# Sementic Search Summary

## Setting up Semantic Search from Scratch

This tutorial will guide you through the entire process of setting up the semantic search project from scratch, including installing Python, creating a virtual environment, installing the necessary libraries, and running the code.

### Step 1: Install Python

First, you need to have Python installed on your system. Download and install the latest version of Python from the official website:

[https://www.python.org/downloads/](https://www.python.org/downloads/)

### Step 2: Install pip (if not installed)

Pip is the package installer for Python. It is usually included with the Python installation. However, if it's not installed, you can download and install it by following the instructions here:

[https://pip.pypa.io/en/stable/installation/](https://pip.pypa.io/en/stable/installation/)

### Step 3: Create a virtual environment (Optional)

Although not required, it's a good practice to create a virtual environment to keep your project dependencies separate from your system-wide Python installation.

Create a new directory for your project:

```
mkdir semantic_search
cd semantic_search
```

Create a virtual environment using `venv`:


```
python -m venv venv
```

Activate the virtual environment:

- On Windows:

```
venv\Scripts\activate
```
- On macOS and Linux:

```
source venv/bin/activate
```

### Step 4: Install dependencies

Install the required libraries using `pip`:

```
pip install pandas numpy sklearn transformers torch nltk
```


## Cell 1: Import libraries

This cell imports the necessary libraries for the script:

- **pandas**: A powerful data manipulation library that allows us to work with DataFrames, which are essentially tables of data. We use pandas to read the input CSV file and store the data in a structured format.
- **numpy**: A library for working with arrays and numerical operations. We use numpy to work with embeddings and compute similarities between them.
- **sklearn.metrics.pairwise**: A submodule from the scikit-learn library that provides the cosine_similarity function. We use this function to calculate cosine similarities between the embeddings of summaries and the query.
- **transformers**: A library that provides pre-trained natural language processing models and tools. We use transformers to load a pre-trained sentence transformer model and its tokenizer to generate embeddings for the text.
- **torch**: A library for working with deep learning models. We use torch to handle the computation of embeddings using the transformer model.
- **ast**: The Abstract Syntax Tree library is used to convert strings containing list representations back into actual lists. We use ast to convert the saved embeddings back into numpy arrays when loading the data from the CSV file.
- **nltk**: The Natural Language Toolkit is a library for working with human language data (text). We use nltk for text preprocessing, including tokenizing, removing stopwords, and lemmatizing.




In [6]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer, AutoModel
import torch
import ast
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

## Cell 2: Downloading NLTK resources

This cell downloads the necessary resources from the NLTK library for text preprocessing:

- **punkt**: The 'punkt' resource is a pre-trained tokenizer for several languages. It is used by the `word_tokenize` function, which is responsible for splitting a text into individual words or tokens. In this script, we use 'punkt' to tokenize text during the preprocessing step.

- **stopwords**: Stopwords are common words that usually do not provide valuable information for text analysis tasks. Examples of stopwords in English include 'a', 'an', 'the', 'in', 'and', etc. The 'stopwords' resource provides a collection of such words for various languages. In this script, we use the 'stopwords' resource to filter out stopwords from the text during the preprocessing step.

- **wordnet**: WordNet is a large lexical database of English. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records various semantic relations between these synonym sets. The 'wordnet' resource is used by the `WordNetLemmatizer` class in NLTK, which is responsible for converting words to their base form or lemma. In this script, we use the 'wordnet' resource to lemmatize words during the preprocessing step.



In [9]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jorrit\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jorrit\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Jorrit\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Cell 3: Define necessary functions

1. `preprocess_text(text)`: This function takes a text input and performs the following preprocessing steps:

    - Converts the text to lowercase.
    - Tokenizes the text into individual words.
    - Removes stopwords (common words that don't carry much meaning, like "the", "and", etc.).
    - Lemmatizes the tokens (reduces words to their base form, e.g., "running" becomes "run").
    - Joins the processed tokens back together into a single string.

2. `generate_embedding(text)`: This function takes a text input and generates an embedding for the text using the pre-trained transformer model. It processes the input text with the tokenizer and feeds it into the model to obtain the last hidden state. It then computes the mean of the last hidden state along dimension 1 to generate the final embedding.

3. `extract_relevant_sentences(summary, query_embedding, k=2)`: This function takes a summary, a query embedding, and an optional parameter k (default value 2). It extracts the top-k most relevant sentences from the summary based on cosine similarity with the query embedding. The function tokenizes the summary into sentences, generates sentence embeddings, computes the cosine similarities between the query embedding and sentence embeddings, and finally selects the top-k sentences with the highest similarities.

4. `search_authors_and_relevant_parts(df, query, n=3, k=2)`: This function takes a DataFrame, a query, and optional parameters n (default value 3) and k (default value 2). It searches for the top-n authors with the most relevant parts of their summaries based on the input query. The function generates a query embedding, computes the cosine similarities between the query embedding and summary embeddings in the DataFrame, sorts the authors based on similarity, selects the top-n authors, and extracts the top-k most relevant sentences from their summaries.


In [10]:
# Define necessary functions
def preprocess_text(text):
    text = text.lower()
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(tokens)

def generate_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).numpy()

def extract_relevant_sentences(summary, query_embedding, k=2):
    sentences = summary.split(".")
    sentence_embeddings = [generate_embedding(sentence) for sentence in sentences]
    similarities = [cosine_similarity(query_embedding.reshape(1, -1), sentence_embedding.reshape(1, -1))[0, 0] for sentence_embedding in sentence_embeddings]
    top_k_indices = np.argsort(similarities)[-k:]
    return [sentences[index] for index in top_k_indices]

def search_authors_and_relevant_parts(df, query, n=3, k=2):
    query_embedding = generate_embedding(query)
    df["similarity"] = df["embedding"].apply(lambda x: cosine_similarity(x.reshape(1, -1), query_embedding.reshape(1, -1))[0, 0])

    top_n_results = df.sort_values("similarity", ascending=False).head(n)

    for _, row in top_n_results.iterrows():
        relevant_parts = extract_relevant_sentences(row["summary"], query_embedding, k)
        print(f"Author: {row['author']} (similarity: {row['similarity']:.4f})")
        print(f"Relevant parts: {'.'.join(relevant_parts)}...")

## Cell 4: Load model and tokenizer
In this cell, we load the pre-trained model and tokenizer using the specified model name. The pre-trained model is based on a transformer architecture, which has become the go-to architecture for many natural language processing tasks due to its ability to effectively capture contextual information in text.

Transformers are a type of deep learning model that use self-attention mechanisms to weigh the importance of words in a sequence. This helps them to understand the context and relationships between words in a sentence or a text. In this tutorial, we will use the transformer model to generate embeddings for text, which can then be used to perform semantic search.

Model names in the Hugging Face Model Hub are unique identifiers for pre-trained models. In this case, we have chosen the "sentence-transformers/paraphrase-MiniLM-L6-v2" model. This model is a smaller and faster variant of the more powerful BERT models, designed specifically for the task of generating paraphrase-related sentence embeddings.

A tokenizer is a necessary component for working with transformer models. It is responsible for converting raw text into a format that can be fed into the model, including tokenization, mapping tokens to IDs, and creating input tensors.

In [11]:
# Load model and tokenizer
datafile_path = "summaries.csv"
output_file_path = "summaries_with_embeddings.csv"
model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

## Cell 5: Generate embeddings and save to a cell
In this cell, we generate embeddings for the preprocessed summaries and save them to a separate CSV file. The main rationale behind generating and saving embeddings to a separate CSV file are:

1. **Computational cost**: Generating embeddings using transformer models can be computationally expensive, especially when dealing with a large number of texts. By generating the embeddings once and saving them to a separate file, we can avoid the need to recompute them each time we want to perform a search. This significantly reduces the computational overhead and speeds up the search process.

2. **Reusability**: Storing the embeddings in a separate CSV file allows us to reuse them in future searches or other tasks that require text similarity comparisons. This way, we can save time and resources by not having to recompute the embeddings each time they are needed.

3. **Portability**: Saving the embeddings to a file makes it easier to share and transfer the data between different environments or systems. This can be particularly useful when working with remote servers or collaborating with others.

By generating and saving embeddings to a separate CSV file, we can optimize the search process and make it more efficient, especially when working with large datasets or computationally expensive models.


In [12]:
# Generate embeddings and save to a cell
df = pd.read_csv(datafile_path)
df["preprocessed_summary"] = df.summary.apply(preprocess_text)
df["embedding"] = df.preprocessed_summary.apply(generate_embedding).apply(lambda x: ', '.join(map(str, x[0])))
df.to_csv(output_file_path, index=False)



## Cell 6: Load embeddings and define a search function

In this cell, we define a `search` function that takes a query as input and finds the most relevant authors and parts of their summaries. The search function calls `search_authors_and_relevant_parts` and that works as follows:

1. First, it generates an embedding for the query using the same pre-trained transformer model as used for the summaries. This ensures that the query embedding and summary embeddings are in the same semantic space.

2. Next, the function computes the cosine similarities between the query embedding and each of the summary embeddings. Cosine similarity is a widely used metric to measure the similarity between two vectors, in our case, the query and summary embeddings. A higher cosine similarity indicates a higher degree of similarity between the query and the summary.

3. The function then sorts the authors based on their similarity scores and selects the top-n authors with the highest cosine similarities. This provides us with the most relevant authors based on the input query.

4. Finally, the function extracts the top-k most relevant sentences from each of the selected author's summaries. This is done by computing the cosine similarities between the query embedding and individual sentence embeddings within each summary, and selecting the sentences with the highest similarities.

By following these steps, the search function provides a ranked list of authors along with the most relevant parts of their summaries, based on the input query.



In [16]:
# Load embeddings and define a search function
df_with_embeddings = pd.read_csv(output_file_path)
df_with_embeddings["embedding"] = df_with_embeddings.embedding.apply(lambda x: np.array(ast.literal_eval(x)))

def search(query):
    search_authors_and_relevant_parts(df_with_embeddings, query, n=5, k=2)

## Cell 7: Test the search function
This cell runs the `search` function with an example query ("Plant biology") to see how it works and to display the results.

In [17]:
# Test the search function

search("plant molecular biology")

Author: Garcia, M. (similarity: 0.3946)
Relevant parts:  Four types of growth regulators, including auxin, cytokinin, gibberellin, and abscisic acid, were applied to tomato plants under controlled conditions.The objective of this project was to investigate the effects of different plant growth regulators on the growth and development of tomato plants...
Author: Lee, H. (similarity: 0.2827)
Relevant parts:  The plants were grown for a period of six weeks, and growth parameters such as shoot and root length, leaf area, and dry weight were measured at regular intervals.The objective of this project was to investigate the effects of different light spectra on the growth and nutritional content of lettuce plants...
Author: Johnson, A. (similarity: 0.2698)
Relevant parts: The objective of this project was to investigate the effect of different concentrations of caffeine on the growth rate of Arabidopsis thaliana. These findings suggest that caffeine may have a negative impact on the growth o