# Sementic Search Summary

## Setting up Semantic Search from Scratch

This tutorial will guide you through the entire process of setting up the semantic search project from scratch, including installing Python, creating a virtual environment, installing the necessary libraries, and running the code.

### Step 1: Install Python

First, you need to have Python installed on your system. Download and install the latest version of Python from the official website:

[https://www.python.org/downloads/](https://www.python.org/downloads/)

### Step 2: Install pip (if not installed)

Pip is the package installer for Python. It is usually included with the Python installation. However, if it's not installed, you can download and install it by following the instructions here:

[https://pip.pypa.io/en/stable/installation/](https://pip.pypa.io/en/stable/installation/)

### Step 3: Create a virtual environment (Optional)

Although not required, it's a good practice to create a virtual environment to keep your project dependencies separate from your system-wide Python installation.

Create a new directory for your project:

```
mkdir semantic_search
cd semantic_search
```

Create a virtual environment using `venv`:


```
python -m venv venv
```

Activate the virtual environment:

- On Windows:

```
venv\Scripts\activate
```
- On macOS and Linux:

```
source venv/bin/activate
```

### Step 4: Install dependencies

Install the required libraries using `pip`:

```
pip install pandas numpy sklearn transformers torch nltk
```


## Cell 1: Import libraries
This cell imports the necessary libraries for data manipulation (pandas, numpy), text preprocessing (nltk), cosine similarity calculation (sklearn), and transformer models (transformers).


In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer, AutoModel
import torch
import ast
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

## Cell 2: Downloading required NLTK resources
Here, we download the necessary resources ('punkt', 'stopwords', 'wordnet') for text preprocessing using the NLTK library.


In [2]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jorrit\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jorrit\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Jorrit\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Cell 3: Define necessary functions
This cell contains the definitions of four functions:
1. `preprocess_text`: Preprocesses text by converting to lowercase, tokenizing, removing stopwords, and lemmatizing.
2. `generate_embedding`: Generates embeddings for a given text using the pre-trained model.
3. `extract_relevant_sentences`: Extracts the top-k most relevant sentences from a summary based on cosine similarity with a given query embedding.
4. `search_authors_and_relevant_parts`: Searches for the top-n authors with the most relevant parts of their summaries based on a query.

In [3]:
# Define necessary functions
def preprocess_text(text):
    text = text.lower()
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(tokens)

def generate_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).numpy()

def extract_relevant_sentences(summary, query_embedding, k=2):
    sentences = summary.split(".")
    sentence_embeddings = [generate_embedding(sentence) for sentence in sentences]
    similarities = [cosine_similarity(query_embedding.reshape(1, -1), sentence_embedding.reshape(1, -1))[0, 0] for sentence_embedding in sentence_embeddings]
    top_k_indices = np.argsort(similarities)[-k:]
    return [sentences[index] for index in top_k_indices]

def search_authors_and_relevant_parts(df, query, n=3, k=2):
    query_embedding = generate_embedding(query)
    df["similarity"] = df["embedding"].apply(lambda x: cosine_similarity(x.reshape(1, -1), query_embedding.reshape(1, -1))[0, 0])

    top_n_results = df.sort_values("similarity", ascending=False).head(n)

    for _, row in top_n_results.iterrows():
        relevant_parts = extract_relevant_sentences(row["summary"], query_embedding, k)
        print(f"Author: {row['author']} (similarity: {row['similarity']:.4f})")
        print(f"Relevant parts: {'.'.join(relevant_parts)}...")

## Cell 4: Load model and tokenizer
This cell loads the pre-trained model and tokenizer using the specified model name.

In [4]:
# Load model and tokenizer
datafile_path = "summaries.csv"
output_file_path = "summaries_with_embeddings.csv"
model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

## Cell 5: Generate embeddings and save to a cell
Here, we read the input CSV file with summaries, preprocess the summaries, generate embeddings for the preprocessed summaries, and save the resulting DataFrame to a new CSV file.

In [5]:
# Generate embeddings and save to a cell
df = pd.read_csv(datafile_path)
df["preprocessed_summary"] = df.summary.apply(preprocess_text)
df["embedding"] = df.preprocessed_summary.apply(generate_embedding).apply(lambda x: ', '.join(map(str, x[0])))
df.to_csv(output_file_path, index=False)

## Cell 6: Load embeddings and define a search function
In this cell, we load the CSV file containing the summaries with embeddings, convert the embeddings column to NumPy arrays, and define a `search` function that wraps the `search_authors_and_relevant_parts` function for easier use.

In [6]:
# Load embeddings and define a search function
df_with_embeddings = pd.read_csv(output_file_path)
df_with_embeddings["embedding"] = df_with_embeddings.embedding.apply(lambda x: np.array(ast.literal_eval(x)))

def search(query):
    search_authors_and_relevant_parts(df_with_embeddings, query, n=3, k=2)

## Cell 7: Test the search function
This cell runs the `search` function with an example query ("Plant biology") to see how it works and to display the results.

In [9]:
# Test the search function

search("plant biology")

Author: Lee, H. (similarity: 0.3164)
Relevant parts:  The plants were grown for a period of six weeks, and growth parameters such as shoot and root length, leaf area, and dry weight were measured at regular intervals.The objective of this project was to investigate the effects of different light spectra on the growth and nutritional content of lettuce plants...
Author: Smith, J. (similarity: 0.3009)
Relevant parts:  The plants were grown under controlled conditions for a period of 10 weeks, and the growth parameters such as plant height, stem diameter, and number of leaves were measured at regular intervals.The aim of this project was to investigate the effect of different types of soil on the growth of tomato plants...
Author: Johnson, A. (similarity: 0.2969)
Relevant parts:  Moreover, the higher the caffeine concentration, the slower the growth rate of the plants. These findings suggest that caffeine may have a negative impact on the growth of Arabidopsis thaliana and could potential