# Task
Provide a simple Python example that demonstrates how to use an open-source LLM and an open-source vector database from Hugging Face. The example should include loading a dataset, creating embeddings with a pre-trained model, building a FAISS index, and performing a similarity search.

## Install libraries

### Subtask:
Install the necessary libraries, including `transformers`, `datasets`, and `faiss-cpu`.


**Reasoning**:
I will install the necessary libraries, `transformers`, `datasets`, and `faiss-cpu`, as requested in the instructions, using a single pip command.



In [1]:
!pip install transformers datasets faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.0 kB)
Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0.post1


## Load dataset

### Subtask:
Load a dataset from the Hugging Face Hub.


**Reasoning**:
The next step is to load the dataset from the Hugging Face Hub as per the instructions. I will import the necessary function, load the specified dataset and split, and then print the dataset to inspect it. This covers all the instructions in the subtask.



In [4]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("dair-ai/emotion", split="train")

# Print the dataset to inspect its structure
print(dataset)

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})


## Load pre-trained model

### Subtask:
Load a pre-trained sentence transformer model from the Hugging Face Hub.


**Reasoning**:
To load the pre-trained model, I will import the necessary classes from the `transformers` library, specify the model name, and then use the `from_pretrained` methods of `AutoTokenizer` and `AutoModel` to load the tokenizer and the model respectively. Finally, I will print the model to verify it's loaded correctly.



In [3]:
from transformers import AutoTokenizer, AutoModel

model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

print(model)

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 384, padding_idx=0)
    (position_embeddings): Embedding(512, 384)
    (token_type_embeddings): Embedding(2, 384)
    (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-5): 6 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=384, out_features=384, bias=True)
            (key): Linear(in_features=384, out_features=384, bias=True)
            (value): Linear(in_features=384, out_features=384, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=384, out_features=384, bias=True)
            (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)


## Create embeddings

### Subtask:
Create embeddings for the dataset using the pre-trained model.


**Reasoning**:
To create the embeddings, I will first define a function that tokenizes a batch of text and generates embeddings using the pre-trained model. This function will also perform mean pooling to create fixed-size sentence embeddings. Then, I will apply this function to the entire dataset using the `map` method and store the resulting embeddings in a new column named `embeddings`.



In [5]:
import torch

def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    with torch.no_grad():
        model_output = model(**encoded_input)

    token_embeddings = model_output.last_hidden_state
    attention_mask = encoded_input['attention_mask']
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    return sum_embeddings / sum_mask

dataset = dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).numpy()[0]}
)

print(dataset)

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label', 'embeddings'],
    num_rows: 16000
})


## Build faiss index

### Subtask:
Build a FAISS index for efficient similarity search.


**Reasoning**:
To build a FAISS index, I will use the `add_faiss_index` method on the dataset object, specifying the 'embeddings' column as the source for the index. Then, I will print the dataset to show the updated structure with the new index.



In [6]:
dataset.add_faiss_index(column="embeddings")
print(dataset)

  0%|          | 0/16 [00:00<?, ?it/s]

Dataset({
    features: ['text', 'label', 'embeddings'],
    num_rows: 16000
})


## Perform similarity search

### Subtask:
Perform a similarity search on the FAISS index.


**Reasoning**:
To perform a similarity search, I will first define a query, then generate its embedding using the `get_embeddings` function, and finally use the `get_nearest_examples` method on the dataset to find and display the most similar entries.



In [7]:
query = "I'm feeling happy and excited!"
query_embedding = get_embeddings([query]).numpy()

scores, samples = dataset.get_nearest_examples("embeddings", query_embedding, k=5)

print("Query:", query)
for score, sample in zip(scores, samples["text"]):
    print(f"Score: {score:.4f}, Sample: {sample}")

Query: I'm feeling happy and excited!
Score: 10.8984, Sample: im feeling a bit apprehensive but excited as well
Score: 11.5661, Sample: i am going to be happy today i am going to enjoy feeling excited about life joyful eager knowing and empowered
Score: 11.9438, Sample: i feel very happy and excited since i learned so many things
Score: 12.9900, Sample: i am so relieved and excited and i feel confident again
Score: 13.4553, Sample: i am feeling hopeful excited and very much being made new


## Summary:

### Data Analysis Key Findings
* The `dair-ai/emotion` dataset was successfully loaded, containing 16,000 entries in its training split.
* The pre-trained `sentence-transformers/all-MiniLM-L6-v2` model was used to generate embeddings for the dataset.
* A FAISS index was built on the generated embeddings, enabling efficient similarity searches.
* A similarity search for the query "I'm feeling happy and excited!" successfully retrieved the top 5 most similar sentences from the dataset.

### Insights or Next Steps
* This example can be adapted to use different datasets and pre-trained models from the Hugging Face Hub for various NLP tasks.
* The similarity search functionality can be integrated into applications like semantic search engines, recommendation systems, or chatbots.
