Creating a Vector Database
- The goal here is to take the data and turn it into a vector for a RAG model
- Vector database- helps you perform a semantic search - find info not just by keywords but by understanding content and meaning
- Text Embedding - Translating words into meaningful numbers (converts text into numerical vectors)
- Text Classification - Adding a label to a piece of text


1). Install the prerequisites
- pymongo is the official driver for working with mongodb databases from python applications. It helps ensure that they understand each other so they can work together.
- transformers - a neural network architecture that excels at processing sequential data like text. Unlike older models, transformers use a "self-attention" mechanism to weigh the importance of different parts of the input when processing, allowing them to understand context and relationships within sequences more effectively - this will be extremely for helping the vector database undersand meaning

In [1]:
!pip install datasets pandas pymongo sentence_transformers
!pip install -U transformers
!pip install accelerate
!pip install -U datasets

Collecting pymongo
  Downloading pymongo-4.13.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cubl

The code below installs the Hugging Face evaluate library. This library is designed to simplify the evaluation of machine learning models and datasets.

In [2]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.5


Configurations

In [6]:
data_path = "/content/RestaurantData2 - Sheet5 (1).csv" #@param{type:"string"}
text_column_name = "Description" #@param {type:"string"}
label_column_names = ['Vegan', 'Gluten-Free', 'Dairy-Free']

model_name = "bert-base-uncased" #@param {type:"string"}
test_size = 0.2 #@param {type:"number"}
num_labels = 2 #@param {type:"number"}

Read and prepare the Dataset

In [7]:
import pandas as pd

In [8]:
df = pd.read_csv(data_path)

In [9]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141 entries, 0 to 140
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Name          141 non-null    object
 1   Category      141 non-null    object
 2   Description   141 non-null    object
 3   Neighborhood  141 non-null    object
 4   Vegan         141 non-null    int64 
 5   Gluten‑Free   141 non-null    int64 
 6   Dairy‑Free    141 non-null    int64 
dtypes: int64(3), object(4)
memory usage: 7.8+ KB


Unnamed: 0,Name,Category,Description,Neighborhood,Vegan,Gluten‑Free,Dairy‑Free
0,A. Schwab Trading Company,General Store,A. Schwab Trading Company is a historic genera...,Beale Street,0,0,0
1,Alcenia’s Southern Style Cuisine,Southern/Soul Food,Alcenia’s Southern Style Cuisine serves up sou...,Downtown Memphis,0,0,0
2,Alchemy,Restaurant / Bar,Alchemy is a stylish restaurant and bar in Coo...,Midtown,0,1,0
3,Aldo's Pizza Pies,Pizza,Aldo’s Pizza Pies offers New York-style pizzas...,Midtown,0,1,0
4,Aldo's Pizza Pies,Pizza,Aldo’s Pizza Pies brings thin-crust New York-i...,Downtown Memphis,0,1,0


In [10]:
df = df.drop('Neighborhood', axis=1)
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 140 entries, 0 to 140
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Name         140 non-null    object
 1   Category     140 non-null    object
 2   Description  140 non-null    object
 3   Vegan        140 non-null    int64 
 4   Gluten‑Free  140 non-null    int64 
 5   Dairy‑Free   140 non-null    int64 
dtypes: int64(3), object(3)
memory usage: 7.7+ KB


Label Encoder

In [11]:
foodlabels = ['Vegan', 'Gluten‑Free', 'Dairy‑Free']

Train/Test Split

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
df_train,df_test = train_test_split(df, test_size=test_size)

Convert to Huggingface Dataset

In [14]:
from datasets import Dataset

In [15]:
train_dataset = Dataset.from_pandas(df_train)
test_dataset = Dataset.from_pandas(df_test)

Tokenizer
-This will prepare the text data for transformer models
-It will break the text down into tokens and then convert the tokens into numbers
-This is breaking the category column into numbers based on the Vegan column.
-It will help classify the restaurants as vegan or not vegan.

In [16]:
from transformers import AutoTokenizer

In [17]:
import numpy as np

tokenizer = AutoTokenizer.from_pretrained(model_name)

def preprocess_function(examples):
  # take a batch of texts
  text = examples["Category"]
  # encode them
  encoding = tokenizer(text, padding="max_length", truncation=True, max_length=128)
  # add labels
  labels_batch = {k: examples[k] for k in examples.keys() if k in foodlabels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text), len(foodlabels)))
  # fill numpy array
  for idx, label in enumerate(foodlabels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()

  return encoding

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [18]:
tokenized_train = train_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/112 [00:00<?, ? examples/s]

In [19]:
tokenized_test = test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/28 [00:00<?, ? examples/s]

In [20]:
tokenized_train.set_format("torch")
tokenized_test.set_format("torch")

In [21]:
id2label = {idx:label for idx, label in enumerate(foodlabels)}
label2id = {label:idx for idx, label in enumerate(foodlabels)}

In [22]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased",
                                                           problem_type="multi_label_classification",
                                                           num_labels=len(foodlabels),
                                                           id2label=id2label,
                                                           label2id=label2id)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
batch_size = 8
metric_name = "f1"

In [24]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"bert-finetuned-sem_eval-english",
    eval_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    label_names=["labels"]
    #push_to_hub=True,
)

In [25]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch

# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions,
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds,
        labels=p.label_ids)
    return result

In [26]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


In [27]:
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33masmithxu[0m ([33masmithxu-codecademy[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,No log,0.495992,0.808511,0.87206,0.75
2,No log,0.390272,0.833333,0.893799,0.75
3,No log,0.358698,0.833333,0.893799,0.75
4,No log,0.346126,0.851064,0.901996,0.785714
5,No log,0.326492,0.851064,0.901996,0.785714


TrainOutput(global_step=70, training_loss=0.4607630593436105, metrics={'train_runtime': 1262.2283, 'train_samples_per_second': 0.444, 'train_steps_per_second': 0.055, 'total_flos': 36835878481920.0, 'train_loss': 0.4607630593436105, 'epoch': 5.0})

In [28]:
from sklearn.metrics import classification_report

In [29]:
preds = trainer.predict(tokenized_train)
preds = np.argmax(preds[:3][0],axis=1)
GT = df_train['Vegan'].tolist()
print(classification_report(GT,preds))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        90
           1       0.03      0.14      0.05        22
           2       0.00      0.00      0.00         0

    accuracy                           0.03       112
   macro avg       0.01      0.05      0.02       112
weighted avg       0.01      0.03      0.01       112



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [30]:
preds = trainer.predict(tokenized_test)
preds = np.argmax(preds[:3][0],axis=1) #preds[:3][1]
GT = df_test['Vegan'].tolist()
print(classification_report(GT,preds))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        26
           1       0.04      0.50      0.07         2
           2       0.00      0.00      0.00         0

    accuracy                           0.04        28
   macro avg       0.01      0.17      0.02        28
weighted avg       0.00      0.04      0.00        28



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [31]:
preds = trainer.predict(tokenized_train)
preds = np.argmax(preds[:3][0],axis=1)
GT = df_train['Gluten‑Free'].tolist()
print(classification_report(GT,preds))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        35
           1       0.62      0.71      0.66        77
           2       0.00      0.00      0.00         0

    accuracy                           0.49       112
   macro avg       0.21      0.24      0.22       112
weighted avg       0.42      0.49      0.46       112



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [32]:
preds = trainer.predict(tokenized_test)
preds = np.argmax(preds[:3][0],axis=1) #preds[:3][1]
GT = df_test['Gluten‑Free'].tolist()
print(classification_report(GT,preds))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         9
           1       0.67      0.95      0.78        19
           2       0.00      0.00      0.00         0

    accuracy                           0.64        28
   macro avg       0.22      0.32      0.26        28
weighted avg       0.45      0.64      0.53        28



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [33]:
preds = trainer.predict(tokenized_train)
preds = np.argmax(preds[:3][0],axis=1)
GT = df_train['Dairy‑Free'].tolist()
print(classification_report(GT,preds))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        87
           1       0.03      0.12      0.05        25
           2       0.00      0.00      0.00         0

    accuracy                           0.03       112
   macro avg       0.01      0.04      0.02       112
weighted avg       0.01      0.03      0.01       112



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [34]:
preds = trainer.predict(tokenized_test)
preds = np.argmax(preds[:3][0],axis=1) #preds[:3][1]
GT = df_test['Dairy‑Free'].tolist()
print(classification_report(GT,preds))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        26
           1       0.04      0.50      0.07         2
           2       0.00      0.00      0.00         0

    accuracy                           0.04        28
   macro avg       0.01      0.17      0.02        28
weighted avg       0.00      0.04      0.00        28



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
