# **Ranking Using Sentence Transformer**

**The objective for this project is to create a ranking of wines based on similarity scores and recommend user their 15 ideal choices of wines.**

If you are using jupyter notebook. Use **!pipenv shell** to install all necessary packages.

If you are on colab(which is recommended)
You can set the hardware accelerator, it will speed up the operations.
1. **Edit > Notebook Setting > GPU**
2. Insert a code cell below and run these commands before beginning:

  a. !pip install sentence_transformers
  
  b. !pip install nmslib

## **1. Import the necessary**

In [44]:
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer, util
import nmslib
import torch

## **2. Define Necessary Variables**

In [45]:
dataset_path = 'dataset/wine_dataset.csv'
user_query = "firm, bitter wine with richness in texture."
model_name = 'distilbert-base-uncased'

## **3. Load the dataset**

I am taking the wine review dataset. The dataset contains around 20000 rows of data with various columns. We will be using only the columns that we need like country, description, title, variety, winery, price, and rating.

Dataset download link: https://drive.google.com/drive/folders/15tRf8lO3x22BgUTb90S7rGjlN-FhGBSx?usp=sharing

I took the dataset with 130000 datas and deleted 110000 data for this project.

In [46]:
df = pd.read_csv(dataset_path, sep=',', engine='python', quotechar='"', error_bad_lines=False)
df.head(2)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,,,,
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,,,,


Out of all these columns of data. We will be considering just the **description** column for generating word embeddings.

In [47]:
df.drop(df.columns.difference(['description']), axis=1, inplace=True)

In [48]:
df.head()

Unnamed: 0,description
0,"Aromas include tropical fruit, broom, brimston..."
1,"This is ripe and fruity, a wine that is smooth..."
2,"Tart and snappy, the flavors of lime flesh and..."
3,"Pineapple rind, lemon pith and orange blossom ..."
4,"Much like the regular bottling from 2012, this..."


## **4. Create Embeddings**

For Generating the embeddings from description we will be using Sentence Transformer and passing the pretrained distilbert model.
Sentence Transformer is  a State-of-the-art for sentence, text and image embeddings.

In [49]:
distilbert = SentenceTransformer(model_name)

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


DistilBERT is a transformers model that is smaller and faster than BERT, it was pretrained by self-supervised way on the same corpus with the BERT base model as a teacher.

More about 'distilbert-base-uncased' on here: https://huggingface.co/distilbert-base-uncased


In [50]:
embeddings = distilbert.encode(df['description'], convert_to_tensor=True).cpu().numpy()

We encoded the description and converted all those to tensors. Now let us look at how many embeddings were formed and also lets add these to one of the column.

In [51]:
print("Total length of one embedding:" + str(len(embeddings[0])))

Total length of one embedding:768


In [52]:
df['embeddings'] = torch.Tensor(embeddings).tolist()
df.head()

Unnamed: 0,description,embeddings
0,"Aromas include tropical fruit, broom, brimston...","[-0.3061617612838745, 0.22339025139808655, 0.0..."
1,"This is ripe and fruity, a wine that is smooth...","[-0.37820613384246826, -0.1644999235868454, 0...."
2,"Tart and snappy, the flavors of lime flesh and...","[-0.217906653881073, -0.0607813261449337, 0.02..."
3,"Pineapple rind, lemon pith and orange blossom ...","[-0.2283596694469452, 0.021180907264351845, 0...."
4,"Much like the regular bottling from 2012, this...","[-0.36137744784355164, -0.02509395033121109, 0..."


## **5. Create Search Index**

This section is a very important section. I will use a library known as **nmslib** that stands for *Non-metric space library*. 

**nmslib** provides a fast similarity search. 
- The search is done in the dataframe/database we provide. 
- To search it **uses the data points and a dissimilarity measure(distance function)**, that is provided.
- The combination of data points and the distance is called the **search space**.

In short, there is a search space. Search space has all the data points. The datapoints searches its neighbouring datapoints for similarity. 

In [53]:
embeddings_index = nmslib.init(method='hnsw', space='cosinesimil')

We initialize the nmslib by passing:

**method=hnsw** : Various methods have been coming since years, the most popular one is the hnsw (a Hierarchical Navigable Small World Graph.)

**space=cosinesimil** : This is the parameter that defines various distances. For our usecase we can use CosineSimilarity or Euclidean Distance.

In [54]:
embeddings_index.addDataPointBatch(embeddings)
embeddings_index.createIndex({'post': 2}, print_progress=True)

We then add the embeddings and create indexes for the embeddings.

## **6. Rank Wines**

Now we encode the query from the user and and check the nearest 500 indexes

In [55]:
def rank_wine_based_on_distance(query):
  df_dataset = pd.read_csv('wine_dataset.csv', sep=',', engine='python', quotechar='"', error_bad_lines=False)
  closest_datapoints_to_check = 500
  query = distilbert.encode([query], convert_to_tensor=True).cpu().numpy()
  ids, distances = embeddings_index.knnQuery(query, k=closest_datapoints_to_check)
  matches = []
  for i, j in zip(ids, distances):
      matches.append({'country':df_dataset.country.values[i]
                    , 'winery' : df_dataset.winery.values[i]
                    , 'variety': df_dataset.variety.values[i]
                    , 'price': df_dataset.price.values[i]
                    , 'similarity': j
                    })
  return pd.DataFrame(matches)

In [56]:
df_rank = rank_wine_based_on_distance(user_query)
df_rank.head()

Unnamed: 0,country,winery,variety,price,similarity
0,US,Sobon Estate,Zinfandel,15,0.053573
1,France,Domaine Barraud,Chardonnay,28,0.064184
2,Austria,Bründlmayer,Sparkling Blend,44,0.064605
3,Portugal,Borges,Portuguese White,9,0.064635
4,Austria,Pratsch,Grüner Veltliner,22,0.065314


## **7. Recommend User**

We can see that the df_rank dataframe gives all the ranking in ascending order. Now, let us recommend user the top 15 wines based on the description he/she has provided earlier as user_query

In [57]:
df_rank.nlargest(15, 'similarity')

Unnamed: 0,country,winery,variety,price,similarity
361,France,Louis Jadot,Chardonnay,15.0,0.106212
360,US,Naches Heights,Pinot Gris,16.0,0.106149
359,France,L. Tramier & Fils,Pinot Noir,120.0,0.105992
358,France,Cave de Lugny,Chardonnay,,0.105934
357,France,Cave de Viré,Chardonnay,15.0,0.105651
356,France,Famille Laplace,Tannat-Syrah,10.0,0.10565
355,France,Charles Clément,Champagne Blend,30.0,0.105639
354,France,Famille Laplace,Tannat-Syrah,10.0,0.105635
353,France,Henri de Villamont,Pinot Noir,,0.105146
352,Portugal,Quinta da Alorna,Portuguese White,15.0,0.10497
