# **Ranking Using Sentence Transformer**

In [None]:
!pip install sentence_transformers
!pip install nmslib
!pip install texthero



**1. Import the necessary dependencies**

In [None]:
import numpy as np
import pandas as pd
import texthero as hero
from texthero import preprocessing
from sentence_transformers import SentenceTransformer, util
import nmslib
import time
import datetime
import torch

**2. Load the dataset**

I am taking the wine review dataset. The dataset contains around 50000 rows of data with columns like country, description, title, variety, winery, price, and rating.

In [None]:
df = pd.read_csv('wine_dataset.csv', sep=',', engine='python', quotechar='"', error_bad_lines=False)
df_copy = df.copy()
df.head()

Skipping line 5148: ',' expected after '"'


Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,,,,
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,,,,
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,,,,
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,,,,
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,,,,


There are so many columns in the wine dataset. But for our scope we will just consider taking the **description section**

In [None]:
df.drop(df.columns.difference(['description']), 1, inplace=True)

In [None]:
df.head()

Unnamed: 0,description
0,"Aromas include tropical fruit, broom, brimston..."
1,"This is ripe and fruity, a wine that is smooth..."
2,"Tart and snappy, the flavors of lime flesh and..."
3,"Pineapple rind, lemon pith and orange blossom ..."
4,"Much like the regular bottling from 2012, this..."


**3. Create Embeddings**

In [None]:
distilbert = SentenceTransformer('distilbert-base-uncased')

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
embeddings = distilbert.encode(df['description'], convert_to_tensor=True).cpu().numpy()

KeyboardInterrupt: ignored

In [None]:
len(embeddings)

In [None]:
embeddings[0]

**4. Create Search Index**

This section is a very important section. We will use a library known as **nmslib** that stands for *Non-metric space library* 

**nmslib** provides a fast similarity search. 
- The search is done in the dataframe/database we provide. 
- To search it **uses the query and a dissimilarity measure(distance function)**, that is provided.
- The combination od datapoints and the distance is called the **search space**.


In short, search space has data points. The data points have some k nearest neighbours. It searches its neighbours for similarity. 

In [None]:
embeddings_index = nmslib.init(method='hnsw', space='cosinesimil')

We initialize the nmslib by passing:

**method=hnsw** : Various methods have been coming since years, the most popular one is the hnsw (a Hierarchical Navigable Small World Graph.)

**space=cosinesimil** : This is the parameter that defines various distances. For our usecase we can use CosineSimilarity or Euclidean Distance.

In [None]:
embeddings_index.addDataPointBatch(embeddings)
embeddings_index.createIndex({'post': 2}, print_progress=True)

We then add the embeddings and create indexes for the embeddings.

In [None]:
def recommend_user(df_copy, query):
  if df_copy is not None and query is not None:
      query = distilbert.encode([query], convert_to_tensor=True).cpu().numpy()
      ids, distances = embeddings_index.knnQuery(query, k=20)
      matches = []
      for i, j in zip(ids, distances):
          matches.append({'country':df_copy.country.values[i]
                        , 'winery' : df_copy.winery.values[i]
                        , 'title' : df_copy.title.values[i]
                        , 'variety': df_copy.variety.values[i]
                        , 'color' : df_copy.color.values[i]
                        , 'description': df_copy.description.values[i]
                        , 'price': df_copy.price.values[i]
                        , 'rating': df_copy.rating.values[i]
                        , 'distance': j
                        })
      return pd.DataFrame(matches)

In [None]:
recommend_user(df_copy, "Wine, tasty, sweet")