Package location on my system:

Works with isolated virtual environment for SharpestMinds by Python 3.10

In [2]:
# import pandas
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Farhad\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Farhad\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
# import data
df = pd.read_csv("https://raw.githubusercontent.com/nikitaa30/Content-based-Recommender-System/master/sample-data.csv")

In [4]:
df.columns

Index(['id', 'description'], dtype='object')

In [5]:
df.dropna(axis=0, how='any', inplace=True)

* explore DataFramae

In [6]:
df

Unnamed: 0,id,description
0,1,Active classic boxers - There's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...
2,3,Active sport briefs - These superbreathable no...
3,4,"Alpine guide pants - Skin in, climb ice, switc..."
4,5,"Alpine wind jkt - On high ridges, steep ice an..."
...,...,...
495,496,Cap 2 bottoms - Cut loose from the maddening c...
496,497,Cap 2 crew - This crew takes the edge off fick...
497,498,All-time shell - No need to use that morning T...
498,499,All-wear cargo shorts - All-Wear Cargo Shorts ...


We will be using Tf-Idf to find similar items based on description
* instantiate TF-IDF

In [7]:
vectorizer = TfidfVectorizer()

* fit and transform 'description' column with TFIDF

In [8]:
X = vectorizer.fit_transform(df['description'])

indices = pd.Series(df.index)
indices[:5]

0    0
1    1
2    2
3    3
4    4
dtype: int64

* calculate the cosine similarity of each item with every other item in the dataset, 

In [9]:
similarity = cosine_similarity(X, X)

In [10]:
similarity

array([[1.        , 0.32792053, 0.20819843, ..., 0.17696975, 0.20143942,
        0.22598052],
       [0.32792053, 1.        , 0.5673509 , ..., 0.12925175, 0.21139731,
        0.19396413],
       [0.20819843, 0.5673509 , 1.        , ..., 0.13509939, 0.14185763,
        0.15717399],
       ...,
       [0.17696975, 0.12925175, 0.13509939, ..., 1.        , 0.14187074,
        0.17045334],
       [0.20143942, 0.21139731, 0.14185763, ..., 0.14187074, 1.        ,
        0.55846363],
       [0.22598052, 0.19396413, 0.15717399, ..., 0.17045334, 0.55846363,
        1.        ]])

* sort all items using their similarity for each item i, and store the values in dictionary `results`

```
results = {
    "1": [5,7,9...],
    "2": [45,2,3...]
}
```

In [11]:
results = {idx + 1 : similarity[idx] for idx in range(len(similarity))}

* create function `recommender` that will recommend similar products
    * function must have two input params: **item_id** and **count** of similar products 

In [12]:
# function that takes the item_id as input and returns the top number of recommended items

def recommender(item_id, cosine_sim = similarity):

    recommended_items = []

    idx = indices[indices == item_id].index[0]

    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending=False)

    top_5_indexes = list(score_series.iloc[1:11].index)

    for i in top_5_indexes:

        recommended_items.append(list(df.index)[i])

        return recommended_items

* show top 5 the most similar items for item with idem_id = 11

In [13]:
item_id_11_Tf_Idf = pd.DataFrame(results[recommender(11)[0]])
item_id_11_Tf_Idf = item_id_11_Tf_Idf.sort_values(0, axis=0, ascending=False)[0:6]

In [14]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['description'])

indices = pd.Series(df.index)
indices[:5]

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [15]:
similarity = cosine_similarity(X, X)

In [16]:
results = {idx + 1 : similarity[idx] for idx in range(len(similarity))}

In [17]:
item_id_11_countvectorizer = pd.DataFrame(results[recommender(11)[0]])
item_id_11_countvectorizer = item_id_11_countvectorizer.sort_values(0, axis=0, ascending=False)[0:6]
item_id_11_countvectorizer

Unnamed: 0,0
400,1.0
278,0.789982
46,0.779829
403,0.770904
248,0.763223
362,0.759253


In [18]:
item_id_11_Tf_Idf

Unnamed: 0,0
400,1.0
278,0.421506
403,0.40433
197,0.331858
46,0.308194
362,0.295987


The reason of difference is countvectorizer provides a simple frequency counter, while TF-Idf tends to give less importance to the words that are more present in the entire corpus