In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Possibility to test the TF-IDF algorithm in a simple form

In [7]:
lego_sets = pd.read_csv('../data/lego_sets.csv')
lego_sets.drop_duplicates(subset=['prod_id'], inplace=True)
lego_sets = lego_sets.reset_index().drop('index', axis=1)


In [8]:
lego_sets['prod_id'] = lego_sets['prod_id'].astype(int)

In [9]:
lego_df = lego_sets[['prod_id','set_name','prod_long_desc','list_price','prod_desc','num_reviews','ages','country','star_rating','theme_name']]
lego_df.head(2)

Unnamed: 0,prod_id,set_name,prod_long_desc,list_price,prod_desc,num_reviews,ages,country,star_rating,theme_name
0,75823,Bird Island Egg Heist,Use the staircase catapult to launch Red into ...,29.99,Catapult into action and take back the eggs fr...,2.0,6-12,US,4.5,Angry Birds™
1,75822,Piggy Plane Attack,Pilot Pig has taken off from Bird Island with ...,19.99,Launch a flying attack and rescue the eggs fro...,2.0,6-12,US,5.0,Angry Birds™


Based on a similarity analysis based only on the product description, which can and should be expanded further:

- Location
- Time
- Customer age (child)
- Customer age (buyer)
-Interests
-Activity
-Past purchasing behavior
-User group purchasing behavior
-Regional purchasing behavior
-Trend predictions
-Similarities in price, dimension, complexity, etc.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words = "english", min_df=2) # pick words that apear in at least two documents (aka. movie descriptions)
lego_df['prod_long_desc'] = lego_df['prod_long_desc'].fillna("")

TF_IDF_matrix = vectorizer.fit_transform(lego_df['prod_long_desc'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lego_df['prod_long_desc'] = lego_df['prod_long_desc'].fillna("")


In [11]:
TF_IDF_matrix.shape

(744, 3756)

In [12]:
lego_df[lego_df['set_name'].str.contains('Jedi', case= False ,na=False)].sample(2)

Unnamed: 0,prod_id,set_name,prod_long_desc,list_price,prod_desc,num_reviews,ages,country,star_rating,theme_name
610,75168,Yoda's Jedi Starfighter™,Add a classic ship to your LEGO® Star Wars col...,24.99,Journey to the stars aboard Yoda’s starship!,37.0,8-12,US,4.5,Star Wars™
581,75206,Jedi™ and Clone Troopers™ Battle Pack,Command your own LEGO® Star Wars Jedi & Clone ...,14.99,Take on the Separatists with the Jedi and Clon...,3.0,6-12,US,4.0,Star Wars™


In [13]:
from sklearn.metrics.pairwise import cosine_similarity

product1 = TF_IDF_matrix[(lego_df['prod_id'] == 41609).values,]
product2 = TF_IDF_matrix[(lego_df['prod_id'] == 41608).values,]

print("Similarity:", cosine_similarity(product1, product2)) # Notice the result is a 2D 1X1 array, so to grab
                                                          # the number we will need to index

Similarity: [[0.73018903]]


In [14]:
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(TF_IDF_matrix, dense_output=False)

In [15]:
similarities.shape

(744, 744)

In [16]:
def content_recommender(set_name, products, similarities, vote_threshold=10) :

    # Get the movie by the id
    movie_index = products[products['set_name'] == set_name].index

    # Create a dataframe with the set_names
    sim_df = pd.DataFrame(
        {'set_name': products['set_name'],
         'prod_id': products['prod_id'],
         'similarity': np.array(similarities[movie_index, :].todense()).squeeze(),
         'num_reviews': products['num_reviews']
        })

    # Get the top 10 products with > 10 votes
    top_products = sim_df[sim_df['num_reviews'] > vote_threshold].sort_values(by='similarity', ascending=False).head(10)

    return top_products

In [17]:
similar_movies = content_recommender('Jedi Starfighter™ With Hyperdrive', lego_df, similarities, vote_threshold=10)
similar_movies.head(10)

Unnamed: 0,set_name,prod_id,similarity,num_reviews
629,Jedi Starfighter™ With Hyperdrive,75191,1.0,15.0
647,The Phantom,75170,0.25955,25.0
610,Yoda's Jedi Starfighter™,75168,0.25189,37.0
597,Y-Wing Starfighter™,75172,0.242695,51.0
593,Poe's X-Wing Fighter™,75102,0.227475,71.0
620,Tatooine™ Battle Pack,75198,0.213069,14.0
612,Duel on Naboo™,75169,0.197085,32.0
740,A-Wing Starfighter™,75175,0.18131,24.0
587,Resistance Bomber,75188,0.157,23.0
583,Death Star™,75159,0.156804,84.0


The simple algorithm displays more products than the current Gift Shop, and also allows one category, Star Wars, to be selected.