# Create a file from two datasets and add headers. Then use for "item base category" algorithm. 

We'll start by loading up the Shopping Rates dataset. Using Pandas, we can very quickly load the rows of the u.data and u.item files that we care about, and merge them together so we can work with items names instead of ID's. (In a real production job, you'd stick with ID's and worry about the names at the display layer to make things more efficient. But this lets us understand what's going on better for now.)

In [1]:
import pandas as pd

r_cols = ['product_id','user_id' , 'rating']
ratings = pd.read_csv('C:\\Users\\Diego Alves\\Desktop\\Data_sets\\test_jupyter.csv', names=r_cols, usecols=range(3), encoding="ISO-8859-1")

m_cols = ['product_id', 'title']
products = pd.read_csv('C:\\Users\\Diego Alves\Desktop\\Data_sets\\test_jupyter_2.csv', names=m_cols, usecols=range(2), encoding="ISO-8859-1")

ratings = pd.merge(products, ratings)

In [2]:
ratings.head()

Unnamed: 0,product_id,title,user_id,rating
0,1,WHITE HANGING HEART T-LIGHT HOLDER,102,3
1,2,WHITE METAL LANTERN,109,2
2,3,CREAM CUPID HEARTS COAT HANGER,109,5
3,4,KNITTED UNION FLAG HOT WATER BOTTLE,104,3
4,5,RED WOOLLY HOTTIE WHITE HEART.,103,5


Now the amazing pivot_table function on a DataFrame will construct a user / movie rating matrix. Note how NaN indicates missing data - movies that specific users didn't rate.

In [3]:
productsRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
productsRatings.head()

title,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,I LOVE LONDON MINI RUCKSACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,...,wrongly coded 20713,wrongly coded 23343,wrongly coded-23343,wrongly marked,wrongly marked 23343,wrongly marked carton 22804,wrongly marked. 23343 in box,wrongly sold (22719) barcode,wrongly sold as sets,wrongly sold sets
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
101,3.5,3.222222,3.0,2.2,,3.666667,2.846154,3.083333,3.138889,2.85,...,,4.0,,,4.0,,4.0,,,
102,3.333333,2.866667,3.157895,2.642857,,,2.809524,3.357143,3.1,2.647059,...,,,3.0,,,,,,,
103,3.25,3.727273,2.9,2.625,,2.0,2.9,3.0,3.346154,3.0,...,,,,3.0,,,,,,
104,3.5,3.266667,3.071429,3.25,,2.0,2.764706,3.2,2.692308,2.909091,...,,,,,,,,,,
105,3.0,2.1,2.782609,2.2,4.0,2.0,3.7,3.428571,2.533333,3.384615,...,4.5,,,,,,,2.0,,


Let's extract a Series of users who rated "DOLLY GIRL BEAKER":

In [4]:
WHITE_METAL_LANTERN_Ratings = productsRatings['WHITE METAL LANTERN']
WHITE_METAL_LANTERN_Ratings.head()

user_id
101    3.285714
102    3.224490
103    3.189189
104    3.088235
105    2.576923
Name: WHITE METAL LANTERN, dtype: float64

Pandas' corrwith function makes it really easy to compute the pairwise correlation of WHITE METAL LANTERN vector of user rating with every other prduct! After that, we'll drop any results that have no data, and construct a new DataFrame of products and their correlation score (similarity) to WHITE METAL LANTERN:

In [5]:
similarProducts = productsRatings.corrwith(WHITE_METAL_LANTERN_Ratings)
similarProducts = similarProducts.dropna()
df = pd.DataFrame(similarProducts)
df.head(10)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
4 PURPLE FLOCK DINNER CANDLES,0.211416
50'S CHRISTMAS GIFT BAG LARGE,0.435368
DOLLY GIRL BEAKER,0.257584
I LOVE LONDON MINI BACKPACK,-0.005706
NINE DRAWER OFFICE TIDY,0.131937
OVAL WALL MIRROR DIAMANTE,-0.562115
RED SPOT GIFT BAG LARGE,-0.084893
SET 2 TEA TOWELS I LOVE LONDON,0.530502
SPACEBOY BABY GIFT SET,-0.715176
TRELLIS COAT RACK,-0.537366


(That warning is safe to ignore.) Let's sort the results by similarity score, and we should have the products most similar to WHITE METAL LANTERN! Except... we don't. These results make no sense at all! This is why it's important to know your data - clearly we missed something important.

In [6]:
similarProducts.sort_values(ascending=False)

title
RED STONE/CRYSTAL EARRINGS             1.0
DAIRY MAID  PUDDING BOWL               1.0
WRAP  VINTAGE DOILEY                   1.0
LEAVES MAGNETIC  SHOPPING LIST         1.0
VINTAGE PHOTO ALBUM PARIS DAYS         1.0
                                      ... 
SET 10 MINICARDS CUTE SNOWMAN 17071   -1.0
PADS TO MATCH ALL CUSHIONS            -1.0
PACK OF 12 DOILEY TISSUES             -1.0
ORANGE PENDANT TRIPLE SHELL NECKLAC   -1.0
wet/rusty                             -1.0
Length: 3850, dtype: float64

Our results are probably getting messed up by movies that have only been viewed by a handful of people who also happened to like WHITE METAL LANTERN. So we need to get rid of products that were only bought by a few people that are producing spurious results. Let's construct a new DataFrame that counts up how many ratings exist for each product, and also the average rating while we're at it - that could also come in handy later.

## Let's try to improve theses results:

Can you improve on these results? Perhaps a different method or min_periods value on the correlation computation would produce more interesting results.

Also, it looks like some movies similar to Gone with the Wind - which I hated - made it through to the final list of recommendations. Perhaps movies similar to ones the user rated poorly should actually be penalized, instead of just scaled down?

There are also probably some outliers in the user rating data set - some users may have rated a huge amount of movies and have a disporportionate effect on the results. Go back to earlier lectures to learn how to identify these outliers, and see if removing them improves things.

For an even bigger project: we're evaluating the result qualitatively here, but we could actually apply train/test and measure our ability to predict user ratings for movies they've already watched. Whether that's actually a measure of a "good" recommendation is debatable, though!

In [7]:
import numpy as np
productStats = ratings.groupby('title').agg({'rating': [np.size, np.mean]})
productStats.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
4 PURPLE FLOCK DINNER CANDLES,41,3.219512
50'S CHRISTMAS GIFT BAG LARGE,130,3.069231
DOLLY GIRL BEAKER,181,2.944751
I LOVE LONDON MINI BACKPACK,88,2.647727
I LOVE LONDON MINI RUCKSACK,1,4.0


Let's get rid of any products rated by fewer than 100 people, and check the top-rated ones that are left:

In [8]:
popularProducts = productStats['rating']['size'] >= 100
productStats[popularProducts].sort_values([('rating', 'mean')], ascending=False)[:15]

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
12 EGG HOUSE PAINTED WOOD,100,3.54
SET 6 FOOTBALL CELEBRATION CANDLES,109,3.477064
HEARTS STICKERS,120,3.375
MINI LADLE LOVE HEART RED,155,3.36129
RED REFECTORY CLOCK,112,3.348214
CHRISTMAS METAL POSTCARD WITH BELLS,116,3.344828
LARGE CAKE TOWEL CHOCOLATE SPOTS,109,3.330275
MAGNETS PACK OF 4 VINTAGE LABELS,128,3.304688
COFFEE MUG DOG + BALL DESIGN,205,3.302439
DOORMAT WELCOME SUNRISE,123,3.300813


100 might still be too low, but these results look pretty good as far as "well rated products that people have heard of." Let's join this data with our original set of similar products to HITE METAL LANTERN:

In [9]:
df = productStats[popularProducts].join(pd.DataFrame(similarProducts, columns=['similarity']))



In [10]:
df.head()

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
50'S CHRISTMAS GIFT BAG LARGE,130,3.069231,0.435368
DOLLY GIRL BEAKER,181,2.944751,0.257584
OVAL WALL MIRROR DIAMANTE,162,2.950617,-0.562115
RED SPOT GIFT BAG LARGE,105,3.2,-0.084893
SET 2 TEA TOWELS I LOVE LONDON,282,2.865248,0.530502


And, sort these new results by similarity score. That's more like it!

In [11]:
df.sort_values(['similarity'], ascending=False)[:15]

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
WHITE METAL LANTERN,328,2.926829,1.0
NATURAL SLATE RECTANGLE CHALKBOARD,427,3.032787,0.878113
MINI JIGSAW DINOSAUR,103,3.097087,0.87461
"KEY FOB , SHED",380,3.076316,0.844699
SET 12 KIDS COLOUR CHALK STICKS,376,2.888298,0.831866
SET OF 6 HEART CHOPSTICKS,159,2.880503,0.819564
SET/2 RED RETROSPOT TEA TOWELS,403,2.853598,0.810626
METAL SIGN CUPCAKE SINGLE HOOK,113,3.115044,0.806738
VINTAGE RED ENAMEL TRIM JUG,146,2.945205,0.806278
BOX OF 6 CHRISTMAS CAKE DECORATIONS,224,3.098214,0.79675


Ideally we'd also filter out the product we started from - of course WHITE METAL LANTERN is 100% similar to itself. But otherwise these results aren't bad.