
___
# Chapter 1 - Simple Approaches to Recommender Systems
## Segment 3 - Making Recommendations Based on Correlation

In [8]:
import numpy as np
import pandas as pd

These datasets are hosted on: https://archive.ics.uci.edu/ml/datasets/Restaurant+%26+consumer+data

They were originally published by: Blanca Vargas-Govea, Juan Gabriel GonzÃ¡lez-Serna, Rafael Ponce-MedellÃ­n. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSysâ€™11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011.

In [9]:
frame =  pd.read_csv('rating_final.csv')
cuisine = pd.read_csv('chefmozcuisine.csv')
geodata = pd.read_csv('geoplaces2.csv', encoding = 'mbcs')

In [10]:
frame.head()

Unnamed: 0,userID,placeID,rating,food_rating,service_rating
0,U1077,135085,2,2,2
1,U1077,135038,2,2,1
2,U1077,132825,2,2,2
3,U1077,135060,1,2,2
4,U1068,135104,1,1,2


Each of the places in the dataset gets a rating of either zero, one, or two. 

Where two is the best and zero is the worst rating. 

And looking at the head here you can see that user IDs are in duplicate.

That happens when a user has reviewed more than one place.

In [11]:
geodata.head()

# The reason that we want this dataset is that it provides a name for each of the unique places that's been reviewed

Unnamed: 0,placeID,latitude,longitude,the_geom_meter,name,address,city,state,country,fax,...,alcohol,smoking_area,dress_code,accessibility,price,url,Rambience,franchise,area,other_services
0,134999,18.915421,-99.184871,0101000020957F000088568DE356715AC138C0A525FC46...,Kiku Cuernavaca,Revolucion,Cuernavaca,Morelos,Mexico,?,...,No_Alcohol_Served,none,informal,no_accessibility,medium,kikucuernavaca.com.mx,familiar,f,closed,none
1,132825,22.147392,-100.983092,0101000020957F00001AD016568C4858C1243261274BA5...,puesto de tacos,esquina santos degollado y leon guzman,s.l.p.,s.l.p.,mexico,?,...,No_Alcohol_Served,none,informal,completely,low,?,familiar,f,open,none
2,135106,22.149709,-100.976093,0101000020957F0000649D6F21634858C119AE9BF528A3...,El Rincón de San Francisco,Universidad 169,San Luis Potosi,San Luis Potosi,Mexico,?,...,Wine-Beer,only at bar,informal,partially,medium,?,familiar,f,open,none
3,132667,23.752697,-99.163359,0101000020957F00005D67BCDDED8157C1222A2DC8D84D...,little pizza Emilio Portes Gil,calle emilio portes gil,victoria,tamaulipas,?,?,...,No_Alcohol_Served,none,informal,completely,low,?,familiar,t,closed,none
4,132613,23.752903,-99.165076,0101000020957F00008EBA2D06DC8157C194E03B7B504E...,carnitas_mata,lic. Emilio portes gil,victoria,Tamaulipas,Mexico,?,...,No_Alcohol_Served,permitted,informal,completely,medium,?,familiar,t,closed,none


The geodata dataset provides a name for each of the unique places that's been reviewed, but since we don't need all of the attributes in this data frame, let's subset it down to only place ID and name. 

We'll call this subset places and then let's just select these two columns 'placeID', 'name'. 

Now let's look at the head of this. 

In [12]:
places =  geodata[['placeID', 'name']]
places.head()

Unnamed: 0,placeID,name
0,134999,Kiku Cuernavaca
1,132825,puesto de tacos
2,135106,El Rincón de San Francisco
3,132667,little pizza Emilio Portes Gil
4,132613,carnitas_mata


okay so now we have each of our place IDs and the name of the restaurant that goes with that place ID

In [13]:
cuisine.head()

Unnamed: 0,placeID,Rcuisine
0,135110,Spanish
1,135109,Italian
2,135107,Latin_American
3,135106,Mexican
4,135105,Fast_Food


## Grouping and Ranking Data

Now let's look at the ratings these places are getting. 

To do that, we will look at the mean value of all the ratings that are given to each place. 

So we'll group by place ID, and then for each place ID we want to look at the rating column, and we want to generate the mean value for each of the ratings that was given to each place.

And let's print out the head of this to see what it looks like. 



In [14]:
rating = pd.DataFrame(frame.groupby('placeID')['rating'].mean())
rating.head()

Unnamed: 0_level_0,rating
placeID,Unnamed: 1_level_1
132560,0.5
132561,0.75
132564,1.25
132572,1.0
132583,1.0


Great, so we've got each of our places and then the average rating that each of the places was given. 

In addition to the mean value we also want to look at how popular each of these places was. 

So to do this, let's add a column called rating count, and then in that column we'll generate counts for how many reviews each place got. 

We want to group by place ID, again. And then for the rating column. 

This time we want to take a count of how many ratings were given. 

In [15]:
rating['rating_count'] = pd.DataFrame(frame.groupby('placeID')['rating'].count())
rating.head()

Unnamed: 0_level_0,rating,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
132560,0.5,4
132561,0.75,4
132564,1.25,4
132572,1.0,15
132583,1.0,4


What we've got here is we've got each of the place IDs with their average rating and then the rating count, the number of ratings that each of these places got. 

In [16]:
# Now let's look at a statistical description of this rating data frame. 

rating.describe()

Unnamed: 0,rating,rating_count
count,130.0,130.0
mean,1.179622,8.930769
std,0.349354,6.124279
min,0.25,3.0
25%,1.0,5.0
50%,1.181818,7.0
75%,1.4,11.0
max,2.0,36.0


For count of the rating data frame we get 130, and that indicates that there are 130 unique places that have been reviewed in the rating data frame, and also I want to point out here that you see the max value for rating count comes out to 36. 

What this means is that the most popular place in the dataset has got a total of 36 reviews. 

To see what place that is, all we have to do is sort our dataset in descending order. 

In [18]:
rating.sort_values('rating_count', ascending=False).head()

Unnamed: 0_level_0,rating,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
135085,1.333333,36
132825,1.28125,32
135032,1.178571,28
135052,1.28,25
132834,1.0,25


We see that our most popular place has got a place ID of 135085. 

That's kind of an obscure way to refer to a restaurant. 

So let's find the name of this place. 

In order to do that, I'm going to create a filter, and what this filter is going to do is find a true value for where the place ID is equal to 135085, and then we're going to filter our places data frame to return only the record where that's true. 

In [19]:
places[places['placeID']==135085]

Unnamed: 0,placeID,name
121,135085,Tortas Locas Hipocampo


We've got the name of the place. It's called Tortas Locas Hipocampo.

Let's also look at the type of cuisine this place serves. 

We'll use the same filtering process and we're filtering this from our cuisine data frame

In [20]:
cuisine[cuisine['placeID']==135085]

Unnamed: 0,placeID,Rcuisine
44,135085,Fast_Food


We can see here that Tortas Locas Hipocampo, the restaurant serves fast food.

## Preparing Data For Analysis

The next thing we need to do is to build a user by item utility matrix. 

To do that we're going to call the pivot table function. 

This function will cross tabulate each user against each place, and output a matrix. 

our data is going to be the frame dataframe. 

The values we're interested in are the values from the rating column, and our index is going to be our user ID and let's name our columns place ID. 

Now let's look at the first five records of places cross tab. 

In [21]:
places_crosstab = pd.pivot_table(data=frame, values='rating', index='userID', columns='placeID')
places_crosstab.head()

placeID,132560,132561,132564,132572,132583,132584,132594,132608,132609,132613,...,135080,135081,135082,135085,135086,135088,135104,135106,135108,135109
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1001,,,,,,,,,,,...,,,,0.0,,,,,,
U1002,,,,,,,,,,,...,,,,1.0,,,,1.0,,
U1003,,,,,,,,,,,...,2.0,,,,,,,,,
U1004,,,,,,,,,,,...,,,,,,,,2.0,,
U1005,,,,,,,,,,,...,,,,,,,,,,


Now the first thing you'll notice about this cross tab is that it's full of null values. 

That's because people never review that many places. Just a few people review just a few places. Hence the sparsity of this matrix. 

You do see some numbers here, and these numbers are the ratings that each user gave to the respective place that they did review and cases where they made up restaurant review, and you might be thinking that this matrix can't be very useful because it's got so many null values, but let me show you how we can use it to find places that are correlated. 

Before we do that, we need to first isolate the user ratings from our restaurant called Tortas Locas Hipocampo. 

In [23]:
# Create series a series here, and we'll say that from the places cross tab. 
# We want to select the column that's indexed with the number 135085.
Tortas_ratings = places_crosstab[135085]

# Let's also filter Tortas ratings so that we can see only the non null values.
# As you recall, Tortas is the most popular place with 36 ratings. So let's get a look at what those ratings are. 
Tortas_ratings[Tortas_ratings>=0]

userID
U1001    0.0
U1002    1.0
U1007    1.0
U1013    1.0
U1016    2.0
U1027    1.0
U1029    1.0
U1032    1.0
U1033    2.0
U1036    2.0
U1045    2.0
U1046    1.0
U1049    0.0
U1056    2.0
U1059    2.0
U1062    0.0
U1077    2.0
U1081    1.0
U1084    2.0
U1086    2.0
U1089    1.0
U1090    2.0
U1092    0.0
U1098    1.0
U1104    2.0
U1106    2.0
U1108    1.0
U1109    2.0
U1113    1.0
U1116    2.0
U1120    0.0
U1122    2.0
U1132    2.0
U1134    2.0
U1135    0.0
U1137    2.0
Name: 135085, dtype: float64

And when we run that, here we've got 36 review scores andthey range between zero and two, perfect. 

## Evaluating Similarity Based on Correlation

Now to find correlation between each of the places and the Tortas restaurant, what we'll do is call the core with method off of our places cross tab, and then pass it the Tortas rating series. 

What this will do is generate a Pearson R correlation coefficient between Tortas and each other place that's been reviewed in the dataset. 

Keep in mind that this correlation is based on similarities and user reviews that were given to each place. 

In [25]:
similar_to_Tortas = places_crosstab.corrwith(Tortas_ratings)

# Similar to Tortas is going to be returned as a matrix, and we want to convert it to a data frame
corr_Tortas = pd.DataFrame(similar_to_Tortas, columns=['PearsonR'])

# And we don't want to see all the null values so let's drop those
corr_Tortas.dropna(inplace=True)
corr_Tortas.head()

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,PearsonR
placeID,Unnamed: 1_level_1
132572,-0.428571
132723,0.301511
132754,0.930261
132825,0.700745
132834,0.814823


Looking at the head of corr_Tortas, we see that we have a data frame that contains each place ID and a Pearson R correlation coefficient that indicates how well each place correlates with Tortas based on user rating. 

But let's think about this for a minute here. If we've found some places that were really well correlated with Tortas but that had only, say, two ratings total, then those places probably wouldn't really be all that similar to Tortas. 

I mean maybe those places got similar ratings as Tortas, but they wouldn't be very popular. 

Therefore, that correlation really wouldn't be significant. 

We also need to take stock of how popular each of these places is, in addition to how well the review scores correlate with the ratings that were given to other places in the dataset. 

So to do that, let's join our corr_Tortas data frame with a rating data frame.

In [26]:
Tortas_corr_summary = corr_Tortas.join(rating['rating_count'])

Let's create a filter now so that we can see only the places from the data frame that have at least 10 user reviews, and for those places, let's look at the Pearson R correlation coefficient sorted in descending order. 

In [19]:
Tortas_corr_summary[Tortas_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(10)

Unnamed: 0_level_0,PearsonR,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
135076,1.0,13
135085,1.0,36
135066,1.0,12
132754,0.930261,13
135045,0.912871,13
135062,0.898933,21
135028,0.892218,15
135042,0.881409,20
135046,0.867722,11
132872,0.840168,12


Since we sorted the data frame in descending order by correlation, we now have a list of top reviewed places that are most similar to Tortas. 

I want to point out these places here that have a Pearson R value of one though. 

These Pearson R values of one aren't meaningful here. The reason you're seeing these is because for those places, there was only one user who gave a review to both places. That user gave both places the same score. Which is why you're seeing a Pearson R value of one. 

But a correlation that's based on similarities between only one review rating, that's not meaningful. 

The places need to have more than one reviewer in common. So we'll throw those places out.

In [30]:
# So now let's take the top seven correlated results that remain and see if any of these places also serve fast food. 
# then we pass in a series of numbers that are the place IDs for the top correlated places. 

places_corr_Tortas = pd.DataFrame([135085, 132754, 135045, 135062, 135028, 135042, 135046], index = np.arange(7), columns=['placeID'])

# We then need to create a summary data table.  
# it's going to be based on the merge between places corr_Tortas and cuisine. 
# Cause basically I'm trying to create a summary of each of the top correlated place IDs and the types of food they serve.
summary = pd.merge(places_corr_Tortas, cuisine,on='placeID')
summary

Unnamed: 0,placeID,Rcuisine
0,135085,Fast_Food
1,132754,Mexican
2,135028,Mexican
3,135042,Chinese
4,135046,Fast_Food


We only get five results even after we included seven place IDs in this data frame. 

But the reason why you're only seeing five places here is that not all of the places were listed in the cuisine's dataset.

Places that weren't in the cuisine's dataset were not able to be returned in this merged output table. 

None the less, what we are seeing here is that among the top six places that were most correlated with Tortas, at least one of these places also serves fast food. 

Let's get a name for this place so we don't have to refer to it as a number. 

In [21]:
places[places['placeID']==135046]

Unnamed: 0,placeID,name
42,135046,Restaurante El Reyecito


We see that, that place is actually called Restaurante El Reyecito.

To evaluate how relevant the similarity metric really is though, let's consider the entire set of possibilities. 

Meaning how many cuisine types are served at places in this dataset. 

To do that we'll use the describe method. 

In [22]:
cuisine['Rcuisine'].describe()

count         916
unique         59
top       Mexican
freq          239
Name: Rcuisine, dtype: object

We can see that according to our cuisine data frame, there are 59 unique types of cuisines that are served. 


So in last analysis, what we got back were six top places that were similar to Tortas based on correlation and popularity. 

Of these six places, one other place also serves fast food. 

Considering that there are 59 total cuisine types that could have been offered, and that we got back another fast food place in our top six most similar places, it looks like our correlation based recommendation system is on track. 

In this case, we'd be safe recommending the places Restaurante El Reyecito to users who also like the restaurant Tortas.