
## Content-Based Recommender Systems
## Nearest Neighbors Algorithm

In [1]:
import numpy as np
import pandas as pd

import sklearn
from sklearn.neighbors import NearestNeighbors

*mtcars dataset source:* 
Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.

In [5]:
cars = pd.read_csv('../Data/mtcars.csv')
cars.columns = ['car_names', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']
cars.head()

Unnamed: 0,car_names,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


So imagine that a customer walks in and tells you that he's looking for a car that weighs 3.2 tons, gets at least 15 miles per gallon, has an engine with a displacement size of 300 cubic inches, and a power of 160 horsepower.

Let's make a test point to represent the shopper's specifications. So I say, we'll call it t, and we'll set it equal to a list, and we want 15 for 15 miles per gallon, 300 for 300 cubic inches. The next value will be 160 for 160 horsepower. And lastly we'll have 3.2, which is for 3.2 tons.

In [25]:
t = [15, 300, 160, 3.2]

X = cars[["mpg", "disp", "hp", "wt"]].values
X[0:5]

array([[ 21.   , 160.   , 110.   ,   2.62 ],
       [ 21.   , 160.   , 110.   ,   2.875],
       [ 22.8  , 108.   ,  93.   ,   2.32 ],
       [ 21.4  , 258.   , 110.   ,   3.215],
       [ 18.7  , 360.   , 175.   ,   3.44 ]])

Now let's see how the nearest neighbor algorithm can be used to recommend a car for this shopper based on his requirements. The first thing we need to do is define our dataset. So we'll say x is equal to cars.ix, and we use a special indexer to select only the variables we need here. The first variable here is miles per gallon, and cubic inches is a measure of displacement, so we'll select the variable at position three.

The next thing is we need 160 horsepower, so that's hp, variable four. And lastly we're looking at the weight, how many tons, so that's variable six. And then of these we just want the values. So let's say .values. And then let's look at the first five records here. So we'll select and return only those.

In [26]:
# This tells the algorithm to search the dataset and find a single point p that is nearest to the test point t.
nbrs = NearestNeighbors(n_neighbors=1).fit(X)

The k neighbors function returns here an array that represents the length to point p from the test point t, and here, another array that contains the index value of the nearest point or the most similar instance in dataset x.

In [27]:
print(nbrs.kneighbors([t]))

(array([[10.77474942]]), array([[22]], dtype=int64))


In [28]:
cars

Unnamed: 0,car_names,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
6,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
7,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
8,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
9,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


 So according to our nearest neighbor model, you should recommend the shopper to take a closer look at this AMC Javelin car, because it's the most similar car to the shopper's specifications of all cars that the car dealer has on his lot.

## Popularity-Based Recommenders

Dataset:https://archive.ics.uci.edu/ml/datasets/Restaurant+%26+consumer+data

In [31]:
# Import libraries
import pandas as pd
import numpy as np

In [33]:
df = pd.read_csv('../Data/rating_final.csv')
cuisine = pd.read_csv('../Data/chefmozcuisine.csv')

In [34]:
df.head()

Unnamed: 0,userID,placeID,rating,food_rating,service_rating
0,U1077,135085,2,2,2
1,U1077,135038,2,2,1
2,U1077,132825,2,2,2
3,U1077,135060,1,2,2
4,U1068,135104,1,1,2


In [35]:
cuisine.head()

Unnamed: 0,placeID,Rcuisine
0,135110,Spanish
1,135109,Italian
2,135107,Latin_American
3,135106,Mexican
4,135105,Fast_Food


In this demonstration, users are restaurant reviewers and the items are restaurants (placeID). Each place gets a rating of 0, 1 or 2 where 2 is the best and 0 is the worst rating.

## Recommending based on counts

In [36]:
rating_count = pd.DataFrame(df.groupby('placeID')['rating'].count())

rating_count.sort_values('rating', ascending=False).head()

Unnamed: 0_level_0,rating
placeID,Unnamed: 1_level_1
135085,36
132825,32
135032,28
135052,25
132834,25


Now, let's take the top five most often rated places and see if they have any similarities between the cuisines that they serve. 

In [37]:
most_rated_places = pd.DataFrame([135085, 132825, 135032, 135052, 132834], index=np.arange(5), columns=['placeID'])

summary = pd.merge(most_rated_places, cuisine, on='placeID')
summary

Unnamed: 0,placeID,Rcuisine
0,135085,Fast_Food
1,132825,Mexican
2,135032,Cafeteria
3,135032,Contemporary
4,135052,Bar
5,135052,Bar_Pub_Brewery
6,132834,Mexican


In [38]:
cuisine['Rcuisine'].describe()

count         916
unique         59
top       Mexican
freq          239
Name: Rcuisine, dtype: object

So what you can see here is that there are 59 unique types of cuisines that are represented in our data. Also notice that the most frequently occurring type of cuisine in the data set is Mexican food. Now, let's look back at our summary table. You can see that two of the top rated places in town both serve Mexican food. The recommender is suggesting that Mexican food is popular and that places that serve it are good candidates for recommending.

 Our recommender is basically saying that places that serve the most popular types of cuisine are more likely to be appreciated by the average restaurant goer in the city. Makes sense, right?