# Activities 19
## Data Science, Deep Learning, & Machine Learning with Python
### Arash Nouri
### Our choice of 10 for K was arbitrary - what effect do different K values have on the results? Our distance metric was also somewhat arbitrary - we just took the cosine distance between the genres and added it to the difference between the normalized popularity scores. Can you improve on that?

Load libraries

In [1]:
import pandas as pd
import numpy as np
from scipy import spatial
import operator

Load data

In [2]:
ratings = pd.read_csv('u.data', sep='\t', names=['user_id', 'movie_id', 'rating'], usecols=range(3))
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,0,50,5
1,0,172,5
2,0,133,1
3,196,242,3
4,186,302,3


Now, we'll group everything by movie ID, and compute the total number of ratings (each movie's popularity) and the average rating for every movie:

In [3]:
movieProperties = ratings.groupby('movie_id').agg({'rating': [np.size, np.mean]})
movieProperties.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
movie_id,Unnamed: 1_level_2,Unnamed: 2_level_2
1,452,3.878319
2,131,3.206107
3,90,3.033333
4,209,3.550239
5,86,3.302326


The raw number of ratings isn't very useful for computing distances between movies, so we'll create a new DataFrame that contains the normalized number of ratings. So, a value of 0 means nobody rated it, and a value of 1 will mean it's the most popular movie there is.

In [4]:
movieNumRatings = pd.DataFrame(movieProperties['rating']['size'])
movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
movieNormalizedNumRatings.head()

Unnamed: 0_level_0,size
movie_id,Unnamed: 1_level_1
1,0.773585
2,0.222985
3,0.152659
4,0.356775
5,0.145798


Now, let's get the genre information from the u.item file. The way this works is there are 19 fields, each corresponding to a specific genre - a value of '0' means it is not in that genre, and '1' means it is in that genre. A movie may have more than one genre associated with it.

In [5]:
movieDict = {}
with open(r'u.item',encoding='ISO-8859-1') as f:
    temp = ''
    for line in f:
        fields = line.rstrip('\n').split('|')
        movieID = int(fields[0])
        name = fields[1]
        genres = fields[5:25]
        genres = map(int, genres)
        movieDict[movieID] = (name, list(genres), movieNormalizedNumRatings.loc[movieID].get('size'), 
                              movieProperties.loc[movieID].rating.get('mean'))

Now let's define a function that computes different "distances" between two movies based on how similar their genres are, and how similar their popularity is. 

In [6]:
def ComputeDistance(a, b, c):
    genresA = a[1]
    genresB = b[1]
    if c == "cosine":
        genreDistance = spatial.distance.cosine(genresA, genresB)
    elif c == "braycurtis":
        genreDistance = spatial.distance.braycurtis(genresA, genresB)
    elif c == "canberra":
        genreDistance = spatial.distance.canberra(genresA, genresB)
    elif c == "chebyshev":
        genreDistance = spatial.distance.chebyshev(genresA, genresB)
    elif c == "cityblock":
        genreDistance = spatial.distance.cityblock(genresA, genresB)
    elif c == "correlation":
        genreDistance = spatial.distance.correlation(genresA, genresB)
    elif c == "euclidean":
        genreDistance = spatial.distance.euclidean(genresA, genresB)
    elif c == "minkowski":
        genreDistance = spatial.distance.minkowski(genresA, genresB)
    elif c == "sqeuclidean":
        genreDistance = spatial.distance.sqeuclidean(genresA, genresB)
    popularityA = a[2]
    popularityB = b[2]
    popularityDistance = abs(popularityA - popularityB)
    return genreDistance + popularityDistance

Another function was defined to find the `K` nearest neighbors of each movie by using the distance function. 

In [7]:
import operator

def getNeighbors(movieID, K, c):
    distances = []
    for movie in movieDict:
        if (movie != movieID):
            dist = ComputeDistance(movieDict[movieID], movieDict[movie], c)
            distances.append((movie, dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(K):
        neighbors.append(distances[x][0])
    return neighbors

Now the `K` nearest neighbors of a movie (Toy story in this example) were found by using differnt distances and differetn values for `K`

In [8]:
distances = ["cosine", "braycurtis", "canberra", "chebyshev", "cityblock", "correlation", "euclidean", "minkowski",
             "sqeuclidean"]
names = []
score = []
k_val = []
dist = []
for j in distances:
    i = str(j)
    res = np.zeros((len(distances),10))
    for K in range(1,11,1):
        avgRating = 0
        neighbors = getNeighbors(1, K, i)
        for neighbor in neighbors:
            avgRating += movieDict[neighbor][3]
            n = movieDict[neighbor][0].split('()')
            names.append(list(n))
            score.append(movieDict[neighbor][3])
            k_val.append(K)
            dist.append(j)
            '''print (movieDict[neighbor][0] + " " + str(movieDict[neighbor][3])+ " for distances = "+ str(i) + " and K = "
                   + str(K))'''
            '''avgRating /= K
            print(str(avgRating)+ " for distances = "+ str(i) + " and K =  "+str(K))'''

Results are:

In [12]:
names = pd.DataFrame(names)
score = pd.DataFrame(score)
k_val = pd.DataFrame(k_val)
dist = pd.DataFrame(dist)


result = pd.concat([names, score, k_val, dist], axis=1)
result.columns = ['names','score','k_val','dist']

result.head()

Unnamed: 0,names,score,k_val,dist
0,Liar Liar (1997),3.156701,1,cosine
1,Liar Liar (1997),3.156701,2,cosine
2,Aladdin (1992),3.812785,2,cosine
3,Liar Liar (1997),3.156701,3,cosine
4,Aladdin (1992),3.812785,3,cosine


At the end the movie that was appeared in most of the cases are

In [14]:
result.groupby('names').count().sort_values("score" ,ascending = False)

Unnamed: 0_level_0,score,k_val,dist
names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aladdin (1992),73,73,73
Aladdin and the King of Thieves (1996),68,68,68
George of the Jungle (1997),55,55,55
Liar Liar (1997),48,48,48
Beavis and Butt-head Do America (1996),47,47,47
Home Alone (1990),31,31,31
Willy Wonka and the Chocolate Factory (1971),25,25,25
Jungle2Jungle (1997),23,23,23
"Wrong Trousers, The (1993)",18,18,18
Monty Python and the Holy Grail (1974),16,16,16
