# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint

In [0]:
#@title Experiment Walkthrough
#@markdown Movie Recommendation using KNN
from IPython.display import HTML

HTML("""<video width="320" height="240" controls>
  <source src="https://cdn.talentsprint.com/talentsprint/archives/sc/aiml/aiml_labs_blr/movie_recommendation_system_knn.mp4" type="video/mp4">
</video>
""")

## Learning Objective

At the end of this experiment, you will be able to :

* Recommend movies to the users.

## Dataset

### Description

The dataset chosen for this experiment is a subset of the original movielens dataset.

Consider the problem of recommending movies to users. We have M Users and N Movies. 
Now, we want to predict whether a given test user $x$ will watch movie $y$.

User $x$ has seen and not seen few movies in the past. We will use $x$'s movie watching history as a feature for our recommendation system.

We will use KNN to find the K nearest neighbour users (users with similar taste) to $x$, and make predictions based on their entries for movie $y$.

A user either had seen the movie (1) or not seen the movie (0). We can represent this as a matrix of size M×N. (M rows and N columns). We have actually used a dictionary with the keys userId and movieId to represent this matrix.

Each element of the matrix is either zero or one. If (u, m) entry in this matrix is 1, then the $u^{th}$ user has seen the movie $m$.

#### Training set
M×N binary matrix indicating seen/not-seen.
#### Test set: 
L test cases with $(x, y)$ pairs. $x$ is N-dimensional binary vector with missing $y^{th}$ entry - which we want to predict.


### Data Source

* AIML_DS_MOVIE-TRAIN_SMALLSUBSETOFMOVIELENSDATASET.csv

*  AIML_DS_MOVIE-TEST_SMALLSUBSETOFMOVIELENSDATASET.csv

These have been taken (and modified) from:
http://kevinmolloy.info/teaching/cs504_2017Fall/

This is a small subset of the original movielens dataset.
https://grouplens.org/datasets/movielens/



* We will use KNN to find the K nearest neighbour users (users with similar taste) to $x$, and make predictions based on their entries for the movie $y$.

* We have given the code for Cosine distance, when computing nearest neighbours.

## Keywords

* KNN
* Recommendation Systems
* Cosine distance

#### Setup Steps

In [0]:
#@title Please enter your registration id to start: (e.g. P181900101) { run: "auto", display-mode: "form" }
Id = "P181902118" #@param {type:"string"}


In [0]:
#@title Please enter your password (normally your phone number) to continue: { run: "auto", display-mode: "form" }
password = "8860303743" #@param {type:"string"}


In [0]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()
  
notebook="BLR_M2W4E29_MovieRecommendationSystem_KNN" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")
    ipython.magic("sx wget https://cdn.talentsprint.com/aiml/Experiment_related_data/AIML_DS_MOVIE-TEST_SMALLSUBSETOFMOVIELENSDATASET.csv")
    ipython.magic("sx wget https://cdn.talentsprint.com/aiml/Experiment_related_data/AIML_DS_MOVIE-TRAIN_SMALLSUBSETOFMOVIELENSDATASET.csv")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    
    ipython.magic("notebook -e "+ notebook + ".ipynb")
    
    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:        
        print(r["err"])
        return None        
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional, 
              "concepts" : Concepts, "record_id" : submission_id, 
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook}

      r = requests.post(url, data = data)
      r = json.loads(r.text)
      print("Your submission is successful.")
      print("Ref Id:", submission_id)
      print("Date of submission: ", r["date"])
      print("Time of submission: ", r["time"])
      print("View your submissions: https://iiith-aiml.talentsprint.com/notebook_submissions")
      print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
      return submission_id
    else: submission_id
    

def getAdditional():
  try:
    if Additional: return Additional      
    else: raise NameError('')
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None
  
def getConcepts():
  try:
    return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None

def getAnswer():
  try:
    return Answer
  except NameError:
    print ("Please answer Question")
    return None

def getId():
  try: 
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup 
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
    from IPython.display import HTML
    HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id))
  
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


In [0]:
# Importing required packages
import pandas as pd

In [0]:
## Setting up the files

Train_set = "AIML_DS_MOVIE-TRAIN_SMALLSUBSETOFMOVIELENSDATASET.csv"
Test_set = " AIML_DS_MOVIE-TEST_SMALLSUBSETOFMOVIELENSDATASET.csv"

In [0]:

## Loading the data from set up files
rated = pd.read_csv(Train_set, converters={"userId":int, "movieId":int})
rated.describe()

Unnamed: 0,userId,movieId,rating
count,80045.0,80045.0,80045.0
mean,345.401574,1654.71185,3.544594
std,195.180637,1887.186635,1.058349
min,0.0,0.0,0.5
25%,179.0,327.0,3.0
50%,363.0,870.0,4.0
75%,518.0,2337.0,4.0
max,670.0,9065.0,5.0


In [0]:
rated.shape

(80045, 3)

In [0]:
rated

In [0]:
rated.describe()

Unnamed: 0,userId,movieId,rating
count,80045.0,80045.0,80045.0
mean,345.401574,1654.71185,3.544594
std,195.180637,1887.186635,1.058349
min,0.0,0.0,0.5
25%,179.0,327.0,3.0
50%,363.0,870.0,4.0
75%,518.0,2337.0,4.0
max,670.0,9065.0,5.0


In [0]:
userCount = max(rated.userId)
movieCount = max(rated.movieId)

In [0]:
userCount

670

In [0]:
movieCount

9065

In [0]:
seen = {}
for x in rated.values:
    seen[(int(x[0]), int(x[1]))] = 1

In [0]:
allUsersMovies = [(u,m) for u in range(userCount) for m in range(movieCount)]

In [0]:
for x in allUsersMovies:
    if x not in seen:
        seen[x] = 0

Now we have the data loaded into a dictionary, let us recast the distance function to use it. Given two users, $u_1$ and $u_2$, for a movie $mx$, we must ignore the entries for $mx$ while computing the distance

In [0]:
# This is actually the cosine distance
def distance(u1, u2, mx):
    d = 0 - seen[(u1, mx)] * seen[(u2, mx)]
    for m in range(movieCount):
        d += seen[(u1, m)] * seen[(u2, m)]
    return d

def kNN(k, givenUser, givenMovie):
    distances = []
    for u in range(userCount):
        if u != givenUser:
            distances.append([distance(u, givenUser, givenMovie), u])
    distances.sort()
    distances.reverse() ## Because cosine distances mean higher = closer
    return distances[:k] 

def prediction(k, givenUser, givenMovie):
    neighbours = kNN(k, givenUser, givenMovie)
    print(neighbours)
    howmanySaw = sum([seen[(u, givenMovie)] for d, u in neighbours])
    print("howmanySaw=", howmanySaw)
    return 2 * howmanySaw > k      ### predict 1 if more than half of the similar users have seen this movie, otherwise 0.
        

In [0]:
prediction(5,1,0)

[[39, 563], [39, 486], [38, 14], [37, 460], [37, 211]]
howmanySaw= 1


False

### Ungraded Exercise 1

Verify the above code and check if it works

In [0]:
print(movieCount)
maxMovieWatched= []
for i,m in enumerate(range(movieCount+1)):
  maxMovieWatched.append(0)
for i,m in enumerate(seen):
  if(seen[m]!=0):
      maxMovieWatched[m[1]]+=1

maxValue = 0
maxIndex = i
for i, each in enumerate(maxMovieWatched):
  if each > maxValue:
    maxValue = each
    maxIndex = i
print(maxValue, maxIndex)
    

9065
272 57


In [0]:
for x in range(14, 15):
  print(prediction(7,x,0))

[[539, 72], [532, 546], [476, 623], [417, 379], [411, 467], [396, 451], [365, 579]]
howmanySaw= 2
False


In [0]:
# Your Answer Here
prediction(10,1,57)

[[38, 563], [38, 486], [37, 14], [36, 460], [36, 211], [35, 310], [34, 513], [32, 653], [32, 495], [32, 18]]
howmanySaw= 10


True

### Ungraded Exercise 2 

Change the distance function to compute Euclidean, and see if the prediction changes. Remember to modify the kNN function to pick the smallest distances: do not reverse()!

In [0]:
## Your Code Here
# This is actually the cosine distance
import math
def distance(u1, u2, mx):
#     d=0
#     if seen[(u1, mx)]==1 and seen[(u2, mx)]==1:
    d = 0 - (seen[(u1, mx)] - seen[(u2, mx)])**2
    for m in range(movieCount):
#       if seen[(u1, m)]==1 and seen[(u2, m)]==1:
      d += (seen[(u1, m)] - seen[(u2, m)])**2
    return math.sqrt(d)

def kNN(k, givenUser, givenMovie):
    distances = []
    for u in range(userCount):
        if u != givenUser:
            distances.append([distance(u, givenUser, givenMovie), u])
    distances.sort()
#     distances.reverse() ## Because cosine distances mean higher = closer
    return distances[:k] 

def prediction(k, givenUser, givenMovie):
    neighbours = kNN(k, givenUser, givenMovie)
    print(neighbours)
    howmanySaw = sum([seen[(u, givenMovie)] for d, u in neighbours])
    print("howmanySaw=", howmanySaw)
    return 2 * howmanySaw > k      ### predict 1 if more than half of the similar users have seen this movie, otherwise 0.
        

In [0]:
prediction(10,1,57)

[[7.14142842854285, 494], [7.280109889280518, 265], [7.483314773547883, 81], [7.483314773547883, 190], [7.483314773547883, 420], [7.483314773547883, 448], [7.54983443527075, 107], [7.54983443527075, 224], [7.54983443527075, 248], [7.54983443527075, 337]]
howmanySaw= 7


True

### Ungraded Exercise 3

Change the distance function to compute Manhattan, and see if the prediction changes. Remember to modify the kNN function to pick the smallest distances: do not reverse()!

In [0]:
## Your Code Here
## Your Code Here
# This is actually the cosine distance
import math
def distance(u1, u2, mx):
#     d=0
#     if seen[(u1, mx)]==1 and seen[(u2, mx)]==1:
    d = 0 - (seen[(u1, mx)] - seen[(u2, mx)])
    for m in range(movieCount):
#       if seen[(u1, m)]==1 and seen[(u2, m)]==1:
      d += (seen[(u1, m)] - seen[(u2, m)])
    return d

def kNN(k, givenUser, givenMovie):
    distances = []
    for u in range(userCount):
        if u != givenUser:
            distances.append([distance(u, givenUser, givenMovie), u])
    distances.sort()
#     distances.reverse() ## Because cosine distances mean higher = closer
    return distances[:k] 

def prediction(k, givenUser, givenMovie):
    neighbours = kNN(k, givenUser, givenMovie)
    print(neighbours)
    howmanySaw = sum([seen[(u, givenMovie)] for d, u in neighbours])
    print("howmanySaw=", howmanySaw)
    return 2 * howmanySaw > k      ### predict 1 if more than half of the similar users have seen this movie, otherwise 0.
        

In [0]:
prediction(10,1,57)

[[-50, 637], [-47, 248], [-47, 318], [-47, 336], [-46, 111], [-46, 324], [-46, 398], [-46, 468], [-46, 497], [-46, 656]]
howmanySaw= 2


False

### Summary

In above experiment we have learnt how to build recommendation systems using KNN classifier.

### Please answer the questions below to complete the experiment:

In [0]:
#@title In the experiment above, two users are considered nearest neighbors, if they have both watched same number of movies(not necessarily common movies)?{ run: "auto", form-width: "500px", display-mode: "form" }
Answer = "FALSE" #@param ["TRUE","FALSE"]


In [0]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity =  "Good and Challenging me" #@param ["Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging me", "Was Tough, but I did it", "Too Difficult for me"]


In [0]:
#@title If it was very easy, what more you would have liked to have been added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "test" #@param {type:"string"}

In [0]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "Yes" #@param ["Yes", "No"]

In [0]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id =return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")

Your submission is successful.
Ref Id: 5060
Date of submission:  15 May 2019
Time of submission:  23:14:29
View your submissions: https://iiith-aiml.talentsprint.com/notebook_submissions
For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.
