# Collaborative Filtering based Recommendation System_Questios

## About Book Crossing Dataset
###This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

## Objective
This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

Execute the below cell to load the datasets

In [1]:
import pandas as pd

In [2]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


## Q1 Check no.of records (shape) and features given in each dataset 

In [3]:
print('The shape of the Books Dataset is: ',books.shape)
print('The shape of the Users Dataset is: ',users.shape)
print('The shape of the Ratings Dataset is: ',ratings.shape)

The shape of the Books Dataset is:  (271360, 8)
The shape of the Users Dataset is:  (278858, 3)
The shape of the Ratings Dataset is:  (1149780, 3)


## Q2. Exploring books dataset - 1

In [4]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [5]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
imageUrlS            271360 non-null object
imageUrlM            271360 non-null object
imageUrlL            271357 non-null object
dtypes: object(8)
memory usage: 16.6+ MB


In [6]:
books.describe().transpose()

Unnamed: 0,count,unique,top,freq
ISBN,271360,271360,0375806377,1
bookTitle,271360,242135,Selected Poems,27
bookAuthor,271359,102023,Agatha Christie,632
yearOfPublication,271360,202,2002,13903
publisher,271358,16807,Harlequin,7535
imageUrlS,271360,271044,http://images.amazon.com/images/P/006016848X.0...,2
imageUrlM,271360,271044,http://images.amazon.com/images/P/067100767X.0...,2
imageUrlL,271357,271041,http://images.amazon.com/images/P/044640361X.0...,2


### Drop last three columns containing image URLs which will not be required for analysis

In [7]:
columns = ['imageUrlS', 'imageUrlM', 'imageUrlL']
books_new = books
books_new.drop(columns, axis=1, inplace=True)

In [8]:
books_new.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


### check the unique values of yearOfPublication

In [9]:
books_new['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

### Check the rows having 'DK Publishing Inc' as yearOfPublication and drop them
### Change the datatype of yearOfPublication to 'int'  -1

In [10]:
#Removing rows from yearOfPublication where value is 'DK Publishing Inc'
books_new.drop(books.index[books['yearOfPublication'] == 'DK Publishing Inc'], inplace=True)

In [11]:
#From above unique value list, we have identified that there is one more character i.e. Gallimard. Hence, removing that too
books_new.drop(books.index[books['yearOfPublication'] == 'Gallimard'], inplace=True)

In [12]:
books_new[books_new['yearOfPublication'] == 'DK Publishing Inc']

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher


In [13]:
#Converting yearOfPublication column now into numeric from object after removing characters
books_new['yearOfPublication'] = pd.to_numeric(books_new['yearOfPublication'])

In [14]:
books_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271357 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271357 non-null object
bookTitle            271357 non-null object
bookAuthor           271356 non-null object
yearOfPublication    271357 non-null int64
publisher            271355 non-null object
dtypes: int64(1), object(4)
memory usage: 12.4+ MB


### Check for null vaules and impute them

In [15]:
books_new.isnull().sum()

ISBN                 0
bookTitle            0
bookAuthor           1
yearOfPublication    0
publisher            2
dtype: int64

#Find out the rows where null values are there at one or more places in a row

In [16]:
books_new[books_new.isnull().any(axis=1)]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
128890,193169656X,Tyrant Moon,Elaine Corvidae,2002,
129037,1931696993,Finders Keepers,Linnea Sinclair,2001,
187689,9627982032,The Credit Suisse Guide to Managing Your Perso...,,1995,Edinburgh Financial Publishing


#Replacing NaN values with the most common values for categorical columns

In [17]:
import numpy as np

In [18]:
books_new['bookAuthor'].fillna(books_new['bookAuthor'].value_counts().index[0], inplace=True)

In [19]:
books_new['publisher'].fillna(books_new['publisher'].value_counts().index[0], inplace=True)

In [20]:
books_new.isnull().sum()

ISBN                 0
bookTitle            0
bookAuthor           0
yearOfPublication    0
publisher            0
dtype: int64

## Q3. Explore Users Dataset

### Age values below 5 and above 90 do not make much sense for our book rating case...hence replacing these by mean and change the datatype to int - 1

In [21]:
users_new = users

In [22]:
users_new.head()

Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [23]:
users_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [24]:
users_new.index[users_new['Age'] <5]

Int64Index([   219,    469,    561,    612,    670,    931,   1148,   1460,
              1564,   1685,
            ...
            276073, 276184, 276226, 276315, 276577, 276692, 276784, 276889,
            277075, 277908],
           dtype='int64', length=882)

In [25]:
#users_new['Age'].replace(users_new.index[users_new['Age'] <5], users_new['Age'].mean(), inplace=True)

In [26]:
users_new.loc[users_new.Age < 5, 'Age'] = users_new['Age'].mean()

In [27]:
users_new.loc[users_new.Age > 90, 'Age'] = users_new['Age'].mean()

In [28]:
users_new['Age'] = pd.to_numeric(users_new['Age'])

In [29]:
users_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [30]:
users_new['Age'].isnull().any()

True

In [31]:
#We need to replace null values before we change the datatype to int
users_new['Age'] = users_new['Age'].fillna(0)

In [32]:
users_new['Age'].isnull().any()

False

In [33]:
users_new['Age'] = pd.to_numeric(users_new['Age'])

In [34]:
users_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         278858 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


## Q4. Explore ratings Dataset

In [35]:
rating_new = ratings

In [36]:
rating_new.head()

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [37]:
rating_new.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
userID,1149780.0,140386.395126,80562.277718,2.0,70345.0,141010.0,211028.0,278854.0
bookRating,1149780.0,2.86695,3.854184,0.0,0.0,0.0,7.0,10.0


In [38]:
rating_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


In [39]:
rating_new.isna().sum()

userID        0
ISBN          0
bookRating    0
dtype: int64

### Ratings dataset should have books only which exist in our books dataset. - 1

In [40]:
rating_book = rating_new[rating_new['ISBN'].isin(books_new['ISBN'])]

In [41]:
rating_book.shape

(1031132, 3)

In [42]:
rating_new.shape

(1149780, 3)

### Ratings dataset should have ratings from users which exist in users dataset.

In [43]:
ratings_user = rating_book[rating_book['userID'].isin(users['userID'])]

In [44]:
ratings_user.shape

(1031132, 3)

In [45]:
rating_new.shape

(1149780, 3)

### Consider only ratings from 1-10 and leave 0s.

In [46]:
rating10 = ratings_user

In [47]:
rating10['bookRating'].unique()

array([ 0,  5,  3,  6,  7,  9,  8, 10,  1,  4,  2], dtype=int64)

In [48]:
rating10 = ratings_user[ratings_user['bookRating'] != 0]

In [49]:
rating10['bookRating'].unique()

array([ 5,  3,  6,  7,  9,  8, 10,  1,  4,  2], dtype=int64)

### Find out which rating has been given highest number of times/

In [50]:
rating10['bookRating'].value_counts()

8     91804
10    71225
7     66401
9     60778
5     45355
6     31687
4      7617
3      5118
2      2375
1      1481
Name: bookRating, dtype: int64

#Rating 8 has been given highest number of time which is 91804 times

## **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [51]:
user_count = rating10['userID'].value_counts()

In [52]:
rating_index = user_count.loc[user_count.values > 99].index

In [53]:
len(rating_index)

449

#This shows we have 449 users who have rated at least 100 books

In [54]:
#Filtering the ratings for those 499 users
rating10 = rating10.loc[rating10['userID'].isin(rating_index)]
rating10.shape

(103271, 3)

In [55]:
len(rating10['userID'].unique())

449

In [56]:
rating10.head()

Unnamed: 0,userID,ISBN,bookRating
1456,277427,002542730X,10
1458,277427,003008685X,8
1461,277427,0060006641,10
1465,277427,0060542128,7
1474,277427,0061009059,9


## Q5 Generating ratings matrix from explicit ratings table

#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [57]:
ratings_matrix = rating10.pivot(index = 'userID',columns = 'ISBN', values = 'bookRating')
userID = ratings_matrix.index
ISBN = ratings_matrix.columns
ratings_matrix.fillna(0, inplace=True)
print("Matrix Shape: ", ratings_matrix.shape)
ratings_matrix.head()

Matrix Shape:  (449, 66574)


ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Q6. Generate the predicted ratings using SVD with no.of singular values to be 50

In [58]:
from surprise import Dataset,Reader
from surprise import SVD
from surprise import accuracy

In [59]:
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(rating10[['userID', 'ISBN', 'bookRating']], reader)

In [60]:
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=.30,random_state=45)

In [61]:
svd_model = SVD(n_factors=50,biased=False)

In [62]:
svd_model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x18491749c88>

In [63]:
test_pred = svd_model.test(testset)

In [64]:
accuracy.rmse(test_pred)

RMSE: 3.2606


3.260558003093164

## Take a particular user_id

Lets find the recommendations for user with id 2110
Note: Execute the below cells to get the variables loaded

In [65]:
userID = 2110

## Q7 Get the predicted ratings for userID 2110 and sort them in descending order

# First, we need to see the ratings of those books which the user 2110 has not rated

In [66]:
#For this, first we need to generate an anti test set
testset_new = trainset.build_anti_testset()

In [67]:
testset_new[0][1]

'0064400026'

In [68]:
#Covert user's rawid to inner uid
trainset.to_inner_uid(2110)

442

In [69]:
#We will now be slicing the list for user id 2110.
user2110 =[]
length = len(testset_new)
for i in range(length):
    if testset_new[i][0] == 2110:
        user2110.append(testset_new[i])

In [70]:
#Thus we have the list of books for user id 2110 which the user has not rated
user2110[0:5]

[(2110, '0373706235', 7.8208856119188255),
 (2110, '0064400026', 7.8208856119188255),
 (2110, '1400032016', 7.8208856119188255),
 (2110, '0671025422', 7.8208856119188255),
 (2110, '0312966970', 7.8208856119188255)]

In [71]:
predictions = svd_model.test(user2110)

In [72]:
len(predictions)

50086

In [73]:
#creating the dataframe for user id 2110 and corresding book - estimated rating combination
predictions_df = pd.DataFrame([[x.uid,x.iid,x.est] for x in predictions])

In [74]:
predictions_df.columns = ["userID","ISBN","est_rating"]
predictions_df.sort_values(by = ["userID","est_rating"],ascending=False,inplace=True)

In [75]:
predictions_df.head()

Unnamed: 0,userID,ISBN,est_rating
3032,2110,1400031346,8.780548
687,2110,345339681,8.175391
77,2110,842329129,7.994389
7028,2110,786866845,7.862248
10963,2110,440215625,7.825756


## Q8 Create a dataframe with name user_data containing userID 2110 explicitly interacted books

#Let's create dataframe of books which have been rated by the user

In [76]:
user_data = rating10[rating10['userID']==2110]

In [77]:
len(user_data)

103

#It shows that user 2110 has rated 103 books

## Q9 Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [78]:
user_full_info = user_data.merge(books,on='ISBN')

In [79]:
len(user_full_info)

103

## Q10 Get top 10 recommendations for above given userID from the books not already rated by that user

#Let's see with the help of algorithm, which all books we can recommend to the user

In [80]:
predictions_df = predictions_df.merge(books, on = 'ISBN')
predictions_df.head()

Unnamed: 0,userID,ISBN,est_rating,bookTitle,bookAuthor,yearOfPublication,publisher
0,2110,1400031346,8.780548,The No. 1 Ladies' Detective Agency,Alexander McCall Smith,2002,Anchor (UK)
1,2110,345339681,8.175391,The Hobbit : The Enchanting Prelude to The Lor...,J.R.R. TOLKIEN,1986,Del Rey
2,2110,842329129,7.994389,Left Behind: A Novel of the Earth's Last Days ...,Tim Lahaye,1996,Tyndale House Publishers
3,2110,786866845,7.862248,Ice Bound: A Doctor's Incredible Battle for Su...,Dr. Jerri Nielsen,2001,Miramax
4,2110,440215625,7.825756,Dragonfly in Amber,DIANA GABALDON,1993,Dell


In [81]:
predictions_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50086 entries, 0 to 50085
Data columns (total 7 columns):
userID               50086 non-null int64
ISBN                 50086 non-null object
est_rating           50086 non-null float64
bookTitle            50086 non-null object
bookAuthor           50086 non-null object
yearOfPublication    50086 non-null int64
publisher            50086 non-null object
dtypes: float64(1), int64(2), object(4)
memory usage: 3.1+ MB


#There are total 50086 books we can recommend to the user

# Content Based Recommendation System - Optional ( Q11 - Q19 will not be graded)

## Q11 Read the Dataset `movies_metadata.csv`

In [82]:
movies = pd.read_csv("movies_metadata.csv", error_bad_lines=False, encoding="latin-1")
movies.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [83]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
adult                    45466 non-null object
belongs_to_collection    4494 non-null object
budget                   45466 non-null object
genres                   45466 non-null object
homepage                 7782 non-null object
id                       45466 non-null object
imdb_id                  45449 non-null object
original_language        45455 non-null object
original_title           45466 non-null object
overview                 44512 non-null object
popularity               45461 non-null object
poster_path              45080 non-null object
production_companies     45463 non-null object
production_countries     45463 non-null object
release_date             45379 non-null object
revenue                  45460 non-null float64
runtime                  45203 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null objec

## Q12 Create a new column with name 'description' combining `'overview' and 'tagline'` columns in the given dataset

In [84]:
movies['description'] = movies.overview.str.cat(movies.tagline)
movies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,description
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,When siblings Judy and Peter discover an encha...
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,A family wedding reignites the ancient feud be...
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"Cheated on, mistreated and stepped on, the wom..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,Just when George Banks has recovered from his ...


## Q13  Lets drop the null values in `description` column

In [85]:
movies.description.isnull().sum()

25062

In [86]:
movies = movies[pd.notnull(movies['description'])]

In [87]:
movies.description.isnull().sum()

0

## Q14 Keep the first occurance and drop duplicates of each title in column `title`

In [88]:
movies.drop_duplicates(subset ="title", keep = False, inplace = True)

In [89]:
movies.shape

(18608, 25)

## Q15   As we might have dropped a few rows with duplicate `title` in above step, just reset the index [make sure you are not adding any new column to the dataframe while doing reset index]

In [90]:
movies.reset_index(drop=True)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,description
0,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,When siblings Judy and Peter discover an encha...
1,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,A family wedding reignites the ancient feud be...
2,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"Cheated on, mistreated and stepped on, the wom..."
3,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,Just when George Banks has recovered from his ...
4,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,187436818.0,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0,"Obsessive master thief, Neil McCauley leads a ..."
5,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,45325,tt0112302,en,Tom and Huck,"A mischievous young boy, Tom Sawyer, witnesses...",...,0.0,97.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,The Original Bad Boys.,Tom and Huck,False,5.4,45.0,"A mischievous young boy, Tom Sawyer, witnesses..."
6,False,,35000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,9091,tt0114576,en,Sudden Death,International action superstar Jean Claude Van...,...,64350171.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Terror goes into overtime.,Sudden Death,False,5.5,174.0,International action superstar Jean Claude Van...
7,False,"{'id': 645, 'name': 'James Bond Collection', '...",58000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",http://www.mgm.com/view/movie/757/Goldeneye/,710,tt0113189,en,GoldenEye,James Bond must unmask the mysterious head of ...,...,352194034.0,130.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,No limits. No fears. No substitutes.,GoldenEye,False,6.6,1194.0,James Bond must unmask the mysterious head of ...
8,False,,62000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,9087,tt0112346,en,The American President,"Widowed U.S. president Andrew Shepherd, one of...",...,107879496.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Why can't the most powerful man in the world h...,The American President,False,6.5,199.0,"Widowed U.S. president Andrew Shepherd, one of..."
9,False,"{'id': 117693, 'name': 'Balto Collection', 'po...",0,"[{'id': 10751, 'name': 'Family'}, {'id': 16, '...",,21032,tt0112453,en,Balto,An outcast half-wolf risks his life to prevent...,...,11348324.0,78.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Part Dog. Part Wolf. All Hero.,Balto,False,7.1,423.0,An outcast half-wolf risks his life to prevent...


## Q16    Generate tf-idf matrix using the column `description`. Consider till 3-grams, with minimum document frequency as 0.

Hint:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')

In [91]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tf.fit(movies["description"])
title_matrix = tf.transform(movies["description"])

In [92]:
tf

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=0,
        ngram_range=(1, 3), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [93]:
title_matrix

<18608x1108701 sparse matrix of type '<class 'numpy.float64'>'
	with 1747620 stored elements in Compressed Sparse Row format>

## Q17  Create cosine similarity matrix

## Q18  Write a function with name `recommend` which takes `title` as argument and returns a list of 10 recommended title names in the output based on the above cosine similarities

Hint:

titles = df['title'] <br>
indices = pd.Series(df.index, index=df['title']) <br>

def recommend(title): <br>
    idx = indices[title] <br>
    sim_scores = list(enumerate(cosine_similarities[idx])) <br>
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) <br>
    sim_scores = sim_scores[1:31] <br>
    movie_indices = [i[0] for i in sim_scores] <br>
    return titles.iloc[movie_indices] <br>

## Q19 Give the recommendations from above functions for movies `The Godfather` and `The Dark Knight Rises`

# Popularity Based Recommendation System

### About Dataset

Anonymous Ratings on jokes.

1. Ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" = "not rated").

2. One row per user

3. The first column gives the number of jokes rated by that user. The next 100 columns give the ratings for jokes 01 - 100.

# Q20 Read the dataset(jokes.csv)

Take care about the header in read_csv() as there are no column names given in the dataset. 

In [94]:
jokes_df = pd.read_csv('jokes.csv')

In [95]:
jokes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24983 entries, 0 to 24982
Columns: 101 entries, NumJokes to Joke100
dtypes: float64(100), int64(1)
memory usage: 19.3 MB


In [96]:
jokes_df.head()

Unnamed: 0,NumJokes,Joke1,Joke2,Joke3,Joke4,Joke5,Joke6,Joke7,Joke8,Joke9,...,Joke91,Joke92,Joke93,Joke94,Joke95,Joke96,Joke97,Joke98,Joke99,Joke100
0,74,-7.82,8.79,-9.66,-8.16,-7.52,-8.5,-9.85,4.17,-8.98,...,2.82,99.0,99.0,99.0,99.0,99.0,-5.63,99.0,99.0,99.0
1,100,4.08,-0.29,6.36,4.37,-2.38,-9.66,-0.73,-5.34,8.88,...,2.82,-4.95,-0.29,7.86,-0.19,-2.14,3.06,0.34,-4.32,1.07
2,49,99.0,99.0,99.0,99.0,9.03,9.27,9.03,9.27,99.0,...,99.0,99.0,99.0,9.08,99.0,99.0,99.0,99.0,99.0,99.0
3,48,99.0,8.35,99.0,99.0,1.8,8.16,-2.82,6.21,99.0,...,99.0,99.0,99.0,0.53,99.0,99.0,99.0,99.0,99.0,99.0
4,91,8.5,4.61,-4.17,-5.39,1.36,1.6,7.04,4.61,-0.44,...,5.19,5.58,4.27,5.19,5.73,1.55,3.11,6.55,1.8,1.6


# Q21 Consider `ratings` named dataframe with only first 200 rows and all columns from 1(first column is 0) of dataset

In [97]:
ratings = jokes_df.iloc[0:200,1:]

In [98]:
ratings.shape

(200, 100)

# Q22 Change the column indices from 0 to 99

In [99]:
ratings.columns = list(range(0,100))

In [100]:
ratings.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-7.82,8.79,-9.66,-8.16,-7.52,-8.5,-9.85,4.17,-8.98,-4.76,...,2.82,99.0,99.0,99.0,99.0,99.0,-5.63,99.0,99.0,99.0
1,4.08,-0.29,6.36,4.37,-2.38,-9.66,-0.73,-5.34,8.88,9.22,...,2.82,-4.95,-0.29,7.86,-0.19,-2.14,3.06,0.34,-4.32,1.07
2,99.0,99.0,99.0,99.0,9.03,9.27,9.03,9.27,99.0,99.0,...,99.0,99.0,99.0,9.08,99.0,99.0,99.0,99.0,99.0,99.0
3,99.0,8.35,99.0,99.0,1.8,8.16,-2.82,6.21,99.0,1.84,...,99.0,99.0,99.0,0.53,99.0,99.0,99.0,99.0,99.0,99.0
4,8.5,4.61,-4.17,-5.39,1.36,1.6,7.04,4.61,-0.44,5.73,...,5.19,5.58,4.27,5.19,5.73,1.55,3.11,6.55,1.8,1.6


# Q23 In the dataset, the null ratings are given as 99.00, so replace all 99.00s with 0
Hint: You can use `ratings.replace(<the given value>, <new value you wanted to change with>)`

In [101]:
ratings = ratings.replace(99.00,0)

In [102]:
ratings.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-7.82,8.79,-9.66,-8.16,-7.52,-8.5,-9.85,4.17,-8.98,-4.76,...,2.82,0.0,0.0,0.0,0.0,0.0,-5.63,0.0,0.0,0.0
1,4.08,-0.29,6.36,4.37,-2.38,-9.66,-0.73,-5.34,8.88,9.22,...,2.82,-4.95,-0.29,7.86,-0.19,-2.14,3.06,0.34,-4.32,1.07
2,0.0,0.0,0.0,0.0,9.03,9.27,9.03,9.27,0.0,0.0,...,0.0,0.0,0.0,9.08,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,8.35,0.0,0.0,1.8,8.16,-2.82,6.21,0.0,1.84,...,0.0,0.0,0.0,0.53,0.0,0.0,0.0,0.0,0.0,0.0
4,8.5,4.61,-4.17,-5.39,1.36,1.6,7.04,4.61,-0.44,5.73,...,5.19,5.58,4.27,5.19,5.73,1.55,3.11,6.55,1.8,1.6


# Q24 Normalize the ratings using StandardScaler and save them in `ratings_diff` variable

In [103]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

In [104]:
ratings.shape

(200, 100)

In [105]:
sc.fit(ratings)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [106]:
ratings_diff = sc.transform(ratings)

In [107]:
ratings_diff.shape

(200, 100)

### Popularity based recommendation system

# Q25  Find the mean for each column  in `ratings_diff` i.e, for each joke
Consider all the mean ratings and find the jokes with highest mean value and display the top 10 joke IDs.

In [108]:
mean_array = ratings_diff.mean(axis=0)

In [109]:
len(mean_array) # we can see that we have obtained the mean of 100 columns

100

In [110]:
#joke having highest mean value 
print('JokeId having maximum mean value: ',np.argmax(mean_array))

JokeId having maximum mean value:  53


In [111]:
#list of top 10 joke ids
top10_index = mean_array.argsort()[-10:][::-1]

In [112]:
print('Top 10 joke ids: ',top10_index)

Top 10 joke ids:  [53 20 47 49 64 99 83 23 73 13]
