```
DATA DESCRIPTION:
    • author : name of the person who gave the rating
    • country : country the person who gave the rating belongs to
    • data : date of the rating
    • domain: website from which the rating was taken from
    • extract: rating content
    • language: language in which the rating was given
    • product: name of the product/mobile phone for which the rating was given
    • score: average rating for the phone
    • score_max: highest rating given for the phone
    • source: source from where the rating was taken
```

**Project Objectives :**  we will build a recommendation system using popularity based and collaborative filtering methods to recommend mobile phones to a user which are most popular and personalised respectively.


# Steps and tasks: [ Total Score: 60 points]

In [42]:
import pandas as pd
import numpy as np
import glob
import string

from collections import defaultdict

from surprise import SVD
from surprise import KNNWithMeans
from surprise import Dataset, Reader
from surprise import accuracy
from surprise.model_selection import train_test_split, cross_validate

In [92]:
# reading all csv files
def read_csv_files(path) :
    # Get CSV files list from a path
    csv_files = glob.glob(path + "/*.csv")
    csv_files

    df  = pd.DataFrame()

    for file in csv_files :
        print('file =',file)
        df = pd.concat([df, pd.read_csv(file, encoding='latin-1') ],ignore_index=True)
        pd.read_csv(file, encoding='latin-1').info()
    return df 

#return null values percentage
def null_percentage(df):
    return  100 * (df.isnull().sum()/len(df))

# duplicate values
def duplicate_values(df):
    return df.drop_duplicates()

#get value counts
def value_counts(df):
    for feature in df.columns: # Loop through all columns in the dataframe
        print(df[feature].value_counts(),'\n---------------------------------------',)
        
#has special characters
def has_special_char(df):
    for feature in df.columns:
        if (df[feature].dtype == 'object' ): 
            unwanted = string.ascii_letters + string.punctuation + string.whitespace + string.digits
            print(df[df[feature].str.strip(unwanted).astype(bool)==True])

# impute missing values by median
def impute_missing_by_median(df, include):
    for feature in df.columns: # Loop through all columns in the dataframe
        if (feature in include ): # Only apply for columns with categorical strings
            df[feature]=df[feature].fillna(df[feature].median()) # Impute missing By Median
    return df

# drop missing values
def drop_missing_rows(df, feature):
    df.dropna(subset= feature, inplace=True)
    return df


1. Import the necessary libraries and read the provided CSVs as a data frame and perform the below steps. [15 Marks]  
A. Merge all the provided CSVs into one dataFrame. [2 Marks] 

In [101]:
mobile_df = read_csv_files('DataSet')
mobile_df.info()

file = DataSet/phone_user_review_file_1.csv
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374910 entries, 0 to 374909
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   phone_url  374910 non-null  object 
 1   date       374910 non-null  object 
 2   lang       374910 non-null  object 
 3   country    374910 non-null  object 
 4   source     374910 non-null  object 
 5   domain     374910 non-null  object 
 6   score      366691 non-null  float64
 7   score_max  366691 non-null  float64
 8   extract    371934 non-null  object 
 9   author     371641 non-null  object 
 10  product    374910 non-null  object 
dtypes: float64(2), object(9)
memory usage: 31.5+ MB
file = DataSet/phone_user_review_file_2.csv
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114925 entries, 0 to 114924
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   phone_url  114925

***
    The given compressed file contains 6 csv files.
    phone_user_review_file_1.csv file contains 374910 rows
    phone_user_review_file_2.csv file contains 114925 rows
    phone_user_review_file_3.csv file contains 312961 rows
    phone_user_review_file_4.csv file contains 98284 rows
    phone_user_review_file_5.csv file contains 350216 rows
    phone_user_review_file_6.csv file contains 163837 rows
***

In [88]:
mobile_df.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.2,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8


B. Explore, understand the Data and share at least 2 observations. [2 Marks]


In [89]:
mobile_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
score,1351644.0,8.00706,2.616121,0.2,7.2,9.2,10.0,10.0
score_max,1351644.0,10.0,0.0,10.0,10.0,10.0,10.0,10.0


***
For score feature,  median is closer to max than min, hence the distribution is positively skewed (skewed right)
score_max is having min, Q1, median, Q2 and max are having value 10. that is only constant value 10 is there in the score_max feature.
***

In [90]:
mobile_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1415133 entries, 0 to 1415132
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   phone_url  1415133 non-null  object 
 1   date       1415133 non-null  object 
 2   lang       1415133 non-null  object 
 3   country    1415133 non-null  object 
 4   source     1415133 non-null  object 
 5   domain     1415133 non-null  object 
 6   score      1351644 non-null  float64
 7   score_max  1351644 non-null  float64
 8   extract    1395772 non-null  object 
 9   author     1351931 non-null  object 
 10  product    1415132 non-null  object 
dtypes: float64(2), object(9)
memory usage: 118.8+ MB


***
The mobile data freame contains 1415133 rows and 11 attributes.   
Score and score_max are float datypes and other attributes are of object datatype.
***

In [93]:
# Unique values
value_counts(mobile_df)

/cellphones/samsung-galaxy-s-iii/      17093
/cellphones/apple-iphone-5s/           16379
/cellphones/samsung-galaxy-s6/         16145
/cellphones/samsung-galaxy-s5/         16082
/cellphones/samsung-galaxy-s7-edge/    15917
                                       ...  
/cellphones/motorola-rizr/                 1
/cellphones/lg-c105/                       1
/cellphones/toshiba-tg02/                  1
/cellphones/zte-x990/                      1
/cellphones/alcatel-ot-799/                1
Name: phone_url, Length: 5556, dtype: int64 
---------------------------------------
7/18/2016     3244
7/17/2016     3102
1/15/2014     2885
2/5/2016      2849
2/6/2016      2567
              ... 
31/5/2017        1
14/9/2011        1
12/24/2002       1
6/6/2017         1
12/16/1999       1
Name: date, Length: 7728, dtype: int64 
---------------------------------------
en    554746
ru    207443
de    176600
it    116120
es     99739
fr     95080
pt     67155
nl     38375
tr     28359
sv     17149
f

In [94]:
# Unwanted characters
has_special_char(mobile_df)

Empty DataFrame
Columns: [phone_url, date, lang, country, source, domain, score, score_max, extract, author, product]
Index: []
Empty DataFrame
Columns: [phone_url, date, lang, country, source, domain, score, score_max, extract, author, product]
Index: []
Empty DataFrame
Columns: [phone_url, date, lang, country, source, domain, score, score_max, extract, author, product]
Index: []
Empty DataFrame
Columns: [phone_url, date, lang, country, source, domain, score, score_max, extract, author, product]
Index: []
                              phone_url        date lang country  \
245      /cellphones/samsung-galaxy-s8/    5/6/2017   no      no   
359      /cellphones/samsung-galaxy-s8/   4/26/2017   no      no   
457      /cellphones/samsung-galaxy-s8/   4/24/2017   no      no   
465      /cellphones/samsung-galaxy-s8/   4/23/2017   no      no   
489      /cellphones/samsung-galaxy-s8/   4/15/2017   ru      ru   
...                                 ...         ...  ...     ...   
1413333   /c

C. Round off scores to the nearest integers. [3 Marks]


In [95]:
mobile_df['score'] = np.round(mobile_df['score'],0).fillna(mobile_df['score'].median()).apply(np.int64)
mobile_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1415133 entries, 0 to 1415132
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   phone_url  1415133 non-null  object 
 1   date       1415133 non-null  object 
 2   lang       1415133 non-null  object 
 3   country    1415133 non-null  object 
 4   source     1415133 non-null  object 
 5   domain     1415133 non-null  object 
 6   score      1415133 non-null  int64  
 7   score_max  1351644 non-null  float64
 8   extract    1395772 non-null  object 
 9   author     1351931 non-null  object 
 10  product    1415132 non-null  object 
dtypes: float64(1), int64(1), object(9)
memory usage: 118.8+ MB


***
There are 4% socre data has NaN values. Filled missing values with Median and rounded off scroes to nearest integer.
***

D. Check for missing values. Impute the missing values, if any. [2 Marks]


In [102]:
# Percentage of missing values
null_percentage(mobile_df)

phone_url    0.000000
date         0.000000
lang         0.000000
country      0.000000
source       0.000000
domain       0.000000
score        4.486433
score_max    4.486433
extract      1.368140
author       4.466153
product      0.000071
dtype: float64

In [103]:
mobile_df.shape

(1415133, 11)

In [104]:
mobile_df = drop_missing_rows(mobile_df,['author'])

In [105]:
mobile_df.shape

(1351931, 11)

***
4.4% of author feature data is missing. dropped missing author rows 63202.
***

In [53]:
# Percentage of missing values
null_percentage(mobile_df)

phone_url    0.000000
date         0.000000
lang         0.000000
country      0.000000
source       0.000000
domain       0.000000
score        0.000000
score_max    4.504150
extract      1.147618
author       0.000000
product      0.000000
dtype: float64

***
There are 4% socre data has NaN values. Filled missing values with Median and rounded off scroes to nearest integer.  
4.4% of author feature data is missing. dropped missing author rows 63202.
***

E. Check for duplicate values and remove them, if any. [2 Marks]


In [107]:
# drop duplicate
mobile_df = duplicate_values(mobile_df)

In [108]:
mobile_df.shape

(1346904, 11)

***
There are 5027 duplicate records and dropped them.
***

F. Keep only 1 Million data samples. Use random state=612. [2 Marks] 


In [55]:
mobile_df.sample(n=1000000, random_state=612)

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
138292,/cellphones/xiaomi-mi-5/,3/23/2017,he,il,Zap.il,zap.co.il,10,10.0,×××©××¨ ××××× × ×¨×× ×××© ××× ×...,×©×,×××¤×× ×¡××××¨× Xiaomi Mi5 64GB
258075,/cellphones/asus-ze500kl/,12/9/2015,en,in,Amazon,amazon.in,4,10.0,for the price its best. but its heating..nd be...,Amazon Customer,"Asus Zenfone 2 Laser ZE500KL (Black, 16GB)"
753000,/cellphones/zte-open-c/,3/27/2015,de,de,Amazon,amazon.de,8,10.0,Habe das Handy jetzt seit etwa 3 Monaten als H...,Venja Kahrs,ZTE OPEN C 4.0â³Dual-Core 1.2 GHz Smartphone ...
117824,/cellphones/oneplus-3t/,3/12/2017,en,in,Amazon,amazon.in,10,10.0,Great smartphone,Amazon certified Customer,"OnePlus 3T (Gunmetal, 6GB RAM + 128GB memory)"
306295,/cellphones/acer-liquid-z630/,1/19/2016,es,es,Amazon,amazon.es,8,10.0,"Es un muy buen mÃ³vil, a veces se calienta un ...",antcaesar,Acer Liquid Z630S - Smartphone libre Android (...
...,...,...,...,...,...,...,...,...,...,...,...
1001291,/cellphones/samsung-galaxy-mini-2-s6500/,6/7/2014,it,it,Amazon,amazon.it,10,10.0,"arrivato immediatamente,perfetto con scontrino...",adele64,"Samsung Galaxy Mini 2 Smartphone, Display 3.27..."
734529,/cellphones/sony-xperia-sp/,5/6/2015,en,gb,Amazon,amazon.co.uk,2,10.0,DO NOT BUY THIS PHONE. On day one this seemd l...,G. D. Symes,Sony Xperia SP Smartphone - on EE T-Mobile Ora...
492229,/cellphones/ttfone-pluto/,3/17/2015,en,gb,Amazon,amazon.co.uk,10,10.0,Great phone and no frills. Does what it needs ...,tracystride,TTfone Pluto Big Button Clamshell Flip Unlocke...
865209,/cellphones/blackberry-8800/,8/22/2009,en,gb,Dooyoo,dooyoo.co.uk,10,10.0,I really was very pleased that I bought this p...,surfaholic18,Blackberry 8800


G. Drop irrelevant features. Keep features like Author, Product, and Score. [2 Marks] 


In [56]:
mobileDF= mobile_df[['author','product','score']]

***
selected only the required features - Author, Product and Score
***

2. Answer the following questions. [10 Marks]  
A. Identify the most rated products. [3 Marks]


In [116]:
product_rate_df = mobileDF['product'].value_counts().reset_index(name='product_rating').sort_values(['product_rating'], ascending=False)
print('\n Most rated product \n',product_rate_df[:1],'\n-----------------------------------------------------------')



 Most rated product 
                               index  product_rating
0  Lenovo Vibe K4 Note (White,16GB)            5223 
-----------------------------------------------------------


B. Identify the users with most number of reviews. [3 Marks]  

In [115]:
user_reviews_df = mobileDF['author'].value_counts().reset_index(name='author_reviews').sort_values(['author_reviews'], ascending=False)
print(' \n User with Most number of reviews  \n\n',user_reviews_df[:1],'\n-----------------------------------------------------')

 
 User with Most number of reviews  

              index  author_reviews
0  Amazon Customer           76933 
-----------------------------------------------------


C. Select the data with products having more than 50 ratings and users who have given more than 50 ratings. Report the shape of the final
dataset. [4 Marks]  

In [118]:
auth_filter = mobileDF[mobileDF['author'].map(mobileDF['author'].value_counts()) > 50]['author']

In [119]:
product_filter = mobileDF[mobileDF['product'].map(mobileDF['product'].value_counts()) > 50]['product']

In [120]:
mobileDF_50 = mobileDF[mobileDF['author'].isin(auth_filter) & mobileDF['product'].isin(product_filter)]

In [126]:
print('\n Users who have given more than 50 ratings :',auth_filter.shape[0])
print('\n Products having more than 50 ratings :',product_filter.shape[0])
print('\n Users who have given more than 50 ratings  and Products having more than 50 ratings:', mobileDF_50.shape[0],'\n----------------------------------------------------------------------------------------------')



 Users who have given more than 50 ratings : 254238

 Products having more than 50 ratings : 866350

 Users who have given more than 50 ratings  and Products having more than 50 ratings: 174359 
----------------------------------------------------------------------------------------------


3. Build a popularity based model and recommend top 5 mobile phones. [5 Marks]

In [63]:
mobileDF.groupby('product')['score'].mean().sort_values(ascending=False).head()  

product
SAMSUNG Galaxy S5 Mini - G800F - White - Smartphone unlocked                                                                                                                                                                                                              10.0
Cubot sbloccato S222 5.5 "Android 4.2 3G phablet MTK6582 Quad Core 1.3GHz 1G di RAM +16 G ROM Dual SIM Dual Standby SIM-Free Smartphone 3G HD IPS Schermo 13.0MP Torna 8,0 MP fotocamera frontale GPS Google Play Store Quali applicazioni Tablet PC WIFI per Orange ,    10.0
Motorola Moto Z Droid Edition                                                                                                                                                                                                                                             10.0
Samsung Galaxy Note 5 SM-N9200 32GB                                                                                                                                                

In [64]:
mobileDF.groupby('product')['score'].count().sort_values(ascending=False).head()  

product
Lenovo Vibe K4 Note (White,16GB)     5223
Lenovo Vibe K4 Note (Black, 16GB)    4389
OnePlus 3 (Graphite, 64 GB)          4103
OnePlus 3 (Soft Gold, 64 GB)         3557
Huawei P8lite zwart / 16 GB          2707
Name: score, dtype: int64

In [65]:
product_mean_count = pd.DataFrame(mobileDF.groupby('product')['score'].mean()) 

In [66]:
product_mean_count['product_score_counts'] = pd.DataFrame(mobileDF.groupby('product')['score'].count())  

In [129]:
print('\nPopularity based model, Top 5 mobile phones recommend: \n\n')
product_mean_count.sort_values(by='product_score_counts',ascending=False).head()  


Popularity based model, Top 5 mobile phones recommend: 




Unnamed: 0_level_0,score,product_score_counts
product,Unnamed: 1_level_1,Unnamed: 2_level_1
"Lenovo Vibe K4 Note (White,16GB)",7.180165,5223
"Lenovo Vibe K4 Note (Black, 16GB)",7.173388,4389
"OnePlus 3 (Graphite, 64 GB)",8.725323,4103
"OnePlus 3 (Soft Gold, 64 GB)",8.502109,3557
Huawei P8lite zwart / 16 GB,8.457702,2707


4. Build a collaborative filtering model using SVD. You can use SVD from surprise or build it from scratch(Note: Incase you’re building it from scratch you
can limit your data points to 5000 samples if you face memory issues). Build a collaborative filtering model using kNNWithMeans from surprise. You
can try both user-based and item-based model. [10 Marks] 

In [68]:
#pip install scikit-surprise

In [69]:
reader = Reader(rating_scale=(1,10))
mobile_ds = Dataset.load_from_df(mobileDF_50, reader)

In [70]:
# Split data in training and test
train_data, test_data = train_test_split(mobile_ds, test_size = 0.2)

In [130]:
#a collaborative filtering model using SVD
algo_svd = SVD()
algo_svd.fit(train_data)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f94a92fccd0>

In [72]:
test_pred_svd = algo_svd.test(test_data)

In [131]:
print('\n A collaborative filtering model using SVD \n')
test_pred_svd[0:5]


 A collaborative filtering model using SVD 



[Prediction(uid='Kindle Customer', iid='Samsung Galaxy Note 4 N910a 32GB Unlocked GSM 4G LTE Smartphone White', r_ui=2.0, est=5.220687580854655, details={'was_impossible': False}),
 Prediction(uid='Frank', iid='HTC One S Smartphone (10,9 cm (4,3 Zoll) AMOLED-Tochscreen, 8 Megapixel Kamera, Android OS) grau', r_ui=10.0, est=6.796751591096053, details={'was_impossible': False}),
 Prediction(uid='Amazon Customer', iid='Lenovo Vibe K4 Note (Black, 16GB)', r_ui=8.0, est=7.129228074180737, details={'was_impossible': False}),
 Prediction(uid='Amazon Customer', iid='Panasonic P81 (Black)', r_ui=10.0, est=7.237923682822756, details={'was_impossible': False}),
 Prediction(uid='Andrew', iid='HTC EVO Design 4G Prepaid Android Phone (Boost Mobile)', r_ui=10.0, est=7.699560520593857, details={'was_impossible': False})]

In [132]:
#user-based collaborative filtering
algo_user = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': True})
algo_user.fit(train_data)


Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f94a92fc700>

In [133]:
# run the trained model against the testset
test_pred_auth = algo_user.test(test_data)

In [134]:
# item-based collaborative filtering
algo_product = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': False})
algo_product.fit(train_data)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f9456de2880>

In [136]:
# run the trained model against the testset
test_pred_product = algo_product.test(test_data)

5. Evaluate the collaborative model. Print RMSE value. [2 Marks]  

In [137]:
# get RMSE
print("SVD Model : Test Set recommendation")
accuracy.rmse(test_pred_svd, verbose=True)

SVD Model : Test Set recommendation
RMSE: 2.6502


2.6502010582620814

In [138]:
# get RMSE
print("User-based Model : Test Set recommendation")
accuracy.rmse(test_pred_auth, verbose=True)

User-based Model : Test Set recommendation
RMSE: 2.7277


2.7276855920476772

In [139]:
# get RMSE
print("item-based Model : Test Set recommendation")
accuracy.rmse(test_pred_product, verbose=True)

item-based Model : Test Set recommendation
RMSE: 2.6645


2.6644953021820728

6. Predict score (average rating) for test users. [2 Marks]  


In [140]:
test_pred_auth[0:5]

[Prediction(uid='Kindle Customer', iid='Samsung Galaxy Note 4 N910a 32GB Unlocked GSM 4G LTE Smartphone White', r_ui=2.0, est=4.250866554305137, details={'actual_k': 2, 'was_impossible': False}),
 Prediction(uid='Frank', iid='HTC One S Smartphone (10,9 cm (4,3 Zoll) AMOLED-Tochscreen, 8 Megapixel Kamera, Android OS) grau', r_ui=10.0, est=3.8719531785532597, details={'actual_k': 8, 'was_impossible': False}),
 Prediction(uid='Amazon Customer', iid='Lenovo Vibe K4 Note (Black, 16GB)', r_ui=8.0, est=7.72, details={'actual_k': 50, 'was_impossible': False}),
 Prediction(uid='Amazon Customer', iid='Panasonic P81 (Black)', r_ui=10.0, est=7.5, details={'actual_k': 8, 'was_impossible': False}),
 Prediction(uid='Andrew', iid='HTC EVO Design 4G Prepaid Android Phone (Boost Mobile)', r_ui=10.0, est=8.346456692913385, details={'actual_k': 0, 'was_impossible': False})]

In [141]:
test_pred_product[0:5]

[Prediction(uid='Kindle Customer', iid='Samsung Galaxy Note 4 N910a 32GB Unlocked GSM 4G LTE Smartphone White', r_ui=2.0, est=5.896202181804837, details={'actual_k': 50, 'was_impossible': False}),
 Prediction(uid='Frank', iid='HTC One S Smartphone (10,9 cm (4,3 Zoll) AMOLED-Tochscreen, 8 Megapixel Kamera, Android OS) grau', r_ui=10.0, est=5.478400946173987, details={'actual_k': 50, 'was_impossible': False}),
 Prediction(uid='Amazon Customer', iid='Lenovo Vibe K4 Note (Black, 16GB)', r_ui=8.0, est=7.720000000000001, details={'actual_k': 50, 'was_impossible': False}),
 Prediction(uid='Amazon Customer', iid='Panasonic P81 (Black)', r_ui=10.0, est=7.616162068700027, details={'actual_k': 50, 'was_impossible': False}),
 Prediction(uid='Andrew', iid='HTC EVO Design 4G Prepaid Android Phone (Boost Mobile)', r_ui=10.0, est=8.26388955254541, details={'actual_k': 39, 'was_impossible': False})]

7. Report your findings and inferences. [2 Marks]  


***
1. The compression file contains 6 csv files. All 6 csv files are meged and formed Pandas dataframe
2. Data contains missing values. For Score feature missing values are imputed by median and for Author feature, deleted the missing values.
3. Data contains duplicate value. Dropped duplicate values.
4. Data contains multiple features. Selected only the required features - Author, Product and Score for model building
5. Root mean square error with SVD model is 2.6502  
6. Root mean square error with User based collaborative filtering  model is 2.7277  
7. Root mean square error with Item based collaborative filtering model is 2.6645

8. RMSE with SVD model is less than RMSE with User based collaborative filtering and Item based collaborative filtering models.
***

8. Try and recommend top 5 products for test users. [5 Marks]  

In [142]:
def get_top_n(predictions, n=5):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [143]:
top_n = get_top_n(test_pred_auth, n=5)

In [144]:
top_n['Ralf']

[('Mobistel Cynus T2 Smartphone (12,7 cm (5 Zoll) Touchscreen, 12 Megapixel Kamera, 4GB Speicher, Dual-SIM, Android 4.0) weiÃ\x9f',
  9.558132062987049),
 ('Sony Xperia XCompact Smartphone (11,7 cm (4,6 Zoll), 32 GB Speicher, Android 6.0) Mist Blue',
  9.176204585415022),
 ('Microsoft Nokia 2323 classic Handy (GPRS, Bluetooth, E-Mail) black',
  8.607581661368377),
 ('Huawei Ascend G700 Smartphone (12,7 cm (5 Zoll) Touchscreen, 8 Megapixel Kamera, 8 GB Interner Speicher, Android 4.2) schwarz',
  7.771428571428571),
 ('Nokia E52', 7.771428571428571)]

9. Try other techniques (Example: cross validation) to get better results. [3 Marks]  


In [145]:
# Run 5-fold cross-validation and print results
result_cv_svd = cross_validate(algo_svd, mobile_ds, measures=["RMSE", "MAE"], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    2.6771  2.6775  2.6852  2.6967  2.6512  2.6775  0.0149  
MAE (testset)     2.0576  2.0655  2.0672  2.0672  2.0222  2.0559  0.0173  
Fit time          1.64    1.61    1.63    1.69    1.93    1.70    0.12    
Test time         0.28    0.41    0.27    0.44    0.31    0.34    0.07    


In [146]:
print('Root Mean Square error with SVD :',result_cv_svd['test_rmse'].mean())
print('Mean Absolute Error with SVD: ',result_cv_svd['test_mae'].mean())

Root Mean Square error with SVD : 2.6775353398599147
Mean Absolute Error with SVD:  2.0559355858788133


In [150]:
# Run 5-fold cross-validation and print results.
result_cv_item = cross_validate(algo_product, mobile_ds, measures=['RMSE', 'MAE'], cv=5, verbose=True)
#return_train_measures=True

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    2.6621  2.6633  2.6588  2.6674  2.6738  2.6651  0.0051  
MAE (testset)     2.0245  2.0330  2.0300  2.0388  2.0449  2.0342  0.0071  
Fit time          263.63  271.83  261.40  304.83  265.21  273.38  16.11   
Test time         628.16  590.44  615.95  687.18  62

In [151]:
print('Root Mean Square error with Item-Item Collaboration :',result_cv_item['test_rmse'].mean())
print('Mean Absolute Error with Item-Item Collaboration: ',result_cv_item['test_mae'].mean())

Root Mean Square error with Item-Item Collaboration : 2.665078356674573
Mean Absolute Error with Item-Item Collaboration:  2.034230867557857


In [148]:
# Run 5-fold cross-validation and print results.
result_cv_user = cross_validate(algo_user, mobile_ds, measures=['RMSE', 'MAE'], cv=5, verbose=True, return_train_measures=True);

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    2.7489  2.7378  2.7189  2.7560  2.7371  2.7397  0.0126  
MAE (testset)     2.0572  2.0612  2.0460  2.0752  2.0582  2.0596  0.0094  
RMSE (trainset)   2.0864  2.0944  2.0974  2.0917  2.0966  2.0933  0.0040  
MAE (trainset)    1.3841  1.3980  1.4005  1.3966  1.

In [149]:
print('Root Mean Square error with User-User Collaboration :',result_cv_user['test_rmse'].mean())
print('Mean Absolute Error with User-User Collaboration: ',result_cv_user['test_mae'].mean())

Root Mean Square error with User-User Collaboration : 2.7397259277202934
Mean Absolute Error with User-User Collaboration:  2.0595560254253287


10. In what business scenario you should use popularity based Recommendation Systems ? [2 Marks]  


***

Items that have been viewed or purchased by the majority of people and have received high recommendations are recommended by systems of recommendation based on popularity.

Suitable for 
1. Popular news
2. Trending videos
3. Current events
***

11. In what business scenario you should use CF based Recommendation Systems ? [2 Marks]  

***
A collaborative filtering recommendation system anticipates what a user might find interesting based on the preferences of many other users.

Business Scenario
1. Recommand products to buyers
2. Streaming platform
***

12. What other possible methods can you think of which can further improve the recommendation for different users ? [2 Marks]  

***
A hybrid approach combines collaborative and content-based filtering techniques while providing recommendations. When making recommendations, both the user-to-item and user-to-user relationships are crucial. This framework offers personalised recommendations, gives mobile recommendations based on the user's understanding, and fixes an issue if a particular customer disregards essential information.

Together, these two strategies will provide you more knowledge and better results since they investigate new routes to significant underlying material and collaborative filtering techniques using data on buyer behaviour.
***