# Lab 09 - Ensembles and Recommender systems
This week we are exploring on ensemble methods in sci-kit learn and how to create a recommender system using what
you learned in the class. Let's import necessary libraries first.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, VotingClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, pairwise_distances, classification_report, plot_roc_curve
from sklearn import datasets

Let's import breast cancer dataset. This dataset consists of features of the images taken from fine needle aspirate (FNA) of a
breast mass. As the target feature we have to predict whether the mass is benign or malignant. You can find more
information on the dataset [here](https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset).

This dataset is already cleaned and encoded. So in the target feature  (diagnosis) encoding is follows:
* 1 - Benign
* 0 - Malignant

In [2]:
data = datasets.load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.DataFrame(data.target, columns=['target'])

X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [3]:
y.value_counts()

target
1         357
0         212
dtype: int64

It is always good to normalize data before using them.

In [4]:
X = -1 + (((X - X.mean())*2) / (X.max() - X.min()))
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,-0.634369,-1.602614,-0.573893,-0.706375,-0.602063,0.062874,-0.009837,-0.024047,-0.386483,-0.329722,...,-0.351774,-1.444948,-0.229655,-0.440416,-0.606004,-0.201842,-0.297585,0.036383,-0.329689,-0.541464
1,-0.390155,-1.102783,-0.434304,-0.430659,-1.209809,-1.157665,-1.0089,-0.788759,-0.999615,-1.258113,...,-0.379523,-1.120854,-0.48664,-0.471384,-1.113169,-1.131298,-1.048863,-0.509321,-1.059435,-0.933436
2,-0.473453,-0.867409,-0.474384,-0.535007,-0.760951,-0.659168,-0.491093,-0.214902,-0.740019,-1.118265,...,-0.480554,-1.007848,-0.549392,-0.592795,-0.841096,-0.669674,-0.715317,-0.117569,-0.719202,-0.952326
3,-1.256263,-0.926253,-1.198867,-1.228029,-0.166928,0.101521,-0.284908,-0.440548,-0.206685,0.459241,...,-1.096705,-0.956142,-1.083582,-1.153796,0.022669,0.187599,-0.337522,-0.017912,0.473386,0.16823
4,-0.416659,-1.334775,-0.403898,-0.455261,-0.928867,-0.825416,-0.488282,-0.449494,-1.002645,-1.167128,...,-0.553838,-1.480129,-0.55238,-0.658663,-0.933548,-1.095594,-0.795828,-0.670833,-1.211613,-1.094003


In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Ensembles
### Bagging

You can find bagging classifier in sklearn.ensembles.Bagging. For the bagging classifier we need a base model. We will use
logistic regression as our base model. You can find the complete documentation on bagging classifier
[here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html).

Let's train a Bagging classifier with 10 estimators with our training data.

In [6]:
base_log_reg = LogisticRegression(max_iter=1000)
bag_clf = BaggingClassifier(base_estimator=base_log_reg, n_estimators=10)

bag_clf.fit(X_train, y_train.values.ravel())

BaggingClassifier(base_estimator=LogisticRegression(max_iter=1000))

Let's measure the performance.

In [7]:
pred = bag_clf.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test))

              precision    recall  f1-score   support

           0       1.00      0.96      0.98        54
           1       0.98      1.00      0.99        89

    accuracy                           0.99       143
   macro avg       0.99      0.98      0.99       143
weighted avg       0.99      0.99      0.99       143



### Random Forest
The most popular ensemble method is the random forest classifier. You can find the full documentation on random
forest implementation [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).
Let's train a random forest model with 200 trees where the maximum depth of each tree is two.

In [8]:
rf_clf = RandomForestClassifier(n_estimators=200, max_depth=2)
rf_clf.fit(X_train, y_train.values.ravel())

RandomForestClassifier(max_depth=2, n_estimators=200)

Let's measure the performance of it as well.

In [9]:
pred = rf_clf.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test))

              precision    recall  f1-score   support

           0       0.98      0.94      0.96        54
           1       0.97      0.99      0.98        89

    accuracy                           0.97       143
   macro avg       0.97      0.97      0.97       143
weighted avg       0.97      0.97      0.97       143



### Voting classifier
Implementation of voting classifier can be found in
[sklearn.ensemble.VotingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier).
This implementation supports weighted voting as well.

We are going to train a voting classifier with three classifiers, a Logistic Regression classifier, a Random Forest classifier
 and a Decision Tree classifier, with uniform weights.

In [10]:
tree_clf = DecisionTreeClassifier()
log_reg_clf = LogisticRegression(max_iter=1000)
randf_clf = RandomForestClassifier()

voting_clf = VotingClassifier([('decTree', tree_clf), ('LogReg', log_reg_clf), ('RandForest', randf_clf)], weights=None)
voting_clf.fit(X_train, y_train.values.ravel())

VotingClassifier(estimators=[('decTree', DecisionTreeClassifier()),
                             ('LogReg', LogisticRegression(max_iter=1000)),
                             ('RandForest', RandomForestClassifier())])

Let's measure the performance of our voting classifier.

In [11]:
pred = voting_clf.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test))

              precision    recall  f1-score   support

           0       1.00      0.94      0.97        54
           1       0.97      1.00      0.98        89

    accuracy                           0.98       143
   macro avg       0.98      0.97      0.98       143
weighted avg       0.98      0.98      0.98       143



### Boosting
We will implement a model using Gradient Boosting algorithm. You can find more information about Gradient Boosting
algorithm
[here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier)

Let's train a gradient boosting model.

In [12]:
gb_clf = GradientBoostingClassifier(n_estimators=200)
gb_clf.fit(X_train, y_train.values.ravel())

GradientBoostingClassifier(n_estimators=200)

In [13]:
pred = gb_clf.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test))

              precision    recall  f1-score   support

           0       0.94      0.94      0.94        54
           1       0.97      0.97      0.97        89

    accuracy                           0.96       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.96      0.96      0.96       143



Finally, when we compare the results above, We can see that the logistic regression with bagging has the best
performance for this particular dataset. This is a really nice example which shows that using more sophisticated
algorithms does not always make your predictions better. Sometimes, simple solutions work better. However, That does
not mean Bagging classifier is the better ensemble method than any one of other algorithms used here. But at least for this
dataset, Bagging classifier gives the best model we can use.

### Task

Change the normalizing method to,
* Min_max scalar
* Mean scalar
* standard scalar

and see what happens to the performance of the models.

## Recommender systems
We are using movie rating dataset for our lab. First, we have to import the ratings data.

In [14]:
ratings_data = pd.read_csv("ratings.csv")
ratings_data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [15]:
ratings_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


Each row in the dataset corresponds to one rating.
* The userId column contains the ID of the user who left the rating.
* The movieId column contains the ID of the movie
* The rating column contains the rating left by the user. Ratings can have values between 1 and 5.
* The timestamp refers to the time at which the user left the rating.

This dataset contains the IDs of the movies but not their titles. We'll need movie names for the movies we're
rating. The movie names are stored in the "movies.csv" file.

In [16]:
movie_names = pd.read_csv("movies.csv")
movie_names.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [17]:
movie_names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


We need a dataset that contains the userId, movie title, and its ratings. We have this information in two different
dataframe objects: "ratings_data" and "movie_names". To get our desired information in a single dataframe, we can
merge the two dataframe objects on the movieId column since it is common between the two dataframes.

In [18]:
movie_data = pd.merge(ratings_data, movie_names, on='movieId')
movie_data.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [19]:
movie_data.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 100836 entries, 0 to 100835
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
 4   title      100836 non-null  object 
 5   genres     100836 non-null  object 
dtypes: float64(1), int64(3), object(2)
memory usage: 5.4+ MB


Now let's split our data into train and test sets.

In [20]:
train_data, test_data = train_test_split(movie_data, test_size=0.3, random_state=42)

datasets = [train_data, test_data]

Now we have to create a dataset which represent all the rating of a user in a single instance.

In [21]:
user_rating_df = pd.DataFrame(index=train_data['userId'].unique(), columns=movie_data['movieId'].unique())
for index, row in train_data.iterrows():
    user_rating_df[row['movieId']][row['userId']] = row['rating']

user_rating_df.head()

Unnamed: 0,1,3,6,47,50,70,101,110,151,157,...,147662,148166,149011,152372,158721,160341,160527,160836,163937,163981
488,,,,,,,,,,,...,,,,,,,,,,
129,,,,3.5,,,,,,,...,,,,,,,,,,
489,,,,3.0,,2.0,,4.5,3.5,,...,,,,,,,,,,
509,4.0,,,,,,,,,,...,,,,,,,,,,
200,,,,,,,,,,,...,,,,,,,,,,


From here onwards **you are going to** implement the algorithm.

With the datasets ready, now we can use the pairwise distance functions or pandas corr() method to create the
correlation matrices for both training sets. We are going to use pearson correlation. You can find the information on
 pairwise distance
[here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html). And the information
about the metric you are going to use can be found
[here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.correlation.html).

Note:
* The pairwise distance function calculates the distance, not the similarity.
* Replace Nan with 0.
* Normalizing the data is really important for user-based CF.

In [22]:
# Generate user similarity matrix


In [23]:
# Generate item similarity matrix


With the similarity matrices in hand, we can now predict the ratings that were not included with the data. We can compare these predictions with the test data to validate the quality of our recommender model.

For user-based CF we are using k-nearest neighbours. The steps are,
* Rank all the users and find the top-k most similar users for a particular user in the training set.
* Using average, aggregate the ratings of the k nearest neighbours as the prediction.
* Return the predictions.

Create a function that takes the training set, the user similarity matrix, K (number of neighbours considering) and
user_Id as inputs and return the predicted rating values for that user for all the movies.

i.e:
for this dataset
* for user 509, predicted normalized rating for movie Id 50 is 0.3333
* for user 19, predicted normalized rating for movie Id 260 is 0.5833

In [24]:
# Implement your function here.

Implement the function that takes the training set, the item similarity matrix, K (number of neighbours considering) and
movie_Id as inputs and return the predicted rating values for all the users using item-based CF.

i.e:
for this dataset
* for user 63, predicted normalized rating for movie Id 50 is 0.4444
* for user 19, predicted normalized rating for movie Id 260 is 0.4167

In [25]:
# Implement your function here.

### Evaluation

We are going to use Root Mean Square Error (RMSE) as our evaluation metric.

The evaluation steps are follows.
* Using the implemented function generate a complete training set.
* For each instance in test set,
    * Check whether that UserID and Movie Id exists on the completed training set. If not skip.
    * Get the predicted rating from the completed training set and measure the error.
* Using the error values calculate the RMSE for your prediction model.

Calculate the user-based CF model performance. Please normalize the test data using the same values you used to
normalize train data.

In [26]:
# Calculate the performance here.

Calculate the item-based CF model performance here.

In [27]:
# Calculate the performance here.

