**NYU CDS - Introduction to Data Science (Fall 2021)**
* Student NET-ID: N14948495
* Project deadline: Dec 06, 2021, 11:59pm


**<font size=5> Data analysis Project 2: Correlation and Regression of Movie Ratings Data </font>**

In [3]:
# Import libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings("ignore")

**Dataset description**

This dataset features ratings data of 400 movies from 1097 research participants. 

* 1st row: Headers (Movie titles/questions) – note that the indexing in this list is from 1
* Row 2-1098: Responses from individual participants
* Columns 1-400: These columns contain the ratings for the 400 movies (0 to 4, and missing)
* Columns 401-421: These columns contain self-assessments on sensation seeking behaviors (1-5)
* Columns 422-464: These columns contain responses to personality questions (1-5)
* Columns 465-474: These columns contain self-reported movie experience ratings (1-5)
* Column 475: Gender identity (1 = female, 2 = male, 3 = self-described)
* Column 476: Only child (1 = yes, 0 = no, -1 = no response)
* Column 477: Movies are best enjoyed alone (1 = yes, 0 = no, -1 = no response)

Note that we did most of the data munging for you already (e.g. Python interprets commas in a csv file as separators, so we removed all commas from movie titles), but you still need to handle missing data.

### Q1:


**Note:** For all missing values in the data, use the average of the corresponding column so to fill in the missing data. 



In this problem, under **the most correlated**, we consider the largest correlation in the absolute value.


1.1. For every user in the given data, find its most correlated user. 

1.2. What is the pair of the most correlated users in the data? 

1.3. What is the value of this highest correlation?

1.4. For users 0, 1, 2, \dots, 9, print their most correlated users. 



In [4]:
df = pd.read_csv('..\movieReplicationSet.csv') # Import data

df.fillna(df.mean(), inplace=True) # Fill in missing values
display(df.head())

Unnamed: 0,The Life of David Gale (2003),Wing Commander (1999),Django Unchained (2012),Alien (1979),Indiana Jones and the Last Crusade (1989),Snatch (2000),Rambo: First Blood Part II,Fargo (1996),Let the Right One In (2008),Black Swan (2010),...,When watching a movie I cheer or shout or talk or curse at the screen,When watching a movie I feel like the things on the screen are happening to me,As a movie unfolds I start to have problems keeping track of events that happened earlier,"The emotions on the screen ""rub off"" on me - for instance if something sad is happening I get sad or if something frightening is happening I get scared",When watching a movie I get completely immersed in the alternative reality of the film,Movies change my position on social economic or political issues,When watching movies things get so intense that I have to stop watching,Gender identity (1 = female; 2 = male; 3 = self-described),Are you an only child? (1: Yes; 0: No; -1: Did not respond),Movies are best enjoyed alone (1: Yes; 0: No; -1: Did not respond)
0,2.151316,2.021127,4.0,2.707612,3.0,2.597656,2.365385,2.899606,2.49635,2.911565,...,1.0,6.0,2.0,5.0,5.0,5.0,1.0,1.0,0,1
1,2.151316,2.021127,1.5,2.707612,2.778618,2.597656,2.365385,2.899606,2.49635,2.911565,...,3.0,1.0,1.0,6.0,5.0,3.0,2.0,1.0,0,0
2,2.151316,2.021127,3.153422,2.707612,2.778618,2.597656,2.365385,2.899606,2.49635,2.911565,...,5.0,4.0,3.0,5.0,5.0,4.0,4.0,1.0,1,0
3,2.151316,2.021127,2.0,2.707612,3.0,2.597656,2.365385,2.899606,2.49635,4.0,...,3.0,1.0,1.0,4.0,5.0,3.0,1.0,1.0,0,1
4,2.151316,2.021127,3.5,2.707612,0.5,2.597656,0.5,1.0,2.49635,0.0,...,2.0,3.0,2.0,5.0,6.0,4.0,4.0,1.0,1,1


**Question 1.1: For every user in the given data, find its most correlated user**

In [9]:
my_dict = {}

for user in range(0, df.iloc[:,:400].shape[0]):
    corr = df.iloc[:,:400].corrwith(df.iloc[:,:400].iloc[user], axis=1).sort_values(ascending=False) #List of correlations between user and all rows
    my_dict[user] = [corr.index[1], corr.values[1]] #Highest correlation gets assigned to user

In [10]:
correlation_df = pd.DataFrame.from_dict(my_dict, orient='index', columns=['correlated_user', 'correlation'])
display(correlation_df.head())

Unnamed: 0,correlated_user,correlation
0,118,0.564325
1,831,0.831628
2,896,0.944122
3,19,0.540632
4,784,0.47703


**Question 1.2: What is the pair of the most correlated users in the data?**

In [97]:
temp_df = correlation_df.sort_values('correlation', ascending=False)
display(temp_df.head())
print('The most correlated users are: {} and {}'.format(temp_df.index[0], temp_df.iloc[0,0]))

Unnamed: 0,correlated_user,correlation
896,831,0.999542
831,896,0.999542
858,896,0.960376
456,449,0.952448
449,456,0.952448


The most correlated users are: 896 and 831


**Question 1.3: What is the value of this highest correlation?**

In [93]:
print('The correlation between the highest correlated users is: {}'.format(temp_df.iloc[0,1]))

The correlation between the highest correlated users is: 0.9995424261495214


**Question 1.4: For users 0, 1, 2, $\dots$, 9, print their most correlated users.**

In [98]:
display(correlation_df.head(10))

Unnamed: 0,correlated_user,correlation
0,583,0.551171
1,831,0.725494
2,896,0.784047
3,364,0.640055
4,896,0.528441
5,99,0.612641
6,239,0.602601
7,896,0.5141
8,896,0.706144
9,1004,0.752591


### Q2:

We want to find a model between the ratings and the personal part of the data. To do so, consider:


**Part 1**: the ratings of all users over columns 1-400: 

-- Columns 1-400: These columns contain the ratings for the 400 movies (0 to 4, and missing);

call this part `df_rate`


and 


**Part 2**:  the part of the data which includes all users over columns 401-474

-- Columns 401-421: These columns contain self-assessments on sensation seeking behaviors (1-5)

-- Columns 422-464: These columns contain responses to personality questions (1-5)

-- Columns 465-474: These columns contain self-reported movie experience ratings (1-5)

call this part `df_pers`.

---

Our main task is to model: 


`df_pers = function(df_rate)`


---

**Note:** Split the original data into training and testing as the ratio 0.80: 0.20. 


2.1. Model `df_pers = function(df_rate)` by using the linear regression. 

What are the errors on: (i) the training part; (ii) the testing part?




2.2. Model `df_pers = function(df_rate)` by using the ridge regression with hyperparamter values alpha from [0.0, 1e-8, 1e-5, 0.1, 1, 10]. 

For every of the previous values for alpha, what are the errors on: (i) the training part; (ii) the testing part?

What is a best choice for alpha?



2.3. Model `df_pers = function(df_rate)` by using the lasso regression with hyperparamter values alpha from [1e-3, 1e-2, 1e-1, 1]. 

For every of the previous values for alpha, what are the errors on: (i) the training part; (ii) the testing part?

What is a best choice for alpha?


**Note**: Ignore any `convergence warning` in case you may obtain in the Lasso regression.




In [42]:
# Divides the dataset into features and target
df_rate = df.iloc[:,:400]
df_pers = df.iloc[:,400:474]

# Divides the dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(df_rate, df_pers, test_size=0.2, random_state=101)

**Model 1: Linear Regression**

In [33]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression().fit(X_train, y_train)

from sklearn.metrics import mean_squared_error
y_train_pred = linreg.predict(X_train)
y_test_pred = linreg.predict(X_test) 
print('Training MSE for this Linear Regression model: {}'
      .format(round(mean_squared_error(y_train, y_train_pred), 3)))
print('Testing MSE for this Linear Regression model: {}'
      .format(round(mean_squared_error(y_test, y_test_pred), 3)))

Training MSE for this Linear Regression model: 0.627
Testing MSE for this Linear Regression model: 3.304


**Model 2: Ridge Regression**

In [34]:
from sklearn.linear_model import Ridge
for a in [0, 1e-8, 1e-5, 0.1, 1, 10]:
    ridge = Ridge(alpha = a).fit(X_train, y_train)
    y_pred_train = ridge.predict(X_train)
    y_pred_test = ridge.predict(X_test)
    print('The training MSE for this Ridge Regression model with alpha {} is: {}'
          .format(a, round(mean_squared_error(y_train, y_pred_train), 3)))
    print('The testing MSE for this Ridge Regression model with alpha {} is: {}\n'
          .format(a, round(mean_squared_error(y_test, y_pred_test), 3)))

The training MSE for this Ridge Regression model with alpha 0 is: 0.627
The testing MSE for this Ridge Regression model with alpha 0 is: 3.304

The training MSE for this Ridge Regression model with alpha 1e-08 is: 0.627
The testing MSE for this Ridge Regression model with alpha 1e-08 is: 3.304

The training MSE for this Ridge Regression model with alpha 1e-05 is: 0.627
The testing MSE for this Ridge Regression model with alpha 1e-05 is: 3.304

The training MSE for this Ridge Regression model with alpha 0.1 is: 0.627
The testing MSE for this Ridge Regression model with alpha 0.1 is: 3.201

The training MSE for this Ridge Regression model with alpha 1 is: 0.632
The testing MSE for this Ridge Regression model with alpha 1 is: 2.69

The training MSE for this Ridge Regression model with alpha 10 is: 0.684
The testing MSE for this Ridge Regression model with alpha 10 is: 1.818



The best choice for alpha is 10

**Model 3: Lasso Regression**

In [41]:
from sklearn.linear_model import Lasso
for a in [1e-3, 1e-2, 1e-1, 1]:
    lasso = Lasso(alpha = a).fit(X_train, y_train)
    y_pred_train = lasso.predict(X_train)
    y_pred_test = lasso.predict(X_test)
    print('The traning MSE for this Ridge Regression model with alpha {} is: {}'
          .format(a, round(mean_squared_error(y_train, y_pred_train), 3)))
    print('The testing MSE for this Ridge Regression model with alpha {} is: {}\n'
          .format(a, round(mean_squared_error(y_test, y_pred_test), 3)))

The traning MSE for this Ridge Regression model with alpha 0.001 is: 0.651
The testing MSE for this Ridge Regression model with alpha 0.001 is: 2.241

The traning MSE for this Ridge Regression model with alpha 0.01 is: 0.911
The testing MSE for this Ridge Regression model with alpha 0.01 is: 1.293

The traning MSE for this Ridge Regression model with alpha 0.1 is: 1.224
The testing MSE for this Ridge Regression model with alpha 0.1 is: 1.198

The traning MSE for this Ridge Regression model with alpha 1 is: 1.241
The testing MSE for this Ridge Regression model with alpha 1 is: 1.206



The best choice for alpha is 0.1