### NYU CDS

### Fall 2021

### Introduction to Data Science

### Project 2

### student netid: cya220

### deadline: Dec 06, 2021, 11:59pm

---
# Data analysis Project 2
### Correlation and Regression of Movie Ratings Data
---

### Dataset description

This dataset features ratings data of 400 movies from 1097 research participants. 

* 1st row: Headers (Movie titles/questions) – note that the indexing in this list is from 1
* Row 2-1098: Responses from individual participants
* Columns 1-400: These columns contain the ratings for the 400 movies (0 to 4, and missing)
* Columns 401-421: These columns contain self-assessments on sensation seeking behaviors (1-5)
* Columns 422-464: These columns contain responses to personality questions (1-5)
* Columns 465-474: These columns contain self-reported movie experience ratings (1-5)
* Column 475: Gender identity (1 = female, 2 = male, 3 = self-described)
* Column 476: Only child (1 = yes, 0 = no, -1 = no response)
* Column 477: Movies are best enjoyed alone (1 = yes, 0 = no, -1 = no response)

Note that we did most of the data munging for you already (e.g. Python interprets commas in a csv file as separators, so we removed all commas from movie titles), but you still need to handle missing data.




### Q1:


**Note:** For all missing values in the data, use the average of the corresponding column so to fill in the missing data.

In [1]:
import numpy as np
from scipy import stats
import pandas as pd

df = pd.read_csv('movieReplicationSet.csv',skipinitialspace=True)
df = df.iloc[:,:-3] #Remove last three columns, which are irrelevent
df = df.fillna(df.mean()) #Fill missing values with column mean
df_movies = df.iloc[:,:400] #Take only the movie ratings columns

In this problem, under the most correlated, we consider the largest correlation in the absolute value.

1.1. For every user in the given data, find its most correlated user.

In [2]:
df_corr = df_movies.T.corr(method = 'pearson').abs() #Compute correlation matrix
np.fill_diagonal(df_corr.values, 0) #Set diagonal elements to zero

most_corr = pd.DataFrame([np.argmax(df_corr.loc[[i]]) for i in range(len(df_corr))])
most_corr

Unnamed: 0,0
0,118
1,831
2,896
3,19
4,784
...,...
1092,896
1093,784
1094,896
1095,896


1.2. What is the pair of the most correlated users in the data?

In [3]:
max_corr = pd.DataFrame(df_corr.max())
user_1 = np.argmax(max_corr)
user_2 = most_corr.iloc[user_1][0]
print(user_1,"&",user_2)

831 & 896


1.3. What is the value of this highest correlation?

In [4]:
print(max_corr.max()[0])

0.9987890924779799


1.4. For users 0, 1, 2, $\dots$, 9, print their most correlated users.

In [5]:
most_corr.head(10)

Unnamed: 0,0
0,118
1,831
2,896
3,19
4,784
5,990
6,1071
7,1074
8,821
9,1004


### Q2:

We want to find a model between the ratings and the personal part of the data. To do so, consider:


**Part 1**: the ratings of all users over columns 1-400: 

-- Columns 1-400: These columns contain the ratings for the 400 movies (0 to 4, and missing);

call this part `df_rate`

In [6]:
df_rate = df.iloc[:,:400]

**Part 2**:  the part of the data which includes all users over columns 401-474

-- Columns 401-421: These columns contain self-assessments on sensation seeking behaviors (1-5)

-- Columns 422-464: These columns contain responses to personality questions (1-5)

-- Columns 465-474: These columns contain self-reported movie experience ratings (1-5)

call this part `df_pers`

In [7]:
df_pers = df.iloc[:,400:]

Our main task is to model: 


`df_pers = function(df_rate)`

**Note:** Split the original data into training and testing as the ratio 0.80: 0.20. 

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(df_rate, df_pers, test_size=0.2, random_state=42)

2.1. Model `df_pers = function(df_rate)` by using the linear regression. 

What are the errors on: (i) the training part; (ii) the testing part?

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression().fit(X_train, Y_train)
Y_hat_train = model.predict(X_train)
Y_hat_test = model.predict(X_test)

train_rmse = mean_squared_error(Y_train, Y_hat_train, squared=False)
test_rmse = mean_squared_error(Y_test, Y_hat_test, squared=False)
print("Train Error: ",train_rmse)
print("Test Error: ",test_rmse)

Train Error:  0.7713293217782915
Test Error:  1.7738241527829712


2.2. Model `df_pers = function(df_rate)` by using the ridge regression with hyperparamter values alpha from [0.0, 1e-8, 1e-5, 0.1, 1, 10]. 

For every of the previous values for alpha, what are the errors on: (i) the training part; (ii) the testing part?

What is a best choice for alpha?

In [10]:
from sklearn.linear_model import Ridge

alphas = [0.0, 1e-8, 1e-5, 0.1, 1, 10]
def ridge_function(alpha_list):
    train_errors = []
    test_errors = []
    for a in alpha_list:
        model = Ridge(alpha=a).fit(X_train, Y_train)
        Y_hat_train = model.predict(X_train)
        Y_hat_test = model.predict(X_test)
        train_rmse = mean_squared_error(Y_train, Y_hat_train, squared=False)
        test_rmse = mean_squared_error(Y_test, Y_hat_test, squared=False)
        train_errors.append(train_rmse)
        test_errors.append(test_rmse)
    return train_errors, test_errors 

train_errors, test_errors = ridge_function(alphas)
print("Lowest Train Error: ", np.min(train_errors))
print("Best Train Alpha: ", alphas[np.argmin(train_errors)])
print("Lowest Test Error: ", np.min(test_errors))
print("Best Test Alpha: ", alphas[np.argmin(test_errors)])
pd.DataFrame(np.stack((np.array(alphas), np.array(train_errors),np.array(test_errors))).T)

Lowest Train Error:  0.7713293217782915
Best Train Alpha:  0.0
Lowest Test Error:  1.3570421726110675
Best Test Alpha:  10


Unnamed: 0,0,1,2
0,0.0,0.771329,1.773824
1,1e-08,0.771329,1.773824
2,1e-05,0.771329,1.773822
3,0.1,0.771397,1.750919
4,1.0,0.774441,1.624254
5,10.0,0.805712,1.357042


2.3. Model `df_pers = function(df_rate)` by using the lasso regression with hyperparamter values alpha from [1e-3, 1e-2, 1e-1, 1]. 

For every of the previous values for alpha, what are the errors on: (i) the training part; (ii) the testing part?

What is a best choice for alpha?


**Note**: Ignore any `convergence warning` in case you may obtain in the Lasso regression.

In [11]:
from sklearn.linear_model import Lasso
from sklearn.utils._testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning

alphas = [1e-3, 1e-2, 1e-1, 1]
@ignore_warnings(category=ConvergenceWarning)
def lasso_function(alpha_list):
    train_errors = []
    test_errors = []
    for a in alpha_list:
        model = Lasso(alpha=a).fit(X_train, Y_train)
        Y_hat_train = model.predict(X_train)
        Y_hat_test = model.predict(X_test)
        train_rmse = mean_squared_error(Y_train, Y_hat_train, squared=False)
        test_rmse = mean_squared_error(Y_test, Y_hat_test, squared=False)
        train_errors.append(train_rmse)
        test_errors.append(test_rmse)
    return train_errors, test_errors 

train_errors, test_errors = lasso_function(alphas)
print("Lowest Train Error: ", np.min(train_errors))
print("Best Train Alpha: ", alphas[np.argmin(train_errors)])
print("Lowest Test Error: ", np.min(test_errors))
print("Best Test Alpha: ", alphas[np.argmin(test_errors)])
pd.DataFrame(np.stack((np.array(alphas), np.array(train_errors),np.array(test_errors))).T)

Lowest Train Error:  0.7864463543603145
Best Train Alpha:  0.001
Lowest Test Error:  1.104763164764148
Best Test Alpha:  0.1


Unnamed: 0,0,1,2
0,0.001,0.786446,1.483397
1,0.01,0.934716,1.154863
2,0.1,1.085629,1.104763
3,1.0,1.091588,1.108752
