# General Instructions to students:

1. There are 5 types of cells in this notebook. The cell type will be indicated within the cell.
    1. Markdown cells with problem written in it. (DO NOT TOUCH THESE CELLS) (**Cell type: TextRead**)
    2. Python cells with setup code for further evaluations. (DO NOT TOUCH THESE CELLS) (**Cell type: CodeRead**)
    3. Python code cells with some template code or empty cell. (FILL CODE IN THESE CELLS BASED ON INSTRUCTIONS IN CURRENT AND PREVIOUS CELLS) (**Cell type: CodeWrite**)
    4. Markdown cells where a written reasoning or conclusion is expected. (WRITE SENTENCES IN THESE CELLS) (**Cell type: TextWrite**)
    5. Temporary code cells for convenience and TAs. (YOU MAY DO WHAT YOU WILL WITH THESE CELLS, TAs WILL REPLACE WHATEVER YOU WRITE HERE WITH OFFICIAL EVALUATION CODE) (**Cell type: Convenience**)
    
2. You are not allowed to insert new cells in the submitted notebook.

3. You are not allowed to import any extra packages.

4. The code is to be written in Python 3.6 syntax. Latest versions of other packages maybe assumed.

5. In CodeWrite Cells, the only outputs to be given are plots asked in the question. Nothing else to be output/print. 

6. If TextWrite cells ask you to give accuracy/error/other numbers you can print them on the code cells, but remove the print statements before submitting.

7. The convenience code can be used to check the expected syntax of the functions. At a minimum, your entire notebook must run with "run all" with the convenience cells as it is. Any runtime failures on the submitted notebook as it is will get zero marks.

8. All code must be written by yourself. Copying from other students/material on the web is strictly prohibited. Any violations will result in zero marks.

9. All datasets will be given as .npz files, and will contain data in 4 numpy arrays :"X_train, Y_train, X_test, Y_test". In that order. The meaning of the 4 arrays can be easily inferred from their names.

10. All plots must be labelled properly, all tables must have rows and columns named properly.

In [45]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import json

In [2]:
#CodeRead

data_folder = '../Data_contest/dataset/'


genome_scores_df=pd.read_csv(data_folder+'genome_scores.csv') # Large (500MB)
movies_df=pd.read_csv(data_folder+'movies.csv')
train_df=pd.read_csv(data_folder+'train.csv') # Large 500MB
validation_df = pd.read_csv(data_folder+'validation.csv') 








In [3]:
#CodeRead

# create movie rating dataset from train

# Feature vector for the 10000 movies, each with a 1128 dimensional vector. 
# If a movie doesn't appear in genome_scores we make it simply the 0 vector.
X=np.zeros((10000,1128)) 
movies_with_featvecs=set(genome_scores_df['movieId'])
# The average rating, for each of the movies in the training set. 
# -1 if it is not in the train set.
rating_movies = -1*np.ones(10000) 


In [4]:

# Each movie, is labelled +1 or -1 based on whetherr it is a comedy or not

for i in range(10000):
    if i not in movies_with_featvecs:
        continue
    temp = genome_scores_df[genome_scores_df['movieId']==i]
    feat_vec= np.array(temp['relevance'])
    X[i,:]=feat_vec


In [5]:

for i in range(10000):
    temp = train_df[train_df['movieId']==i]
    if len(temp)==0:
        continue
    ratings_curr_movies = temp['rating']
    rating_movies[i] = np.mean(ratings_curr_movies)



In [46]:
all_genres = []
for i in range(10000):
    temp = movies_df[movies_df['movieId']==i]
    if len(temp)==0:
        continue
    temp = temp['genres'].values[0]
    temp = temp.split('|')
    for genre in temp:
        if genre not in all_genres:
            all_genres.append(genre)
        



In [53]:
X_genre = np.zeros((10000,19))

for i in range(10000):
    temp = movies_df[movies_df['movieId']==i]
    if len(temp)==0:
        continue
    temp = temp['genres'].values[0]
    temp = temp.split('|')
    
    for idx, genre in enumerate(all_genres):
        X_genre[i,idx] = genre in temp

In [75]:
X_concat = np.concatenate((X,X_genre),axis=1)
X_concat.shape

(10000, 1147)

In [76]:
np.array(train_df['rating']).mean()

3.3601078084681526

TextWrite cell. Report test accuracies for different k here.




# Problem 2: PCA and regression

Take the regression dataset below, and perform linear regression after doing PCA on the feature vector. 

For each K in [4,32,256,1024] take the top k components and report the mean squared error on the test set below. 

For each K you can choose the regularisation hyperparameter $\lambda$ for linear regression using a 80-20 split of the training set. 

For each K above, report the best lambda and the mean squared error for this best lambda in the cell below the next.




In [77]:
# CodeWrite

X_all = X_concat[rating_movies>0]
Y_all = rating_movies[rating_movies>0]

X_train = np.array(X_all[:7000])
Y_train = np.array(Y_all[:7000])
X_test = np.array(X_all[7000:])
Y_test = np.array(Y_all[7000:])






In [78]:
from sklearn.svm import SVR


In [81]:

def SVM_func(X_train, Y_train, kernel, C=1, kernel_param=1):
    if kernel == 'linear':
        SVM_algo =  SVR(C=C, kernel=kernel)
    if kernel == 'poly':
        SVM_algo =  SVR(C=C, kernel=kernel, degree = kernel_param)
    else:
        SVM_algo =  SVR(C=C, kernel=kernel, gamma = kernel_param)

    classifier = SVM_algo.fit(X_train,Y_train)
#     Y_test_pred = classifier.decision_function(X_test)
    return classifier

def best_hyperparam(X_train, Y_train, kernel):
    split = int(0.7*X_train.shape[0])
    X_train1 = X_train[:split]
    X_val = X_train[split:]
    Y_train1 = Y_train[:split]
    Y_val = Y_train[split:]
    best_loss = 10000
    best_kernel_param = 1
    best_reg_param = 0
    
    #reg_params = [0.0001]

#     if kernel == 'linear':
#         kernel_param = 1
#         for C in reg_params:
#             classifier = SVM_func(X_train1, Y_train1, kernel, C, kernel_param)
#             Y_val_pred1 = classifier.decision_function(X_val)
#             Y_val_pred = np.where(Y_val_pred1>0,1,-1)
#             zero_one_loss = np.where(Y_val_pred != Y_val,1,0)
#             mean_zero_one_loss = np.mean(zero_one_loss)
#             if mean_zero_one_loss < best_zero_one_loss:
#                 best_zero_one_loss = mean_zero_one_loss
#                 best_kernel_param = kernel_param
#                 best_reg_param = C
#             print('C ',C,'loss = ',mean_zero_one_loss)
            
#     #degree_params = [3]            
#     degree_params = [1,3,5,9,15]                  
#     if kernel =='poly':  
#         for kernel_param in degree_params:
#             for C in reg_params:
#                 classifier = SVM_func(X_train1, Y_train1, kernel, C, kernel_param)
#                 Y_val_pred1 = classifier.decision_function(X_val)
#                 Y_val_pred = np.where(Y_val_pred1>0,1,-1)
#                 zero_one_loss = np.where(Y_val_pred != Y_val,1,0)
#                 mean_zero_one_loss = np.mean(zero_one_loss)
#                 if mean_zero_one_loss < best_zero_one_loss:
#                     best_zero_one_loss = mean_zero_one_loss
#                     best_kernel_param = kernel_param
#                     best_reg_param = C
#                 #print('C ',C,'degree ', kernel_param, 'loss = ',mean_zero_one_loss)


    reg_params = [1e3,1e2,1e1,1,1e-1,1e-2]
    rbf_lambda_params = [1e-5,1e-3,1e-1,1,10,1e2] # rbf only
    if kernel =='rbf':
        for kernel_param in rbf_lambda_params:
            for C in reg_params:
                classifier = SVM_func(X_train1, Y_train1, kernel, C, kernel_param)
                Y_val_pred = classifier.predict(X_val)
                mse_loss = np.mean((Y_val_pred - Y_val)**2)
                if mse_loss < best_loss:
                    best_loss = mse_loss
                    best_kernel_param = kernel_param
                    best_reg_param = C
                print('C ',C,'lambda ', kernel_param, 'loss = ',mse_loss)


                    
    return best_kernel_param, best_reg_param




In [None]:
kernel = 'rbf'
# best_kernel_param = 0.1 
# best_reg_param = 10

best_kernel_param, best_reg_param = best_hyperparam(X_all, Y_all, kernel)
# 0.001, 10 according to best_hyperparam()
# 0.1,   10 according to submission

classifier = SVM_func(X_all, Y_all, kernel, C = best_reg_param, kernel_param = best_kernel_param)

Y_train_pred = classifier.predict(X_train)
Y_test_pred = classifier.predict(X_test)
train_zero_one_loss = np.mean((Y_train_pred - Y_train)**2)
test_zero_one_loss = np.mean((Y_test_pred - Y_test)**2)
print("zero_one loss train  , test = ",train_zero_one_loss," , ",test_zero_one_loss)
print('best params', best_kernel_param, best_reg_param)

C  1000.0 lambda  1e-05 loss =  0.02559231890136304
C  100.0 lambda  1e-05 loss =  0.03180249615654678


In [None]:
# zero_one loss train  , test =  0.009417467803048247  ,  0.010346241350575927
# best_kernel_param = 0.001 
# best_reg_param = 10


In [None]:
X_all = X_concat
X_all.shape
Y_pred_all = classifier.predict(X_all)


In [None]:
#Submission

df_test=pd.read_csv(data_folder+'test.csv') # Large 500MB

predictions = np.zeros(len(df_test))
for i in range(len(df_test)):
    userid =  df_test.iloc[i,0]
    movieid = df_test.iloc[i,1]
    rating = float("{0:.1f}".format(Y_pred_all[movieid]))
    if rating>5:
        rating = 5
    if rating<0.5:
        rating =0.5
    predictions[i] = rating
df_submission = pd.read_csv(data_folder+'dummy_submission.csv')
df_submission.Prediction = predictions
df_submission.to_csv('./Submission_regression_concat2.csv',index=False)