<a href="https://colab.research.google.com/github/d4gituser/Turhan_DS/blob/main/p2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Portfolio Part 2

The goal of the second Portfolio task is to train linear regression models to predict users' ratings towards movies. This involves a standard Data Science workflow: exploring data, building models, making predictions, and evaluating results. In this task, we will explore the impacts of feature selections and different sizes of training/testing data on the model performance. We will continue using the movielens dataset provided in Portfolio task 1. 

### Import Cleaned MovieLens Dataset
Save the cleaned data (i.e., after removing missing values and outliers) in the Portfolio task 1 as a csv file named 'movielens_data_clean.csv'. You may need to use the Pandas method, i.e., `to_csv`, for doing this. After that, please import the csv file (i.e., 'movielens_data_clean') and print out its total length.

In [2]:
# your code and solution
import pandas as pd
import numpy as np

#reading the csv
data = pd.read_csv("./movielens_data.csv") 
#data.shape
#fillna()method replaces the NULL values with a specified value  & 
# set to True to avoid returning to a new DataFrame object
data.fillna(np.nan, inplace = True) 

data = data.dropna() #dropna() method removes the rows that contains NULL values.
for index, row in data.iterrows():
    if data['occupation'][index]== "none":
        #print(data['genre'][index])
        #data.drop() removes the index value i.e. none from occupation column of the data
        data.drop(index, inplace = True)
    
reading = str(data.shape)
print("Data Size before removing: "+ reading)

#saving new cleaned csv
data.to_csv('movielens_data_clean.csv',index= False)
data.shape

Data Size before removing: (99022, 8)


(99022, 8)

In [8]:
data.head()

Unnamed: 0,userId,age,gender,occupation,movieId,rating,genre,timestamp
0,196,49.0,M,writer,242,3.0,Comedy,881250949
2,22,25.0,M,writer,377,1.0,Children,878887116
3,244,28.0,M,technician,51,2.0,Romance,880606923
5,298,44.0,M,executive,474,4.0,War,884182806
7,253,26.0,F,librarian,465,5.0,Adventure,891628467


### Explore the Dataset

* Use the methods, i.e., `head()` and `info()`, to have a rough picture about the data, e.g., how many columns, and the data types of each column. 
* As our goal is to predict ratings given other columns, please get the correlations between age/gender/genre/occupation and rating by using the `corr()` method.
* To get the correlations between different features, you may need to first convert the categorical features (i.e., gender, genre and occupation) into numerial values. For doing this, you may need to import `OrdinalEncoder` from `sklearn.preprocessing` (refer to the useful exmaples [here](https://pbpython.com/categorical-encoding.html))
* Please provide ___necessary explanations/analysis___ on the correlations, and figure out which are the ___most___ and ___least___ corrleated features regarding rating. Try to ___discuss___ how the correlation will affect the final prediction results, if we use these features to train a regression model for rating prediction. In what follows, we will conduct experiments to verify your hypothesis.

In [7]:
# your code and solution
df = pd.read_csv("./movielens_data_clean.csv") 
df.head()

Unnamed: 0,userId,age,gender,occupation,movieId,rating,genre,timestamp
0,196,49.0,M,writer,242,3.0,Comedy,881250949
1,22,25.0,M,writer,377,1.0,Children,878887116
2,244,28.0,M,technician,51,2.0,Romance,880606923
3,298,44.0,M,executive,474,4.0,War,884182806
4,253,26.0,F,librarian,465,5.0,Adventure,891628467


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99022 entries, 0 to 99021
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   userId      99022 non-null  int64  
 1   age         99022 non-null  float64
 2   gender      99022 non-null  object 
 3   occupation  99022 non-null  object 
 4   movieId     99022 non-null  int64  
 5   rating      99022 non-null  float64
 6   genre       99022 non-null  object 
 7   timestamp   99022 non-null  int64  
dtypes: float64(2), int64(3), object(3)
memory usage: 6.0+ MB


# Correlation
To find correlation between columns we wil need to compare them, for that we wil need same type of numerical data.
So now we will use label encoder from sklearn to conver categorical values into numerical for this dataframe. 

In [31]:
#correlation
#making a copy of the datafram to keep the original fine.
df2= df
df2.head()

#importing necessary libraries
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

#now will convert the categorical columns into numericals
df2['gender']=pd.DataFrame({'gender':le.fit_transform(df2['gender'])})
df2['genre']=pd.DataFrame({'genre':le.fit_transform(df2['genre'])})
df2['occupation']=pd.DataFrame({'occupation':le.fit_transform(df2['occupation'])})
df2.head()

Unnamed: 0,userId,age,gender,occupation,movieId,rating,genre,timestamp
0,196,49.0,1,19,242,3.0,4,881250949
1,22,25.0,1,19,377,1.0,3,878887116
2,244,28.0,1,18,51,2.0,13,880606923
3,298,44.0,1,6,474,4.0,16,884182806
4,253,26.0,0,10,465,5.0,1,891628467




```
Now that in df2 all the categorical columns are converted we can now apply corr in between columns

```




In [35]:
print('correlation between age and rating:',df2['age'].corr(df2['rating']))
print('correlation between gender and rating:',df2['gender'].corr(df2['rating']))
print('correlation between genre and rating:',df2['genre'].corr(df2['rating']))
print('correlation between occupation and rating:',df2['occupation'].corr(df2['rating']))

correlation between age and rating: 0.05684068891822537
correlation between gender and rating: -0.0014983044579124484
correlation between genre and rating: 0.044561354728212356
correlation between occupation and rating: -0.029203996222652656


### Split Training and Testing Data
* Machine learning models are trained to help make predictions for the future. Normally, we need to randomly split the dataset into training and testing sets, where we use the training set to train the model, and then leverage the well-trained model to make predictions on the testing set. 
* To further investigate whether the size of the training/testing data affects the model performance, please random split the data into training and testing sets with different sizes:
    * Case 1: training data containing 10% of the entire data;
    * Case 2: training data containing 90% of the entire data. 
* Print the shape of training and testing sets in the two cases. 

In [None]:
# your code and solution

### Train Linear Regression Models with Feature Selection under Cases 1 & 2
* When training a machine learning model for prediction, we may need to select the most important/correlated input features for more accurate results. 
* To investigate whether feature selection affects the model performance, please select two most correlated features and two least correlated features regarding rating, respectively. 
* Train four linear regression models by following the conditions:
    - (model-a) using the training/testing data in case 1 with two most correlated input features
    - (model-b) using the training/testing data in case 1 with two least correlated input features
    - (model-c) using the training/testing data in case 2 with two most correlated input features
    - (model-d) using the training/testing data in case 2 with two least correlated input features
* By doing this, we can verify the impacts of the size of traing/testing data on the model performance via comparing model-a and model-c (or model-b and model-d); meanwhile the impacts of feature selection can be validated via comparing model-a and model-b (or model-c and model-d).    

In [None]:
# your code and solution

### Evaluate Models
* Evaluate the performance of the four models with two metrics, including MSE and Root MSE
* Print the results of the four models regarding the two metrics

In [None]:
# your code and solution

### Visualize, Compare and Analyze the Results
* Visulize the results, and perform ___insightful analysis___ on the obtained results. For better visualization, you may need to carefully set the scale for the y-axis.
* Normally, the model trained with most correlated features and more training data will get better results. Do you obtain the similar observations? If not, please ___explain the possible reasons___.

In [None]:
# your code and solution