**Background of Problem Statement :**

The GroupLens Research Project is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Members of the GroupLens Research Project are involved in many research projects related to the fields of information filtering, collaborative filtering, and recommender systems. The project is led by professors John Riedl and Joseph Konstan. The project began to explore automated collaborative filtering in 1992 but is most well known for its worldwide trial of an automated collaborative filtering system for Usenet news in 1996. Since then the project has expanded its scope to research overall information by filtering solutions, integrating into content-based methods, as well as, improving current collaborative filtering technology.

**Problem Objective :**

Here, we ask you to perform the analysis using the Exploratory Data Analysis technique. You need to find features affecting the ratings of any particular movie and build a model to predict the movie ratings.

**Analysis Tasks to be performed:**

1. Import the three datasets.

2. Create a new dataset [Master_Data] with the following columns MovieID Title UserID Age Gender Occupation Rating. (Hint: (i) Merge two tables at a time. (ii) Merge the tables using two primary keys MovieID & UserId).

3. Explore the datasets using visual representations (graphs or tables), also include your comments on the following:

    i). User Age Distribution.

    ii). User rating of the movie “Toy Story”.

    iii). Top 25 movies by viewership rating.

    iv). Find the ratings for all the movies reviewed by for a particular user of user  id = 2696.
    

4. Feature Engineering:

    **Use column genres:**

    i). Find out all the unique genres (Hint: split the data in column genre making a list and then process the data to find out only the unique categories of genres).
    
    ii). Create a separate column for each genre category with a one-hot encoding ( 1 and 0) whether or not the movie belongs to that genre.
    
    iii). Determine the features affecting the ratings of any particular movie.
    
    iv).Develop an appropriate model to predict the movie ratings Dataset Description :

***These files contain 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000.***

In [None]:
#Import the necessary library.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Import machine learning library.

**1. Import the dataset.**

In [None]:
data_1 = pd.read_csv('movies.csv', engine = 'python', sep = '::', encoding = 'latin-1'
                    ,header = None)
data_1.columns = ['Movie_ID', 'Title', 'Genres']
print(data_1.shape)
data_1.head()

In [None]:
data_1.describe()

In [None]:
data_1.info()

In [None]:
data_2 = pd.read_csv('users.csv', engine = 'python', sep = '::', encoding = 'latin-1',
                    header = None)
data_2.columns = ['User_ID','Gender','Age','Occupation','Zip_Code']
print(data_2.shape)
data_2.head()

In [None]:
data_2.describe()

In [None]:
data_2.info()

In [None]:
data_3 = pd.read_csv('ratings.csv', engine = 'python', sep = '::', encoding = 'latin-1',
                    header = None)
data_3.columns = ['User_ID','Movie_ID','Ratings','Timestamp']
print(data_3.shape)
data_3.head()

In [None]:
data_3.describe()

In [None]:
data_3.info()

**2. Create a Master Data**

In [None]:
data_4 = pd.merge(data_1,data_3, on = 'Movie_ID', how = 'inner')
data_4.head()

In [None]:
data = pd.merge(data_4,data_2, on = 'User_ID', how = 'inner')
data.head()

In [None]:
print(data.shape)
#print(data.describe)
data.info()

**#3 -i). User Age Distribution**

In [None]:
#1.User Age Distribution
sns.histplot(data = data['Age'], bins = 7, color = 'grey')
plt.title('User Age Distribution')
plt.show()

In [None]:
sns.boxplot(data = data['Age'])
plt.show()

In [None]:
(data['Age'].value_counts()).plot(kind = 'bar')
plt.title('User Age Distribution')
plt.show()

**3 - ii). User rating of the movie “Toy Story”**

In [None]:
#User rating of the movie “Toy Story”
movie_list = data.groupby('Title')[['Ratings']].mean()

In [None]:
movie_list = movie_list.reset_index()

In [None]:
print(movie_list[movie_list['Title'] == 'Toy Story 2 (1999)'])
print("\n")
print(movie_list[movie_list['Title'] == 'Toy Story (1995)'])

**3 - iii). Top 25 movies by viewership rating.**

In [None]:
#3.Top 25 movies by viewership rating.
top_25 = movie_list.sort_values('Ratings', ascending = False).head(25)
top_25 = top_25.reset_index(drop = True)
top_25

**3 - iv). Find the ratings for all the movies reviewed by for a particular user of user id = 2696.**

In [None]:
user_2696 = data[data['User_ID'] == 2696]
movie_list_2696 = user_2696.groupby('Title')[['Ratings']].mean()
movie_list_2696 = movie_list_2696.reset_index()
print(movie_list_2696.shape)
movie_list_2696

**4. Feature Engineering**

**i) List of unique genres.**

In [None]:
#Find out all the unique genres.
genres = data['Genres']
b = []
for i in genres:
    a = []
    a = i.split('|')
    b.append(a)
    
ls = []
for i in b:
    for j in i:
        ls.append(j)

ls = pd.Series(ls)
ls = ls.unique()
print(ls)

**ii). Create a separate column for each genre category with a one-hot encoding ( 1 and 0) whether or not the movie belongs to that genre.**

In [None]:
genre_dummy = data['Genres'].str.get_dummies()
genre_dummy

iii) **Determine the fetures affecting the rating of the movie.**

**The dataset is a multiclass classification model with independent and dependent variable are as mentioned below:**

**Independent variables:**
1. Movie_ID 
2. Title 
3. Generes 
4. User_ID 
5. Gender
6. Age 
7. Occupation
8. Zip_Code 
9. Timestamp

**Dependent Variable(or target variable)**
1. Ratings

In [None]:
#View the master data
data.head()

**Univariate Analysis of the relevant columns**

In [None]:
#Count plot of the movie ratings. (Target variable)
sns.countplot(data = data, x = 'Ratings')
plt.title("Count plot of the Movie ratings")
plt.show()

In [None]:
#Count plot of Age.
sns.countplot(data = data, x = 'Gender')
plt.title("Count plot of Gender")
plt.show()

In [None]:
#Count plot of Age.
sns.countplot(data = data, x = 'Age')
plt.title("Count plot of Age")
plt.show()

In [None]:
#Count plot of Occupation.
sns.countplot(data = data, x = 'Occupation')
plt.title("Count plot of Occupation")
plt.show()

**Bivariate Analysis of the Data**

In [None]:
#1. Bivariate Analysis of Age with Rating.
sns.countplot(data = data, x = 'Ratings', hue = 'Age')
plt.title("Count plot of the Movie ratings")
plt.show()

In [None]:
sns.violinplot(data = data, x = 'Ratings', y = 'Age')
plt.title("Violen plot of the Movie ratings")
plt.show()

In [None]:
#2.Bivariate Analysis of Sex with Rating.
sns.countplot(data = data, x = 'Ratings', hue = 'Gender')
plt.title("Count plot of the Movie ratings")
plt.show()

In [None]:
#2.Bivariate Analysis of Sex with Rating.
sns.violinplot(data = data, x = 'Ratings', y = 'Occupation')
plt.title("Violen plot of the Movie ratings")
plt.show()

In [None]:
sns.heatmap(data.corr(), annot = True)

**From the analysis above, we need to drop few independent features and retain some features mention below:**

**Independent variables:**
1. Movie_ID = Need to Drop
2. Title = Need to Drop
3. Generes = Affecting the rating of the movie.
4. User_ID = Need to Drop
5. Gender = Affecting the rating of the movie.
6. Age = Affecting the rating of the movie.
7. Occupation = Affecting the rating of the movie.
8. Zip_Code = Need to Drop
9. Timestamp = Need to Drop

**Dependent Variable(or target variable)**
1. Rating

In [None]:
data = data.drop(columns = ['Movie_ID','Title','Genres','User_ID','Zip_Code','Timestamp'], axis = 1)

In [None]:
data = pd.concat([data, genre_dummy], axis = 1)
data.head(10)

In [None]:
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
data['Gender'] = label.fit_transform(data['Gender'])
data.head()

In [None]:
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
data['Age'] = scale.fit_transform(data['Age'])
data.head()

In [None]:
data = pd.get_dummies(data = data, column = 'Occupation')
data.head(10)

In [None]:
#Splitting the data into Dependent and Independent variables.
x = data.drop(columns = 'Ratings')
y = data.iloc[:,0]

In [None]:
#Splitting the data into training and Testing Set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, random_state = 15)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

**1. Logistic Regression Model**

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train,y_train)

In [None]:
#Predict the accuracy of training and Testing data.
from sklearn import metrics
pred_train = model.predict(x_train)
pred_test = model.predict(x_test)
accuray_train = metrics.accuracy_score(y_train, pred_train)
accuray_test = metrics.accuracy_score(y_test, pred_test)
print("Training accuracy: {}".format(accuray_train))
print("Testing accuracy: {}".format(accuray_test))

In [None]:
#Confusion Matrix and Classification report of the model.
print(metrics.classification_report(pred_test,y_test))
print("\n")
print(metrics.confusion_matrix)

**2. Decision Tree Model**

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(x_train,y_train)

In [None]:
#Predict the accuracy of training and Testing data.
pred_train = model.predict(x_train)
pred_test = model.predict(x_test)
accuray_train = metrics.accuracy_score(y_train, pred_train)
accuray_test = metrics.accuracy_score(y_test, pred_test)
print("Training accuracy: {}".format(accuray_train))
print("Testing accuracy: {}".format(accuray_test))

In [None]:
#Confusion Matrix and Classification report of the model.
print(metrics.classification_report(pred_test,y_test))

**3. Random Forest Model** 

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(x_train,y_train)

In [None]:
#Predict the accuracy of training and Testing data.
pred_train = model.predict(x_train)
pred_test = model.predict(x_test)
accuray_train = metrics.accuracy_score(y_train, pred_train)
accuray_test = metrics.accuracy_score(y_test, pred_test)
print("Training accuracy: {}".format(accuray_train))
print("Testing accuracy: {}".format(accuray_test))

In [None]:
#Confusion Matrix and Classification report of the model.
print(metrics.classification_report(pred_test,y_test))

**4. Gradient Boosting Model**

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators = 200)
#model.fit(x_train, y_train)

In [None]:
grid.fit(x_train,y_train)

In [None]:
#Predict the accuracy of training and Testing data.
from sklearn import metrics
pred_train = model.predict(x_train)
pred_test = model.predict(x_test)
accuray_train = metrics.accuracy_score(y_train, pred_train)
accuray_test = metrics.accuracy_score(y_test, pred_test)
print("Training accuracy: {}".format(accuray_train))
print("Testing accuracy: {}".format(accuray_test))

In [None]:
#Confusion Matrix and Classification report of the model.
print(metrics.classification_report(pred_test,y_test))