### The dataset provided contains movie reviews given by Amazon customers. Reviews were given between May 1996 and July 2014.

### Data Dictionary
 - UserID – 4848 customers who provided a rating for each movie
 - Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users

### Data Considerations
- All the users have not watched all the movies and therefore, all movies are not rated. These missing values are represented by NA.
- Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best.

### Analysis Task
- Exploratory Data Analysis:

 - Which movies have maximum views/ratings?
 - What is the average rating for each movie? Define the top 5 movies with the maximum ratings.
 - Define the top 5 movies with the least audience.
 - Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.

- Divide the data into training and test data
- Build a recommendation model on training data
- Make predictions on the test data

In [15]:
import pandas as pd
import numpy as np
movie=pd.read_csv('A:\MachineLearning\Project3\Amazon - Movies and TV Ratings.csv')
movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4848 entries, 0 to 4847
Columns: 207 entries, user_id to Movie206
dtypes: float64(206), object(1)
memory usage: 7.7+ MB


In [16]:
movie.shape

(4848, 207)

In [17]:
movie.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,...,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,...,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,...,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,...,,,,,,,,,,


In [18]:
movie_ratings=movie.iloc[:,1:].values
#movie_ratings.head()

In [52]:
a = np.unique(movie_ratings[~np.isnan(movie_ratings)],return_counts=True)
type(a)

tuple

In [53]:
df=pd.DataFrame(list(a))
df

Unnamed: 0,0,1,2,3,4
0,1.0,2.0,3.0,4.0,5.0
1,363.0,185.0,272.0,521.0,3659.0


In [38]:
movie=movie.fillna(value=0)
movie.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,AH3QC2PC1VTGP,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,A3LKP6WPMP9UKX,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,AVIY68KEPQ5ZD,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,A1CV1WROP5KTTW,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [40]:
data=movie.drop(columns='user_id')
data.head()

Unnamed: 0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
max_ratings=pd.DataFrame(data.sum().sort_values())
max_ratings.tail(5)

Unnamed: 0,0
Movie29,1168.0
Movie103,1241.0
Movie16,1446.0
Movie140,2794.0
Movie127,9511.0


In [55]:
#Which movies have maximum views/ratings?
pd.DataFrame(np.count_nonzero(data,axis=0),columns=['cnt']).sort_values(by='cnt',ascending = False).head()

Unnamed: 0,cnt
126,2313
139,578
15,320
102,272
28,243


Movie 126 has been rated by 2313 users.

### What is the average rating for each movie? 

In [67]:
x=data.sum()
print(type(x),'is type of x')
#x=x.to_numpy()
y=np.count_nonzero(data,axis=0)
print(type(y),'is type of y')
avg=x/y
print(type(avg),'is type of avg')
b=pd.DataFrame(avg,columns=['Average']).sort_values(by='Average')
print(b)

<class 'pandas.core.series.Series'> is type of x
<class 'numpy.ndarray'> is type of y
<class 'pandas.core.series.Series'> is type of avg
           Average
Movie144  1.000000
Movie67   1.000000
Movie45   1.000000
Movie58   1.000000
Movie60   1.000000
Movie154  1.000000
Movie69   1.000000
Movie90   1.833333
Movie59   2.000000
Movie53   2.000000
Movie3    2.000000
Movie73   2.000000
Movie171  2.000000
Movie159  3.000000
Movie20   3.000000
Movie26   3.000000
Movie83   3.000000
Movie64   3.000000
Movie203  3.000000
Movie62   3.000000
Movie17   3.000000
Movie95   3.333333
Movie28   3.333333
Movie52   3.470588
Movie19   3.500000
Movie113  3.750000
Movie197  3.800000
Movie146  4.000000
Movie155  4.000000
Movie156  4.000000
...            ...
Movie106  5.000000
Movie105  5.000000
Movie78   5.000000
Movie48   5.000000
Movie128  5.000000
Movie65   5.000000
Movie152  5.000000
Movie49   5.000000
Movie150  5.000000
Movie149  5.000000
Movie148  5.000000
Movie147  5.000000
Movie50   5.000000
Movie145

In [74]:
#Define the top 5 movies with the maximum ratings.
b.tail(7)

Unnamed: 0,Average
Movie135,5.0
Movie63,5.0
Movie133,5.0
Movie132,5.0
Movie131,5.0
Movie55,5.0
Movie1,5.0


In [37]:
#Divide the data into training and test data
#Build a recommendation model on training data
#Make predictions on the test data

In [78]:
new_data=pd.melt(movie,id_vars=['user_id'],value_vars=movie.columns[1:].values)
new_data.head()

Unnamed: 0,user_id,variable,value
0,A3R5OBKS7OM2IR,Movie1,5.0
1,AH3QC2PC1VTGP,Movie1,0.0
2,A3LKP6WPMP9UKX,Movie1,0.0
3,AVIY68KEPQ5ZD,Movie1,0.0
4,A1CV1WROP5KTTW,Movie1,0.0


In [108]:
new_data.shape

(998688, 3)

In [79]:
from sklearn.model_selection import train_test_split
train_data,test_data=train_test_split(new_data,test_size=0.25)

In [80]:
train_data.shape

(749016, 3)

In [81]:
train_data.head()

Unnamed: 0,user_id,variable,value
446873,A23UCQFQ84N23C,Movie93,0.0
965345,A1UD0YDZ5KTGK0,Movie200,0.0
539324,A1JFZY0BE5COMO,Movie112,0.0
241131,A1PJAX7WONEPVX,Movie50,0.0
291989,A160AB64G2E949,Movie61,0.0


In [82]:
from sklearn import preprocessing
label_encoder=preprocessing.LabelEncoder()

In [83]:
train_data['user_id'] = label_encoder.fit_transform(train_data['user_id'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [84]:
train_data['variable'] = label_encoder.fit_transform(train_data['variable'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [85]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
X_train = train_data.iloc[:,0:2]
y_train = train_data.iloc[:,2:3]
classifier.fit (X_train, y_train)


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [86]:
test_data['variable'] = label_encoder.fit_transform(test_data['variable'])
test_data['user_id'] = label_encoder.fit_transform(test_data['user_id'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [87]:
X_test = test_data.iloc[:,0:2]
y_test = test_data.iloc[:,2:3]

In [88]:
y_pred = classifier.predict(X_test)

In [89]:
from sklearn.metrics import classification_report
print (classification_report(y_pred,y_test))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00    248415
         1.0       0.05      0.06      0.05        90
         2.0       0.00      0.00      0.00        46
         3.0       0.01      0.02      0.01        64
         4.0       0.01      0.01      0.01       126
         5.0       0.16      0.16      0.16       931

   micro avg       0.99      0.99      0.99    249672
   macro avg       0.20      0.21      0.20    249672
weighted avg       0.99      0.99      0.99    249672

