Hi guys, Exploratory Data Analysis(EDA) is all about getting insight from datasets, and if this process is followed properly then we get some really good understanding of the features and distributions we have. In depth EDA certainly results in outstanding Feature Engineering which eventually leaves heavy impact on model performance.

**Update:** I've added a simple recommendation model using Keras, later we'll do more improvisation.

This is a random exercise that I have performed. Here I've used 100K movie rating's older dataset provided by Movie Lens (https://grouplens.org/datasets/movielens/) but If you want, you can use 1M dataset as well.

### Importing Libraries and reading datasets

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from wordcloud import WordCloud
from uszipcode import SearchEngine


%matplotlib inline



In [2]:
dateparse = lambda x: datetime.utcfromtimestamp(int(x)).strftime('%Y-%m-%d %H:%M:%S')

rating_df = pd.read_csv('./input/u.data', sep='\t', 
                        names=['user_id', 'movie_id', 'rating', 'timestamp'], 
                        parse_dates=['timestamp'], 
                        date_parser=dateparse)

movie_df = pd.read_csv('./input/u.item', sep='|', encoding='latin-1',
                    names=['movie_id', 'movie_title', 'release_date', 'video_release_date', 'imdb_url', 'unknown', 'action', 
                           'adventure', 'animation', 'childrens', 'comedy', 'crime', 'documentary', 'drama', 'fantasy', 
                           'film_noir', 'horror', 'musical', 'mystery', 'romance', 'sci_fi', 'thriller', 'war', 'western'])

user_df = pd.read_csv('./input/u.user', sep='|', encoding='latin-1',
                     names=['user_id', 'age', 'gender', 'occupation', 'zip_code'])

In [3]:
movie_df.sample(10)

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,imdb_url,unknown,action,adventure,animation,childrens,...,fantasy,film_noir,horror,musical,mystery,romance,sci_fi,thriller,war,western
272,273,Heat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Heat%20(1995),0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1678,1679,B. Monkey (1998),06-Feb-1998,,http://us.imdb.com/M/title-exact?B%2E+Monkey+(...,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
115,116,Cold Comfort Farm (1995),23-Apr-1996,,http://us.imdb.com/M/title-exact?Cold%20Comfor...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
745,746,Real Genius (1985),01-Jan-1985,,http://us.imdb.com/M/title-exact?Real%20Genius...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1405,1406,When Night Is Falling (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?When%20Night%...,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1279,1280,Gang Related (1997),01-Jan-1997,,http://us.imdb.com/M/title-exact?Gang+Related+...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14,15,Mr. Holland's Opus (1995),29-Jan-1996,,http://us.imdb.com/M/title-exact?Mr.%20Holland...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
969,970,Hear My Song (1991),01-Jan-1991,,http://us.imdb.com/M/title-exact?Hear%20My%20S...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
409,410,Kingpin (1996),12-Jul-1996,,http://us.imdb.com/M/title-exact?Kingpin%20(1996),0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
397,398,Super Mario Bros. (1993),01-Jan-1993,,http://us.imdb.com/M/title-exact?Super%20Mario...,0,1,1,0,1,...,0,0,0,0,0,0,1,0,0,0


### Exploring 'item_df' - 

In [4]:
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1682 entries, 0 to 1681
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   movie_id            1682 non-null   int64  
 1   movie_title         1682 non-null   object 
 2   release_date        1681 non-null   object 
 3   video_release_date  0 non-null      float64
 4   imdb_url            1679 non-null   object 
 5   unknown             1682 non-null   int64  
 6   action              1682 non-null   int64  
 7   adventure           1682 non-null   int64  
 8   animation           1682 non-null   int64  
 9   childrens           1682 non-null   int64  
 10  comedy              1682 non-null   int64  
 11  crime               1682 non-null   int64  
 12  documentary         1682 non-null   int64  
 13  drama               1682 non-null   int64  
 14  fantasy             1682 non-null   int64  
 15  film_noir           1682 non-null   int64  
 16  horror

In [5]:
movie_df.sample(6)

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,imdb_url,unknown,action,adventure,animation,childrens,...,fantasy,film_noir,horror,musical,mystery,romance,sci_fi,thriller,war,western
519,520,"Great Escape, The (1963)",01-Jan-1963,,http://us.imdb.com/M/title-exact?Great%20Escap...,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
28,29,Batman Forever (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Batman%20Fore...,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
126,127,"Godfather, The (1972)",01-Jan-1972,,"http://us.imdb.com/M/title-exact?Godfather,%20...",0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
118,119,Maya Lin: A Strong Clear Vision (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Maya%20Lin:%2...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
901,902,"Big Lebowski, The (1998)",26-Dec-1997,,http://us.imdb.com/M/title-exact?imdb-title-11...,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
808,809,Rising Sun (1993),01-Jan-1993,,http://us.imdb.com/M/title-exact?Rising%20Sun%...,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [6]:
movie_df.describe()

Unnamed: 0,movie_id,video_release_date,unknown,action,adventure,animation,childrens,comedy,crime,documentary,...,fantasy,film_noir,horror,musical,mystery,romance,sci_fi,thriller,war,western
count,1682.0,0.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,...,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0,1682.0
mean,841.5,,0.001189,0.149227,0.080262,0.02497,0.072533,0.300238,0.064804,0.029727,...,0.01308,0.014269,0.054697,0.033294,0.036266,0.146849,0.060048,0.149227,0.042212,0.016052
std,485.695893,,0.034473,0.356418,0.271779,0.156081,0.259445,0.458498,0.246253,0.169882,...,0.11365,0.118632,0.227455,0.179456,0.187008,0.354061,0.237646,0.356418,0.201131,0.125714
min,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,421.25,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,841.5,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1261.75,,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1682.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Observations:
    1. video_release_date seems containing lots of NaN values.
    2. release_date needs to be parsed to datetime.
    3. imdb_url contains external links which is not usefull here anyways.

In [7]:
movie_df.release_date = pd.to_datetime(movie_df.release_date)

### Checkout NaN values percent in each column

In [8]:
def check_nan_percent(df):
    columns_null_count = df.isnull().sum()
    null_percent_df = pd.DataFrame({'null_percent': (columns_null_count/df.shape[0])*100, 'total_null': columns_null_count})
    return null_percent_df[null_percent_df.null_percent>0]

In [9]:
check_nan_percent(movie_df)

Unnamed: 0,null_percent,total_null
release_date,0.059453,1
video_release_date,100.0,1682
imdb_url,0.178359,3


Clearly video_release_date contains nothing and imdb_url is of no use, lets remove them. In release_date we have only null value, if we remove entire row, then it will cost us one missing movie_id which is not good. So lets fill it with mode value of same column, since this approach is not gonna harm much.

In [10]:
movie_df.drop(columns=['video_release_date', 'imdb_url'], inplace=True)
movie_df.release_date.fillna(value=movie_df.release_date.mode()[0], inplace=True)

check_nan_percent(movie_df)

Unnamed: 0,null_percent,total_null


Cool

### Exploring rating_df

In [11]:
rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   user_id    100000 non-null  int64         
 1   movie_id   100000 non-null  int64         
 2   rating     100000 non-null  int64         
 3   timestamp  100000 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(3)
memory usage: 3.1 MB


In [12]:
rating_df.sample(6)

Unnamed: 0,user_id,movie_id,rating,timestamp
50190,346,3,3,1997-09-26 09:16:32
12668,384,355,4,1998-03-30 16:07:35
86203,464,479,4,1997-11-01 03:32:47
88333,741,255,3,1998-04-01 19:14:58
20516,339,179,5,1998-03-27 21:06:33
28843,309,334,4,1997-10-20 17:59:16


In [13]:
rating_df.describe()

Unnamed: 0,user_id,movie_id,rating
count,100000.0,100000.0,100000.0
mean,462.48475,425.53013,3.52986
std,266.61442,330.798356,1.125674
min,1.0,1.0,1.0
25%,254.0,175.0,3.0
50%,447.0,322.0,4.0
75%,682.0,631.0,4.0
max,943.0,1682.0,5.0


Observations:
    1. Everything seems all ok.

#### Check missing values

In [14]:
check_nan_percent(rating_df)

Unnamed: 0,null_percent,total_null


perfect, we don't have any missing value.

### Exploring user_df

In [15]:
user_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943 entries, 0 to 942
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     943 non-null    int64 
 1   age         943 non-null    int64 
 2   gender      943 non-null    object
 3   occupation  943 non-null    object
 4   zip_code    943 non-null    object
dtypes: int64(2), object(3)
memory usage: 37.0+ KB


In [16]:
user_df.sample(6)

Unnamed: 0,user_id,age,gender,occupation,zip_code
891,892,36,M,other,45243
346,347,18,M,student,90210
171,172,55,M,marketing,22207
10,11,39,F,other,30329
324,325,48,M,technician,2139
161,162,25,M,artist,15610


Zip codes can be converted to city names, let's do that, and later we can use it for some additional analysis.

In [19]:
search = SearchEngine(simple_zipcode=True)
user_df['city'] = user_df.zip_code.apply(lambda zip: search.by_zipcode(zip).major_city)

Exception ignored in: <function SearchEngine.__del__ at 0x7f90eee7b5f0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/uszipcode/search.py", line 195, in __del__
    if self.ses:
AttributeError: 'SearchEngine' object has no attribute 'ses'


TypeError: __init__() got an unexpected keyword argument 'simple_zipcode'

In [20]:
user_df.describe()

Unnamed: 0,user_id,age
count,943.0,943.0
mean,472.0,34.051962
std,272.364951,12.19274
min,1.0,7.0
25%,236.5,25.0
50%,472.0,31.0
75%,707.5,43.0
max,943.0,73.0


#### Check missing values

In [21]:
check_nan_percent(user_df)

Unnamed: 0,null_percent,total_null


37 zip codes can not be converted to city names. Lets check these records

In [22]:
user_df[user_df.city.isnull()]

AttributeError: 'DataFrame' object has no attribute 'city'

In all these cases zip codes seems invalid. Again removing these records completely will bring inconsistency, so better we fill these value with mode of the column.

In [None]:
user_df.city.fillna(value=user_df.city.mode()[0], inplace=True)

## Finding Insight - 

I believe the best way to do that is by asking usefull questions from dataset, and not moving forward till you get the answers.

__QA. From movie_df -__
   1. [What are common genere of movies?](#QA1)
   2. [How many movies got released each year and month?](#QA2)
   3. [What is the prefferable week of month to release movies?](#QA3)

__QB. From movie_df -__
   1. [Who watches more movies Men/Women?](#QB1)
   2. [What age group watches more movies?](#QB2)
   3. [Which kind of occupant watches more movies?](#QB3)
    
__QC. From movie_df + user_df -__
   1. [What gender likes which kind of genere](#QC1)
   2. [What age group watches which kind of movies?](#QC2)
   3. [Are movie lover's increasing over time](#QC3)
      1. [Overall](#QC3a)
      2. [Gender wise](#QC3b)
      3. [Age group wise](#QC3c)

__QD. From movie_df + user_df + rating_df__
   1. [How much rating people give mostly.](#QD1)
   2. [Most Rated Movies](#QD2)
      1. [during all years](#QD2a)
      3. [gender wise](#QD2c)
      4. [age group wise](#QD2d)
   3. [Most Loved Movies](#QD3)
      1. [during all years](#QD3a)
      3. [gender wise](#QD3c)
      4. [age group wise](#QD3d)
   4. [Worst movie as per user rating.](QD4)

If you notice at various places, we have used word age group, which is not already there, so lets go ahead and create one additional categorical feature in user_df called __age_group__. Minimum age is 7 and maximum age is 73 so we should have following divisions:

* 5-12   -  Gradeschooler
* 13-19  -  Teenager
* 20-35  -  Young
* 35-55  -  Midlife
* above 55    -  Old

In [None]:
user_df['age_group'] = user_df.age.apply(lambda age: 'Gradeschooler' if 5<=age<=12 else ('Teenager' if 13<=age<=19 else ('Young' if 20<=age<=35 else ('Midlife' if 35<=age<=55 else 'Old'))))
user_df.sample(5)

Lets also create joined DataFrames, they will be helpful later.

In [None]:
rating_user_df = rating_df.join(other=user_df, how='inner', on='user_id', lsuffix='_R')
rating_user_movie_df = rating_user_df.join(other=movie_df, how='inner', on='movie_id', rsuffix='_M')
rating_movie_df = rating_df.join(other=movie_df, how='inner', on='movie_id', rsuffix='_M')

### <a id='QA1'>What are common genere of movies?</a>

In [None]:
generes = ['unknown', 'action',
       'adventure', 'animation', 'childrens', 'comedy', 'crime', 'documentary',
       'drama', 'fantasy', 'film_noir', 'horror', 'musical', 'mystery',
       'romance', 'sci_fi', 'thriller', 'war', 'western']

plt.figure(figsize=(12,7))
genere_counts = movie_df.loc[:,generes].sum().sort_values(ascending=False)
sns.barplot(x=genere_counts.index, y=genere_counts.values)
plt.xticks(rotation=60);

drama and comedy are the most common genere type. We must also note that it can not be a clear indication of people's preference, because One movie can have more than one genere and drama is the most commor genere type.

### <a id="QA2">How many movies got released each year and month?</a>

Yearly release

In [None]:
plt.figure(figsize=(12,7))
yearly_release_counts = movie_df.groupby(movie_df.release_date.dt.year).size().sort_values(ascending=False)
sns.lineplot(yearly_release_counts.index, yearly_release_counts.values);
plt.xlabel('Release Year');

#release significantly increased after 80s, however it doesn't exactly depicts the actual number of release, but the number of rated release, actual number of release must be far greater than what we have. Anyway, one thing is clear that people mostly watched movies released in 90s. There is a sudden fall in 1998's record, that might be because of incomplete record.

Since there is not much record available for movies before 90s, so I will consider release count for months in 90s only.`

In [None]:
plt.figure(figsize=(12,7))
monthly_release_counts = movie_df[movie_df.release_date.dt.year > 1990].groupby(movie_df.release_date.dt.month).size()
sns.barplot(['jan', 'feb', 'mar', 'apr', 'may', 'june', 'july', 'aug', 'sept', 'oct', 'nov', 'dec'], monthly_release_counts.values);
plt.xlabel('Release Month');

this is weird, rated movie rlease count in Jan month is too high as cmpare to other months. Dataset creators must have considered Jan 1st as the default release month and date.

### <a id="QA3">What is the prefferable week of month to release movies?</a>

In [None]:
plt.figure(figsize=(12,7))
weekday_release_counts = movie_df[movie_df.release_date.dt.year > 1990].groupby(movie_df.release_date.dt.dayofweek).size()
sns.barplot(['mon', 'tue', 'wed', 'thus', 'fri', 'sat', 'sun'], weekday_release_counts.values);
plt.xlabel('Release Day of Week');

This trend seems fine. Mostly rated movies got released on Friday, and then on weekends.

### <a id='QB1'>Who watches more movies Men/Women?</a>

In [None]:
plt.figure(figsize=(9,6))
sns.barplot(user_df.groupby('gender').size().index, user_df.groupby('gender').size().values)
plt.title('Male/Female movie rating ratio');

Male, of course ;)
or may be mostly Male takes interest in rating movies.

### <a id='QB2'>What age group watches more movies?</a>

In [None]:
plt.figure(figsize=(9,6))
sns.barplot(user_df.groupby('age_group').size().index, user_df.groupby('age_group').size().values)
plt.title('movie watchers age_group wise');

Obviously Gradeschoolers and Teenagers don't have that much of time and old people don't have much interestest. Youngesters are the ones who watches movies mostly and prefers rating them. So overall people from age 20 - 55 watches more movies.

### <a id='QB3'>Which kind of occupant watches more movies?</a>

In [None]:
plt.figure(figsize=(12,7))
movie_watcher_occupants = user_df.groupby('occupation').size().sort_values(ascending=False)
sns.barplot(movie_watcher_occupants.index, movie_watcher_occupants.values)
plt.title('movie watchers age_group wise')
plt.xticks(rotation=50);

It appears that Students watches more movies, may be the ones who are above 20 means Young students. Irony is that people who are in entertainment don't watch or may rate movies. Lets explore it bit more.

In [None]:
pd.DataFrame(user_df.groupby(['occupation', 'age_group']).size().sort_values(ascending=False))

From above dataframe it is clear that Young students are more interested in movies. After that Midelife people in any profession seems interested too.

### <a id='QC1'>What gender likes which kind of genere</a>

In [None]:
temp_df = rating_user_movie_df.groupby('gender').sum().loc[:,generes]
temp_df = temp_df.transpose()
temp_df

Stacked Bar Chart-

In [None]:
plt.figure(figsize=(12, 6))

temp_df.M.sort_values(ascending=False).plot(kind='bar', color='teal', label="Male")
temp_df.F.sort_values(ascending=False).plot(kind='bar', color='black', label="Fe-Male")
plt.legend()
plt.xticks(rotation=60)
plt.show()

Multi Car Chart - 

In [None]:
plt.figure(figsize=(12, 6))
m_temp_df = temp_df.M.sort_values(ascending=False)
f_temp_df = temp_df.F.sort_values(ascending=False)

plt.bar(x=m_temp_df.index, height=m_temp_df.values, label="Male", align="edge", width=0.3, color='teal')
plt.bar(x=f_temp_df.index, height=f_temp_df.values, label="Female", width=0.3, color='black')
plt.legend()
plt.xticks(rotation=60)
plt.show()

Looks cool.. isn't it.

### <a id='QC2'>What age group watches which kind of movies?</a>

### <a id='QC3'>Are movie lover's increasing over time</a>

####      <a id='QC3a'>Overall</a>

In [None]:
rating_df.groupby(rating_df.timestamp.dt.year).size()

We have rating record for only two years. Which is not enought for this observation.

####      <a id='QC3b'>Gender wise</a>

In [None]:
rating_user_df.groupby([rating_user_df.timestamp.dt.year, 'gender']).size()

We don't have enough record in ratings dataset :/

####      <a id='QC3c'>Age group wise</a>

In [None]:
rating_user_df.groupby([rating_user_df.timestamp.dt.year, 'age_group']).size()

We don't have enough record in ratings dataset :/

### <a id='QD1'>How much rating people give mostly.</a>

In [None]:
temp_df = rating_user_df.groupby(['gender', 'rating']).size()
plt.figure(figsize=(10, 5))
m_temp_df = temp_df.M.sort_values(ascending=False)
f_temp_df = temp_df.F.sort_values(ascending=False)

plt.bar(x=m_temp_df.index, height=m_temp_df.values, label="Male", align="edge", width=0.3, color='teal')
plt.bar(x=f_temp_df.index, height=f_temp_df.values, label="Female", width=0.3, color='black')
plt.title('Ratings given by Male/Female Viewers')
plt.legend()
plt.xlabel('Ratings')
plt.ylabel('Count')
plt.show()

Viewers gave mostly 4 start rating then 3 and then 5

### <a id='QD2'>Most Rated Movies</a>

####      <a id='QD2a'>during all years</a>

In [None]:
def draw_horizontal_movie_bar(movie_titles, ratings_count, title=''):
    plt.figure(figsize=(12, 7))
    sns.barplot(y=movie_titles, x=ratings_count, orient='h')
    plt.title(title)
    plt.ylabel('Movies')
    plt.xlabel('Count')
    plt.show()

In [None]:
top_ten_rated_movies = rating_movie_df.groupby('movie_id').size().sort_values(ascending=False)[:10]
top_ten_movie_titles = movie_df.iloc[top_ten_rated_movies.index].movie_title

draw_horizontal_movie_bar(top_ten_movie_titles.values, top_ten_rated_movies.values, 'Top 10 watched movies')

####      <a id='QD2c'>gender wise</a>

In [None]:
top_rated_movies_gender_wise = rating_user_movie_df.groupby(['gender','movie_id']).size()

for index_label in top_rated_movies_gender_wise.index.get_level_values(0).unique():

    top_10_userkind_rated_movies = top_rated_movies_gender_wise[index_label].sort_values(ascending=False)[:10]
    top_10_userkind_rated_movie_titles = movie_df.iloc[top_10_userkind_rated_movies.index].movie_title
    draw_horizontal_movie_bar(top_10_userkind_rated_movie_titles.values, top_10_userkind_rated_movies.values, f'Top 10 {index_label} watched movies')

####      <a id='QD2d'>age group wise</a>

In [None]:
top_rated_movies_age_group_wise = rating_user_movie_df.groupby(['age_group','movie_id']).size()

for index_label in top_rated_movies_age_group_wise.index.get_level_values(0).unique():
    top_10_userkind_rated_movies = top_rated_movies_age_group_wise[index_label].sort_values(ascending=False)[:10]
    top_10_userkind_rated_movie_titles = movie_df.iloc[top_10_userkind_rated_movies.index].movie_title
    draw_horizontal_movie_bar(top_10_userkind_rated_movie_titles.values, top_10_userkind_rated_movies.values, f'Top 10 {index_label} watched movies')

Mostly movies are same for all user kinds, may be because of limited record we have in our ratings dataset

### <a id='QD3'>Most Loved Movies</a>

####      <a id='QD3a'>during all years</a>

In [None]:
top_ten_most_loved_movies = rating_movie_df.groupby('movie_id').sum().rating.sort_values(ascending=False)[:10]
top_ten_most_loved_movie_titles = movie_df.iloc[top_ten_most_loved_movies.index].movie_title

draw_horizontal_movie_bar(top_ten_most_loved_movie_titles.values, top_ten_most_loved_movies.values, 'Top 10 most loved movies')

####      <a id='QD3c'>gender wise</a>

In [None]:
most_loved_movies_gender_wise = rating_user_movie_df.groupby(['gender','movie_id']).sum().rating

for index_label in most_loved_movies_gender_wise.index.get_level_values(0).unique():

    top_10_userkind_loved_movies = most_loved_movies_gender_wise[index_label].sort_values(ascending=False)[:10]
    top_10_userkind_loved_movie_titles = movie_df.iloc[top_10_userkind_loved_movies.index].movie_title
    draw_horizontal_movie_bar(top_10_userkind_loved_movie_titles.values, top_10_userkind_loved_movies.values, f'Top 10 {index_label} loved movies')

####      <a id='QD3d'>age group wise</a>

In [None]:
most_loved_movies_age_group_wise = rating_user_movie_df.groupby(['age_group','movie_id']).sum().rating

for index_label in most_loved_movies_age_group_wise.index.get_level_values(0).unique():
    top_10_userkind_loved_movies = top_rated_movies_age_group_wise[index_label].sort_values(ascending=False)[:10]
    top_10_userkind_loved_movie_titles = movie_df.iloc[top_10_userkind_loved_movies.index].movie_title
    draw_horizontal_movie_bar(top_10_userkind_loved_movie_titles.values, top_10_userkind_loved_movies.values, f'Top 10 {index_label} loved movies')

To be honest, I haven't watched most of these movies so not quite sure what movie contains what kind of storyline :>

### <a id='QD4'>Worst movies as per user rating.</a>

In [None]:
movies_ratings_sum = rating_user_movie_df.groupby('movie_id').sum().rating.sort_values()
movies_ratings_sum.index = movie_df.iloc[movies_ratings_sum.index].movie_title
# Will show movies with 0 < total_rating<= 10
lowest_rated_movies = movies_ratings_sum[movies_ratings_sum <= 10]


wordcloud = WordCloud(min_font_size=7, width=800, height=500, random_state=21, max_font_size=50, relative_scaling=0.5, colormap='Dark2')
# Substracted lowest_rated_movies from 11 so that we can have greater font size of least rated movies.
wordcloud.generate_from_frequencies(frequencies=(11-lowest_rated_movies).to_dict())
plt.figure(figsize=(16,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Well.. all I can say is I am not amazed.

So this is it in this EDA, I have covered most of the questions that could be asked from this dataset except the ones related to user locations(zip codes), we can generate some insight from those as well, like 
- Is there any pattern in location and movie choices?
- What kind of occupats are residing in which place?
- Whether place has any impact on users being soft rater and hard rater etc.

Will do that later ;)

In [None]:
# rating_df
# movie_df
# user_df

In [None]:
rating_df.columns, 

In [None]:
top_users_g = rating_df.groupby('user_id')['rating'].count()
top_users = top_users_g.sort_values(ascending=False)[:15]
top_movies_g = rating_df.groupby('movie_id')['rating'].count()
top_movies = top_movies_g.sort_values(ascending=False)[:15]
top_users_movies = rating_df.join(top_users, rsuffix='_r', how='inner', on='user_id')
top_users_movies = top_users_movies.join(top_movies, rsuffix='_r', how='inner', on='movie_id')

In [None]:
rating_df.shape, top_users_movies.shape

Ratings record drastically came down after applying top-n user and movie filters. Let's check what these movies and users are:

In [None]:
top_users_movies.user_id.unique(), top_users_movies.movie_id.unique()

Top movies picked by top users:

In [None]:
top_users_df = user_df[user_df.user_id.isin(top_users_movies.user_id.unique())]
top_movies_df = movie_df[movie_df.movie_id.isin(top_users_movies.movie_id.unique())]
top_movies_df

In [None]:
pd.crosstab(top_users_movies.user_id, top_users_movies.movie_id, top_users_movies.rating, aggfunc=np.sum)

The user/movie fields are currently non-sequential integers representing some unique ID for that entity. We need them to be sequential starting at zero

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [None]:
user_enc = LabelEncoder()
rating_df['user'] = user_enc.fit_transform(rating_df['user_id'].values)
n_users = rating_df['user'].nunique()

item_enc = LabelEncoder()
rating_df['movie'] = item_enc.fit_transform(rating_df['movie_id'].values)
n_movies = rating_df['movie'].nunique()

rating_df['rating'] = rating_df['rating'].values.astype(np.float32)

min_rating = min(rating_df['rating'])
max_rating = max(rating_df['rating'])

n_users, n_movies, min_rating, max_rating

In [None]:
# train/test split
X = rating_df[['user', 'movie']].values
y = rating_df['rating'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

In [None]:
# deciding on number of factors
n_factors = 50

In [None]:
# separating user and movie lists
X_train_array = [X_train[:, 0], X_train[:, 1]]
X_test_array = [X_test[:, 0], X_test[:, 1]]

Building and training a simple Neural Network:

In [None]:
from keras.models import Model
from keras.layers import Input, Reshape, Dot
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.regularizers import l2

In [None]:
def RecommenderV1(n_users, n_movies, n_factors):
    # building user input
    user = Input(shape=(1,))
    user_embed = Embedding(n_users, n_factors, embeddings_initializer='he_normal',
                  embeddings_regularizer=l2(1e-6))(user)
    user_embed = Reshape((n_factors,))(user_embed)
    
    # building movie input
    movie = Input(shape=(1,))
    movie_embed = Embedding(n_movies, n_factors, embeddings_initializer='he_normal',
                  embeddings_regularizer=l2(1e-6))(movie)
    movie_embed = Reshape((n_factors,))(movie_embed)
    
    # putting together user and movie input
    x = Dot(axes=1)([user_embed, movie_embed])
    model = Model(inputs=[user, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt)

    return model


model = RecommenderV1(n_users, n_movies, n_factors)
model.summary()

In [None]:
history = model.fit(x=X_train_array, y=y_train, 
                    batch_size=64, epochs=5, 
                    verbose=1, validation_data=(X_test_array, y_test))

It's not a bad result, considering the simplicty of our network. Let's do few improvements:

In [None]:
from keras.layers import Add, Activation, Lambda

**Improvising embedding layer:** Add a bias term to each of the embeddings

In [None]:
class EmbeddingLayer:
    def __init__(self, n_items, n_factors):
        self.n_items = n_items
        self.n_factors = n_factors
    
    def __call__(self, x):
        x = Embedding(self.n_items, self.n_factors, embeddings_initializer='he_normal',
                      embeddings_regularizer=l2(1e-6))(x)
        x = Reshape((self.n_factors,))(x)
        return x

**Adding a sigmoid layer** and scaling the output between the valid rating range i.e. 1-5

In [None]:
def RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating):
    user = Input(shape=(1,))
    u = EmbeddingLayer(n_users, n_factors)(user)
    ub = EmbeddingLayer(n_users, 1)(user)

    movie = Input(shape=(1,))
    m = EmbeddingLayer(n_movies, n_factors)(movie)
    mb = EmbeddingLayer(n_movies, 1)(movie)

    x = Dot(axes=1)([u, m])
    x = Add()([x, ub, mb])
    x = Activation('sigmoid')(x)
    x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)
    model = Model(inputs=[user, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt)
    return model

In [None]:
model = RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating)
model.summary()

In [None]:
# Let's retrain 
history = model.fit(x=X_train_array, y=y_train, 
                    batch_size=64, epochs=5,
                    verbose=1, validation_data=(X_test_array, y_test))

In [None]:
There is a slight improvement, we can still apply more enhancements in our network to get a much better model.

There is a slight improvement, we can still apply more enhancements in our network to get a much better model.