# Exploratory Data Analysis

For this exercise we will download the MovieLens dataset. It has been used many times as a testbed for recommendation algorithms, i.e. to predict which movies the user may be interested in watching, similar to what Netflix does. There are various versions of the dataset. We will use the "Movielens 1 Million", which contains approximately one Million ratings of movies that users watched.



## Downloading the necessary file(s)
You may download MovieLens from [here](https://grouplens.org/datasets/movielens/20m/), or it's always better if we do it programmatically!

In case you are using colab, you can load the raw data directly from github.
Go to the [data folder](https://github.com/ahmadajal/DM_ML_course_public/tree/master/2%263.%20Data%26EDA/data/ml-1m) of week 2&3 in github and copy the link of the raw files.

If you wish to do it with jupyter notebook, Then download the files from [here](http://files.grouplens.org/datasets/movielens/ml-1m.zip)

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
# movies data
movies_link = #find the link of the raw file from github
movies = pd.read_csv(movies_link, delimiter = '::', 
                     names = ['movie_id', 'title', 'genres'], header = 0)
# ratings data
ratings_link = #find the link of the raw file from github
ratings = pd.read_csv(ratings_link, delimiter = '::', 
                      names=['UserID','MovieID', 'Rating', 'Timestamp'], header = 0)
# users data
users_link = #find the link of the raw file from github
users = pd.read_csv(users_link, delimiter = '::',
                   names=['UserID', 'Gender', 'Age' ,'Occupation', 'Zip-code'], header=0)


In [None]:
movies.head()

In [None]:
ratings.head()

In [None]:
users.head()

split title and release year in separate columns

In [None]:
movies['year'] = movies.title.str.extract("\((\d{4})\)", expand=True)
movies.head()

Correcting the datatype of the columns:
    
right now if you check the data type of columns, the new column `year` is object wheras it should be integer.

In [None]:
# TO DO: convert column year to integer



Convert the timestamp in ratings table to date type and add a new column `date` to the rating table

In [None]:
# TO DO: convert timestamp to date and add a new column called date to ratings



We have the description of the occupation codes from readme. Let's apply a mapping to convert occupation codes to description in the users dataframe.

In [None]:
# TO DO: map the occupation codes to the jobs name using the readme
job_code = {0:  "other or not specified",
 1:  "academic/educator",
 2:  "artist",
 3:  "clerical/admin",
 4:  "college/grad student",
 5:  "customer service",
 6:  "doctor/health care",
 7:  "executive/managerial",
 8:  "farmer",
 9:  "homemaker",
10:  "K-12 student",
11:  "lawyer",
12:  "programmer",
13:  "retired",
14:  "sales/marketing",
15:  "scientist",
16:  "self-employed",
17:  "technician/engineer",
18:  "tradesman/craftsman",
19:  "unemployed",
20:  "writer"}
users["Occupation"] = users["Occupation"].map(job_code)

users.head()

### Q1, Q2, Q3: Number of movies, users and ratings
How many movies?

How many users?

How many ratings?

In [None]:
print("number of users: ", ratings.groupby("UserID").count().shape[0])
# TO DO:
print("number of movies: ")
print("number of ratings: ")

### Q4: Distribution of ratings
What is the distribution of ratings? Can you find the median and mode values?

In [None]:
# hint: make a bar plot

### Q5, Q6: Average rating
What is the average rating across all users?

What is the average rating across *per user*?

In [None]:
# across all users
print("avg rating across all users: ", np.mean(ratings["Rating"]))

In [None]:
# per user
# hint: read about `groupby` in pandas

### Q7, Q8, Q9: most watched movies
Which are the most-watched movies?

Which are the most-favorite movies (best avergae rating)?

Create a dataframe which has both of the above values for each movie in separate columns

In [None]:
# mereg the dataframes to match the movie id to movie name
df = pd.merge(movies,ratings,left_on='movie_id', right_on="MovieID")
df.head()

In [None]:
# most watched movies
# TO DO

In [None]:
# movies with best avg ratings
# TO DO

In [None]:
# Do both of the above simultanously: 
# create Dataframe: ratings count and mean rating per movie

### For fun!
Let's try to find the poster of the popular movies in our dataset.

To do so we can use [tmdbsimple](https://pypi.org/project/tmdbsimple/) which is a wrapper, written in Python, for The Movie Database (TMDb) API v3.

You will need an API key to The Movie Database to access the API. To obtain a key, follow these steps:

- Register for and verify an [account](https://www.themoviedb.org/account/signup).
- [Log into](https://www.themoviedb.org/login) your account.
- Select the API section on left side of your account page.
- Click on the link to generate a new API key and follow the instructions.


In [None]:
# first we need to install the package
!pip3 install tmdbsimple

In [None]:
import tmdbsimple as tmdb
import time


class TMDB():
    """For retrieving image poster.

    """
    poster_prefix = "http://image.tmdb.org/t/p/w200"
    tmdb_api_key = "YOUR_API_KEY_HERE"

    def __init__(self):
        tmdb.API_KEY = self.tmdb_api_key

    def search(self, search_string):
        search = tmdb.Search()
        response = search.movie(query=search_string)
        return response

    def get_poster_path_by_name(self, search_string):
        """Return just the poster path for the movie"""

        response = self.search(search_string)
        for hit in response["results"]:
            return self.poster_prefix + hit["poster_path"]

    def get_poster_path_by_id(self, tmdb_id):
        movie = tmdb.Movies(tmdb_id)
        response = movie.info()

        return self.poster_prefix + response["poster_path"]
        return self.poster_prefix + response["poster_path"]


if __name__ == '__main__':
    start_time = time.time()

    # check tmdb
    print(TMDB().get_poster_path_by_name("Jurassic Park"))

    print("--- %s seconds ---" % (time.time() - start_time))

Now we try to find the poster for the 10 most popular movies in the dataset

In [None]:
first_10_movies = df_ratingCount.head(10).reset_index()
first_10_movies["title"] = first_10_movies["title"].map(lambda x: x.strip()[:-6])
first_10_movies["link to poster"] = first_10_movies["title"].map(TMDB().get_poster_path_by_name)
first_10_movies

In [None]:
# poster for Terminator 2
print(first_10_movies["link to poster"][6])

### Q9: Most popular geners
Which are the most popular geners?

In [None]:
# most popular genres (with most ratings)
# hint: use `groupby` to find the number of ratigs per genre and use `sort_values` to sort them

In [None]:
# movies distribution per genres
# TO DO

## Visual Data Analysis


In [None]:
# select plot style
plt.style.use('ggplot')

# import libraries
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline

### Q10, Q11

#### Distribution of ratings per movie (with ratings count  > 50)
What is the rating distribution per movies?

What is the movie distribution per release year?

In [None]:
# hint: filter out movies with less than 50 ratings and then plot a histogram
# hint: you may use the same dataframe that you created for Q7
# TO DO

#### Movies distribution per release year (> 1985)

In [None]:
# hint: histogram of movies that have been released after 1960
# TO DO

### Q12

#### Users distribution per occupation
What is the distribution of jobs for users?

In [None]:
# hint: use `factorplot` from seaborn package

### Q13
#### box plot
Viusalize the distribution of the number of ratings per movie with a box plot. Identify the outliers.

In [None]:
# TO DO: box plot of the number of ratings for each movie. use the dataframe that you created
# for Q7


#### violon plot
It is very similar to box plot except that it features a kernel density estimation of the underlying distribution. for more information check <a href="https://seaborn.pydata.org/generated/seaborn.violinplot.html">this documentation</a>.

In [None]:
plt.figure(figsize=(5,8))
sns.set_style("whitegrid")
sns.violinplot("ratingCount", data=df_ratingCount, orient='v');

### Q 14
#### Relationship between number of ratings - ratings mean

Read about [jointplot](https://seaborn.pydata.org/generated/seaborn.jointplot.html) from seaborn documentation. What can you say about the relationship between the number of ratings for a movie and its average rating?

In [1]:
# hint: use `joinplot` from seaborn package


### Q15 (bonus)
Read about treemap and try to visualize the distribution of 10 most populated genres using it.


In [None]:
best_10_genres = most_popular_genre.sort_values('Total ratings',ascending=False).head(10)

In [None]:
best_10_genres

We may use a colormap from matplotlib. Note that here both the intensity of the color and the size of the square indicates the value. Check the package `squarify` to learn how to plot treemaps

In [None]:
# TO DO: tree map