# Exploratory Data Analysis

## Goals
The goal is to learn to use pandas to load a dataset, perform a basic analysis, and answer some questions that we may have about that data.

## Introduction

The first steps before performing an analysis in a dataset, is to simply "explore" it. This is called **Exploratory Data Analysis**, or EDA, for short. The term was coined by [John Tukey](https://en.wikipedia.org/wiki/John_Tukey). (Sidenote: among other things, he also co-invented the Fast Fourier Transform.) If you are interested to learn more about EDA, then you can read this short encyclopedia [article](https://www.stat.berkeley.edu/~brill/Papers/EDASage.pdf) or the "[Art of Data Science](http://bedford-computing.co.uk/learning/wp-content/uploads/2016/09/artofdatascience.pdf)" handbook.

### Goals of Exploratory Data Analysis

1. To determine if there are any problems with your dataset.
2. To determine whether the question you are asking can be answered by the data that you have.
3. To develop a sketch of the answer to your question.

### Exploratory Data Analysis checklist

1. Formulate your question
2. Read in your data
3. "Check the packaging" (rows, columns, types, etc.)
4. Look at the top and the bottom of your data 
5. Check your “n”s (do you see the expected number of rows, columns?)
6. Validate with at least one external data source 
7. Make a plot
8. Try the easy solution first
9. Follow up (do you have the right data? do you need other data? do you have the right question?)

For this exercise we will download the MovieLens dataset. It has been used many times as a testbed for recommendation algorithms, i.e. to predict which movies the user may be interested in watching, similar to what Netflix does. There are various versions of the dataset. We will use the "Movielens 1 Million", which contains approximately one Million ratings of movies that users watched.

## Questions
These are the questions that we are interested in answering:

1. How many movies?
2. How many users?
3. How many ratings?
4. What is the distribution of ratings?
5. What is the average rating across all users?
6. What is the average rating across *per user*?
7. Which are the most-watched movies?
8. Which are the most-favorite movies (best avergae rating)?
9. Which are the most popular geners?
10. What is the rating distribution per movies?
11. What is the movie distribution per year?
12. What is the distribution of jobs for users?
13. Viusalize the distribution of the number of ratings per movie with a box plot.
14. Read about treemap and try to visualize the distribution of genres using it.
15. Read about [jointplot](https://seaborn.pydata.org/generated/seaborn.jointplot.html) from seaborn documentation. What can you say about the relationship between the number of ratings for a movie and its average rating?
11. Find groups of users that have watched at least 20 similar movies.
12. If someone has watched "The Matrix" which other movie would you recommend?


## Downloading the necessary file(s)
You may download MovieLens from [here](https://grouplens.org/datasets/movielens/20m/), or it's always better if we do it programmatically!

You can load the raw data directly from github. Note that you should get the column names from readme. (check github)

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
# movies data
movies = pd.read_csv('https://raw.githubusercontent.com/ahmadajal/DM_ML_course_public/master/2.%20Data%26EDA/data/ml-1m/movies.dat', delimiter = '::', 
                     names = ['movie_id', 'title', 'genres'], header=0, encoding='latin1')
# ratings data
ratings = pd.read_csv('https://raw.githubusercontent.com/ahmadajal/DM_ML_course_public/master/2.%20Data%26EDA/data/ml-1m/ratings.dat', delimiter = '::', 
                      names=['UserID','MovieID', 'Rating', 'Timestamp'], header=0)
# users data
users = pd.read_csv('https://raw.githubusercontent.com/ahmadajal/DM_ML_course_public/master/2.%20Data%26EDA/data/ml-1m/users.dat', delimiter = '::',
                   names=['UserID', 'Gender', 'Age' ,'Occupation', 'Zip-code'], header=0)


In [None]:
movies.head()

In [None]:
ratings.head()

In [None]:
users.head()

split title and release year in separate columns

In [None]:
movies['year'] = movies.title.str.extract("\((\d{4})\)", expand=True)
movies.head()

Correcting the datatype of the columns:
    
right now if you check the data type of columns, the new column `year` is object wheras it should be integer.

In [None]:
# TO DO: convert column year to integer



Convert the timestamp in ratings table to date type and add a new column `date` to the rating table

In [None]:
# TO DO: convert timestamp to date and add a new column called date to ratings



We have the description of the occupation codes from readme. Let's apply a mapping to convert occupation codes to description in the users dataframe.

In [None]:
# TO DO: map the occupation codes to the jobs name using the readme
job_code = {0:  "other or not specified",
 1:  "academic/educator",
 2:  "artist",
 3:  "clerical/admin",
 4:  "college/grad student",
 5:  "customer service",
 6:  "doctor/health care",
 7:  "executive/managerial",
 8:  "farmer",
 9:  "homemaker",
10:  "K-12 student",
11:  "lawyer",
12:  "programmer",
13:  "retired",
14:  "sales/marketing",
15:  "scientist",
16:  "self-employed",
17:  "technician/engineer",
18:  "tradesman/craftsman",
19:  "unemployed",
20:  "writer"}
users["Occupation"] = users["Occupation"].map(job_code)

users.head()

### Q1, Q2, Q3: Number of movies, users and ratings

In [None]:
print("number of users: ", ratings.groupby("UserID").count().shape[0])
# TO DO:
print("number of movies: ")
print("number of ratings: ")

### Q4: Distribution of ratings

In [None]:
# hint: make a bar plot

### Q5, Q6: Average rating


In [None]:
# across all users
print("avg rating across all users: ", np.mean(ratings["Rating"]))

In [None]:
# per user

### Q7, Q8: most watched movies

In [None]:
# mereg the dataframes to match the movie id to movie name
df = pd.merge(movies,ratings,left_on='movie_id', right_on="MovieID")
df.head()

In [None]:
# most watched movies
# TO DO

In [None]:
# movies with best avg ratings
# TO DO

In [None]:
# Do both of the above simultanously: 
# create Dataframe: ratings count and mean rating per movie

### Q9: Most popular geners


In [None]:
# most popular genres (with most ratings)
most_popular_genre = df.groupby('genres', as_index = False)['UserID'].count()
most_popular_genre = most_popular_genre.rename(columns={'UserID' : 'Total ratings'})
most_popular_genre.sort_values('Total ratings',ascending=False).head(10)

In [None]:
# movies distribution per genres
# TO DO

## Visual Data Analysis


In [None]:
# select plot style
plt.style.use('ggplot')

# import libraries
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline

### Q10, Q11

#### Distribution of ratings per movie (with ratings count  > 50)

In [None]:
# plot a graph that indicates that most of the movies have less than 2000 ratings
# TO DO

#### Movies distribution per year (> 1985)

In [None]:
# most movies have been released after 1985
# TO DO

### Q12

#### Users distribution per occupation

In [None]:
sns.factorplot("Occupation", data=users, aspect=3, kind="count", color="b").set_xticklabels(rotation=90, fontsize=15)

### Q13
#### box plot
You can see that most of the movies in the data set have less than 1000 ratings. Therefore we see a lost of outliers i the box plot.

In [None]:
# TO DO: box plot of the ratingCount column in the df_ratingCount data frame


#### violon plot
It is very similar to box plot except that it features a kernel density estimation of the underlying distribution. for more information check <a href="https://seaborn.pydata.org/generated/seaborn.violinplot.html">this documentation</a>.

In [None]:
plt.figure(figsize=(5,8))
sns.set_style("whitegrid")
sns.violinplot("ratingCount", data=df_ratingCount, orient='v');

### Q14
Let's plot the treemap for the first 10 popular genres.

In [None]:
best_10_genres = most_popular_genre.sort_values('Total ratings',ascending=False).head(10)

In [None]:
best_10_genres

We may use a colormap from matplotlib. Note that here both the intensity of the color and the size of the square indicates the value.

In [None]:
import squarify
import matplotlib
plt.figure(figsize=(8,8))
cmap = matplotlib.cm.Reds
mini=min(best_10_genres["Total ratings"])
maxi=max(best_10_genres["Total ratings"])
norm = matplotlib.colors.Normalize(vmin=mini, vmax=maxi)
colors = [cmap(norm(value)) for value in best_10_genres["Total ratings"]]
squarify.plot(sizes=best_10_genres["Total ratings"], label=best_10_genres["genres"], color=colors,alpha=.6 )
plt.axis('off')
plt.show()

### Q 15
#### Relationship between number of ratings - ratings mean

We can see that the movies with the most ratings tend to have better average of ratings.

In [None]:
# TO DO: try to do it yourself. it is a bonus!
