# &#127909; Intro Of This NoteBook:
 This notebook uses a dataset from the [MovieLens 20M Dataset](https://www.kaggle.com/grouplens/movielens-20m-dataset). We will describe the dataset further as we explore with it using pandas.


![Imgur](https://i.imgur.com/sO3hkdl.png)

# &#128279; Code Library, Style, and Links 

In [None]:
%%html
<style>
@import url('https://fonts.googleapis.com/css?family=Ewert|Roboto&effect=3d|ice|');
body {background-color: gainsboro;} 
a {color: #37c9e1; font-family: 'Roboto';} 
h1 {color: #37c9e1; font-family: 'Orbitron'; text-shadow: 4px 4px 4px #aaa;} 
h2, h3 {color: slategray; font-family: 'Orbitron'; text-shadow: 4px 4px 4px #aaa;}
h4 {color: #818286; font-family: 'Roboto';}
span {font-family:'Roboto'; color:black; text-shadow: 5px 5px 5px #aaa;}  
div.output_area pre{font-family:'Roboto'; font-size:110%; color:lightblue;}      
</style>

Useful `LINKS`:

&#128187; 1. [MovieLens 20M Dataset Research Paper](http://files.grouplens.org/papers/harper-tiis2015.pdf)

&#128187; 2. [Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/)

&#128187; 3. [Pandas Official Site](https://pandas.pydata.org)

# &#128203; Introduction of MoveLens:
This is a report on the movieLens dataset available here. MovieLens itself is a research site run by GroupLens Research group at the University of Minnesota. The first automated recommender system was developed there in 1993.

# &#128221; Dataset Description:
The dataset is available in several snapshots. The ones that were used in this analysis were Latest Datasets - both full and small (for web scraping). They were last updated in October 2016.



# &#128214; Definitions of Pandas:
Pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.

Pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.

* The main data structures pandas provides are Series and DataFrames.

![Imgur](https://i.imgur.com/L9GESTT.jpg)

 # &#128204; Getting Started
 To get started, we will need to; Please note that you will need to download the dataset. 
 
 Here are the links to the data source and location:
 
* ** Data Source:** Kaggle Data Science Home (filename: movelens-20m-dataset.zip)
* **Location:** https://www.kaggle.com/grouplens/movielens-20m-dataset

# &#128229; Import Libraries

In [None]:
import pandas as pd

# &#128197; Read the Dataset
In this notebook, we will be using three CSV files:

* **ratings.csv :** userId,movieId,rating, timestamp

* **tags.csv : **userId,movieId, tag, timestamp

* **movies.csv : **movieId, title, genres 

In [None]:
movies = pd.read_csv('../input/movielens-20m-dataset/movie.csv', sep=',')
print(type(movies))
movies.head(20)

In [None]:
tags = pd.read_csv('../input/movielens-20m-dataset/tag.csv', sep=',')
tags.head()

In [None]:
ratings = pd.read_csv('../input/movielens-20m-dataset/rating.csv', sep=',', parse_dates=['timestamp'])
ratings.head()

* For current analysis, we will remove timestamp

In [None]:
del ratings['timestamp']
del tags['timestamp']

# &#128230; Data Structures:

## &#128678; Series

In [None]:
row_0 = tags.iloc[0]
type(row_0)

In [None]:
print(row_0)

In [None]:
row_0.index

In [None]:
row_0['userId']

In [None]:
'rating' in row_0

In [None]:
row_0.name

In [None]:
row_0 = row_0.rename('firstRow')
row_0.name

# &#9641; DataFrames

In [None]:
tags.head()

In [None]:
tags.index

In [None]:
tags.columns

In [None]:
tags.iloc[ [0,11,500] ]

# &#128200; &#128201; Descriptive Statistics 
Let's look how the ratings are distributed!

In [None]:
ratings['rating'].describe()

In [None]:
ratings.describe()

In [None]:
ratings['rating'].mean()

In [None]:
ratings.mean()

In [None]:
ratings['rating'].min()

In [None]:
ratings['rating'].max()

In [None]:
ratings['rating'].std()

In [None]:
ratings['rating'].mode()

In [None]:
ratings.corr()

In [None]:
filter1 = ratings['rating'] > 10
print(filter1)
filter1.any()

In [None]:
filter2 = ratings['rating'] > 0
filter2.all()

# &#128295; Data Cleaning: Handling Missing Data

In [None]:
movies.shape

In [None]:
movies.isnull().any().any()

* Thats nice ! No NULL values !

In [None]:
ratings.shape

In [None]:
ratings.isnull().any().any()

* Thats nice ! No NULL values !

In [None]:
tags.shape

In [None]:
tags.isnull().any().any()

* We have some tags which are NULL.

In [None]:
tags=tags.dropna()

In [None]:
tags.isnull().any().any()

In [None]:
tags.shape

* Thats nice ! No NULL values ! Notice the number of lines have reduced.

# &#128202; Data Visualization

In [None]:
%matplotlib inline

ratings.hist(column='rating', figsize=(10,5))

In [None]:
ratings.boxplot(column='rating', figsize=(10,5))

# &#128228; Slicing Out Columns

In [None]:
tags['tag'].head()

In [None]:
movies[['title','genres']].head()

In [None]:
ratings[-10:]

In [None]:
tag_counts = tags['tag'].value_counts()
tag_counts[-10:]

In [None]:
tag_counts[:10].plot(kind='bar', figsize=(10,5))

# &#127907; Filters for Selecting Rows

In [None]:
is_highly_rated = ratings['rating'] >= 5.0
ratings[is_highly_rated][30:50]

In [None]:
is_action= movies['genres'].str.contains('Action')
movies[is_action][5:15]

In [None]:
movies[is_action].head(15)

# &#128101; Group By and Aggregate

In [None]:
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count

In [None]:
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()

In [None]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()

In [None]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()

# &#128304; Merge Dataframes

In [None]:
tags.head()

In [None]:
movies.head()

In [None]:
t = movies.merge(tags, on='movieId', how='inner')
t.head()

* More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html

### &#128218; Combine aggreagation, merging, and filters to get useful analytics

In [None]:
avg_ratings= ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()

In [None]:
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.tail()

In [None]:
is_highly_rated = box_office['rating'] >= 4.0
box_office[is_highly_rated][-5:]

In [None]:
is_Adventure = box_office['genres'].str.contains('Adventure')
box_office[is_Adventure][:5]

In [None]:
box_office[is_Adventure & is_highly_rated][-5:]

# &#128221; Vectorized String Operations

In [None]:
movies.head()

## &#128300; Split 'genres' into multiple columns 

In [None]:
movie_genres = movies['genres'].str.split('|', expand=True)

In [None]:
movie_genres[:10]

## &#128681; Add a new column for comedy genre flag 

In [None]:
movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')

In [None]:
movie_genres[:10]

## &#128223; Extract year from title e.g. (2007) 

In [None]:
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)

In [None]:
movies.tail()

More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods 

# &#128336; Parsing Timestamps

 * Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!

In [None]:
tags = pd.read_csv('../input/movielenslatest/tags.csv', sep=',')

In [None]:
tags.dtypes

Unix time / POSIX time / epoch time records time in seconds 

since midnight Coordinated Universal Time (UTC) of April 4, 2009

In [None]:
tags.head(5)

In [None]:
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')

* Data Type datetime64[ns] maps to either

In [None]:
tags['parsed_time'].dtype

In [None]:
tags.head(2)

Selecting rows based on timestamps

In [None]:
greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

tags.shape, selected_rows.shape

Sorting the table using the timestamps

In [None]:
tags.sort_values(by='parsed_time', ascending=True)[:10]

# &#128199; Average Movie Ratings over Time

## Movie ratings related to the year of launch?

In [None]:
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()

In [None]:
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()

# &#128170; Motivation

# **“Learning how to do data science is like learning to ski. You have to do it.”**
![Imgur](https://i.imgur.com/2sUbqv7.jpg)