## Module 7 - Workshop - Data Analysis, Part 1

In [1]:
from numpy.random import randn
import numpy as np
import pandas as pd
np.random.seed(123)
import os
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4)
pd.options.display.max_rows = 10
#pd.options.display.max_rows = 20

## MovieLens 1M Dataset

GroupLens Research provides a number of collections of movie ratings data collected from users of MovieLens in the late 1990s and early 2000s. The data provide movie ratings, movie metadata (genres and year), and demographic data about the users (age, zip code, gender identification, and occupation). In this workshop, we will explore this dataset and practice slicing and dicing datasets like these into the form required for further analysis and predictive modeling.

The MovieLens 1M dataset (https://grouplens.org/datasets/movielens/) contains 1 million ratings collected from 6,000 users on 4,000 movies. It's spread across three tables: **ratings, user** information, and **movie** information.

We want to analyze the data and answer the following questions:
- What is the **average rating for any movie by users' gender and age**?
- Out of the films that received at least 200 ratings, **what are the top films as rated by the female viewers and the top films as rated by the male viewers**?
- Find the movies that are **most divisive** between male and female viewers.

**EXERCISE 1 (10 minutes)**:

Exract the data into a directory on your hard drive.
After extracting the data from the ZIP file, we can load the data from each file into pandas DataFrame objects.
Load the data into 3 separate dataframes:
- **users** for user infromation data
- **ratings** for ratings data
- **movies** for movie infromation data
You can use pandas *read_table()* function to read the data.
In the README file you will find inforamtion about the structure of data in each file. 

NOTE: you can use parameter *engine = 'python'* to avoid getting a warning message. Please refer to the documentation for a detailed expanation of this parameter: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html

In [None]:
# Reading users data:
users = pd.read

In [None]:
# Reading ratings data:
ratings = 

In [None]:
# Reading movies data:
movies = 

**EXERCISE 1 - discussion:**

Note that ages and occupations are coded as integers indicating groups described in the dataset's README file.

Analyzing the data across three tables is not a simple task; for example, suppose we wanted to compute **mean ratings for a particular movie by users' gender and age**. This is much easier to do with all of the data merged together into a single table. We need to merge all 3 dataframes into one.

**EXERCISE 2 (5 minutes):**

Using pandas's *merge* function, first merge **ratings** with **users** and then merge that result with the **movies** data. Keep in mind that pandas infers which columns to use as the merge (or join) keys based on overlapping names. You can read more about *merge* function in the documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html

In [None]:
movies_data = 

**EXERCISE 2 - discussion:**

To get mean movie ratings for each film grouped by gender, we can use the *pivot_table* (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html):

In [None]:
mean_ratings = movies_data.pivot_table('rating', index='title',
                                columns='gender', aggfunc='mean')
mean_ratings[:10]

This produced another DataFrame containing mean ratings with movie titles as row labels (the 'index') and gender as column labels. 

Then, we need to get the **movies that received at least 200 ratings**. To start, we need to group the data by title and use size() to get a Series of group sizes for each title:

In [None]:
ratings_by_title = movies_data.groupby('title').size()
ratings_by_title[:10]

In [None]:
ratings_by_title.index

In [None]:
# Titles of the movies with more than 200 ratings

active_titles = ratings_by_title.index[ratings_by_title >= 200]
active_titles

**EXERCISE 3 (15 minutes):**

The purpose of this exercise is to select the top films among female viewers and the top films among male viewers using the list of movies which received more than 200 ratings.

1) Using the index of titles receiving at least 200 ratings, **active_titles**, select rows from **mean_ratings**.

2) Select the top 10 films favoured by the female users

3) Select the top 10 films favoured by the male users

In [None]:
mean_ratings = 

In [None]:
top_female_ratings = 

In [None]:
top_male_ratings =

**EXERCISE 3 - discussion:**

Suppose we wanted to find the movies that are most divisive between male and female viewers. One way is to add a column to **mean_ratings** containing the difference in means, then sort by that column.

**EXERCISE 4 (10 minutes):**

1. Add a new column to **mean_ratings** which will be calculated as mean ratings by male viewers (column titled 'M') minus mean ratings by female viewers (column 'F')

2. Sort values in the dataframe based on the values in the new column

3. Get the top 10

In [None]:
mean_ratings['diff'] = 

**EXERCISE 4 - discussion:**

If we wanted the movies that elicited the most disagreement among viewers, independent of gender identification, then the disagreement can be measured by the variance or standard deviation of the ratings:

In [None]:
# Standard deviation of rating grouped by title
rating_std_by_title = movies_data.groupby('title')['rating'].std()
rating_std_by_title

In [None]:
# Filter down to active_titles
rating_std_by_title = rating_std_by_title.loc[active_titles]
# Order Series by value in descending order
rating_std_by_title.sort_values(ascending=False)[:10]