In [None]:
movies_file = "../data/ml-1m/movies.dat"
ratings_file = "../data/ml-1m/ratings.dat"
users_file = "../data/ml-1m/users.dat"

# Executive summary

Let's start from the great news that our BigCorporation Oy decided to benefit from an extensive movie rankings database collected over the last few years, by applied machine learning methods commonly known as 'AI'.

Strictly speaking, keeping a large dataset is a liability especially if it contains personal data. It costs money to store, it cost work hours to be updated with new data chunks and to modify or delete parts of data according you users' requests under GDPR, and it may leak or be stolen damaging our company's image and perhaps the share price too.

But the data has value of the services we can create with it. These services will increase our revenue, and may even expand the activities of our enterprice to diversify our offers. Taking an active approach toward data is the correct attitude in the modern world, much better thatn storing it "just in case" Yahoo-style.

This document will explore all the different opportunities that are technically feasible on the data at hand. Then our experienced leadership board should make their conclusion on which of the opportunities make business sense to be developed into services.

#### Note: I will skip machine learning part here as it seems out of scope for this task. This can be easily added later on the loaded data from this Jupyter notebook.

## Have a look at our data

First things first - let's load the data and make sure it's there, it's not damaged, does not have large chunks missing, or other abnormalities.

## 1. Movies data file

In [None]:
import pandas as pd

In [None]:
movie_columns = ("MovieID", "Title", "Genres")  # from the dataset README
movies = pd.read_csv(movies_file, sep="::", names=movie_columns, index_col="MovieID", 
                     engine='python')

In [None]:
# too lazy to manually copy from README
movie_genres_split = movies.Genres.str.split("|", expand=True)
movie_genres = list(pd.unique(movie_genres_split.values.flatten()))
movie_genres.remove(None)
movie_genres

We will create a separate column for each genre, because a movie may belong to several genres at once

In [None]:
if "Genres" in movies.columns:   # avoid errors on cell re-run
    for genre in movie_genres:
        movies[genre] = movies.Genres.str.contains(genre).astype('int')
        
    del movies['Genres']

General look at the data

In [None]:
movies.head()

In [None]:
from matplotlib import pyplot as plt
import matplotlib
%matplotlib inline

In [None]:
plt.spy(movies.loc[:, "Animation":].T, aspect=150)
plt.title("Movie genres by user")
plt.xticks([])
plt.xlabel("Users")
plt.yticks(range(len(movie_genres)), movie_genres)
plt.show()

We can see a uniformly random distribution of movie genres across the dataset. No obvious artifacts of missing data pieces here.

## 1.2 Users data file 

In [None]:
# from the dataset README
user_columns = ("UserID", "Gender", "Age", "Occupation", "Zip-code")  
age_columns = ("Under 18", "18-24", "25-34", "35-44", "45-49", "50-55", "56+")
occupation_dict = {
    0:  "other",
    1:  "academic/educator",
    2:  "artist",
    3:  "clerical/admin",
    4:  "college/grad student",
    5:  "customer service",
    6:  "doctor/health care",
    7:  "executive/managerial",
    8:  "farmer",
    9:  "homemaker",
    10:  "K-12 student",
    11:  "lawyer",
    12:  "programmer",
    13:  "retired",
    14:  "sales/marketing",
    15:  "scientist",
    16:  "self-employed",
    17:  "technician/engineer",
    18:  "tradesman/craftsman",
    19:  "unemployed",
    20:  "writer"}

In [None]:
users = pd.read_csv(users_file, sep="::", names=user_columns, index_col="UserID")

In [None]:
users.Occupation = users.Occupation.map(occupation_dict)

General look at the data

In [None]:
users.head()

Let's see how many viewers we have in different occupation and age groups

In [None]:
data = pd.pivot_table(users, values="Zip-code", index=['Occupation'],
                      columns=['Age'], aggfunc="count")

In [None]:
import seaborn as sn

In [None]:
matplotlib.rcParams['figure.figsize'] = [12, 10]
sn.heatmap(data, cmap='Blues', xticklabels=age_columns)
plt.title("Age and occupation of viewers")
plt.ylim([-0.5, len(occupation_dict)+0.5])  # avoid cutting first and last rows in half
plt.show()

Data seems to be correct - numbers of viewers are smoothly distributed with a peak around 30 years old. School attendants are predictably young, students peak at 18-24 years, and retired people are over 56. Also the most active movie watching group is predictably the students.

There are zip codes in the dataset. Let's plot them on a map to see which area is covered by the data.

In [None]:
import zipcodes

In [None]:
def get_coordinates(rec):
    try:
        coords = zipcodes.matching(rec)
        return pd.Series({'lat': coords[0]['lat'], 
                          'long': coords[0]['long']})
    except:
        return pd.Series({'lat': None, 'long': None})

Add user coordinates to the dataset

In [None]:
# this takes a while
if not 'lat' in users.columns:  # re-run safeguard
    users = users.merge(
        users['Zip-code'].apply(get_coordinates),
        left_index=True,
        right_index=True
    )

In [None]:
users.head()

In [None]:
from mpl_toolkits.basemap import Basemap

In [None]:
m = Basemap()
m.drawmapboundary(fill_color='#A6CAE0', linewidth=0)
m.fillcontinents(color='grey', alpha=0.7, lake_color='grey')
m.drawcoastlines(linewidth=0.1, color="white")
 
# Add a marker per city of the data frame!
m.plot(users['long'], users['lat'], linestyle='none', marker="o", 
       markersize=10, alpha=0.05, c="orange", markeredgecolor="none", 
       markeredgewidth=1)
plt.show()

Our dataset only covers users living in US, although with a good coverage over US. 

This is an important finding - any services derived from this dataset would likely to be useless e.g. for Chinese market.

# 1.3 Ratings data file

In [None]:
import datetime

In [None]:
rating_columns = ("UserID", "MovieID", "Rating", "Timestamp")  # from the dataset README
ratings = pd.read_csv(
    ratings_file, sep="::", names=rating_columns, 
    index_col="Timestamp", parse_dates=["Timestamp"], 
    date_parser=lambda a: datetime.datetime.fromtimestamp(int(a)))

Let's look at the available ratings

In [None]:
# use 'sample' to not kill the computer by plotting a million points
ratings.sample(frac=0.01).plot(
    use_index=True, y="Rating", 
    marker='.', linestyle='none', 
    markersize=50, alpha=0.01
)
plt.show()

Let's examine how our ratings are distributed by checking their weekly frequency.

In [None]:
weekly_ratings = ratings.Rating.resample('W').count()
weekly_ratings.plot()
plt.show()

A closer look

In [None]:
weekly_ratings.plot()
plt.ylim([0, 9000])
plt.show()

In [None]:
print("Ratings in year 2000: ", ratings.Rating['01-01-2000':'31-12-2000'].count()) 
print("Ratings after 2000: ", ratings.Rating['31-12-2000':].count()) 

The distribution of our ratings is very uneven - 90% of the data comes from year 2000, with only 10% in the following two years. This is not how real data should look like, so we got a manually crafted samples of an actual business data for analysis.

There are sharp peaks in data e.g. around December 2000, but these may be justified by holidays or a release of some popular movies.

A natural split point would be 1 January 2001, where we could use a large amount of prior data for training and a smaller amount of the following data (but over a longer period) for testing.

# 1.4 Answer some questions by our data

What are the preferred movie genres for female viewers of different age?

Let's define "preferred" as movies with a rating of 4 or 5.

In [None]:
data = ratings[ratings.Rating >= 4]\
    .join(movies, on='MovieID')\
    .join(users[users.Gender=='F'], on='UserID', how='inner')

In [None]:
pivot = []
for g in movie_genres:
    pivot_col = pd.pivot_table(
        data, values='Rating', columns=[g],
        index=['Age'], aggfunc='count', fill_value=0)
    pivot_col = pivot_col.loc[:, 1:1]
    pivot_col.columns = [g]
    pivot.append(pivot_col)
genres_pref = pd.concat(pivot, axis=1)

In [None]:
matplotlib.rcParams['figure.figsize'] = [12, 10]
sn.heatmap(genres_pref.T, cmap='Blues', xticklabels=age_columns, cbar=False)
plt.title("Preferred movie genres by female viewers by age")
plt.ylim([-0.5, len(occupation_dict)+0.5])  # avoid cutting first and last rows in half
plt.show()

Same for male viewers

In [None]:
data = ratings[ratings.Rating >= 4]\
    .join(movies, on='MovieID')\
    .join(users[users.Gender=='M'], on='UserID', how='inner')

In [None]:
pivot = []
for g in movie_genres:
    pivot_col = pd.pivot_table(
        data, values='Rating', columns=[g],
        index=['Age'], aggfunc='count', fill_value=0)
    pivot_col = pivot_col.loc[:, 1:1]
    pivot_col.columns = [g]
    pivot.append(pivot_col)
genres_pref = pd.concat(pivot, axis=1)

In [None]:
matplotlib.rcParams['figure.figsize'] = [12, 10]
sn.heatmap(genres_pref.T, cmap='Blues', xticklabels=age_columns, cbar=False)
plt.title("Preferred movie genres by male viewers by age")
plt.ylim([-0.5, len(occupation_dict)+0.5])  # avoid cutting first and last rows in half
plt.show()

It seems that both genders like comedy and, surprisingly, drama. Males prefer more action, thriller and war movies than females but the global impact of these preferences seems to be less significant than the shared liking of comedies and dramas.

Another explanation may be that there were some really good movies among comedy/drama, that received high ratings from everyone. Hmm let's check that.

In [None]:
data = ratings[ratings.Rating >= 4]\
    .join(movies, on='MovieID')

In [None]:
top_comedy = data[data.Comedy==1]\
    .groupby("Title").Rating.count().sort_values(ascending=False)

top_drama = data[data.Drama==1]\
    .groupby("Title").Rating.count().sort_values(ascending=False)

In [None]:
top_comedy[:30].plot(rot=90, title='Top rated comedy movies')
plt.xticks(range(30), top_comedy[:30].index.values)
plt.ylim([0, 3000])
plt.ylabel("Number of ratings >= 4")
plt.show()

In [None]:
top_drama[:30].plot(rot=90, title='Top rated drama movies')
plt.xticks(range(30), top_drama[:30].index.values)
plt.ylim([0, 3000])
plt.ylabel("Number of ratings >= 4")
plt.show()

There are indeed very popular drama and comedy movies, including the top ranked one that is both drama and comedy genre.