## MovieLens Analysis
### By: Carter Carlson

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

movies = pd.read_csv('../Data/movies.csv')
ratings = pd.read_csv('../Data/ratings.csv')

### Optimizing table size

In [2]:
# Original memory usage of dataset
print('Ratings: original data\n')
print(ratings.info(memory_usage='deep'))

## Convert ratings to a 1-10 scale and finish cleaning table
ratings['rating'] *= 2
ratings['rating'] = ratings['rating'].astype(np.int8)
ratings['movieId'] = ratings['movieId'].astype(np.int32)
ratings = ratings[['rating', 'movieId']]

print('\n\nRatings: optimized data\n')
print(ratings.info(memory_usage='deep'))

Ratings: original data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 4 columns):
userId       int64
movieId      int64
rating       float64
timestamp    int64
dtypes: float64(1), int64(3)
memory usage: 610.4 MB
None


Ratings: optimized data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 2 columns):
rating     int8
movieId    int32
dtypes: int32(1), int8(1)
memory usage: 95.4 MB
None


Our optimized dataset takes up 80% less memory!



---


### Dataset Cleaning

In [3]:
# Removing movies that have no genre listed
no_genre = movies.loc[movies['genres'] == '(no genres listed)']['movieId'].tolist()

# Only taking movies that have 19 or 20
year_in_title = [movies['movieId'][i]
                 for i in range(len(movies))
                 if not ('19' in movies['title'][i]
                         or '20' in movies['title'][i])]

movies_to_remove = no_genre + year_in_title

ratings = ratings.loc[~ratings['movieId'].isin(movies_to_remove)].reset_index(drop=True)

# Clean movies so that there are only movies that have a rating and genre
movie_set = set(movies['movieId'])
ratings_set = set(ratings['movieId'])
missing_movies = list(movie_set - ratings_set)

movies = movies.loc[~movies['movieId'].isin(missing_movies)].reset_index(drop=True)

### Average and Median rating

In [4]:
ratings_avg = ratings.groupby('movieId')[['movieId','rating']].mean().reset_index(drop=True)

# Add average rating to movie
ratings_dict = {ratings_avg['movieId'][i]:ratings_avg['rating'][i] for i in range(len(ratings_avg))}

def find_movie_rating(movie):
    return ratings_dict[movie]

movies['average rating'] = list(map(find_movie_rating, movies['movieId']))

# Sorting by average value
movies.sort_values('average rating', inplace=True)
movies = movies.reset_index(drop=True)


print('Average movie rating:  {0:.2f}'.format(movies['average rating'].mean()))
print('Median movie rating:   {0:.2f}'.format(movies['average rating'].median()))

Average movie rating:  6.27
Median movie rating:   6.48


### Feature Engineering

In [5]:
# Since our ratings are sorted, find how many records are in every 10%
movies_10_pct = int(len(movies)/10)

# Column of movie popularity
movies['popularity'] = None
bottom_20 = movies['popularity'].iloc[:movies_10_pct*2] = 1 # 'Worst'
middle_20 = movies['popularity'].iloc[movies_10_pct*4:movies_10_pct*6] = 2 # 'OK'
top_20 = movies['popularity'].iloc[movies_10_pct*8:] = 3 # 'Best'

# Remove movies that are not in top, middle, or bottom 20%
movies = movies.dropna().reset_index(drop=True)

# Add year of movie
movies['title'] = movies['title'].str.strip()
def extract_year(title):
    if title.find('(') < 0:
        return int(title[:4])

    return extract_year(title[title.find('(')+1:])

# Column of movie age
movies['year'] = [extract_year(movie) for movie in movies['title']]

movies['age'] = None
movies.loc[movies['year'] >= 1960, 'age'] = 1 # 'Old'
movies.loc[movies['year'] >= 1970, 'age'] = 2 # 'Medium'
movies.loc[movies['year'] >= 1990, 'age'] = 3 # 'New'

movies = movies.dropna().reset_index(drop=True)

# Make a column for each genre
genre_groups = [movie.split('|') for movie in movies['genres']]
genre_set = set()
[[genre_set.add(genre) for genre in movie] for movie in genre_groups]
genre_set = list(genre_set)

df = pd.DataFrame(columns=[genre_set])

df['average rating'] = movies['average rating']
df['popularity'] = movies['popularity']
df['age'] = movies['age']

df.fillna(0, inplace=True)

# populate dataframe
for i in range(len(genre_groups)):
    for genres in genre_groups[i]:
        df.loc[i, genres] = 1

### Machine Learning

We will use three groups of data as predictors for popularity:
1. Movie genres
2. Movie age
3. Movie genres & age

And we will utilize three ML algorithms:
1. Logistic Regression
2. Bernoulli Naive Bayes
3. Random Forest

In [6]:
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import math

x1 = df[genre_set]
x2 = pd.get_dummies(df['age'])
x3 = pd.concat([x1, x2], axis=1)
y = df['popularity']

logr = LogisticRegression()
bnb = BernoulliNB()
tree = RandomForestClassifier()

# Collection of parameters to test
logr_param = {'tol': [math.exp(-5), math.exp(-4), math.exp(-3)]}

bnb_param = {'alpha': [0.01, 0.5, 1, 2]}

tree_param = {'n_estimators': [10, 100, 200],
              'n_jobs': [-1],
              'max_depth': [2, 5, 10]}

# Determine the best parameters
logr_best = GridSearchCV(estimator=logr, param_grid=logr_param, cv=5)
bnb_best = GridSearchCV(estimator=bnb, param_grid=bnb_param, cv=5)
tree_best = GridSearchCV(estimator=tree, param_grid=tree_param, cv=5)

classifiers = [logr_best, bnb_best, tree_best]
predictors = [x1, x2, x3]
input_columns = ['Genres', 'Age', 'Genres and Age']
classifier_name = ['Logistic Regression', 'Bernoulli Naive Bayes', 'Random Forest']

print('Predicting movie popularity')
for a in range(3):
    print('\n\nInput: {}\n'.format(input_columns[a]))
    x = predictors[a]

    for i in range(3):
        classifier = classifiers[i]
        classifier.fit(x, y)
        y_pred = classifier.predict(x)
        print('{}  -  accuracy: {:.3f}'.format(classifier_name[i], accuracy_score(y_pred, y)))

Predicting movie popularity


Input: Genres

Logistic Regression  -  accuracy: 0.491
Bernoulli Naive Bayes  -  accuracy: 0.487
Random Forest  -  accuracy: 0.531


Input: Age

Logistic Regression  -  accuracy: 0.350
Bernoulli Naive Bayes  -  accuracy: 0.345
Random Forest  -  accuracy: 0.357


Input: Genres and Age

Logistic Regression  -  accuracy: 0.494
Bernoulli Naive Bayes  -  accuracy: 0.487
Random Forest  -  accuracy: 0.544


### Interesting findings

I originally expected movie age to be a strong predictor in movie rating.  My reasoning behind it was that the majority of old, low rated movies would fade into the background, and there wouldn't be many people watching the movie 10 years after to review it.  Similarly, the high-rated older movies should stand the test of time and continue to be viewed and positively rated years after the movie premiered.

Now that I think about it, that may not necessarily be true.  An action movie released in 1980 may have been highly rated, but 15 years later the graphics, actors, and movie references will be outdated and lose some of their original appeal.  So, there are too many variables that come into consideration with movie age, which is why movie age is only slightly more accurate than randomly picking one of three ratings.

When it comes to genres, every genre has its fair share of good and bad movies.  If there was a proven 'best' combination of genres, the other genres wouldn't exist.  An interesting trend to analyze would be movie genre popularity over time.  I can't exactly say how genre popularity has changed over the years (or if it hasn't changed), but I'm sure the right ML analysis will show surprising results, just like how it has with predicting movie popularity.