We have a Twitter Movie Ratings dataset, 
* movies (with IMDB id, title and year, and genre tags)
* users (connection to real Twitter ID. We don't use handle because it can change, but lookup through tweeterid.com is possible)
* ratings (user-movie edge, rated 1-10 with timestamp).

For each movie, we can compute its average score. Hopefully, we can find some correlations between movie properties and scores.

In [None]:
import pandas as pd

DIR = 'snapshots/10K/'
#DIR = 'latest/'

movies_frame = pd.read_csv(DIR + 'movies.dat', sep='::', engine='python',
                          names=['movie_id', 'title_and_year', 'bar_sep_genres']).dropna()

users_frame = pd.read_csv(DIR + 'users.dat', sep='::', engine='python',
                          names=['user_id', 'twitter_id']).dropna()

ratings_frame = pd.read_csv(DIR + 'ratings.dat', sep='::', engine='python',
                          names=['user_id', 'movie_id', 'rating', 'timestamp']).dropna()


Examining the data

In [None]:
print(movies_frame.count())
movies_frame[:5]

These are movies with data from IMDB, tags are separated by bars. We think the genres are relevant to a user's rating of a film, so we want to represent the set of genres more clearly. As a first question, how many different genres exists in the movie set?

In [None]:
# demonstrating Pandas string splitter,
# Here "Crime|Drama" -> ["Crime", "Drama"]
movies_frame['bar_sep_genres'].str.split('|')[170:180]

In [None]:
import numpy as np
from functools import reduce  # Used to combine rows
tag_lists = np.array(movies_frame['bar_sep_genres'].str.split('|'))
# removing NaN element from empty tag strings using 't == t'
unique_tags = reduce(set().union, [t for t in tag_lists if t == t], set())
print(unique_tags)
print(len(unique_tags))


With relatively few genres, we can one-hot encode them all into a frame for later.

In [None]:
genre_ohe_frame = pd.get_dummies(
    movies_frame['bar_sep_genres'].str.split('|')
    .apply(pd.Series).stack()
).sum(level=0)
# Showing that genre_ohe_frame now contains an appropriate enconding
pd.DataFrame(movies_frame['bar_sep_genres']).join(genre_ohe_frame)[:5]

The year of the movie is also an interesting feature. However, we can transform it to its age.

In [None]:
from datetime import date
current_year = date.today().year  # 2019
movie_ages = list()
for name in movies_frame['title_and_year']:
    # Using that the name strings end in " (xxxx)" to extract year string "xxxx"
    year_string = name[-5:-1]
    age = current_year - int(year_string)
    movie_ages.append(age/100)

movie_ages = pd.DataFrame(movie_ages, columns=['age'])
# Resetting indices before joining eliminates risk for NaN:s
movie_ages.reset_index(drop=True, inplace=True)
movies_frame.reset_index(drop=True, inplace=True)

# Showing that age is properly computed
pd.DataFrame(movies_frame['title_and_year']).join(movie_ages)[-4:]

We can now join our features to one set describing the movie:

In [None]:

genre_ohe_frame.reset_index(drop=True, inplace=True)

movie_features = movie_ages.join(genre_ohe_frame)

movie_features[:5]

What we want to predict for the movies is their rating, as given by people through tweets. It is time to look at our other files: _users_ and _ratings_.

In [None]:
print(users_frame.count())
users_frame[:5]

In [None]:
print(ratings_frame.count())
ratings_frame[:5]

In [None]:
import datetime
ratings_frame['date'] = ratings_frame['timestamp'].apply(datetime.date.fromtimestamp)
period_start = datetime.date.fromisoformat('2013-01-01')
period_end = datetime.date.fromisoformat('2014-01-01')
ratings_period = ratings_frame[ratings_frame['date'].between(period_start, period_end)]
max(ratings_frame['date'])

The only use we can have of _users_ is if we want to look up their real-life Twitter handle. Since we are not interested in individual users, we can ignore this frame entirely.

In the _ratings_ frame, we want to look at which rating was given to which movie. Since there may be many ratings on a single movie, we have choose some way to compute its "score", or final rating. One way, the obvious, is the arithmetic mean.

In [None]:
movie_ratings = list()
# This computation may take a while...
for movie_id in movies_frame['movie_id']:
    ratings = ratings_period[ratings_period['movie_id'] == movie_id]
    if ratings.empty:
        averaged = 5
    else:
        averaged = ratings['rating'].mean()
    movie_ratings.append(averaged)
movie_ratings = pd.DataFrame(movie_ratings, columns=['rating'])

movies_frame.join(movie_ratings)[:5]

Now it's time to see if we can do some regression!

In [None]:
from sklearn.model_selection import train_test_split
X = movie_features
y = movie_ratings

print(X.isna().values.any())

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

test_score = model.score(X_test, y_test)
train_score = model.score(X_train, y_train)
print('Linear Regression')
print(f'Test set accuracy: {test_score}')
print(f'Train set accuracy: {train_score}')

In [None]:
list(zip(
    movie_features.columns, model.coef_.flatten()))

It is difficult to know whether this model is accurate, but at least we see that there are differences between the genres. Certain genres have higher coefficients than others (Documentary is very high, for example), but the largest influence seems to be the age.

Let's create a function to plot average ratings by "era", a span of several years:

In [None]:
import matplotlib.pyplot as plt
from copy import deepcopy
def plot_rating_by_year(features_, ratings, genres, all=True):
    bins = np.arange(min(features_['age']), max(features_['age']), 0.05)
    features = deepcopy(features_)
    c = pd.cut(features['age'], bins)
    features['era'] = c
    
    table = ratings.join(features)
    
    fig, ax = plt.subplots()
    plt.axis((1900,2020,4,10))
    if all:
        mean_ratings = table.groupby('era')['rating'].mean()    
        ax.plot(2019 - bins[:-1]*100, mean_ratings, label='All')
    
    for genre in genres:
        movie_subset = table[table[genre] == 1]
        #print(table)
        mean_ratings = movie_subset.groupby('era')['rating'].mean()
        
        ax.plot(2019 - bins[:-1]*100, mean_ratings, label=genre)
    ax.legend(bbox_to_anchor=(1.5, 1))
plot_rating_by_year(movie_features, movie_ratings, ['Documentary', 'Animation'], all=False)
plot_rating_by_year(movie_features, movie_ratings, ['Horror'])
