In [29]:
import numpy as np
import pandas as pd

#### Business Goals

For the purposes of this project, I work for a company called **Big Apple Tix**, an event ticketing company for the NYC area. I've been asked to develop a recommender system to promote events to users based on their previous patterns of attending and rating other events. Our objective is to ensure that the events we promote to individual users lead to a.) additional purchases, so we can collect transaction fees, and b.) customers being satisfied with the events we show them, so they keep coming back to our service to figure out where they should go next. Continued business also benefits us by making our ratings data more robust, improving our ability to promote events effectively in the future.

I'll start by pulling in my (dummy) data for events and user ratings. This data is already formatted as a dataframe with null values--functionally a matrix--so it is suitable for future calculations we'll need to run.

In [84]:
df = pd.read_csv('data/nyc_event_ratings.csv', index_col='user')

Let's quickly examine a subset of the data:

In [85]:
df.shape

(50, 20)

In [86]:
df.head()

Unnamed: 0_level_0,Brooklyn Street Art Tour,Carnegie Hall Classical Concert,Caroline's on Broadway,Comedy Cellar Stand-up Night,Death of a Salesman,Glengarry Glen Ross,Guggenheim Retrospective,Hadestown,Hamilton,Jets Game,Knicks Game,Mets Game,MoMA Contemporary Exhibit,NYC Jazz Festival,"Oh, Mary!",Rangers Game,The Lion King,Tribeca Film Festival,Wicked,Yankees Game
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
User_001,7.0,4.0,5.0,10.0,6.0,6.0,5.0,,4.0,9.0,5.0,,5.0,,5.0,6.0,,,,9.0
User_002,,8.0,6.0,,9.0,5.0,,,8.0,9.0,8.0,,,7.0,6.0,,,,8.0,10.0
User_003,,,,6.0,10.0,9.0,6.0,8.0,7.0,,,,,,8.0,,10.0,6.0,9.0,
User_004,8.0,6.0,,,6.0,8.0,7.0,10.0,8.0,7.0,5.0,7.0,8.0,,9.0,7.0,,,,5.0
User_005,7.0,8.0,,,8.0,7.0,7.0,6.0,6.0,7.0,8.0,,,9.0,,,8.0,,7.0,7.0


We can see that our data consists of 50 users of the ticketing platform as rows, and the columns are 20 distinct events they could have attended in NYC. Users have rated the events they've attended from 1 to 10, and events that have not attended (or at least not rated) are null.

Now, we can split the data into testing and training sets by randomly removing values from the data and reserving them as test data. We'll use a standard of 20% per-user removal (or 1, as a minimum).

In [87]:
#training data is copy of main df, will REMOVE test values
df_train = df.copy()
#test data starts as empty df with same index and cols, will ADD test values
df_test = pd.DataFrame(np.nan, index=df.index, columns=df.columns)

In [88]:
#set the testing split % at 20%
test_perc = 0.2
#set a seed for consistent randomness
rng = np.random.default_rng(905)

In [89]:
for user in df.index:
    #all items this user has rated
    rated_items = df.loc[user].dropna().index
    if len(rated_items) == 0:
        continue
    #how many to hold out (at least 1)
    n_test = max(1, int(len(rated_items) * test_perc))
    test_items = rng.choice(rated_items, size=n_test, replace=False)
    
    #move to test_df and blank out in train_df
    df_test.loc[user, test_items] = df.loc[user, test_items]
    df_train.loc[user, test_items] = np.nan

**Raw Average**

As an initial predictor, we can set all expected values to the global average of all known event ratings. This will serve as a baseline against which we can compare our (hopefully) improved model that accounts for which users tend to rate things higher or lower, and which events themselves tend to be rated higher or lower.

In [90]:
#put all values in a single array and take the mean (method ignores nulls)
global_avg = df_train.stack().mean()

In [91]:
global_avg

np.float64(6.83015873015873)

So my mean rating throughout all the data is about **6.8**. Let's plug that in as the baseline rating for all user-event pairs.

In [92]:
#build df with all values as global average for comparison
df_raw_avg = pd.DataFrame(global_avg,
                          index=df_train.index,
                          columns=df_train.columns)

Now I just need a function to check the RMSE for my predictions against my known values. Earlier, I split those knowns into training and testing data so I could get a sense of how well my system would work on genuinely unknown future data.

In [93]:
#build function to compare predictions with actual
def calc_rmse(pred_df, actual_df):
    #mask of observed entries
    mask = actual_df.notna()
    #squared errors only where mask=True
    se = (pred_df[mask] - actual_df[mask])**2
    #stack() will turn that into a single Series of all squared errors
    return np.sqrt(se.stack().mean())

My first "predictor" is simply to expect every rating will be the global average we previously calculated. Let's first try this against the the **training data**:

In [94]:
calc_rmse(df_raw_avg, df_train)

np.float64(1.682225936605682)

And now the **test data**:

In [95]:
calc_rmse(df_raw_avg, df_test)

np.float64(1.7900331930760587)

**User and Event Biases**

Now that we have the baseline RMSE for the global average of event event ratings, let's try to improve the model by accounting for user- and event- level biases. This bias reflects how the average rating for a given user or event differs from the global average. In other words, how "generous" is each user in rating events than average? And how much higher or lower are the ratings for a particular event than all events in aggregate?

*User Biases*

In [96]:
#10 random users - average event scores
df_train.mean(axis=1).sample(10, random_state=905)

user
User_046    5.750000
User_021    7.600000
User_035    5.266667
User_017    8.916667
User_019    7.100000
User_006    7.000000
User_003    8.250000
User_016    6.083333
User_049    8.538462
User_005    7.272727
dtype: float64

Here are 10 random users from the data and their average scores for the events they attended. Our global average of ~6.8 passes the sense check here, but we can see some real variance. For example, user 35 has an average score of ~5.3 for the events they attended, and user 17 rated events an average of 8.9. That means their **user biases** are roughtly -1.5 and +2.1, respectively. We can calculate this bias for every user and use it as a starting point for all predictions for that user. In all likelihood, this will get us a better RMSE than the global average alone.

In [97]:
#calculate all user biases
user_biases = df_train.mean(axis=1) - global_avg

In [98]:
#make new prediction df with biases added to global averages
df_biases = df_raw_avg.add(user_biases, axis=0)

In [99]:
df_biases.head()

Unnamed: 0_level_0,Brooklyn Street Art Tour,Carnegie Hall Classical Concert,Caroline's on Broadway,Comedy Cellar Stand-up Night,Death of a Salesman,Glengarry Glen Ross,Guggenheim Retrospective,Hadestown,Hamilton,Jets Game,Knicks Game,Mets Game,MoMA Contemporary Exhibit,NYC Jazz Festival,"Oh, Mary!",Rangers Game,The Lion King,Tribeca Film Festival,Wicked,Yankees Game
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
User_001,6.083333,6.083333,6.083333,6.083333,6.083333,6.083333,6.083333,6.083333,6.083333,6.083333,6.083333,6.083333,6.083333,6.083333,6.083333,6.083333,6.083333,6.083333,6.083333,6.083333
User_002,7.666667,7.666667,7.666667,7.666667,7.666667,7.666667,7.666667,7.666667,7.666667,7.666667,7.666667,7.666667,7.666667,7.666667,7.666667,7.666667,7.666667,7.666667,7.666667,7.666667
User_003,8.25,8.25,8.25,8.25,8.25,8.25,8.25,8.25,8.25,8.25,8.25,8.25,8.25,8.25,8.25,8.25,8.25,8.25,8.25,8.25
User_004,6.833333,6.833333,6.833333,6.833333,6.833333,6.833333,6.833333,6.833333,6.833333,6.833333,6.833333,6.833333,6.833333,6.833333,6.833333,6.833333,6.833333,6.833333,6.833333,6.833333
User_005,7.272727,7.272727,7.272727,7.272727,7.272727,7.272727,7.272727,7.272727,7.272727,7.272727,7.272727,7.272727,7.272727,7.272727,7.272727,7.272727,7.272727,7.272727,7.272727,7.272727


Calculate RMSE for user-biased predictions against training data:

In [100]:
calc_rmse(df_biases, df_train)

np.float64(1.3642360587512585)

And the same against the test data:

In [101]:
calc_rmse(df_biases, df_test)

np.float64(1.6515574619992706)

Already we see a substantial decrease in the RMSE for these predictions, representing an improvement. In other words, these predictions are less far off from reality than when we predicted all values would be the global average.

Now, let's examine a random assortment of events and their average scores:

In [102]:
#10 random events - average user scores
df_train.mean(axis=0).sample(10, random_state=905)

Tribeca Film Festival              6.928571
Oh, Mary!                          7.343750
Glengarry Glen Ross                7.166667
Carnegie Hall Classical Concert    6.621622
Brooklyn Street Art Tour           6.303030
Caroline's on Broadway             7.200000
NYC Jazz Festival                  7.428571
Wicked                             6.871795
Jets Game                          6.548387
Hadestown                          6.833333
dtype: float64

Here we can see a similar dynamic to the user scores above, where the average event scores hover around our global average of 6.8, but are all some differing amount above or below it. Incidentally, there appears to be less variability among average event scores than user scores. That should be fine--it just may mean the biases are smaller for this dimension, and therefore impact our predictions less than user biases.

In [103]:
#calculate event biases
event_biases = df_train.mean(axis=0) - global_avg

Now I can update my prediction df with the additional bias information obtained from the event biases:

In [104]:
df_biases = df_biases.add(event_biases, axis=1)

RMSE for the training data:

In [105]:
calc_rmse(df_biases, df_train)

np.float64(1.3528542437780882)

And finally, for the test data:

In [106]:
calc_rmse(df_biases, df_test)

np.float64(1.6335358785834804)

As expected, the RMSE did go down when accounting for event biases, but only slightly. Overall, we have achieved a somewhat blunt but decently effective model by starting with the overall average of known scores for every user-event-rating, and accounting for the user and event biases for all pairs.

Below, I can build a function to list out, in order, what the recommendations would be for a specified user based on our predictions. The results only account for nulls in the original dataframe (in other words, events that user hasn't attended/rated).

In [107]:
def extract_recommendation_rankings(df_actual, df_pred, user_id):
    '''Returns list of events the specified user hasn't yet attended/rated,
    Ordered by highest predicted rating'''
    df_list = df_actual.loc[[user_id]].transpose().join(df_pred.loc[[user_id]].transpose(), rsuffix='_pred')
    df_list = df_list[df_list[user_id].isnull()]
    return df_list.sort_values(by = user_id+'_pred', ascending=False)

In [108]:
extract_recommendation_rankings(df, df_biases, 'User_025')

user,User_025,User_025_pred
NYC Jazz Festival,,7.598413
"Oh, Mary!",,7.513591
Glengarry Glen Ross,,7.336508
Guggenheim Retrospective,,7.169841
Death of a Salesman,,7.109235
Tribeca Film Festival,,7.098413
Carnegie Hall Classical Concert,,6.791463
Brooklyn Street Art Tour,,6.472872


So, for this user, we would recommend they go see the NYC Jazz Festival first, predicting they might rate it at about a 7.6.