# Popularity Prediction

In this notebook I will try to predict popularity, based on the tags used. Although predicting popularity just based on tags is fairly impossible and the fact that the dataset is not very big, it will be fun to fiddle with the data a bit.


## Imports

In [1]:
import pandas
import ast
import numpy as np

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor 
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import RandomizedSearchCV

## Load the data

In [2]:
data_frame = pandas.read_csv('data/ted_updated.csv')
data_frame.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,...,related_talks,speaker_occupation,tags,title,url,views,impressions,positive_impressions,negative_impressions,popularity
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,...,"[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110,93850.0,92712.0,1138.0,0.441611
1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,...,"[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520,2936.0,2372.0,564.0,0.355374
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,...,"[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292,2824.0,2473.0,351.0,0.393828
3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1151367060,...,"[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550,3728.0,3572.0,156.0,0.443118
4,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1151440680,...,"[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869,25620.0,25310.0,310.0,0.446907


### Lets first use the JSON parse function oen the tags

In [3]:
def parseJsonInDataframe(df, columns):
    for i in range(0, df.shape[0]):
        for column in columns:
            if (column in df and type(df.at[i, column]) is str):
                df.at[i, column] = ast.literal_eval(df.at[i, column])

parseJsonInDataframe(data_frame, ["tags"])

### Extract all unique tags

In [4]:
unique_tags = set()
max_tags_per_sample = 0
min_tags_per_sample = 100

for tags in data_frame.tags:
    unique_tags.update(tags)
    max_tags_per_sample = max(max_tags_per_sample, len(tags))
    min_tags_per_sample = min(min_tags_per_sample, len(tags))
    
# Because sets don't have indices we need to make it list agian
unique_tags = list(unique_tags)
print("Unique tags count:", len(unique_tags))
print("Max amaount of tags per sample:", max_tags_per_sample)
print("Min amaount of tags per sample:", min_tags_per_sample)

Unique tags count: 416
Max amaount of tags per sample: 32
Min amaount of tags per sample: 1


### Use one-hot representation to prapare train/test datasets

In [5]:
samples = []
labels = []

for i, row in data_frame.iterrows():
    samples.append([unique_tags.index(tag) for tag in row["tags"]])
    labels.append(row["popularity"])

# Lets convert samples to one-hot representation
for i, sample in enumerate(samples):
    one_hot = np.zeros(len(unique_tags))
    for tag in sample:
        one_hot[tag] = 1
    
    samples[i] = one_hot

# Shuffle the data
samples, labels = shuffle(samples, labels, random_state=42)

# Create train/test splits
X_train, X_test, y_train, y_test = train_test_split(samples, labels, test_size=0.1, random_state=42)

### Try MLPRegressor

In [6]:
mlpr = MLPRegressor()
mlpr.fit(X_train, y_train)
mlpr_predicted = mlpr.predict(X_test)

print("Score:", r2_score(y_test, mlpr_predicted))

Score: -1.0755872045831714


### Try RandomForestRegressor

In [7]:
rfr = RandomForestRegressor(n_estimators=100)
rfr.fit(X_train, y_train)
rfr_predicted = rfr.predict(X_test)

print("Score:", r2_score(y_test, rfr_predicted))

Score: -0.06537931204834568


### Try AdaBoostRegressor

In [8]:
adar = AdaBoostRegressor()
adar.fit(X_train, y_train)
adar_predicted = adar.predict(X_test)

print("Score:", r2_score(y_test, adar_predicted))

Score: -0.36976869743479424


### Try DecisionTreeRegressor

In [9]:
dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)
dtr_predicted = dtr.predict(X_test)

print("Score:", r2_score(y_test, dtr_predicted))

Score: -0.897982628460644


### Try LinearRegression

In [10]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_predicted = lr.predict(X_test)

print("Score:", r2_score(y_test, lr_predicted))

Score: -1.955937527267203e+24


### Try GradientBoostingRegressor

In [11]:
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
gbr_predicted = gbr.predict(X_test)

print("Score:", r2_score(y_test, gbr_predicted))

Score: 0.08304069161957117


### Try KNeighborsRegressor

In [12]:
knr = KNeighborsRegressor()
knr.fit(X_train, y_train)
knr_predicted = knr.predict(X_test)

print("Score:", r2_score(y_test, knr_predicted))

Score: -0.2339003909938253


Results are not very good (as expected). The best results were achieved with using gradinet tree boosting regressor. However, as you can see for all the regressors I used the default configuration. Which may not be the optimal one.

### Try doing some "ensambling"

Just compute the mean of all predictions.

In [13]:
def mean(numbers):
    return float(sum(numbers)) / max(len(numbers), 1)

def ensamblePredict(data):
    r1 = mlpr.predict(data)
    r2 = rfr.predict(data)
    r3 = adar.predict(data)
    r4 = dtr.predict(data)
    r5 = lr.predict(data)
    r6 = gbr.predict(data)
    r7 = knr.predict(data)
    
    
    return [mean([r1[i], r2[i], r3[i], r4[i], r5[i], r6[i], r7[i]]) for i in range(0, len(data))]

res = ensamblePredict(X_test)
print("Score:", r2_score(y_test, res))            

Score: -3.991709239319396e+22


This is not the regular way to implement ensambling. Usually ensambles are trained as such and not separetly. However, this is a good way to illustrate that via ensmabling we can achieve above average results.

# Lets try some parameters tuning

Lets first define the parameters we want to tune.

In [14]:
# Number of trees
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 500, num = 10)]

# Maximum tree depth
max_depth = list(range(1, 11))

# Minimum number of samples required to split a node
min_samples_split = [2, 3, 4, 5]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Number of features to consider at every split
max_features = ['auto', 'sqrt', 'log2']

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

print(random_grid)

{'n_estimators': [50, 100, 150, 200, 250, 300, 350, 400, 450, 500], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'min_samples_split': [2, 3, 4, 5], 'min_samples_leaf': [1, 2, 4]}


Now that we have defined the parameters, lets use a random search algorithm for optimizing the hyper parameters. For the regressor I will use the one that scored the best result - GradientBoostingRegressor. The task is to see if we can achieve better results through hyperparameter optimization. 

In [15]:
regressor = GradientBoostingRegressor()

regressor_random = RandomizedSearchCV(estimator = regressor, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

regressor_random.fit(X_train, y_train)
pred = regressor_random.predict(X_test)

print("Score:", r2_score(y_test, pred))

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   12.4s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  7.4min finished


Score: 0.11241341416127681


So the results are a little bit better. Lets check the parameters chozen by the optimizer.

In [16]:
regressor_random.best_params_

{'max_depth': 2,
 'max_features': 'auto',
 'min_samples_leaf': 4,
 'min_samples_split': 4,
 'n_estimators': 100}

## Summary

Predicting popularity based simply on the keywords or tags used is not the most optimal way to do that. However, it was still a cool experiment. The results were as expected and got a little better through hyperparameter optimization.