# Snapchat Political Ads

# Summary of Findings


### Introduction
Using the Snapchat Political Ads Data from 2018 and 2019, we are attempting to predict the number of Impressions on a given ad. This prediction would tell us how many people would see a given ad. Since Impressions is quantitative, this problem is a regression problem. The evaluation metric we used for our models is R-Squared, a measure of variance in the target that is explained by our features. A higher R-Squared means that the model is more accurate.

### Baseline Model
Our baseline model was based only on information that we would know at the time of prediction or when the ad was launched. This means that we could not use the columns Spend or Impressions as they are not determined until the ad campaign ends. We decided to use 4 features, Currency Code, Gender, Country Code, and Language. We chose them because they had a limited number of unique values. These were all nominal features so we one-hot encoded them. Having a limited number of unique values was important because it ensured that the column transformations were less likely to result in a sparse dataset and the model would not produce unexpected results. 

Our baseline model used linear regression. We measured the performance of our model using R^2. We generated multiple training and testing sets from the ads dataset, calculated R^2 and then took the average of the scores. We got a mean R^2 of -0.0038. This model has poor performance which we attributed to the fact that none of our features reveal meaningful information about the reach of an add. To develop a better model, we added several engineered features.  


### Final Model
The first feature we added was improving the One Hot Encoding of the nominal features from the base model by adding a PCA transformer. This is good for our data because the dimensionality reduction will eliminate unnecessary features created from the One Hot Encoding done right before. We chose the parameters for PCA through a grid search to maximize our R-Squared value.

The second feature we added was calculating the duration of time the ad was up on Snapchat for. We thought this information could be valuable because the length of an ad could indicate how many people saw the ad. If the ad was up for longer, then more people had a chance to see it. We made this feature by finding the difference between the specified start and end date. The ads with no specified end date are still running according to Snapchat, so we replaced the null entries in that column with the current date for calculating duration.

The rest of the features we added were transformations of nominal columns that weren’t previously One Hot Encoded into quantitative measures of the character length of the entries in each row. The columns that we performed this transformation on were all related to location data that determined where the app would run. This means that the more localized the ad is, the longer the entries in those columns would be. This transformation helps our data because if an ad is highly localized, it would most likely get less impressions since the number of people it is shown to is limited.

We tested our features with linear and logistic regression. We ultimately chose linear regression for our model because logistic regression took a long time to run and was giving worse results. For our final model with linear regression, we generated multiple training and testing sets from the ads dataset, calculated R^2 and then took the average of the scores. It had an average R-Squared score of 0.006728 over 100 runs. This model still has pretty poor performance overall but is an improvement from our baseline model. 


### Fairness Evaluation
The subsets of the data that we wanted to evaluate fairness on were ads with lower impressions (<100,000 views) and ads with higher impressions (>= 100,000 views). The parity measure that we used was the difference of R-Squared scores since the model is solving a regression problem. 

- Our null hypothesis was that the R-Squared values of the lower and higher impressions classes are approximately equal.

- Our alternate hypothesis was that the R-Squared values of the lower and higher impressions classes are not approximately equal.

Using a permutation test, we found that our p-value for the test was 0.0, averaged over 100 runs. The cause of this was the high difference of R-Squared scores on the observed data, particularly from the larger negative score on the lower impressions subset. This tells us that our model is unfair, and is more likely to score well on ads with higher impressions. This was a surprise to us, as we would have guessed that the engineered features would be biased towards more localized, lower viewed ads.


# Code

In [99]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

In [100]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression

In [101]:
# reading in data sets
ads_2018 = os.path.join('PoliticalAds2018.csv')
ads18 = pd.read_csv(ads_2018)
ads_2019 = os.path.join('PoliticalAds2019.csv')
ads19 = pd.read_csv(ads_2019)

# concat the data sets
ads = pd.concat([ads18, ads19], ignore_index=True)

# Data Cleaning
# convert Start and End dates to date time objects + to PST
ads['StartDate'] = pd.to_datetime(ads['StartDate']).dt.tz_convert(tz="America/Los_Angeles")
ads['EndDate'] = pd.to_datetime(ads['EndDate']).dt.tz_convert(tz="America/Los_Angeles")

### Baseline Model

In [102]:
#selecting appropriate features for baseline model
features = ['Currency Code', 'Gender', 'CountryCode', 'Language']
baseline_feats = ads[features].fillna('Missing')

# features
X = baseline_feats
# outcome
y = ads.Impressions
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

base_pl = Pipeline([('one-hot', OneHotEncoder(handle_unknown = 'ignore')), ('lin-reg', LinearRegression())])

#finding the average R^2 score over 100 iterations 
N = 100
out = []
for i in range(N):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) 
    base_pl.fit(X_train, y_train)
    out.append(base_pl.score(X_test, y_test))

base_r2 = np.mean(out)
base_r2

-0.003809749261793306

### Final Model

In [103]:
#selecting features and filling missing values for one hot encoding
features = ['Currency Code', 'Gender', 'CountryCode', 'Language']
final_feats = ads[features].fillna('Missing')
final_feats[['EndDate', 'StartDate', 'Metros (Included)', 
             'Metros (Excluded)', 'Radius Targeting (Included)', 
             'Radius Targeting (Excluded)', 'Postal Codes (Included)', 
             'Postal Codes (Excluded)']] = ads[['EndDate', 'StartDate', 'Metros (Included)', 'Metros (Excluded)',
                                                'Radius Targeting (Included)', 'Radius Targeting (Excluded)', 
                                                'Postal Codes (Included)', 'Postal Codes (Excluded)']]

In [105]:
# helper function to apply transformations by row 
def by_col(col, func):
    r_s = col.apply(func, axis=1)
    return r_s.to_frame()

In [106]:
## 1st Feature
## Engineer duration from StartDate and EndDate

def calc_duration(row):
    if pd.isnull(row['EndDate']):
        return (pd.to_datetime('today').tz_localize("America/Los_Angeles") - row['StartDate']).days
    else:
        return (row['EndDate'] - row['StartDate']).days

duration_transform = Pipeline([('Duration', FunctionTransformer(by_col, kw_args={'func':calc_duration}))])

In [108]:
## 2nd Feature
## Engineer Localized feature which estimates how 'localized' an ad is by counting the number of characters in the column

def str_length(row):
    if type(row.iloc[0]) == str:
        return len(row.iloc[0])
    else:
        return 0

local_transform = Pipeline([('Localized', FunctionTransformer(by_col, kw_args={'func' :str_length}))])

In [110]:
# Assemble final pipeline with one hot encoding and engineered features
one_hot_pl = Pipeline([('one-hot', OneHotEncoder(handle_unknown = 'ignore', sparse=False)),
                       ('pca', PCA(svd_solver = 'full', n_components= 37))
                      ])

preprocess = ColumnTransformer(transformers=[('Duration', duration_transform, ['StartDate', 'EndDate']),
                                             ('LocalizationScore_mi', local_transform, ['Metros (Included)']),
                                             ('LocalizationScore_me', local_transform, ['Metros (Excluded)']),
                                             ('LocalizationScore_rti', local_transform, ['Radius Targeting (Included)']),
                                             ('LocalizationScore_rte', local_transform, ['Radius Targeting (Excluded)']),
                                             ('LocalizationScore_pci', local_transform, ['Postal Codes (Included)']),
                                             ('LocalizationScore_pce', local_transform, ['Postal Codes (Excluded)']),
                                             ('categorical', one_hot_pl, ['Currency Code','Gender', 'CountryCode', 
                                                                            'Language'])
                                            ])

final_pl = Pipeline(steps=[('preprocesser', preprocess), ('lin-reg', LinearRegression())])

# final model performance 
# features
X = final_feats
# outcome
y = ads.Impressions

#finding the average R^2 score over 100 iterations 
N = 100
out = []
for i in range(N):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) 
    final_pl.fit(X_train, y_train)
    out.append(final_pl.score(X_test, y_test))

final_r2 = np.mean(out)
final_r2



0.006727598969040846

### Fairness Evaluation

In [158]:
def calc_stat(df):
    #splitting data by less and more views and calculating R^2 score for each subset
    more = df[df['Views'] == 1]
    # features
    X_more = more.drop(columns=['Impressions', 'Views'])
    # outcome
    y_more = more.Impressions
    more_score = final_pl.score(X_more, y_more)
    
    less = df[df['Views'] == 0]
    # features
    X_less = less.drop(columns=['Impressions', 'Views'])
    # outcome
    y_less = less.Impressions
    less_score = final_pl.score(X_less, y_less)
    # test stat: difference of R^2 scores for ads with less views and ads with more views
    return abs(more_score - less_score)

In [162]:
# evaluating model on fairness using a permutation test 

#selecting features and filling missing values for one hot encoding
perm_data = ads[['EndDate', 'StartDate', 'Metros (Included)', 'Metros (Excluded)','Radius Targeting (Included)',
                 'Radius Targeting (Excluded)', 'Postal Codes (Included)', 'Postal Codes (Excluded)', 'Impressions']]
features = ['Currency Code', 'Gender', 'CountryCode', 'Language']
perm_data[features] = ads[features].fillna('Missing')

# adding in a binary column for ad Impressions with a threshold of 100,000 - 0 -> Less views, 1 -> More views 
perm_data['Views'] = (ads['Impressions'] >= 100000).astype(int)

#putting together the Pipeline using one-hot encoding and engineered features
one_hot_pl = Pipeline([('one-hot', OneHotEncoder(handle_unknown = 'ignore', sparse=False)),
                       ('pca', PCA(svd_solver = 'full', n_components= 37))
                      ])
preprocess = ColumnTransformer(transformers=[('Duration', duration_transform, ['StartDate', 'EndDate']),
                                             ('LocalizationScore_mi', local_transform, ['Metros (Included)']),
                                             ('LocalizationScore_me', local_transform, ['Metros (Excluded)']),
                                             ('LocalizationScore_rti', local_transform, ['Radius Targeting (Included)']),
                                             ('LocalizationScore_rte', local_transform, ['Radius Targeting (Excluded)']),
                                             ('LocalizationScore_pci', local_transform, ['Postal Codes (Included)']),
                                             ('LocalizationScore_pce', local_transform, ['Postal Codes (Excluded)']),
                                             ('categorical', one_hot_pl, ['Currency Code','Gender', 'CountryCode', 
                                                                            'Language'])
                                            ])
final_pl = Pipeline(steps=[('preprocesser', preprocess), ('lin-reg', LinearRegression())])
#fitting the Pipeline to the data
final_pl.fit(perm_data.drop(columns=['Impressions', 'Views']), perm_data.Impressions)

# observed statistic
observed = calc_stat(perm_data)

# calculating test statistic on shuffled data 
N = 100
results = []
for _ in range(N):
    # shuffle Views column
    shuffled_views = (perm_data['Views'].sample(replace=False, frac=1).reset_index(drop=True))
    shuffled = perm_data.assign(**{'Views': shuffled_views,})
    results.append(calc_stat(shuffled))       
    
# calculate the p-value          
p_val = np.count_nonzero(results >= observed) / N
p_val

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


0.0