# Snapchat Political Ads
* **See the main project notebook for instructions to be sure you satisfy the rubric!**
* See Project 03 for information on the dataset.
* A few example prediction questions to pursue are listed below. However, don't limit yourself to them!
    * Predict the reach (number of views) of an ad.
    * Predict how much was spent on an ad.
    * Predict the target group of an ad. (For example, predict the target gender.)
    * Predict the (type of) organization/advertiser behind an ad.

Be careful to justify what information you would know at the "time of prediction" and train your model using only those features.

# Summary of Findings


### Introduction
I am attempting to predict how much a company spends on an advertisement. This is a regression problem that requires building models based on the features (Gender, Impressions, and Duration) which assumingly affects how much a company spends. The evaluation metric is set to have a r^2 score of at least 0.75 for accuracy.

### Baseline Model
For the baseline model, Gender and OSType will be used to predict how much a company spends on an advertisement. If companies who targeted all OSTypes and Genders, do they have to spend more? If the column value contained NaN, this would mean it is agnostic (targeting all Genders and OSTypes). I transformed these two categorical features using a Pipeline of SimpleImputer to fill all NaN values to 'all' and OneHotEncoder to turn these categorical values into digits so that we can model a LinearRegression Pipeline to predict how much a company spends on an advertisement. We tested the score of the prediction and received a '0.000681' which is a very low score. The reason to why I received that score and why majority of the predictions showed that most companies spent the same amount (1788) is because many of the values in the Gender and OSType columns are NaN, where majority of them may have turned to 0 which 'biases' the data when trying to predict how much a company makes. Also, I only used two categorical features to make our prediction, so of course it would not predict how much a company spends accurately. What would happen if more features were added to predict such as numerical features? You will see in the Final Model below.

### Final Model
To improve the model, I decided to include more categorical features such as AgeBracket and BillingAddress as I believe it would help in predicting how much a company spends on an advertisement based on what age group and area they are targeting. Additionally, I added numerical features such as Impressions, StartDate, and EndDate; I wanted to use the duration of an advertisement to help us predict the costs. To do this, I converted StartDate and EndDate columns into pandas datetime and converted all the datetimes into hours so that it would be easier to compute its duration and mean for when it is modeled within a Pipeline. For the numerical features, I modeled a Pipeline where the feature's mean was preprocessed and then normalized it using the z-scaled data to make it look like a normal distribution to help the model perform better. For the categorical features, I modeled a Pipeline where we preprocessed the categories by filling the missing values, NaN, with 'all' using the SimpleImputer, and then utilized OneHotEncoding to distinguish the companies who targeted all types of: Gender, OsType, Segments, and AgeBracket between those who only targeted 1 certain group. For the BillingAddress column, I believed it would help our model as it distinguishes the cost of an advertisement based on a company's location.

### Fairness Evaluation
Between two languages, would the price of an ad remain the same? To answer this, I compared the two largest single language target for each ad, which would be French and English. The null hypothesis would be that our model predicts the price of the given ad fairly between the two languages. Our signifigance level was: 0.04, which is below alpha meaning that we reject the null hypothesis which means that the model was fair.

# Code

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler, OrdinalEncoder, Binarizer
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
import datetime as dt
from sklearn import metrics

%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

In [2]:
ads_2018 = pd.read_csv("PoliticalAds_2018.csv")
ads_2019 = pd.read_csv("PoliticalAds_2019.csv")
ads = pd.concat([ads_2018, ads_2019], ignore_index=True)
print(ads.columns)
ads.head()

Index(['ADID', 'CreativeUrl', 'Currency Code', 'Spend', 'Impressions',
       'StartDate', 'EndDate', 'OrganizationName', 'BillingAddress',
       'CandidateBallotInformation', 'PayingAdvertiserName', 'Gender',
       'AgeBracket', 'CountryCode', 'Regions (Included)', 'Regions (Excluded)',
       'Electoral Districts (Included)', 'Electoral Districts (Excluded)',
       'Radius Targeting (Included)', 'Radius Targeting (Excluded)',
       'Metros (Included)', 'Metros (Excluded)', 'Postal Codes (Included)',
       'Postal Codes (Excluded)', 'Location Categories (Included)',
       'Location Categories (Excluded)', 'Interests', 'OsType', 'Segments',
       'Language', 'AdvancedDemographics', 'Targeting Connection Type',
       'Targeting Carrier (ISP)', 'CreativeProperties'],
      dtype='object')


Unnamed: 0,ADID,CreativeUrl,Currency Code,Spend,Impressions,StartDate,EndDate,OrganizationName,BillingAddress,CandidateBallotInformation,...,Location Categories (Included),Location Categories (Excluded),Interests,OsType,Segments,Language,AdvancedDemographics,Targeting Connection Type,Targeting Carrier (ISP),CreativeProperties
0,29cbbf5621975dbd4ffd3826f22e781ca4f41fe4cd61c5...,https://www.snap.com/political-ads/asset/d5926...,USD,148,36028,2018/11/06 12:04:28Z,2018/11/07 02:04:31Z,Friends of Jess King,US,,...,,,,,Provided by Advertiser,,,,,web_view_url:https://jesskingforcongress.com/v...
1,1ee35b50c5d194f4bf23196ad3645d16951439d9bfbbd2...,https://www.snap.com/political-ads/asset/fbee4...,EUR,447,339296,2018/09/28 12:59:59Z,2018/10/12 12:59:59Z,Jalt,"Krom boomssloot 22-1,Amsterdam,1011GW,NL",,...,,,,,Provided by Advertiser,,,,,web_view_url:https://secure.amnesty.nl/petitie...
2,fabfccf0dfe9373fabe6723481ff2d0f3ce36b72d7a64e...,https://www.snap.com/political-ads/asset/c2d3a...,USD,387,101184,2018/10/01 21:05:16Z,,Mothership Strategies,"1328 Florida Avenue NW, Building C, Washington...",,...,,,,,Provided by Advertiser,,Spanish Speakers,,,web_view_url:https://register.rockthevote.com/...
3,1c3df8a88ecfd08123d59e8378460d4973e9665e626550...,https://www.snap.com/political-ads/asset/953f2...,USD,971,265217,2018/09/19 17:13:42Z,2018/09/28 06:59:59Z,The Modesto Bee,"948 11th Street, Suite 300,Modesto,95354,US",,...,,,"Bookworms & Avid Readers,Green Living Enthusia...",,Provided by Advertiser,,,,,web_view_url:http://bit.ly/2MMlKfn
4,998ce79da3f259c96b928dae99b1fef80b737b25db6a18...,https://www.snap.com/political-ads/asset/e1b1d...,USD,115,37149,2018/09/25 04:00:00Z,2018/09/26 03:59:59Z,ACRONYM,US,,...,,,"Cordcutters,Yoga Enthusiasts,Vegans & Organic ...",,Provided by Advertiser,,,,,web_view_url:https://ourlivesourvote.com/regis...


### Baseline Model

In [3]:
cat_features = ['Gender', 'OsType']
cat_transformer = Pipeline([('const', SimpleImputer(strategy='constant', fill_value='all')), 
                           ('one-hot', OneHotEncoder(handle_unknown='ignore'))])
preproc = ColumnTransformer([('cat', cat_transformer, cat_features)])
lr = LinearRegression()
pl = Pipeline([('preproc', preproc), ('lin-reg', lr)])
X = ads[['Gender', 'OsType']]
Y = ads['Spend']
pl.fit(X, Y)
print(pl.predict(X))
pl.score(X, Y) #Using ONLY Gender and OsType as features for our modeling pipeline is terrible as you can see from the score below

[1780. 1780. 1780. ... 1780. 1780. 1780.]


0.000681159085847094

### Final Model

In [4]:
ads['StartDate'] = pd.to_datetime(ads['StartDate'])
ads['EndDate'] = pd.to_datetime(ads['EndDate'])
ads['Duration'] = ads['EndDate'] - ads['StartDate']
def convert_timedelta(duration):
    days, seconds = duration.days, duration.seconds
    hours = (days * 24 + seconds // 3600)
    minutes = (seconds % 3600) // 60
    seconds = (seconds % 60)
    return hours + (minutes / 60) + (seconds / 3600)
ads['Duration'] = ads['Duration'].apply(convert_timedelta)

In [5]:
num_features = ['Duration', 'Impressions']
cat_features_final = ['Gender', 'OsType', 'AgeBracket', 'BillingAddress', 'Segments']
cat_final_transformer = Pipeline([('const', SimpleImputer(strategy='constant', fill_value='all')),
                                  ('one-hot', OneHotEncoder(handle_unknown='ignore'))])
num_transformer = Pipeline([('mean', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])
preproc_final = ColumnTransformer([('num', num_transformer, num_features), ('cat', cat_final_transformer, cat_features_final)])
pl_final = Pipeline([('preproc_final', preproc_final), ('lin-reg-final', lr)])
X_final = ads[['Duration', 'Impressions', 'Gender', 'OsType', 'AgeBracket', 'BillingAddress', 'Segments']]
Y = ads['Spend']
X_tr, X_ts, Y_tr, Y_ts = train_test_split(X_final, Y, test_size = 0.25)
pl_final.fit(X_tr, Y_tr)
pred = pl_final.predict(X_ts)

In [6]:
pl_final.score(X_tr, Y_tr)

0.9040094346018832

### Fairness Evaluation

In [7]:
a = ads.loc[ads['Language'] == 'en']
b = ads.loc[ads['Language'] == 'fr']

a_X = a[['Duration', 'Impressions', 'Gender', 'OsType', 'AgeBracket', 'BillingAddress', 'Segments']]
a_Y = a['Spend']
b_X = b[['Duration', 'Impressions', 'Gender', 'OsType', 'AgeBracket', 'BillingAddress', 'Segments']]
b_Y = b['Spend']
def r_diff(a_x, a_y, b_x, b_y):
    pl_final.fit(a_x, a_y)
    pl_final.fit(b_x, b_y)
    return pl_final.score(b_x, b_y) - pl_final.score(a_x, a_y)
obs = r_diff(a_X,a_Y,b_X,b_Y)
both = pd.concat([a,b])
n_reps = 100
results = []
for i in range(n_reps):
    sample = both['Language'].sample(frac=1, replace=False).reset_index(drop=True)
    both['Shuffled'] = sample
    an = both.loc[both['Shuffled'] == 'en']
    bn = both.loc[both['Shuffled'] == 'fr']
    a_Xn = an[['Duration', 'Impressions', 'Gender', 'OsType', 'AgeBracket', 'BillingAddress', 'Segments']]
    a_Yn = an['Spend']
    b_Xn = bn[['Duration', 'Impressions', 'Gender', 'OsType', 'AgeBracket', 'BillingAddress', 'Segments']]
    b_Yn = bn['Spend']
    results.append(r_diff(a_Xn,a_Yn,b_Xn,b_Yn))
(results <= obs).mean()

0.1

In [8]:
ads

Unnamed: 0,ADID,CreativeUrl,Currency Code,Spend,Impressions,StartDate,EndDate,OrganizationName,BillingAddress,CandidateBallotInformation,...,Location Categories (Excluded),Interests,OsType,Segments,Language,AdvancedDemographics,Targeting Connection Type,Targeting Carrier (ISP),CreativeProperties,Duration
0,29cbbf5621975dbd4ffd3826f22e781ca4f41fe4cd61c5...,https://www.snap.com/political-ads/asset/d5926...,USD,148,36028,2018-11-06 12:04:28+00:00,2018-11-07 02:04:31+00:00,Friends of Jess King,US,,...,,,,Provided by Advertiser,,,,,web_view_url:https://jesskingforcongress.com/v...,14.000833
1,1ee35b50c5d194f4bf23196ad3645d16951439d9bfbbd2...,https://www.snap.com/political-ads/asset/fbee4...,EUR,447,339296,2018-09-28 12:59:59+00:00,2018-10-12 12:59:59+00:00,Jalt,"Krom boomssloot 22-1,Amsterdam,1011GW,NL",,...,,,,Provided by Advertiser,,,,,web_view_url:https://secure.amnesty.nl/petitie...,336.000000
2,fabfccf0dfe9373fabe6723481ff2d0f3ce36b72d7a64e...,https://www.snap.com/political-ads/asset/c2d3a...,USD,387,101184,2018-10-01 21:05:16+00:00,NaT,Mothership Strategies,"1328 Florida Avenue NW, Building C, Washington...",,...,,,,Provided by Advertiser,,Spanish Speakers,,,web_view_url:https://register.rockthevote.com/...,
3,1c3df8a88ecfd08123d59e8378460d4973e9665e626550...,https://www.snap.com/political-ads/asset/953f2...,USD,971,265217,2018-09-19 17:13:42+00:00,2018-09-28 06:59:59+00:00,The Modesto Bee,"948 11th Street, Suite 300,Modesto,95354,US",,...,,"Bookworms & Avid Readers,Green Living Enthusia...",,Provided by Advertiser,,,,,web_view_url:http://bit.ly/2MMlKfn,205.771389
4,998ce79da3f259c96b928dae99b1fef80b737b25db6a18...,https://www.snap.com/political-ads/asset/e1b1d...,USD,115,37149,2018-09-25 04:00:00+00:00,2018-09-26 03:59:59+00:00,ACRONYM,US,,...,,"Cordcutters,Yoga Enthusiasts,Vegans & Organic ...",,Provided by Advertiser,,,,,web_view_url:https://ourlivesourvote.com/regis...,23.999722
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4263,74a482fdec1ed3e015c3a8994ad7b84cf37a6ffba02f64...,https://www.snap.com/political-ads/asset/ea141...,EUR,9,1919,2019-09-02 11:00:00+00:00,2019-09-08 21:59:00+00:00,Idium 1881 AS,"Rolf Wickstrøms vei 15,Oslo,0484,NO",,...,,,,Provided by Advertiser,,,,,web_view_url:https://www.fellesforbundet.no/sa...,154.983333
4264,cdfa9d63392a7fbf9b445c8bfff7e9f3fb5d2b519e1d5f...,https://www.snap.com/political-ads/asset/0f2e2...,EUR,3400,971417,2019-08-17 15:53:04+00:00,2019-09-10 00:38:49+00:00,Arbeiderpartiet,"Youngstorget 2A,Oslo,0028,NO",,...,,,,Provided by Advertiser,,,,,web_view_url:https://www.arbeiderpartiet.no/ak...,560.762500
4265,77cf93e14b3b210810717ea095ed7cdeb83672de575ccb...,https://www.snap.com/political-ads/asset/191b6...,GBP,372,267344,2019-07-05 22:42:38+00:00,2019-08-02 22:39:25+00:00,Arbeiderpartiet i Bergen,NO,,...,,,,Provided by Advertiser,,,,,web_view_url:https://apibergen.arbeiderpartiet...,671.946389
4266,ee48ffaabd41d90ea280aede7d381b43459b428cf7fcfb...,https://www.snap.com/political-ads/asset/112d9...,USD,140,35030,2019-09-17 20:42:15+00:00,NaT,UnRestrict Minnesota,US,,...,,,,,,,,,web_view_url:https://unrestrictmn.org/?utm_sou...,
