# Snapchat Political Ads
This project uses political ads data from Snapchat, a popular social media app. Interesting questions to consider include:
- What are the most prevalent organizations, advertisers, and ballot candidates in the data? Do you recognize any?
- What are the characteristics of ads with a large reach, i.e., many views? What may a campaign consider when maximizing an ad's reach?
- What are the characteristics of ads with a smaller reach, i.e., less views? Aside from funding constraints, why might a campaign want to produce an ad with a smaller but more targeted reach?
- What are the characteristics of the most expensive ads? If a campaign is limited on advertising funds, what type of ad may the campaign consider?
- What groups or regions are targeted frequently? (For example, for single-gender campaigns, are men or women targeted more frequently?) What groups or regions are targeted less frequently? Why? Does this depend on the type of campaign?
- Have the characteristics of ads changed over time (e.g. over the past year)?
- When is the most common local time of day for an ad's start date? What about the most common day of week? (Make sure to account for time zones for both questions.)

### Getting the Data
The data and its corresponding data dictionary is downloadable [here](https://www.snap.com/en-US/political-ads/). Download both the 2018 CSV and the 2019 CSV. 

The CSVs have the same filename; rename the CSVs as needed.

Note that the CSVs have the exact same columns and the exact same data dictionaries (`readme.txt`).

### Cleaning and EDA
- Concatenate the 2018 CSV and the 2019 CSV into one DataFrame so that we have data from both years.
- Clean the data.
    - Convert `StartDate` and `EndDate` into datetime. Make sure the datetimes are in the correct time zone.
- Understand the data in ways relevant to your question using univariate and bivariate analysis of the data as well as aggregations.

*Hint 1: What is the "Z" at the end of each timestamp?*

*Hint 2: `pd.to_datetime` will be useful here. `Series.dt.tz_convert` will be useful if a change in time zone is needed.*

*Tip: To visualize geospatial data, consider [Folium](https://python-visualization.github.io/folium/) or another geospatial plotting library.*

### Assessment of Missingness
Many columns which have `NaN` values may not actually have missing data. How come? In some cases, a null or empty value corresponds to an actual, meaningful value. For example, `readme.txt` states the following about `Gender`:

>  Gender - Gender targeting criteria used in the Ad. If empty, then it is targeting all genders

In this scenario, an empty `Gender` value (which is read in as `NaN` in pandas) corresponds to "all genders".

- Refer to the data dictionary to determine which columns do **not** belong to the scenario above. Assess the missingness of one of these columns.

### Hypothesis Test / Permutation Test
Find a hypothesis test or permutation test to perform. You can use the questions at the top of the notebook for inspiration.

# Summary of Findings

### Introduction
TODO #Create/Select a question to explore
* The more common political party advertising
    * could split this on gender; e.g. are men more democratic?
* the highest impression rate by:
    * age group?
    * political party?
        * could asses the political preference of snapchat users
    * both?
        * could asses political preference of users by age
* Does spending more money equate to more impressions?
    * split by party

### Cleaning and EDA
TODO

### Assessment of Missingness

The columns * contain null values. Looking at readme.txt, the null values in * are explained/accounted for. We must then assess the missingness of the remaining columns *

TODO

### Hypothesis Test
TODO

# Code

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

### Cleaning and EDA

In [3]:
# load the csv files for 2018 and 2019
fp_18 = os.path.join('data', 'ads_2018.csv')
fp_19 = os.path.join('data', 'ads_2019.csv')
# read files into dataframes
df_18 = pd.read_csv(fp_18)
df_19 = pd.read_csv(fp_19)
ad_data = pd.concat([df_18, df_19])
ad_data

Unnamed: 0,ADID,CreativeUrl,Spend,Impressions,StartDate,EndDate,OrganizationName,BillingAddress,CandidateBallotInformation,PayingAdvertiserName,...,Interests,OsType,Segments,LocationType,Language,AdvancedDemographics,Targeting Connection Type,Targeting Carrier (ISP),Targeting Geo - Postal Code,CreativeProperties
0,91db2796a80472ed8c2bfa17760b3ce1471f6ec1f3147b...,https://www.snap.com/political-ads/asset/b2c47...,1044,137185,2018/10/30 17:45:51Z,2018/11/07 00:00:00Z,"GMMB, Inc","3050 K Street,Washington,20007,US",,JB for Governor,...,,,Provided by Advertiser,,,,,,,web_view_url:https://iwillvote.com/?state=il
1,97e3f17d5ec164c454a35d2822734482ca60be3f3af310...,https://www.snap.com/political-ads/asset/affc7...,279,94161,2018/12/23 14:26:52Z,2018/12/28 14:28:06Z,Revolution Messaging,"1730 Rhode Island Ave NW,Washington,20036,US",,Paid for by ReBuild USA,...,"Arts & Culture Mavens,Chat Fiction Enthusiasts...",,Provided by Advertiser,,,,,,,web_view_url:https://rebuildusa.info/landing-3
2,14535fea019a9b1a910a77ce1555af8bdedbb5c78fb60a...,https://www.snap.com/political-ads/asset/754f6...,6743,3149886,2018/10/06 01:11:41Z,2018/11/07 03:00:00Z,Lockwood Strategy,US,,Change Now,...,"TV Live Event Viewers (The Academy Awards),TV ...",,Provided by Advertiser,,,,,,,web_view_url:https://action.socalhealthcarecoa...
3,10b64550ad4a23c651d7883746cabeac93cbd92d5f3b3f...,https://www.snap.com/political-ads/asset/818ae...,3698,573475,2018/11/02 16:20:57Z,2018/11/06 18:15:30Z,The Prosper Group,"435 E. Main,Greenwood,46143,US",,No On L,...,,,Provided by Advertiser,,,,,,"92801,92802,92803,92804,92805,92806,92807,9280...",web_view_url:https://www.stopmeasurel.com
4,2438786c60ae41cf56614885b415a72857bbfb5c06f760...,https://www.snap.com/political-ads/asset/2c264...,445,232906,2018/11/27 21:44:19Z,2019/01/13 21:43:53Z,Amnesty International UK,"17-25 New Inn Yard,London,EC2A 3EA,GB",,Amnesty International UK,...,,,Provided by Advertiser,,,,,,,web_view_url:https://www.amnesty.org.uk/write-...
5,e2124010037e7f87d518d2d02569b33a500786f4c4138b...,https://www.snap.com/political-ads/asset/39ceb...,651,88101,2018/10/17 15:00:00Z,2018/11/07 04:00:00Z,Democratic Congressional Campaign Committee,"430 S Capitol St SE,Washington,20003,US",,DCCC,...,,,,,,,,,,web_view_url:https://mypollingplace.org/
6,ad196210394ae33426eacc77649600a60deedb56e3957e...,https://www.snap.com/political-ads/asset/4f5b7...,132,16923,2018/10/28 17:58:01Z,2018/11/06 22:59:59Z,Mothership Strategies,"1328 Florida Avenue NW, Building C, Washington...",,Progressive Turnout Project,...,,,,,,,,,,web_view_url:http://votingmatters.org/
7,51b851e3247ba7ac3591778553eb43cd41a03a10ff8cbd...,https://www.snap.com/political-ads/asset/55a38...,659,58543,2018/10/18 16:25:09Z,2018/11/06 23:00:00Z,Bully Pulpit Interactive,"1140 Connecticut Ave NW, Suite 800,Washington,...",,NextGen America,...,,,,,,,,,"95355,95382,95316,95377,95353,95361,95313,9537...",
8,1ce7c447a1b781684558b8b52a1074527074b71c9b459c...,https://www.snap.com/political-ads/asset/1b8bb...,2784,908208,2018/09/02 09:59:57Z,2018/09/15 09:59:59Z,Media Agent,"Østre alle 2 ,Værløse ,3500,DK",,Dansk_Folkeparti,...,,,Provided by Advertiser,,,,,,,web_view_url:https://danskfolkeparti.dk
9,ff7487becb4bf23c7670b9682080a701a27cfa3febabea...,https://www.snap.com/political-ads/asset/90f33...,250,284533,2018/10/01 08:52:13Z,2018/10/06 08:52:13Z,Amnesty International Switzerland,CH,,Amnesty International,...,,,Provided by Advertiser,,,,,,,web_view_url:https://amnestyyouth1.typeform.co...


In [66]:
# convert StartDate and EndDate to date time objects (UTC)
# timezones will not be used in our analysis
ad_data['StartDate'] = pd.to_datetime(ad_data['StartDate'])
ad_data['EndDate'] = pd.to_datetime(ad_data['EndDate'])

# display the data type of each column
ad_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3303 entries, 0 to 2643
Data columns (total 27 columns):
ADID                           3303 non-null object
CreativeUrl                    3303 non-null object
Spend                          3303 non-null int64
Impressions                    3303 non-null int64
StartDate                      3303 non-null datetime64[ns]
EndDate                        2647 non-null datetime64[ns]
OrganizationName               3303 non-null object
BillingAddress                 3303 non-null object
CandidateBallotInformation     225 non-null object
PayingAdvertiserName           3303 non-null object
Gender                         322 non-null object
AgeBracket                     3029 non-null object
CountryCode                    3303 non-null object
RegionID                       1013 non-null object
ElectoralDistrictID            65 non-null object
LatLongRad                     0 non-null float64
MetroID                        180 non-null object
In

In [63]:
# get the proportion of null values for each column
prop_null = ad_data.isna().mean()# > 0
# find the columns with null values
has_null = pd.Series(prop_null[prop_null > 0].index)

# the proportion of null values in each column
prop_null

ADID                           0.000000
CreativeUrl                    0.000000
Spend                          0.000000
Impressions                    0.000000
StartDate                      0.000000
EndDate                        0.198607
OrganizationName               0.000000
BillingAddress                 0.000000
CandidateBallotInformation     0.931880
PayingAdvertiserName           0.000000
Gender                         0.902513
AgeBracket                     0.082955
CountryCode                    0.000000
RegionID                       0.693309
ElectoralDistrictID            0.980321
LatLongRad                     1.000000
MetroID                        0.945504
Interests                      0.762035
OsType                         0.993642
Segments                       0.337269
LocationType                   0.994550
Language                       0.723282
AdvancedDemographics           0.970936
Targeting Connection Type      1.000000
Targeting Carrier (ISP)        1.000000


In [62]:
# the following columns have null values
print(has_null)

0                         EndDate
1      CandidateBallotInformation
2                          Gender
3                      AgeBracket
4                        RegionID
5             ElectoralDistrictID
6                      LatLongRad
7                         MetroID
8                       Interests
9                          OsType
10                       Segments
11                   LocationType
12                       Language
13           AdvancedDemographics
14      Targeting Connection Type
15        Targeting Carrier (ISP)
16    Targeting Geo - Postal Code
17             CreativeProperties
dtype: object


In [58]:
# the following columns have values missin by design

    # Gender: if null, then target all genders
    # AgeBracket: if null, then target all ages
    # RegionID: if null, then target all regions in target country
    # ElectoralDistrickID: if null, then target all electoral districts in the target country
    # LatLongRad: if null, then target all lat/long in the target country
    # MetroID: if null, then target all metros in the target country
    # Interests: if null, then the ad is agnostic to interests
    # OsType: if null, then target all operating systems
    # Language: if null, then the ad is agnostic to language
    # AdvancedDemographics: if null, then the ad is agnostic to 3rd party data segments
    # Targeting Connection Type: if null, then the ad is agnostic to internet connection type
    # Targeting Carrier (ISP): if null, then the ad is agnostic to carrier type
    # Targeting Geo-Postal Code: if null, then targets all postal codes in the target country

# columns missing by design
mbd = [
    'Gender', 'AgeBracket', 'RegionID', 'ElectoralDistrictID', 'LatLongRad', 'MetroID',
    'Interests', 'OsType', 'Language', 'AdvancedDemographics', 'Targeting Connection Type', 
    'Targeting Carrier (ISP)', 'Targeting Geo - Postal Code'
]

# get the column names that are not missing by design
not_mbd = has_null[has_null.isin(mbd) == False]
not_mbd

0                        EndDate
1     CandidateBallotInformation
10                      Segments
11                  LocationType
17            CreativeProperties
dtype: object

In [60]:
# the proportion of nulls for the remaining columns
prop_null[not_mbd]

EndDate                       0.198607
CandidateBallotInformation    0.931880
Segments                      0.337269
LocationType                  0.994550
CreativeProperties            0.195883
dtype: float64

### Assessment of Missingness

In [None]:
# TODO

### Hypothesis Test

In [None]:
# TODO