# Python Project

In this part of the course we will use Python to predict polls as outcomes of the 2016 Prsidential Election. Sound's fun? 

Here's what we're going to do:
1. use selenium to scrape search volume data from Google Trends
3. use pandas to prepare the data
2. use data.world to download 538 data from polls in each state
4. use scipy to do some data analysis
5. use XXX to visualize our results

Let's go!

## 1. Web scraping
Before starting to code your brains out, it's worth taking a look at [Google Trends](https://trends.google.com/trends). Familiarize yourself with the basic functionalities of the website. Search for something and try to get the data in csv form.

Some things to think about:
- Google Trends normalizes the data so that the peak search interest in the results corresponds to a value of 100. You can't unnormalize the data, but maybe it can actually save you time! Ask: How would I normalize anyways?
- With this in mind, should you search for both candidates simultaneously or for one at a time?
- There are different options to search: search terms and concepts. Which should you use?
- We want to have a time series for each state. There two main ways to accomplish this: export the time series for each state or export the data in the map for each time point. Which should you use?
- can you use a static method or will you have to used a dynamic webdriver?

For this step-by-step guide we will compare both search terms simultaneously and export the map data for every day in the year leading up to the election. If the volume is very low, Google doesn't report anything. Although there exists a [workaround](https://www.sciencedirect.com/science/article/pii/S0047272714000929) we won't bother with it here.

As a first step, write write a function `open_driver()` that imports and opens the selenium webdriver. Return the webdriver as an object. I suggest that you also import `By` and `Options`. 

You could try changing the download location using Options, but it might not work on your OS. We'll rename the downloads anyways so you can just move them to the correct location at that step. Don't worry, by now that should't be a challenge for you!

In [1]:
def open_driver():
    """
    Function to open the chrome webdriver.
    """
    # ---
    # add your code here
    
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.chrome.options import Options
    
    chrome_options = Options()
    #chrome_options.add_argument("--headless") # doesn't work yet
    chrome_options.add_argument('--no-sandbox')
    driver = webdriver.Chrome('/usr/local/bin/chromedriver', chrome_options=chrome_options)
    
    # ---
    print('Chrome driver is good to go!')
    return driver

Make sure to test your function:

In [286]:
driver = open_driver()

Next, set up a example search for both candidates on election day (Nov 8, 2016). Take a look at your URL. While the date menue is relatively complicated to navigate, changing the date through the url id straightforward, so that is what we're going to do.
https://trends.google.com/trends/explore?hl=en&date=[START_DATE]%20[END_DATE]&geo=[REGION]&q=[SEARCH_TERMS]

Note: I added `hl=en` flag to the url, otherwise the labels might not be in English and you'll have trouble matching the data.

Write a function that takes a start date, end date, region, and a list of search terms as inputs and returns the url. For fancy pants: Check if `searchterms` is a list and join it using commas if necessary. Replace spaces in search terms by `%20`.

In [2]:
def build_url(start, end, region, searchterms):
    """
    Function to construct the URL.
    """
    # ---
    # add your code here
    
    if isinstance(searchterms, list) == True:
        searchterms = ','.join(map(str, searchterms)) # the easier  ",".join(list) might not work with some symbols.
    searchterms = searchterms.replace(' ','%20')
    
    url = 'https://trends.google.com/trends/explore?hl=en&date='+start+'%20'+end+'&geo='+region+'&q='+searchterms
    
    # ---
    return url

Test the function:

In [283]:
url = build_url('2016-11-08','2016-11-08','US',['Donald Trump','Hillary Clinton'])
url

'https://trends.google.com/trends/explore?date=2016-11-08%202016-11-08&geo=US&q=Donald%20Trump,Hillary%20Clinton'

Now comes the tricky part: Write a function that opens the url in the driver and downloads the csv file for the map. Be careful, as there are multiple download buttons!

Note: Google has a nasty habit of producting an error the first time you access the url. A relatively reliable work around uses the `time.sleep()` function to wait two seconds and try again. A more sophistiated solution would check if the download button is there and reload periodically until it is (although in my experience, either you get the page on the second attempt or you don't get it anytime soon).

Also: All the files will have the same name when downloaded. This can cause some problems, expecially if you start a new download while there's still a previous file in the directory. I suggest that you first remove the old file (if it exists) and that you wait at the end until your download is complete. Both can easily be achieved using an `if os.path.exists("path/to/your/file"):` clause.

In [3]:
def download_csv(url, driver):
    """
    Function to download the csv file of the map.
    """
    # ---
    # add your code here
    
    import os
    import time
    print('... start download...')
    
    map_dl = '/Users/czuend/Downloads/geoMap.csv'
    if os.path.exists(map_dl):
        os.remove(map_dl)

    export_map = []
    while len(export_map) == 0:
        print('... ... try loading the page...')
        driver.get(url)
        time.sleep(2)
        export_map = driver.find_elements_by_xpath('//*[@class="fe-multi-heat-map-generated fe-atoms-generic-container"]')
    
    export_map[0].find_element_by_xpath('.//*[@title="CSV"]').click()
    
    while not os.path.exists(map_dl):
        time.sleep(1)
        
    del export_map # maybe not needed with the download_csv function. 
    
    print('... download complete.')
    
    # ---
    return

... and test it:

In [284]:
download_csv(url, driver)

... start download...
... ... wait one second and try again...
... ... wait one second and try again...
... download complete.


We have to rename (and possibly move) the downloaded csv file. Name the file `"map_[searchterms]_[startdate]_[enddate]_[region].csv"`, to avoid accidentally overwriting existing files if you later explore other search specifications.

In [4]:
def rename_csv(start, end, region, searchterms):
    """
    Function to rename and move files.
    """
    # ---
    # add your code here
    
    import os
    
    if isinstance(searchterms, list) == True:
        searchterms = ','.join(map(str, searchterms))
    searchterms = searchterms.replace(' ','%20')
    
    dir = 'data'
    if not os.path.exists(dir):
        os.makedirs(dir)

    map_dl = '/Users/czuend/Downloads/geoMap.csv'
    map_name = dir+'/map_'+searchterms+'_'+start+'_'+end+'_'+region+'.csv'

    os.rename(map_dl, map_name)
    
    # ---
    return

... and test it:

In [71]:
rename_csv('2016-11-08','2016-11-08','US',['Donald Trump','Hillary Clinton'])

Finally, we need a list of dates to loop over. There are many ways to do this. The internet is your friend here!

Note: I suggest you only scrape a month's worth of data and copy the rest from the github repo as webscraping is time consuming and not universally appreciated (e.g. by Google's system admins).

In [12]:
def get_dates():
    """
    Function to produce a list of dates with YYYY-MM-DD from 2015-11-08 to 2016-11-08.
    """
    # ---
    # add your code here
    
    from datetime import date, timedelta
    d1 = date(2016,7,2)
    d2 = date(2016,11,8)
    dates = [str(d1 + timedelta(days=x)) for x in range((d2-d1).days + 1)]
    
    # ---
    return dates

One last test:

In [291]:
get_dates()

Congratulations, we're ready to put everything together!

In [13]:
def main():
    driver = open_driver()
    searchterms = ['Donald Trump','Hillary Clinton']
    region = 'US'
    dates = get_dates()
    for date in dates:
        print('Download data for: ', date)
        url = build_url(date, date, region, searchterms)
        download_csv(url, driver)
        rename_csv(date, date, region, searchterms)
    driver.quit()
    print('All data downloaded.')
    
if __name__ == '__main__':
  main()

Chrome driver is good to go!
Download data for:  2016-07-02
... start download...
... ... try loading the page...
... ... try loading the page...
... download complete.
Download data for:  2016-07-03
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-07-04
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-07-05
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-07-06
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-07-07
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-07-08
... start download...
... ... try loading the page...
... ... try loading the page...
... ... try loading the page...
... ... try loading the page...
... ... try loading the page...
... ... try loading the page...
... ... try loading the page...
... ... try lo

... download complete.
Download data for:  2016-09-12
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-13
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-14
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-15
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-16
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-17
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-18
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-19
... start download...
... ... try loading the page...
... download complete.
Download data for:  2016-09-20
... start download...
... ... try loading the page...
... download complete.
Downl

In [11]:
driver.quit()

NameError: name 'driver' is not defined

## 2. Prepare Google Trends data

Now that we have the data from Google, we're going to combine all the csv files in a large pandas dataframe. 

- use `os.listdir()` to generate a list of all the files
- write a loop to append all of them in a large pandas dataframe
- clean the data to remove missing values and normalize the data

First, use `os.listdir()` to generate a list `files` of all the files we want to load. If you saved other data in the same directory, you might need to filter the list and keep just the Google Trends data.

In [194]:
# ---
# add your code here

import os 
files = [f for f in os.listdir('data') if 'map_' in f]

# ---

If done correctly, you should have 367 files (Nov 8th is included twice and 2016 was a leap year). Let's see how many we got:

In [200]:
len(files)

367

Write a function `load_data()` to load a file from into a pandas dataframe with colums `state`, `trump`, and `clinton`, and generate a column 'date' that contains the date from the file name.

In [215]:
def load_data(f):
    # ---
    # add your code here
    import re
    
    df = pd.read_csv('data/'+f, header = 1, names = ['state','trump','clinton'])
    date = re.search('\d{4}-\d{2}-\d{2}',f)
    df['date'] = date.group()
    # ---
    return df

Test the function:

In [216]:
load_data(files[0])

Unnamed: 0,state,trump,clinton,date
0,South Carolina,94 %,19 %,2016-02-21
1,Massachusetts,80 %,32 %,2016-02-21
2,Georgia,89 %,23 %,2016-02-21
3,Pennsylvania,92 %,21 %,2016-02-21
4,Nevada,70 %,43 %,2016-02-21
5,Norddakota,79 %,34 %,2016-02-21
6,North Carolina,85 %,28 %,2016-02-21
7,New Jersey,86 %,27 %,2016-02-21
8,Connecticut,92 %,21 %,2016-02-21
9,Tennessee,92 %,21 %,2016-02-21


Create an empty pandas dataframe `google_raw` and write a loop over `files` that applies `load_data()` on each file and appends it to `google_raw`.

In [220]:
# ---
# add your code here

import pandas as pd
google_raw = pd.DataFrame()

for f in files:
    df = load_data(f)
    google_raw = google_raw.append(df, ignore_index=True)
    
# ---

Let's see whether it worked:

In [222]:
google_raw

Finally, we'll do some housekeeping:

1. create a copy of the 'google_raw' dataframe so that we don't have to rerun the loop if we mess up.
2. remove missing values
3. remove the `%` symbol from the percentages
4. normalize the data such that `candidate = candidate / (total searches)`
5. use `groupby` to order by `date` and `state`

In [296]:
# ---
# add your code here

df = google_raw.copy()
df = df.dropna()

df = df.replace(to_replace=r'^Kalifornien$', value='California', regex=True)
df = df.replace(to_replace=r'^Norddakota$', value='North Dakota', regex=True)
df = df.replace(to_replace=r'^Süddakota$', value='South Dakota', regex=True)

df['trump'] = df['trump'].astype(str).str[:-2].astype(np.int64)
df['clinton'] = df['clinton'].astype(str).str[:-2].astype(np.int64)

df['tot'] = df['trump'] + df['clinton']
df['trump'] = df['trump']/df['tot']
df['clinton'] = df['clinton']/df['tot']
del df['tot']

df = df.groupby(["date", "state"]).agg('sum')
# ---

Checking the result and copying it to `google_data`:

In [297]:
google_data = df.copy()
google_data

Unnamed: 0_level_0,Unnamed: 1_level_0,trump,clinton
date,state,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-11-08,Alabama,0.940000,0.060000
2015-11-08,Alaska,0.810000,0.190000
2015-11-08,Arizona,0.920000,0.080000
2015-11-08,California,0.890000,0.110000
2015-11-08,Colorado,0.910000,0.090000
2015-11-08,Connecticut,0.920000,0.080000
2015-11-08,District of Columbia,0.890000,0.110000
2015-11-08,Florida,0.900000,0.100000
2015-11-08,Georgia,0.900000,0.100000
2015-11-08,Hawaii,0.850000,0.150000


## 3. Prepare polling data

We're going to use polling data compiled by 538. However, instead of scraping that data too, we're going to use a short cut. Interesting data has often been scraped by somebody else before, so you can save a lot of time by googling datasets and checking github and other data repositories!

1. Go to [data.world](data.world) and registre as a new user (unless you're an old user, duh!). They have some cool data, so it's worth it.
2. Download [presidential_polls_2016_fivethirtyeight.csv](https://data.world/databeats/2016-us-presidential-election/workspace/file?filename=presidential_polls_2016_fivethirtyeight.csv) to an appropriate location.

Note: If you feel like showing off, try creating an API key and downloading the data using the python module. [Here's some guidance.](https://data.world/integrations/python)

Load the poll data into a pandas dataframe called `polls`.

In [94]:
# ---
# add your code here
import pandas as pd

polls = pd.read_csv('data/presidential_polls_2016_fivethirtyeight.csv')
# ---

Let's have a look:

In [95]:
polls

Unnamed: 0,cycle,branch,type,matchup,forecastdate,state,startdate,enddate,pollster,grade,...,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,multiversions,url,poll_id,question_id,createddate,timestamp
0,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/3/2016,11/6/2016,ABC News/Washington Post,A+,...,45.20163,41.72430,4.626221,,,https://www.washingtonpost.com/news/the-fix/wp...,48630,76192,11/7/16,09:35:33 8 Nov 2016
1,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/1/2016,11/7/2016,Google Consumer Surveys,B,...,43.34557,41.21439,5.175792,,,https://datastudio.google.com/u/0/#/org//repor...,48847,76443,11/7/16,09:35:33 8 Nov 2016
2,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/2/2016,11/6/2016,Ipsos,A-,...,42.02638,38.81620,6.844734,,,http://projects.fivethirtyeight.com/polls/2016...,48922,76636,11/8/16,09:35:33 8 Nov 2016
3,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/4/2016,11/7/2016,YouGov,B,...,45.65676,40.92004,6.069454,,,https://d25d2506sfb94s.cloudfront.net/cumulus_...,48687,76262,11/7/16,09:35:33 8 Nov 2016
4,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/3/2016,11/6/2016,Gravis Marketing,B-,...,46.84089,42.33184,3.726098,,,http://www.gravispolls.com/2016/11/final-natio...,48848,76444,11/7/16,09:35:33 8 Nov 2016
5,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/3/2016,11/6/2016,Fox News/Anderson Robbins Research/Shaw & Comp...,A,...,49.02208,43.95631,3.057876,,,http://www.foxnews.com/politics/2016/11/07/fox...,48619,76163,11/7/16,09:35:33 8 Nov 2016
6,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/2/2016,11/6/2016,CBS News/New York Times,A-,...,45.11649,40.92722,4.341786,,,http://www.cbsnews.com/news/cbs-news-poll-stat...,48521,76058,11/7/16,09:35:33 8 Nov 2016
7,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/3/2016,11/5/2016,NBC News/Wall Street Journal,A-,...,43.58576,40.77325,5.365788,,,http://www.nbcnews.com/storyline/2016-election...,48480,75974,11/6/16,09:35:33 8 Nov 2016
8,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,New Mexico,11/6/2016,11/6/2016,Zia Poll,,...,44.82594,41.59978,7.870127,,,http://projects.fivethirtyeight.com.s3.amazona...,48614,76158,11/7/16,09:35:33 8 Nov 2016
9,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/4/2016,11/7/2016,IBD/TIPP,A-,...,42.92745,42.23545,6.316175,,,http://www.investors.com/politics/ibd-tipp-pre...,48916,76611,11/8/16,09:35:33 8 Nov 2016


To ensure that the data are comparable, check whether all the results are for the same election cycle, branch of government, type, forecastdate, and most importantly, the same matchup of candidates! Table the data to make sure that this is indead the case.

In [111]:
# ---
# add your code here

print(polls.groupby('cycle')['cycle'].count(),'\n')
print(polls.groupby('branch')['cycle'].count(),'\n')
print(polls.groupby('type')['cycle'].count(),'\n')
print(polls.groupby('matchup')['cycle'].count(),'\n')
print(polls.groupby('forecastdate')['cycle'].count(),'\n')
print(polls.groupby('cycle')['cycle'].count(),'\n')

# ---

cycle
2016    12624
Name: cycle, dtype: int64 

branch
President    12624
Name: cycle, dtype: int64 

type
now-cast      4208
polls-only    4208
polls-plus    4208
Name: cycle, dtype: int64 

matchup
Clinton vs. Trump vs. Johnson    12624
Name: cycle, dtype: int64 

forecastdate
11/8/16    12624
Name: cycle, dtype: int64 

cycle
2016    12624
Name: cycle, dtype: int64 



This seems mostly ok. But what's up with type? Turns out that each poll is included three times:

In [187]:
polls.groupby('poll_id')['cycle'].count()

We'll have to change that. The next step is filtering the data!

Use pandas amazing data-slizing ability to generate `polls_filtered` such that: 
- only `polls-only` are included
- only polls with information on sample size are included
- national polls (`state` = 'U.S.') are excluded
- Main and Nebraska have split electoral votes. Make sure to include only polls for the entire state. 

In [170]:
# ---
# add your code here

polls_filtered = polls[(polls.type == 'polls-only') & (-polls.samplesize.isnull()) & (polls.state != 'U.S.') & (-polls.state.str.contains('CD-'))]

# ---

Let's see:

In [171]:
polls_filtered

Unnamed: 0,cycle,branch,type,matchup,forecastdate,state,startdate,enddate,pollster,grade,...,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,multiversions,url,poll_id,question_id,createddate,timestamp
8424,2016,President,polls-only,Clinton vs. Trump vs. Johnson,11/8/16,New Mexico,11/6/2016,11/6/2016,Zia Poll,,...,45.03026,41.83415,8.034579,,,http://projects.fivethirtyeight.com.s3.amazona...,48614,76158,11/7/16,09:14:14 8 Nov 2016
8429,2016,President,polls-only,Clinton vs. Trump vs. Johnson,11/8/16,Virginia,11/3/2016,11/4/2016,Public Policy Polling,B+,...,47.45132,42.32406,2.164561,,,http://www.publicpolicypolling.com/pdf/2015/PP...,48349,75743,11/4/16,09:14:14 8 Nov 2016
8431,2016,President,polls-only,Clinton vs. Trump vs. Johnson,11/8/16,Iowa,11/1/2016,11/4/2016,Selzer & Company,A+,...,39.49021,45.61412,6.009565,,,http://www.desmoinesregister.com/story/news/po...,48470,75957,11/5/16,09:14:14 8 Nov 2016
8433,2016,President,polls-only,Clinton vs. Trump vs. Johnson,11/8/16,Wisconsin,10/26/2016,10/31/2016,Marquette University,A,...,46.12806,40.94420,2.939782,,,https://twitter.com/MULawPoll,48095,75264,11/2/16,09:14:14 8 Nov 2016
8434,2016,President,polls-only,Clinton vs. Trump vs. Johnson,11/8/16,North Carolina,11/4/2016,11/6/2016,Siena College,A,...,44.26774,45.04665,2.408998,,,http://www.nytimes.com/2016/11/08/upshot/trump...,48524,76066,11/7/16,09:14:14 8 Nov 2016
8435,2016,President,polls-only,Clinton vs. Trump vs. Johnson,11/8/16,Georgia,11/6/2016,11/6/2016,Landmark Communications,B,...,45.23540,48.68394,3.665207,,,http://landmarkcommunications.net/landmarkrose...,48525,76068,11/7/16,09:14:14 8 Nov 2016
8436,2016,President,polls-only,Clinton vs. Trump vs. Johnson,11/8/16,Florida,11/3/2016,11/6/2016,Quinnipiac University,A-,...,46.39114,44.03355,1.978985,,,https://poll.qu.edu/2016-presidential-swing-st...,48522,76060,11/7/16,09:14:14 8 Nov 2016
8437,2016,President,polls-only,Clinton vs. Trump vs. Johnson,11/8/16,North Carolina,11/3/2016,11/6/2016,Quinnipiac University,A-,...,47.38561,44.03075,2.978985,,,https://poll.qu.edu/2016-presidential-swing-st...,48523,76062,11/7/16,09:14:14 8 Nov 2016
8438,2016,President,polls-only,Clinton vs. Trump vs. Johnson,11/8/16,Virginia,10/27/2016,10/30/2016,ABC News/Washington Post,A+,...,46.40351,41.57526,6.069753,,,https://www.washingtonpost.com/local/virginia-...,47880,74934,11/1/16,09:14:14 8 Nov 2016
8440,2016,President,polls-only,Clinton vs. Trump vs. Johnson,11/8/16,Georgia,11/3/2016,11/6/2016,Gravis Marketing,B-,...,43.82129,47.27258,3.749071,,,http://www.gravispolls.com/2016/11/multi-state...,48852,76448,11/7/16,09:14:14 8 Nov 2016


Finally we have to aggregate the data so that we can merge it on the day and state. There are many possibilities and there's going to be an element of subjective judgement. Here's one way:

- `date`: There's a start date and an end date. We deal with this by calculating the number of days that the poll run for and assuming that the same number of repondents participated on each day.
- `state`: There can be multiple polls in the same state and on the same day. We deal with this by calculating a weighted average based on the (estimated) sample size of each poll on that date.

Step by step:
1. copy the `polls_filtere` to a new dataframe `df`. We don't want to make a mess!
2. calculate the number of days that the poll ran for from `startdate` and `enddate`.
3. calculate `samplesize_day` by dividing `samplesize` by the number of days.
4. expand the dataframe such that each observation is included `days` time.
5. generate a variable `date` with the ficticious polling date.
6. aggregate the data by date and state such that `rawpoll_clinton`, `rawpoll_trump`, `adjpoll_clinton` and `adjpoll_trump` are averages weighted by the sample size, and `samplesize` is the sum of the daily sample sizes.

In [320]:
# ---
# add your code here

df = polls_filtered.copy()

df['days'] = pd.to_datetime(df['enddate']) - pd.to_datetime(df['startdate']) + pd.offsets.Day(1)
df['days'] = df['days'].dt.days

df['samplesize_day'] = df['samplesize'] / df['days']

import numpy as np
df = df.reindex(np.repeat(df.index, df['days'])).reset_index(drop=True)

df['n'] = df.groupby('poll_id').cumcount() + 1
df['date'] = pd.to_datetime(df['startdate']) + pd.to_timedelta(df['n'], unit='D')
df['date'] = df['date'].astype(str)

wm = lambda x: np.average(x, weights=df.loc[x.index, "samplesize_day"])
f = {'samplesize_day': ['sum'], 
     'rawpoll_clinton': {'weighted_mean' : wm}, 
     'rawpoll_trump': {'weighted_mean' : wm}, 
     'adjpoll_clinton': {'weighted_mean' : wm}, 
     'adjpoll_trump': {'weighted_mean' : wm}
    }

df = df.groupby(["date", "state"]).agg(f)
df.columns = ['samplesize', 'rawpoll_clinton', 'rawpoll_trump', 'adjpoll_clinton', 'adjpoll_trump']
# ---

  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)


Let's check the results:

In [321]:
poll_data = df.copy()
poll_data

Unnamed: 0_level_0,Unnamed: 1_level_0,samplesize_day,rawpoll_clinton,rawpoll_trump,adjpoll_clinton,adjpoll_trump
date,state,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-11-07,New Hampshire,48.181818,45.000000,38.000000,45.641840,39.072340
2015-11-08,New Hampshire,48.181818,45.000000,38.000000,45.641840,39.072340
2015-11-08,South Carolina,645.000000,42.000000,47.000000,40.866410,45.800820
2015-11-09,New Hampshire,48.181818,45.000000,38.000000,45.641840,39.072340
2015-11-09,South Carolina,645.000000,42.000000,47.000000,40.866410,45.800820
2015-11-10,New Hampshire,48.181818,45.000000,38.000000,45.641840,39.072340
2015-11-10,Virginia,111.000000,50.000000,36.000000,49.381830,38.098560
2015-11-11,Iowa,91.571429,41.000000,40.000000,42.006200,41.554670
2015-11-11,Nevada,89.714286,41.000000,44.000000,41.870070,45.427960
2015-11-11,New Hampshire,48.181818,45.000000,38.000000,45.641840,39.072340


## 4. Analyze data
With our data at hand, we can finally test what google searches reveal about political opinion. We combine the data (easy!) and use the scipy module to test some ideas. After that you're free to explore!

Combine Google and 538 data using `pd.concat()`.

In [322]:
# ---
# add your code here

data = pd.concat([google_data, poll_data], axis=1)

# ---

As always, check the result! (If you have no complete observations, something went wrong with the matching. Most likely the date wasn't the same format.)

In [325]:
data.dropna()

Unnamed: 0_level_0,Unnamed: 1_level_0,trump,clinton,samplesize_day,rawpoll_clinton,rawpoll_trump,adjpoll_clinton,adjpoll_trump
date,state,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2015-11-08,New Hampshire,0.680000,0.320000,48.181818,45.000000,38.000000,45.641840,39.072340
2015-11-08,South Carolina,0.820000,0.180000,645.000000,42.000000,47.000000,40.866410,45.800820
2015-11-09,New Hampshire,0.630000,0.370000,48.181818,45.000000,38.000000,45.641840,39.072340
2015-11-09,South Carolina,0.880000,0.120000,645.000000,42.000000,47.000000,40.866410,45.800820
2015-11-10,New Hampshire,0.570000,0.430000,48.181818,45.000000,38.000000,45.641840,39.072340
2015-11-10,Virginia,0.750000,0.250000,111.000000,50.000000,36.000000,49.381830,38.098560
2015-11-11,Iowa,0.750000,0.250000,91.571429,41.000000,40.000000,42.006200,41.554670
2015-11-11,Nevada,0.680000,0.320000,89.714286,41.000000,44.000000,41.870070,45.427960
2015-11-11,New Hampshire,0.620000,0.380000,48.181818,45.000000,38.000000,45.641840,39.072340
2015-11-11,South Carolina,0.840000,0.160000,89.571429,41.000000,44.000000,42.378460,45.954830


Before we use scipy, it's worth keeping in mind that pandas has a lot of data science capability built in! Try to estimate the correlation between the relative search volume for Trump and his performance in the polls using `.corr()`:

In [328]:
# ---
# add your code here
print(data['rawpoll_trump'].corr(data['trump']))
print(data['rawpoll_clinton'].corr(data['clinton']))

# ---

-0.06811603007332102
-0.13837400708040404


Seems like people didn't particularily like the candidates they googled...

In [352]:
states = data.index.get_level_values(1).unique()

df = data.copy()
df = df.reset_index(level=['state', 'date'])

for state in states:
    print(state,':')
    df1 = df[df['state'] == state]
    print(df1['rawpoll_trump'].corr(df1['trump']))
    print(df1['rawpoll_clinton'].corr(df1['clinton']))
    print('')

New Hampshire :
0.08492454422868134
-0.05581006335463174

Alabama :
0.029880230372623186
-0.14439223672618134

Alaska :
-0.2014053238558636
0.012037607836193771

Arizona :
0.0344630781623133
-0.14547733649553238

California :
0.13005408100779295
-0.14069124636707947

Colorado :
-0.03901347523098426
-0.1789499750511866

Connecticut :
-0.12272704155784375
0.03090220191402395

District of Columbia :
-0.04074292334544256
0.06456152186700687

Florida :
-0.21371622697317946
-0.42036865055851036

Georgia :
-0.14944956122165087
-0.02927963316852944

Hawaii :
0.009737456185285322
-0.3338100210587378

Illinois :
-0.27801930012176007
-0.1491028249263646

Indiana :
0.10350899684120925
-0.2720359329020461

Iowa :
-0.11209945728693901
-0.3717510757757521

Kansas :
-0.09686310532076879
0.2428560542712669

Kentucky :
-0.09152823804312159
-0.16655976382103038

Louisiana :
-0.12140367050653773
-0.004350064553488925

Maryland :
-0.14213725665604207
-0.17145988101527238

Massachusetts :
-0.346132124108986

Some additional ideas to test if you have enough time:
- Search data around the primaries might be missleading. Does the fit improve if we only look after the national conventions?