# Summary
This is the third and final notebook in our data scraping and cleaning process. We use the .csv files created by [clean_csv1.ipynb](https://github.com/dinopants174/DataScienceFinalProject/blob/gh-pages/clean_csv1.ipynb) which found sessions of Congress with data on the bills that were voted on in that session and added it to the .csv files. In this notebook, we resolve the problem that we only had the sponsor name for bills found in [clean_csv1.ipynb](https://github.com/dinopants174/DataScienceFinalProject/blob/gh-pages/clean_csv1.ipynb) instead of the legislative ID that we desire to use to identify each legislator. 

This notebook ultimately produces our final iteration of .csvs where the data on each sponsor is an ID to our legislators .csv files

Import the packages we need and the legislators data from [legislators.csv](https://github.com/dinopants174/DataScienceFinalProject/blob/gh-pages/data/legislators.csv).

In [2]:
import numpy as np
import pandas as pd
import re

data = pd.read_csv('data/legislators.csv')

We define a function that we can then apply to the sponsor column of our .csv files that will replace the sponsor namestring with the desired ID from our data. This code is far from efficient and involves fixing a lot of esoteric errors, such as name mispellings, hyphenated last names, utf encoding errors, etc.

Please know that in solving this problem, we are aware that this is not an ideal solution and was manually intensive.

In [133]:
'''
namestring is of the form: firstname#lastname#state

'''

def reference_leg_id(namestring):
    if type(namestring) == str:  #if we don't know the sponsor, record the sponsor as "unknown"
        namestring = namestring.encode('utf-8') #our encoding failed for several Hispanic names, requiring manual correction in legislator.csv
        locs = [m.start() for m in re.finditer('#', namestring)]
        lastname = ""
        state = ""
        for i in range(0,len(locs)):
            if i == 0:
                lastname = namestring[locs[i]+1:locs[i+1]]
            else:
                state = namestring[locs[i]+1:]
                
        #alters the sponsor name string so that we can find this legislator in our legislators.csv file
        if lastname == 'St Germain':
            lastname = 'St. Germain'
        if lastname == 'Waggoner' and state == 'LA':
            lastname = 'Waggonner'
        if lastname == 'Roth Jr.' and state == 'DE':
            lastname = 'Roth'
        if lastname == 'Lambert' and state == 'AR':
            lastname = 'Lincoln'
        if lastname == 'Jackson-Lee' and state == 'TX':
            lastname = 'Jackson Lee'
        
        #we find all the legislators that have the same last name and state
        res = data.vote_id[data.last_name == lastname][data.state == state] 
        
        #if we have found more than one, ex. Kennedy from MA of which there have been several legislators
        if (len(res) > 1): #we now use the first name to identify the correct legislator
            firstname = namestring[0:namestring.index('#')].split()[0]
            #again alters the sponsor name, assumes that this error doesn't occur for several legislators from the same state
            #ie all legislators with firstname 'J.' from LA are all named 'John'
            if firstname == 'J.' and state == 'LA':
                firstname = 'John'
            if firstname == 'Wright' and state == 'TX':
                firstname = 'John'
            if firstname == 'Sam' and state == 'NC':
                firstname = 'Samuel'
            if firstname == 'Stuart' and state == 'MO':
                firstname = 'William'
            if firstname == 'Bill' and state == 'TN':
                firstname = 'William'
            if firstname == 'J.' and state == 'MD':
                firstname = 'John'
            if firstname == 'Dick' and state == 'IA':
                firstname = 'Richard'
            if firstname == 'Bud' and state == 'PA':
                firstname = 'Elmer'
            if firstname == 'Bart' and state == 'TN':
                firstname = 'Barton'
            if firstname == 'Bill' and state == 'MO':
                firstname = 'Norvell'
            if firstname == 'Estaban' and state == 'CA':
                firstname = 'Esteban'
            if firstname == 'Bob' and state == 'LA':
                firstname = 'Robert'
            if firstname == 'Bob' and state == 'NH':
                firstname = 'Robert'
            if firstname == 'Don' and state == 'PA':
                firstname = 'Donald'
            if firstname == 'Chris' and state == 'UT':
                firstname = 'Christopher'
            if firstname == 'Ernie' and state == 'KY':
                firstname = 'Ernest'
            if firstname == 'Amo' and state == 'NY':
                firstname = 'Amory'
            if firstname == 'Tom' and state == 'VA':
                firstname = 'Thomas'
            if firstname == 'Jo' and state == 'VA':
                firstname = 'Jo Ann'
            if firstname == 'Vic' and state == 'AR':
                firstname = 'Victor'
            #finds the correct legislator using the last name, first name, and state and returns the id
            return data.vote_id[data.last_name == lastname][data.state == state][data.first_name == firstname].iloc[0]
        else:
            return data.vote_id[data.last_name == lastname][data.state == state].iloc[0]
    else:
        return "unknown"

In this cell, we take our cleaned .csv files from [clean_csv1.ipynb](https://github.com/dinopants174/DataScienceFinalProject/blob/gh-pages/clean_csv1.ipynb) and apply our reference_leg_id function to the sponsor column, which contained namestrings and will now contain IDs from our [legislators.csv](https://github.com/dinopants174/DataScienceFinalProject/blob/gh-pages/data/legislators.csv) file. We now save these .csv files to their final location, in our [data/congress_sessions_legislation](https://github.com/dinopants174/DataScienceFinalProject/tree/gh-pages/data/congress_sessions_legislation) folder.

In [146]:
for i in range(1, 114):
    for body in ['house', 'senate']:
        print str(i) + body + ' starting'
        path = 'cleanedcsv/' + str(i) + body + '.csv'
        df = pd.read_csv(path)
        df['sponsor'] = df['sponsor'].apply(reference_leg_id)
        df.to_csv('data/congress_sessions_legislation' + str(i) + body + '.csv', index=False)
        print str(i) + body + ' finished'

1house starting
1house finished
1senate starting
1senate finished
2house starting
2house finished
2senate starting
2senate finished
3house starting
3house finished
3senate starting
3senate finished
4house starting
4house finished
4senate starting
4senate finished
5house starting
5house finished
5senate starting
5senate finished
6house starting
6house finished
6senate starting
6senate finished
7house starting
7house finished
7senate starting
7senate finished
8house starting
8house finished
8senate starting
8senate finished
9house starting
9house finished
9senate starting
9senate finished
10house starting
10house finished
10senate starting
10senate finished
11house starting
11house finished
11senate starting
11senate finished
12house starting
12house finished
12senate starting
12senate finished
13house starting
13house finished
13senate starting
13senate finished
14house starting
14house finished
14senate starting
14senate finished
15house starting
15house finished
15senate starting
15se