# Instructions

This is a run-through of the cleaning functions in create_staging_tables. The functions below clean and reorganize data collected from the API and scraped from the web. Data was initially collected and put into a postgres database named "wa_lge_raw". The API and scraping functions can be found in the data_aquisition directory. 

There are seven steps to creating the necessary staging tables. Only the following tables need to be saved to the wa_leg_staging database:
* legislator_df
* rep_score_df
* bill_text_df
* merged_final_df

In [1]:
import psycopg2
from sqlalchemy import create_engine
import pandas as pd
from create_staging_tables import create_staging_legislator_df_STEP_ONE, create_staging_vote_df_STEP_TWO, create_staging_bill_df_STEP_THREE, create_staging_merged_initial_df_STEP_FOUR, create_staging_bill_text_df_STEP_FIVE, clean_merged_final_STEP_SEVEN, create_rep_score_STEP_EIGHT, load_and_clean_party_minority_history_df

In [2]:
engine = create_engine('postgresql://localhost:5432/wa_leg_raw')

In [3]:
raw_vote_df = pd.read_sql_query('select * from "vote_api"', con=engine)
raw_committee_member_df = pd.read_sql_query('select * from "committee_member_api"', con=engine)
missing_leg_info_df = pd.read_csv('../data/missing_legislators.csv', sep = '|')
raw_bill_df = pd.read_sql_query('select * from "bill_api"', con=engine)
raw_sponsor_df = pd.read_sql_query('select * from "sponsor_api"', con=engine)

In [4]:
legislator_df = create_staging_legislator_df_STEP_ONE(raw_vote_df, raw_committee_member_df, missing_leg_info_df)

In [5]:
staging_vote_df = create_staging_vote_df_STEP_TWO(raw_vote_df)

In [6]:
staging_bill_df = create_staging_bill_df_STEP_THREE(raw_bill_df, raw_sponsor_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  sponsor_df_reformatted['bill_num'] = sponsor_df_reformatted['bill_id'].apply(lambda x: x.split()[1])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  sponsor_df_reformatted['bill_num_unique'] = sponsor_df_reformatted['biennium'] + ' ' + sponsor_df_reformatted['bill_num']


In [7]:
staging_bill_df.head()

Unnamed: 0,biennium,bill_id,class,description,htm_create_date,htm_last_modified_date,htm_url,long_friendly_name,name,type,bill_unique,bill_num,bill_num_unique,sponsor_agency,primary_sponsor_id,secondary_sponsors
0,1991-92,HB 1001,Bills,,1991-08-30T00:00:00,2006-07-10T17:13:53.543,http://app.leg.wa.gov/documents/billdocs/1991-...,House Bill 1001,1001,House Bills,1991-92 HB 1001,1001,1991-92 1001,House,251,"[444, 429, 297, 48, 75, 286, 325, 32, 213, 188..."
1,1991-92,SHB 1001,Bills,,1991-02-01T00:00:00,2006-07-10T17:13:54.903,http://app.leg.wa.gov/documents/billdocs/1991-...,Substitute House Bill 1001,1001-S,House Bills,1991-92 SHB 1001,1001,1991-92 1001,House,251,"[444, 429, 297, 48, 75, 286, 325, 32, 213, 188..."
2,1991-92,HB 1002,Bills,,1991-01-14T00:00:00,2006-07-10T17:13:11.637,http://app.leg.wa.gov/documents/billdocs/1991-...,House Bill 1002,1002,House Bills,1991-92 HB 1002,1002,1991-92 1002,House,251,"[474, 207, 219, 180, 23, 227, 394, 484, 110, 4..."
3,1991-92,HB 1003,Bills,,1991-01-14T00:00:00,2006-07-10T17:13:11.747,http://app.leg.wa.gov/documents/billdocs/1991-...,House Bill 1003,1003,House Bills,1991-92 HB 1003,1003,1991-92 1003,House,311,"[54, 110, 474]"
4,1991-92,SHB 1003,Bills,,1991-02-21T00:00:00,2006-07-10T17:14:07.357,http://app.leg.wa.gov/documents/billdocs/1991-...,Substitute House Bill 1003,1003-S,House Bills,1991-92 SHB 1003,1003,1991-92 1003,House,311,"[54, 110, 474]"


In [8]:
merged_initial_df = create_staging_merged_initial_df_STEP_FOUR(staging_vote_df, staging_bill_df, legislator_df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  unique_vote_dates.drop_duplicates(keep='first', inplace=True)


In [9]:
def change_agency_to_int(agency):
    if agency == 'House':
        return 0
    if agency == 'Senate':
        return 1

In [10]:
merged_initial_df['sponsor_agency'] = merged_initial_df['sponsor_agency'].apply(change_agency_to_int)
legislator_df['agency'] = legislator_df['agency'].apply(change_agency_to_int)

In [17]:
legislator_df['agency'] = legislator_df['agency'].apply(change_agency_to_int)

In [19]:
legislator_df.head()

Unnamed: 0,id,agency,district,first_name,party,last_name
0,7,0.0,42,Ann,Republican,Anderson
2,7,1.0,42,Ann,Republican,Anderson
3,11,0.0,46,Marlin,Democrat,Appelwick
7,17,0.0,12,Clyde,Republican,Ballard
13,23,0.0,19,Bob,Democrat,Basich


In [20]:
merged_initial_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3042939 entries, 0 to 3042938
Data columns (total 24 columns):
sequence_number           int64
vote                      int64
vote_date                 datetime64[ns]
voter_id                  int64
voter_name                object
voting_agency             int64
bill_unique               object
year                      int64
unique_id                 float64
biennium                  object
bill_id                   object
class                     object
description               object
htm_create_date           datetime64[ns]
htm_last_modified_date    datetime64[ns]
htm_url                   object
long_friendly_name        object
name                      object
type                      object
bill_num                  object
bill_num_unique           object
sponsor_agency            float64
primary_sponsor_id        object
secondary_sponsors        object
dtypes: datetime64[ns](3), float64(2), int64(5), object(14)
memory usage: 

In [21]:
merged_initial_df = merged_initial_df.merge(legislator_df, 
                               how='left', 
                               left_on=['voter_id', 'voting_agency'], 
                               right_on=['id', 'agency'])

In [23]:
def change_party_word_to_int(party):
        if party == 'Democrat':
            return 0
        if party == 'Republican':
            return 1

In [24]:
merged_initial_df = merged_initial_df.drop(['sequence_number', 'type', 
                                      'voter_name', 'htm_last_modified_date', 'description', 
                                      'bill_num_unique', 'bill_num', 'class'], axis = 1)
merged_initial_df['party'] = merged_initial_df['party'].apply(change_party_word_to_int)

KeyError: "labels ['sequence_number' 'type' 'voter_name' 'htm_last_modified_date'\n 'description' 'bill_num_unique' 'bill_num' 'class'] not contained in axis"