# WIM Python Workshop: Introduction to Python
## Part II: Organizing Data
* Date: Oct 20th, 2023
* Instructors: Eehyun Kim (eehkim@iu.edu), Anne Kavalerchik (akavaler@iu.edu)

First, we load the necessary packages

In [None]:
# load the necessary packages
import os
import json
import pandas as pd  # data analysis package
import pickle  # Python specific module to serialize data

Ordinarily, you might have to install packages like `pandas`, but we have preloaded this environment so that it is already installed, and only needs to be loaded.

To install `pandas` (or other packages) on your local machine:

Go to terminal/control panel
type `pip3 install pandas` and press enter

## Load the data

In Python, we can work with multiple datasets at the same time.

In [None]:
# load the dataframe of members of the House of Representatives
# we will use pandas read_csv

congress_df = pd.read_csv('congress_house.csv')
print(congress_df.shape)  # tells you the dimensions
congress_df  # prints out a subset of the dataframe

In [None]:
# load the bills "pickle" file
# pickle is a python module for storing data
# we will open it with the following command.
# "rb" tells python we are opening it for reading only in a binary format
with open('ProPublica_Members-Bills.pkl', "rb") as file:
    bills_1000 = pickle.load(file)

## Explore the congress dataset using pandas

`pandas` is an extremely powerful python data analysis package

We will only use a few of its functionalities today


In [None]:
# using pandas methods
print(congress_df.info())  # overview of the dataframe, missingness, and types of variables
print(congress_df.describe())  # describes basic summary statistics for each column

print(congress_df.head(3))  # returns first 3 rows
print(congress_df.tail(3))  # returns last 3 rows

print(congress_df['id'])  # returns "id" column as a pandas series
print(congress_df[['id']])  # returns "id" column as a pandas dataframe

print(congress_df.iloc[37:53])  # returns rows 37-53

## Explore the bills dataset using basics to from part 1 to call values

`bills_1000` is the most recent 1000 health bills, and it is extremely nested

In [None]:
# explore bills dataset
# bills_1000 is basically a list of 50 dictionaries
print(len(bills_1000))  # Tells us the length - 50

print(type(bills_1000))  # DefaultDict - similar to a dictionary
print(bills_1000.keys())  # Prints the keys of the dictionary
print(type(bills_1000[0]))  # Dictionary
print(bills_1000[0])  # Prints the value for dictionary key 0 of bills_1000 
print(bills_1000[0].keys())  # Prints the keys of the dictionary


It may be helpful to look at an entry visually to understand the nestedness.

In [None]:
bills_1000[0]

So, `bills_1000[0]` is a dictionary, where the value of the key `results` is a list of dictionaries for every bill, and every dictionary in that list is a key-value pair, and each of those key-value pairs gives some descriptive detail about the bill.

Let's practice calling specific values of this dictionary of the first 20 bills

In [None]:
# let's practice calling specific values of these first 20 bills
# print out the values 
print(bills_1000[0]['status'])
print(bills_1000[0]['copyright'])
print(bills_1000[0]['offset'])
print(bills_1000[0]['subject'])
print(bills_1000[0]['results'])

In [None]:
print(len(bills_1000[0]['results']))  # length is 20 - 20 bills
print(bills_1000[0]['results'][0])  # the first bill
print(type(bills_1000[0]['results'][0]))  # dictionary

In [None]:
print(bills_1000[0]['results'][0]['bill_id'])  # bill id
print(bills_1000[0]['results'][0]['title'])  # title
print(bills_1000[0]['results'][0]['sponsor_id'])  # sponsor id

In [None]:
print(bills_1000[43]['results'][15]['bill_id'])
print(bills_1000[42]['results'][9]['sponsor_name'])

Note: What happens if we call a value that does not exist?

`KeyError` - when a dictionary key does not exist

`IndexError` - when a sequence subscript is out of range

In [None]:
print(bills_1000[50]['results'][0]['sponsor_name'])  # Key Error

In [None]:
print(bills_1000[49]['results'][20]['sponsor_name'])  # IndexError 

You can avoid these using a `try - except` clause:

In [None]:
try:
    print(bills_1000[50]['results'][0]['sponsor_name'])
except KeyError:
    print('KeyError')

try:
    print(bills_1000[49]['results'][20]['sponsor_name'])
except IndexError:
    print('IndexError')


Use nested `for` loops to navigate `bills_1000`

For loops are not always the most elegant solution, but they are very useful

In [None]:
# experiment with for loops here
all_pols = []  # initializing an empty list
for entry in bills_1000:  # Loop through each of the 50 dictionaries
    results = bills_1000[entry]['results']

    # Then loop thru each of the 20 dictionaries that belong to those entries
    for result in results:
        all_pols.append(result['sponsor_name'])  # appending the sponsor name on each bill to a list

print(len(all_pols))  # 1000
# print(all_pols)

In [None]:
for entry in bills_1000:
    results = bills_1000[entry]['results']

all_pols = []
for result in results:
    all_pols.append(result['sponsor_name'])
all_pols    

Note: Sometimes your code can get _too_ nested. This can be bad for readability.

There are several ways around this: list/dict comprehension, functions, and more

List comprehension:

In [None]:
all_pols_1 = [bills_1000[entry]['results'] for entry in bills_1000]
all_pols_2 = [result[i]['sponsor_name'] for result in all_pols_1 for i in range(0, 20)]
all_pols_2==all_pols  # True

## __Count__ the number of bills passed by each politician

What we really want,  is to combine `bills_1000` with `congress_df` in some useful way

`id` in `congress_df` is linked to `sponsor_id` in `bills_1000`

We will do this with nested loops again!

In [None]:
# initialize a dictionary with every key a congressperson's id,
# and every value = 0
num_bills = {}
for i in congress_df['id']:
    num_bills[i] = 0

# loop through the data and count every bill
for entry in bills_1000:
    results = bills_1000[entry]['results']
    for result in results:
        num_bills[result['sponsor_id']] += 1

Need to account for senators! Adding a `try-except` clause:

In [None]:
# add a try...except to account for senators!
senators = {}
num_bills = {}
for i in congress_df['id']:
    num_bills[i] = 0

for entry in bills_1000:
    
    results = bills_1000[entry]['results']

    for result in results:
        try:
            num_bills[result['sponsor_id']] += 1
        except KeyError:
            senators[result['sponsor_id']] = result['sponsor_name']            

## Use pandas to combine the two data sets

Remember, `id` in the congress data, corresponds to `sponsor_id` in the bills data

In [None]:
num_bills_df = pd.DataFrame(num_bills.items(),
                            columns=['sponsor_id', 'num_bills'])

Can also initialize a dataframe this way:

In [None]:
num_bills_df = pd.DataFrame.from_dict(num_bills, orient='index').reset_index()
num_bills_df.columns = ['sponsor_id', 'num_bills']
num_bills_df

Pull out the columns we're interested in:

In [None]:
congress_df.columns
cols = ['id', 'short_title', 'first_name', 'last_name', 'party', 'gender']

In [None]:
final_df = pd.merge(congress_df[cols], num_bills_df, left_on='id',
                    right_on='sponsor_id')
final_df

In [None]:
print('Republicans:', final_df[final_df.party=='R'].num_bills.sum())
print('Democrats:', final_df[final_df.party=='D'].num_bills.sum())

In [None]:
final_df.sort_values('num_bills', ascending=False)

In [None]:
# export into a csv file
final_df.to_csv('reps_with_num_bills.csv', index=False)