# INFO 3401 – Module Assignment 1.1

[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT).

## Learning Objectives
This is one of two required sub-assignments for Module Assignment 1. In this assignment we want to evaluate how you can complete a data analysis on your own. This notebook will be due before midnight (11:59pm) on Friday, September 11 by submission to canvas.

## Background

Representatives elected to the U.S. House are given budgets to hire staff and to run their offices in Washington, D.C. as well as in their home district. Senior leadership offices and committees also receive budgets to hire their own staff and conduct their business. These funds cannot be used for personal or campaign expenses and there is no reserve fund if members run over budget. These budgets vary across members and committees  and depend on a variety of factors. Detailed records of these disbursements are [published quarterly as CSV files](https://www.house.gov/the-house-explained/open-government/statement-of-disbursements/archive) going back to 2010.

Make sure to also check out the [Details](https://www.house.gov/the-house-explained/open-government/statement-of-disbursements/details), [FAQ](https://www.house.gov/the-house-explained/open-government/statement-of-disbursements/frequently-asked-questions), [Glossary](https://www.house.gov/the-house-explained/open-government/statement-of-disbursements/glossary-of-terms), and [Transaction Codes](https://www.house.gov/the-house-explained/open-government/statement-of-disbursements/transaction-codes).  Disappointingly as both an information scientist and a citizen, the Senate does not publish analgous kinds of office disbursement data.

For more optional background, also check out the [cleaned data](https://projects.propublica.org/represent/expenditures), [blog posts](https://www.propublica.org/article/update-on-house-disbursements-a-few-notes-on-how-to-use-the-data), [training resources](https://www.propublica.org/documents/item/3230540-75012825-House-Disbursements-Reports-Training.html), and stories about [budget reductions](https://www.propublica.org/article/house-operating-budget-cuts-paving-way-for-more-special-interest-influence) and [staff turnover](https://www.propublica.org/article/turnover-in-the-house-who-keeps-and-who-loses-the-most-staff) the ProPublic/Sunlight Foundation publishes about these data.

The notebook follows Chapter 4 in Peng and Matsui *[The Art of Data Science](https://canvas.colorado.edu/courses/62560/files/19782496)* (available under Week 01 on Canvas) if you want more background.

## Step 0: Load libraries

Load the libraries for pandas, numpy, pyplot, and seaborn. Include the matplotlib [cell magic](https://ipython.readthedocs.io/en/stable/interactive/plotting.html#id1). Basically, copy the first cell from previous notebooks!

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Step 1: Formulate a question

Here's the question we'll all explore.

## Step 2: Read in the data

Load the 2019 Q3 "Detail" data from the offical House site and assign to `h20q1_df`. Read in from disk or downloading directly from the web are both acceptable. 

Include two other parameters in your `read_csv` function:
* **encoding** You may get a character encoding error, pass 'latin1' to "encoding" parameter.
* **parse_dates** Pass "TRANSACTION DATE", "PERFORM START DT", and "PERFORM END DT" as a list to the "parse_dates" parameter.

In [4]:
h20q1_df = pd.read_csv('JAN-MAR-2020-SOD-DETAIL-GRID_FINAL.csv',encoding='latin1',parse_dates=['TRANSACTION DATE','PERFORM START DT','PERFORM END DT'])

## Step 3: Check the packaging

How many rows and columns are in `h20q1_df`? Print out the lengths of each/both as integers.

In [5]:
h20q1_df.shape

(132754, 12)

## Step 4: Look at the top and bottom of the data

Inspect the first 10 and last 10 rows of data.

In [6]:
h20q1_df.head()

Unnamed: 0,ORGANIZATION,PROGRAM,SORT SUBTOTAL DESCRIPTION,SORT SEQUENCE,TRANSACTION DATE,DATA SOURCE,DOCUMENT,VENDOR NAME,PERFORM START DT,PERFORM END DT,DESCRIPTION,AMOUNT
0,2020 OFFICE OF THE SPEAKER,GENERAL EXPENDITURES,FRANKED MAIL,DETAIL,23-Mar-20,AP,1265156.0,UNITED STATES POSTAL SERVICE,3-Jan-20,31-Jan-20,FRANKED MAIL,66.49
1,2020 OFFICE OF THE SPEAKER,GENERAL EXPENDITURES,FRANKED MAIL,DETAIL,31-Mar-20,AP,1275764.0,UNITED STATES POSTAL SERVICE,1-Feb-20,29-Feb-20,FRANKED MAIL,60.67
2,2020 OFFICE OF THE SPEAKER,GENERAL EXPENDITURES,FRANKED MAIL,SUBTOTAL,,,,,,,FRANKED MAIL TOTALS:,127.16
3,2020 OFFICE OF THE SPEAKER,GENERAL EXPENDITURES,PERSONNEL COMPENSATION,DETAIL,,,,BERRET EMILY C,3-Jan-20,31-Mar-20,DIR OF OPERATIONS & ADVISOR,31777.77
4,2020 OFFICE OF THE SPEAKER,GENERAL EXPENDITURES,PERSONNEL COMPENSATION,DETAIL,,,,BUSH JACQUELINE D,3-Jan-20,31-Mar-20,DIGITAL ASSISTANT,7944.43


In [7]:
h20q1_df.tail()

Unnamed: 0,ORGANIZATION,PROGRAM,SORT SUBTOTAL DESCRIPTION,SORT SEQUENCE,TRANSACTION DATE,DATA SOURCE,DOCUMENT,VENDOR NAME,PERFORM START DT,PERFORM END DT,DESCRIPTION,AMOUNT
132749,FISCAL YEAR 2019 PAGING,PAGING,EQUIPMENT,DETAIL,19-Mar-20,AP,1265498.0,BEARCOM,1-Feb-20,29-Feb-20,WARRANTIES,6405.41
132750,FISCAL YEAR 2019 PAGING,PAGING,EQUIPMENT,DETAIL,26-Mar-20,AP,1276203.0,BEARCOM,1-Mar-20,31-Mar-20,WARRANTIES,6405.41
132751,FISCAL YEAR 2019 PAGING,PAGING,EQUIPMENT,SUBTOTAL,,,,,,,EQUIPMENT TOTALS:,25621.64
132752,FISCAL YEAR 2019 PAGING,PAGING,EQUIPMENT,SUBTOTAL,,,,,,,PAGING TOTALS:,25621.64
132753,FISCAL YEAR 2019 PAGING,PAGING,EQUIPMENT,GRAND TOTAL FOR ORGANIZATION,,,,,,,OFFICE TOTALS:,25621.64


In [10]:
h20q1_df.rename(columns={h20q1_df.columns[-1]:'AMOUNT'},inplace=True)

## Step 5: Check the "n"s

Check the kinds of values that are possible for each of the "ORGANIZATION", "PROGRAM", "SORT SUBTOTAL DESCRIPTION", and "SORT SEQUENCE" columns.

Use Boolean indexing to remove the rows of data corresponding to "SUBTOTAL" and "GRANT TOTAL FOR ORGANIZATION" under "SORT SEQUENCE" since these are duplicates/aggregations. Check to make sure that only "DETAIL" remains afterwards.

In [11]:
only_detail = h20q1_df['SORT SEQUENCE'] == 'DETAIL'
only_detail_df = h20q1_df[only_detail]

## Step 6: Validate the data against another source

"TRAVEL" makes up one of the most frequent expenses under "SORT SUBTOTAL DESCRIPTION". Use Boolean indexing to create a new DataFrame called `travel_df` that only contains "TRAVEL" from "SORT SUBTOTAL DESCRIPTION". What is the average value of "AMOUNT" in `travel_df`? Does this seem reasonable?

In [15]:
only_travel = only_detail_df['SORT SUBTOTAL DESCRIPTION'] == "TRAVEL"
travel_df = only_detail_df[only_travel]

In [17]:
travel_df['AMOUNT'].mean()

204.4647220559862

## Step 7: Make a plot

Make a [histogram](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#histograms) of the `travel_df` "AMOUNT"s. Describe some interesting features about the distribution of travel expenditures.

## Step 8: Try an easy solution

Make a pivot table with `travel_df` with "ORGANIZATION" as an index, "DESCRIPTION" as columns, "AMOUNT" as values, and 'sum' as an aggfunc.

In [18]:
pd.pivot_table(travel_df,
               index='ORGANIZATION',
               columns='DESCRIPTION',
               values='AMOUNT',
               aggfunc='sum'
              )

DESCRIPTION,AUTOMOBILE LEASE,CAR RENTAL,CAUCUS TRAVEL,COMMERCIAL TRANSPORTATION,CONSULT TRAVEL / RELATED EXP,GASOLINE,LODGING,MEALS,MISCELLANEOUS TRAVEL,PRIVATE AUTO MILEAGE,TAXI/PARKING/TOLLS,WITNESS TRAVEL / RELATED EXP
ORGANIZATION,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2015 HON. DENNY HECK,,,,,,,,,,1134.48,,
2016 HON. DENNY HECK,,,,,,,,,,720.90,,
2016 HON. PAUL A. GOSAR,,,,,,,367.30,,,,,
2017 HON. DENNY HECK,,,,,,,,,,2824.97,,
"2017 HON. EARL L. ""BUDDY"" CARTER",,,,,,,,,,-100.00,,
...,...,...,...,...,...,...,...,...,...,...,...,...
FISCAL YEAR 2020 OFFICE OF ATTENDING PHYSICIAN,,,,863.31,,,1184.17,532.00,,40.14,113.43,
FISCAL YEAR 2020 OFFICE OF CONGRESSIONAL ETHICS,,,,2559.40,,,2218.99,1218.96,9752.81,,581.99,
FISCAL YEAR 2020 OFFICE OF GENERAL COUNSEL,,,,1134.69,,,802.70,161.45,,6.90,339.37,
FISCAL YEAR 2020 OFFICE OF INSPECTOR GENERAL,,,,,,,,,,,14.55,


Sort the resulting table in descending order by different columns like "COMMERCIAL TRANSPORTATION", "MEALS", "PRIVATE AUTO MILEAGE".

Look up the biographies for the offices or committees with the highest totals. Are there any patterns based on the district they represent or the work that the committee does?