## Objectives

- Explore and glean insights from a real dataset using pandas
- Practice using pandas for exploratory analysis, information gathering, and discovery
- Practice cleaning data and answering questions

## General Guidelines:

- This is a **real** dataset and so it may contain errors and other pecularities to work through
- This dataset is ~218mb, which will take some time to load (and probably won't load in Google Sheets or Excel)
- If you make assumptions, annotate them in your responses
- While there is one code/markdown cell positioned after each question as a placeholder, some of your code/responses may require multiple cells
- Double-click the markdown cells that say for example **1a answer here:** to enter your written answers. If you need more cells for your written answers, make them markdown cells (rather than code cells)
- This homework assignment is not autograded because of the variety of responses one could give. 
  - Please upload this notebook to the autograder page and the TAs will manually grade it. 
  - Ensure that each cell is run and outputs your answer for ease of grading! 
  - Highly suggest to do a `restart & run all` before uploading your code to ensure everything runs and outputs correctly.
  - Answers without code (or code that runs) will be given 0 points.
- **This is meant to simulate real world data so you will have to do some external research to determine what some of the answers are!** 

## Dataset

You are to analyze campaign contributions to the 2016 U.S. presidential primary races made in California. Use the csv file located here: https://drive.google.com/file/d/1Lgg-PwXQ6TQLDowd6XyBxZw5g1NGWPjB/view?usp=sharing. You should download and save this file in the same folder as this notebook is stored.  This file originally came from the U.S. Federal Election Commission (https://www.fec.gov/).

## Data Questions

You are working for a California state-wide election campaign. Your boss wants you to examine historic 2016 election contribution data to see what zipcodes are more supportive of fundraising for your candidate. 

Your boss asks you to filter out some of the records:
- Only use primary 2016 contribution data (more like how your race is).
- Concentrate on Bernie Sanders as a candidate (most a like your candidate)

The questions your boss wants answered is:
- Which zipcode (5-digit zipcode) had the highest count of contributions and the most dollar amount?
- What day(s) of the month do most people donate?

## Setup

Run the cell below as it will load the data into a pandas dataframe named `contrib`. Note that a custom date parser is defined to speed up loading. If Python were to guess the date format, it would take even longer to load.

In [3]:
import pandas as pd
import numpy as np
from datetime import datetime

# These commands below set some options for pandas and to have matplotlib show the charts in the notebook
pd.set_option('display.max_rows', 1000)
pd.options.display.float_format = '{:,.2f}'.format

# Define a date parser to pass to read_csv
d = lambda x: datetime.strptime(x, '%d-%b-%y')

# Load the data
# We have this defaulted to the folder OUTSIDE of your repo - please change it as needed
contrib = pd.read_csv('P00000001-CA.csv', index_col=False, parse_dates=['contb_receipt_dt'], date_parser=d)

# Note - for now, it is okay to ignore the warning about mixed types. 

  contrib = pd.read_csv('P00000001-CA.csv', index_col=False, parse_dates=['contb_receipt_dt'], date_parser=d)


***
## 1. Initial Data Checks (50 points)

First we will take a preliminary look at the data to check that it was loaded correctly and contains the info we need.

The questions to answer at the end of this section:
- Do we have the correct # of columns and rows. 
- Do the records contain data for the questions we want to answer 
- What columns are important? 
- What columns can be dropped?
- What are the data problems?

**1a.** Print the *shape* of the data. Does this match the expectation? (2 points)

In [4]:
# 1a YOUR CODE HERE
contrib.shape

(1125659, 18)

- **1a answer here:** 

**1b.** Print a list of column names. Are all the columns included that are in the documentation? (2 points)

In [5]:
# 1b YOUR CODE HERE
contrib.columns

Index(['cmte_id', 'cand_id', 'cand_nm', 'contbr_nm', 'contbr_city',
       'contbr_st', 'contbr_zip', 'contbr_employer', 'contbr_occupation',
       'contb_receipt_amt', 'contb_receipt_dt', 'receipt_desc', 'memo_cd',
       'memo_text', 'form_tp', 'file_num', 'tran_id', 'election_tp'],
      dtype='object')

- **1b answer here:** 

**1c** Print out the first five rows of the dataset. How do the columns `cand_id`, `cand_nm` and `contbr_st` look? (3 points)

In [6]:
# 1c YOUR CODE HERE
contrib.head(1000)

Unnamed: 0,cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
0,C00575795,P00003392,"Clinton, Hillary Rodham","AULL, ANNE",LARKSPUR,CA,949391913.0,,RETIRED,50.0,2016-04-26,,X,* HILLARY VICTORY FUND,SA18,1091718,C4768722,P2016
1,C00575795,P00003392,"Clinton, Hillary Rodham","CARROLL, MARYJEAN",CAMBRIA,CA,934284638.0,,RETIRED,200.0,2016-04-20,,X,* HILLARY VICTORY FUND,SA18,1091718,C4747242,P2016
2,C00575795,P00003392,"Clinton, Hillary Rodham","GANDARA, DESIREE",FONTANA,CA,923371507.0,,RETIRED,5.0,2016-04-02,,X,* HILLARY VICTORY FUND,SA18,1091718,C4666603,P2016
3,C00577130,P60007168,"Sanders, Bernard","LEE, ALAN",CAMARILLO,CA,930111214.0,AT&T GOVERNMENT SOLUTIONS,SOFTWARE ENGINEER,40.0,2016-03-04,,,* EARMARKED CONTRIBUTION: SEE BELOW,SA17A,1077404,VPF7BKWA097,P2016
4,C00577130,P60007168,"Sanders, Bernard","LEONELLI, ODETTE",REDONDO BEACH,CA,902784310.0,VERICOR ENTERPRISES INC.,PHARMACIST,35.0,2016-03-05,,,* EARMARKED CONTRIBUTION: SEE BELOW,SA17A,1077404,VPF7BKX3MB3,P2016
5,C00577130,P60007168,"Sanders, Bernard","LEONELLI, ODETTE",REDONDO BEACH,CA,902784310.0,VERICOR ENTERPRISES INC.,PHARMACIST,100.0,2016-03-06,,,* EARMARKED CONTRIBUTION: SEE BELOW,SA17A,1077404,VPF7BKYBXV4,P2016
6,C00577130,P60007168,"Sanders, Bernard","LEOPARD, PATTI",VISTA,CA,920842849.0,ONSITE ENERGY CORPORATION,PROJECT MANAGER,25.0,2016-03-04,,,* EARMARKED CONTRIBUTION: SEE BELOW,SA17A,1077404,VPF7BKW04C1,P2016
7,C00575795,P00003392,"Clinton, Hillary Rodham","HOFER, VIRGINIA",LAGUNA WOODS,CA,926372912.0,,RETIRED,40.0,2016-04-20,,X,* HILLARY VICTORY FUND,SA18,1091718,C4747988,P2016
8,C00577130,P60007168,"Sanders, Bernard","LEPKE, KELLY",WESTMINSTER,CA,926833846.0,NONE,NOT EMPLOYED,10.0,2016-03-05,,,* EARMARKED CONTRIBUTION: SEE BELOW,SA17A,1077404,VPF7BKX3H59,P2016
9,C00577130,P60007168,"Sanders, Bernard","LERCH, DOUGLAS",PETALUMA,CA,949522729.0,SEEDS OF AWARENESS,DIRECTOR OF NON PROFIT,15.0,2016-03-06,,,* EARMARKED CONTRIBUTION: SEE BELOW,SA17A,1077404,VPF7BKYY720,P2016


- **1c answer here:** 

**1d.** Print out the values for the column `election_tp`. In your own words, based on the documentation, what information does the `election_tp` variable contain? Do the values in the column match the documentation? (3 points)

In [7]:
# 1d YOUR CODE HERE
contrib.election_tp.unique() # or contrib.election_tp.values

array(['P2016', 'G2016', nan, 'P2020'], dtype=object)

- **1d answer here:** 

**1e.** Print out the datatypes for all of the columns. What are the datatypes for the `contbr_zip`, `contb_receipt_amt`, `contb_receipt_dt`? (5 points)

In [8]:
# 1e YOUR CODE HERE
contrib.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1125659 entries, 0 to 1125658
Data columns (total 18 columns):
 #   Column             Non-Null Count    Dtype         
---  ------             --------------    -----         
 0   cmte_id            1125659 non-null  object        
 1   cand_id            1125659 non-null  object        
 2   cand_nm            1125659 non-null  object        
 3   contbr_nm          1125659 non-null  object        
 4   contbr_city        1125633 non-null  object        
 5   contbr_st          1125659 non-null  object        
 6   contbr_zip         1125564 non-null  object        
 7   contbr_employer    967757 non-null   object        
 8   contbr_occupation  1115260 non-null  object        
 9   contb_receipt_amt  1125659 non-null  float64       
 10  contb_receipt_dt   1125659 non-null  datetime64[ns]
 11  receipt_desc       15045 non-null    object        
 12  memo_cd            144268 non-null   object        
 13  memo_text          501148 n

- **1e answer here:** 

**1f.** What columns have the most non-nulls?  Would you recommend to drop any columns based on the number of nulls? (5 points)

In [9]:
# 1f YOUR CODE HERE
for col in contrib.columns:
    print(col,contrib[col].isna().sum())

cmte_id 0
cand_id 0
cand_nm 0
contbr_nm 0
contbr_city 26
contbr_st 0
contbr_zip 95
contbr_employer 157902
contbr_occupation 10399
contb_receipt_amt 0
contb_receipt_dt 0
receipt_desc 1110614
memo_cd 981391
memo_text 624511
form_tp 0
file_num 0
tran_id 0
election_tp 1425


- **1f answer here:** 
1. receipt_desc
2. memo_cd
3. memo_text

The above can be deleted

**1g.** A column we know that we want to use is the cand_nm column.  From the documentation each candidate is a unique candidate id also. Check data quality of `cand_id` column to see if it matches `cand_nm` column. Specifically check to ensure our targetted candidate 'Bernard Sanders' always has the same cand_id throughout. Any issues with `cand_nm` matching `cand_id`? (5 points)

In [10]:
# 1g YOUR CODE HERE
contrib[contrib['cand_nm'].str.contains('Sanders, Bernard')]['cand_id'].unique()

array(['P60007168'], dtype=object)

- **1g answer here:** 
No issues

**1h.** Another area to check is to make sure all of the records are from California. Check the `contbr_st` column - are there any records outside of California based on `contbr_st`? (5 points)

In [11]:
# 1h YOUR CODE HERE
contrib[~contrib['contbr_st'].eq('CA')]['contbr_st'].count()

0

- **1h answer here:** 
NO

**1i.** The next column to check for the analysis is the `tran_id` column. This column could be the primary key so look for duplicates. How many duplicate entries are there? Any pattern for why are there duplicate entries? (5 points)

In [12]:
# 1i YOUR CODE HERE


- **1i answer here:** 

**1j.** Another column to check is the `contb_receipt_amt` that shows the donation amounts. How many negative donations are included? What do negative donations mean? Please show at least pull a few rows to look at the records with negative donations. Do these records match with the expectation of why a negative donation would happen? (5 points)

In [13]:
# 1j YOUR CODE HERE
contrib[contrib['contb_receipt_amt']<0]['contb_receipt_amt'].count()
contrib[contrib['contb_receipt_amt']<0][:5]

Unnamed: 0,cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
19,C00574624,P60006111,"Cruz, Rafael Edward 'Ted'","JOLLIFF, RICHARD",CHICO,CA,959289507.0,SELF EMPLOYED,RANCHER,-25.0,2016-04-29,REDESIGNATION TO GENERAL,X,REDESIGNATION TO GENERAL,SA17A,1077664,SA17A.1826482B,P2016
23,C00574624,P60006111,"Cruz, Rafael Edward 'Ted'","JOLLIFF, RICHARD",CHICO,CA,959289507.0,SELF EMPLOYED,RANCHER,-150.0,2016-04-29,REDESIGNATION TO GENERAL,X,REDESIGNATION TO GENERAL,SA17A,1077664,SA17A.1826483B,P2016
81,C00574624,P60006111,"Cruz, Rafael Edward 'Ted'","JOLLIFF, RICHARD",CHICO,CA,959289507.0,SELF EMPLOYED,RANCHER,-60.0,2016-04-14,REDESIGNATION TO GENERAL,X,REDESIGNATION TO GENERAL,SA17A,1077664,SA17A.1827494,P2016
190,C00574624,P60006111,"Cruz, Rafael Edward 'Ted'","NOWELL, DIANA L.",RANCHO SANTA MARGARITA,CA,926884928.0,CAPISTRAND UNIFIED SCHOOL DISTRICT,LIBRARIAN TECHNICIAN,-100.0,2016-04-11,REDESIGNATION TO GENERAL,X,REDESIGNATION TO GENERAL,SA17A,1077664,SA17A.1639830B,P2016
213,C00574624,P60006111,"Cruz, Rafael Edward 'Ted'","LICHTY, ANDREW MR.",SAN DIEGO,CA,921096720.0,SELF EMPLOYED,REAL ESTATE,-25.0,2016-04-30,REDESIGNATION TO GENERAL,X,REDESIGNATION TO GENERAL,SA17A,1077664,SA17A.1826888B,P2016


- **1j answer here:**

refund of donours?

**1k.** One more column to look at is the date of donation column. Are there any dates outside of the primary period (defined as 1 Jan 2014 to 7 June 2016)? Are the dates well-formatted for our analysis? (5 points)

In [14]:
# 1k YOUR CODE HERE
contrib[(contrib['election_tp']=='P2016') & ~((contrib['contb_receipt_dt'] >= '2014-01-01') & (contrib['contb_receipt_dt'] <= '2016-06-07'))]['contb_receipt_dt'].count()

141617

- **1k answer here:**

**1l.** Finally, answer the initial questions in the cells below (5 points)

**1l.1** Do we have the correct # of columns and rows.

- **1l.1 answer here:**

**1l.2** Do the records contain data for the questions we want to answer?

- **1l.2 answer here:**

**1l.3** What columns are important?

- **1l.3 answer here:** 

**1l.4** What columns can be dropped?

- **1l.4 answer here:** 

**1l.5** What are the data problems?

- **1l.5 answer here:**

**1l.6** List any assumptions so far:

- **1l.6 answer here:**

***
## 2. Data filtering and data quality fixes (30 points)

Now that we have a basic understanding of the data, let's filter out the records we don't need and fix the data.

**2a.** From the dataset filter out (remove) any election_tp not in the primary election. Print/show the shape of the dataframe after the filtering is complete. (5 points)

In [15]:
# 2a YOUR CODE HERE
df_2016 = contrib[(contrib['election_tp']=='P2016')]
contrib.shape, df_2016.shape

((1125659, 18), (810481, 18))

**2b.** From the dataset filter out (remove) any candidate that is not Bernie Sanders. Print/show the shape of the dataframe after the filtering is complete. (5 points)

In [16]:
# 2b YOUR CODE HERE
df_2016_Bernie = df_2016[df_2016['cand_nm'].str.contains('Sanders, Bernard')]
df_2016.shape, df_2016_Bernie.shape

((810481, 18), (407171, 18))

**2c.** The `contbr_zip` column is not formatted well for our analysis. Make a new zipcode column that is the five-digit zipcodes. Filter out any records outside of California based on the zipcode. Print/show the shape of the dataframe after the filtering is complete. (10 points).

- You will have to research what the valid 5-digit zipcodes for California are!

In [17]:
# 2c YOUR CODE HERE
# df_2016_Bernie = df_2016_Bernie.assign(zip_code=lambda x:str(x['contbr_zip']).split(',')[:2][:5])
for i,row in df_2016_Bernie.iterrows():
    old = str(df_2016_Bernie.at[i,"contbr_zip"])
    df_2016_Bernie.at[i,"zip_code"] = old[:5]
df_2016_Bernie['zip_code'][:3]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_2016_Bernie.at[i,"zip_code"] = old[:5]


3    93011
4    90278
5    90278
Name: zip_code, dtype: object

**2d.** The receipt amount column has negative donations. After talking with your team, a decision was made that the best course of action is to remove these negative values so that the donation count and amount is more accurate. Print/show the shape of the dataframe after the filtering is complete. (5 points)

In [18]:
# 2d YOUR CODE HERE
df_2016_Bernie = df_2016_Bernie.drop(df_2016_Bernie[df_2016_Bernie['contb_receipt_amt']<0].index) 
df_2016_Bernie.shape

(404083, 19)

**2e.** From the dataset drop any columns that won't be used in the analysis. Print/show the shape of the dataframe after the dropping is complete. What columns did you drop and why? (5 points)

In [19]:
# 2e YOUR CODE HERE
df_2016_Bernie.columns

Index(['cmte_id', 'cand_id', 'cand_nm', 'contbr_nm', 'contbr_city',
       'contbr_st', 'contbr_zip', 'contbr_employer', 'contbr_occupation',
       'contb_receipt_amt', 'contb_receipt_dt', 'receipt_desc', 'memo_cd',
       'memo_text', 'form_tp', 'file_num', 'tran_id', 'election_tp',
       'zip_code'],
      dtype='object')

- **2e answer here:**

**2f.** List any assumptions that you made up to this point:

- **2f answer here:**

***
## 3. Answering the questions (20 points)

Now that the data is cleaned and filterd - let's answer the two questions from your boss!

**3a.** Which zipcode had the highest count of contributions and the most dollar amount? (10 points)

In [20]:
# 3a YOUR CODE HERE
a = df_2016_Bernie['zip_code'].value_counts().idxmax()
b=  df_2016_Bernie.groupby('zip_code')['contb_receipt_amt'].sum().nlargest(1)#
a,b

('94110',
 zip_code
 94110   294,061.13
 Name: contb_receipt_amt, dtype: float64)

- **3a answer here:** 

**3b.** What day(s) of the month do most people donate? (10 points)

In [21]:
# 3b YOUR CODE HERE
df_2016_Bernie['contb_receipt_dt'].dt.day_name().value_counts().idxmax()

'Wednesday'

- **3b answer here:** 