## Week 12 Assignment - W200 Introduction to Data Science Programming, UC Berkeley MIDS

Write code in this Jupyter Notebook to solve the following problems. This assignment addresses material covered in Unit 11. Please upload this **Notebook** with your solutions to your GitHub repository in your SUBMISSIONS/week_12 folder by 11:59PM PST the night before class. Do **NOT** push/upload the data file. If you turn-in anything on ISVC please do so under the Week 12 Assignment category. 

## Objectives

- Explore and glean insights from a real dataset using pandas
- Practice using pandas for exploratory analysis, information gathering, and discovery
- Practice using matplotlib for data visualization

## Dataset

You are to analyze campaign contributions to the 2016 U.S. presidential primary races made in California. Use the csv file located here: https://drive.google.com/file/d/1Lgg-PwXQ6TQLDowd6XyBxZw5g1NGWPjB/view?usp=sharing. You should download and save this file in the same folder as this notebook is stored.  This file originally came from the U.S. Federal Election Commission (https://www.fec.gov/).

**DO NOT PUSH THIS FILE TO YOUR GITHUB REPO!**

- Best practice is to not have DATA files in your code repo. As shown below, the default load is outside of the folder this notebook is in. If you change the folder where the file is stored please update the first cell!
- If you do accidentally push the file to your github repo - follow the directions here to fix it: https://docs.google.com/document/d/15Irgb5V5G7pKPWgAerH7FPMpKeQRunbNflaW-hR2hTA/edit?usp=sharing

Documentation for this data can be found here: https://drive.google.com/file/d/11o_SByceenv0NgNMstM-dxC1jL7I9fHL/view?usp=sharing

## General Guidelines:

- This is a **real** dataset and so it may contain errors and other pecularities to work through
- This dataset is ~218mb, which will take some time to load (and probably won't load in Google Sheets or Excel)
- If you make assumptions, annotate them in your responses
- While there is one code/markdown cell positioned after each question as a placeholder, some of your code/responses may require multiple cells
- Double-click the markdown cells that say YOUR ANSWER HERE to enter your written answers. If you need more cells for your written answers, make them markdown cells (rather than code cells)

## Setup

Run the two cells below. 

The first cell will load the data into a pandas dataframe named `contrib`. Note that a custom date parser is defined to speed up loading. If Python were to guess the date format, it would take even longer to load.  

The second cell subsets the dataframe to focus on just the primary period through May 2016. Otherwise, we would see general election donations which would make it harder to draw conclusions about the primaries.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import datetime

# These commands below set some options for pandas and to have matplotlib show the charts in the notebook
pd.set_option('display.max_rows', 1000)
pd.options.display.float_format = '{:,.2f}'.format
%matplotlib inline

# Define a date parser to pass to read_csv
d = lambda x: pd.datetime.strptime(x, '%d-%b-%y')

# Load the data
# We have this defaulted to the folder OUTSIDE of your repo - please change it as needed
contrib = pd.read_csv('../../P00000001-CA.csv', index_col=False, parse_dates=['contb_receipt_dt'], date_parser=d)
print(contrib.shape)

# Note - for now, it is okay to ignore the warning about mixed types. 

FileNotFoundError: [Errno 2] No such file or directory: '../../P00000001-CA.csv'

In [None]:
# Subset data to primary period 
contrib = contrib.copy()[contrib['contb_receipt_dt'] <= datetime.datetime(2016, 5, 31)]
print(contrib.shape)

## 1. Data  Exploration (20 points)

**1a. First, take a preliminary look at the data.**
- Print the *shape* of the data. What does this tell you about the number of variables and rows you have?
- Print a list of column names. 
- Review the documentation for this data (link above). Do you have all of the columns you expect to have?
- Sometimes variable names are not clear unless we read the documentation. In your own words, based on the documentation, what information does the `election_tp` variable contain?

In [2]:
# contrib.shape: This tells us the number of rows followed by number of columns.

contrib.shape

In [None]:
# contrib.columns: This print all the column names. 
# I have all the columns I expect.

contrib.columns

election_type: shows us the election type; Primary election or General election

**1b. Print the first 5 rows from the dataset to manually check some of the data.** 

This is a good idea to ensure the data loaded and the columns parsed correctly!

In [3]:
contrib.head(5)

**1c. Pick three variables from the dataset above and run some quick sanity checks.**

When working with a new dataset, it is important to explore and sanity check your variables. For example, you may want to examine the maximum and minimum values, a frequency count, or something else. Use the three markdown cells below to explain if your **three** chosen variables "pass" your sanity checks or if you have concerns about the integrity of your data and why. 

In [4]:
contrib['contbr_city'].value_counts()

In [None]:
contrib['contb_receipt_amt'].max()

In [None]:
contrib['cand_nm'].unique()

1- contrib['contbr_city'].value_counts() passes my sanity check because it lists all the cities and occurences as expected.\
2- contrib['contb_receipt_amt'].max() passes my sanity check because it shows the maximum value for contb_receipt_amt column as expected.\
3- contrib['cand_nm'].unique() passes my sanity check because it shows all the unique candidates as expected.

**1d. Plotting a histogram** 

Make a histogram of **one** of the variables you picked above. What are some insights that you can see from this histogram? 
Remember to include on your histogram:
- Include a title
- Include axis labels
- The correct number of bins to see the breakout of values
- Hint: For some variables the range of values is very large. To do a better exploration, make the initial histogram the full range and then you can make a smaller histogram 'zoomed' in on a discreet range.

In [2]:
plt.subplots(figsize =(20,7))
plt.xlabel('Candidate Names', color = 'red')
plt.xticks(rotation=90)
plt.ylabel('Frequency of Contributions', color = 'red') 
plt.title(r'Frequency of Contributions Per Candidate', color = 'red') 
plt.hist(contrib['cand_nm'], bins = 23, ec ='black', align = 'left') 
pass

From this histogram, we can see that Bernard Sanders had the highest frequency of contributions. Hillary clinton has the second place. 

## 2. Exploring Campaign Contributions (30 points)

Let's investigate the donations to the candidates.

**2a. Present a table that shows the number of donations to each candidate sorted by number of donations.**

- When presenting data as a table, it is often best to sort the data in a meaningful way. This makes it easier for your reader to examine what you've done and to glean insights.  From now on, all tables that you present in this assignment (and course) should be sorted.
- Hint: Use the `groupby` method. Groupby is explained in Unit 13: async 13.3 & 13.5
- Hint: Use the `sort_values` method to sort the data so that candidates with the largest number of donations appear on top.

Which candidate received the largest number of contributions (variable 'contb_receipt_amt')?

In [3]:
contribution_number = contrib.groupby(by='cand_nm')['contb_receipt_amt'].count().sort_values(ascending = False)
contribution_number

The candidate with largest number of contributions is Bernard Sanders. 

**2b. Now, present a table that shows the total value of donations to each candidate. sorted by total value of the donations**

Which candidate raised the most money in California?

In [None]:
contributions_amount = contrib.groupby(by="cand_nm")["contb_receipt_amt"].sum().sort_values(ascending=False)
contributions_amount

Hillary Clinton had the highest amount of contributions. 

**2c. Combine the tables (sorted by either a or b above).**

- Looking at the two tables you presented above - if those tables are Series convert them to DataFrames.
- Rename the variable (column) names to accurately describe what is presented.
- Merge together your tables to show the *count* and the *value* of donations to each candidate in one table.
- Hint: Use the `merge` method.

In [None]:
# Converting the above two tables into data frames
contribution_number_df = pd.DataFrame(contribution_number)
contributions_amount_df = pd.DataFrame(contributions_amount)

# Resetting indices for both tables.
contribution_number_df.reset_index(inplace=True)
contributions_amount_df.reset_index(inplace=True)


# Remaning Column Names to properly reflect the data using dictionaries.
contribution_number_df.rename(columns = {'contb_receipt_amt': 'total_number_of_contributions'}, inplace =True)
contributions_amount_df.rename(columns = {'contb_receipt_amt': 'total_amount_of_contributions '},inplace = True)


# Merging both tables together.  
number_amount_contb = pd.merge(contribution_number_df, contributions_amount_df, on = 'cand_nm')
number_amount_contb

**2d. Calculate and add a new variable to the table from 2c that shows the average \$ per donation. Print this table sorted by the average donation**

In [None]:
# Finding average $ per donation and creating a new column
number_amount_contb['average_per_contribution'] = number_amount_contb['total_amount_of_contributions ']/ number_amount_contb['total_number_of_contributions']
    
#Sorting by average donation
number_amount_contb.sort_values(by='average_per_contribution', ascending = False) 

**2e. Plotting a Bar Chart**

Make a single bar chart that shows two different bars per candidate with one bar as the total value of the donations and the other as average $ per donation. 
- Show the Candidates Name on the x-axis
- Show the amount on the y-axis
- Include a title
- Include axis labels
- Hint: Make the y-axis a log-scale to show both numbers! (matplotlib docs: https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.yscale.html )

In [None]:
number_amount_contb.plot(x='cand_nm', y=['total_amount_of_contributions ', 'average_per_contribution'], 
                         kind="bar", logy= True, figsize=(17,9),
                         title= "Total and Average Donations Per Candidate",
                         xlabel= 'Candidate Name', ylabel="Logarithmic Scale") 
pass

**2f. Comment on the results of your data analysis in a short paragraph.**

- There are several interesting conclusions you can draw from the table you have created.
- What have you learned about campaign contributions in California?
- We are looking for data insights here rather than comments on the code!

1- James Gilmore has the lowest number of contributions with a very high average. It is possible that these contributions are from family or friends or wealthy individuals.\

2- Bernard Sanders has the most number of contributions with very low average contributions. It is possible that these contributions are from general public, specificaly from poor or middle class.\

3- Hillary Clinton despite having fewer contributions than Bernard Sanders has the highest total amount of contributions. Her average is much higher than Bernard Sanders as well. It is possible that most of her supporters are from middle or upper class.\

4- Overall, both main democratic candidates have the most amount of contributions from California. It is possible that California is a very blue state. 



## 3. Exploring Donor Occupations (30 points)

Above in part 2, we saw that some simple data analysis can give us insights into the campaigns of our candidates. Now let's quickly look to see what *kind* of person is donating to each campaign using the `contbr_occupation` variable.

**3a. Show the top 5 occupations of individuals that contributed to Hillary Clinton.** 

- Subset your data to create a dataframe with only donations for Hillary Clinton.
- Then use the `value_counts` and `head` methods to present the top 5 occupations (`contbr_occupation`) for her donors.
- Note: we are just interested in the count of donations, not the value of those donations.

In [None]:
Hillary_Donations = contrib.loc[contrib['cand_nm'] == 'Clinton, Hillary Rodham']
Donations_count = Hillary_Donations['contbr_occupation'].value_counts()
Donations_count_df = pd.DataFrame(Donations_count)
Donations_count_df.head(5)

**3b. Write a function called `get_donors`.**

Imagine that you want to do the previous operation on several candidates.  To keep your work neat, you want to take the work you did on the Clinton-subset and wrap it in a function that you can apply to other subsets of the data.

- The function should take a DataFrame as a parameter, and return a Series containing the counts for the top 5 occupations contained in that DataFrame.

In [None]:
def get_donors(df):
    """This function takes a dataframe that contains a variable named contbr_occupation.
    It outputs a Series containing the counts for the 5 most common values of that
    variable."""
    
    donations_count = df.value_counts()
    donations_count_df = pd.DataFrame(donations_count)
    return donations_count_df.head()

**3c. Now run the `get_donors` function on subsets of the dataframe corresponding to three candidates. Show each of the three candidates below.**

- Hillary Clinton
- Bernie Sanders
- Donald Trump

In [None]:
# Using loc function to locate each canditate in "cand_nm" column 
Hillary_donations = contrib.loc[contrib['cand_nm'] == 'Clinton, Hillary Rodham']
Bernie_donations = contrib.loc[contrib['cand_nm'] == 'Sanders, Bernard']
Trump_donations = contrib.loc[contrib['cand_nm'] == 'Trump, Donald J.']

In [None]:
#recalling the function get_donors and finding the desired value for Hillary Clinton
get_donors(Hillary_donations['contbr_occupation'])

In [None]:
#recalling the function get_donors and finding the desired value for Bernie Sanders
get_donors(Bernie_donations['contbr_occupation'])

In [None]:
#recalling the function get_donors and finding the desired value for Donald Trump
get_donors(Trump_donations['contbr_occupation'])

**3d. Finally, use `groupby` to separate the entire dataset by candidate.**

- Call .apply(get_donors) on your groupby object, which will apply the function you wrote to each subset of your data.
- Look at your output and marvel at what pandas can do in just one line!

In [None]:
# Grouping cand_nm
contritubtions_by_occupation = contrib.groupby('cand_nm')

#using apply function to call our function. Using lambda function to look for all candidates.  
contritubtions_by_occupation.apply(lambda x: get_donors(x['contbr_occupation']))

**3e. Comment on your data insights & findings in a short paragraph.**

1- It seems like retired people have been universally donating the most compared to all other subsets of occupations.\
2- Bernie Sanders received the most number of donations from unemployed people followed by retired people.\
3- Bernie Sanders also received a huge number of donations from teachers. \
4- James Gilmore received all his donations from wealthy individuals.\
5- Hillary Clinton received an overwhelming number of donations from retired people and attornies. \
6- There are many more insights that we can pull from there. I listed the interesting ones above. \
7- Not surprisingly, Trump did not have a lot of contributions from a blue state such as California. 

**3f. Think about your findings in section 3 vs. your findings in section 2 of this assignment.**

Do you have any new data insights into the results you saw in section 2 now that you see the top occupations for each candidate?

1- James Gilmore did have wealthy donors.\
2- Bernie Sanders's majority of donors were from middle and low income people.\
3- Hillary Clinton had middle class and upper class donors.  


## 4. Plotting Data (20 points)

There is an important element that we have not yet explored in this dataset - time.

**4a. Present a single line chart with the following elements.**

- Show the date on the x-axis
- Show the contribution amount on the y-axis
- Include a title
- Include axis labels

In [4]:
contrib.plot(x = 'contb_receipt_dt', y = 'contb_receipt_amt', figsize=(20,9),
                         title= "Contributions Over Time ",
                         xlabel= 'Date', ylabel="Contributions Amount", ylim=([0,12000]), grid = True) 
pass

**4b. Make a better time-series line chart**

This chart is messy and it is hard to gain insights from it.  Improve the chart from 4a so that your new chart shows a specific insight. In the spot provided, write the insight(s) that can be gained from this new time-series line chart.

In [5]:
contribution_dates.plot(x = 'n', y = 'contb_receipt_amt', figsize=(23,14),
                         title= "Contributions Over Time ",
                         xlabel= 'Date', ylabel="Contributions Amount", ylim=([0,13000]), grid = True, color = 'red') 

1- As expected, donations increased as election for primaries approached.\
2- Based on the graph, we can see that the amount of donations started picking up after 04/2015. This may indicate the time candidates start annoucning that they will be running for president.  
3- The amount of donations increased rapidly around 04/2016. This could possibly indicate the start of campaigning season of the candidates. 

## If you have feedback for this homework, please submit it using the link below:

http://goo.gl/forms/74yCiQTf6k