# Assignment: Data Wrangling and Exploratory Data Analysis
## Do Q1 and Q2, and one other question.
`! git clone https://www.github.com/DS3001/assignment2`

**Q1.** Open the "tidy_data.pdf" document in the repo, which is a paper called Tidy Data by Hadley Wickham.

  1. Read the abstract. What is this paper about?
  2. Read the introduction. What is the "tidy data standard" intended to accomplish?
  3. Read the intro to section 2. What does this sentence mean: "Like families, tidy datasets are all alike but every messy dataset is messy in its own way." What does this sentence mean: "For a given dataset, it’s usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general."
  4. Read Section 2.2. How does Wickham define values, variables, and observations?
  5. How is "Tidy Data" defined in section 2.3?
  6. Read the intro to Section 3 and Section 3.1. What are the 5 most common problems with messy datasets? Why are the data in Table 4 messy? What is "melting" a dataset?
  7. Why, specifically, is table 11 messy but table 12 tidy and "molten"?
  8. Read Section 6. What is the "chicken-and-egg" problem with focusing on tidy data? What does Wickham hope happens in the future with further work on the subject of data wrangling?

In [None]:
# Q1 Reading the Tidy Data paper

# Q1.1 Abstract
# This paper talks about how tidy datasets are easy to clean because they each observation has a column, row, and each type of observational unit
# is a table. Because of this structure, not a lot of tools are required to clean this.

# Q1.2 Introduction
# It is intended to organise data values within a dataset to make data cleaning easier

# Q1.3 Into to section 2
# "Like families, tidy datasets are all alike but every messy dataset is messy in its own way" meaning:
# When datasets have the same properties they are similar, but some datasets might have missing values or values that were stored incorrectly.

# "For a given dataset, it's usually easy to figure out what are the observations and what are the values, but it is surpisingly difficult to
# precisely define varaibles and observations in general" meaning:
# When a person looks at a dataset, they can see which are the observations and which are the values but thye can have a hard time trying to
# articulate what these observations and values are.

#Q1.4 Section 2.2
# Values are what are collected in a dataset: they can be either numbers or strings and they are organized in two ways: Variables and observations
# Variables have values that measure the same thing. For example, they can measure weight or height
# Observations have the values that are measured. For example, if we measure the weight (variable) of someone, the observation would be the person

# Q1.5 Section 2.3
# Tidy data is the way to match the meaning of a dataset to its structure

# Q1.6 Section 3 - 3.1
# 5 most common problems with messy datasets:
# 1. Column headers are values and not variable names. So instead of having a variable name of cost, it will be the price numbers
# 2. Multiple variables are stored in one column. So all of the variables are in one table which will make it hard to understand what's going on
# 3. Variables are stored in both rows and columns. So all the variables are in individual columns, spread across columns and rows.
# 4. Multiple types in one variable. So facts will be placed in multiple places and that will result in inconsistances
# 5. A sinly observational unit in multiple tables. So basically a single type of data will be placed in multiple tables or files.

# Data in table 4 are messy because the column headers have the income in numbers instead of having one column header named "income"
# Melting a dataset is stacking a dataset

# Q1.7 Table 11 and Table 12
# Table 11 is messy because there is a rows for each day in the month and each row has NAs
# Table 12 is is tidy and "molten" because each date has its own column

#Q1.8 Section 6
# The chicken and egg problem with visualizing data is that if the tidy data is as useful as the tools that go with it then the tools depend on the tidy data
# Wickham hopes that others will build on this tidy data framework and develop better tools and better data storage strategies


**Q2.** This question provides some practice cleaning variables which have common problems.
1. Numeric variable: For `./data/airbnb_hw.csv`, clean the `Price` variable as well as you can, and explain the choices you make. How many missing values do you end up with? (Hint: What happens to the formatting when a price goes over 999 dollars, say from 675 to 1,112?)
2. Categorical variable: For the `./data/sharks.csv` data covered in the lecture, clean the "Type" variable as well as you can, and explain the choices you make.
3. Dummy variable: For the pretrial data covered in the lecture, clean the `WhetherDefendantWasReleasedPretrial` variable as well as you can, and, in particular, replace missing values with `np.nan`.
4. Missing values, not at random: For the pretrial data covered in the lecture, clean the `ImposedSentenceAllChargeInContactEvent` variable as well as you can, and explain the choices you make. (Hint: Look at the `SentenceTypeAllChargesAtConvictionInContactEvent` variable.)

In [84]:
# Q2.1 Cleaning the numeric variable 'Price' from the airbnb data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# importing the data
df_airbnb = pd.read_csv( 'airbnb_hw.csv')

print(df_airbnb)
# Getting the price variable
df_airbnb.loc[:,"Price"]


        Host Id Host Since                                Name Neighbourhood   \
0       5162530        NaN     1 Bedroom in Prime Williamsburg       Brooklyn   
1      33134899        NaN     Sunny, Private room in Bushwick       Brooklyn   
2      39608626        NaN                Sunny Room in Harlem      Manhattan   
3           500  6/26/2008  Gorgeous 1 BR with Private Balcony      Manhattan   
4           500  6/26/2008            Trendy Times Square Loft      Manhattan   
...         ...        ...                                 ...            ...   
30473  43022976  8/31/2015   10 Mins to Time Square/two floors         Queens   
30474  42993382  8/31/2015       1BR ocean view & F,Q train st       Brooklyn   
30475  43033067  8/31/2015                Amazing Private Room       Brooklyn   
30476  43000991  8/31/2015   Charming private female room: UWS      Manhattan   
30477  42999189  8/31/2015    Huge Beautiful Bedroom - Astoria         Queens   

      Property Type  Review

0        145
1         37
2         28
3        199
4        549
        ... 
30473    300
30474    125
30475     80
30476     35
30477     80
Name: Price, Length: 30478, dtype: object

In [85]:
# Q2.1 - Looking at the price variable
# using the .unique function
print(df_airbnb['Price'].unique(), '\n') # can see that there are quotations around the numbers - not good
# using the .value_counts() function to see how many times each price happens
print(df_airbnb['Price'].value_counts(), '\n')

['145' '37' '28' '199' '549' '149' '250' '90' '270' '290' '170' '59' '49'
 '68' '285' '75' '100' '150' '700' '125' '175' '40' '89' '95' '99' '499'
 '120' '79' '110' '180' '143' '230' '350' '135' '85' '60' '70' '55' '44'
 '200' '165' '115' '74' '84' '129' '50' '185' '80' '190' '140' '45' '65'
 '225' '600' '109' '1,990' '73' '240' '72' '105' '155' '160' '42' '132'
 '117' '295' '280' '159' '107' '69' '239' '220' '399' '130' '375' '585'
 '275' '139' '260' '35' '133' '300' '289' '179' '98' '195' '29' '27' '39'
 '249' '192' '142' '169' '1,000' '131' '138' '113' '122' '329' '101' '475'
 '238' '272' '308' '126' '235' '315' '248' '128' '56' '207' '450' '215'
 '210' '385' '445' '136' '247' '118' '77' '76' '92' '198' '205' '299'
 '222' '245' '104' '153' '349' '114' '320' '292' '226' '420' '500' '325'
 '307' '78' '265' '108' '123' '189' '32' '58' '86' '219' '800' '335' '63'
 '229' '425' '67' '87' '1,200' '158' '650' '234' '310' '695' '400' '166'
 '119' '62' '168' '340' '479' '43' '395' '144' '52' 

In [86]:
# Q2.1 - Cleaning the numeric variables:

# remove the ',' separators from the numbers:
# 1. convert all the variables in 'price' to string:
df_airbnb['Price'] = df_airbnb['Price'].astype(str)
# 2. remove the commas
df_airbnb['Price'] = df_airbnb['Price'].str.replace(',', '', regex=True) # it wouldn't remove the commas without the regex argument
# so for this example it will look at the numbers above and if it sees a comma, it will replace it with nothing
print(df_airbnb['Price'].unique(), '\n')

['145' '37' '28' '199' '549' '149' '250' '90' '270' '290' '170' '59' '49'
 '68' '285' '75' '100' '150' '700' '125' '175' '40' '89' '95' '99' '499'
 '120' '79' '110' '180' '143' '230' '350' '135' '85' '60' '70' '55' '44'
 '200' '165' '115' '74' '84' '129' '50' '185' '80' '190' '140' '45' '65'
 '225' '600' '109' '1990' '73' '240' '72' '105' '155' '160' '42' '132'
 '117' '295' '280' '159' '107' '69' '239' '220' '399' '130' '375' '585'
 '275' '139' '260' '35' '133' '300' '289' '179' '98' '195' '29' '27' '39'
 '249' '192' '142' '169' '1000' '131' '138' '113' '122' '329' '101' '475'
 '238' '272' '308' '126' '235' '315' '248' '128' '56' '207' '450' '215'
 '210' '385' '445' '136' '247' '118' '77' '76' '92' '198' '205' '299'
 '222' '245' '104' '153' '349' '114' '320' '292' '226' '420' '500' '325'
 '307' '78' '265' '108' '123' '189' '32' '58' '86' '219' '800' '335' '63'
 '229' '425' '67' '87' '1200' '158' '650' '234' '310' '695' '400' '166'
 '119' '62' '168' '340' '479' '43' '395' '144' '52' '47

In [87]:
#3. change the price variable back to numeric
df_airbnb['Price'] = pd.to_numeric(df_airbnb['Price'], errors='coerce')
# 4. checking what type price is
print(df_airbnb['Price'].dtype)


int64


In [88]:
# Q2.1 -  seeing if there are any missing values
print(df_airbnb['Price'].isnull().values.any())
# end up with no missing values


False


In [None]:
# SEPARATION FROM QUESTION 2.1 TO 2.2

In [234]:
# Q2.2 Cleaning the categorical variable 'Type' from the shark data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# importing the data
df_sharks = pd.read_csv('sharks.csv',low_memory=False)

print(df_sharks)
# getting the 'type' variable
df_sharks['Type']

      index   Case Number                  Date    Year        Type  \
0         0    2020.02.05           05-Feb-2020  2020.0  Unprovoked   
1         1  2020.01.30.R  Reported 30-Jan-2020  2020.0    Provoked   
2         2    2020.01.17           17-Jan-2020  2020.0  Unprovoked   
3         3    2020.01.16           16-Jan-2020  2020.0  Unprovoked   
4         4    2020.01.13           13-Jan-2020  2020.0  Unprovoked   
...     ...           ...                   ...     ...         ...   
6457   6457       ND.0005           Before 1903     0.0  Unprovoked   
6458   6458       ND.0004           Before 1903     0.0  Unprovoked   
6459   6459       ND.0003             1900-1905     0.0  Unprovoked   
6460   6460       ND.0002             1883-1889     0.0  Unprovoked   
6461   6461       ND.0001             1845-1853     0.0  Unprovoked   

                 Country               Area  \
0                    USA               Maui   
1                BAHAMAS             Exumas   
2     

0       Unprovoked
1         Provoked
2       Unprovoked
3       Unprovoked
4       Unprovoked
           ...    
6457    Unprovoked
6458    Unprovoked
6459    Unprovoked
6460    Unprovoked
6461    Unprovoked
Name: Type, Length: 6462, dtype: object

In [235]:
# finding the missing variables
print(df_sharks['Type'].isnull().values.any())
# checking how many NAs are there
df_sharks['Type_NA'] = df_sharks["Type"].isnull()
print(sum(df_sharks['Type_NA']),'\n')
# removing the 5 NAs
df_sharks = df_sharks.dropna(subset=['Type'])

True
5 



In [236]:
# check if the NAs where dropped
print(df_sharks['Type'].isnull().values.any())
# no missing values since the 5 NAs were dropped

False


In [237]:
# seeing the variables in the "Types column"
print(df_sharks['Type'].unique().tolist())

['Unprovoked', 'Provoked', 'Questionable', 'Watercraft', 'Unconfirmed', 'Unverified', 'Invalid', 'Under investigation', 'Boating', 'Sea Disaster', 'Boat', 'Boatomg']


In [238]:
# Create a column called "Activity" that will contain "Watercraft", "Boating", "Boat", "Boatomg", and "Sea Disaster"
df_sharks.loc[df_sharks['Type'] == 'Watercraft', 'Boat_Activity'] = 'Watercraft'
df_sharks.loc[df_sharks['Type'] == 'Boating', 'Boat_Activity'] = 'Boating'
df_sharks.loc[df_sharks['Type'] == 'Sea Disaster', 'Boat_Activity'] = 'Sea Disaster'
df_sharks.loc[df_sharks['Type'] == 'Boat', 'Boat_Activity'] = 'Boat'
df_sharks.loc[df_sharks['Type'] == 'Boatomg', 'Boat_Activity'] = 'Boatomg'


print(df_sharks['Boat_Activity'].value_counts())

# Create a column called "Status" that will contain "Questionable", "Unconfirmed", "Unverified", "Under investigation", and "Invalid"
df_sharks.loc[df_sharks['Type'] == 'Questionable', 'Status'] = 'Questionable'
df_sharks.loc[df_sharks['Type'] == 'Unconfirmed', 'Status'] = 'Unconfirmed'
df_sharks.loc[df_sharks['Type'] == 'Unverified', 'Status'] = 'Unverified'
df_sharks.loc[df_sharks['Type'] == 'Invalid', 'Status'] = 'Invalid'
df_sharks.loc[df_sharks['Type'] == 'Under investigation', 'Status'] = 'Under investigation'


print(df_sharks['Status'].value_counts())

Sea Disaster    239
Watercraft      142
Boat            109
Boating          92
Boatomg           1
Name: Boat_Activity, dtype: int64
Invalid                552
Questionable            10
Unconfirmed              1
Unverified               1
Under investigation      1
Name: Status, dtype: int64


In [239]:
# Check the NAs for the "Boat_Activity"
print(df_sharks['Boat_Activity'].isnull().values.any())
# remove the NAs
df_sharks = df_sharks.dropna(subset=['Boat_Activity'])


True


In [240]:
# check if the NAs were removed
print(df_sharks['Boat_Activity'].isnull().values.any())

False


In [242]:
# Check the NAs for the "Status"
print(df_sharks['Status'].isnull().values.any())
# remove the NAs
df_sharks = df_sharks.dropna(subset=['Status'])

True


In [243]:
# check if the NAs were removed
print(df_sharks['Status'].isnull().values.any())

False


In [None]:
# SEPARATION FROM Q2.2 TO Q2.3

In [79]:
# Q2.3 Cleaning the dummy variable "Whether Defedant Was Released Pretrial" from pretrial data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# importing the data
df = pd.read_csv('/content/pretrial_data.csv',low_memory=False)

# getting and renaming the 'WhetherDefedantWasReleasedPretrail' to 'released':
df = df.rename(columns = {'WhetherDefendantWasReleasedPretrial':'released'})
print(df['released'])
# 0 is not released and 1 is released and 9 is uncertain


0        NaN
1        0.0
2        0.0
3        0.0
4        1.0
        ... 
22981    1.0
22982    1.0
22983    1.0
22984    1.0
22985    1.0
Name: released, Length: 22986, dtype: float64


In [80]:
# How many NAs:
df['released_NA'] = df['released'].isnull()
print(sum(df['released_NA']),'\n')

print(df['released'].dtype)

31 

float64


In [81]:
# replace the NaNs with np.nan
df['released'] = df['released'].replace( ['NaN'], np.nan)
# checking if it worked
print(df['released'].unique())


[nan  0.  1.]


In [None]:
# SEPARATION FROM Q2.3 TO 2.4

In [65]:
# Question 2.4 Clean the ImposedSentenceAllChargeInContactEvent

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# getting the data:
df = pd.read_csv('/content/VirginiaPretrialData2017.csv')

# renaming the 'ImposedSentenceAllChargeInContactEvent' to 'imposed':

df = df.rename(columns = {'WhetherDefendantWasReleasedPretrial':'imposed'})
print(df['imposed'])

  df = pd.read_csv('/content/VirginiaPretrialData2017.csv')


0        9
1        0
2        0
3        0
4        1
        ..
22981    1
22982    1
22983    1
22984    1
22985    1
Name: imposed, Length: 22986, dtype: int64


In [68]:
# change the 9 variable to nan
df['imposed'] = df['imposed'].replace(9,np.nan)
# check if it worked
print(df['imposed'].unique())
# checking to see if there are any 9s left
print(9 in df['imposed'].unique())

[nan  0.  1.]
False


In [76]:
# make the 4 variable into 0
df['imposed'] = df['imposed'].replace(4, 0)
# check if it worked
print(df['imposed'].unique())
# checking to see if there are any 4s left
print(4 in df['imposed'].unique())


[nan  0.  1.]
False


**Q3.** This question provides some practice doing exploratory data analysis and visualization.

The "relevant" variables for this question are:
  - `level` - Level of institution (4-year, 2-year)
  - `aid_value` - The average amount of student aid going to undergraduate recipients
  - `control` - Public, Private not-for-profit, Private for-profit
  - `grad_100_value` - percentage of first-time, full-time, degree-seeking undergraduates who complete a degree or certificate program within 100 percent of expected time (bachelor's-seeking group at 4-year institutions)

1. Load the `./data/college_completion.csv` data with Pandas.
2. What are are the dimensions of the data? How many observations are there? What are the variables included? Use `.head()` to examine the first few rows of data.
3. Cross tabulate `control` and `level`. Describe the patterns you see.
4. For `grad_100_value`, create a histogram, kernel density plot, boxplot, and statistical description.
5. For `grad_100_value`, create a grouped kernel density plot by `control` and by `level`. Describe what you see. Use `groupby` and `.describe` to make grouped calculations of statistical descriptions of `grad_100_value` by `level` and `control`. Which institutions appear to have the best graduation rates?
6. Create a new variable, `df['levelXcontrol']=df['level']+', '+df['control']` that interacts level and control. Make a grouped kernel density plot. Which institutions appear to have the best graduation rates?
7. Make a kernel density plot of `aid_value`. Notice that your graph is "bi-modal", having two little peaks that represent locally most common values. Now group your graph by `level` and `control`. What explains the bi-modal nature of the graph? Use `groupby` and `.describe` to make grouped calculations of statistical descriptions of `aid_value` by `level` and `control`.
8. Make a scatterplot of `grad_100_value` by `aid_value`. Describe what you see. Now make the same plot, grouping by `level` and then `control`. Describe what you see. For which kinds of institutions does aid seem to increase graduation rates?

**Q4.** This question uses the Airbnb data to practice making visualizations.

  1. Load the `./data/airbnb_hw.csv` data with Pandas. You should have cleaned the `Price` variable in question 2, and you'll need it later for this question.
  2. What are are the dimensions of the data? How many observations are there? What are the variables included? Use `.head()` to examine the first few rows of data.
  3. Cross tabulate `Room Type` and `Property Type`. What patterns do you see in what kinds of rentals are available? For which kinds of properties are private rooms more common than renting the entire property?
  4. For `Price`, make a histogram, kernel density, box plot, and a statistical description of the variable. Are the data badly scaled? Are there many outliers? Use `log` to transform price into a new variable, `price_log`, and take these steps again.
  5. Make a scatterplot of `price_log` and `Beds`. Describe what you see. Use `.groupby()` to compute a desciption of `Price` conditional on/grouped by the number of beds. Describe any patterns you see in the average price and standard deviation in prices.
  6. Make a scatterplot of `price_log` and `Beds`, but color the graph by `Room Type` and `Property Type`. What patterns do you see? Compute a description of `Price` conditional on `Room Type` and `Property Type`. Which Room Type and Property Type have the highest prices on average? Which have the highest standard deviation? Does the mean or median appear to be a more reliable estimate of central tendency, and explain why?
  7. We've looked a bit at this `price_log` and `Beds` scatterplot. Use seaborn to make a `jointplot` with `kind=hex`. Where are the data actually distributed? How does it affect the way you think about the plots in 5 and 6?

**Q5.** Many important datasets contain a race variable, typically limited to a handful of values often including Black, White, Asian, Latino, and Indigenous. This question looks at data gathering efforts on this variable by the U.S. Federal government.

1. How did the most recent US Census gather data on race?
2. Why do we gather these data? What role do these kinds of data play in politics and society? Why does data quality matter?
3. Please provide a constructive criticism of how the Census was conducted: What was done well? What do you think was missing? How should future large scale surveys be adjusted to best reflect the diversity of the population? Could some of the Census' good practices be adopted more widely to gather richer and more useful data?
4. How did the Census gather data on sex and gender? Please provide a similar constructive criticism of their practices.
5. When it comes to cleaning data, what concerns do you have about protected characteristics like sex, gender, sexual identity, or race? What challenges can you imagine arising when there are missing values? What good or bad practices might people adopt, and why?
6. Suppose someone invented an algorithm to impute values for protected characteristics like race, gender, sex, or sexuality. What kinds of concerns would you have?

In [None]:
# Question 5

# Q5.1
# There is a census questionnaire that paricipants answer. The choices are "white","black or african american", "asian",
#"native hawaiian and other pacific  islander", and / or "two or more races". The US Census Bureau collects this data with the guidelines provided by
# the US Office of Management and Budget. (https://www.census.gov/quickfacts/fact/note/US/RHI625222)

# Q5.2
# We gather this information to make decisions about policies, health related problems, environmental related problems, and societal related problems.
# This data plays a big role in politics since it helps to make decisions about policy making and it also plays a big role in society since it comes up
# to solutions of societal perception of races or sometimes makes certain problems that races have aware to the public.
# The data's quality matters because it plays such a big role for people's lives.

# Q5.3
# In getting information about a person's race, the US Census does a good job of allowing people to select more than one race and having detailed options, for example,
# for example, having separate options for asians and pacific islanders, and the data protection. Some things that are missing are having a separate option for people
# with Middle Eastern or North African ancestery, another one is having more specific options for the indigenous people (tribes), lastly, there could be a multiracial
# option for people that might not want to check every race that they are. Future surveys should have more inclusive categories such as the Middle Eastern or North African
# ancestery and more options for the indegenous tribes. Another suggestion is to have open ended questions for things that cannot be answered in a multiple choice question.
# Lastly, there could be examples of each categories in case peopel get confused. Some of the Census' practices could be adapted into education or healthcare.

# Q 5.4
# To get information about a person's sex and gender, the US Census Bureau asks them in surveys. Some of the good things that they do is that
# they made clarifications, for example, when asking about a participant's sex, they have participants mark the sex they identify with even if it doesnt match with their
# official records. Another good thing is both the sensitivity that they show and the data protection. Some things that are missing are the non-binary or transgender
# options. Future surveys should have wider gender options like " gender fluid" or " non-binary"and options for different cultural gender identities.
# Lastly, they should have language updates as the language changes for the gender and sex.

# Q5.5
# Some concerns while cleaning data like this is erasing rare representations that might look like outliers or putting two categories together that are actually completely two different things.
# Some challenges when dealing with missing data might be that some groups don't have enough values to analyze which can lead to inaccurate results about this group of people
# another challenge might be inferring missing values, for example if you have the name "Roy" and assume that it's a woman but it turns out to be a man, this could result in missclasification
# Some good practices might be working with a diverse group of people so that there are multiple persepctives which can lead to an accurate data cleaning result.
# Another good practice might be to comment the decision to do things in data cleaning so that if a mistake happens people can go back and see the reasoning and try to fix it
# Some bad practices might be ignoring the missing values and not understanding as to why they are there and this could lead to misinterpreted results
# Another bad practice might be just doing things without thinking of the outcome and again, this can lead to bad results.

#Q5.6
# One concern might be that if a researcher was to impute a value for these protected characteristics, they might look as if they are making assumptions about that person's identity
# A second concern might be that impute characteristics like these will lead to misinterpretation
# A third concern is a person can face legal consequences if there is no consent about imputing values

**Q6.** Open the `./data/CBO_data.pdf` file. This contains tax data for 2019, explaining where the money comes from that the U.S. Federal Government Spends in terms of taxation on individuals/families and payroll taxes (the amount that your employer pays in taxes on your wages).

For some context, the Federal government ultimately spent about $4.4 trillion in 2019, which was 21% of GDP (the total monetary value of all goods and services produced within the United States). Individual Income Taxes is the amount individuals pay on their wages to the Federal government, Corporate Income Taxes is the taxes individuals pay on capital gains from investment when they sell stock or other financial instruments, Payroll Taxes is the tax your employer pays on your wages, Excises and Customs Duties are taxes on goods or services like sin taxes on cigarettes or alcohol, and Estate and Gift Taxes are taxes paid on transfers of wealth to other people.

1. Get the Millions of Families and Billions of Dollars data into a .csv file and load it with Pandas.
2. Create a bar plot of individual income taxes by income decile. Explain what the graph shows. Why are some values negative?
3. Create a bar plot of Total Federal Taxes by income decile. Which deciles are paying net positive amounts, and which are paying net negative amounts?
4. Create a stacked bar plot for which Total Federal Taxes is grouped by Individual Income Taxes, Payroll Taxes, Excises and Customs Duties, and Estate and Gift Taxes. How does the share of taxes paid vary across the adjusted income deciles? (Hint: Are these the kind of data you want to melt?)
5. Below the Total line for Millions of Families and Billions of Dollars, there are data for the richest of the richest families. Plot this alongside the bars for the deciles above the Total line. Describe your results.
6. Get the Percent Distribution data into a .csv file and load it with Pandas. Create a bar graph of Total Federal Taxes by income decile.
7. A tax system is progressive if higher-income and wealthier individuals pay more than lower-income and less wealthy individuals, and it is regressive if the opposite is true. Is the U.S. tax system progressive in terms of amount paid? In terms of the percentage of the overall total?
8. Do the rich pay enough in taxes? Defend your answer.