# Assignment: Data Wrangling and Exploratory Data Analysis
## Do Q1 and Q2, and one other question.
`! git clone https://www.github.com/DS3001/assignment2`

**Q1.** Open the "tidy_data.pdf" document in the repo, which is a paper called Tidy Data by Hadley Wickham.

  1. Read the abstract. What is this paper about?
    
    -> This paper looks into the overlooked aspect of data cleaning, specifically data tidying. It highlights the importance of structuring data in specific format in order to avoid error in data handling.

  2. Read the introduction. What is the "tidy data standard" intended to accomplish?
    
    -> The tidy data standard is intended to provide a univeral standard for data tidying to have consistency amongst anyone who wishes to understand the data of a data table that uses that standard. As long as there is consistency within the standard one is able to traverse any data set that uses the tidy data standard. Kind of like if you are able to learn the alphabet you should be able to learn and eventually understand every word that uses that alphabet.
    

  3. Read the intro to section 2. What does this sentence mean: "Like families, tidy datasets are all alike but every messy dataset is messy in its own way." What does this sentence mean: "For a given dataset, it’s usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general."

    ->"Like families, tidy datasets are all alike but every messy dataset is messy in its own way." Each tidy data set follows the same standard just like a "tidy" family or healthy family follows a similarity. To clarify if we take the nuclear family as the standard for "healthy" then all nuclear families are alike. Howver all non nuclear families are very different in the same way any messy dataset will follow its own standard so it will be different to every other messy dataset on top of being different to the tidy datasets.
   
  4. Read Section 2.2. How does Wickham define values, variables, and observations?

    -> Values are the individual quantitative or qualitative data points iwithin the data set. 
    -> Variables are way of grouping values within the same data set dependant on a category that they fit in or a standard that they follow.
    -> Observations are units of measurment that are in a dataset. An Observation contains all the values that were recorded on one unit or from one (or more)variables.

  5. How is "Tidy Data" defined in section 2.3?

    -> tidy data is defined as a standard way of mapping the meaning of a dataset to its structure. In tidy data 3 rules are defined.
    -> 1. Each variable forms a column
    -> 2. Each observation forms a row.
    -> 3. Each type of observational unit forms a table.

  6. Read the intro to Section 3 and Section 3.1. What are the 5 most common problems with messy datasets? Why are the data in Table 4 messy? What is "melting" a dataset?

    -> 5 most common problems with messy datasets are: 
    Column headers are values, not variable names.
    Multiple variables are stored in one column.
    Variables are stored in both rows and columns.
    Multiple types of observational units are stored in the same table.
    A single observational unit is stored in multiple tables.
    -> in table 4 Column headers are values, multiple variables are stored in both rows and columns: the column headers are ranges which are variables and the row headers are variables. 
    -> melting in a dataset is the process of turning columns into rows

  7. Why, specifically, is table 11 messy but table 12 tidy and "molten"?

    -> table 11 stores date in separate columns. Table 11 leaves multiple columns blank.
    -> table 12 stores values into separate tables molten and tidy as to separate overlapping values taken from table 11.

  8. Read Section 6. What is the "chicken-and-egg" problem with focusing on tidy data? What does Wickham hope happens in the future with further work on the subject of data wrangling?

    -> tidy data is only as useful as the tools that are designed to work with it hoever the development of tidy tools is linked to the structure of tidy data so its a paradox.
    -> Wickham hopes incremental improvements will continue to happen as we understand tidy data and tools further.

**Q2.** This question provides some practice cleaning variables which have common problems.
1. Numeric variable: For `./data/airbnb_hw.csv`, clean the `Price` variable as well as you can, and explain the choices you make. How many missing values do you end up with? (Hint: What happens to the formatting when a price goes over 999 dollars, say from 675 to 1,112?)
2. Categorical variable: For the `./data/sharks.csv` data covered in the lecture, clean the "Type" variable as well as you can, and explain the choices you make.
3. Dummy variable: For the pretrial data covered in the lecture, clean the `WhetherDefendantWasReleasedPretrial` variable as well as you can, and, in particular, replace missing values with `np.nan`.
4. Missing values, not at random: For the pretrial data covered in the lecture, clean the `ImposedSentenceAllChargeInContactEvent` variable as well as you can, and explain the choices you make. (Hint: Look at the `SentenceTypeAllChargesAtConvictionInContactEvent` variable.)

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns

df = pd.read_csv('./data/airbnb_hw.csv', low_memory=False)
print( df.shape, '\n')
df.head()

df['Cleaned_Price'] = df['Price'].str.replace('[\$,]', '', regex=True)
df['Cleaned_Price'] = pd.to_numeric(df['Cleaned_Price'], errors='coerce')

# Handling potential outliers
reasonable_price_range = (df['Cleaned_Price'] > 0) & (df['Cleaned_Price'] < 10000)
df['Cleaned_Price'] = df['Cleaned_Price'].where(reasonable_price_range, np.nan)
missing_values_price = df['Cleaned_Price'].isnull().sum()


(30478, 13) 



In [3]:
df = pd.read_csv('./data/sharks.csv', low_memory=False)
df.head()
df.columns.tolist()
df['Cleaned_Type'] = df['Type'].str.lower()
df['Cleaned_Type'] = df['Cleaned_Type'].replace(['boat', 'boating', 'sea disaster'], 'watercraft')
df['Cleaned_Type'] = df['Cleaned_Type'].replace(['invalid', 'questionable', 'unconfirmed', 'unverified', 'under investigation'], np.nan)


**Q3.** This question provides some practice doing exploratory data analysis and visualization.

The "relevant" variables for this question are:
  - `level` - Level of institution (4-year, 2-year)
  - `aid_value` - The average amount of student aid going to undergraduate recipients
  - `control` - Public, Private not-for-profit, Private for-profit
  - `grad_100_value` - percentage of first-time, full-time, degree-seeking undergraduates who complete a degree or certificate program within 100 percent of expected time (bachelor's-seeking group at 4-year institutions)

1. Load the `./data/college_completion.csv` data with Pandas.
2. What are are the dimensions of the data? How many observations are there? What are the variables included? Use `.head()` to examine the first few rows of data.
3. Cross tabulate `control` and `level`. Describe the patterns you see.
4. For `grad_100_value`, create a histogram, kernel density plot, boxplot, and statistical description.
5. For `grad_100_value`, create a grouped kernel density plot by `control` and by `level`. Describe what you see. Use `groupby` and `.describe` to make grouped calculations of statistical descriptions of `grad_100_value` by `level` and `control`. Which institutions appear to have the best graduation rates?
6. Create a new variable, `df['levelXcontrol']=df['level']+', '+df['control']` that interacts level and control. Make a grouped kernel density plot. Which institutions appear to have the best graduation rates?
7. Make a kernel density plot of `aid_value`. Notice that your graph is "bi-modal", having two little peaks that represent locally most common values. Now group your graph by `level` and `control`. What explains the bi-modal nature of the graph? Use `groupby` and `.describe` to make grouped calculations of statistical descriptions of `aid_value` by `level` and `control`.
8. Make a scatterplot of `grad_100_value` by `aid_value`. Describe what you see. Now make the same plot, grouping by `level` and then `control`. Describe what you see. For which kinds of institutions does aid seem to increase graduation rates?

**Q4.** This question uses the Airbnb data to practice making visualizations.

  1. Load the `./data/airbnb_hw.csv` data with Pandas. You should have cleaned the `Price` variable in question 2, and you'll need it later for this question.
  2. What are are the dimensions of the data? How many observations are there? What are the variables included? Use `.head()` to examine the first few rows of data.
  3. Cross tabulate `Room Type` and `Property Type`. What patterns do you see in what kinds of rentals are available? For which kinds of properties are private rooms more common than renting the entire property?
  4. For `Price`, make a histogram, kernel density, box plot, and a statistical description of the variable. Are the data badly scaled? Are there many outliers? Use `log` to transform price into a new variable, `price_log`, and take these steps again.
  5. Make a scatterplot of `price_log` and `Beds`. Describe what you see. Use `.groupby()` to compute a desciption of `Price` conditional on/grouped by the number of beds. Describe any patterns you see in the average price and standard deviation in prices.
  6. Make a scatterplot of `price_log` and `Beds`, but color the graph by `Room Type` and `Property Type`. What patterns do you see? Compute a description of `Price` conditional on `Room Type` and `Property Type`. Which Room Type and Property Type have the highest prices on average? Which have the highest standard deviation? Does the mean or median appear to be a more reliable estimate of central tendency, and explain why?
  7. We've looked a bit at this `price_log` and `Beds` scatterplot. Use seaborn to make a `jointplot` with `kind=hex`. Where are the data actually distributed? How does it affect the way you think about the plots in 5 and 6?

**Q5.** Many important datasets contain a race variable, typically limited to a handful of values often including Black, White, Asian, Latino, and Indigenous. This question looks at data gathering efforts on this variable by the U.S. Federal government.

1. How did the most recent US Census gather data on race?

- Surveys, Interviews online polls etc.

2. Why do we gather these data? What role do these kinds of data play in politics and society? Why does data quality matter?

- Our society is not just white people - its important to gather feedback from all backgrounds, race, culture etc. These data are used a lot for the wrong reasons and sometimes for the right reasons. Can be used for preditory advertisements to buy products, vote for politicans in certain communities. But they can also be used to address real socio-economic problems in communities of color. The quality of this data is important as we need to be as clear and transparent about these statistics because they will ultimately be used widely for a variety of reasons that will have an affect on peoples lives.

3. Please provide a constructive criticism of how the Census was conducted: What was done well? What do you think was missing? How should future large scale surveys be adjusted to best reflect the diversity of the population? Could some of the Census' good practices be adopted more widely to gather richer and more useful data?

- Lack of detailed interviews and representation of all communities. These surveys are usually conducted by white people so straight off the bat these results have some sort of bias. Its important to understand the culture behind the people we are asking to fill out census data since context is extremely important in the representation of statistics.

4. How did the Census gather data on sex and gender? Please provide a similar constructive criticism of their practices.

- Online Surverys/ Questionares also interviews. Gender and sexuality takes on a much more broad definition when we leave the realms of the bible. We cannot conduct survey feedback from the point of view of one culture - this is a tedious process but it must be met with precise work and detail since its uses will have an outcome on the people.

5. When it comes to cleaning data, what concerns do you have about protected characteristics like sex, gender, sexual identity, or race? What challenges can you imagine arising when there are missing values? What good or bad practices might people adopt, and why?

- cleaning data based off of gender - people have different concepts of gender and having a set standard can be unconventional and controversial when applying to minority communities. If we are to represent the data and information of all of out citizens we must take into account all citizens as best we can.

6. Suppose someone invented an algorithm to impute values for protected characteristics like race, gender, sex, or sexuality. What kinds of concerns would you have?

- Race, Gender Sex and Sexuality is not as simple as 1 and 0 its complex across its deffinitons and has multiple/ various perspectives across cultures. Having one algorithm to determine these concepts breaks them down into a numerical value which is not the case for these ideas. Much more spoken analysis and context needs to be done in order to have a proper discussion and presentation of the data of these concepts.

**Q6.** Open the `./data/CBO_data.pdf` file. This contains tax data for 2019, explaining where the money comes from that the U.S. Federal Government Spends in terms of taxation on individuals/families and payroll taxes (the amount that your employer pays in taxes on your wages).

For some context, the Federal government ultimately spent about $4.4 trillion in 2019, which was 21% of GDP (the total monetary value of all goods and services produced within the United States). Individual Income Taxes is the amount individuals pay on their wages to the Federal government, Corporate Income Taxes is the taxes individuals pay on capital gains from investment when they sell stock or other financial instruments, Payroll Taxes is the tax your employer pays on your wages, Excises and Customs Duties are taxes on goods or services like sin taxes on cigarettes or alcohol, and Estate and Gift Taxes are taxes paid on transfers of wealth to other people.

1. Get the Millions of Families and Billions of Dollars data into a .csv file and load it with Pandas.
2. Create a bar plot of individual income taxes by income decile. Explain what the graph shows. Why are some values negative?
3. Create a bar plot of Total Federal Taxes by income decile. Which deciles are paying net positive amounts, and which are paying net negative amounts?
4. Create a stacked bar plot for which Total Federal Taxes is grouped by Individual Income Taxes, Payroll Taxes, Excises and Customs Duties, and Estate and Gift Taxes. How does the share of taxes paid vary across the adjusted income deciles? (Hint: Are these the kind of data you want to melt?)
5. Below the Total line for Millions of Families and Billions of Dollars, there are data for the richest of the richest families. Plot this alongside the bars for the deciles above the Total line. Describe your results.
6. Get the Percent Distribution data into a .csv file and load it with Pandas. Create a bar graph of Total Federal Taxes by income decile.
7. A tax system is progressive if higher-income and wealthier individuals pay more than lower-income and less wealthy individuals, and it is regressive if the opposite is true. Is the U.S. tax system progressive in terms of amount paid? In terms of the percentage of the overall total?
8. Do the rich pay enough in taxes? Defend your answer.
    -> no they don't: 