# Assignment: Data Wrangling

**Q1.** Open the "tidy_data.pdf" document in the repo, which is a paper called Tidy Data by Hadley Wickham.

  1. Read the abstract. What is this paper about?
  2. Read the introduction. What is the "tidy data standard" intended to accomplish?
  3. Read the intro to section 2. What does this sentence mean: "Like families, tidy datasets are all alike but every messy dataset is messy in its own way." What does this sentence mean: "For a given dataset, it’s usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general."
  4. Read Section 2.2. How does Wickham define values, variables, and observations?
  5. How is "Tidy Data" defined in section 2.3?
  6. Read the intro to Section 3 and Section 3.1. What are the 5 most common problems with messy datasets? Why are the data in Table 4 messy? What is "melting" a dataset?
  7. Why, specifically, is table 11 messy but table 12 tidy and "molten"?
  8. Read Section 6. What is the "chicken-and-egg" problem with focusing on tidy data? What does Wickham hope happens in the future with further work on the subject of data wrangling?

Q1: #1. This paper is about data cleaning, but more specifically data tidying. The author will go into detail of how to make messy datasets tidy with a set of tools that he demonstrates with a case study.  

Q1: #2. The tidy data standard is intended to offer a standard way to organize values within a dataset. This way, you will then be able to apply these steps every time you clean tidy data, making it easier to successfully analyze the data.

Q1: #3.
-In the introduction to section 2, the first sentence refers to the fact that families are all alike in certain ways. Everyone is related, share similar features, etc. Tidy data sets are similar to families because they are going to be strucutred similarly. On the other hand, every family is different, similar to messy data sets with their own disorganization and messiness.
-Continuing, the second sentence refers to the understanding of information and values from the dataset but the difficulty that arises from correctly undertanding these values to then organize the data accordingly.

Q1: #4
Wickham defines values as the pieces that make up a dataset, either numbers or strings. A variable contains values that quantify a single attribute or property. Lastly, an observation contains values quantified on the same unit.


Q1: #5
Tidy data is defined by three characteristics with each variable being a column, each observation being a row, and each type of observational unit being a table.

Q1: #6
The 5 most common problems with messy data sets are: 1. Column headers being composed of values instead of the variable names 2. A column having more than one variable 3. Variables being stored in a row and also a column 4. More than one type of observational unit being stored in a single table 5. One observational unit being stored in more than one table. The data in Table 4 is messy because the columns should be labeled by their variable, income, instead of the values of the income column. Lastly, melting data is when you merge columns into rows.


Q1: #7
Table 11 is messy because the days (values) are labeling the columns. Year, month, and element, could also be combined with the variable, date, similar to in Table 12 where the data is melted together. Lastly, the variable, element, in Table 12 B is fully tidy because tmax and tmin are split into variables rather than values.


Q1: #8
The “chicken-and egg” problem with focusing on tidy data is that as tools are changing so will certain tactics for data tidying. Instead of having a set of rules for data tidying, it's about evolving the data science world with tools and ideas.

2.

**Q2.** This question provides some practice cleaning variables which have common problems.
1. Numeric variable: For `./data/airbnb_hw.csv`, clean the `Price` variable as well as you can, and explain the choices you make. How many missing values do you end up with? (Hint: What happens to the formatting when a price goes over 999 dollars, say from 675 to 1,112?)
2. Categorical variable: For the `./data/sharks.csv` data covered in the lecture, clean the "Type" variable as well as you can, and explain the choices you make.
3. Dummy variable: For the pretrial data covered in the lecture, clean the `WhetherDefendantWasReleasedPretrial` variable as well as you can, and, in particular, replace missing values with `np.nan`.
4. Missing values, not at random: For the pretrial data covered in the lecture, clean the `ImposedSentenceAllChargeInContactEvent` variable as well as you can, and explain the choices you make. (Hint: Look at the `SentenceTypeAllChargesAtConvictionInContactEvent` variable.)

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns

Q2: #1

In [24]:
#loading df
url= 'https://raw.githubusercontent.com/DS3001/wrangling/main/assignment/data/airbnb_hw.csv'
df = pd.read_csv(url, low_memory=False)

In [None]:
#figuring out the unique prices
price = df['Price']
price.unique()

In [None]:
#remove the comma from thousands to clean the formatting to match the other numbers
price = price.str.replace(',','')
print(price.unique())

In [None]:
#convert values to numerics
#figuring out if there is any missing values
price = pd.to_numeric(price,errors='coerce')
print(price.unique())
print('Total missing: ',sum(price.isnull()))
#I end up with zero missing values


Q2: #2

In [None]:
#load df and figure out unique Type values to see all types of shark attacks
url = 'https://raw.githubusercontent.com/DS3001/wrangling/main/assignment/data/sharks.csv'
df = pd.read_csv(url, low_memory= False)
df.head()
type = df['Type']
type.unique()
type.value_counts()

In [None]:
#putting all the boat related stuff together because there are typos and redundant terms
type = type.replace(['Sea Disaster','Boat','Boating','Boatomg'],'Watercraft')
type.value_counts()
#nulling all other values that arent watercraft, provoked, and unprovoked
type = type.replace(['Invalid', 'Questionable','Unconfirmed','Unverified','Under investigation']
                    ,np.nan)
type.value_counts()
df['Type']= type
del type
df.head()

Q2: #3

In [None]:
url = 'http://www.vcsc.virginia.gov/pretrialdataproject/October%202017%20Cohort_Virginia%20Pretrial%20Data%20Project_Deidentified%20FINAL%20Update_10272021.csv'
df = pd.read_csv(url,low_memory=False)
df.head()

In [None]:
released = df['WhetherDefendantWasReleasedPretrial']
print(released.unique())
print(released.value_counts())
#the values we have are 0, 1, & 9
#replace all 9 values (missing values) with np.nan because they're the missing values
released = released.replace(9,np.nan)
print(released.value_counts())
sum(released.isnull())
df['WhetherDefendantWasReleasedPretrial'] = released
del released
df.head()
#now all we have is 0,1, & NaN!

Q2: #4

In [None]:
length = df['ImposedSentenceAllChargeInContactEvent']
print(length.unique())
type = df['SentenceTypeAllChargesAtConvictionInContactEvent']
print(type.unique())
#turn all length values into numeric values and turning non=numeric to NaN
length = pd.to_numeric(length, errors='coerce')
length_NA = length.isnull()
print(np.sum(length_NA),type)
#of 23,000 values this calculation found 9,000 missing values so I am going to
#compare when charges were dismissed

In [None]:
print (pd.crosstab(length_NA, type))
#when type is 4 that means cases were dismissed so when type=4 length is 0
length = length.mask( type == 4, 0)
length = length.mask( type == 9, np.nan)
# a new missing dummy for NaN length values
length_NA = length.isnull()
print( pd.crosstab(length_NA, type))
print( np.sum(length_NA))
#missing values are now cut down to 274 vs 9k!
df['ImposedSentenceAllChargeInContactEvent'] = length
del length, type
df.head()

**Q3.** Many important datasets contain a race variable, typically limited to a handful of values often including Black, White, Asian, Latino, and Indigenous. This question looks at data gathering efforts on this variable by the U.S. Federal government.

1. How did the most recent US Census gather data on race?
2. Why do we gather these data? What role do these kinds of data play in politics and society? Why does data quality matter?
3. Please provide a constructive criticism of how the Census was conducted: What was done well? What do you think was missing? How should future large scale surveys be adjusted to best reflect the diversity of the population? Could some of the Census' good practices be adopted more widely to gather richer and more useful data?
4. How did the Census gather data on sex and gender? Please provide a similar constructive criticism of their practices.
5. When it comes to cleaning data, what concerns do you have about protected characteristics like sex, gender, sexual identity, or race? What challenges can you imagine arising when there are missing values? What good or bad practices might people adopt, and why?
6. Suppose someone invented an algorithm to impute values for protected characteristics like race, gender, sex, or sexuality. What kinds of concerns would you have?