# Assignment: Data Wrangling

**Q1.** Open the "tidy_data.pdf" document in the repo, which is a paper called Tidy Data by Hadley Wickham.

  1. Read the abstract. What is this paper about?
  Cleaning data is a critical step in analysis though there has been few advancements in the process to make it more effective. Data tidying is the concept of organizing a dataset so that each variable is a column, each observation is a row, and each type of observational unit is a table. This paper focuses on a framework that takes advantage of matching tools and data structures.
 2. Read the introduction. What is the "tidy data standard" intended to accomplish?


 Data tidying is the structuring of datasets to facilitate analysis and is estimated to be 80% of a data scientist's role. The tidy data standard is intended to provide a default way to organize data within a dataset, making it a less time consuming step. The framework that is provided in this paper provides a “philisophy of data” that is explained in the plyr and ggplot2 packages.
  
 3. Read the intro to section 2. What does this sentence mean: "Like families, tidy datasets are all alike but every messy dataset is messy in its own way." What does this sentence mean: "For a given dataset, it’s usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general."


The sentence means that organized data has a common set of properties whether a physical layout or grouping that make similar while messy datasets each have unique challenges making them complex to analyze. The next sentence regarding observations and variables refers to the idea that the there are many wats to organize the same underlying data. Formulating a broader more-encompassing definition for variables and observations is difficult due to the importance of context in these decisions.
   
 4. Read Section 2.2. How does Wickham define values, variables, and observations?


Wickman defines values as what datasets are composed of as either numbers or strings. These values measuring the same attribute (ex: height, temperature, duration) are then assigned to a variable or an observation with measurement on the same unit (ex: person, or a day, or a race).

 5. How is "Tidy Data" defined in section 2.3?


Tidy data is a defined as when each variable forms a column, each observation forms a row, and each type of observational unity forms a table. The alternative is messy data where it is arranged in any other fashion.


 6. Read the intro to Section 3 and Section 3.1. What are the 5 most common problems with messy datasets? Why are the data in Table 4 messy? What is "melting" a dataset?


There are 5 common problems with messy datasets: column headers are values, not variable names, multiple variables are stored in one column, variables are stored in both rows and columns, multiple types of observational units are stored in the same table, and a single observational unit is stored in multiple tables. The data in Table 4 are messy since the column headers are values and not variable name. They are values for a variable that is income which should be next to the column for the religion variable so that each variable forms a column. Melting is turning columns into rows to create a molten dataset.  


 7. Why, specifically, is table 11 messy but table 12 tidy and "molten"?


Table 11 is messy sincethere is a column for each day of the month, the variables are stored in both rows and columns. The element column is not a
variable; it stores the names of variables. Table 12(a) is “molten” since the tmax and tmin variable are in rows and need to be unstacked to the tidy weather dataset where by rotating the element variable back out into the columns.


 8. Read Section 6. What is the "chicken-and-egg" problem with focusing on tidy data? What does Wickham hope happens in the future with further work on the subject of data wrangling?


The “chicken-and-egg” problem is that if tidy data is only as useful as the tools that work with it, then tidy tools will be inextricably linked to tidy data. Wickman acknowledges that he does not see the tidy data framework as the final solution and hopes that other will build on the framework to develop better storage strategies and better tools. He likewise hopes that more frameworks are developed for the other jobs involved in cleaning data such as parsing dates and numbers, identifying missing values, and correcting character encodings.


**Q2.** This question provides some practice cleaning variables which have common problems.
1. Numeric variable: For `./data/airbnb_hw.csv`, clean the `Price` variable as well as you can, and explain the choices you make. How many missing values do you end up with? (Hint: What happens to the formatting when a price goes over 999 dollars, say from 675 to 1,112?)


In [10]:
import pandas as pd
import numpy as np

! git clone https://github.com/DS3001/wrangling

fatal: destination path 'wrangling' already exists and is not an empty directory.


In [11]:
df = pd.read_csv('wrangling/assignment/data/airbnb_hw.csv', low_memory=False)
print( df.shape, '\n')
df.head()

price = df['Price']
price.unique() #see that there are comma seperators for numbers over 1000

price = price.str.replace(',','') #Remove separator commas
price = pd.to_numeric(price,errors='coerce') #Coerce the values to numeric using the Pandas method

print("Missing Values:", sum(price.isnull())) #sum of the missing values


(30478, 13) 

Missing Values: 0


2. Categorical variable: For the `./data/sharks.csv` data covered in the lecture, clean the "Type" variable as well as you can, and explain the choices you make.

In [12]:
df = pd.read_csv('wrangling/assignment/data/sharks.csv', low_memory=False)
df['Type'].value_counts()

Unprovoked             4716
Provoked                593
Invalid                 552
Sea Disaster            239
Watercraft              142
Boat                    109
Boating                  92
Questionable             10
Unconfirmed               1
Unverified                1
Under investigation       1
Boatomg                   1
Name: Type, dtype: int64

In [13]:
type = df['Type'] # Create a temporary vector of values for the Type variable

type = type.replace(['Sea Disaster', 'Boat','Boating','Boatomg'],'Watercraft') # There is duplicates of watercraft/boating values
type = type.replace(['Invalid', 'Questionable','Unconfirmed','Unverified','Under investigation'],np.nan) # Treat as missing variables with np.nan from the NumPy package: "Not-a-number", and its type is float
type.value_counts()

df['Type'] = type # Replace the 'Type' variable with the cleaned dataset
del type # Delete the temporary vector

df['Type'].value_counts()

Unprovoked    4716
Provoked       593
Watercraft     583
Name: Type, dtype: int64

3. Dummy variable: For the pretrial data covered in the lecture, clean the `WhetherDefendantWasReleasedPretrial` variable as well as you can, and, in particular, replace missing values with `np.nan`.

In [15]:
df = pd.read_csv('http://www.vcsc.virginia.gov/pretrialdataproject/October%202017%20Cohort_Virginia%20Pretrial%20Data%20Project_Deidentified%20FINAL%20Update_10272021.csv', low_memory=False)
print(df['WhetherDefendantWasReleasedPretrial'].value_counts(),'\n')

release = df['WhetherDefendantWasReleasedPretrial'] #Create a temporary vector of values for the variable
print(release.unique(),'\n')
release = release.replace(9,np.nan) # Based off the codebook, we know 9's are "unclear"
print(release.value_counts(),'\n')
sum(release.isnull()) # we see there are 31 missing values as 9's
df['WhetherDefendantWasReleasedPretrial'] = release # Replace variable with the cleaned values
del release # Delete the temporary vector

1    19154
0     3801
9       31
Name: WhetherDefendantWasReleasedPretrial, dtype: int64 

[9 0 1] 

1.0    19154
0.0     3801
Name: WhetherDefendantWasReleasedPretrial, dtype: int64 



4. Missing values, not at random: For the pretrial data covered in the lecture, clean the `ImposedSentenceAllChargeInContactEvent` variable as well as you can, and explain the choices you make. (Hint: Look at the `SentenceTypeAllChargesAtConvictionInContactEvent` variable.)


In [17]:
sentence = df['ImposedSentenceAllChargeInContactEvent']
type = df['SentenceTypeAllChargesAtConvictionInContactEvent']

sentence = pd.to_numeric(sentence,errors='coerce') #Coerce the values to numeric using the Pandas method
sentence_NA = sentence.isnull() # Create a missing dummy
print( np.sum(sentence_NA),'\n') # see we have many missing values (9k of 23k)

print( pd.crosstab(sentence_NA, type), '\n') # group 4 is cases with charges were dismissed/pending/deferred

sentence = sentence.mask( type == 4, 0) # Sentence is 0 when type ==4
sentence = sentence.mask( type == 9, np.nan) # Sentence is np.nan when type == 9 since group 9 has no sentencing record

sentence_NA = sentence.isnull()
print( np.sum(sentence_NA),'\n') # see only 274 missing value

df['ImposedSentenceAllChargeInContactEvent'] = sentence # Replace variable with the cleaned values
del sentence, type # Delete the temporary vector

274 

SentenceTypeAllChargesAtConvictionInContactEvent     0     1    2     4    9
ImposedSentenceAllChargeInContactEvent                                      
False                                             8720  4299  914  8779    0
True                                                 0     0    0     0  274 

274 



**Q3.** Many important datasets contain a race variable, typically limited to a handful of values often including Black, White, Asian, Latino, and Indigenous. This question looks at data gathering efforts on this variable by the U.S. Federal government.

1. How did the most recent US Census gather data on race?
2. Why do we gather these data? What role do these kinds of data play in politics and society? Why does data quality matter?
3. Please provide a constructive criticism of how the Census was conducted: What was done well? What do you think was missing? How should future large scale surveys be adjusted to best reflect the diversity of the population? Could some of the Census' good practices be adopted more widely to gather richer and more useful data?
4. How did the Census gather data on sex and gender? Please provide a similar constructive criticism of their practices.
5. When it comes to cleaning data, what concerns do you have about protected characteristics like sex, gender, sexual identity, or race? What challenges can you imagine arising when there are missing values? What good or bad practices might people adopt, and why?
6. Suppose someone invented an algorithm to impute values for protected characteristics like race, gender, sex, or sexuality. What kinds of concerns would you have?