# Data Cleaning 
Here are some of the functions presented in Python fo Data Analysis Chapter 7

## **Pandas Functions for Data Handling and Transformation**

*   **`add_prefix()`**: Adds a prefix to column labels in a DataFrame.
*   **`.agg()`**: Aggregates data using one or more operations over a specified axis.
*   **`.any()`**: Checks if any element is `True` over a requested axis.
*   **`as_ordered()`**: Makes a categorical Series ordered.
*   **`.astype()`**: Casts a pandas object to a specified data type.
*   **`pd.Categorical()`**: Creates a categorical data type.
*   **`pd.cut()`**: Bins values into discrete intervals.
*   **`.describe()`**: Generates descriptive statistics of a Series or DataFrame.
*   **`drop_duplicates()`**: Removes duplicate rows from a DataFrame.
*   **`dropna()`**: Removes missing values from a Series or DataFrame.
*   **`.duplicated()`**: Returns a boolean Series indicating duplicate rows.
*   **`fillna()`**: Fills missing (NA/NaN) values with a specified method.
*   **`pd.get_dummies()`**: Converts categorical variables into dummy/indicator variables.
*   **`.groupby()`**: Groups a DataFrame or Series by a mapper or by a Series of columns.
*   **`.isna()`**: Detects missing values, returning a boolean same-sized object.
*   **`.isin()`**: Checks whether each element in a DataFrame is contained in a sequence of values.
*   **`.join()`**: Joins columns with another DataFrame.
*   **`.map()`**: Maps values of a Series according to an input correspondence.
*   **`.mean()`**: Calculates the mean of a Series.
*   **`.notna()`**: Detects existing (non-missing) values.
*   **`pd.qcut()`**: Quantile-based discretization function.
*   **`replace()`**: Replaces a value in a DataFrame.
*   **`.reset_index()`**: Resets the index of a DataFrame.
*   **`.sample()`**: Returns a random sample of items from an axis of an object.
*   **`pd.Series()`**: Creates a one-dimensional labeled array.
*   **`.str.contains()`**: Tests if a pattern or regex is contained within a string of a Series.
*   **`.str.extract()`**: Extracts capture groups in a regex as a DataFrame.
*   **`.str.findall()`**: Finds all occurrences of a pattern or regex in a Series.
*   **`.str.get()`**: Extracts the i-th element of each element in a Series.
*   **`.str.get_dummies()`**: Splits each string in a Series by a separator and returns a DataFrame of dummy/indicator variables.
*   **`pd.unique()`**: Returns the unique values of a Series.
*   **`pd.value_counts()`**: Returns a Series containing counts of unique values.

## **Python Functions for General Operations**

*   **`.abs()`**: Returns the absolute value of each element.
*   **`.count()`**: Counts non-NA cells for each column or row.
*   **`.find()`**: Returns the lowest index of a substring in the string.
*   **`.index()`**: Returns the lowest index of a substring; raises an exception if the substring is not found.
*   **`.join()` (string method)**: Concatenates the elements of an iterable with a string separator.
*   **`np.random.permutation()`**: Randomly permutes a sequence.
*   **`np.random.seed()`**: Seeds the random number generator for reproducibility.
*   **`np.sign()`**: Returns an element-wise indication of the sign of a number.
*   **`.split()`**: Splits a string into a list where each word is a list item.
*   **`.strip()`**: Removes leading and trailing characters (whitespace by default).

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# results url
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vRkK73xD192AdP0jZe6ac9cnVPSeqqbYZmSPnhY2hnY8ANROAOCStRFdvjwFoapv3j2rzMtZ91KXPFm/pub?output=csv"

# create data frame from url
df = pd.read_csv(url)

# assign original headers to list
survey_questions = df.columns.to_list()

# replace with column names easier to work with
renamelist = ['Timestamp', 'musicartist', 'height', 'city', '30min', 'travel', \
              'likepizza', 'deepdish', 'sport', 'spell', 'hangout', 'talk', \
              'year', 'quote', 'areacode', 'pets', 'superpower', 'shoes']
df.columns = renamelist

# print new column labels and original
for i in range(len(renamelist)):
  print(f'{renamelist[i]:15} {survey_questions[i]}')

Timestamp       Timestamp
musicartist     Who is your favorite music artist (broadly defined)?
height          What physical height would you like to be?
city            If you had to live in the city, but could pick any city in the world, what city would you live in?
30min           If you could have 30 minutes to talk with any person, living or dead, who would you pick?
travel          If you could travel to any location in the world for vacation, where would you go?
likepizza       On a scale of 1 (gross) to five (awesome) how much do you like pizza?
deepdish        Is Chicago-style deep dish actually pizza or is it really casserole?
sport           What sport do you most enjoy watching?
spell           Which is the most difficult to spell? 
hangout         What is the optimal number of people to hang out with?
talk            Do you think you talk more or less than the average person?
year            If you had a time machine and could visit any year you like, which year would you 

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541 entries, 0 to 540
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Timestamp    541 non-null    object 
 1   musicartist  518 non-null    object 
 2   height       537 non-null    object 
 3   city         531 non-null    object 
 4   30min        506 non-null    object 
 5   travel       531 non-null    object 
 6   likepizza    538 non-null    float64
 7   deepdish     539 non-null    object 
 8   sport        522 non-null    object 
 9   spell        540 non-null    object 
 10  hangout      537 non-null    object 
 11  talk         539 non-null    object 
 12  year         523 non-null    object 
 13  quote        505 non-null    object 
 14  areacode     332 non-null    object 
 15  pets         323 non-null    object 
 16  superpower   333 non-null    object 
 17  shoes        336 non-null    object 
dtypes: float64(1), object(17)
memory usage: 76.2+ KB


In [9]:
df.head()

Unnamed: 0,Timestamp,musicartist,height,city,30min,travel,likepizza,deepdish,sport,spell,hangout,talk,year,quote,areacode,pets,superpower,shoes
0,8/27/2018 13:54:50,LSTNYT,5'9 because then I could model,Tokyo,"Prince dead, living Miguel",Spain,5.0,pizza,Gymnastics,daiquiri,5,More,"Future, no options for me in the past",Treat yo self,,,,
1,8/27/2018 14:11:00,MJ,5 foot 9. I am 5 foot 7,LA,Bruce Lee,Normandy,3.0,no opinion,Basketball,hors d'oeuvre,2,Less,2050,"Accept it is to be liberated from it, when tom...",,,,
2,8/27/2018 14:14:23,Il Volo,"5'2""",Chicago,Dmitri Shostakovich,"Vienna, Austria",4.0,pizza,Hockey,hors d'oeuvre,3,Less,1919,Relatable composer quote from Richard Strauss:...,,,,
3,8/27/2018 14:26:51,Jon Belllion,"6'1""","Bar Harbor, ME",Theodore Roosevelt,New Zealand,4.0,pizza,Football,hors d'oeuvre,4,Less,1776,"""Good leaders don't make excuses. Instead, the...",,,,
4,8/27/2018 15:05:00,Queen,1.7m,Shanghai,Konosuke Matsushita,Austrailia,4.0,casserole,Soccer,hors d'oeuvre,3,Less,1985,Believe in yourself. You are braver than you t...,,,,


## Problem 1
Drop duplicate entries. Be sure to ignore the timestamp when you do the drop. Show that there are fewer entries in `'df'` after the drop. 

## Problem 2
Inspect, clean, and visualize the series `'spell'` to clearly communicate how people responded to the question. Your visualization should include appropriate titles and axes labels. 

## Problem 3
* Inspect, clean, and visualize the series `'deepdish'` to clearly communicate how people responded to the question. 
* Your visualization should present the percent of respondents that choose each option, instead of the raw count. One approach is to use the `normalize = True` argument with `value_counts()`.
* Your visualization should include appropriate titles and axes labels. 

## Problem 4
* Inspect, clean, and visualize the series `'likepizza'` to clearly communicate how people responded to the question. 
* The scale the question uses runs from 1 'Gross' up to 5 'Awesome'
* Your visualization should include appropriate titles and axes labels. 

## Problem 5
* Inspect, clean, and visualize the series `'shoes'` to clearly communicate how people responded to the question. 
* Remove any missing values and non-numeric values from the series. You can use `.str.isnumeric()` and `.notna()` to accomplish this. Other methods are also fine.
* Note that it is likely pandas is currently treating the individual values as strings. You can convert them to integers using `.astype('int')`
* Your visualization should include appropriate titles and axes labels. 

## Problem 6
* Inspect, clean, and visualize the series `'areacode'` to clearly communicate how people responded to the question.
* Make sure the visualizations sorts and displays the area codes appropriately. 
* Your visualization should include appropriate titles and axes labels. 

## Problem 7
* Inspect, clean, and visualize the series `'talk'` to clearly communicate how people responded to the question. 
* Your visualization should present the percent of respondents that choose each option, instead of the raw count. One approach is to use the `normalize = True` argument with `value_counts()`.
* Your visualization should include appropriate titles and axes labels. 

## Problem 8
* Inspect, clean, and visualize the series `'pets'` to clearly communicate how people responded to the question. 
* At a minimum your code should produce a visualization of the percent of responses that contain the following: dog, cat, lizard, fish, bird.
* Optional: include bunny (or variants) as well.
* Your visualization should include appropriate titles and axes labels. 

## Problem 9
* Inspect, clean, and visualize the series `'city'` to clearly communicate how people responded to the question.
* Make sure the visualization includes any city mentioned in five or more responses. 
* Make sure your code handles all forms of responses that include 'Sydney' and 'New York'
* Suggestion: the str.title() method will capitalize the first letter of every word in a string
* Your visualization should include appropriate titles and axes labels. 

## Problem 10
* Inspect, clean, and visualize the series `'travel'` to clearly communicate how people responded to the question. 
* Your visualization should present the percent of respondents that choose each option, instead of the raw count. One approach is to use the `normalize = True` argument with `value_counts()`.
* Your visualization should include appropriate titles and axes labels.

## Problem 11
* Inspect, clean, and visualize the series `'superpower'` to clearly communicate how people responded to the question.
* Your code should handle all versions of invisibility and mind reading 
* Your visualization should show the top five responses by count
* Your visualization should include appropriate titles and axes labels.

## Problem 12
* Inspect, clean, and visualize the series `'year'` to clearly communicate how people responded to the question. 
* Your code should handle any responses that include four or fewer digits, you may drop responses that have more four digits
* What is the average year for responses?
  * Your code should handle any responses that contain any version of 'bc'
  * Print the average year, as well as what percentage of responses were included in computing the average 
* Visualize the frequency of responses between 1850 and 2050 (inclusive), exclude all other responses
  * Your visualization should include appropriate titles and axes labels. 


Potential tools:
* `.astype()`, a series method that, when passed the argument '`int`' converts values to integers when possible
* `.isnumeric(),` a python string method that returns True if the string only contains numeric characters
* `.map()` to apply custom functions to values in a series

In [3]:
# write your cleaning code here

In [4]:
# check your cleaning code here
# print out percentage of original responses that were retained
# then print the mean or the retained responses

In [5]:
# make your visualization here
# Note! sns.histplot() has an argument binwidth, the example below uses binwidth = 10