<a href="https://colab.research.google.com/github/ckalibsnelson/HackCville---Node-A/blob/master/04_Data_Cleaning%2C_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
gimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Data Cleaning

In [0]:
vals = np.array(["4", 5, "2", 2.3, "3.12"]) 
df = pd.DataFrame(vals)
df.at[0, 0] * 4 # this is messed up because it's a string!

'4444'

In [0]:
cleaned = pd.DataFrame(df[df.columns[0]].apply(lambda x: float(x)))
cleaned

Unnamed: 0,0
0,4.0
1,5.0
2,2.0
3,2.3
4,3.12


In [0]:
cleaned.at[0, 0] * 4 # works now

16.0

# Imputation

You will often encounter missing data in your datasets. Imputation is a fancy word for dealing with missing values.

Take, for example, this dataframe with several NaN and None values:

In [0]:
vals = np.array([1, np.nan, 3, None, 4, np.nan, 1]) 
df = pd.DataFrame(vals)
df.head(11)

Unnamed: 0,0
0,1.0
1,
2,3.0
3,
4,4.0
5,
6,1.0


This will cause problems with many pandas and matplotlib methods.

### 1. Dropping rows with missing values

By far the easiest way of dealing with missing data is just dropping rows that have missing values:

In [0]:
dropped = df.dropna()
dropped

Unnamed: 0,0
0,1
2,3
4,4
6,1


You might also need to drop NAs only for a single column. In this case, we can pass an argument to .dropna()

In [0]:
df.dropna(subset=[0]) # if you had column names, subset could be a list of column names

Unnamed: 0,0
0,1
2,3
4,4
6,1


 ### 2. Filling in missing values

We can fill in missing values in a variety of different ways. You can replace misisng values with a specific value (like the mean), forward-fill, back-fill, or use a variety of more advanced imputation methods such as KNN.

- Replacing missing values with a specific value:

In [0]:
df.fillna(0) # replace missing values with 0

Unnamed: 0,0
0,1
1,0
2,3
3,0
4,4
5,0
6,1


You can also use this to replace missing values with the mean of the values:

In [0]:
df[0].fillna(df[0].mean())

0    1.00
1    2.25
2    3.00
3    2.25
4    4.00
5    2.25
6    1.00
Name: 0, dtype: float64

- another filling strategy that can be useful for timeseries data is back-fill and forward-fill in which we replace missing values with either the first value before or after the missing field.

In [0]:
df.fillna(method='ffill') # forward-fill

Unnamed: 0,0
0,1
1,1
2,3
3,3
4,4
5,4
6,1


In [0]:
df.fillna(method='bfill') # back-fill

Unnamed: 0,0
0,1
1,3
2,3
3,4
4,4
5,1
6,1


## Merging DataFrames

Sometimes you have data from two different sources that you'd like to have in one data frame to analyze. We can do that with .concat() and .merge().

In [0]:
left = pd.DataFrame({'Dining': ['Castle', 'Newcomb', 'Chick-fil-A', 'Five Guys', 'Runk', 'Subway'], 'Rating': [4, 3, 5, 4, 4, 3]})
left

Unnamed: 0,Dining,Rating
0,Castle,4
1,Newcomb,3
2,Chick-fil-A,5
3,Five Guys,4
4,Runk,4
5,Subway,3


In [0]:
right = pd.DataFrame({'Dining': ['Castle', 'Newcomb', 'Chick-fil-A', 'Five Guys', 'Runk', 'Subway'], 'Price': [8.5, 9.5, 6.5, 7.75, 9.5, 6]})
right

Unnamed: 0,Dining,Price
0,Castle,8.5
1,Newcomb,9.5
2,Chick-fil-A,6.5
3,Five Guys,7.75
4,Runk,9.5
5,Subway,6.0


Now we want to put these into one data table, but not duplicate the "Dining" column. You can call .merge() on two data tables and specify the column to merge "on="

In [0]:
pd.merge(left, right, on='Dining')

Unnamed: 0,Dining,Rating,Price
0,Castle,4,8.5
1,Newcomb,3,9.5
2,Chick-fil-A,5,6.5
3,Five Guys,4,7.75
4,Runk,4,9.5
5,Subway,3,6.0


Let's try .concat() to see if it does the same thing. It's a little different from .merge() because you have to pass in a list of dataframes that you want to concatenate.

In [0]:
pd.concat([left, right])

Unnamed: 0,Dining,Price,Rating
0,Castle,,4.0
1,Newcomb,,3.0
2,Chick-fil-A,,5.0
3,Five Guys,,4.0
4,Runk,,4.0
5,Subway,,3.0
0,Castle,8.5,
1,Newcomb,9.5,
2,Chick-fil-A,6.5,
3,Five Guys,7.75,


So that didn't really work, because now we have duplicate rows with missing information. However, pd.concat() can be useful for a different case.

In [0]:
more_data = pd.DataFrame({'Dining': ["O'Hill", "Starbucks", "N2Go", "Burrito Theory"], 'Rating': [3, 4, 4, 3]})
more_data

Unnamed: 0,Dining,Rating
0,O'Hill,3
1,Starbucks,4
2,N2Go,4
3,Burrito Theory,3


In [0]:
pd.concat([left, more_data])

Unnamed: 0,Dining,Rating
0,Castle,4
1,Newcomb,3
2,Chick-fil-A,5
3,Five Guys,4
4,Runk,4
5,Subway,3
0,O'Hill,3
1,Starbucks,4
2,N2Go,4
3,Burrito Theory,3


Concat can actually add columns instead of rows too. Use the axis=1 argument. Here's a case where it might be useful.

In [0]:
more_info = pd.DataFrame({'Popularity': [8, 5, 10, 7, 8, 7], 'Hours': ["7:00-9:00", "7:00-8:00", "11:00-8:00", "11:00-8:00", "7:00-8:00", "11:00-8:00"]})
more_info

Unnamed: 0,Hours,Popularity
0,7:00-9:00,8
1,7:00-8:00,5
2,11:00-8:00,10
3,11:00-8:00,7
4,7:00-8:00,8
5,11:00-8:00,7


In [0]:
df = pd.concat([right, more_info], axis=1)
df

Unnamed: 0,Dining,Price,Hours,Popularity
0,Castle,8.5,7:00-9:00,8
1,Newcomb,9.5,7:00-8:00,5
2,Chick-fil-A,6.5,11:00-8:00,10
3,Five Guys,7.75,11:00-8:00,7
4,Runk,9.5,7:00-8:00,8
5,Subway,6.0,11:00-8:00,7


Note the difference between .concat(axis=1) and .merge(). We would use .concat() when there isn't a duplicate column, and .merge() when there is one.

### A brief aside: cleaning time series data

Let's say UVA decided to close all dining halls one hour later. We could go in and manually fix the hours, but that's not efficient if there were more than 10 dining halls in the table. Instead, we can extract the times we need and turn them into timestamps, which can be easy to add and subtract minutes, hours, etc.

We need to split the hours into open and close time first.

In [0]:
df['Open'] = df['Hours'].apply(lambda x: x.split('-')[0])
df['Close'] = df['Hours'].apply(lambda x: x.split('-')[1])
df

Unnamed: 0,Dining,Price,Hours,Popularity,Open,Close
0,Castle,8.5,7:00-9:00,8,7:00,9:00
1,Newcomb,9.5,7:00-8:00,5,7:00,8:00
2,Chick-fil-A,6.5,11:00-8:00,10,11:00,8:00
3,Five Guys,7.75,11:00-8:00,7,11:00,8:00
4,Runk,9.5,7:00-8:00,8,7:00,8:00
5,Subway,6.0,11:00-8:00,7,11:00,8:00


In [0]:
df['Open'] = pd.to_datetime(df['Open'])
df['Close'] = pd.to_datetime(df['Close'])
df

Unnamed: 0,Dining,Price,Hours,Popularity,Open,Close
0,Castle,8.5,7:00-9:00,8,2018-09-30 07:00:00,2018-09-30 09:00:00
1,Newcomb,9.5,7:00-8:00,5,2018-09-30 07:00:00,2018-09-30 08:00:00
2,Chick-fil-A,6.5,11:00-8:00,10,2018-09-30 11:00:00,2018-09-30 08:00:00
3,Five Guys,7.75,11:00-8:00,7,2018-09-30 11:00:00,2018-09-30 08:00:00
4,Runk,9.5,7:00-8:00,8,2018-09-30 07:00:00,2018-09-30 08:00:00
5,Subway,6.0,11:00-8:00,7,2018-09-30 11:00:00,2018-09-30 08:00:00


So the close hours are actually incorrect because datetime uses 24 hour time notation. So we need to add 12 hours to it, plus the 1 hour for UVA closing the dining halls later

In [0]:
df['Close'] = df['Close'] + pd.Timedelta(hours=13)
df

Unnamed: 0,Dining,Price,Hours,Popularity,Open,Close
0,Castle,8.5,7:00-9:00,8,2018-09-30 07:00:00,2018-09-30 22:00:00
1,Newcomb,9.5,7:00-8:00,5,2018-09-30 07:00:00,2018-09-30 21:00:00
2,Chick-fil-A,6.5,11:00-8:00,10,2018-09-30 11:00:00,2018-09-30 21:00:00
3,Five Guys,7.75,11:00-8:00,7,2018-09-30 11:00:00,2018-09-30 21:00:00
4,Runk,9.5,7:00-8:00,8,2018-09-30 07:00:00,2018-09-30 21:00:00
5,Subway,6.0,11:00-8:00,7,2018-09-30 11:00:00,2018-09-30 21:00:00


(We actually can't isolate only the hour without converting it back to an int)

#Data Visualization

see kaggle notebooks in the 9/30 slides

SNS pairplot: https://seaborn.pydata.org/generated/seaborn.pairplot.html



In [0]:
from google.colab import files

uploaded = files.upload()

Saving titanic.csv to titanic.csv


In [0]:
import pandas as pd
df = pd.read_csv('titanic.csv', sep='\t')

In [0]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
