# Pandas & Data Cleaning

# Pandas Info

* __[Pandas Docs](https://pandas.pydata.org/)__
* __[I learned Pandas from this book](https://wesmckinney.com/book/)__
* __[YAHT: Yet another handy tutorial](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/)__


# Pandas: What is it good for?

* Tabular or spreadsheet-like data in which each column may be a different type (string, numeric, date, or otherwise). 
* Multidimensional arrays (matrices).
* Multiple tables of data interrelated by key columns (what would be primary or foreign keys for a SQL user).
* Evenly or unevenly spaced time series.

[ref:](https://wesmckinney.com/book/preliminaries.html)

# Pandas: What is it good for?

Python integrates nicely with some Python libraries:

* NumPy - Matrices 
* SciPy - Statistics
* Matplotlib - Plotting
* Scikit-learn - Machine Learning

# Pandas Series

A __series__ is a 1-dimensional array-like object containing a sequence of same-typed values + a labeled index.

The default index is numeric, starting at 0, just like a list.

In [None]:
import pandas as pd

ser = pd.Series([8,6,7,5,3,0,9])
ser

# Pandas Series

Pass index labels using the __index__ argument.

In [None]:
serIdx = pd.Series([8,6,7,5,3,0,9], index=["j","e","n","N","y","_","!"])
serIdx

In [None]:
#select a value from a Series with its index
print(ser[1], serIdx["N"])

In [None]:
#assign values similarly

ser[1]=9

ser

In [None]:
#filter series using boolean operators

ser[ser>4]

# I read this as select from ser WHERE value in ser > 4

In [None]:
# do scalar arithmatic, e.g. add 3

ser+3

In [None]:
# e.g. square
ser**2

In [None]:
#it's like a fixed length, ordered dictionary index:value
data = {"j":8,"e":6,"n":7,"N":5,"y":3,"_":0,"!":9}
serIdx2 = pd.Series(data)
serIdx2

In [None]:
# convert that back to a dict
serIdx2.to_dict()

In [None]:
#note, pandas will use NaN when a null value is present
#note also, pandas is converting to a float
index=["J","e","n","N","y","_","!"]
serIdx3 = pd.Series(data, index=index)
serIdx3

In [None]:
# use isna as a method or notna as a function, or vice versa
serIdx3.isna(), pd.notna(serIdx3)

In [None]:
#Automatic Index Alignment
serIdx2 + serIdx3

In [None]:
#Name the Series & its index
serIdx3.name = "demoSeries"
serIdx3.index.name = "demoIndex"
serIdx3

In [None]:
#assign a new index
serIdx3.index = ["j","e","n","N","y","_","!"]
serIdx3

# Pandas Data Frames

A __Data Frame__ is a rectangular table of data with:

* An ordered, named collection of columns independently typed data values
* A row index
* A column index

In [None]:
#enough of this stupid song reference.
#now pulling from here: [pandas book](https://wesmckinney.com/book/pandas-basics.html)

data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

In [None]:
#inspect head
frame.head()

In [None]:
#inspect tail, specifying number of rows, which works on head too
frame.tail(3)

In [None]:
#specify column order
pd.DataFrame(data, columns=["year", "state", "pop"])

In [None]:
# select columns like a dictionary, or as a method

frame["year"], frame.year

In [None]:
# select rows with loc using the label and iloc using the index
frame.index = ["a","b","c","d","e","f"]
frame.loc["b"], frame.iloc[1]

In [None]:
#select multiple rows in loc or iloc by passing a list
frame.loc[["c","d"]]

In [None]:
#select multiple rows using a slice
frame.iloc[2:4]

In [None]:
#create a column using NaN values from NumPy
import numpy as np

frame["debt"]=np.nan

frame

In [None]:
#Modify column by assignment
# this modifies the underlying data

frame["debt"] = 0

frame

In [None]:
# set a column as the index
# this modifies the view on the data
frame.set_index("state")

In [None]:
# use multiple columns for a multi-index

frame.set_index(["year","state"])

In [None]:
#the data is unchanged
frame

In [None]:
# make the change permanently

frame =frame.set_index(["year","state"])
frame

In [None]:
#undo the change using reset_index and the inplace keyword argument
frame.reset_index(inplace=True)
frame

In [None]:
# there are some built in analysis functions
#(number of rows, number of columns)
frame.shape 

In [None]:
#number of values
frame.size

In [None]:
#data type of each column
frame.dtypes

In [None]:
#mean, median, variance
frame['pop'].mean(), frame['pop'].median(), frame['pop'].var()

In [None]:
#min, max
frame['pop'].min(), frame['pop'].max()

In [None]:
#unique values
frame['state'].unique(), frame['pop'].unique()

In [None]:
#sort numeric values
frame['pop'].sort_values(ascending=False)

# Pandas Data Reading & Writing

Now that you know enough pandas to be dangerous, you can read a variety of formats, including:

* csv
* excel
* html
* json

In [None]:
#change csvPath to the path for your file
csvPath = r"/Users/brandanscully/Documents/GitHub/DATA_510/handyrefs.csv"

# there are many controls for ordering, skipping, and setting column preferences
# we're not going to get into them now
handyrefs = pd.read_csv(csvPath)

handyrefs

# Pandas Data Reading & Writing

The point here is that pandas is very flexible for managing tabular data.

# Data Cleaning!

So far, we have talked about how to get data from a source, including:

* api
* website via scraping
* text file
* csv file

# Data Cleaning!

Now that we have the data, we need to talk about what to do when it's "dirty".

First, we need to think about data types, which can be confusing.

# Data Types

* __Discrete__: Countable, e.g. individuals in a group 
* __Continuous__: Sequence data with sub-integer values, e.g. temperature

[Discrete v. Continuous](https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/discrete-vs-continuous-variables/)

# Data Types

* __Qualitative__: Type or Label data
* __Quantitative__: Numeric data

# Qualitative Data

* __Nominal__: Category labels with no specific hierarchy
* __Ordinal__: Data that describe order without informing difference

# Quantitative Data

* __Interval__: Numeric data preserving order and difference along known equal interval
* __Ratio__: Numeric data that preserves order, difference, and has a "true 0".

# Nominal Data

If we were to partition our class by student name, we would have 8 groups:

Lydia, Will, Maddie, Sheldon, Trevor, Giselle, David, Cayden

We could then assign a __dummy variable__ (a placeholder for another value) for each of those 8 groups, e.g. 0-7.

Lydia:0, Will:1, Maddie:2, Sheldon:3, Trevor:4, Giselle:5, David:6, Cayden:7

The order of labels is trivial.

The difference between labels is uninformative.

# Ordinal Data

Ordinal data assings an inherent order to label data. 

Charleston's temperature can be described as:

* "Very hot"
* "hot"
* "cold"
* "very cold"

The labels tell us something meaningful about temperature, absent actual temperature ranges.

Arbitrarilly defining temperature ranges doesn't affect the ordinal labels:

* "Very hot": T >= 100 F
* "hot": 100 > T >= 75
* "cold": 75 > T >= 60 
* "very cold": T < 60

Assigning a _dummy variable_ to ordinal data must preserve the order.

* "Very hot": 0
* "hot": 1
* "cold": 2
* "very cold": 3

# Interval Data

Interval data have labels and order and a meaningful difference between known equal intervals.

Sticking with temperature, the difference between 100 F, 75 F, and 60 F is 25 F and 15 F, respectively.

0 F, and -10 F are _relatively_ cold. But 0 F just means halfway between 1 F and -1 F.

This lack of a "true 0" and the ability to have negative values means division and multiplication won't work.

# Ratio Data 

Ratio data have labels, order, a meaningful difference between known equal intervals, and a "true 0".

A __True 0__ means that a value of 0 means an absence of that value.

0 Kelvin means an absence of temperature. 0 F or 0 C are relatively warmer than 0 K, but they're not the same.

Ratio data can be divided, multiplied, etc.

# Data Types Explainers

Here are some handy references if this is confusing:

* [Levels of Measurement](https://careerfoundry.com/en/blog/data-analytics/data-levels-of-measurement/)
* [Measurement Scales](https://www.statsdirect.com/help/basics/measurement_scales.htm#:~:text=The%20interval%20measurement%20scale%20is,treat%20discrete%20values%20as%20continuous.)
* [Ratio Data](https://careerfoundry.com/en/blog/data-analytics/what-is-ratio-data/)
* [Data Types](https://builtin.com/data-science/data-types-statistics)

# Missing Data & Sentinel Values

There is a difference between an observation of value 0, and no observation.

If you are measuring the height of something, a measurement of 0 means the object has no height.

If you don't record a measurement for an object, it may have 0 height, but that's not necessarily true.

__Null Data__ is the case where data are missing.

You will need to make a decision of what to do with this data.

The same goes for the case where a __Sentinel value__ represents an observation.

A __Sentinal value__ is used to represent another value or set of values.

If observations are interested in the range of 1-10 and an observation is more than 10, it might get a sentinel value like -999.

Python in general and Pandas specifically has tools built in to help us with these issues.

Starting with the [read_csv documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html):

* usecols=None
* true_values=None 
* false_values=None
* na_values=None
* na_filter=True
* delim_whitespace=False

All of these can be used to specify how pandas treats missing values, sentinel values, whitespace characters, etc. at the read-in stage.

Once data are read in, we can access some data cleaning functions right in pandas.

In [None]:
#here's a useful note about trying to set values using chained indexing
frame.iloc[0]['debt']=np.nan

In [None]:
#this is still chained indexing
frame.loc[3]['debt']=np.nan

In [None]:
#select the column, then the label, then assign
frame['debt'].loc[0]=np.nan
frame

In [None]:
#drop null values from view, 
frame.dropna()

In [None]:
frame

In [None]:
#or use inplace = True keyword argument to make it permanent
frame.dropna(inplace=True)
frame

In [None]:
# so if we care about whether a field should be retained or kept for analysis we can start applying cleaning
# In this case the variance of 'debt' is 0
frame['debt'].var()

In [None]:
#Statistically, var=0 means the data is homogenouse and therefore, not useful
#so we can omit it on read-in with usecols and omitting debt (which we added before)
#or we can delete it

del frame['debt']
frame

In [None]:
#pandas has an apply method that lets you use a function on each column with a lambda
handyrefs['Link'].apply(lambda x: len(x))

Clearly, Link 6 is an outlier so we can investigate it.

In [None]:
# Alternatively, we can apply some formatting operations to fix data
handyrefs['Link'].apply(lambda x: x.split('://')[1])

This is the beginning, and we will be looking at data cleaning more in depth next week

# YAIBP: Yet Another Inspirational Blog Post

[How Text Messages Change After Having a Baby](http://adashofdata.com/2017/09/05/how-text-messages-change-after-having-a-baby/)