# More Pandas and a Data Cleaning Example

## Housekeeping

* Last week's material.
* HW2, project proposals.
* The midterm.

# Last Week's Material

* Pandas is great for dealing with spreadsheet type data.
* Columns/Indices make it versatile for querying data, but these can be tricky.
* Pandas is well documented - look there first with questions.
* Practice is the best way to get better with Pandas, text cleaning, & coding in general.

# Project Proposals

* I'm STOKED.
* Everyone is doing something unique and interesting.
* The goal is do learn something, and hopefully have somthing to demonstrate your skill.
* I'm here to help you succeed.
* Get to work!

# Midterm

* Opens in OAKS Sep 28, 2022 9:00 PM EDT
* Closes in OAKS Oct 4, 2022 11:59 PM EDT
* 29 questions: T/F, multiple choice, matching.
* Open class notebooks. 
* Please don't collaborate.

# Anything Else?

# Data Cleaning with Pandas

I'm referencing a few tutorials.

* [Oil Spills & Iris dataset](https://machinelarningmastery.com/basic-data-cleaning-for-machine-learning)
* In turn, this tutorial references Kuhn, M., and Johnson, K. (2019) _Feature Engineeing and Selection: A Practical Approach for Predictive Models_ (1st ed). Chapman & Hall/CRC Data Science Series. 

You should also check out this [kaggle tutorial](https://www.kaggle.com/code/ashrafkhan94/oil-spill-imbalanced-classification/notebook).

And the [pandas docs](https://pandas.pydata.org/).

In [None]:
import pandas as pd

# How do we think about data?

* One way is to focus explicitly on the values.
* Another way is to think about the big picture.

Both of these are valuable.

# Iris Dataset

A balanced dataset that describes 3 species (classes) of Iris flower, each with 50 instances (150 total). Each observation has e measurements of different parts of the flower.

[Fisher, R.A. (1936). "The use of multiple measurements in taxonomic problems". _Annual Eugenics_, 7, Part II 179-188.](https://onlinelibrary.wiley.com/doi/10.1111/j.1469-1809.1936.tb02137.x)

[Iris Dataset](https://archive.ics.uci.edu/ml/datasets/iris)

In [None]:
#this dataset has been adulturated
iris_data = r"https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
columns=["Sepal length in cm", "sepal width in cm", "petal length in cm", "petal width in cm", "class"]
iris_df = pd.read_csv(iris_data, header=None, names=columns)

In [None]:
iris_df.head()

In [None]:
iris_df.shape

One of the simplest errors to check for is duplicated data.

Pandas has a function for that.

In [None]:
# we indeed have duplicates at least for line 37.
iris_df.iloc[33:38]

In [None]:
iris_df[iris_df.duplicated()]

Rows 34 and 37 duplicate row 9.
Row 142 duplicates row 101.
We can confirm this by making a pandas selection.

In [None]:
iris_df[(iris_df["Sepal length in cm"]==4.9)|(iris_df["Sepal length in cm"]==5.8)]

In [None]:
#they are easily dropped with drop_duplicates and the inplace keyword argument
iris_df.drop_duplicates(inplace=True)
iris_df

In [None]:
iris_df.shape

In [None]:
#pandas can tell us a little about the data
iris_df.describe()

In [None]:
#A scatter matrix gives us a quick way to compare data relationships.
pd.plotting.scatter_matrix(iris_df)

Variables 2 and 3 seem to have a strong linear relationship.

Everything else seems like there could be embedded relationships, given that we're dealing with 3 labels.

In [None]:
pd.plotting.scatter_matrix(iris_df[iris_df['class']=='Iris-versicolor'])

In [None]:
pd.plotting.scatter_matrix(iris_df[iris_df['class']=='Iris-virginica'])

In [None]:
pd.plotting.scatter_matrix(iris_df[iris_df['class']=='Iris-setosa'])

In [None]:
#sometimes, you want to assign a value based on some other value, such as when plotting
iris_df.plot.scatter(x="Sepal length in cm",y="sepal width in cm", c="class")

In [None]:
# we can build a dict of integer values to quite that error, and create a new collumn by mapping it 
int_class={"Iris-versicolor":0, "Iris-setosa":1, "Iris-virginica":2}
iris_df["int_class"] = iris_df["class"].map(int_class)

In [None]:
iris_df.head()

In [None]:
iris_df.plot.scatter(x="Sepal length in cm",y="sepal width in cm", c="int_class")

# Oil Spills Dataset

An imbalanced dataset that describes 41 oil slicks and 896 non-oil slicks.

Each case includes a patch number (column 0), a class label (1 = slick, 0 = 0), and 48 numerical features derived from computer vision analysis of satellite imagery.

[Kubat, M., Holte, R., & Matwin, S. (1998) Machine learning for the detection of oil spills in satellite radar images. _Machine Learning_, 30, 195-215.](https://link.springer.com/content/pdf/10.1023/A:1007452223027.pdf)

In [None]:
oil_data = r"https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv"
oil_df = pd.read_csv(oil_data, header=None)

oil_df.head()

By calling the head of the dataframe, we can see that it has 50 columns as expected.

To determine the shape of the dataset, we can call df.shape

In [None]:
oil_df.shape #50 columns, 937 rows = 41 slicks + 896 non-slicks

# When is data _valuable_?

When it tells us something!

So, how can we tell if our data is telling us something?

A quick way to explore your data is to look at each column's data type and number of unique values.

In [None]:
# use the dtypes command to get the data type of each column
oil_df.dtypes

In [None]:
#the pandas describe() method runs some simple statistics on each field.
oil_df.describe()

In [None]:
#alternatively summarize the integer columns using the select_dtypes command and the len() command
print('There are {} columns with integer data'.format(len(df.select_dtypes(include=['int']).columns)))
oil_df.select_dtypes(include=['int']).columns

# What integer data would make sense?

* Column 0 is an integer patch number.
* We know that there is a class label ( 1 = slick, 0 = no slick)
* This is numerical data from an automated process, so it would make sense if data were encoded as ordinal/categorical.

In [None]:
# we can also summarize the integer columns using the select_dtypes command and the len() command
print('There are {} columns with float data'.format(len(df.select_dtypes(include=['float']).columns)))
oil_df.select_dtypes(include=['float']).columns

In [None]:
# use the nunique command to get the number of unique values per column.
# what fields are interesting?
counts = df.nunique()
counts

# Initial Analysis

Field 0 is the patch number. Why are there only 238 uniques?

Fields 45 and 29 only have 2 unique integers each. Class label candidates.

Field 22 has only 1 unique value.

Field 46 has a unique value for every case.

A handful of fields have single digit unique values.

It's informative to look at the number of uniqe values per field as a percent of the total number of cases.

In [None]:
for i in range(oil_df.shape[1]):
    num = len(oil_df[i].unique())
    pctg = num/oil_df.shape[0]*100
    print('%d, %d, %.1f%%' % (i, num, pctg))

What's the variance of a field with one unique value?

In [None]:
oil_df[22].var()

This doesn't tell us anything, so we can exclude it. The tutorial goes into applying variance thresholds on the data colums as a means of identifying columns to drop based on varying criteria. It's overkill for now, but feel free to explore it on your own.

For the sake of argument, let's assume it's appropriate to discard anything with variance = 0 and columns with fewer than 1% unique values, except for the class label.

Which field is the class label?

In [None]:
# Columns 45 and 49 both have 2 unique values. We know one should have 41 instances of 1 and 896 instances of 0
oil_df[oil_df[45]==1].shape[0], oil_df[oil_df[45]==0].shape[0]

In [None]:
# It doesn't appear to be column 45. Try 49.
oil_df[oil_df[49]==1].shape[0], oil_df[oil_df[49]==0].shape[0]

Column 49 is the class label.

In [None]:
#let's see the label weights as percent.
pct_spill = oil_df[oil_df[49]==1].shape[0]/oil_df.shape[0]*100
pct_nospill=oil_df[oil_df[49]==0].shape[0]/oil_df.shape[0]*100
print('Class 1: %.3f%%, Class 0: %.3f%%' % (pct_spill, pct_nospill))

In [None]:
# pandas has some helpful plotting functionality
oil_df[49].hist()

In [None]:
# let's look at column 45. It's very similar to column 49.
oil_df[45].hist()

In [None]:
# visualize the columns as histograms to see distributions.
import matplotlib.pyplot as plt #import pyplot 

fig = plt.figure(figsize=(50,50)) #create a figure
ax = fig.gca() #assign an axis variable that gets the current axis
_ = oil_df.hist(ax=ax) #plot a histogram that uses data frame fields assigning a new axis to each field.

In [None]:
# recall we said it was appropriate to delete anything with variance less than 1
to_del = [i for i, v in enumerate (counts) if (v/oil_df.shape[0]*100) <= 1]
to_del

In [None]:
# how many are we deleting?
len(to_del)

In [None]:
# let's save the patch names and the labels as series to their own variables
patches = oil_df[0]
labels = oil_df[49]

In [None]:
# drop the low variance fields in place by specifying them in the first axis.
oil_df.drop(to_del, axis=1, inplace=True)

In [None]:
#check the shape of the resulting dataframe.
oil_df.shape

In [None]:
# we can join the patches and labels series into a dataframe using the pandas concat method.
pd.concat([patches, labels], axis=1)

In [None]:
# the concat method defaults to axis 0, which appends series end to end.
pd.concat([patches,labels])