# CSCI 303
# Introduction to Data Science
<p/>

### 10 - Exploratory Data Analysis

![Exploratory data analysis](eda.png)

## This Lecture
---
- Explore the California Housing data set

The obligatory setup code...

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk
import sklearn.datasets

from pandas import Series, DataFrame

plt.style.use('bmh')

%matplotlib inline

## The California Housing Dataset
---
A well known and heavily studied dataset for statistical inference.

Available in the scikit-learn package, or many sources online.

In [None]:
from sklearn.datasets import fetch_california_housing    
raw = fetch_california_housing()
#print(raw.target_names)
cali = DataFrame(raw.data, columns=raw.feature_names)

#example if you do not want to put the target in the cali dataframe
#X = cali
#y = raw.target

#example if you want to put the target in the cali dataframe, be sure to separate for ML
cali['MedHouseVal'] = raw.target
X = cali[['HouseAge', 'AveRooms', 'AveBedrms']]
y = cali['MedHouseVal']
#if you want to remove a column, you can use the .drop
#cali = cali.drop('MedHouseVal', axis=1)

cali.head()

In [None]:
print(raw.DESCR)

## Basic Statistics
---
pandas provides the `describe` function (similar to R's `summary`):

In [None]:
cali.describe()

## What Shall We Explore?
---
Some ideas:

- distributions of individual inputs
- correlations between pairs of inputs and/or the target
- your suggestion here

## Distributions
---
Often best explored via histogram.

A histogram divides data into (usually) even sized *bins*, then counts the frequency of occurrence of samples in each bin.

For example, let's look at average number of rooms per dwelling.

In [None]:
#using the defaults, does this give useful information?
plt.hist(cali['AveRooms'])
plt.show()

Very normal looking, isn't it?  We can vary the number of bins for more or less precision.

In [None]:
#maybe update the range and number of bins
plt.hist(cali['AveRooms'], bins=50, range=[1,10])
plt.show()

How about the average number of household members?

In [None]:
plt.hist(cali['AveOccup'], bins=20, range=[1,10]) 
plt.show()

## Correlations
---
Often best explored via a scatter plot.

I theorize that there will be a correlation between average number of bedrooms and average occupancy.  Let's take a look:

In [None]:
plt.scatter(cali['AveBedrms'], cali['AveOccup']) 
plt.xlabel('AveBedrms'); plt.ylabel('AveOccup');
plt.show()

There seems to be some odd artifacts on the AveOccup axis, we should explore further.

Let's take a closer look at the AveOccup data.

In [None]:
#find the max to double check the plot?
print(cali['AveOccup'].max())

#see what the counts of unique values are in this Series object
cali['AveOccup'].value_counts().head()
#cali['AveOccup'].value_counts()

These large numbers seems suspicious.  Some kind of accidental input, corporate housing, hmm?

In [None]:
#let's explore just the large values, starting with more than 10 occupants
caliSubset = cali[cali['AveOccup'] > 10]
caliSubset.describe()

In [None]:
#looks like there are 37 out of the 20K+ samples that are in this selection
caliSubset.head()

What are the chances that 37 out of the 20K+ samples are corrupt or not usable?

In [None]:
#let's redo our scatterplot without those samples in it
caliSubset = cali[cali['AveOccup'] <= 10]
plt.scatter(caliSubset['AveBedrms'], caliSubset['AveOccup']) 
plt.xlabel('AveBedrms'); plt.ylabel('AveOccup');
plt.show()

That looks better!
Wait ... now what is going on with the AveBedrms? 35 bedrooms, that is a big house!

In [None]:
#shall we further refine the dataset, removing those possible outliers, using a best guess?
caliSubset = caliSubset[caliSubset['AveBedrms'] <= 13]
plt.scatter(caliSubset['AveBedrms'], caliSubset['AveOccup']) 
plt.xlabel('AveBedrms'); plt.ylabel('AveOccup');
plt.show()

What shall we look at now? We could check to see if there is a correlation between average number of rooms and the target, median value?

In [None]:
plt.scatter(caliSubset['AveRooms'], caliSubset['MedHouseVal']) 
plt.xlabel('AveRooms'); plt.ylabel('MedHouseVal');
plt.show()

In [None]:
#Wow - what is going on at the 500K, lots clustered up there? How should we look?
print(caliSubset['MedHouseVal'].max())
caliSubset['MedHouseVal'].value_counts().iloc[:10]

In [None]:
caliSubset[caliSubset['MedHouseVal'] >= 5]

I'm quite suspicious that this value is some kind of data-entry default.

1. It's the same number, 5.001 for 987 of the entries?
2. It's the maximum and same value no matter the number of rooms, age, etc. Could be some big outliers in there!

For now, let's remove that data. It might not be justified, but without access to the original data collection info, it makes the most sense to me.

In [None]:
caliSubset = caliSubset[caliSubset['MedHouseVal'] < 5]
caliSubset

We are left with over 19K samples, we can move on ... see how it goes!

We have done some histograms and scatter plots to explore our data, removed some possible outliers ... now what?

In [None]:
# another way to plot, we can add a 3rd dimension of MedInc using a gradiant of color
# using our caliSubset with some outliers removed, plot using the pandas plotting library
caliSubset.plot(kind='scatter', x='AveRooms', y='MedHouseVal', c='MedInc', colormap='Blues_r') 
plt.show()

In [None]:
# we can change our a 3rd dimension to our target using a gradiant of color
# using our caliSubset with some outliers removed, plot using the pandas plotting library
caliSubset.plot(kind='scatter', x='AveRooms', y='MedInc', c='MedHouseVal', colormap='Blues_r') 
plt.show()

So these plots makes some sense with the value increasing as the income increases ... 

- Recall, we removed some suspicious data.

- We almost certainly lost some good data.

- Was removing data the right thing to do?

Other questions we could explore:
    
- What could we do with the Latitude and Longitude features?
- Does population have any berring on the target?
- Enter here during class

In [None]:
# plots all correlations with both a histogram and a scatterplot
for f in raw.feature_names:
    plt.subplot(1,2,1)
    plt.hist(caliSubset[f])
    plt.xlabel(f)
    plt.subplot(1,2,2)
    plt.scatter(caliSubset[f], caliSubset['MedHouseVal'])
    plt.xlabel(f)
    plt.ylabel('MedHouseVal')
    plt.show()
    