# Assignment 9 - Option 2
Submit a mini-project of your own.

This would be similar to your individual project but is smaller in scope. For example, you can pick a dataset and explore it. Provide background, analysis and visualization. Or you can pick a Python data visualization library (Altair, Bokeh, Plotly Express, Seaborn, etc.) or other libraries of your interest.  Learn and use it on a dataset. 

Make sure you make the project well styled and easy to follow. It does not need to be huge and long. Just good enough to show that you learn something new in a new dataset or learn something new in a new library.

## What is Bokeh?
"Bokeh is an interactive visualization for modern web browers." ([Bokeh](https://https://docs.bokeh.org/en/latest/index.html)) The main idea behind Bokeh is that "graphs are built up one layer at a time". ([Towards Data Science](https://https://towardsdatascience.com/data-visualization-with-bokeh-in-python-part-one-getting-started-a11655a467d4))

Features of Bokeh ([Geeks for Geeks](https://https://www.geeksforgeeks.org/introduction-to-bokeh-in-python/))


*   Flexibility
*   Productivity
*   Interactivity
*   Powerful
*   Shareable
*   Open Source



Look at the example from Towards Data Science to do some initial testing, then I'll import my own dataset.

In [2]:
pip install bokeh



In [4]:
# import libraries
from bokeh.plotting import figure
from bokeh.io import show, output_notebook

# Create a blank figure with labels
p = figure(plot_width = 600, plot_height = 600, title = 'Example Glyphs', x_axis_label = 'X', y_axis_label = 'Y')

# Example data
squares_x = [1,3,4,5,8]
squares_y = [8,7,3,1,10]
circles_x = [9,12,4,3,15]
circles_y = [8,4,11,6,10]

# Add square glyph
p.square(squares_x, squares_y, size=12, color="navy", alpha=0.6)
# Add circle glyph
p.circle(circles_x, circles_y, size=12, color='red')

# Set to output the plot in the notebook
output_notebook()
# Show the plot
show(p)

## Mini Project

### **Load the Data**

I'm going to look at one of the datasets that comes with sklearn since the focus of this mini project is to learn how to use Bokeh.

In [13]:
# import library
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

# look at available datasets
housing = fetch_california_housing()

In [18]:
# description of dataset
print(housing.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

In [17]:
# load dataset as a dataframe
df = pd.DataFrame(housing.data, columns = housing.feature_names)
df['target'] = housing.target
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [19]:
# summary of the stats for each column
df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


### **Histograms of Median Income Blocks**

In [21]:
arr_hist, edges = np.histogram(df['MedInc'], 
                               bins = 15, 
                               range = [0, 15])


delays = pd.DataFrame({'MedInc': arr_hist, 
                       'left': edges[:-1], 
                       'right': edges[1:]})

In [22]:
delays.head()

Unnamed: 0,MedInc,left,right
0,155,0.0,1.0
1,2284,1.0,2.0
2,4926,2.0,3.0
3,5151,3.0,4.0
4,3615,4.0,5.0


In [23]:
# Create the blank plot
p = figure(plot_height = 600, plot_width = 600, 
           title = 'Histogram of Median Income in blocks',
          x_axis_label = 'Median Income in blocks]', 
           y_axis_label = 'Number of Households')

# Add a quad glyph
p.quad(bottom=0, top=delays['MedInc'], 
       left=delays['left'], right=delays['right'], 
       fill_color='red', line_color='black')

# Show the plot
show(p)

This shows that the households are skewed towards the lower blocks of income

In [25]:
arr_hist, edges = np.histogram(df['MedInc'], 
                               bins = 30, 
                               range = [0, 15])


delays = pd.DataFrame({'MedInc': arr_hist, 
                       'left': edges[:-1], 
                       'right': edges[1:]})

# Create the blank plot
p = figure(plot_height = 600, plot_width = 600, 
           title = 'Histogram of Median Income in blocks',
          x_axis_label = 'Median Income in blocks', 
           y_axis_label = 'Number of Households')

# Add a quad glyph
p.quad(bottom=0, top=delays['MedInc'], 
       left=delays['left'], right=delays['right'], 
       fill_color='red', line_color='black')

# Show the plot
show(p)

Doubleing the resolution does not provide any further insights

### **Average # of Rooms vs. Average Number of People**

In [45]:
p  = figure(plot_height = 600, plot_width = 600, 
          x_axis_label = 'Average Occupancy', 
           y_axis_label = 'Average Number of Rooms')
p.square(df[df['AveOccup'] < df['AveOccup'].quantile(.95)]['AveOccup'],df[df['AveRooms'] < df['AveRooms'].quantile(.95)]['AveRooms'], size=2, color="olive", alpha=0.5)
show(p)

Here we can see that there isn't a noticible trend even with cutting off the top 5% to eliminate outliers 

Try it with 1/25th of the data to see if we can notice a trend.

In [44]:
df_25 = df.sample(n=int(df.count()[0]/25))

p  = figure(plot_height = 600, plot_width = 600, 
          x_axis_label = 'Average Occupancy', 
           y_axis_label = 'Average Number of Rooms')
p.square(df_25[df_25['AveOccup'] < df_25['AveOccup'].quantile(.95)]['AveOccup'],df_25[df_25['AveRooms'] < df_25['AveRooms'].quantile(.95)]['AveRooms'], size=2, color="olive", alpha=0.5)
show(p)

Even with 1/25 of the data  there is no discernable  trend

### **Average # of Rooms vs. Average Income**

Only going to look at 1/25th of the data.

In [43]:
df_25 = df.sample(n=int(df.count()[0]/25))


p  = figure(plot_height = 600, plot_width = 600, 
          x_axis_label = 'Median Income', 
           y_axis_label = 'Average Number of Rooms')
p.square(df_25[df_25['MedInc'] < df['MedInc'].quantile(.95)]['MedInc'],df_25[df_25['AveRooms'] < df_25['AveRooms'].quantile(.95)]['AveRooms'], size=2, color="olive", alpha=0.5)
show(p)



No trends

### **Age of House vs. Median Income Over the Years**

In [47]:
df_ex = df[['HouseAge','MedInc']]
df_ex.head()

Unnamed: 0,HouseAge,MedInc
0,41.0,8.3252
1,21.0,8.3014
2,52.0,7.2574
3,52.0,5.6431
4,52.0,3.8462


In [53]:
# group by house age and get median income for each "age"
df_group = df_ex.groupby('HouseAge').mean()
df_group = df_group.reset_index()
df_group.head()

Unnamed: 0,HouseAge,MedInc
0,1.0,4.0034
1,2.0,5.167766
2,3.0,5.460258
3,4.0,5.180673
4,5.0,4.697636


In [58]:
p = figure(plot_width=600, plot_height=600,
           title = 'Age vs. Income Over the Years',
          x_axis_label = 'Age of Home', 
           y_axis_label = 'Average Median Income')
p.line(df_group['HouseAge'], df_group['MedInc'], line_width=2, color='teal')
show(p)

This shows that people with more money are buying newer houses on average.