# ML Checklist: Getting Started

- What is the objective in business terms?

- Understand how your solution will be used

- Are there current solutions/workarounds?

- What categorization? (supervised/unsupervised, etc)

- How will performance be measured?

- Does the performance measure match the business objective?

- What's the minimum acceptable performance?

- Any reuse possible?

- Is human expertise available?

- What would be the manual solution?

- Are there any assumptions?

- Document! Document! Document!

## Selecting Performance Measures

- How will you measure "how good" your model is performing?

<b>Confution Matrix</b>

| N=300 | Predicted: No | Predicted: Yes | |
| :--- | :--- | :--- | :--- |
| <b>Actual: No</b> | TN=140 | FP=15 | 155 |
| <b>Actual: Yes</b> | FN=100 | TP=45 | 145 |
| | 240 | 60 | |

<b>Common Regression Measures</b>

RMSE: Room Mean Squared Error
- Most commonly used
$RMSE = \sqrt{\frac{\displaystyle\sum_{i=i}^{N}(Predicted_i-Actual_i)^2}{N}} $

MAE: Mean Absolute Error
- Preferred when many outliers
$MAE = \frac{1}{n}\displaystyle\sum_{i=1}^{n}|X_i-X|$

R^2: R-squared
- Also called coefficient of determination
Division of the $\text{Sum of Square Regression} = \sum(y^\prime{}-\bar{y^\prime{}})^2$ and $\text{Sum of Square Total} = \sum(y-\bar{y})^2$ where the Sum of Square Regression of made up of $1-SSE\div{SST} = 1 - \frac{\sum(y-\bar{y^\prime{}})^2}{SST}$

<b>Summar of Common Measures</b>
| Acronym | Full Name | Residual Operation | Robust to Outliers |
| :--- | :--- | :--- | :--- |
| MAE | Mean Absolute Error | Absolute Value | Yes |
| MSE | Mean Squared Error | Square | No |
| RMSE | Root Mean Squared Error | Square | No |
| MAPE | Mean Absolute Percentage Error | Absolute Value | Yes |
| MPE | Mean Percentage Error | N/A | Yes |

<b>Check and Validate All Your Assumptions</b>

## Module 2.2 Assignment
Using the data for the California Census, answer the following questions

### 1. What are the attributes for each district?

In [2]:
# package imports
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

In [3]:
# upload the data
df_cal_cens = pd.read_csv('housing.csv')
df_cal_cens.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [4]:
# get summary stats on each numerical feature
df_cal_cens.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [10]:
# determine unique values of the categorical features
print(f"Attributes for each district:\n{df_cal_cens['ocean_proximity'].unique()}")

Attributes for each district:
['NEAR BAY' '<1H OCEAN' 'INLAND' 'NEAR OCEAN' 'ISLAND']


### 2. What attributes are confusing to you?

The most confusing attributes to are the difference between the "<1H OCEAN" and "NEAR OCEAN". I would assume that NEAR OCEAN would mean >1H from the ocean but would most likely have to plot on a map chart to understand that information more.

### 3. Without graphing tools, what observations can you make about the data?

After reviewing the below information we can point out some key observations

1. There are only 5 ISLAND homes and the largest count attribute is <1H OCEAN at 9,136 observations
2. ISLAND homes have the greatest median age with INLAND homes having the lowest median age
3. ISLAND homes have the lowest number of total rooms and total bedrooms compared to the other attributes
4. ISLAND has the lowest number of households (unsure what this feature means)
5. ISLAND has the lowest median income but also has the lowest std of median income compared to the other homes
6. INLAND has the lowest median house value

In [13]:
# without graphing tools, we can look at the summary stats for each attribute/district
sum_stats_dict = {}
for val in df_cal_cens['ocean_proximity'].unique():
    # print(f"summary stats for {val}\n{df_cal_cens[df_cal_cens['ocean_proximity'] == val].describe()}\n")
    sum_stats_dict[val] = df_cal_cens[df_cal_cens['ocean_proximity'] == val].describe()

In [15]:
sum_stats_dict['NEAR BAY']

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,2290.0,2290.0,2290.0,2290.0,2270.0,2290.0,2290.0,2290.0,2290.0
mean,-122.260694,37.801057,37.730131,2493.58952,514.182819,1230.317467,488.616157,4.172885,259212.31179
std,0.147004,0.185434,13.070385,1830.817022,367.887605,885.899035,350.598369,2.017427,122818.537064
min,-122.59,37.35,2.0,8.0,1.0,8.0,1.0,0.4999,22500.0
25%,-122.41,37.73,29.0,1431.25,289.0,718.25,275.0,2.83475,162500.0
50%,-122.25,37.79,39.0,2083.0,423.0,1033.5,406.0,3.81865,233800.0
75%,-122.14,37.9075,52.0,3029.75,628.75,1495.0,599.25,5.054425,345700.0
max,-122.01,38.34,52.0,18634.0,3226.0,8276.0,3589.0,15.0001,500001.0


In [19]:
sum_stats_dict['<1H OCEAN']

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,9136.0,9136.0,9136.0,9136.0,9034.0,9136.0,9136.0,9136.0,9136.0
mean,-118.847766,34.560577,29.279225,2628.343586,546.539185,1520.290499,517.744965,4.230682,240084.285464
std,1.588888,1.467127,11.644453,2160.463696,427.911417,1185.848357,392.280718,2.001223,106124.292213
min,-124.14,32.61,2.0,11.0,5.0,3.0,4.0,0.4999,17500.0
25%,-118.5,33.86,20.0,1464.0,303.0,857.75,293.0,2.8649,164100.0
50%,-118.275,34.03,30.0,2108.0,438.0,1247.0,421.0,3.875,214850.0
75%,-118.0,34.22,37.0,3141.0,652.0,1848.0,617.0,5.1805,289100.0
max,-116.62,41.88,52.0,37937.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [16]:
sum_stats_dict['NEAR OCEAN']

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,2658.0,2658.0,2658.0,2658.0,2628.0,2658.0,2658.0,2658.0,2658.0
mean,-119.332555,34.738439,29.347254,2583.700903,538.615677,1354.008653,501.244545,4.005785,249433.977427
std,2.327307,2.275386,11.840371,1990.72476,376.320045,1005.563166,344.445256,2.010558,122477.145927
min,-124.35,32.54,2.0,15.0,3.0,8.0,3.0,0.536,22500.0
25%,-122.02,32.78,20.0,1505.0,313.0,778.5,299.0,2.630525,150000.0
50%,-118.26,33.79,29.0,2195.0,464.0,1136.5,429.0,3.64705,229450.0
75%,-117.1825,36.98,37.0,3109.0,666.0,1628.0,614.0,4.8374,322750.0
max,-116.97,41.95,52.0,30405.0,4585.0,12873.0,4176.0,15.0001,500001.0


In [17]:
sum_stats_dict['ISLAND']

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
mean,-118.354,33.358,42.4,1574.6,420.4,668.0,276.6,2.74442,380440.0
std,0.070569,0.040866,13.164346,707.545264,169.320111,301.691067,113.200265,0.44418,80559.561816
min,-118.48,33.33,27.0,716.0,214.0,341.0,160.0,2.1579,287500.0
25%,-118.33,33.34,29.0,996.0,264.0,422.0,173.0,2.6042,300000.0
50%,-118.32,33.34,52.0,1675.0,512.0,733.0,288.0,2.7361,414700.0
75%,-118.32,33.35,52.0,2127.0,521.0,744.0,331.0,2.8333,450000.0
max,-118.32,33.43,52.0,2359.0,591.0,1100.0,431.0,3.3906,450000.0


In [18]:
sum_stats_dict['INLAND']

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,6551.0,6551.0,6551.0,6551.0,6496.0,6551.0,6551.0,6551.0,6551.0
mean,-119.73299,36.731829,24.271867,2717.742787,533.881619,1391.046252,477.447565,3.208996,124805.392001
std,1.90095,2.116073,12.01802,2385.831111,446.117778,1168.670126,392.252095,1.437465,70007.908494
min,-123.73,32.64,1.0,2.0,2.0,5.0,2.0,0.4999,14999.0
25%,-121.35,34.18,15.0,1404.0,282.0,722.0,254.0,2.18895,77500.0
50%,-120.0,36.97,23.0,2131.0,423.0,1124.0,385.0,2.9877,108500.0
75%,-117.84,38.55,33.0,3216.0,636.0,1687.0,578.0,3.9615,148950.0
max,-114.31,41.95,52.0,39320.0,6210.0,16305.0,5358.0,15.0001,500001.0
