# DS-SF-23 | Lab 08 | Introduction to Classification

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import neighbors, metrics, grid_search, cross_validation

pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

%matplotlib inline
plt.style.use('ggplot')

In [2]:
df = pd.read_csv(os.path.join('..', 'datasets', 'boston.csv'))

In [3]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,...,TAX,PTRATIO,BLACK,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,...,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,...,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,...,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,...,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,...,222,18.7,396.9,5.33,36.2


The Boston dataset concerns itself with housing values in suburbs of Boston.  A description of the dataset is as follows:

- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sqft
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River binary/dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centers
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate (per ten thousands of dollars)
- PTRATIO: pupil-teacher ratio by town
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT: % lower status of the population
- MEDV: Median value of owner-occupied homes (in thousands of dollars)

> ## Question 1.  Let's first categorize `MEDV` to 4 groups: Bottom 20% as Level 1, next 30% as Level 2, next 30% categorized as Level 3, and the top 20% as Level 4.  Please create a new variable `MEDV_Category` that stores the level number

In [6]:
df[['MEDV']].describe()


Unnamed: 0,MEDV
count,506.0
mean,22.532806
std,9.197104
min,5.0
25%,17.025
50%,21.2
75%,25.0
max,50.0


In [11]:
level_2 = ((df.MEDV > df.MEDV.quantile(.2)) & (df.MEDV <= df.MEDV.quantile(.5)))
len(level_2)

506

In [18]:
df['category'] = '1'
#Assigning ranges
level_2 = ((df.MEDV > df.MEDV.quantile(.2)) & (df.MEDV <= df.MEDV.quantile(.5)))
level_3 = ((df.MEDV > df.MEDV.quantile(.5)) & (df.MEDV <= df.MEDV.quantile(.8)))
level_4 = (df.MEDV > df.MEDV.quantile(.8))

def medv_categorize(row):
    if row.MEDV in level_2:
        row.category = '2'
    elif row.MEDV in level_3:
        row.category = '3'
    else:
        row.category = '4'

df.apply(medv_categorize, axis = 1)

0      None
1      None
2      None
3      None
4      None
       ... 
501    None
502    None
503    None
504    None
505    None
dtype: object

In [17]:
df[df['category'] == '1']

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,...,PTRATIO,BLACK,LSTAT,MEDV,category
0,0.00632,18.0,2.31,0,0.538,...,15.3,396.90,4.98,24.0,1
1,0.02731,0.0,7.07,0,0.469,...,17.8,396.90,9.14,21.6,1
2,0.02729,0.0,7.07,0,0.469,...,17.8,392.83,4.03,34.7,1
3,0.03237,0.0,2.18,0,0.458,...,18.7,394.63,2.94,33.4,1
4,0.06905,0.0,2.18,0,0.458,...,18.7,396.90,5.33,36.2,1
...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,...,21.0,391.99,9.67,22.4,1
502,0.04527,0.0,11.93,0,0.573,...,21.0,396.90,9.08,20.6,1
503,0.06076,0.0,11.93,0,0.573,...,21.0,396.90,5.64,23.9,1
504,0.10959,0.0,11.93,0,0.573,...,21.0,393.45,6.48,22.0,1


## Our goal is to predict `MEDV_Category` based on `RM`, `PTRATIO`, and `LSTAT`

> ## Question 2.  First normalize `RM`, `PTRATIO`, and `LSTAT` into the new variables `RM_s`, `PTRATIO_s`, and `LSTAT_s`.  By normalizing, we mean to scale each variable between 0 and 1 with the lowest value as 0 and the highest value as 1

In [None]:
# TODO

> ## Question 3.  Run a KNN classifier with 5 nearest neighbors and report your misclassification error; set weights to uniform

In [None]:
# TODO

Answer:

> ## Question 4.  Is this error reliable?

Answer:

> ## Question 5.  Now use 10-fold cross-validation to choose the most efficient `k`

In [None]:
# TODO

> ## Question 6.  Explain your findings

Answer:

> ## Question 7.  Train your model with the optimal `k` you found above (don't worry if it changes from time to time - if that is the case use the one that is usually the best)

In [None]:
# TODO

Answer:

> ## Question 8.  After training your model with that `k`, use it to predict the class of a neighborhood with `RM = 2`, `PRATIO = 19`, and `LSTAT = 3.5` 

In [None]:
# TODO

Answer: