# Feature Engineering and k-Nearest Neighbors with the California Housing Prices Data Set
* [Overview](#overview)   
* [Using seaborn](#using-seaborn)
* [Reviewing the Data Set](#reviewing-the-data-set)
* [Examining the Categorical Data](#examining-the-categorical-data)
* [One-Hot Encoding](#one-hot-encoding)
* [k Nearest Neighbors](#k-nearest-neighbors)
* [Re-scaling the Data](#rescaling-tje-data)
* [Putting it Together](#putting-it-together)

## Overview

## Using Seaborn 

*seaborn* is a Python library that extends matplotlib. It can be used to make plots that give information. You should be able to install seaborn using whatever method you've used for other packages (conda or pip). We can then import it. 

In [1]:
import seaborn as sns #import the seaborn library

Seaborn has a bunch of nice plotting features. One thing that I like is the ability to create scatterplots with color-coding due to a certain variable using the [sns.scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) command. 

## Reviewing the Data Set 

We will be working with the California Housing Prices Data Set from two weeks ago. 

In [2]:
import pandas as pd

housing_df = pd.read_csv("california-housing.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'california-housing.csv'

We can use seaborn's scatterplot command to visualize how location affects price in this data set.

In [None]:
sns.scatterplot(x = "longitude",
               y = "latitude",
               data = housing_df,
               hue = "median_house_value")


## Examining the Categorical Data

A quick review: here is what the data columns look like. 

In [None]:
housing_df.head()

In [None]:
housing_df.dtypes #access the data types of the columns. 

There is one column that is not numeric. (We could automate the check for categorical variables by using the [select_dtypes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html) command.)

In the past, we dealt with this column by dropping it. Now, we want to see if it actually makes a difference to our data set. We'll analyze this both quantitatively using pandas and visually in seaborn. 

Let's see how many unique categories there are in ocean_proximity.

In [None]:
housing_df.ocean_proximity.value_counts()

Let's create a latitude-longitude scatterplot that shows what these categorical features represent.

In [None]:
sns.scatterplot(x = "longitude",
              y = "latitude",
              data = housing_df,
              hue = "ocean_proximity")

In [None]:
housing_df.groupby(["ocean_proximity"]).mean() #Find the average value by different categories. 

Let's also use seaborn to create some boxplots to visualize how the housing prices vary with the different categories, using the [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html) command. 

In [None]:
graph = sns.boxplot(x = "ocean_proximity", y = "median_house_value", data = housing_df)
graph.axhline(housing_df["median_house_value"].median(),
             color = "red", label = "median house value")
graph.axhline(housing_df["median_house_value"].mean(),
             color = "blue", label = "average house value")
graph.legend()

## One Hot Encoding

Our machine learning algorithms are mathematical processes based on numbers. To use these categorical variables, one approach is to use 0 or 1. 

One approach uses pandas *get_dummies* command. 

In [None]:
ocean_dummies = pd.get_dummies(housing_df["ocean_proximity"]) #get dummy variables

In [None]:
housing_df.columns

Now we need to add the columns from ocean_dummies onto housing_df.

In [None]:
new_df = housing_df.join(ocean_dummies)

In [None]:
new_df.head()

In [None]:
new_df = new_df.drop(columns = ["ocean_proximity"]) #We've gotten that information into a usable form. 

## We could just try going back to Linear Regression, and see if it improves the fit. 

## We've done more data preprocessing, and we have a better understanding of what our data represents. 

## k Nearest Neighbors -- A New Predictor

We plan to use the *k-nearest-neighbors* approach to regression. sklearn implements this with [kNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) class. 

The function has many options. Some of them are: 

1. *n_neighbors* tells how many neighbors to use in the prediction. Default is 5.
2. *weights* tells how to weight the responses from the neighbors (uniform, or scaled by distance).

We will discuss how some of these options work, as well as some of the other options in tomorrow's videos. 

![Example:nn-regression-data-set](nn-regression-data-set.png)

It's important to remember here that distance doesn't just mean physical distance. When we use this with the housing data set, all of the variables will be used in calculating distance. <span style="color:green"> A better term might be "similarity". </span>

## Rescaling the Data

When we look at distance, it's important that features be on the same scale. For instance, is a housing district which is 10000 dollars away "closer" than one that is 2 degrees of longitude away? 

To address this issue, we need to rescale all of our variables so that they are on the same scale. 

Two approaches: StandardScaler and MinMaxScaler. 

StandardScaler: takes things on their z-scores (Math 270). For each column, it subtracts the mean of the column, and divides by the standard deviation of the column. 

StandardScaler object that does this for you. You will need to use one of the two scaling approaches for your kNearestNeighbors to work. 

In [None]:
from sklearn.preprocessing import StandardScaler

st_scaler = StandardScaler()

new_df_standard = pd.DataFrame(st_scaler.fit_transform(new_df))

The output of this is a numpy array, so I need to cast it as a data frame. I will now have a new data frame that is compatible with k Nearest Neighbors. 

In [None]:
new_df_standard.head()

## Putting It Together