# I. Introduction
This beginner's project is based on many ideas found in *Machine Learning for Absolute Beginners: A Plain English Introduction* by Oliver Theobald. It has been heavily augmented with references found around the web (linked when appropriate).

---

![Machine Learning for Absolute Beginners](https://images-na.ssl-images-amazon.com/images/I/413%2BI3pEaXL.jpg)

---


This book is an excellent introduction for those just beginning their machine learning journey. Though this class introduces a number of principles found in that text, I highly recommend buying the book yourself and proceed through it after your experience here. He walks the reader through a number of important concepts that are too extensive for this course, but his writing is clear and he does a spectacular job of explaining difficult topics to ensure understanding. Additionally, he provides a number examples which the reader can tackle after getting down the basics. 

The book can be found here if you are interested: [Machine Learning for Absolute Beginners](https://www.amazon.com/Machine-Learning-Absolute-Beginners-Introduction/dp/1549617214/ref=sr_1_1?crid=1AF1PFSE85G4F&keywords=machine+learning+for+absolute+beginners+a+plain+english+introduction&qid=1563399014&s=gateway&sprefix=machine+learning+for+absolute+beginners+%2Caps%2C326&sr=8-1)

# II. Obtain and Work with the Data

## A. Get the dataset
1. Press "Connect" in the upper right corner of this page (on Colab).

2. The dataset is available through Kaggle.com which requires a login to access. However, I have made the dataset available on my Github page, but you should be able to access and import it using the following commands. 


- If the following command fails, please go to [Github page](https://github.com/ccwilliamsut/machine_learning/tree/master/absolute_beginners/data_files/modified), download "CaliforniaHousingDataModified.csv" to your downloads folder and uncomment the alternative commands.

- If the above does not work and you are still having trouble, consult this [link](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92) to learn how to work with data files in Colaboratory.

In [0]:
# ------------------------------------------ Setup the Environment ------------------------------------------
# Import libraries
import pandas as pd
import seaborn as sns; sns.set()
import numpy as np
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
import missingno as msno
from IPython.display import display
from sklearn.model_selection import train_test_split 
from sklearn import ensemble
from scipy import stats
from sklearn import linear_model
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import learning_curve,GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn import preprocessing
from collections import Counter
from sklearn.externals import joblib
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Set the URL for the data file
url = 'https://raw.githubusercontent.com/ccwilliamsut/machine_learning/master/absolute_beginners/data_files/modified/CaliforniaHousingDataModified.csv'

# Import the datafile from the provided url and run the cell
df = pd.read_csv(url)

# ------- ALTERNATIVE COMMANDS (if above commands do not work-------
#df = pd.read_csv(~/Downloads/CaliforniaHousingDataModified.csv)

## B. Learn how to initially explore datasets


### (1). Basic functions for analyzing datasets
If we want to see the names of our **features** (i.e. column headings), we simply issue the first of the following commands. We can also neatly list them out with the second function.

In [0]:
# Look at the first 5 rows of the file
print('The "head" of our data:')
df.head()

In [0]:
# Get a random sample of the data
print('\n\nRandom sample of the data:')
df.sample(5)

In [0]:
# Get a statistical description of the data
print('\n\nStatistical description of the data:')
df.describe()

In [0]:
# List the column headers to see what we are working with
print("\n\nHere are all of our columns:\n")
df.columns

In [0]:
# List all the columns in the dataset in an ordered way
print('\n\nAll of our columns in a list form (much easier to read):\n')
list(df.columns)

In [0]:
# Look at a specific sample (record) in the dataset
print('\n\nLook at one specific record in a dataset:\n')
print(df.iloc[27])

In [0]:
# Look at the "shape" of the dataset
print('\n\nLook at the shape a dataset (rows / columns):\n')
df.shape

In [0]:
# Look at some information about the dataset (feature name, total non-null record count,
#   wheter the feature is empty, datatype for each feature)
print('\n\nLook at the info about a dataset (feature name, record count, records exist, datatype):\n')
df.info()

### (2). Analyze the dataset for potential problems

**Finding problems**
- Look at the above data closely. **Talk** about it with your **group/partner** or think about it on your own.
- Can you spot the potential issues that we might have with this dataset?

```


```

When you are ready, unhide the cells below to see some of the potential issues with this dataset.

#### i. Looking for missing data
One thing that we need to do before moving on is to see if there is any missing data. If a value is null when the data is imported to Pandas, then the value "NaN" is assigned. Some of our data might have null (```NaN```) values that will cause problems with the performance of our model. 

Missing data can cause a number of problems:
- Further analysis with statistical measures and graphs can **distort our understanding**
- Missing data can **distort our model** if those values affect predictive or clustering calculations
- Some **functions can fail** if they encouter null values

Let's quickly look over our dataset for any missing values.

In [0]:
# Use the MissingNo library to visualize where any missing data is
msno.matrix(df)

It appears that our **```total_bedrooms```** feature has some missing data. We can investigate further to count the number of records with missing values.

In [0]:
# Get a count of "NaN" or missing values by feature
print('Count of missing values by feature:\n')
display(df.isnull().sum(axis=0))

### (3). Discussion / Reflection


#### Quality issues
The primary problems at this point are:
- The spelling of **```lattitude```**
- The meaning of **```t_rooms```**
- The scale used for **```median_income```**
- The scale used for **```proximity_to_store```**


#### Analyzing the problems
- **Spelling** might not seem important at first (after all, it doesn't directly affect our model's performance), but the spelling is actually a fairly big issue. This is not because of any immediate coding problems, as we could easily write our code and accommodate the misspelling. But when our model moves downstream in other workflows, we will have to make special note of the misspelling to others, lest we make their debugging more problematic and time-comsuming. Worse yet, it might not get caught at all and be shown to leadership as is, possibly leaving a bad impression. **If we think that we will possibly use this feature, then it is better to fix it now than wait for later**. 

- **```t_rooms```** presents another issue, as we cannot be sure exactly what it means at this point. We can guess based on the other data that it likely means "total rooms", but we need to be sure. To get the answer, we (fictionally, in this case) contact the team that collected the data and confirm that **```t_rooms```** should be **```total_rooms```**. Again, we want to think about if we actually need this feature before making the change. It seems likely that the total rooms in a house will have some kind of correlation to the overall price, so let's keep this and change it.

- **```median_income```** and **```proximity_to_store```** present different problems. We do not know the scale being used in either case, and they appear to be different from one another. We will have to analyze these features further in the next section to see if we (1) can use them as is, (2) have to drop one or both of them, or (3) have to engineer them to create something useful.

In [0]:
# Change the spelling of a feature name
df.rename(columns = {'lattitude':'latitude',
                     't_rooms':'total_rooms'
                     },
          inplace=True
          )

# Verify that the changes have been made as intended
list(df.columns)

## C. Analyze the data for viability
- Become more familiar with the data
- Look for possible relationships
- Determine which features should be used and which need to be removed

```

```

**Key questions to consider at this point:**
1. Does my data have **labels**? (dependent variables; determines whether we will use supervised or unsupervised algorithms)
2. Are there major **issues** with the data (lots of null values, small number of samples, misspellings, derived data, unknown scales or values, etc.)?
3. What is my **goal**? (For this course, our goal is to create a model that can predict housing prices based upon the given set of features.)
    - Will I be able to accomplish that goal with this dataset?
    - If not, can I employ feature engineering to create the data I need?
    - If so, which features will contribute to that goal and which are unnecessary?
4. Can I see any **relationships** in the data that might serve as a good foundation for my model?

In [0]:
# Look at the data once again to get an idea of variance, outliers, problem features, relationships, etc.
df.describe()

### (1). Identify patterns, problems and usability
- Notice that **not all columns are shown** in the **```dataset.describe()```** function... Why do you think this might be?
- Do you see any **new problems**?
- Is our data **labeled or unlabeled**?
- Can you spot **any possible relationships** that we might explore (using any of the data above)?
- Can we **accomplish our goal** with this dataset?



### (2). Preliminary inquiries about "problem" data
Let's look at a few features that we have identified as problems above:
1. Why is **```ocean_proximity```** not included in the **```dataset.describe()```** function?
2. What do the distributions of **```median_income```** and **```ocean_proximity```** tell us? Are they useful?

In [0]:
# List 15 random samples in 'ocean_proximity' to discover why it is not shown in the describe function
print('Here are some of the contents of the "ocean_proximity" column:')
df['ocean_proximity'].sample(15)

#### i. Question: Ocean Proximity
- Why do you think that **```ocean_proximity```** is not included in the **```dataset.describe()```** function?
- Do you think we can use this data? How?

In [0]:
# Look at distributions of median_income and ocean_proximity
fig, (ax0, ax1) = plt.subplots(nrows = 1,
                               ncols = 2,
                               figsize = (15, 5),
                               sharey = False
                               )
df.hist('median_income', ax = ax0)
sns.countplot(x = 'ocean_proximity',
              data = df,
              orient = 'h',
              ax = ax1
              )


plt.show()
plt.close()

#### ii. Questions: Usability
- Do you think that the above data will help our model to be more accurate?
- Should we keep one, both or none of these features?

We will not any changes to these columns just yet, but it is worth noting the potential problems that they exhibit (unknown scale and non-numerical data).

### (3). Use graphs to explore the data further
Using a variety of graphs is a great way to explore data. 

> Websites referenced:
- [Overview of graph types and purposes](https://towardsdatascience.com/a-step-by-step-guide-for-creating-advanced-python-data-visualizations-with-seaborn-matplotlib-1579d6a1a7d0)

#### i. Univariate Data Analysis
We can use some graphs to analyze one feature at a time. A couple of the more useful in this regard are:
- Histograms
- KDE (Kernel Density Estimation) - helps us to understand the distribution of data

> Websites referenced:
[Analyze the data through data visualization using Seaborn (Toward Data Science)](https://towardsdatascience.com/analyze-the-data-through-data-visualization-using-seaborn-255e1cd3948e)

##### a. Histogram Plots

> Websites referenced:
- [Tutorial on visualizing distributions](https://www.machinelearningplus.com/plots/matplotlib-histogram-python-examples/)
- [Creating multi-dimensional subplots](https://matplotlib.org/3.1.1/gallery/subplots_axes_and_figures/subplots_demo.html)
- [Great demo of the differences between histograms and KDE](https://mglerner.github.io/posts/histograms-and-kernel-density-estimation-kde-2.html)
- [Explanation of KDE](http://www.mvstat.net/tduong/research/seminars/seminar-2001-05/)

In [0]:
# List the features in the dataset
list(df.columns)

In [0]:
# Use histograms to explore features in the dataset

# Create a "figure" that can hold multiple plots
fig, (ax0, ax1, ax2) = plt.subplots(nrows = 1,              # create a grid with a specified number of rows
                                    ncols = 3,              # specify the number of columns
                                    figsize = (15, 4),      # specify the size of each subplot
                                    sharey = False          # specify if all plots will share the y-axis values
                                    )

# ------------------ Histograms ---------------------
# Histogram 1 (in slot 'ax1', the first container)
sns.distplot(df['housing_median_age'], 
             kde = False,
             bins = 20, 
             color = 'magenta',
             ax = ax0
             )

# Histogram 2 (in slot 'ax2', the second container)
sns.distplot(df['total_rooms'], 
             kde = False,
             bins = 30, 
             color = 'green', 
             ax = ax1
             )

# Histogram 3 (in slot 'ax3', the third container)
sns.distplot(df['median_house_value'], 
             kde = False,
             bins = 50, 
             color = 'blue',
             ax = ax2
             )



##### b. KDE (Kernel Density Estimation) Plots

In [0]:
# Plot Kernel Density Estimates (KDE) for some features
fig, (ax0, ax1, ax2) = plt.subplots(nrows = 1,              # create a grid with a specified number of rows
                                    ncols = 3,              # specify the number of columns
                                    figsize = (15, 4),      # specify the size of each subplot
                                    sharey = False          # specify if all plots will share the y-axis values
                                    )
# KDE plot 1 (in slot 'ax4', the fourth container)
sns.distplot(df['housing_median_age'], 
             hist = False,
             color = 'magenta',
             ax = ax0,
             kde_kws = {'shade': True}
             )

# KDE plot 2 (in slot 'ax5', the fifth container)
sns.distplot(df['total_rooms'], 
             hist = False,
             color = 'green', 
             ax = ax1,
             kde_kws = {'shade': True}
             )

# KDE plot 3 (in slot 'ax6', the sixth container)
sns.distplot(df['median_house_value'], 
             hist = False,
             color = 'blue',
             ax = ax2, 
             kde_kws = {'shade': True}
             )

#### ii. Bivariate Data Analysis
We can use a multitude of graphs to analyze features against each other in varying combinations.

Datasets generally have two primary data types:
- Statistical Data (integers, continuous variables)
- Categorical Data (boolean values, text-based values, etc.)

We can explore both of these types with the following types of graphs (note that this is not an exhaustive list):
- Pairplots
- Scatterplots
- Lineplots
- Surface plots
- Correlation heatmaps
- Category plots
- Boxplots
- Pointplots

> Websites referenced: 
- [Analyze the data through data visualization using Seaborn (Toward Data Science)](https://towardsdatascience.com/analyze-the-data-through-data-visualization-using-seaborn-255e1cd3948e)

##### a. Statistical Data
- Scatterplots
- Lineplots
- Pairplots
- Surface Plots

> Websites referenced:
- [Stackoverflow Help](https://stackoverflow.com/questions/25212986/how-to-set-some-xlim-and-ylim-in-seaborn-lmplot-facetgrid)

Let's begin by analyzing a few features using these types of graphs:

1. **Scatterplots**

In [0]:
# Use scatterplots to compare features in the dataset

# Create a "figure" that can hold multiple plots (in this case, 1 row and 3 columns)
fig, ((ax1, ax2, ax3)) = plt.subplots(nrows = 1,                      # create a grid with a specified number of rows
                                      ncols = 3,                      # specify the number of columns
                                      figsize = (20, 5),              # specify the size of each subplot
                                      sharey = False,                 # specify if all plots will share the y-axis values
                                      sharex = False                  # specify if all plots will share the x-axis values
                                      )

# Plot scatterplots on the top row and lineplots on the bottom row
# Scatterplots 1 to 3
ax1 = sns.regplot(x = df['median_income'],
                  y = df['median_house_value'],
                  data = df,
                  ax = ax1, 
                  dropna = True,
                  order = 2,                                          # Note here that we are using polynomial regression
                  line_kws = {'color': 'darkorange'},
                  fit_reg = True
                  )

ax2 = sns.regplot(x = df['total_bedrooms'],
                  y = df['median_house_value'],
                  data = df,
                  ax = ax2, 
                  dropna = True,
                  order = 2,
                  line_kws = {'color': 'purple'},
                  fit_reg = True
                  )

ax3 = sns.regplot(x = df['population'],
                  y = df['median_house_value'],
                  data = df,
                  ax = ax3, 
                  dropna = True,
                  order = 2,
                  line_kws = {'color': 'blue'},
                  fit_reg = True
                  )


# Change the range of a few axes variables to make the graphs more useful
ax1.set(ylim = (0, 500000))

#ax2.set(ylim = (0, 500000))
#ax2.set(xlim = (0, 2))

ax3.set(ylim = (0, 500000))
ax3.set(xlim = (0, 15000))

#ax6.set(xlim = (0, 15000))



###### 2. Lineplots
>**Note:** Lineplots can be very compute-intensive b/c they are first sorted and then plotted.  We will therefore create a sample set of data for use with those plots.

In [0]:
# Create a "figure" that can hold multiple plots
fig, ((ax1, ax2, ax3)) = plt.subplots(nrows = 1,                    # create a grid with a specified number of rows
                                    ncols = 3,                      # specify the number of columns
                                    figsize = (20, 5),             # specify the size of each subplot
                                    sharey = True,                 # specify if all plots will share the y-axis values
                                    sharex = False                  # specify if all plots will share the x-axis values
                                    )
# Create a sample dataset
df_sample = df.sample(frac = 0.005, random_state = 5)

# Ensure that our sample dataset does not have any missing values
df_sample = df_sample.fillna(df_sample.median())

# Plot the data for each figure container
ax1 = sns.lineplot(x = df_sample['median_income'],
                   y = df_sample['median_house_value'],
                   data = df_sample,
                   ax = ax1,
                   style = 'ocean_proximity',
                   hue = 'ocean_proximity'
                   )

ax2 = sns.lineplot(x = df_sample['total_bedrooms'],
                   y = df_sample['median_house_value'],
                   data = df_sample,
                   ax = ax2,
                   hue = 'ocean_proximity'
                   )

ax3 = sns.lineplot(x = df_sample['housing_median_age'],
                   y = df_sample['median_house_value'],
                   data = df_sample,
                   ax = ax3,
                   style = 'ocean_proximity'
                   )


###### 3. Pairplots
With this simple function, we can quickly analyze most of our features against one another to:
- Seek out possible relationships
- Identify possible issues with data
```
```

***Websites referenced:***
- [Seaborn Documentation: Pairplots](https://seaborn.pydata.org/generated/seaborn.pairplot.html)
- [Short Tutorial](https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166)
- [Pairplot Code Examples](https://jovianlin.io/data-visualization-seaborn-part-2/)

In [0]:
# Use seaboarn.pairplot() to quickly look at possible relationships and visually analyze the data

# Use this to analyze each feature against the others while also segmenting out the proximity to the ocean
#sns.pairplot(data = df,
#              hue='ocean_proximity', 
#              dropna=True, 
#             )

# Use this function to analyze each feature against the others with an added regression line to help identify relationships
sns.pairplot(data = df, 
             dropna = True
             )

###### 4. Surface Plots
The immediate goal for us is to determine which features will help us make a better model and which ones will not. For those that will not help us, we want to drop them in order to reduce compute needs and processing time for our final model. 

Let's look at some different types of graphs and how they can help us analyze our dataset for this purpose.

> Websites referenced:
> [Stackoverflow: Limiting x- and y-axis values](https://stackoverflow.com/questions/54951362/seaborn-jointplot-with-defined-axes-limits)

```

```
Let's first look at a basic contoured graph (using KDE: Kernel Density Estimation):

In [0]:
# Use contoured plots with KDE (Kernel Density Estimation) and targeted ranges

# Create a surface plot with KDE
plot1 = sns.jointplot(x = df['total_bedrooms'],       # set the x-axis variable
                      y = df['median_house_value'],   # set the y-axis variable
                      data=df,                        # define the dataset being used
                      kind='kde',                     # define the graph type to be used
                      dropna=True                     # define what happens to 'NaN' values
                      )

# Adjust the top of the subplot
plt.subplots_adjust(top=0.9)

# Add a title to a graph
plot1.fig.suptitle('Plot 1: Original Plot')

Looking at the data above, we can see that the vast majority of **```total_bedrooms```** values are less than 1000. It is likely that we are seeing the effect of some outlier data above 1000, so let's concetrate on that portion of the data to see if there is something interesting to learn.
```

```
We will construct the same graph, but we will limit the x-axis and y-axis outputs to specified ranges (x <= 1000 and y <= 400,000).

In [0]:
# Plot the same type of graph (contour KDE) but with restricted x- and y-axis values
plot2 = sns.jointplot(x = df['total_bedrooms'],
                      y = df['median_house_value'], 
                      data=df,
                      kind='kde',
                      dropna=True
                      )

# Limit the x-axis plots to a specified degree
plot2.ax_marg_x.set_xlim(0, 1000)
plot2.ax_marg_y.set_ylim(0, 400000)

# Add a title to a second graph
plot2.fig.suptitle('Plot 2: Limited x-axis & y-axis values')
plt.subplots_adjust(top=0.9)

###### 5. Correlation Heatmaps
We can also look at how well data is correlated via the **```seaboarn.heatmap()```** function. This will help to give us an idea of which features are well or poorly correlated to each other.

> *Websites referenced:*
- [Seaborn Documentation: Heatmaps](https://seaborn.pydata.org/generated/seaborn.heatmap.html)
- [Excellent tuturial on modifying heatmaps](https://heartbeat.fritz.ai/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations-f1c49c816f07)
- [Short Tutorial about correlation and heatmaps](https://jovianlin.io/data-visualization-seaborn-part-2/)

In [0]:
# -- Analyze the data with a heatmap --
# 1. Only keep the columns and data that we want to see correlated
cols = df.drop(['latitude', 'longitude', 'ocean_proximity'],
               axis=1
               )

# 2. Fill in missing values with the median value of the feature
#     - You can also use the mean (or even the mode if dealing with categorical data)
cols.fillna(cols.median(),
            inplace = True
            )

# 2. Calculate the correlations
corr = cols.corr()

# 3. Set the size you want
plt.figure(figsize=(9 ,9))

# 4. Display the heatmap
sns.heatmap(corr,
            annot=True,
            vmin = -1,
            vmax = 1,
            center = 0
            )

##### b. Categorical Data
Categorical data is comprised of:
- qualitative measurements (good / bad / okay, 5-stars / 3-stars, etc.)
- discrete measurements ('near ocean', 'on time', etc.)

```

```

Categorical data can also be qualified by the following parameters:
- nominal (no intrinsic order to the categories)
- ordinal (has an intrinsic order such as a 5-star ratings system)
- dichotomous (can be only one of two possible values such as true/false, off/on, etc.)

```

```

We can analyze categorical data using many types of graphs with varying techniques. One of these techniquest is known as a **facetgrid**, and it allows us to break out statistical data according to categorical measures.

###### 1. Countplots


In [0]:
# Create a countplot with a categorical variable
sns.countplot(x = df['ocean_proximity'])

###### 2. Histograms (broken down by category, using Facetgrid)
**Facetgrid plots** allow one to break out data by category while analyzing either univariate or bivariate data.

In [0]:
# Analyze graphs side-by-side (using seaborn Facetgrid)
ghist = sns.FacetGrid(df, 
                      col='ocean_proximity', 
                      hue = 'ocean_proximity', 
                      dropna = True, 
                      legend_out=True, 
                      despine=True
                      )
ghist.map(plt.hist,
          'median_house_value',
          alpha=1,
          bins = 20
          )

# Create another facetgrid with scatterplots
gscat = sns.FacetGrid(df,
                      col='ocean_proximity',
                      hue = 'ocean_proximity',
                      dropna = True,
                      despine=True
                      )
gscat.map(plt.scatter,
          'median_income',
          'median_house_value',
          alpha=.3
          )

###### 3. Category Plots (Catplots)

In [0]:
# Create a category plot using 'ocean_proximity'
g = sns.catplot(x = 'ocean_proximity',
                y = 'median_house_value',
                hue = 'ocean_proximity',
                data = df,
                height = 5,
                aspect = 1.5,
                kind = 'box',          # Change this to: 'box', 'violin', 'boxen', 'point', or 'bar'
                order = ['NEAR OCEAN',
                         'NEAR BAY',
                         '<1H OCEAN',
                        'ISLAND',
                        'INLAND']
                )

g.set_xticklabels(rotation = 45)

#### (4). Binning / Bucketing Data
Using this method, we can analyze continuous variables as categorical ones (though the actual value might remain continuous). THis allows us to explore our dataset in different ways than would be natively available.

In the following example, we will create "bins" or "buckets" of categories for **```median_house_value```** and then filter our data into them (using a new feature called **```median_house_value_bins```**.

In [0]:
# Use binned / bucketed data to create discrete groups from statistical (continuous) data

# Create bin / bucket edges (left is inclusive, right is excusive), then create labels to match those bins
bins = [0, 100000, 200000, 300000, 400000, 500000, 600000]
labels = ['< 100k', '100-199', '200-299', '300-399', '400-499', '500-599']

# Create a new feature in our dataset to hold the new categorical data and perform the pandas.cut() operation
df['median_house_value_bins'] = pd.cut(df['median_house_value'], 
                                       bins = bins,
                                       labels = labels)

# Create a variable to hold the name of our new category and operators (so changing them later is easy)
yval = 'avg_people_per_household'
vnum = 'population'
vdem = 'households'

# Here we create another feature based on calculations of existing features
df[yval] = (df[vnum] / df[vdem])

# Create a catplot which will break out our categorical data in multiple ways
g = sns.catplot(x = 'ocean_proximity',                # set the x-axis categorical variable (will create ticks for each category)
                y = yval,                             # set the y-axis variable (the dependent variable)
                col = 'median_house_value_bins',      # set the column category with which you want to break up the data
                data = df,                            # set the data source
                height = 4,                           # set the height of the graph
                aspect = .8,                          # set the aspect ratio to determine the graph shape (height * aspect)
                order = ['NEAR OCEAN',                # set the order in which you would like to lay out the categories on the x-axis
                         'NEAR BAY',
                         '<1H OCEAN',
                         'ISLAND',
                         'INLAND'
                         ]
                )

g.set_axis_labels(x_var = '', y_var = yval)           # set the variables to use on the x- and y-axes
g.set_xticklabels(rotation = 45)                      # rotate the x-axis labels (if running into each other)
g.set(ylim = (0,20))                                   # set the y-axis limits to measure

#### (5). FInal Analysis and Planning
Now that we have explored our data in multiple ways and visualized it, we can begin to transform it in ways that will help us accomplish our goal (to predict the **```median_house_value```** based on the feature set that we choose.



##### a. Questions:
- How do we read this data? What does it mean?
- Which features, if any, will help us build a model that can accurately predict the **```median_house_value```**? Do any of them appear to correlate?
- Which features do we need to **drop** or **change**?

##### b. New problems:

When looking at all the data analysis above, it becomes easier to identify at least a few new problems:
1. The **``` median_house_value```** has a lot of values stacked up in the $500,000 range
2. **```total_rooms```** and **```total_bedrooms```** have very *long tails* which can cause problems with accuracy (changes the mean, distorts graphs, etc.)
3. We have **categorical data** in ```ocean_proximity```...  Can we use this in a machine learning model?
4. We have some **missing data** in one of our columns... Can you spot which one?

```
```

Take some time to think about the above questions. Discuss them with your group or partner if applicable.

# III. Scrubbing Data
Most of a data scientist's time is spent on working with data. This includes performing all of the tasks that we have done previously as well as the ones which we will now cover:
- One-hot encoding
- Deleting (and sometimes creating new) features
- Fixing missing data
- Dealing with outlier data


---

At the beginning of this section, we will re-import all of our libraries and perform all of the small changes that we have made thus far. The reason I re-introduce this code here is to make it easier and faster for you to make changes to your code and see the results. It keeps you from having to wait for all of the above code to execute before seeing any of your changes. **Please note that any changes you make above will not be reflected below, as the dataset is essentially "reset" with the code cell that follows**.

> **NOTE:** Because we will be looking at a heatmap a couple of times, I have created a function call that will make it much easier. By calling this function and supplying it with a version of the dataframe (optional), we do not have to re-create the code each time. To call it from here on out, we simply issue the command **```make_heatmap(df)```**.

In [0]:
# ------------------------------------------ Setup the Environment ------------------------------------------
# Import libraries
import pandas as pd
import seaborn as sns; sns.set()
import numpy as np
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
import missingno as msno
from IPython.display import display
from sklearn.model_selection import train_test_split 
from sklearn import ensemble
from scipy import stats
from sklearn import linear_model
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import learning_curve,GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn import preprocessing
from collections import Counter
from sklearn.externals import joblib
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Set the URL for the data file
url = 'https://raw.githubusercontent.com/ccwilliamsut/machine_learning/master/absolute_beginners/data_files/modified/CaliforniaHousingDataModified.csv'

# Import the datafile from the provided url and run the cell
df = pd.read_csv(url)

# ------- ALTERNATIVE COMMANDS (if above commands do not work-------
#df = pd.read_csv(~/Downloads/CaliforniaHousingDataModified.csv)

# Change the spelling of a feature name
df.rename(columns = {'lattitude':'latitude', 't_rooms':'total_rooms'}, inplace=True)


# --------------------------------------------- FUNCTIONS ---------------------------------------------
def make_heatmap(dataframe = df):
  # Show a heatmap to identify any possible correlations
  # - Calculate the correlation matrix
  corr = df.corr()

  # Set the size you want
  plt.figure(figsize=(15 ,9))

  # Display the heatmap
  sns.heatmap(corr,
              annot=True,
              vmin = -1,
              vmax = 1,
              center = 0,
              fmt = '.1g',
              cmap = 'coolwarm'
              )

  plt.show()
  plt.close()

## A. One-Hot Encoding
One of our features, **```ocean_proximity```**, is categorical. Because we believe this feature to be useful in helping us predict housing prices, we want to keep it, but we must change the categorical values into numerical ones for modeling purposes. To do this, we employ a procedure known as **one-hot encoding** which creates a new feature for each of the known values and then assigns a **```1``` or ```0```** depending upon whether or not each row contains that value.

In [0]:
# Perform one-hot encoding on 'ocean_proximity'
# Show the initial state 
print('The original state of our dataset:')
display(df.sample(5))

# 1. First, cast our target feature into a categorical data type
df['ocean_proximity'] = pd.Categorical(df['ocean_proximity'])

# 2. Create a temp dataframe to hold our new dummy values
df_dummies = pd.get_dummies(df['ocean_proximity'], 
                            drop_first = True
                            )

# 3. Drop the target feature from the original dataframe
df.drop(['ocean_proximity'], 
        axis = 1, 
        inplace = True
        )

# 4. join our temp dataframe with our original to create a single dataframe
df = pd.concat([df, df_dummies], 
               axis=1
               )

# Show the new state after one-hot encoding
print('\n\nThe state of our dataset after one-hot encoding:')
display(df.sample(5))

## B. Removing Unwanted Features
When working with many datasets, there are likely to be features that are not useful for the goal at hand. In our case, we are trying to predict the median house value, so we must look at the features we have and decide if there are any that will not help us in that goal.

Let's first take a look at the correlations between our target variable and the other variables to get an idea of which features will be useful in predicting a price. We will use our brand new **```make_heatmap```** function to build it.

In [0]:
# Call on your new function to create a heatmap of the current dataframe
make_heatmap(df)

# Get a list of column names and decide which ones we can eliminate (if any)
display(list(df.columns))

Based upon our goal, the following features can probably be dropped:
- **```population```** - Appears to have almost no correlation, and it appears to be well represented in other features (such as total rooms, total bedrooms, etc.)
- **```households```** - Perfect correlation with total bedrooms, and very correlated with population, so this appears to be well-represented elsewhere
- **```proximity_to_store```** - This appears to be uniformly distributed and, thus, of little use for predictive purposes
- **```ISLAND```** - Almost no correlation with our target variable (little better than guessing)

> **NOTE:** If you would like to change which features are dropped and/or kept, simply modify the code below by adding or removing the hash in front of the desired feature. The features with hash marks are kept and those without them are dropped. **Remember** that you will also need to run the **environment setup commands in section III** (and all cells below it) if you modify these parameters.

In [0]:
# Drop those columns that we will not use for training / testing
df.drop([#'longitude',
         #'latitude',
         #'housing_median_age',
         #'total_rooms',
         #'total_bedrooms',
         'population',
         'households',
         #'median_income',
         #'median_house_value',
         'proximity_to_store',
         #'INLAND',
         'ISLAND',
         #'NEAR BAY',
         #'NEAR OCEAN'
         ],
        axis = 1,
        inplace = True
        )

# List the current features in our dataset
list(df.columns)
print('\n\n')

# List the data types for each column
df.info()

## C. Missing Data
When first analyzing our data, we discovered that there were some "NaN" values (null data) in the **```total_bedrooms```** feature. Before proceeding to build our model, we must first figure out what to do with the records that contain those missing values.

Let's quickly count the missing values again

In [0]:
# Get a count of "NaN" or missing values by feature
print('Count of missing values by feature:\n')
display(df.isnull().sum(axis=0))


Now that we have located missing data, we have *at least* four choices:
1. Fill in missing values with the **mode**
2. Fill in missing values with the **median**
3. **Delete** samples with missing values
4. **Delete** the feature(s) containing the missing data

**Options 1 & 2** will depend upon the type of data being manipulated. 
- For **categorical data**, it is generally okay to use the **mode** as this represents the most frequently encountered value.
- For **continuous data**, it is generally okay to use the **median** so as to avoid the influence of any outlier data (using the mean will allow this influence)

**Option 3** (deleting samples) is considered the **last resort** as it will cause us to lose valuable data that can help our model perform better. If the number of samples to be deleted are insignificant compared to the overall size of the dataset, then this might not be a problem. Often, however, our data is limited and every sample counts, so we will want to keep every bit of it that we can.

**Option 4** (deleting features with missing data) is also worth considering **if and only if the feature is not consequential for the model**. In other words, if we do not feel that the data in this feature will help us to train a more accurate prediction model, then we can just delete the feature altogether and be done with it.

> **NOTE:** There are actually a host of different methods for dealing with missing data, and they range greatly in their complixity. **The method one chooses will ultimately depend upon the type of data being used and the goal(s) of the model.** For the purposes of this lesson, however, we have covered 4 possibilities simply to demonstrate that there are easy ways to manage datasets with missing values. For more information on dealing with missing data, please see the following links:
- [Towards Data Science: Compensating for missing values](https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779)
- [Geeks for Geeks: Excellent yet simple tutorial for working with missing data](https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/)
- [Pandas Documentation: Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)
- [Machine Learning Mastery: Detailed options for dealing with missing data](https://machinelearningmastery.com/handle-missing-data-python/)

```
```
In this case, we have 207 values missing out of 20,640 which represents about 1% of our data. For the purposes of this exercise, we will elect to keep that data and fill those missing values.

> **Question:** Which method should we use to fill in the missing data?

#### (1). Working with missing data

In [0]:
# Let's look at the current state of our data prior to changes
print('-------------------- BEFORE CHANGES --------------------')
print('\nData types by column:')
print(df.dtypes)
print('\nCount of null values per feature:')
print(df.isnull().sum(axis=0))
print('\nMedian value for total_bedrooms: ', df['total_bedrooms'].median())
print('\nMean value for total_bedrooms: ', df['total_bedrooms'].mean(), '\n\n')

# --------------- Now lets apply the changes we want to see ------------------
# First, create a variable to hold the median value
tb_med = df['total_bedrooms'].median(axis=0)

# Next, we want to fill the "NaN" values with the median value of the column (with current 'NaN' columns skipped in the calculation)
df['total_bedrooms'] = df['total_bedrooms'].fillna(value = tb_med)

# Change the datatype of 'total_bedrooms' to 'int' (since we cannot really have partial bedrooms)
df['total_bedrooms'] = df['total_bedrooms'].astype(int)

# Next, we want to verify that our columns are all of the correct length and data type (ensuring that we can perform calculations on them later)
print('-------------------- AFTER CHANGES --------------------')
print('\nData types by column:')
print(df.dtypes)
print('\nCount of null values per feature:')
print(df.isnull().sum(axis=0))

# Finally, we want to ensure that our median value is still accurate (should be the same as above including the replaced 'NaN' values)
print('\nMedian value for total_bedrooms: ', df['total_bedrooms'].median())
print('\nMean value for total_bedrooms: ', df['total_bedrooms'].mean())

## D. Working with Outlier Data

Now that we have filled in our missing data, let's take a look at our remaining features' distributions. This will help us to get an idea of other changes that we might need to make.

First, let's look at some graphs (histograms and boxplots).

>Websites Referenced:
- [Towards Data Science: Working with Outlier Data](https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba)
- [Medium.com: Standardize and Normalize Data](https://medium.com/@rrfd/standardize-or-normalize-examples-in-python-e3f174b65dfc)

In [0]:
# Look at distributions of median_income and proximity_to_store

# Create a "figure" that can hold multiple plots
fig, ((ax1, ax2, ax3, ax4, ax5), (ax6, ax7, ax8, ax9, ax10)) = plt.subplots(nrows = 2, ncols = 5, figsize = (20, 5), sharey = False)

df.hist('longitude', ax = ax1)
df.hist('latitude', ax = ax2)
df.hist('housing_median_age', ax = ax3)
df.hist('total_rooms', ax = ax4)
df.hist('total_bedrooms', ax = ax5)
df.hist('median_income', ax = ax6)
df.hist('median_house_value', ax = ax7)
sns.countplot(x = df['INLAND'], ax = ax8)
sns.countplot(x = df['NEAR BAY'], ax = ax9)
sns.countplot(x = df['NEAR OCEAN'], ax = ax10)

plt.show()
plt.close()

In [0]:
# Create a "figure" that can hold multiple plots
fig, (ax1, ax2, ax3, ax4, ax5, ax6, ax7) = plt.subplots(nrows = 1, ncols = 7, figsize = (25, 5), sharey = False)

sns.boxplot(x=df['longitude'], ax = ax1)
sns.boxplot(x=df['latitude'], ax = ax2)
sns.boxplot(x=df['housing_median_age'], ax = ax3)
sns.boxplot(x=df['total_rooms'], ax = ax4)
sns.boxplot(x=df['total_bedrooms'], ax = ax5)
sns.boxplot(x=df['median_income'], ax = ax6)
sns.boxplot(x=df['median_house_value'], ax = ax7)
sns.lmplot(x = 'median_house_value', y = 'INLAND', logistic = True, data = df, ci = None)
sns.lmplot(x = 'median_house_value', y = 'NEAR BAY', logistic = True, data = df, ci = None)
sns.lmplot(x = 'median_house_value', y = 'NEAR OCEAN', logistic = True, data = df, ci = None)

plt.show()
plt.close()

### (1). Eliminating outlier data with Z-score

There are a number of ways in which to deal with outlier data, but one of the more common ones is utilizing a **z-score** which gives us the ability to exclude items based on standard deviation. This, in turn, will allow us to exclude data that falls outside of standard deviation boundaries of our choice (thus eliminating outliers). We will then look at our data again and see if this has helped.

In [0]:
# ------------------------------------- Calculate the z-score for all data -------------------------------------
# Get the absolute z-score value for our dataset
z = np.abs(stats.zscore(df))

# Finally, we remove those values with standard deviations greater than 3
dfz = df[(z < 3).all(axis = 1)]

# Create a "figure" that can hold multiple plots
fig, (ax1, ax2, ax3, ax4, ax5, ax6, ax7) = plt.subplots(nrows = 1, ncols = 7, figsize = (25, 5), sharey = False)

sns.boxplot(x=dfz['longitude'], ax = ax1)
sns.boxplot(x=dfz['latitude'], ax = ax2)
sns.boxplot(x=dfz['housing_median_age'], ax = ax3)
sns.boxplot(x=dfz['total_rooms'], ax = ax4)
sns.boxplot(x=dfz['total_bedrooms'], ax = ax5)
sns.boxplot(x=dfz['median_income'], ax = ax6)
sns.boxplot(x=dfz['median_house_value'], ax = ax7)

# Also show the data description to get accurate values for the quartiles
display(dfz.describe())

### (2). Eliminating outlier data with Interquartile Range

Another way to work with outliers is to eliminate those values that fall outside of our IQR (Interquartile Range). This will work somewhat similarly to the z-score, but our threshold for eliminating outlier data will be based upon quartiles.

> Websites referenced:
- [Towards Data Science: Detect and Remove Outliers](https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba)

After running the code below, we will have the IQR necessary to calculate our threshold for outliers. The formula is:

- ```lower boundary = q1 - (1.5 * iqr)```
- ```upper boundary = q3 + (1.5 * iqr)```

The subsequent code will then look for and remove all samples that fall outside these boundaries and leave us (hopefully) with substantially fewer outliers. 

We will then compare both methods (z-score and IQR) against the original dataset to see which one gives us the most usable data.

In [0]:
# Create a dataset with outliers removed via IQR method

# First, establish quantile variables
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3 - q1

# Display the interquartile range for our features
print(iqr)

# Remove outliers with IQR method

# Establish variables
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3 - q1
lower = q1 - (1.5 * iqr)
upper = q3 + (1.5 * iqr)

# Remove samples outside of the upper and lower boundaries
dfi = df[~((df < lower) | (df > upper)).any(axis = 1)]

# Compare housing_median_age
fig, (ax1a, ax1b, ax1c) = plt.subplots(nrows = 3, ncols = 1, figsize = (10, 5), sharex = True)
sns.boxplot(x=dfi['longitude'],  ax = ax1a, color='g')
sns.boxplot(x=dfz['longitude'],  ax = ax1b)
sns.boxplot(x=df['longitude'],  ax = ax1c, color='grey')
plt.show()
plt.close()
print('\n')


# Compare total_rooms
fig, (ax2a, ax2b, ax2c) = plt.subplots(nrows = 3, ncols = 1, figsize = (10, 5), sharex = True)
sns.boxplot(x=dfi['latitude'],  ax = ax2a, color='g')
sns.boxplot(x=dfz['latitude'],  ax = ax2b)
sns.boxplot(x=df['latitude'],  ax = ax2c, color='grey')
plt.show()
plt.close()
print('\n')


# Compare total_bedrooms
fig, (ax3a, ax3b, ax3c) = plt.subplots(nrows = 3, ncols = 1, figsize = (10, 5), sharex = True)
sns.boxplot(x=dfi['housing_median_age'], ax = ax3a, color='g')
sns.boxplot(x=dfz['housing_median_age'], ax = ax3b)
sns.boxplot(x=df['housing_median_age'], ax = ax3c, color='grey')
plt.show()
plt.close()
print('\n')


# Compare population
fig, (ax4a, ax4b, ax4c) = plt.subplots(nrows = 3, ncols = 1, figsize = (10, 5), sharex = True)
sns.boxplot(x=dfi['total_rooms'], ax = ax4a, color='g')
sns.boxplot(x=dfz['total_rooms'], ax = ax4b)
sns.boxplot(x=df['total_rooms'], ax = ax4c, color='grey')
plt.show()
plt.close()
print('\n')


# Compare households
fig, (ax5a, ax5b, ax5c) = plt.subplots(nrows = 3, ncols = 1, figsize = (10, 5), sharex = True)
sns.boxplot(x=dfi['total_bedrooms'], ax = ax5a, color='g')
sns.boxplot(x=dfz['total_bedrooms'], ax = ax5b)
sns.boxplot(x=df['total_bedrooms'], ax = ax5c, color='grey')
plt.show()
plt.close()
print('\n')


# Compare median_income
fig, (ax6a, ax6b, ax6c) = plt.subplots(nrows = 3, ncols = 1, figsize = (10, 5), sharex = True)
sns.boxplot(x=dfi['median_income'], ax = ax6a, color='g')
sns.boxplot(x=dfz['median_income'], ax = ax6b)
sns.boxplot(x=df['median_income'], ax = ax6c, color='grey')
plt.show()
plt.close()
print('\n')


# Compare median_house_value
fig, (ax7a, ax7b, ax7c) = plt.subplots(nrows = 3, ncols = 1, figsize = (10, 5), sharex = True)
sns.boxplot(x=dfi['median_house_value'], ax = ax7a, color='g')
sns.boxplot(x=dfz['median_house_value'], ax = ax7b)
sns.boxplot(x=df['median_house_value'], ax = ax7c, color='grey')
plt.show()
plt.close()
print('\n')

Everything appears to be better using the IQR method with the exception of **```median_house_value```**, which we might want to inspect a little more closely.Let's look at the new data in a distribution plot (histogram) to see what it looks like.

In [0]:
# Compare median_house_price via histograms
fig, (ax8a, ax8b, ax8c) = plt.subplots(nrows = 1, ncols = 3, figsize = (20, 5), )
ax8a = sns.distplot(dfi['median_house_value'],  
                    kde = True,
                    hist = True,
                    bins = 50, 
                    color = 'green',
                    ax = ax8a,
                    hist_kws={'alpha':1}, 
                    )
ax8b = sns.distplot(dfz['median_house_value'], 
                    kde = True,
                    hist = True,
                    bins = 50,
                    ax = ax8b,
                    hist_kws={'alpha':1}
                    )
ax8c = sns.distplot(df['median_house_value'],  
                    kde = True,
                    hist = True,
                    bins = 50, 
                    color = 'grey',
                    ax = ax8c,
                    hist_kws={'alpha':1}
                    )

Upon further investigation, it appears that the IQR method gives us the best resulting dataset with which to move forward and build our model. It eliminates many of the outlier problems that exist while also handling the problem of the **```median_house_value```** "wall" problem at ~500,000. 

Let's compare the shape and descriptions of our datasets one last time before moving onto building our model:

In [0]:
# Compare the results of each outlier removal method
print('IQR Dataframe shape (rows, columns):')
display(dfi.shape)
print('\nZ-Score Dataframe shape (rows, columns):')
display(dfz.shape)
print('\nOriginal Dataframe shape (rows, columns):')
display(df.shape)
print('\nIQR Dataframe description:')
display(dfi.describe())
print('\nZ-Score Dataframe description:')
display(dfz.describe())
print('\nOriginal Dataframe description:')
display(df.describe())

# IV. Building Machine Learning Models
At this point, we have now cleaned our dataset and are ready to build a machine learning model. We want to predict housing prices and we have labeled data, so we will build a supervised learning model with both **linear regression** and a **regression tree** (which is a form of decision tree). 

Before we begin building a model, **there are a few definitions that we should cover** so that you can better understand what is happening in the code below.

```

```

## A. Key Definitions
- **[Estimator / Model](https://scikit-learn.org/stable/user_guide.html)**
  - A method that neatly packages up all of the low-level training code. 
    - This method will loop through a training set using a [score method](https://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation) to determine "loss" (deviation from the predicted value).
    - A **linear regression estimator**, for example, will try to find a single line that has the least amount of deviation from all target values.
  - Estimators allow us to **configure** training loops rather than coding everything from the ground up.
  - [Examples](https://scikit-learn.org/stable/user_guide.html) include **linear regressors**, **decision trees**, **k-Means Clustering**, etc.
  >**Note:** The estimators called by sklearn will calculate error/loss/cost exactly rather than using Gradient Descent to minimize it incrementally via "epochs". See the exact description and calculations at the following links:
    - [Data Science Exchange: Epochs discussion](https://datascience.stackexchange.com/questions/29044/how-many-epochs-does-fit-method-run)
    - [Machine Learning Mastery: Gradient Descent](https://machinelearningmastery.com/gradient-descent-for-machine-learning/).


```

```


- **[Scoring](https://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation)**
  - A method that measures "loss" (a.k.a. "cost" or "error") when training an estimator
  - Can be measured in various ways such as **mean squared error**, **median absolute error**, etc.

```

```


- **Splitting Data**
  - Refers to the act of separating data into **training** and **testing** (and possibly **validation**) sets
  - The parameter **```test_size```** dictates the proportion of data to be set aside as the test set (with the remaining being the training set)
    - Splits are typically 70/30, 80/20 or somewhere between for training/test sets
  - Accomplished with the **[train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)** method which will return 4 arrays of data (X_train, X_test, y_train, y_test)
  - These 4 arrays are then used when **fitting** the data to the esimator
  - Data can (and should be) **shuffled** with this method (to reduce chances of bias)
  > **Question**: Why do we need to split our data into training and testing sets?

```

```


- **Hyperparameters**
  - These are values that we define **before** training a model. Regular parameters learn and change during the training process, but hyperparameters cannot "learn" and must be set beforehand.
  - Includes things like **```n_estimators```**, **```max_depth```**, etc.
  - We can "tune" hyperparameters in order to get better model performance.
  - We can also employ a "GridSearchCV" function that will analyze all combinations of hyperparameters and reveal which combinations produced the best performance (shown at the end of the lesson). We use this function to reduce the work necessary to create a "good" model.

```

```

- **[Fitting](https://stackoverflow.com/questions/45704226/what-does-fit-method-in-scikit-learn-do) a model/estimator**
  - This refers to the act of actually **training** the estimator on the training set of data (X_train, y_train)
  - The estimator loops through the data (using the hyperparameter settings), measures loss with the scoring method and produces a **trained model**.
  - The trained model is then run against the test set to evaluate the accuracy.
  - The error rates returned can be used to determine if:
    - The model needs to be further refined
    - Data needs to be further engineered
    - The model is ready to deploy.


## B. Basic steps for Building a Machine Learning Model
1. **Import necessary libraries**
  - In order to train a model, we must identify the library that contains it and call the import function so that we can use it in our code (```from sklearn import linear_model```). For this lesson, we also need to import some other libraries such as ```preprossesing``` and ```Counter```. 
---
2. **Identify the Target Variable**
  - We must then **identify our target variable**. Basically, we can choose any feature that we would like to predict as our target, and the remaining features will be used to try and determine an effective model that can predict any value of the target feature. 
  - For this lesson, we will use **```median_house_value```** as our target variable, meaning that we are trying to predict a house's value based upon the factors contained in the other features (```total_rooms, total_bedrooms, median_income```, etc). We can easily change this up later, however, if we want to try and predict another variable instead.
---
3. **Split the Dataset into Training and Testing sets**
  - In this step, we set our training and testing sets, determine the test set size and shuffle our data (to prevent bias).
  - **NOTE:** It's vitally important to **shuffle** the dataset at this point in order to randomize the data and prevent bias from entering our model.
---
4. **Fit (or "train") the Model**
  - At this point, we call our estimator and "fit" it to our training arrays (X_train, y_train)
    - ```estimator.fit(X_train, y_train)```
  - Once trained, we then run the test data through the model while also gathering the error data through the sklearn.metrics library
    - ```this_err = metrics.median_absolute_error(y_test, e.predict(X_test))```
  - As those errors are gathered, we append them to the error array that we initialized in step C.
    - ```errvals = np.append(errvals, this_err)```
---
5. **Visualize the data**
- Once everything is completed, we then create a simple vertical bar graph to show how our model(s) performed.

## C. Models
We can now begin building our models (linear regression, regression tree and gradient boost models).

We can setup our environment with the code below. This code will create **three datasets** with which you can work:
- **df:** This is our original data (though the 'NaN' values have been filled with median values
- **dfz:** This is a dataset created using the z-score method
- **dfi:** This is a dataset created using the IQR method

In [0]:
# ------------------------------------------ SETUP THE ENVIRONMENT ------------------------------------------
# Import libraries
import pandas as pd
import seaborn as sns; sns.set()
import numpy as np
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
import missingno as msno
from IPython.display import display
from sklearn.model_selection import train_test_split 
from sklearn import ensemble
from scipy import stats
from sklearn import linear_model
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import learning_curve,GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn import preprocessing
from collections import Counter
from sklearn.externals import joblib
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

url = 'https://raw.githubusercontent.com/ccwilliamsut/machine_learning/master/absolute_beginners/data_files/modified/CaliforniaHousingDataModified.csv'

df = pd.read_csv(url)
#df = pd.read_csv(~/Downloads/CaliforniaHousingDataModified.csv)


# --------------------------------------------- RENAME FEATURES ---------------------------------------------
df.rename(columns = {'lattitude':'latitude', 't_rooms':'total_rooms'}, inplace=True)


# --------------------------------------------- FUNCTIONS ---------------------------------------------
def make_heatmap(df = 'df'):
  corr = df.corr()
  plt.figure(figsize=(15 ,9))
  sns.heatmap(corr, annot=True, vmin = -1, vmax = 1, center = 0, fmt = '.1g', cmap = 'coolwarm')
  plt.show()
  plt.close()


# --------------------------------------------- ONE-HOT ENCODING ---------------------------------------------
df['ocean_proximity'] = pd.Categorical(df['ocean_proximity'])
df_dummies = pd.get_dummies(df['ocean_proximity'], drop_first = True)
df.drop(['ocean_proximity'], axis = 1, inplace = True)
df = pd.concat([df, df_dummies], axis=1)


# --------------------------------------------- DROP UNWANTED FEATURES ---------------------------------------------
df.drop(['population', 'households', 'proximity_to_store', 'ISLAND'], axis = 1, inplace = True)


# ------------------------------------- FIX MISSING DATA -------------------------------------
tb_med = df['total_bedrooms'].median(axis=0)
df['total_bedrooms'] = df['total_bedrooms'].fillna(value = tb_med)
df['total_bedrooms'] = df['total_bedrooms'].astype(int)


# ------------------------------------- Z-SCORE -------------------------------------
z = np.abs(stats.zscore(df))
dfz = df[(z < 3).all(axis = 1)]

# ------------------------------------- INTERQUARTILE RANGE -------------------------------------
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3 - q1
lower = q1 - (1.5 * iqr)
upper = q3 + (1.5 * iqr)
dfi = df[~((df < lower) | (df > upper)).any(axis = 1)]
dfi.drop(['NEAR BAY', 'NEAR OCEAN'], axis = 1, inplace = True)  # After applying IQR, the following features are now empty and can be dropped

print('Original Heatmap')
make_heatmap(df)

print('Z-Score Heatmap')
make_heatmap(dfz)

print('IQR Heatmap')
make_heatmap(dfi)

### (1). Linear Models

In [0]:
# ---------------------------- A: Import Libraries ----------------------------
# NOTE: We have imported all necessary libraries in the code cell above

# ---------------------------- B: Set Variables ----------------------------
# Set the Random State variable (for use in the splitting function)
dfx = dfi.copy()
rs = 20

# Create an array to hold our error values during training
errvals = np.array([])

# Here we will try a number of different linear regression models and then
#   compare their performance against one another.
estimators = [linear_model.LinearRegression(), 
              linear_model.Ridge(),
              linear_model.Lasso(),
              linear_model.ElasticNet(),
              linear_model.BayesianRidge(),
              linear_model.OrthogonalMatchingPursuit()
              ]

# Create labels that match the linear regressors for use in the graph
estimators_labels = np.array(['Linear',
                              'Ridge', 
                              'Lasso', 
                              'ElasticNet', 
                              'BayesRidge', 
                              'OMP'
                              ]
                             )


# ---------------------------- C: Identify Target Variable ----------------------------
# Create a copy of the dataframe with only the desired features (dropping the target feature)
features_dfx = dfx.copy()
features_dfx.drop(['median_house_value'], 
                 axis = 1,
                 inplace = True
                 )

# Create X and y arrays to hold the independent (X) and dependent (y) variables
X = features_dfx.values
y = dfx['median_house_value'].values

# ---------------------------- D: Split Dataset ----------------------------
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    shuffle = True,
                                                    random_state = rs
                                                    )


# ---------------------------- E: Fit (i.e. 'train') the model ----------------------------
# Create a temporary variable for use in the following loop
i = 0

# Loop through each linear regression estimator and: 
for estimator in estimators:
    estimator.fit(X_train, y_train)                                       # (1) train the model (fit the model to the training data)
    training_score = estimator.score(X_train, y_train)                    # (2) determine the R^2 score (measure of variance of prediction from the mean) for training and test sets
    testing_score = estimator.score(X_test, y_test)                   
    y_predicted_score = estimator.predict(X_test)                         # (3) record the loss of the trained estimator by applying the scoring mechanism to the test set                                       
    this_err = metrics.median_absolute_error(y_test, y_predicted_score)   # (4) append the error value to the 'errvals' array
    print(estimators_labels[i],                                           # (5) print the relevant metrics for analysis
          'Median Absolute Error on Test Set: %0.2f' % this_err,                                
          '\n       Mean Absolute Error on Test Set: %.2f' % metrics.mean_absolute_error(y_test, y_predicted_score),  
          '\n       Training set accuracy (R^2): ', training_score,
          '\n       Testing set accuracy (R^2): ', testing_score
          ),                    
    errvals = np.append(errvals, this_err)                                # (6) append this_err to the errvals[] array (for use in plotting)
    print('-' * 80)                                        
    i += 1                                                                # (7) add 1 to the variable 'i' iterate through the estimator_labels list

# Plot a bar chart with sorted values from errvals (mean_absolute_error for each linear regression estimator)
pos = np.arange(errvals.shape[0])
srt = np.argsort(errvals)
plt.figure(figsize=(8,4))
plt.bar(pos, errvals[srt], align='center')
plt.xticks(pos, estimators_labels[srt])
plt.xlabel('Estimator')
plt.ylabel('Median Absolute Error')
plt.show()
plt.close()

### (2). Single Regression Tree

In [0]:
# ---------------------------- A: Import Libraries ----------------------------
# NOTE: We have imported all necessary libraries in the code cell above

# ---------------------------- B: Set Variables ----------------------------
# Define the random state variable (which influences the splitting function to ensure that
#   we are splitting our dataset with the same value over multiple runs)
dfx = dfi.copy()
rs = 20

# Define the desired algorithm and congigure the hyperparameters
estimator = DecisionTreeRegressor(criterion='mae',  # can also be 'mse' or other values
                                  max_depth = None, 
                                  min_samples_split = 4,
                                  min_samples_leaf = 4,
                                  max_features = 1.0,
                                  random_state = rs
                                  )


# ---------------------------- C: Identify Target Variable ----------------------------
# Create a copy of the dataframe with only the desired features (dropping the target feature)
features_dfx = dfx.copy()
features_dfx.drop(['median_house_value'], 
                 axis = 1,
                 inplace = True
                 )

# Create X and y arrays to hold the independent (X) and dependent (y) variables
X = features_dfx.values
y = dfx['median_house_value'].values


# ---------------------------- D: Split Dataset ----------------------------
# Split the dataset (we can change the test set size here)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.3,
                                                    shuffle = True
                                                    )


# ---------------------------- E: Fit (i.e. 'train') the model ----------------------------
# Train the model using our training sets (X_train, y_train)
estimator.fit(X_train, y_train)


# ---------------------------- Analysis / Visualization ----------------------------
# Analyze the results (mean/median absolute error in the model)
training_score = estimator.score(X_train, y_train)
testing_score = estimator.score(X_test, y_test)
y_predicted_score = estimator.predict(X_test)

# Have a look at R sq to give an idea of the fit ,
# Explained variance score: 1 is perfect prediction
print('Training set mean accuracy (R^2): ',training_score)
# Explained variance score: 1 is perfect prediction
print('Test set mean accuracy (R^2): %.2f' % testing_score)
# The mean absolute error
print("Mean Absolute Error on Test Set: %.2f" % metrics.mean_absolute_error(y_test, y_predicted_score))


# Run the model against the test data to produce a graph
fig, ax = plt.subplots()
ax.scatter(y_test, y_predicted_score, edgecolors=(0, 0, 0), alpha=0.5)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4)
ax.set_xlabel('Actual')
ax.set_ylabel('Predicted')
ax.set_title("Ground Truth vs Predicted")
plt.show()

# Plot feature importance
feature_importance = estimator.feature_importances_

# Make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, features_dfx)
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()
plt.close()

### (3). Gradient Boosting
### From *Machine Learning For Absolute Beginners* by Oliver Theobald
```

model = ensemble.GradientBoostingRegressor(
                                           n_estimators = 150, 
                                           learning_rate = 0.1, 
                                           max_depth = 30, 
                                           min_samples_split = 4, 
                                           min_samples_leaf = 6, 
                                           max_features = 0.6, 
                                           loss = 'huber'
                                          )
```

The first line is the algorithm itself (gradient boosting) and comprises just one line of code. The code below dictates the hyperparameters for this algorithm. 
- **n_estimators** represents how many decision trees to be used. Remember that a high number of trees generally improves accuracy (up to a certain point) but will extend the model’s processing time. Above, I have selected 150 decision trees as an initial starting point. 
- **learning_rate** controls the rate at which additional decision trees influence the overall prediction. This effectively shrinks the contribution of each tree by the set learning_rate. Inserting a low rate here, such as 0.1, should help to improve accuracy. 
- **max_depth** defines the maximum number of layers (depth) for each decision tree. If “None” is selected, then nodes expand until all leaves are pure or until all leaves contain less than min_samples_leaf. Here, I have chosen a high maximum number of layers (30), which will have a dramatic effect on the final result, as we’ll soon see. 
- [**min_samples_split**](https://stackoverflow.com/questions/46480457/difference-between-min-samples-split-and-min-samples-leaf-in-sklearn-decisiontre) defines the minimum number of samples required to execute a new binary split. For example, min_samples_split = 10 means there must be ten available samples in order to create a new branch.
- [**min_samples_leaf**](**min_samples_split**) represents the minimum number of samples that must appear in each child node (leaf) before a new branch can be implemented. This helps to mitigate the impact of outliers and anomalies in the form of a low number of samples found in one leaf as a result of a binary split. For example, min_samples_leaf = 4 requires there to be at least four available samples within each leaf for a new branch to be created. 
- **max_features** is the total number of features presented to the model when determining the best split.
- **loss** calculates the model's error rate. For this exercise, we are using huber which protects against outliers and anomalies. Alternative error rate options include ls (least squares regression), lad (least absolute deviations), and quantile (quantile regression). Huber is actually a combination of least squares regression and least absolute deviations.


  >Theobald, Oliver. Machine Learning For Absolute Beginners: A Plain English Introduction (Second Edition) (Machine Learning For Beginners Book 1) (pp. 139-141). Scatterplot Press. Kindle Edition.

I also referenced the following website(s) and adapted code in order to graph this model:
- [Scikit-Learn Documentation: Ensemble Gradient Boosting Visualization](https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-regression-py)
- [Sharp Sight Labs: Numpy Zeros Tutorial](https://www.sharpsightlabs.com/blog/numpy-zeros-python/)

In [0]:
# ---------------------------- A: Import Libraries ----------------------------
# NOTE: We have imported all necessary libraries in the code cell above

# ---------------------------- B: Set Variables ----------------------------
dfx = dfi.copy()
rs = 20

# -- Create a dictionary with key/value pairs that we can use in our function calls
params = {'n_estimators': 500,
          'learning_rate': .01,
          'max_depth': 5,
          'min_samples_split': 6,
          'min_samples_leaf': 4,
          'max_features': 0.6,
          # Options for 'loss': huber, ls (least squares), lad (least absolute deviations) and quantile (quantile regression)
          'loss': 'huber'
          }

## Use the params defined above in the GradientBoostingRegressor estimator call
estimator = ensemble.GradientBoostingRegressor(**params)


# ---------------------------- C: Identify Target Variable ----------------------------
# Create a copy of the dataframe with only the desired features (dropping the target feature)
features_dfx = dfx.copy()
features_dfx.drop(['median_house_value'], 
                  axis = 1,
                  inplace = True
                  )


# ---------------------------- D: Split Dataset ----------------------------
# Create X and y arrays to hold the independent (X) and dependent (y) variables
X = features_dfx.values
y = dfx['median_house_value'].values


# Split the dataset (we can change the test set size here)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.3,
                                                    random_state = rs,
                                                    shuffle = True
                                                    )


# ---------------------------- E: Fit (i.e. 'train') the model ----------------------------
estimator.fit(X_train, y_train) 


# ---------------------------- Analysis / Visualization ----------------------------
training_score = estimator.score(X_train, y_train)
testing_score = estimator.score(X_test, y_test)
y_predicted_score = estimator.predict(X_test)
y_staged_predicted_score = estimator.staged_predict(X_test)

# Print out the various metrics that we have collected on our test sets
print('Training set mean accuracy (R^2): ',training_score)
print('Test set mean accuracy (R^2): %.2f' % testing_score)
print("Mean Absolute error: %.2f" % metrics.mean_absolute_error(y_test, y_predicted_score))


# Plot the training/testing accuracy
# Create an empty array of zeros based on the value in 'n_estimators' key
test_score = np.zeros(shape = (params['n_estimators'],),
                      dtype=np.float64
                      )

# Compute test set deviance (loss) at each stage and place it into the array
#  based on the predicted value (y_pred) against the actual value (y_test)
for i, y_pred in enumerate(y_staged_predicted_score):
    test_score[i] = estimator.loss_(y_test, y_pred)

plt.figure(figsize=(20, 10))
plt.subplot(1, 2, 1)
plt.title('Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, 
         estimator.train_score_, 'b-',
         label='Training Set Deviance'
         )
plt.plot(np.arange(params['n_estimators']) + 1,
         test_score, 'r-',
         label='Test Set Deviance'
         )
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance')


# Plot feature importance
feature_importance = estimator.feature_importances_

# Make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, features_dfx)
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

# Save the model so that we can use it later
joblib.dump(estimator, 'ca_housing_trained_model.pkl')

# V. Moving Forward

## A. Additional challenges
If you would like to continue experimenting with this dataset, you can try some of the following activities:
- Try using the z-score and original datasets instead of the IQR one shown in class.
- Try to get the best scores possible on the Gradient Boosing section.
- Try to import a different, though similar, dataset that uses Australian housing data
  - That dataset can be found here: [Australian Housing Data FULL](https://raw.githubusercontent.com/ccwilliamsut/machine_learning/master/absolute_beginners/data_files/original/Melbourne_housing_FULL.csv)
  - Simply copy the link into the "url" variable in the code and then begin exploring it. You will have to start out with new code, as our model has different variables, but the process will be similar.

## B. Additional Reading and Tools
There are a number of different paths forward, but here are a few of the most useful links and options that I have found during my learning:


### Books
1. Machine Learning With Random Forests And Decision Trees: A Visual Guide For Beginners  
  - Format: E-book 
  - Author: Scott Hartshorn 
  - Suggested Audience: Established beginners 
  - A short, affordable ($3.20 USD), and engaging read on decision trees and random forests with detailed visual examples, useful practical tips, and clear instructions.

2. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems 
  - Format: E-Book
  - Book Author: Aurélien Géron 
  - Suggested Audience: All (with an interest in programming in Python, Scikit-Learn, and TensorFlow) As a popular O’Reilly Media book written by machine learning consultant Aurélien Géron, this is an excellent advanced resource for anyone with a solid foundation of machine learning and computer programming.

### Websites
1. [Google AI](https://experiments.withgoogle.com/collection/ai)
  - Excellent resource to experiment with AI/ML concepts and experiments


### Datasets
