<a href="https://colab.research.google.com/github/ccwilliamsut/machine_learning/blob/master/01b_setting_up_your_environment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a Decision Tree Model with Python

In this exercise, we will learn to build a simple model that uses a Decision Tree algorithm (with gradient boosting) to determine housing prices in the California market.

## Basic Steps
- **Step 1: Setup your environment**
    - Import the appropraite libraries and setup software, if necessary (we will be using Google Colaboratory, so we do not need to setup any additional software)
    - Import your data
- **Step 2: Analyze your data**
    - Become familiar with the data
    - Look for possible relationships
    - Determine which features should be used and which need to be removed
- **Step 3: Scrub the data**
    - Remove unnecessary features
    - Perform *one-hot encoding* (convert text values to numerical values)
    - Fix problematic data (outliers, null values, duplicate data, etc.)
- **Step 4: Choose an algorithm**
    - Determine which type of algorithm fits the data:
        - **Supervised** - uses labeled data to predict values or classes (predictive model; regression and classification)
        - **Unsupervised** - uses unlabeled data to create clusters of data (descriptive model; clustering)
        - **Reinforcement Learning** - Creates a system of rewards and punishments to interactively learn from the environment (online model; classification and control)
    - Determine which algorithm to use
        - Linear regression, logistic regression, k-means clustering, k-nearest neighbor, etc.
    - Implement the algorithm
        - Build it out with estimated guesses for the hyperparameters (where applicable)
- **Step 5: Test and fit the model**
    - Run the model and analyze performance
    - Look for overfitting (too honed on on training data)
    - Look for underfitting (does not predict well on test data)
    - Continue adjusting until a good fit is found (run, tweak, run again)

# Step 1. Setup your environment (importing libraries and data)

For this project, we will draw on the following libraries (commonly used for machine learning projects):

- pandas (for working easily with datasets)
- sklearn (scikit-learn contains a number of useful machine learning tools)
- IPython (for displaying results in Colaboratory)
- seaborn (for displaying graphs)

### The original dataset comes from Kaggle
- [California Housing Data](https://www.kaggle.com/camnugent/california-housing-prices)

### My modified dataset for this course is in my Github repository
- [ccwilliamsut Github](https://github.com/ccwilliamsut/ml_beginners/blob/master/CaliforniaHousingData.csv)


In [0]:
# Import libraries
import pandas as pd
import seaborn as sns
from IPython.display import display
from sklearn.model_selection import train_test_split 
from sklearn import ensemble 
from sklearn.metrics import mean_absolute_error 
from sklearn.externals import joblib

# Import the data
# Set the URL for the data file
url = 'https://raw.githubusercontent.com/ccwilliamsut/ml_beginners/master/CaliforniaHousingData.csv'

# Import the datafile from the provided url
df = pd.read_csv(url)

### Verify that the dataset imported correctly
Quickly look to see that our data was imported and will be usable.

In [0]:
# Test that the data was imported correctly (look at the first 10 rows of the file)
display(df.head())

# List the column headers to see what we are working with
print("\n\nHere are all of our columns:")
display(df.columns)

# Step 2: Analyze the data
- Become familiar with the data
- Look for possible relationships
- Determine which features should be used and which need to be removed



We want to get an idea of what we are working with (i.e. the "shape" of the data) so that we can begin to see what work is required to prepare the data for training. We also want to begin thinking about what kind of algorithm we will be using (supervised or unsupervised, typically).

> **Key questions to consider at this point:**
1. Does my data have **labels**? (supervised or unsupervised algorithms)
2. Are there major **issues** with the data (lots of null values, small number of samples, misspellings, derived data, unknown scales or values, etc.)?
3. What is my **goal**? Will I be able to accomplish that goal with this dataset?
    - If not, can I employ feature engineering to create the data I need?
    - If so, which features will contribute to that goal and which are unnecessary?
4. Can I see any **relationships** in the data that might serve as a good foundation for my model?


Let's begin by looking at a summary of the dataset using the **```dataframe.describe()```** and **```list(dataset.columns)```** functions:



In [0]:
# Analyze the "shape" of the data
display(df.describe())

# List all the columns in the dataset
list(df.columns)

### Analyze at the data above
- Notice that all columns were not shown in the **```dataset.describe()```** function... Why not?
- Do you see any immediate problems?
- Is our data labeled or unlabeled?
- Can you spot any possible relationships that we might explore?
- Can I accomplish my goal with this dataset?

### Use graphs to explore the data
You can also use graphs to look at the distributions of data as well

In [0]:
# List the first 5 items in 'ocean_proximity' to discover why it is not shown in the describe function
display(df['ocean_proximity'].head(5))

# Use histograms to look at the distribution of the data
df.hist('median_house_value')
df.hist('housing_median_age')
df.hist('t_rooms')
df.hist('total_bedrooms')

### Problems
Looking at the data in the above few cells (including the column list and data description), there are at least a few problems we can see:
1. The **``` median_house_value```** has a lot of values stacked up in the $500,000 range
2. **```t_rooms```** and **```total_bedrooms```** have very *long tails* which can cause problems with accuracy
3. We have **text-based data** in ```ocean_proximity```...  What should we do with this?
4. There is a **spelling error**... Do we need to worry about this?
5. We have some **missing data** in one of our columns... Can you spot which one?
---
Take some time to think about the above questions. Discuss them with your group or partner if applicable.

### Other types of visualizations
The *seaborn* library gives us access to a number of different visualizations. We will cover a few more here, but for further information, see the following links:
- [Seaborn official documentation](https://seaborn.pydata.org/)
- [Specific info on distributions](https://seaborn.pydata.org/tutorial/distributions.html)

Before we can perform more advanced visualizations, however, we must first deal with some of the issues listed above.

#**Step 3: Scrub the data**
### Typical actions taken at this stage are:

1.   Modifying or removing data (incomplete, irrelevant or duplicate)
2.   *One-hot encoding* (converting text-based values to numerical values)
3.   Binning (using numerical values to categorize data, such as longitude, price range, etc.)
4.   Feature engineering (in order to optimize our model)
       * feature selection (determining which features will be useful)
       * combining features (to save processing time)
       * removing unnecessary features
       
Next we need to methodically look through our dataset for errors, problems and missing values. This is known as "scrubbing" the data, and it is a **vital part of any machine learning project**. In fact, this stage represents most of the time that a data scientist spends on a project, as the data usually needs to be cleaned and changed significantly in order to maximize efficiency and create a viable model.









##Pinpoint problems
We need to look at the data in multiple ways to identify all the problems that might exist. 

###1. Identify null values: 
- We can get a count of "NaN" values in the dataset with ```dataframe.isnull().sum(axis=0)```
- Another way to do this is to use the using ```dataframe.count(axis=0)``` to get a count of all non-null values

Both methods are shown below:


In [0]:
# Get a count of "NaN" or missing values by feature
print('Counting only "NaN" values:\n')
display(df.isnull().sum(axis=0))

# Count the records in each column
print('\n\nCounting all non-null values:\n')
display(df.count(axis=0))

We can see in the above output that we are missing 207 records in ```total_bedrooms```. How can we deal with these missing values?
>The first questions to ask are: *Do we need this column? Will the information help us?*
>If the answer is "yes", then we should fix the data. If not, then we can move on. In this case, however, it's likely that ```total_bedrooms``` will be useful for our model, so we will keep it.

We have the following choices when dealing with null data:
1. Use the **mode** of the column
    - Useful for binary and categorical data (yes/no, divorce status, etc.)
2. Use the **median** value of the column
    - Useful for continuous variables in which there is an infinite number of possibilities
3. **Delete** the samples with missing data
    - Used as a last resort, as it reduces the data available for training/testing
    
>Because the value in question here is continuous (though not extremely fine-grained), we will use the mode to fill in the missing data.

data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)


In [0]:
# Display the original data for comparison
display(df.describe())

# Replace the null values with mode values
df_mode = df
df_mode['total_bedrooms'].fillna(df_mode['total_bedrooms'].mode()[0], inplace=True)
display(df_mode.describe())

## Pairplots using Seaborn
Seaborn is a visualization library that gives us access to a number of different easy graphing options. One of those options is a pairplot in which each featuer is compared to another in a series of graphs so that we might identify any possible relationships. 

In [0]:
sns.pairplot(df)

## Analysis of the data above
Looking at the graphs above:

*   Do you see any possible relationships?
*   Are there any features that we can eliminate to make things more optimized?
*   Are there any other issues that might be causing a problem?



## Remove unneccessary features
If a feature will not help the model, we should remove it before attempting any feature engineering or before training our model. With larger datasets, this can significantly reduce compute time, and it saves us from having to fix any errors in the data.

*    It appears that the **```proximity_to_store```** is pretty much useless. It does not have any immediately recognizable scale, is likely unaffiliated with the price of a house and seems to be uniformly distributed. We can get rid of it.
*    Latitude and Longitude also seem unlikely to be useful. It might be able to identify specific cities or regions, but it's not going to help us nearly as much as other features in the set.
    *    Notice also that this feature is misspelled (should be "latitude" instead of "lattitude"). If we were going to keep this feature then we would fix the spelling in order to help any downstream applications or developers, as they might not notice
*   We should also get rid of **```median_income```** since we are not sure what the scale represents

In [0]:
# Remove unneccessary features
del df['proximity_to_store']
del df['lattitude']
del df['longitude']
del df['median_income']

# Ensure that changes have been made
display(df.head())

## Fix feature names if necessary
Looking at the feature names, **```t_rooms```**  stands out a bit. What is "t_rooms" measuring? Total rooms? Tea rooms? 

Looking at the feature next to it (**```total_bedrooms```**), it seems likely that this is supposed to be **```total_rooms```**, as the numbers seem consistently larger than the total bedrooms. After speaking with the team that gathered the data, we confirm that this is, in fact, a measure of total rooms per block, so we need to change the feature name.

In [0]:
# Ensure that feature names are descriptive of content and change them if necessary
df.rename(columns={'t_rooms':'total_rooms'}, inplace=True)

# Check that changes are applied and correct
display(df.head())

## Find and count the missing values
If a value is null when the data is imported to Pandas, then the value "NaN" is assigned. We could search through the entire dataset to manually locate the problems, but Pandas provides a much easier solution:



**```dataframe.isnull()```** will help us to identify where the missing values lie, and **```sum()```** will count them for us within each feature, as seen below.




In [0]:
# Get a count of "NaN" or missing values by feature
df.isnull().sum(axis=0)

We see in the output above that 207 of our records are missing values. At this point, we need to deal with these missing values using one of the following methods:
*    Assign a mean value to each (can create bias if outliers exist)
*    Assign a median value
*    Delete the samples with missing values

In truth, 207 samples out of our 20,000+ is not really a big number, so we could probably delete the records without much impact to the accuracy of our model. For the purposes of this lesson, however, we will look through our options to see if we can save those records (and so that you can learn how to do this when a larger percentage of your data is affected).

## Fill the values with the median of the feature
Below, we copy the original dataframe (df) into a new dataframe (df2) to analyze the impact of using the median value. We need to see if this drastically changes any of the dataset's statistical values.

In [0]:
# Copy the dataframe to a new dataframe for comparison
df_median = df.fillna(df.median())

# Check that all null values have been filled in the new dataframe
df_median.isnull().sum()

In [0]:
# Analyze the mean for the affected feature (total_bedrooms) against the other dataframe's value ()
df.describe()

In [0]:
df_median.describe()

Next, let's compare the above results to what happens if we drop the null values instead of using the median.

In [0]:
df_dropped = df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
df_dropped.isnull().sum()

In [0]:
df_median.describe()

In [0]:
df_dropped.describe()

We can see from the above analysis that filling the null records with the median value in ```total_bedrooms``` does not drastically change the values as compared to the dropped version, and we get the added benefit of more data for all the other columns that are not null in those records, so we will keep them and move on.

In [0]:
df = df.fillna(df.median())
display(df.describe())
print('\n\nCount of NaN values in each category:')
display(df.isnull().sum())

# Dealing with outliers
The next problem that we have to fix in our data are the outliers. 

If we look again at our data description (**```df.describe()```**), we can see that the max value is 500001, which indicates that the data does not go beyond that number. To confirm this suspicion, we can simply get a sum for all values over $500,000 by using the following command: 


In [0]:
# Count the number of samples with values over 500000
count1 = (df['median_house_value'] == 500001).sum()
count2 = (df['median_house_value'] > 500000).sum()
print("Count of houses valued at $500,001: {0}".format(count1))
print("Count of houses valued over $500,000: {0}".format(count2))

We can see that there are 965 samples in the dataset that have the value 500001. Since this represents almost 5% of our data, it stands a good chance of introducing bias into our model (skewing the mean for **```median_house_value```**). Furthermore, it is preventing us from being able to accurately analyze the various relationships between the data in our **```sns.pairplot(df)```** graphs. 

We could do something like we did before and replace the outlier values with a median value to preserve the other datapoints, but this would likely result in more problems. For example, these higher value homes likely have other figures associated with them such as **```total_rooms```** and **```total_bedrooms```** that will not be accurately represented by median house value. We will likely cause more problems in our model than we will fix, so it is probably best to just eliminate the outlier samples instead. We will lose 5% of our data, but the remaining data will likely be more accurate.

Before we get rid of all of our outliers, however, let's first change all our text-based values to numerical ones. Currently **```ocean_proximity```** has values like ```<1H OCEAN``` and ```INLAND``` rather than numerical values. We can use ***one hot encoding*** to fix that problem. Basically, it will create new features for each possibility and then assign a ```0``` if false and ```1``` if true for each sample.

In [0]:
# Make a new dataframe with boolean (0 or 1) values for 'ocean_proximity'
features_df = pd.get_dummies(df, columns=['ocean_proximity'])

# Print the results to ensure that we have numerical values for each feature
display(features_df.head())

### Next we can get rid of the outliers
Now we can deal with all possible outliers in our code by using the following commands:

In [0]:
# Import the necessary libraries
from scipy import stats
import numpy as np


# Use the following code to eliminate all outliers from every column in our new feature set
# This code is referenced here: https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-pandas-data-frame
features_df = features_df[(np.abs(stats.zscore(features_df)) < 3).all(axis=1)]

In [0]:
# Analyze the median house value 
print('Basic distribution:')
sns.pairplot(features_df)
#sns.distplot(features_df['median_house_value'])

In [0]:
# Analyze the median house value with specified number of bins
print('Binned distribution(15 bins)')
sns.distplot(df['median_house_value'], bins=10)

In [0]:
sns.distplot(df['total_rooms'], bins=20)

In [0]:
display((df['total_rooms'] > 6000).sum())