# Visualizations and Random Forest 

Prior to this task, you should have watched a video on random forest on Canvas.

## Advantages of Random Forest:

* Random forest can solve both type of problems that is classification and regression and does a decent estimation at both fronts.
* Random forest can be used on both categorical and continuous variables. 
* You do not have to scale features.
* Fairly robust to missing data and outliars.

## Disadvantages of Random Forest

* It is complex, e.g., look at the tree at the end of this exercise!  This makes it feel like a black box, and we have very little control over what the model does.
* It can take a long time to train.

In [None]:
# Here are some alternative ways to load packages in python as aliases 
# This can be useful if you call them often

import numpy as np
import sklearn as sk
import sklearn.datasets as skd
import sklearn.ensemble as ske
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline

The Boston Housing Dataset consists of price of houses in various places in Boston. Alongside with price, the dataset also provide information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE), and there are many other attributes that available here.

* CRIM - per capita crime rate by town
* ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS - proportion of non-retail business acres per town.
* CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
* NOX - nitric oxides concentration (parts per 10 million)
* RM - average number of rooms per dwelling
* AGE - proportion of owner-occupied units built prior to 1940
* DIS - weighted distances to five Boston employment centres
* RAD - index of accessibility to radial highways
* TAX - full-value property-tax rate per 10,000 dollars
* PTRATIO - pupil-teacher ratio by town
* B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
* LSTAT  % lower status of the population
* MEDV Median value of owner-occupied homes in 1000s of dollars

In [None]:
data = pd.read_csv('../data/___')
df = pd.DataFrame(data)
df.shape
df['MEDV']

In [None]:
df.shape

We should check to see if there are any null values.  There are several ways we've learned to do this.

In [None]:
pd.___(df).any() 

In [None]:
pd.___(df).sum()

So there are some null values in there.  You shoud decide how you want to deal with these.  In this exercise, I'm just going to remove any rows with null values.

Generally, remember, what we should look for are:
* There are not any data points that immediately appear as anomalous 
* No zeros in any of the measurement columns. 

Another method to verify the quality of the data is make basic plots. Often it is easier to spot anomalies in a graph than in numbers.

In [None]:
df_dropped = df.___()
pd.isnull(df_dropped).sum()

In [None]:
 df_dropped.describe()

It is useful to know whether some pairs of attributes are correlated and how much. For many ML algorithms correlated features that are not independent should be treated with caution.  Here is a good [blog](https://towardsdatascience.com/data-correlation-can-make-or-break-your-machine-learning-project-82ee11039cc9) on explaining why.

To prevent this, there are methods for deriving features that are as uncorrelated as possible (CA, ICA, autoencoder, dimensionality reduction, manifold learning, etc.), which we'll learn about in coming classes.

We can explore coreelation with Pandas pretty easily...

In [None]:
corr = df_dropped.____(method='pearson')
corr

### Let's explore/review some visualization approaches

A good way to look at correlations quickly is a visualization called a heatmap.  Let's take a look at correlations betewen features in our dataset.

In [None]:
import seaborn as sns 

sns.heatmap(____) # compute and plot the pair-wise correlations
# save to file, remove the big white borders
# You can read more about heatmaps and correlations in sns:  
# https://seaborn.pydata.org/examples/many_pairwise_correlations.html



Let's take a look how we can explore the distributions of values within a specific feature.  Specifically, let's look at the distribution of property tax in Boston. We can do this either in matplotlib or sns.  There are so many tools available to you in Python!

In [None]:
df = df_dropped

In [None]:
import matplotlib.pyplot as plt
attr = df['TAX']
plt.hist(attr, bins=50)

In [None]:
sns.distplot(attr, bins=50)

What's the correlation between property taxes and the number of rooms in a house?

In [None]:
plt.scatter(df['TAX'], df['RM'])


Another possibility is to aggregate data points over 2D areas and estimate the [probability desnsity function](https://en.wikipedia.org/wiki/Probability_density_function). Its a 2D generalization of a histogram. We can either use a rectangular grid, or even a hexagonal one.

In [None]:
sns.jointplot(x = 'TAX', y = 'RM', data = df,  kind='hex')


In [None]:
sns.jointplot(x = 'TAX', y = 'RM', data = df,  kind='kde')


What you'll see is you have access to so many visualizations.  A great way to explore them is through the gallery:  https://seaborn.pydata.org/examples/index.html


# How to implement Random Forest

First, we need to get a train and test dataset going...we are going to see if we can predict teh median value of housing

In [None]:
#  split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
X = df2
df_target = pd.DataFrame(df, columns = ['____']) #from the description
df_target #median value of owner occupied in $1000s
df_target.shape
y = df_target
y = np.array(y).ravel() # changes the 1-D array to a column vector
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [None]:
print(X_train)

In [None]:

print(X_train.shape, y_train.shape)

In [None]:
print(df)

In [None]:
reg = ske.RandomForestRegressor(n_estimators = 1000, random_state = 0)

In [None]:
reg.fit(____, _____)

The 'ravel' command flattens an array:  "ravel(): when you have y.shape == (10, 1), using y.ravel().shape == (10, ). In words... it flattens an array."

https://stackoverflow.com/questions/34165731/a-column-vector-y-was-passed-when-a-1d-array-was-expected

In [None]:
y_pred = reg.predict(____)
print(y_pred)

How do we evaluate this model?  Previously, we've worked with labels for classifications but now instead of a DISCRETE target, we've got a continuous target.  For example, the confusion matrix doesn't make sense and the code will error out below:

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(____, ____)

Check out this [documentation](https://scikit-learn.org/stable/modules/model_evaluation.html) and see if you can find some ways to evaluate this model.

In [None]:
from sklearn.metrics import explained_variance_score, max_error, mean_absolute_error, mean_squared_error
from sklearn.metrics import r2_score

print(explained_variance_score(y_test, y_pred))
print(max_error(y_test, y_pred))
print(mean_absolute_error(y_test, y_pred))
print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred, multioutput='variance_weighted'))

The importance of our features can be found in reg.feature_importances_. 

In [None]:
print(reg.feature_importances_)
print(df2.columns)
df3 = pd.DataFrame({'feature_names':df2.columns, 'fet_imp': reg.feature_importances_})
df3

We can compute how much each feature contributes to decreasing the weighted impurity within a tree.   This is a fast calculation, but one should be cautious because it can be a biased approach.  It has a tendency to inflate the importance of continuous features or high-cardinality categorical variables (a lot of very uncommon or unique variables).

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x='feature_names', y='fet_imp', data=df3)
plt.xlabel('Feature Names')
plt.ylabel('Feature Importance')
plt.title('Feature Importance Bar Chart')


In [None]:
fig, ax = plt.subplots(1, 1)
ax.scatter(X['RM'], y)
ax.set_xlabel('RM')
ax.set_ylabel('Value of houses (k$)')

In [None]:
from sklearn import tree
tree.export_graphviz(reg.estimators_[0],
                     'tree.dot')

In [None]:
# Import tools needed for visualization
from sklearn.tree import export_graphviz
# Pull out one tree from the forest
tree = reg.estimators_[5]
# Import tools needed for visualization
from sklearn.tree import export_graphviz

# Pull out one tree from the forest
tree = reg.estimators_[5]
# Export the image to a dot file
export_graphviz(tree, out_file = 'tree.dot', feature_names = df2.columns, rounded = True, precision = 1)


You'll need to open tree.dot file in a text editor, e.g., notepad.  Select all the code and paste in here:  http://www.webgraphviz.com/.  Scroll right and the tree should show up.

## More practice - optional but recommended because its interesting and doesn't take too long

This is another good [tutorial](https://towardsdatascience.com/random-forest-in-python-24d0893d51c0) on random forest:
.  You can perform this tutorial on your own and expand it for your choose your adventure, though you should be sure to demonstrate knowledge of this topic vs. copying and executing the tutorial.