Welcome to PyData Special Interest Group @ SF Python Project Night
-----

The goal is to have a common dataset to explore together. 

Tonight, we are going to explore the Diamond 💎 Dataset by fitting a decision tree 🌳.

Refer back to the previous notebook about exploratory data analysis (EDA) [here](2019-03-20-Diamond-Dataset-Explore.ipynb)

You can download and run this notebook locally. Or you can run it in the cloud by clicking [here](https://colab.research.google.com/github/brianspiering/PyDataSIG/blob/master/2019-04-17-Diamond-Dataset-Decision-Tree.ipynb).

Visualizing a decision tree
-----

Let's walk through a turtorial to get the fundamentals of decision tree [here](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/).

Creating our own decision tree
-----

Let's run everything from an empty state

In [85]:
reset -fs

In [86]:
# Here are imports to get you started
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

In [87]:
# Let's grab the data from GitHub
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv'
diamonds = pd.read_csv(url)

In [88]:
# Let's predict price
y = diamonds.price.values

In [89]:
# Let's use all other variables to predict
X = diamonds.drop('price', axis=1)
X.head(n=2)

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,3.89,3.84,2.31


In [90]:
# It is a bit tricky to handle categorical variable. 
# Let's drop those for now
X = X.drop(['cut', 'color', 'clarity'], axis=1)
X.head(n=2)

Unnamed: 0,carat,depth,table,x,y,z
0,0.23,61.5,55.0,3.95,3.98,2.43
1,0.21,59.8,61.0,3.89,3.84,2.31


In [91]:
# Create decision tree classifer object
model = DecisionTreeRegressor()

In [92]:
# Train model
model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [93]:
# Define new datapoint
datapoint = {'carat': .22,
             'depth': 60,
             'table': 57,
             'x': 3.9,
             'y': 3.9,
             'z': 2.40}

# Munge datapoint
datapoint_for_model = np.array(list(datapoint.values())).reshape(1, -1)

# Predict value of new datapoint
prediction = model.predict(datapoint_for_model)
prediction

array([404.])

In [94]:
# Let's clean that up
print(f"${prediction[0]:,.2f}")

$404.00


Summary
-----

1. Define problem - Predict the price of diamond based on attributes
1. Find the data - Diamond dataset
1. Train model - Fit Decision tree
1. Predict new data point

Sidebar: Overfitting
----

The goal of Machine Learning is learn a function from historical data that can predict future data.

Overfitting is memorized the historical data. Memorization is brittle and might not do well predicting future data.

We can see if we are overfitting by doing a "train/test" split. 

In [96]:
from sklearn.model_selection import train_test_split

In [97]:
# Split the data 
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [103]:
# Train model just on training data
model.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [104]:
# Evaluate the model just on the test data
predictions = model.predict(X_test)

In [110]:
from sklearn.metrics import r2_score

In [114]:
r2 = r2_score(y_test, predictions)

print(f"The current model scores on unseen data {r2:.2f} out 1.") 

The current model scores on unseen data 0.78 out 1.


On Your Own
-----

- Tune model by trying other parameters. Check out the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
- Add categorical data as features. An example is [here](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features)
- Better visualizations for the decision tree. An example is [here](https://explained.ai/decision-tree-viz/).
- Fit other models. A list of regression models in scikit-learn is [here](https://scikit-learn.org/stable/supervised_learning.html)