## If the project were a regression problem, the following would be sample code. Based off of Kaggle modelling house prices with regression.

**IMPORT AND SET UP DATA**

The following cell imports and reads the data, setting it up in a comprehensible format. 


In [2]:
import pandas as pd
import matplotlib.pyplot as plt

# to access the location of the directory you are working in (the users/ameliebuc thing):
# import os
# print(os.getcwd())

spreadsheet_file_path = "/Users/ameliebuc/Documents/byond_internship/ImBlanced-Classification.csv"
data = pd.read_csv(spreadsheet_file_path, encoding = 'utf-8')
data.describe()

Unnamed: 0,Label,b,c,d,e,f,g,h,i,j,k,l,m
count,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0,24844.0
mean,0.040211,36.845979,1.288118,1.564845,15.765939,7.881702,72.928192,3.200935,736.620633,951.564281,2253.277,1.95351,0.339961
std,0.196458,13.241031,0.453162,0.496761,26.337659,18.785623,40.728075,6.440581,292.545306,749.563452,14034.04,0.642311,0.473705
min,0.0,18.676712,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
25%,0.0,25.920548,1.0,1.0,0.0,2.0,27.0,0.0,743.0,0.0,850.0,2.0,0.0
50%,0.0,34.178082,1.0,2.0,8.0,2.0,99.0,0.0,826.0,1271.0,1200.0,2.0,0.0
75%,0.0,45.328767,2.0,2.0,8.0,8.0,99.0,3.317808,904.0,1624.0,2050.0,2.0,1.0
max,1.0,95.476712,2.0,2.0,81.0,99.0,99.0,56.356164,1138.0,1900.0,2000000.0,3.0,1.0


Kaggle interpets the data descriptions in the leftmost column as follows: (This is in terms of their "Melbourne Housing" dataset. https://www.kaggle.com/dansbecker/explore-your-data)

The results show 8 numbers for each column in your original dataset. The first number, the **count**, shows how many rows have non-missing values.

**Missing values** arise for many reasons. For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house. We'll come back to the topic of missing data.

The second value is the **mean**, which is the average. Under that, **std** is the standard deviation, which measures how numerically spread out the values are.

To interpret the **min, 25%, 50%, 75% and max** values, imagine sorting each column from lowest to highest value. The first (smallest) value is the min. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced "25th percentile"). The 50th and 75th percentiles are defined analgously, and the max is the largest number.

In [6]:
data.columns
# dropna drops missing values (think of na as "not available")
# data = data.dropna(axis=0)
y = data.Label
data_features = ['b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm']
X = data[data_features]
#X.describe()
#X.head()

**BUILD MODEL: SELECT DATA AND FEATURES FOR MODELING**

1. Choose variables/columns for modelling manually.
2. To select a prediction target (the column we want to predict), use dot notation. By convention, the prediction target is called **y**.
3. Choose a few features to later predict the Label (by convention called X).

Define, fit, predict, evaluate.

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

# split data into a set for training and for validation based on a random number generator
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# check what the best depth for the decision tree should be
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    data_model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 1)
    data_model.fit(train_X, train_y)
    val_predictions = data_model.predict(val_X)
    # Mean absolute error: on average, model is off by about X. error = |actual-predicted|
    val_mae = mean_absolute_error(val_y, val_predictions)
    return val_mae
# print("Validation Set's MAE: {}".format(val_mae))

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  0
Max leaf nodes: 50  		 Mean Absolute Error:  0
Max leaf nodes: 500  		 Mean Absolute Error:  0
Max leaf nodes: 5000  		 Mean Absolute Error:  0


When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.

Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting. 

There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. But the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

**Overfitting**: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or

**Underfitting**: failing to capture relevant patterns, again leading to less accurate predictions.

**We use validation data, which isn't used in model training**, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.