### The goal of this exercise is to build, visualize, and diagnose the performance of DT/kNN algorithms on the (larger) habitable planets data set.

Data for this exercise come from [here](http://phl.upr.edu/projects/habitable-exoplanets-catalog/data/database).

In [None]:
import pandas as pd
import numpy as np
import sklearn.tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics 
from sklearn.model_selection import cross_val_predict, cross_val_score, cross_validate
from sklearn.model_selection import KFold, StratifiedKFold
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

In [None]:
#Skip this cell if it gives you problems!

from io import StringIO  
from IPython.display import Image  
import pydotplus
from sklearn.tree import export_graphviz

In [None]:
import matplotlib
font = {'size'   : 20}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=20) 
matplotlib.rc('ytick', labelsize=20) 
matplotlib.rcParams['figure.dpi'] = 300

### Step 1: Preliminary data analysis/exploration.

Once we are working with research-level data sets, our first step should always be data exploration.

We can read the data in a data frame, as we did previously, and do some preliminary data analysis.

In [None]:
df = pd.read_csv('phl_exoplanet_catalog.csv', sep = ',')

In [None]:
df.head()

In [None]:
df.columns

The "describe" property is very useful to visualize some summary statistics.

In [None]:
df.describe()

Here is a way of showing summary statistics by class.

In [None]:
df.groupby('P_HABITABLE').count()

#### Start by lumping together Probably and Possibly Habitable planets.

In [None]:
bindf = df.drop('P_HABITABLE', axis = 1) #What are we doing here?  Dropping label column

In [None]:
bindf['P_HABITABLE'] = (np.logical_or((df.P_HABITABLE == 1) , (df.P_HABITABLE == 2))).astype(int) #how about here? 
#bindf['P_HABITABLE'] = bindf['P_HABITABLE'].astype(int) #creating a new label column and casting it as integer

In [None]:
bindf['P_HABITABLE'].head()

### Let's select some columns.

S_MAG - star magnitude 

S_DISTANCE - star distance (parsecs)

S_METALLICITY - star metallicity (dex)

S_MASS - star mass (solar units)

S_RADIUS - star radius (solar units)

S_AGE - star age (Gy)

S_TEMPERATURE - star effective temperature (K)

S_LOG_G - star log(g)

P_DISTANCE - planet mean distance from the star (AU) 

P_FLUX - planet mean stellar flux (earth units)

P_PERIOD - planet period (days) 

### We can select the same features as we did in Chapter 2.

In [None]:
final_features = bindf[['S_MASS', 'P_PERIOD', 'P_DISTANCE']] 

In [None]:
targets = bindf.P_HABITABLE

In [None]:
final_features.head()

In [None]:
targets

### There are some NaNs. We can see this by using the "describe" property, which only counts numerical values in each column.

In [None]:
final_features.shape

In [None]:
final_features.describe()

### We can count missing data by column...

In [None]:
final_features.isnull().sum() #can also use .isna

### ...and get rid of them (Note: there are much better imputing strategies!)

In [None]:
final_features = final_features.dropna(axis = 0) #gets rid of any instance with at least one NaN in any column
final_features.shape

### Next step: search for outliers

Method 1 - plot!

In [None]:
plt.hist(final_features.iloc[:,0], bins = 100, alpha = 0.5);

There is a remarkable outlier; the same happens for other features. Non-pro tip: if you plot data in a histogram, and the range is surprisingly large, it means that there are outliers :) 

But we could have also known from the difference between mean and median (which, in fact, is even more pronounced for orbital distance and period).

In [None]:
final_features.describe()

#### Let's get rid of some outliers.

In [None]:
#This eliminates > 5 sigma outliers; however it counts from the mean so it might not be ideal

final_features = final_features[(np.abs(stats.zscore(final_features)) < 5).all(axis=1)] 

In [None]:
targets = targets[final_features.index]

#### Now reset index.

In [None]:
final_features = final_features.reset_index(drop=True)

In [None]:
final_features.head()

#### And do the same for the label vector.

In [None]:
targets = targets.reset_index(drop=True)

In [None]:
targets.head()

#### Comparing the shapes, we can see that 9 outliers were eliminated.

In [None]:
final_features.shape

### Check balance of data set

In [None]:
#Simple way: count 0/1s, get fraction of total

In [None]:
np.sum(targets)/len(targets)

In [None]:
np.bincount(targets) #this shows the distribution of the two classes

### This tells us that our data set is extremely imbalanced, and therefore, we need to be careful.

#### Finally, we can explore the data by class, to get a sense of how the two classes differ from one another. For this, we need to concatenate the feature/labels data frames so we group by objects label.

In [None]:
#Note: this generates a "view", not a new data frame

pd.concat([final_features, targets], axis=1)

In [None]:
#We can group by label and take a look at summary statistics

pd.concat([final_features, targets], axis=1).groupby('P_HABITABLE').describe(percentiles = [])

#### We can also just take a look at the first two features, using different symbols for the two classes.

In [None]:
plt.figure(figsize=(10,6))

cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ['#20B2AA','#FF00FF'])

a = plt.scatter(final_features['S_MASS'], final_features['P_PERIOD'], marker = 'o',\
            c = targets, s = 100, cmap=cmap, label = 'Test')

plt.legend();

a.set_facecolor('none')

plt.yscale('log')
plt.xlabel('Mass of Parent Star (Solar Mass Units)')
plt.ylabel('Period of Orbit (days)');

bluepatch = mpatches.Patch(color='#20B2AA', label='Not Habitable')
magentapatch = mpatches.Patch(color='#FF00FF', label='Habitable')

ax = plt.gca()
leg = ax.get_legend()

plt.legend(handles=[magentapatch, bluepatch],\
           loc = 'lower right', fontsize = 14);

#### Ok, this is all for preliminary data exploration. Time to deploy.

### If you just want a random train/test split, you can use this function to create the four arrays you need:

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(final_features, targets, random_state=2)

In [None]:
Xtrain.shape, Xtest.shape

### We can also implement three flavors of k-fold Cross Validation, as you see below.

Note: you can fix the random seed for exactly reproducible behavior.

TL; DR: use the second or the third method.

In [None]:
# This is the standard version. Important: it doesn't shuffle the data, 
# so if your positive examples are all at the beginning or all the end, it might lead to disastrous results.

cv1 = KFold(n_splits = 5)

#This is v2: shuffling added (recommended!)

cv2 = KFold(shuffle = True, n_splits = 5, random_state=5)

# STRATIFICATION ensures that the class distributions in each split resemble those of the 
# entire data set 

cv3 = StratifiedKFold(shuffle = True, n_splits = 5, random_state=5)


### Here is an example of how to use cross validation.

The first command generates a dictionary with keys train_score, test_score, fit_time and something else I don't remember.

The second generates predicted labels by compiling the predictions obtained on the k test folds. 


In [None]:
scores = cross_validate(model, final_features, targets, cv = cv2, scoring = 'accuracy', return_train_scores)

pred = cross_val_predict(model, final_features, targets, cv = cv2)

### Ok, now it's your turn!

- Deploy your favorite model, either a Decision Tree or a kNN (with k of your choice).

- Choose which metric you want to use to evaluate your model (e.g., accuracy, precision, or recall).

- Decide whether you want to do a single train/test split or cross-validation.

- Generate scores and predicted labels for your model.

- Does your model have high variance or high bias?

- What kind of steps would you take to improve?

You might find the functions below useful in your quest.

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    plt.figure(figsize=(7,6))
    print(cm)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center", verticalalignment="center",
                 color="green" if i == j else "red", fontsize = 30)

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

Learning curves are helpful in order to visualize the training scores vs the test scores, and how they vary as a function of data set size. They allow us to determine whether we have enough learning data, AND whether we have a high bias or high variance problem.

The source code below is a slight modification of [this code](https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html).

In [None]:
#The source code below is a slight modification of [this code](https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html).

from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=5,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5), scoring = 'accuracy'):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : int or None, optional (default=None)
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    train_sizes : array-like, shape (n_ticks,), dtype float or int
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the dtype is float, it is regarded as a
        fraction of the maximum size of the training set (that is determined
        by the selected validation method), i.e. it has to be within (0, 1].
        Otherwise it is interpreted as absolute sizes of the training sets.
        Note that for classification the number of samples usually have to
        be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
    plt.figure(figsize=(10,6))
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel(str(scoring))
    
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, scoring = scoring)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Test score from cross-validation")

    plt.legend(loc="best")
    return plt