# Week16 - Group Exercise

**1. Work to improve the model performance for the diabetes decision tree we created in class. You should be able to improve the precision and recall to be above .8 and .7 respectively. You can improve the preprocessing OR alter the model itself.**

**Model Created in Class**

In [1]:
# Load in our modules
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.metrics import classification_report, plot_confusion_matrix

# Load in data frame
diabetes_df = pd.read_csv('diabetes.csv')
# View brief sample of data
diabetes_df.sample(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
482,4,85,58,22,49,27.8,0.306,28,0
266,0,138,0,0,0,36.3,0.933,25,1
112,1,89,76,34,37,31.2,0.192,23,0
101,1,151,60,0,0,26.1,0.179,22,0
278,5,114,74,0,0,24.9,0.744,57,0


In [2]:
# Import modules from train_test_split
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split our data into outcome vs. predictors
# Predictors
X = diabetes_df.drop('Outcome', axis=1)
# Outcome
y = diabetes_df['Outcome']

# Split into training vs. testing data (using model from class to tune)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

# Standardize our data
sc=StandardScaler()
# Fit the classifier to our training and test sets
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

In [3]:
# Create the decision tree classifier
tree_model = tree.DecisionTreeClassifier(max_depth = 8, random_state=42)

In [4]:
# Fit the classifier to our training data
tree_model = tree_model.fit(X_train, y_train)
# Find my predicted values from the model
y_pred = tree_model.predict(X_test)
# Print the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.83      0.81       150
           1       0.66      0.59      0.62        81

    accuracy                           0.75       231
   macro avg       0.72      0.71      0.72       231
weighted avg       0.74      0.75      0.75       231



**Model Refinement: RandomOverSampler Technique**

After a lot of trial and error, I decided to attempt resampling in order to increase the minority population in our data set and to improve the precision and recall of the decision tree model. Afterwards, I fit the model to the resampled data. 

In [5]:
# Import module to use RandomOverSampler
from imblearn.over_sampling import RandomOverSampler

# Instantiate the classifier
ros = RandomOverSampler(sampling_strategy='minority', shrinkage=1, random_state=33)

# Resample the training data, based on our classifier
X_res, y_res = ros.fit_resample(X_train, y_train)

In [6]:
# Instantiate the decision tree classifier
res_model = tree.DecisionTreeClassifier(max_depth=11, random_state=33)

In [7]:
# Fit the decision tree model to our resampled data
new_model = res_model.fit(X_res, y_res)
# Find my predicted values from the new model
y_pred = new_model.predict(X_test)
# Print the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.65      0.72       150
           1       0.52      0.72      0.60        81

    accuracy                           0.67       231
   macro avg       0.67      0.68      0.66       231
weighted avg       0.71      0.67      0.68       231



By resampling the data, I was able to boost the recall to 72%, which is a bit better than the original model from class. However, my precision dropped! Below, I tried to tune the decision tree classifier in order to increase the precision of my model

**Model Refinement: Randomized Search**

I am using the randomized search method to try and tune the decision tree model, based on my resampled data. Hopefully, this will increase the precision.

In [51]:
# Import to search for best parameters
from sklearn.model_selection import RandomizedSearchCV

# Setup the parameters that we will test on our decision tree
param_dict = {"max_depth": [11],
              "max_features": range(1,4),
              "min_samples_leaf": range(1,15),
              "criterion": ["gini"],
              "splitter": ["best"]}

from sklearn import tree

# Instantiate the classifier
trees = tree.DecisionTreeClassifier()

# Dictionary of scoring measures
scores = {'recall': "recall", "precision":"precision"}

In [52]:
# Run the decision tree model through the search, using the specified parameters and 5 crossfold validation
random = RandomizedSearchCV(trees, 
                            n_iter = 20,
                            param_distributions = param_dict, 
                            cv=10,
                            scoring = scores,
                            refit="precision",
                            random_state=33)

# Fit the search to our resampled data
random.fit(X_res, y_res)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(random.best_params_))
print("Best score is {}".format(random.best_score_))

Tuned Decision Tree Parameters: {'splitter': 'best', 'min_samples_leaf': 8, 'max_features': 3, 'max_depth': 11, 'criterion': 'gini'}
Best score is 0.727660986071604


**Final Model**

The model below yielded the highest combination of precision and recall. Of all the models, this one performs the best. It has the highest accuracy, as well as the highest recall and precision. After finding this model, I went back to previous steps and tried to keep tuning different parameters and was experimenting to further increase the precision. When I did so, the recall would lower quite a bit. The model below seemed to be the best overall combination.

In [61]:
# Re-run the model with the best parameters
from sklearn import tree
best_model = tree.DecisionTreeClassifier(splitter='best', min_samples_leaf=13, max_features=4, max_depth = 11, 
                                          criterion='gini', random_state=33)

In [62]:
# Fit the new model to our testing data
best_model.fit(X_res, y_res)
# Find the predicted values
y_pred_best = best_model.predict(X_test)
# Print the classification report
print(classification_report(y_test, y_pred_best))

              precision    recall  f1-score   support

           0       0.85      0.83      0.84       150
           1       0.70      0.73      0.72        81

    accuracy                           0.80       231
   macro avg       0.78      0.78      0.78       231
weighted avg       0.80      0.80      0.80       231



**2. Create a function that accepts an array of names and returns a string formatted as a list of names separated by commas EXCEPT for the last two names, which are separated by an ampersand (and sign - &)**

*Example input:*

[ {'name': 'Nichole'}, {'name': 'Tanisha'}, {'name': 'Maggie'} ]


*Example output:*

Nichole, Tanisha & Maggie

In [152]:
# Define my function
def organize_name(array):
    # Create any empty list to append the names
    name_string = []
    # Iterate through the items in the array
    for name in array:
        # Iterate through the keys and values in the name item
        for key, value in name.items():
            # Append the values (which are the names) to the empty list
            name_string.append(value)
    # Create new object called "organized names"
    # Join all of the items in name_string with ', ' to separate by a comma
    # Split the string where the last comma is
    # Join the strings together, but this time with ampersand
    organized_names = ' & '.join(', '.join(name_string).rsplit(', ', 1))
    print(organized_names)

In [153]:
# Sample array output
name_array = [ {'name': 'Nichole'}, {'name': 'Tanisha'}, {'name': 'Maggie'} ]
name_array

[{'name': 'Nichole'}, {'name': 'Tanisha'}, {'name': 'Maggie'}]

In [155]:
# Call function to test
organize_name(name_array)

Nichole, Tanisha & Maggie


In [156]:
# Try with other array
new_array = [ {'name': 'Nichole'}, {'name': 'Tanisha'}, {'name': 'Maggie'}, {'name': 'Sam'}, 
             {'name': 'Don'}, {'name': 'Joe'}  ]
new_array

[{'name': 'Nichole'},
 {'name': 'Tanisha'},
 {'name': 'Maggie'},
 {'name': 'Sam'},
 {'name': 'Don'},
 {'name': 'Joe'}]

In [157]:
# Test function
organize_name(new_array)

Nichole, Tanisha, Maggie, Sam, Don & Joe
