### 1.Perform combined over and undersampling on the diabetes dataset (use SMOTEENN). Explain how combined sampling works.

In [26]:
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix
from imblearn.combine import SMOTEENN
import pydotplus

from IPython.display import Image

diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [27]:
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']


# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)
print('Original dataset shape %s' % Counter(y_train))

Original dataset shape Counter({0: 350, 1: 187})


In [7]:
# instantiate a SMOTEENN object
smote_enn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smote_enn.fit_resample(X_train, y_train)

print('Resampled dataset shape %s' % Counter(y_resampled))

Resampled dataset shape Counter({1: 211, 0: 187})


Oversampling adds more samples to the minority class, and undersampling removes samples from the majority class to help balance out a dataset where there is a large difference in the sizes of the classes that might skew a supervised learning algorithm to more often pick the majority class. Combined sampling allows you to do both oversampling and undersampling to balance out the dataset.

### 2.	Comment on the performance of combined sampling vs the other approaches we have used for the diabetes dataset.

In [22]:
# decision tree classifier
model = tree.DecisionTreeClassifier(max_depth = 10,random_state=42)

model = model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print('Baseline output')
print(classification_report(y_test, y_pred))

"""SMOTEEN sample"""
# decision tree classifier
model = tree.DecisionTreeClassifier(max_depth = 10,random_state=42)

combined = model.fit(X_resampled, y_resampled)
#model = model.fit(X_test, y_test)
y_pred = combined.predict(X_test)

print('SMOTEEN output')
#print(classification_report(y_test, y_pred))
from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test, y_pred))

""" Performance of random oversampled output, everything else the same"""
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_res, y_res= ros.fit_resample(X_train, y_train)

model = model.fit(X_res, y_res)
#model = model.fit(X_test, y_test)
y_pred = model.predict(X_test)

print('Random Oversampler output')
#print(classification_report(y_test, y_pred))
from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test, y_pred))

"""Performance of an oversampled output with change in depth"""
X = diabetes_df[['Glucose','BMI','Age']]
y = diabetes_df['Outcome']

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

# decision tree classifier
model = tree.DecisionTreeClassifier(max_depth = 5, random_state=42)

from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_res, y_res= ros.fit_resample(X_train, y_train)

model = model.fit(X_res, y_res)
#model = model.fit(X_test, y_test)
y_pred = model.predict(X_test)

print('Optimized Random Oversampler output')
#print(classification_report(y_test, y_pred))
from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test, y_pred))



Baseline output
              precision    recall  f1-score   support

           0       0.77      0.77      0.77       150
           1       0.58      0.58      0.58        81

    accuracy                           0.71       231
   macro avg       0.68      0.68      0.68       231
weighted avg       0.71      0.71      0.71       231

SMOTEEN output
                   pre       rec       spe        f1       geo       iba       sup

          0       0.82      0.74      0.70      0.78      0.72      0.52       150
          1       0.59      0.70      0.74      0.64      0.72      0.52        81

avg / total       0.74      0.73      0.72      0.73      0.72      0.52       231

Random Oversampler output
                   pre       rec       spe        f1       geo       iba       sup

          0       0.83      0.74      0.73      0.78      0.73      0.54       150
          1       0.60      0.73      0.74      0.66      0.73      0.54        81

avg / total       0.75      0.

In [24]:
X = diabetes_df[['Glucose','BMI','Age']]
y = diabetes_df['Outcome']

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)


"""Performance of combined sample with optimization"""
# instantiate a SMOTEENN object
smote_enn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smote_enn.fit_resample(X_train, y_train)

# decision tree classifier
model = tree.DecisionTreeClassifier(max_depth = 6,random_state=42)

combined = model.fit(X_resampled, y_resampled)
#model = model.fit(X_test, y_test)
y_pred = combined.predict(X_test)

print('SMOTEEN output optimized')
#print(classification_report(y_test, y_pred))
from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test, y_pred))

SMOTEEN output optimized
                   pre       rec       spe        f1       geo       iba       sup

          0       0.83      0.73      0.73      0.78      0.73      0.53       150
          1       0.59      0.73      0.73      0.65      0.73      0.53        81

avg / total       0.75      0.73      0.73      0.73      0.73      0.53       231



The SMOTEEN output didn't do as well as the optimized random oversampler output for precision and recall. 

### 3.	What is outlier detection? Why is it useful? What methods can you use for outlier detection?

Outlier detection identifies samples in a dataset that are very far from other samples and then ignores those samples. It is useful because outliers can skew your data.
You can use sklearn's covariance EllipticEnvelope method which assumes a Gaussian distribution of the data and fit the data to an ellipse, anything outside the ellipse is considered an outlier.
You can also use ensemble IsolationForest which creates a tree with a minimum number of splits and looks for terminating nodes that stop splitting too soon.
You can also use the neighbors LocalOutlierFactor which uses k-nearest neighbors to look at the density of the number of neighbors around a sample, if the density of the sample is much smaller than the average density of other samples it's considered an outlier.

### 4.	Perform a linear SVM to predict credit approval (last column) using this dataset: https://archive.ics.uci.edu/ml/datasets/Statlog+%28Australian+Credit+Approval%29 . Make sure you look at the accompanying document that describes the data in the dat file. You will need to either convert this data to another file type or import the dat file to python. 


In [30]:
import csv

with open("australian.dat") as infile, open("outfile.csv", "w") as outfile:
    csv_writer = csv.writer(outfile, delimiter=',')
    
    csv_writer.writerow(['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12','A13', 'A14', 'A15'])
    for line in infile:
        row = [field.strip() for field in line.split(' ')]
        csv_writer.writerow(row)
    

In [31]:
australia_df = pd.read_csv('outfile.csv')
australia_df.head()
# desired outcome is column A15 where + =1 and - = 2

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15
0,1,22.08,11.46,2,4,4,1.585,0,0,0,1,2,100,1213,0
1,0,22.67,7.0,2,8,4,0.165,0,0,0,0,2,160,1,0
2,0,29.58,1.75,1,4,4,1.25,0,0,0,1,2,280,1,0
3,0,21.67,11.5,1,5,3,0.0,1,1,11,1,2,0,1,1
4,1,20.17,8.17,2,6,4,1.96,1,1,14,0,2,60,159,1


In [32]:
australia_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A1      690 non-null    int64  
 1   A2      690 non-null    float64
 2   A3      690 non-null    float64
 3   A4      690 non-null    int64  
 4   A5      690 non-null    int64  
 5   A6      690 non-null    int64  
 6   A7      690 non-null    float64
 7   A8      690 non-null    int64  
 8   A9      690 non-null    int64  
 9   A10     690 non-null    int64  
 10  A11     690 non-null    int64  
 11  A12     690 non-null    int64  
 12  A13     690 non-null    int64  
 13  A14     690 non-null    int64  
 14  A15     690 non-null    int64  
dtypes: float64(3), int64(12)
memory usage: 81.0 KB


In [33]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear')

X = australia_df.drop('A15', axis=1)
y = np.array(australia_df['A15']).reshape(-1)
print(X.shape)
print(y.shape)

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

#classifier.score(y_test, y_pred)
print(classification_report(y_test, y_pred))


(690, 14)
(690,)
              precision    recall  f1-score   support

           0       0.93      0.75      0.83       115
           1       0.75      0.93      0.83        92

    accuracy                           0.83       207
   macro avg       0.84      0.84      0.83       207
weighted avg       0.85      0.83      0.83       207



In [34]:
from sklearn.svm import LinearSVC
l_SVC = LinearSVC()

X = australia_df.drop('A15', axis=1)
y = np.array(australia_df['A15']).reshape(-1)
print(X.shape)
print(y.shape)

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

l_SVC.fit(X_train, y_train)
y_pred = l_SVC.predict(X_test)

#l_SVC.score(y_test, y_pred)
print(classification_report(y_test, y_pred))

(690, 14)
(690,)
              precision    recall  f1-score   support

           0       0.95      0.77      0.85       115
           1       0.76      0.95      0.84        92

    accuracy                           0.85       207
   macro avg       0.85      0.86      0.85       207
weighted avg       0.86      0.85      0.85       207



### 6.	What kinds of jobs in data are you most interested in? Do some research on what is out there. Write about your thoughts in under 400 words. 

Ultimately I’m interested in a data scientist position but based on some of the job listings I’ve seen, it seems like I’m more likely to get my foot in the door as a data analyst and be able to work my way up to data scientist once I have more experience. I think I need to build up my statistics skills so that I’m better able to choose the right tools and be able to explain why it’s the right tool to use when it comes to machine learning. Perhaps if a company was willing to do some investing in me I could get a junior data scientist position. I am definitely interested in machine learning, creating visualizations and communicating what we've learned. 
I found a couple data scientist and data analyst positions with a few healthcare companies that could be a good fit building from my healthcare background, there was even one that didn't have experience as a requirement just as a preference. 
I think I would be a good fit for a consulting job based on my work experience as a contract therapist and ability to integrate into new teams well.