1. Perform combined over and undersampling on the diabetes dataset (use SMOTEENN). Explain how combined sampling works.

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from imblearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
from imblearn.combine import SMOTEENN
from imblearn.under_sampling import EditedNearestNeighbours

diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [10]:
X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42, stratify=y)

In [15]:
#Define model
diabetesmdl = DecisionTreeClassifier()

model = diabetesmdl.fit(X_train, y_train)
y_pred = model.predict(X_test)

#Define SMOTEENN
resample=SMOTEENN(enn=EditedNearestNeighbours(sampling_strategy='all'))

#Define pipeline
pipeline=Pipeline(steps=[('r', resample), ('m', model)])

#Define evaluation procedure (Repeated Stratified K-Fold CV)
cv=RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)

#Evaluate model
scoring=['accuracy','precision_macro','recall_macro']
scores = cross_validate(pipeline, X, y, scoring=scoring, cv=cv, n_jobs=-1)

# summarize performance
print('Mean Accuracy: %.4f' % np.mean(scores['test_accuracy']))
print('Mean Precision: %.4f' % np.mean(scores['test_precision_macro']))
print('Mean Recall: %.4f' % np.mean(scores['test_recall_macro']))

Mean Accuracy: 0.7209
Mean Precision: 0.7127
Mean Recall: 0.7289


In [16]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.87      0.81       100
           1       0.67      0.48      0.56        54

    accuracy                           0.73       154
   macro avg       0.71      0.68      0.68       154
weighted avg       0.73      0.73      0.72       154



2. Comment on the performance of combined sampling vs the other approaches we have used for the diabetes dataset.

Scores from using SMOTEENN are comparable to scores we obtained using Logistic Regression/Random Oversampler.  SMOTEENN scores are a little higher than scores achieved by Decision Tree Classifier alone.

3. What is outlier detection? Why is it useful? What methods can you use for outlier detection?

Outlier detection involves finding data point anomalies - aka, points that have values either extremely high or extremely low relative to the majority of the dataset.  Extreme outliers can skew the data in such a way that renders prediction models less reliable, so removing outliers can make your data modeling more effective.

Methods for detecting outliers include visualizing the data, ex. creating boxplots.  You can also approach outlier detection using tools like the Interquartile Range (IQR).

4. Perform a linear SVM to predict credit approval (last column) using this dataset: https://archive.ics.uci.edu/ml/datasets/Statlog+%28Australian+Credit+Approval%29 . Make sure you look at the accompanying document that describes the data in the dat file. You will need to either convert this data to another file type or import the dat file to python. 

In [45]:
credit_data = np.genfromtxt('australian.dat', skip_header=1, skip_footer=1,
                           names=True, dtype=None, delimiter=" ")
credit_df = pd.DataFrame(credit_data)

credit_df.head()

Unnamed: 0,0,2267,7,2,8,4,0165,0_1,0_2,0_3,0_4,2_1,160,1,0_5
0,0,29.58,1.75,1,4,4,1.25,0,0,0,1,2,280,1,0
1,0,21.67,11.5,1,5,3,0.0,1,1,11,1,2,0,1,1
2,1,20.17,8.17,2,6,4,1.96,1,1,14,0,2,60,159,1
3,0,15.83,0.585,2,8,8,1.5,1,1,2,0,2,100,1,1
4,1,17.42,6.5,2,3,4,0.125,0,0,0,0,2,60,101,0


In [50]:
X = credit_df.drop(credit_df.columns[-1], axis=1)
y = credit_df[credit_df.columns[-1]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42, stratify=y)

In [53]:
from sklearn.svm import LinearSVC

svm = LinearSVC()

model = svm.fit(X_train, y_train)
y_pred = model.predict(X_test)

svm.fit(X_train, y_train)
svm.score(X_test, y_test)



0.8333333333333334

5. How did the SVM model perform? Use a classification report.

In [54]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.58      1.00      0.73        77
           1       1.00      0.08      0.15        61

    accuracy                           0.59       138
   macro avg       0.79      0.54      0.44       138
weighted avg       0.77      0.59      0.48       138



The SVM model did pretty good on the precision front, but could use improvement in its recall as it only correctly identifies about 50-60% of true positives.  One might try experimenting with decreasing the classification threshold in order to increase the recall (keeping in mind that this may decrease precision).

6. What kinds of jobs in data are you most interested in? Do some research on what is out there. Write about your thoughts in under 400 words. 

With my background in and current work in marketing, I think I would enjoy work that uses data science techniques to optimize the efficiency of marketing efforts - perhaps as a Marketing Analyst:

https://www.discoverdatascience.org/career-information/

I also like the description of Data Mining Specialist.  It sounds like that type of role involves looking at data from a number of different angles (modeling, research, etc.), and I like work that involves variety - keeps me from getting bored!