# Support Vector Machines Homework (Week 16)

### 1.	Perform combined over and undersampling on the diabetes dataset (use SMOTEENN). Explain how combined sampling works.

In [1]:
import pandas as pd
import numpy as np

from sklearn import tree
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix
import pydotplus

from IPython.display import Image

from imblearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedStratifiedKFold
from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import EditedNearestNeighbours

diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
#Check for null values (none)
print(diabetes_df.info())
print( )
print(diabetes_df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                     

In [3]:
diabetes_df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome',axis = 1)
y = diabetes_df['Outcome']

#Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

#Standardize
sc = StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

In [5]:
#Decision tree classifier
#model = tree.DecisionTreeClassifier(max_depth=10, random_state=42)

In [6]:
sme = SMOTEENN(random_state=42)
X_resampled, y_resampled = sme.fit_resample(X_train, y_train)

In [7]:
#Define and train modelusing the resampled data
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=42)
model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=42)

In [8]:
#Define SMOTE-ENN
sme = SMOTEENN()

#Create pipeline
pipeline = Pipeline(steps=[('r', sme), ('m', model)])

#Define evaluation procedure (repeated stratified K-fold cross-validation)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)

#Score model
scoring = ['accuracy','precision_macro','recall_macro']
scores = cross_validate(pipeline, X_resampled, y_resampled, scoring=scoring, cv=cv, n_jobs=-1)

#Summarize performance
print('Mean Accuracy: %.4f' % np.mean(scores['test_accuracy']))
print('Mean Precision: %.4f' % np.mean(scores['test_precision_macro']))
print('Mean Recall: %.4f' % np.mean(scores['test_recall_macro']))

Mean Accuracy: 0.9069
Mean Precision: 0.9098
Mean Recall: 0.9066


Combined sampling uses both oversampling and undersampling techniques together. It's used in combination with a machine learning algorithm. Combined sampling oversamples the data's minority class and undersamples the majority class. This prevents the machine learnng method from overemphasizing the majortiy class's data and ignoring the minority class's data.

### 2.	Comment on the performance of combined sampling vs the other approaches we have used for the diabetes dataset.

Combined sampling of the logistic regression model with SMOTE-ENN worked much better than most of the other methods we've used on the diabetes data set. Accuracy, precision, and recall were all slightly above 0.90. On their own, multiple linear regression and logistic regression had accuracy scores of about 0.60, and logistic regression with random oversampling had accuracy, precision, recall, and f1 scores all hovering around 0.75. Logistic regression plus cluster centroids undersampling had basically the same scores (all around 0.75) as logistic regression with random oversampling.

Logistic regression with SMOTE-ENN combined sampling also performed better than the tree-based models. Decision tree classifier had precision, recall, and f1 scores of 0.69, 0.70, and 0.70, respectively. Random forest classifier had accuracy, recall, and f1 scores of 0.76 and precision of 0.77. 

### 3.	What is outlier detection? Why is it useful? What methods can you use for outlier detection?

Outlier detection is identifying the presence of outliers or anomalies that deviate significantly from the majority of the data and thus may need to be excluded from the training or test data when creating a model. Outliers may be caused by errors in data collection, experimental procedure, etc., so they may not necessarily be "real" values. 

If included in the training data, outliers may corrupt the performance of a prediction or classification model. They can drastically skew it so that it is not trained correctly and does not make very accurate predictions for new data.

Some methods for outlier detection include K-nearest neighbor with local outlier factor, feature bagging, isolation forest, one-class SVM,  minimum covariance determinant, and DBSCAN clustering. Standard deviation, boxplots, and/or histograms can also be used but are less automated than machine learning algorithms.

### 4.	Perform a linear SVM to predict credit approval (last column) using this dataset: https://archive.ics.uci.edu/ml/datasets/Statlog+%28Australian+Credit+Approval%29 . Make sure you look at the accompanying document that describes the data in the dat file. You will need to either convert this data to another file type or import the dat file to python. 

#### You can use this code, but otherwise you follow standard practices we have already used many times: 

    from sklearn.svm import SVC
    classifier = SVC(kernel='linear')

In [9]:
aussie_df = pd.read_csv("australian.dat", sep=' ', header=None)
aussie_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,1,22.08,11.46,2,4,4,1.585,0,0,0,1,2,100,1213,0
1,0,22.67,7.0,2,8,4,0.165,0,0,0,0,2,160,1,0
2,0,29.58,1.75,1,4,4,1.25,0,0,0,1,2,280,1,0
3,0,21.67,11.5,1,5,3,0.0,1,1,11,1,2,0,1,1
4,1,20.17,8.17,2,6,4,1.96,1,1,14,0,2,60,159,1


In [10]:
from sklearn.svm import SVC
classifier = SVC(kernel='linear')

In [11]:
X = aussie_df.drop(aussie_df.columns[14],axis=1)
y = aussie_df[aussie_df.columns[14]]

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

#Standardize
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

classifier = SVC(kernel='linear')
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)

### 5.	How did the SVM model perform? Use a classification report.

In [12]:
#Classification Report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.70      0.79        77
           1       0.71      0.92      0.80        61

    accuracy                           0.80       138
   macro avg       0.81      0.81      0.80       138
weighted avg       0.82      0.80      0.80       138



### 6.	What kinds of jobs in data are you most interested in? Do some research on what is out there. Write about your thoughts in under 400 words.

I am most interested in using data science in the non-profit sector to try to make the world a better place. I'm very interested in human rights and the environment, in particular.

Researching jobs online, I found several that appealed to me. An organization called Love Justice International has a program to combat human trafficking, and they currently are advertising a data science position. That position requires a Masters degree, though. I also looked at data analyst positions because there seem to be more entry-level positions of this type. The Mayo Clinic and several other healthcare organizations are looking for data scientists and/or data analysts to work on predicting clinical outcomes for patients based on various factors, including treatment plans. The Center for Employment Opportunities is an organization that provides career resources to formerly incarcerated people, and they are in need of a data and contracts analyst. The Voter Participation Center focuses on registering and mobilizing underrepresented groups, who together make up the majority of potential voters but are currently underrepresented in politics, to vote. The CDC is also looking for a data analyst to look at participation rates in the Covid vaccination program. An organization called MITRE is looking for a climate health data scientist.

I also looked at the Rome Group, which lists non-profit job postings in the St. Louis area, but I did not find anything related to data science.

DataKind seems to be a great resource for data science work in the non-profit sector. It appears that they may be looking primarily for volunteers right now, but this could still be a good opportunity to build experience and participate in projects that would build out my resume and qualify me for more and better data science jobs. DataKind has had recent projects related to identifying food bank dependency, finding paths out of homelessness, empowering women, revitalizing communities through environmental cleanup, and uncovering corruption. One project that looked particularly interesting was "Using Time Series Forecasting to Improve Access to Safe Sanitation." This is also a major focus of the World Bank (I do copyediting work for them), but I did not find any current data job postings on their website.