In [1]:
# Authors: Fredi Weideli

### Purpose

This is the Partial solution for Project 1 until the end of Milestone 2 for Novelty Detection using OneClassSVM

<a id="top"></a> <br>
## Table of  Contents
1. [Introduction](#1)

1. [Initialization](#2)
    1. [Load packages](#21)
    1. [Define Metadata](#22)
    
1. [Load Data](#3)

1. [Data Insights](#4)
    1. [Data Structure](#41)
    1. [Summary Stats](#42)
    1. [Unique Value Checking](#43)
    1. [Identifying 'Bad Columns'](#44)

1. [Data Cleansing](#5)
    1. [Data Reduction](#51)
        1. [Dropping Bad Columns](#511)
        1. [Null Value Removal](#512)
        1. [Data Encoding](#513)
    1. [Export csv file for later use](#52)

1. [Modelling Workflow](#6)
    1. [Data Prep](#61)
        1. [Feature Target Split](#611)
        1. [Train-Test Split](#612)
        1. [Normalizing Numerical Variables](#613)
    1. [Estimate of Baseline Accuracy - Class Distributions](#62)
    1. [Semisupervised & Unsupervised Techniques for Novelty & Outlier Detection](#63)
        1. [OneClassSVMs for Novelty Detection](#631)
        1. [Robust Covariance for Outlier Detection](#632)
        1. [Isolation Forest for Novelty Detection](#633)
        1. [Local Outlier Factor for Novelty Detection](#634)


# <a id='1'>Introduction</a>  

As described on the project page, the dataset contains the Thyroid disease data that is imbalanced. Before running this notebook, please make sure that you have gone through the first project in the LiveSeries and the starter template

## <a id='2'>Initialization</a>  


### <a id='21'>Load Packages</a>  

Load the minimum number of packages to get started and add more as we go along

In [2]:
import pandas as pd 
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import scipy
from scipy.io.arff import loadarff
import scipy.io as sio

from collections import Counter
from sklearn.preprocessing import MinMaxScaler

from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM

from sklearn.metrics import accuracy_score, classification_report,confusion_matrix 
from sklearn.metrics import roc_auc_score, f1_score, precision_score, recall_score, average_precision_score

### <a id='22'>Define Metadata</a>  

In [3]:
# Define the name of the target class column here instead of manually typing it out everywhere
target_class_name = 6

# Fill in the names of what you want to call the 0 and 1 class
labels = ['inliers', 'outliers']

# The following file is downloaded from http://odds.cs.stonybrook.edu/thyroid-disease-dataset/ and kept in the data/raw folder
input_file_name = 'thyroid.mat'

# Using relative path path to specify that the data subfolder is two directories up from the current folder 
raw_data_folder = '../../data/raw/'
processed_data_folder = '../../data/processed/'

# Any exported artifacts will have this date
export_date = '20220129'

## <a id='3'>Load Data</a>  

In this part we will load the data and perform some necessary preprocessing

In [4]:
from os.path import dirname, join as pjoin

mat_fname = pjoin(raw_data_folder, input_file_name)
data = sio.loadmat(mat_fname)

data

{'__header__': b'MATLAB 5.0 MAT-file, written by Octave 3.8.0, 2014-12-05 13:11:25 UTC',
 '__version__': '1.0',
 '__globals__': [],
 'X': array([[7.74193548e-01, 1.13207547e-03, 1.37571157e-01, 2.75700935e-01,
         2.95774648e-01, 2.36065574e-01],
        [2.47311828e-01, 4.71698113e-04, 2.79886148e-01, 3.29439252e-01,
         5.35211268e-01, 1.73770492e-01],
        [4.94623656e-01, 3.58490566e-03, 2.22960152e-01, 2.33644860e-01,
         5.25821596e-01, 1.24590164e-01],
        ...,
        [9.35483871e-01, 2.45283019e-02, 1.60341556e-01, 2.82710280e-01,
         3.75586854e-01, 2.00000000e-01],
        [6.77419355e-01, 1.47169811e-03, 1.90702087e-01, 2.42990654e-01,
         3.23943662e-01, 1.95081967e-01],
        [4.83870968e-01, 3.56603774e-03, 1.90702087e-01, 2.12616822e-01,
         3.38028169e-01, 1.63934426e-01]]),
 'y': array([[0.],
        [0.],
        [0.],
        ...,
        [0.],
        [0.],
        [0.]])}

In [5]:
# STORE THE FEATURES AND TARGET OBJECTS IN THEIR OWN VARIABLES FOR EASY RETRIEVAL

X = data['X']
y = data['y']

X.shape, y.shape

((3772, 6), (3772, 1))

In [6]:
#### CONCATENATE THE X AND y OBJECTS TO CREATE THE DATAFRAME
df = pd.concat([pd.DataFrame(data=X), pd.DataFrame(data=y)], axis=1, ignore_index=True)

df.sample(5)

Unnamed: 0,0,1,2,3,4,5,6
325,0.548387,0.003208,0.1926,0.235981,0.394366,0.160656,0.0
354,0.55914,0.000755,0.190702,0.264019,0.333333,0.206557,0.0
3563,0.526882,0.01434,0.156546,0.294393,0.41784,0.190164,0.0
3155,0.784946,0.001151,0.080645,0.264019,0.262911,0.247541,0.0
3648,0.634409,0.002642,0.190702,0.273364,0.43662,0.170492,0.0


Lets check the head & tail to make sure there is nothing going on at the last row or the header

In [7]:
df.head(3)

Unnamed: 0,0,1,2,3,4,5,6
0,0.774194,0.001132,0.137571,0.275701,0.295775,0.236066,0.0
1,0.247312,0.000472,0.279886,0.329439,0.535211,0.17377,0.0
2,0.494624,0.003585,0.22296,0.233645,0.525822,0.12459,0.0


In [8]:
df.tail(3)

Unnamed: 0,0,1,2,3,4,5,6
3769,0.935484,0.024528,0.160342,0.28271,0.375587,0.2,0.0
3770,0.677419,0.001472,0.190702,0.242991,0.323944,0.195082,0.0
3771,0.483871,0.003566,0.190702,0.212617,0.338028,0.163934,0.0


No trouble with loading the data. Both the head and tail are clean

## <a id=4 > Data Insights

### <a id='41'>Data Structure</a> 

In [9]:
# Lets see the data structure
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3772 entries, 0 to 3771
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       3772 non-null   float64
 1   1       3772 non-null   float64
 2   2       3772 non-null   float64
 3   3       3772 non-null   float64
 4   4       3772 non-null   float64
 5   5       3772 non-null   float64
 6   6       3772 non-null   float64
dtypes: float64(7)
memory usage: 206.4 KB


None of the columns have null values at first glance, but we will run a more thorough diagnostic later

### <a id='42'>Summary Stats</a> 

check out each column's summary statistics
Note that only the numerical columns will be described
Also you will want to exclude the discrete columns whose summary stats will give non-sensical values like 'customer_id' 

In [10]:
df.describe()
# Looks like all the numbers are between 0 and 1

Unnamed: 0,0,1,2,3,4,5,6
count,3772.0,3772.0,3772.0,3772.0,3772.0,3772.0,3772.0
mean,0.543121,0.008983,0.186826,0.248332,0.376941,0.177301,0.024655
std,0.20379,0.043978,0.070405,0.080579,0.087382,0.054907,0.155093
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.376344,0.001132,0.156546,0.203271,0.328638,0.14918,0.0
50%,0.569892,0.003019,0.190702,0.241822,0.375587,0.17377,0.0
75%,0.709677,0.004528,0.213472,0.28271,0.413146,0.196721,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### <a id='43'>Unique Value Checking</a> 

In [11]:
#### ITERATE IN A FOR LOOP TO SEE THE UNIQUE VALUES IN EACH COLUMN
for column in df:
    print(str(column) + ' ' + str(df[column].nunique()))

0 93
1 280
2 72
3 243
4 141
5 324
6 2


All of the columns have atleast 2 unique values, hence there is less of a chance of quasi-constant values 

### <a id='44'>Identifying Bad Columns</a> 

In [12]:
# Copy this function from the starter template

def find_bad_columns_function(dataframe):
    '''
    Args: dataframe for which there maybe columns of concern that need to be fixed or deleted
    
    Logic: Find the columns that have 
    Null values
    blanks in the strings
    quasi constant/constant values defined by less than 1% variance
    
    Returns: 4 lists containing those features that have nulls, blanks, constant values throughout for numerical and categorical
    
    '''
    
    ###### Finding Null Values
    null_col_list = dataframe.columns[dataframe.isna().any()].tolist()
    
    print('Identified {} features with atleast one null'.format(
        len(null_col_list)))

    ###### Finding Blank Spaces in the object column
    # Non-obvious nulls such as blanks: The line items where there are spaces 
    blank_space_col_list = []
    object_columns = dataframe.select_dtypes(include=['object']).columns

    for col in object_columns:
        if sum(dataframe[col]==' '):
            blank_space_col_list.append(col)

    print('Identified {} features with atleast one blank space'.format(
        len(blank_space_col_list)))
    
    ####### Finding Quasi Constant/Constant Value in numerical columns
    # Lets remove the variables that have more than 99% of their values as the same 
    # ie their standard deviation is less than 1 %
    
    numeric_df = dataframe._get_numeric_data()
    constant_numeric_col_list = [col for col in numeric_df.columns if numeric_df[col].std()<0.01]

    print('Identified {} numeric features that have quasi-constant values'.format(
        len(constant_numeric_col_list)))
    
    # We use a separate logic for the non-numerical variables because if you have closely varying float values
    # then the below code snippet wont pick it up
    
    ###### Finding Quasi Constant/Constant non_numeric value
    constant_non_numeric_col_list = []
    
    # Find the columns that are not in numeric_df
    non_numeric_col_set = set(dataframe.columns) - set(numeric_df.columns)   

    for col in non_numeric_col_set:
        categorical_mode_value = (dataframe[col].mode().values)[0]
        fractional_presence = sum(dataframe[col]==categorical_mode_value)/len(dataframe) 
    
        if fractional_presence > 0.99:
            constant_non_numeric_col_list.append(col)
            
    print('Identified {} non-numeric features that have quasi-constant values'.format(
        len(constant_non_numeric_col_list)))
    
    return null_col_list, blank_space_col_list, constant_numeric_col_list, constant_non_numeric_col_list

In [13]:
# USE THE ABOVE CUSTOM FUNCTION TO FIGURE OUT THE IF THERE ARE ANY COLUMNS WE NEED TO BE CONCERNED ABOUT

bad_values = find_bad_columns_function(df)

Identified 0 features with atleast one null
Identified 0 features with atleast one blank space
Identified 0 numeric features that have quasi-constant values
Identified 0 non-numeric features that have quasi-constant values


Thankfully, there is no need to worry about any of the columns in this dataset

## <a id='5'>Data Cleansing</a> 

### <a id='51'>Data Reduction</a> 

#### <a id='511'>Dropping Bad Columns</a> 

In [14]:
# Skip this because there are no bad columns

#### <a id='512'>Null Value Removal</a> 

In [15]:
# No null values to drop from the rows

#### <a id='513'>Data Encoding</a> 

In [16]:
df.dtypes

0    float64
1    float64
2    float64
3    float64
4    float64
5    float64
6    float64
dtype: object

In [17]:
# There are no categorical variables to encode 

### <a id='52'>Export cleaned csv file for later use</a> 

In [18]:
#### USING THE .to_csv() METHOD EXPORT THE DATAFRAME FOR LATER USE

csv_fname = pjoin(processed_data_folder, 'thyroid_' + export_date + '.csv')

df.to_csv(csv_fname, encoding='utf-8', index=False)

## Milestone 1 Ends

## Milestone 2 Begins

## <a id = 6 > Modelling Workflow