## Mini-project 1

__Goal:__ The goal of this assignment is to conduct a study that compares the performance of various classification algorithms on a variety of datasets. 

__Datasets:__ The folder data contains 98 publicly available datasets from the UCI machine learning repository ([link](http://archive.ics.uci.edu/ml/index.php)). These datasets were collected and converted to a standard format by Dunn and Bertsimas (for more details see [link1](https://github.com/JackDunnNZ/uci-data) and [link2](http://jack.dunn.nz/papers/OptimalClassificationTrees.pdf)):
* Each dataset is stored in a separate folder
* Each folder contains a datafile and the configuration file config.ini specifying the data format
- Data files are stored in csv format and their names either end with ".orig" or at ".custom". If both files exist in a folder, use the file ending with ".custom"
- Each config.ini file contains information about a dataset: 
    - separator: the character used to separate columns in the respective csv file
    - header_lines: the number of rows to be skipped in the datafile as these contain some information about the file but not data
    - target_index: the column number of the output variable
    - value_indices: the column numbers of the input variables
    - categoric_indices: column numbers of categorical data
    



__Remarks__:
1. Notice that column numbering in the configuration files begins with 1 (versus 0 in Python)
2. You may use the package [configparser](https://docs.python.org/3.7/library/configparser.html) to read and parse config.ini files
3. The character "?" denotes a null value. After reading a data file, you may drop all lines that contain null values.
4. Out of the 98 datasets, use only the 54 datasets whose name is stored in the file "datasets_selection".



__Assignment__: compare the performance of the following classification algorithms on the 54 datasets: 
- Support vector machine, 
- Logistic Regression, 
- K-nearest neighbors, 
- Decision trees, 
- Quadratic discriminant analysis, 
- Random forests, and 
- AdaBoost


Submit your solution as a jupyter notebook and include in your submission other files that may be needed to replicate your analysis. In addition, submit a report (at most 4 pages long) that discusses your methodology, key findings, as well as the limitations of your analysis. Compare the use of ML methods in this project against typical ML applications. 


__Tip:__ Start early. The assignment requires substantial amount of files processing prior to running the learning algorithms and analyzing the results. 


In [1]:
import pandas as pd
import numpy as np

In [2]:
d = pd.read_csv('datasets_selection',header=None)

for i in range(0, 54):
    d[0][i] = d[0][i][:-5]
dflist = d[0].tolist()
len(dflist)

54

In [3]:
a=!ls data
len(a)

98

In [4]:
dataset_selection = []
for i in dflist:
    for j in a:
        if i == j:
            dataset_selection.append(i)

In [5]:
dataset_selection.append('breast-cancer-wisconsin-original')

In [6]:
# return the dataset file path
import os

def loadDataset(folder):
    files = os.listdir('data/' + folder)
    custom = [file for file in files if '.orig.custom' in file]
    orig = [file for file in files if '.orig' in file]
    if len(custom) > 0:
        return ('data/' + folder + '/' + custom[0])
    elif len(orig) > 0:
        return ('data/' + folder + '/' + orig[0])

In [7]:
# pre-process the files with whitespace
def whitespace(folder):
    rows = open(folder, 'r', encoding='utf-8').read().split('\n')
    with open(folder, 'w') as f:
        for r in rows:
            r = str(r.lstrip())
            l = r.split(' ')
            if '' in l: l = list(filter(lambda a: a != '', l))
            r = ' '.join(l)
            f.write(r + '\n')

In [8]:
# return the dataset as a dataframe format
import pandas as pd
import configparser

def configure(folder):
    config = configparser.ConfigParser()
    config.read('data/'+ folder + '/config.ini')
    
    flag = True
    if config['info']['separator'] == 'comma':
        sep = ','
        flag = False
    elif config['info']['separator'] == ';':
        sep = ';'
        flag = False
    elif config['info']['separator'] == '':
        sep = '\s*\^'
    
    if flag == True:
        if config['info']['header_lines'] == '0':
            return pd.read_csv(loadDataset(folder), header = None, na_values = {'?', 'NaN'}, delim_whitespace=True).dropna(how='any')
        else:
            i = config['info']['header_lines']
            return pd.read_csv(loadDataset(folder), skiprows = int(i), header = None, na_values = {'?', 'NaN'}, delim_whitespace=True).dropna(how = 'any')
    else:
        if config['info']['header_lines'] == '0':
            return pd.read_csv(loadDataset(folder), sep = sep, header = None, na_values = {'?', 'NaN'}, engine='python').dropna(how='any')
        else:
            i = config['info']['header_lines']
            return pd.read_csv(loadDataset(folder), sep = sep, skiprows = int(i), header = None, na_values = {'?', 'NaN'}, engine='python').dropna(how = 'any')
    

In [9]:
# return target_index
def label(folder):
    config = configparser.ConfigParser()
    config.read('data/'+ folder + '/config.ini')
    target = int(config['info']['target_index'])-1
    return target

In [10]:
# data cleaning pipeline
from sklearn.base import BaseEstimator, TransformerMixin

class DataFrameSelector(BaseEstimator, TransformerMixin): 
    def __init__(self, attibute_names):
        self.attibute_names = attibute_names
    def fit(self, X, y=None):
        return(self)
    def transform(self, X): 
        return(X[self.attibute_names].values)

from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
%run CategoricalEncoder.py

from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion

def prepared(folder):
    config = configparser.ConfigParser()
    config.read('data/'+ folder + '/config.ini')
    num = config['info']['value_indices'].split(',')
    cat = config['info']['categoric_indices'].split(',')
    
    # numerical attributes' pipeline
    num_attributes = [int(i)-1 for i in num if i not in cat]
    num_pipeline = Pipeline([
        ('selector',DataFrameSelector(num_attributes)),
        ('imputer',Imputer(strategy="median")),
        ('std_scaler', StandardScaler()),])

    # categorical attributes' pipeline
    cat_attributes = [int(i)-1 for i in cat if i != '']
    cat_pipeline = Pipeline([
        ('selector',DataFrameSelector(cat_attributes)),
        ('label_binarizer',CategoricalEncoder(encoding="onehot-dense")),])
    
    # full pipeline
    if len(num_attributes) == 0:
        full_pipeline = FeatureUnion(transformer_list=[
        ('cat_pipeline',cat_pipeline)])
    elif len(cat_attributes) == 0:
        full_pipeline = FeatureUnion(transformer_list=[
        ('num_pipeline',num_pipeline)])
    else:
        full_pipeline = FeatureUnion(transformer_list=[
            ('num_pipeline',num_pipeline),
            ('cat_pipeline',cat_pipeline),])

    
    df = configure(folder)
    for i in range(len(df.columns)):
        encoder = LabelEncoder()
        df[i] = encoder.fit_transform(df[i]) # single command for fit and transform 
    df_Y = df[label(folder)]
    del df[label(folder)]
    df_X = df
    return (full_pipeline.fit_transform(df_X), df_Y)

In [11]:
# function Randomforest
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
def Random_forest():
    parameters = {'max_features': ('auto', 'sqrt'), 'n_estimators': [100, 150], 'max_depth': [15, 30]}
    rf = RandomForestClassifier()
    clf = GridSearchCV(rf, parameters)
    clf.fit(train_X, train_Y)
    rf_predictions = clf.predict(test_X)
    rf_mse = mean_squared_error(test_Y, rf_predictions)
    rf_rmse = np.sqrt(rf_mse)
    print("The rmse of Random forest is: ",rf_rmse)
    n = len(test_Y)
    accuracy = sum(rf_predictions==test_Y)/n
    print("The accuracy of Random forest is: ",accuracy)

In [12]:
# function QDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
def QDA():
    clf = QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0)
    clf.fit(train_X, train_Y)
    qda_predictions = clf.predict(test_X)
    qda_mse = mean_squared_error(test_Y, qda_predictions)
    qda_rmse = np.sqrt(qda_mse)
    print("The rmse of QDA is: ",qda_rmse)
    n = len(test_Y)
    accuracy = sum(qda_predictions==test_Y)/n
    print("The accuracy of QDA is: ",accuracy)

In [13]:
# function KNN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error
def KNN():
    parameters = {'n_neighbors': [3, 5, 7], 'weights': ('distance', 'uniform')}
    knn = KNeighborsClassifier()
    clf = GridSearchCV(knn, parameters)
    clf.fit(train_X, train_Y)
    knn_predictions = clf.predict(test_X)
    knn_mse = mean_squared_error(test_Y, knn_predictions)
    knn_rmse = np.sqrt(knn_mse)
    print("The rmse of KNN is: ",knn_rmse)
    n = len(test_Y)
    accuracy = sum(knn_predictions==test_Y)/n
    print("The accuracy of KNN is: ",accuracy)

In [14]:
# function adaboost
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error
def adaboost():
    parameters = {'learning_rate': [0.05, 0.1, 0.2], 'n_estimators': [40, 50, 60, 70]} 
    ada = AdaBoostClassifier()
    clf = GridSearchCV(ada, parameters)
    clf.fit(train_X,train_Y)
    ada_predictions = clf.predict(test_X)
    ada_mse = mean_squared_error(test_Y,ada_predictions)
    ada_rmse = np.sqrt(ada_mse)
    print("The rmse of Adaboost is: ",ada_rmse)
    n = len(test_Y)
    accuracy = sum(ada_predictions==test_Y)/n
    print("The accuracy of adaboost is: ",accuracy)

In [15]:
# function LogisticRegression
from sklearn.linear_model import LogisticRegression  
from sklearn.metrics import mean_squared_error
def LogisticRegression():
    from sklearn.linear_model import LogisticRegression 
    parameters = {'C': [10, 100, 1000, 10**4], 'fit_intercept': (True, False), 'solver': ('liblinear', 'lbfgs', 'newton-cg')}
    lr = LogisticRegression()
    clf = GridSearchCV(lr, parameters) 
    clf.fit(train_X, train_Y)
    lr_predictions = clf.predict(test_X)
    lr_mse = mean_squared_error(test_Y, lr_predictions)
    lr_rmse = np.sqrt(lr_mse)
    print("The rmse of Logistic regression is: ",lr_rmse)
    n = len(test_Y)
    accuracy = sum(lr_predictions==test_Y)/n
    print("The accuracy of Logistic regression is: ",accuracy)

In [16]:
# function Decision Tree
from sklearn.tree import DecisionTreeRegressor
def DecisionTree():
    clf = DecisionTreeRegressor()
    clf.fit(train_X, train_Y)
    dt_predictions = clf.predict(test_X)
    dt_mse = mean_squared_error(test_Y, dt_predictions)
    dt_rmse = np.sqrt(dt_mse)
    print("The rmse of Decision Tree is: ",dt_rmse)
    n = len(test_Y)
    accuracy = sum(dt_predictions==test_Y)/n
    print("The accuracy of Decision Tree is: ",accuracy)

In [17]:
# function SVM
from sklearn import svm
def SVM():
    parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10, 100, 1000]}
    svc = svm.SVC()
    clf = GridSearchCV(svc, parameters)
    clf.fit(train_X, train_Y)
    svm_predictions = clf.predict(test_X)
    svm_mse = mean_squared_error(test_Y, svm_predictions)
    svm_rmse = np.sqrt(svm_mse)
    print("The rmse of SVM is: ",svm_rmse)
    n = len(test_Y)
    accuracy = sum(svm_predictions==test_Y)/n
    print("The accuracy of svm is: ",accuracy)

In [48]:
x = int(input())
print(dataset_selection[x])
folder = dataset_selection[x]

# clean the whitespace
whitespace(loadDataset(folder))

# data cleaning
df_X, df_Y = prepared(folder)

# split the training and test dataset
from sklearn.model_selection import train_test_split

train_X, test_X = train_test_split(df_X, test_size=0.2, random_state=1)
train_Y, test_Y = train_test_split(df_Y, test_size=0.2, random_state=1)


# use ravel() to transform train_Y, train_X
train_Y=train_Y.values.ravel()
test_Y =test_Y.values.ravel()

20
hayes-roth


In [49]:
# show the result of seven algorithom's rmse and accuracy
Random_forest()
QDA()
KNN()
adaboost()
LogisticRegression()
DecisionTree()
SVM()

The rmse of Random forest is:  0.38490017946
The accuracy of Random forest is:  0.851851851852
The rmse of QDA is:  0.881917103688
The accuracy of QDA is:  0.444444444444
The rmse of KNN is:  0.981306762925
The accuracy of KNN is:  0.592592592593
The rmse of Adaboost is:  0.0
The accuracy of adaboost is:  1.0
The rmse of Logistic regression is:  0.902670933848
The accuracy of Logistic regression is:  0.407407407407
The rmse of Decision Tree is:  0.160375074775
The accuracy of Decision Tree is:  0.925925925926
The rmse of SVM is:  0.693888666489
The accuracy of svm is:  0.740740740741


In [50]:
pd.DataFrame(df_X).describe()

Unnamed: 0,0,1,2,3
count,132.0,132.0,132.0,132.0
mean,0.0,8.41078e-18,-6.055762e-17,7.149163e-18
std,1.00381,1.00381,1.00381,1.00381
min,-1.224745,-1.010753,-1.010753,-1.010753
25%,-1.224745,-1.010753,-1.010753,-1.010753
50%,0.0,0.0481311,0.0481311,0.0481311
75%,1.224745,0.0481311,0.0481311,0.0481311
max,1.224745,2.165899,2.165899,2.165899
