# Machine Learning - Assignment 2 - Brian Seggebruch

### Q1

#### Data Information for the Auto - MPG dataset from UCI ML Archive

1. Title: Auto-Mpg Data


2. Sources:
   (a) Origin:  This dataset was taken from the StatLib library which is
                maintained at Carnegie Mellon University. The dataset was 
                used in the 1983 American Statistical Association Exposition.
   (c) Date: July 7, 1993


3. Past Usage:
    -  See 2b (above)
    -  Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning.
       In Proceedings on the Tenth International Conference of Machine 
       Learning, 236-243, University of Massachusetts, Amherst. Morgan
       Kaufmann.


4. Relevant Information:

   This dataset is a slightly modified version of the dataset provided in
   the StatLib library.  In line with the use by Ross Quinlan (1993) in
   predicting the attribute "mpg", 8 of the original instances were removed 
   because they had unknown values for the "mpg" attribute.  The original 
   dataset is available in the file "auto-mpg.data-original".

   "The data concerns city-cycle fuel consumption in miles per gallon,
    to be predicted in terms of 3 multivalued discrete and 5 continuous
    attributes." (Quinlan, 1993)


5. Number of Instances: 398


6. Number of Attributes: 9 including the class attribute


7. Attribute Information:

    1. mpg:           continuous
    2. cylinders:     multi-valued discrete
    3. displacement:  continuous
    4. horsepower:    continuous
    5. weight:        continuous
    6. acceleration:  continuous
    7. model year:    multi-valued discrete
    8. origin:        multi-valued discrete
    9. car name:      string (unique for each instance)


8. Missing Attribute Values:  horsepower has 6 missing values



In [166]:
# import libraries

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import mean_squared_error, classification_report, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [22]:
# create df
df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
    ,header=None
    ,sep=','
    ,delim_whitespace=True
)

In [28]:
# checking nulls
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
0    398 non-null float64
1    398 non-null int64
2    398 non-null float64
3    398 non-null object
4    398 non-null float64
5    398 non-null float64
6    398 non-null int64
7    398 non-null int64
8    398 non-null object
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB


In [29]:
# check data
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [30]:
# renaming columns
df.columns = ['mpg','cyl','displ','hp','wt','accl','yr','origin','name']
df.head()

Unnamed: 0,mpg,cyl,displ,hp,wt,accl,yr,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [31]:
# exploring correlations
df[(df.columns[:])].corr()['mpg'][1:-1]

cyl     -0.775396
displ   -0.804203
wt      -0.831741
accl     0.420289
yr       0.579267
Name: mpg, dtype: float64

In [36]:
df.iloc[:5,0]

0    18.0
1    15.0
2    18.0
3    16.0
4    17.0
Name: mpg, dtype: float64

In [67]:
# remove data with missing data
df = df[df.hp != '?']

In [68]:
# train test split
X = df.iloc[:,1:-1]
y = df.iloc[:,0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)

In [69]:
# check shape of resulting X and y train/test
display(X_train.shape)
display(X_test.shape)
display(y_train.shape)
display(y_test.shape)

(313, 7)

(79, 7)

(313,)

(79,)

In [71]:
# Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [73]:
# measure RMSE on train set
pred = lin_reg.predict(X_train)
lin_mse = mean_squared_error(y_train, pred)
lin_rmse = np.sqrt(lin_mse)
display(lin_rmse)

3.1757563687086

In [75]:
# measure RMSE on test set
pred = lin_reg.predict(X_test)
lin_mse = mean_squared_error(y_test, pred)
lin_rmse = np.sqrt(lin_mse)
display(lin_rmse)

3.7621400479117106

In [76]:
df.mpg.agg(['mean','std'])

mean    23.445918
std      7.805007
Name: mpg, dtype: float64

In both cases, I return a relatively low RMSE, especially compared to the centrality and spread of the data!

### Q2

##### Logistic Regression

In [86]:
from sklearn.datasets import load_iris

In [87]:
data = load_iris()

In [109]:
labels = data.target_names

In [100]:
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.DataFrame(data.target, columns=['label'])
display(X.head(),y.head())

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


Unnamed: 0,label
0,0
1,0
2,0
3,0
4,0


In [120]:
# split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)

In [174]:
# Logistic Regression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [175]:
# measure accuracy on train set
pred = log_reg.predict(X_train)
print(classification_report(y_train, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        38
           1       1.00      0.90      0.95        41
           2       0.91      1.00      0.95        41

    accuracy                           0.97       120
   macro avg       0.97      0.97      0.97       120
weighted avg       0.97      0.97      0.97       120



In [176]:
# measure accuracy on test set
pred = log_reg.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       1.00      0.89      0.94         9
           2       0.90      1.00      0.95         9

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.96        30
weighted avg       0.97      0.97      0.97        30



In both cases, when measuring accuracy against the train and test set, the precision and recall are good. This means we likely didn't overfit our training model. The precision is the number of predicted values that are relevant, and the recall is the number of relevant values that are predicted.

In [177]:
print(accuracy_score(y_test, pred))

0.9666666666666667


##### SVM

In [208]:
# fit SVM model
svm = Pipeline([
    ("scaler", StandardScaler())
    ,("linear_svc", LinearSVC(C=1, loss = "hinge"))
]);
svm.fit(X_train, y_train.label)

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('linear_svc',
                 LinearSVC(C=1, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='hinge', max_iter=1000, multi_class='ovr',
                           penalty='l2', random_state=None, tol=0.0001,
                           verbose=0))],
         verbose=False)

In [170]:
# measure accuracy on train data
pred = svm.predict(X_train)
print(classification_report(y_train, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        38
           1       0.92      0.88      0.90        41
           2       0.88      0.93      0.90        41

    accuracy                           0.93       120
   macro avg       0.94      0.93      0.93       120
weighted avg       0.93      0.93      0.93       120



In [171]:
# measure accuracy on test data
pred = svm.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       1.00      0.92      0.96        12
           1       0.88      0.78      0.82         9
           2       0.82      1.00      0.90         9

    accuracy                           0.90        30
   macro avg       0.90      0.90      0.89        30
weighted avg       0.91      0.90      0.90        30



In [172]:
print(labels)

['setosa' 'versicolor' 'virginica']


With this linear SVM we achieve similar results, but with slightly less ability to predict between Setosa and Versicolor, and slightly less ability to classify Virginica. The accuracy on test data is slightly worse in most cases, which indicates we might have overfit our model.

In [173]:
print(accuracy_score(y_test, pred))

0.9


##### Decision Tree

In [163]:
# fit decision tree model
tree_clf = DecisionTreeClassifier(max_depth=6)
tree_clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [164]:
# measure accuracy on train data
pred = tree_clf.predict(X_train)
print(classification_report(y_train, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        38
           1       1.00      1.00      1.00        41
           2       1.00      1.00      1.00        41

    accuracy                           1.00       120
   macro avg       1.00      1.00      1.00       120
weighted avg       1.00      1.00      1.00       120



In [165]:
# measure accuracy on test data
pred = tree_clf.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       1.00      0.78      0.88         9
           2       0.82      1.00      0.90         9

    accuracy                           0.93        30
   macro avg       0.94      0.93      0.92        30
weighted avg       0.95      0.93      0.93        30



With a max depth parameter set to 6 branches, we achieve a perfect prediction score on our training data (which might indicate overfitting if not for the fact that it generalizes well to the test data, as well). 

In [167]:
print(accuracy_score(y_test, pred))

0.9333333333333333


##### Random Forest

In [181]:
# fit the random forest model
rf_clf = RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_train, y_train.label)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [183]:
# measure accuracy on train data
pred = rf_clf.predict(X_train)
print(classification_report(y_train, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        38
           1       1.00      1.00      1.00        41
           2       1.00      1.00      1.00        41

    accuracy                           1.00       120
   macro avg       1.00      1.00      1.00       120
weighted avg       1.00      1.00      1.00       120



In [184]:
# measure accuracy on train data
pred = rf_clf.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       1.00      0.89      0.94         9
           2       0.90      1.00      0.95         9

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.96        30
weighted avg       0.97      0.97      0.97        30



The random forest model out performs the single decision tree, and is nearly identical to our original logistic regression model.

In [185]:
print(accuracy_score(y_test, pred))

0.9666666666666667


##### Comparing all models...

In [194]:
for model in (log_reg, svm, tree_clf, rf_clf):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f'{model.__class__.__name__}', f'{accuracy_score(y_test, y_pred):.2f}')

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  


LogisticRegression 0.97
Pipeline 0.90
DecisionTreeClassifier 0.97
RandomForestClassifier 0.97


### Q2 - new dataset

In [195]:
from sklearn.datasets import load_breast_cancer

In [209]:
# bring in data
data = load_breast_cancer()
labels = data.target_names

In [204]:
print(data.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

In [210]:
# create our dataframes
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.DataFrame(data.target, columns=['label'])
display(X.head(),y.head())

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Unnamed: 0,label
0,0
1,0
2,0
3,0
4,0


In [211]:
# split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)

In [212]:
for model in (log_reg, svm, tree_clf, rf_clf):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f'{model.__class__.__name__}', f'{accuracy_score(y_test, y_pred):.2f}')

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  


LogisticRegression 0.94
Pipeline 0.96
DecisionTreeClassifier 0.95
RandomForestClassifier 0.93


With my new data, the SVM classifer out-performed all the other models.