### Lab 1.3: Multi-Class Linear Classifier

In this lab you will explore multi-class classification and evaluate model generalization using a [dataset for heart disease prediction from the UCI ML repository](https://archive.ics.uci.edu/dataset/45/heart+disease).

In [1]:
!pip install ucimlrepo

Defaulting to user installation because normal site-packages is not writeable


This ``ucimlrepo`` package provides a nice interface for accessing their datasets.

In [2]:
from ucimlrepo import fetch_ucirepo 

# fetch dataset 
heart_disease = fetch_ucirepo(id=45) 
  
# data (as pandas dataframes) 
X = heart_disease.data.features 
y = heart_disease.data.targets 
  
# variable information 
heart_disease.variables


Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,age,Feature,Integer,Age,,years,no
1,sex,Feature,Categorical,Sex,,,no
2,cp,Feature,Categorical,,,,no
3,trestbps,Feature,Integer,,resting blood pressure (on admission to the ho...,mm Hg,no
4,chol,Feature,Integer,,serum cholestoral,mg/dl,no
5,fbs,Feature,Categorical,,fasting blood sugar > 120 mg/dl,,no
6,restecg,Feature,Categorical,,,,no
7,thalach,Feature,Integer,,maximum heart rate achieved,,no
8,exang,Feature,Categorical,,exercise induced angina,,no
9,oldpeak,Feature,Integer,,ST depression induced by exercise relative to ...,,no


Here I remove the missing values from the features and labels.

In [3]:
bad = X.isna().any(axis=1)
X = X[~bad]
y = y[~bad]

Finally I convert the DataFrames to numpy arrays.

In [4]:
X = X.values
y = y.values.flatten()

The classification target is a number from 0-4 indicating the severity of heart disease.  Let's try fitting a linear model.

In [5]:
import sklearn

In [6]:
model = sklearn.linear_model.LogisticRegression().fit(X,y)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [7]:
model.score(X,y)

0.6094276094276094

### Exercises

1. Compute the $\mathbf{z}$ values for the classifier manually, i.e. compute

$$\mathbf{W}\mathbf{X}+\mathbf{b}.$$

*Hints*: 
- Use `.shape` to get the shape of a Numpy matrix.
- ``@`` is the matrix multiplication operator in Numpy
- The actual computation will be a little different from what is written above.  You will need to use a matrix transpose which is `.T` in Numpy.


In [8]:
W = model.coef_ # weight vector
print(f"W shape: ", W.shape)
print(f"B shape: ", model.intercept_.shape)
print(f"X shape: ", X.shape)
print(f"W.T shape", W.T.shape)
z = X @ model.coef_.T + model.intercept_


W shape:  (5, 13)
B shape:  (5,)
X shape:  (297, 13)
W.T shape (13, 5)


Print out the $\mathbf{z}$ values for the first example in the dataset and the first label.   Determine if the classifier is correctly classifying the first example in the dataset.

In [13]:
import numpy as np
z_first_example = X[0] @ W.T + model.intercept_
print("z values for the first example:", z_first_example)
predict = np.argmax(z_first_example)

actual = y[0]
print(f"Predicted class for the first example: {predict}")
print(f"Actual class for the first example: {actual}")

print(f"Classifier correctly classified the first example: {actual == predict}")


z values for the first example: [ 1.02965934  0.44582056 -0.31957023 -0.33764968 -0.81825999]
Predicted class for the first example: 0
Actual class for the first example: 0
Classifier correctly classified the first example: True


2. Use ``sklearn.model_selection.train_test_split`` to split ``X`` and ``y`` into 90% train and 10% test splits.  Note that this should be done in a single call to ``train_test_split``.

*Note*: Pass ``random_state=42`` to ``train_test_split`` to ensure you get the same result from random shuffling each time.


In [16]:
import sklearn.model_selection


X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1, random_state=42)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (267, 13)
X_test shape: (30, 13)
y_train shape: (267,)
y_test shape: (30,)


Fit the model to the training split and calculate accuracy on the test split.  How does it compare to the previous accuracy value (when the model was trained and evaluated on the same data)?

In [20]:
model.fit(X_train, y_train)
test = model.score(X_test, y_test)
train = model.score(X_train, y_train)
print("test: ", test)
print("train: ", train)

test:  0.7333333333333333
train:  0.602996254681648


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


The test has a higher accuracy than the training set. This shows the model is better at seeing the unknown data and it generalized. 

3. Run $k$-fold cross validation with $k=5$ and interpret the results (see `sklearn.model_selection.cross_val_score`).

In [29]:
scores = sklearn.model_selection.cross_val_score(model, X, y, cv=5)
print(scores)
print("Average accuracy", scores.mean())

[0.6        0.6        0.52542373 0.55932203 0.59322034]
Average accuracy 0.5755932203389831


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Each index of the given array represents the fold's accuracy of the cross validation.
For this model, the accuracy is from 52% to 60%, in which the average accuracy is 57.56%. 