### Lab 1.3: Multi-Class Linear Classifier

In this lab you will explore multi-class classification and evaluate model generalization using a [dataset for heart disease prediction from the UCI ML repository](https://archive.ics.uci.edu/dataset/45/heart+disease).

In [1]:
!pip install ucimlrepo



This ``ucimlrepo`` package provides a nice interface for accessing their datasets.

In [2]:
from ucimlrepo import fetch_ucirepo 

# fetch dataset 
heart_disease = fetch_ucirepo(id=45) 
  
# data (as pandas dataframes) 
X = heart_disease.data.features 
y = heart_disease.data.targets 
  
# variable information 
heart_disease.variables


Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,age,Feature,Integer,Age,,years,no
1,sex,Feature,Categorical,Sex,,,no
2,cp,Feature,Categorical,,,,no
3,trestbps,Feature,Integer,,resting blood pressure (on admission to the ho...,mm Hg,no
4,chol,Feature,Integer,,serum cholestoral,mg/dl,no
5,fbs,Feature,Categorical,,fasting blood sugar > 120 mg/dl,,no
6,restecg,Feature,Categorical,,,,no
7,thalach,Feature,Integer,,maximum heart rate achieved,,no
8,exang,Feature,Categorical,,exercise induced angina,,no
9,oldpeak,Feature,Integer,,ST depression induced by exercise relative to ...,,no


Here I remove the missing values from the features and labels.

In [3]:
bad = X.isna().any(axis=1)
X = X[~bad]
y = y[~bad]

Finally I convert the DataFrames to numpy arrays.

In [4]:
X = X.values
y = y.values.flatten()

The classification target is a number from 0-4 indicating the severity of heart disease.  Let's try fitting a linear model.

In [5]:
import sklearn

In [6]:
model = sklearn.linear_model.LogisticRegression().fit(X,y)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [7]:
model.score(X,y)

0.6094276094276094

### Exercises

1. Compute the $\mathbf{z}$ values for the classifier manually, i.e. compute

$$\mathbf{W}\mathbf{X}+\mathbf{b}.$$

*Hints*: 
- Use `.shape` to get the shape of a Numpy matrix.
- ``@`` is the matrix multiplication operator in Numpy
- The actual computation will be a little different from what is written above.  You will need to use a matrix transpose which is `.T` in Numpy.

In [8]:
x = X[0].T
W = model.coef_
b = model.intercept_

# TODO look into linear alebra that makes this shit work
# need to transpose X to multiply by W -> W@X.T
# need to transpose W@X.T to broadcast b -> (W@X.T).T
# invert the transposes, reverse the multiply (W@X.T).T -> X@W.T
Z = X@W.T+b 
Z

array([[ 1.02965673,  0.44586375, -0.31956904, -0.33767823, -0.8182732 ],
       [-0.20355016,  0.06120372,  0.22317677,  0.39085623, -0.47168657],
       [-1.27332421,  0.49379511,  0.50156054,  0.36807311, -0.09010456],
       ...,
       [-0.77804289,  0.3704436 ,  0.24356547,  0.43608363, -0.27204981],
       [-1.11470271,  0.53178527,  0.1641042 ,  0.64461847, -0.22580523],
       [ 3.65848276,  0.56258452, -1.13350087, -1.59797108, -1.48959534]],
      shape=(297, 5))

Print out the $\mathbf{z}$ values for the first example in the dataset and the first label.   Determine if the classifier is correctly classifying the first example in the dataset.

In [9]:
idx = 0
print(Z[idx])
print(y[idx])

[ 1.02965673  0.44586375 -0.31956904 -0.33767823 -0.8182732 ]
0


The classifier is correctly classifying the first example in the dataset as class 0

2. Use ``sklearn.model_selection.train_test_split`` to split ``X`` and ``y`` into 90% train and 10% test splits.  Note that this should be done in a single call to ``train_test_split``.

*Note*: Pass ``random_state=42`` to ``train_test_split`` to ensure you get the same result from random shuffling each time.


In [10]:
split = .10
X_train, X_test = sklearn.model_selection.train_test_split(X, test_size=split, random_state=42)
y_train, y_test = sklearn.model_selection.train_test_split(y, test_size=split, random_state=42)

Fit the model to the training split and calculate accuracy on the test split.  How does it compare to the previous accuracy value (when the model was trained and evaluated on the same data)?

In [11]:
model.fit(X_train, y_train)
model.score(X_test, y_test)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.7333333333333333

The accuracy score is higher with the train test split. This is surprising, because the model has not seen the test values before, but a good result.

3. Run $k$-fold cross validation with $k=5$ and interpret the results (see `sklearn.model_selection.cross_val_score`).

In [12]:
sklearn.model_selection.cross_val_score(model, X, y)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

array([0.6       , 0.6       , 0.52542373, 0.55932203, 0.59322034])

These results show the score with each fold. These folds show our model having a much poorer performance than the other methods show, with a maximum score of 0.6 and a minimum score of 0.525.