
# Logistic Regression and Classification Error Metrics

## Introduction

We will be using the [Human Activity Recognition with Smartphones](https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones) database, which was built from the recordings of study participants performing activities of daily living (ADL) while carrying a smartphone with an embedded inertial sensors. The objective is to classify activities into one of the six activities (walking, walking upstairs, walking downstairs, sitting, standing, and laying) performed.

For each record in the dataset it is provided: 

- Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration. 
- Triaxial Angular velocity from the gyroscope. 
- A 561-feature vector with time and frequency domain variables. 
- Its activity label. 

More information about the features is available on the website: above or at https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones 

## Question 1

Import the data and do the following:

* Examine the data types--there are many columns, so it might be wise to use value counts
* Determine if the floating point values need to be scaled
* Determine the breakdown of each activity
* Encode the activity label as an integer

In [44]:
#Data Path has to be set as per the file location in your system
#data_path = ['..', 'data']
# data_path_arthur = ['/home/seaquest/Arthur/regularizar-metricas/']
# data_path_jonathan = ['']
#The filepath is dependent on the data_path set in the previous cell 
# filepath = os.sep.join(data_path_arthur + ['Human_Activity_Recognition_Using_Smartphones_Data.csv'])
# filepath = os.sep.join(data_path_jonathan + ['Human_Activity_Recognition_Using_Smartphones_Data.csv'])
#não precisou do path por estar no mesma pasta.

from __future__ import print_function
import os
import pandas as pd
import numpy as np

filepath = os.sep.join(['Human_Activity_Recognition_Using_Smartphones_Data.csv'])
data = pd.read_csv(filepath, sep=',')

FileNotFoundError: [Errno 2] No such file or directory: 'Human_Activity_Recognition_Using_Smartphones_Data.csv'

The data columns are all floats except for the activity label.

In [45]:
data.dtypes.value_counts()

NameError: name 'data' is not defined

In [46]:
data.dtypes.tail()

NameError: name 'data' is not defined

The data are all scaled from -1 (minimum) to 1.0 (maximum).

In [47]:
data.iloc[:, :-1].min().value_counts()

NameError: name 'data' is not defined

In [48]:
data.iloc[:, :-1].max().value_counts()

NameError: name 'data' is not defined

Examine the breakdown of activities--they are relatively balanced.

In [49]:
data.Activity.value_counts()

NameError: name 'data' is not defined

Scikit learn classifiers won't accept a sparse matrix for the prediction column. Thus, either `LabelEncoder` needs to be used to convert the activity labels to integers, or if `DictVectorizer` is used, the resulting matrix must be converted to a non-sparse array.  
Use `LabelEncoder` to fit_transform the "Activity" column, and look at 5 random values.

In [50]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data['Activity'] = le.fit_transform(data.Activity)
data['Activity'].sample(5)

NameError: name 'data' is not defined

## Question 2

* Calculate the correlations between the dependent variables.
* Create a histogram of the correlation values
* Identify those that are most correlated (either positively or negatively).

In [51]:
# Calculate the correlation values
feature_cols = data.columns[:-1]
corr_values = data[feature_cols].corr()

# Simplify by emptying all the data below the diagonal
tril_index = np.tril_indices_from(corr_values)

# Make the unused values NaNs
for coord in zip(*tril_index):
    corr_values.iloc[coord[0], coord[1]] = np.NaN
    
# Stack the data and convert to a data frame
corr_values = (corr_values.stack().to_frame().reset_index().rename(columns={'level_0':'feature1','level_1':'feature2',0:'correlation'}))

# Get the absolute values for sorting
corr_values['abs_correlation'] = corr_values.correlation.abs()

NameError: name 'data' is not defined

In [52]:
data[feature_cols]

NameError: name 'data' is not defined

A histogram of the absolute value correlations.

In [53]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [54]:
sns.set_context('talk')
sns.set_style('white')
sns.set_palette('dark')

ax = corr_values.abs_correlation.hist(bins=50)

ax.set(xlabel='Absolute Correlation', ylabel='Frequency');

NameError: name 'corr_values' is not defined

In [55]:
# The most highly correlated values
corr_values.sort_values('correlation', ascending=False).query('abs_correlation>0.8')

NameError: name 'corr_values' is not defined

## Question 3

* Split the data into train and test data sets. This can be done using any method, but consider using Scikit-learn's `StratifiedShuffleSplit` to maintain the same ratio of predictor classes.
* Regardless of methods used to split the data, compare the ratio of classes in both the train and test splits.


In [56]:
from sklearn.model_selection import StratifiedShuffleSplit

# Get the split indexes
strat_shuf_split = StratifiedShuffleSplit(n_splits=1,test_size=0.3, random_state=42)

train_idx, test_idx = next(strat_shuf_split.split(data[feature_cols], data.Activity))

# Create the dataframes
X_train = data.loc[train_idx, feature_cols]
y_train = data.loc[train_idx, 'Activity']
print (X_train)

X_test  = data.loc[test_idx, feature_cols]
y_test  = data.loc[test_idx, 'Activity']

NameError: name 'data' is not defined

In [57]:
y_train.value_counts(normalize=True)

NameError: name 'y_train' is not defined

In [58]:
y_test.value_counts(normalize=True)

NameError: name 'y_test' is not defined

### Analyze Questions 4 to 9
## Question 4

* Fit a logistic regression model without any regularization using all of the features. Be sure to read the documentation about fitting a multi-class model so you understand the coefficient output. Store the model.
* Using cross validation to determine the hyperparameters, fit models using L1, and L2 regularization. Store each of these models as well. Note the limitations on multi-class models, solvers, and regularizations. The regularized models, in particular the L1 model, will probably take a while to fit.

In [59]:
from sklearn.linear_model import LogisticRegression

# Standard logistic regression
lr = LogisticRegression().fit(X_train, y_train)

NameError: name 'X_train' is not defined

In [60]:
from sklearn.linear_model import LogisticRegressionCV

# L1 regularized logistic regression
lr_l1 = LogisticRegressionCV(Cs=10, cv=4, penalty='l1', solver='liblinear').fit(X_train, y_train)

NameError: name 'X_train' is not defined

In [61]:
lr_l1.coef_

NameError: name 'lr_l1' is not defined

### Try with different solvers like ‘newton-cg’, ‘lbfgs’, ‘sag’, ‘saga’ and give your observations

In [62]:
# L2 regularized logistic regression
lr_l2 = LogisticRegressionCV(Cs=10, cv=4, penalty='l2').fit(X_train, y_train) #lbfgs (default)

NameError: name 'X_train' is not defined

#### Observações, Análises, 
Na questão quatro ele chama as funções/importa e depois manda rodar a regressão logística com 2 modelos diferentes, p/ fazer comparação.
não há nada além disso p/ ser analisado aqui, a não ser o fato de que as 2 regressões usadas foram a liblinear e a lbfgs que é a default.


## Question 5
### 5.1
Compare the magnitudes of the coefficients for each of the models. If one-vs-rest fitting was used, each set of coefficients can be plotted separately. 

In [63]:
np.shape(lr.coef_)

NameError: name 'lr' is not defined

In [64]:
# Combine all the coefficients into a dataframe
coefficients = list()

coeff_labels = ['lr', 'l1', 'l2']
coeff_models = [lr, lr_l1, lr_l2]

for lab,mod in zip(coeff_labels, coeff_models):
    coeffs = mod.coef_
    levels=[[lab], [0,1,2,3,4,5]]
    labels=[[0,0,0,0,0,0], [0,1,2,3,4,5]]
    coeff_label = pd.MultiIndex(levels,labels)
    coefficients.append(pd.DataFrame(coeffs.T, columns=coeff_label))

coefficients = pd.concat(coefficients, axis=1)

coefficients.sample(10)

NameError: name 'lr' is not defined

### 5.2 
Prepare six separate plots for each of the multi-class coefficients.

In [65]:
fig, axList = plt.subplots(nrows=3, ncols=2)
axList = axList.flatten()
fig.set_size_inches(10,10)


for ax in enumerate(axList):
    loc = ax[0]
    ax = ax[1]
    
    data = coefficients.xs(loc, level=1, axis=1)
    data.plot(marker='o', ls='', ms=2.0, ax=ax, legend=False)
    
    if ax is axList[0]:
        ax.legend(loc=4)
        
    ax.set(title='Coefficient Set '+str(loc))

plt.tight_layout()


AttributeError: 'list' object has no attribute 'xs'

#### Observações, Análises, 

Na parte 51, ele faz uma tabela para comparar os resultados , para cada uma das regressões feitas, sendo lr a logistica, lr1 a logistica regularizada e lr2 a logistica regularizada com outro solver. 

SÃO 6 coeficientes p/ cada feature, são 500 e algo featuers, p/ cada uma delas teremos coeficientes que poderão indicar de que classe seria.

Temos então LABEl 0 a 5, p/ cada uma das regressões que são coeficientes de uma das 6 classes e nas linhas uma feature e sua correlação/coeficiente p/ resultado da classe. 

Já na parte 5.2 os valores de -5 a 5 p/ cada coeficiente, nos 3 casos e no eixo x seria p/ cada variavel, é bom fazer essas plotagens p/ poder ver quais coeficientes ficaram mais dispersos e mais próximos nas 3 regressões podendo ver a diferença clara entre cada caso, percebendo em qual das 3 regressões temos mais ou menos desparidades p/ classificar.


coef_ndarray of shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function.

coef_ is of shape (1, n_features) when the given problem is binary.

coef_ no caso vai retornar um array do tipo (n_classes, n_features)
e o número de classes é 6, por isso 6 coeficientes.



In [66]:
np.shape(lr.coef_)

NameError: name 'lr' is not defined

## Question 6

* Predict and store the class for each model.
* Also store the probability for the predicted class for each model. 

In [67]:
# Predict the class and the probability for each

y_pred = list()
y_prob = list()

coeff_labels = ['lr', 'l1', 'l2']
coeff_models = [lr, lr_l1, lr_l2]

for lab,mod in zip(coeff_labels, coeff_models):
    y_pred.append(pd.Series(mod.predict(X_test), name=lab))
    y_prob.append(pd.Series(mod.predict_proba(X_test).max(axis=1), name=lab))
    
y_pred = pd.concat(y_pred, axis=1)
y_prob = pd.concat(y_prob, axis=1)

y_pred.head()


NameError: name 'lr' is not defined

In [68]:
y_prob.head()

AttributeError: 'list' object has no attribute 'head'

#### Observações, Análises, 

Nessa ele faz as predições do conjunto de validação/teste , de acordo com o que foi treinado no conjunto de treino, e monta duas tabelas, uma tabela que mostra a predição que foi feita por cada modelo, e uma que mostra a probabilidade que tinha p/ obter esse valor naquele caso em cada modelo p/ cada conjunto de features.

Importante resaltar que aparentemente tem haver os 0,1,2,3,4,5 das perguntas 4 e 5 com o que foi predito/adivinhado qual classe deve pertencer.

Acredito que a probabilidade é uma boa comparação p/ poder ver se está dentro de um erro "falso positivo", "falso negativo" tentando entender se o caso foi bem analisado com parametros futuros.





## Question 7

For each model, calculate the following error metrics: 

* accuracy
* precision
* recall
* fscore
* confusion matrix

Decide how to combine the multi-class metrics into a single value for each model.

In [69]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.preprocessing import label_binarize

metrics = list()
cm = dict()

for lab in coeff_labels:

    # Preciision, recall, f-score from the multi-class support function
    precision, recall, fscore, _ = score(y_test, y_pred[lab], average='weighted')
    
    # The usual way to calculate accuracy
    accuracy = accuracy_score(y_test, y_pred[lab])
    
    # ROC-AUC scores can be calculated by binarizing the data
    auc = roc_auc_score(label_binarize(y_test, classes=[0,1,2,3,4,5]),
              label_binarize(y_pred[lab], classes=[0,1,2,3,4,5]), 
              average='weighted')
    
    # Last, the confusion matrix
    cm[lab] = confusion_matrix(y_test, y_pred[lab])
    
    metrics.append(pd.Series({'precision':precision, 'recall':recall, 
                              'fscore':fscore, 'accuracy':accuracy,
                              'auc':auc}, 
                             name=lab))

metrics = pd.concat(metrics, axis=1)


NameError: name 'y_test' is not defined

In [70]:
#Run the metrics
metrics

[]

#### Observações, Análises, 

Nessa questão está apenas pegando os dados e metricidades do que foi encontrado na questão 6 e organizando eles de forma a ser visualizados em uma tabela de forma mais clara, a decisão de combinar a metrica de multi classe em uma mesma metrica parece que foi feita pelo ROC_AUC

do wikipedia "A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied." e na documentação fala que ele faz essa comparação de uma das duas formas ou ele compara as classes par a par, ou uma classe em contraste com o resto. É uma comparação entre alarme falso/ e positivos verdadeiros.


## Question 8

Display or plot the confusion matrix for each model.

In [71]:

fig, axList = plt.subplots(nrows=2, ncols=2)
axList = axList.flatten()
fig.set_size_inches(12, 10)

axList[-1].axis('off')

for ax,lab in zip(axList[:-1], coeff_labels):
    sns.heatmap(cm[lab], ax=ax, annot=True, fmt='d');
    ax.set(title=lab);
    
plt.tight_layout()


KeyError: 'lr'

#### Observações, Análises,

Bom, ele só plota o gráfico de confusão , e um heatmap de quantidade, nada demais ...


## Question 9
 Identify highly correlated columns and drop those columns before building models

In [72]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import VarianceThreshold

#threshold with .7

sel = VarianceThreshold(threshold=(.7 * (1 - .7)))

data2 = pd.concat([X_train,X_test])
data_new = pd.DataFrame(sel.fit_transform(data2))


data_y = pd.concat([y_train,y_test])

from sklearn.model_selection import train_test_split

X_new,X_test_new = train_test_split(data_new)
Y_new,Y_test_new = train_test_split(data_y)

NameError: name 'X_train' is not defined

#### Observações, Análises,

Ele cria um sel que é um analisador de threshold, que é usado dps p/ selecionar dar um fit transform no grupo de dados e retirar features que tem mta correlação, etc... 

se der um shape em X_new, vc terá um shape de muitos por 42 colunas, ou sejá foi de 500 e algo features p/ apenas 42


 Repeat Model building with new training data after removing higly correlated columns

In [73]:
# Try standard, L1 and L2 Logistic regression
lr2 = LogisticRegression().fit(X_new, Y_new)




NameError: name 'X_new' is not defined

In [74]:
lr2_l1 = LogisticRegressionCV(Cs=1000, cv=4, penalty='l1', solver='liblinear').fit(X_new, Y_new)


NameError: name 'X_new' is not defined

In [75]:
lr2_l2 = LogisticRegressionCV(Cs=10, cv=4, penalty='l2').fit(X_new, Y_new) #lbfgs (default)


NameError: name 'X_new' is not defined

In [76]:
#Try with different solvers like ‘newton-cg’, ‘lbfgs’, ‘sag’, ‘saga’ and give your observations
lr2_newton = LogisticRegressionCV(Cs=10, cv=4, penalty='l2',solver='newton-cg').fit(X_new, Y_new) #lbfgs (default)


NameError: name 'X_new' is not defined

In [77]:
Y_new

NameError: name 'Y_new' is not defined

### Implement 10 to 13
## Question 10

Compare the magnitudes of the coefficients for each of the models. If one-vs-rest fitting was used, each set of coefficients can be plotted separately. 

In [78]:

coefficients2 = list()

coeff_labels2 = ['lr2', 'lr2_l1', 'lr2_l2','lr2_newton']
coeff_models2 = [lr2, lr2_l1 , lr2_l2, lr2_newton]


for lab,mod in zip(coeff_labels2, coeff_models2):
    coeffs = mod.coef_
    levels=[[lab], [0,1,2,3,4,5]]
    labels=[[0,0,0,0,0,0], [0,1,2,3,4,5]]
    coeff_label2 = pd.MultiIndex(levels,labels)
    coefficients2.append(pd.DataFrame(coeffs.T, columns=coeff_label2))

coefficients2 = pd.concat(coefficients2, axis=1)
coefficients2.sample(10)


NameError: name 'lr2' is not defined

Prepare six separate plots for each of the multi-class coefficients.

In [79]:
# try the plots
fig, axList = plt.subplots(nrows=3, ncols=2)
axList = axList.flatten()
fig.set_size_inches(10,10)


for ax in enumerate(axList):
    loc = ax[0]
    ax = ax[1]
    
    data = coefficients2.xs(loc, level=1, axis=1)
    data.plot(marker='o', ls='', ms=2.0, ax=ax, legend=False)
    
    if ax is axList[0]:
        ax.legend(loc=4)
        
    ax.set(title='Coefficient Set '+str(loc))

plt.tight_layout()

AttributeError: 'list' object has no attribute 'xs'

## Question 11

* Predict and store the class for each model.
* Also store the probability for the predicted class for each model. 

In [80]:
# Predict the class and the probability for each
# Combine all the coefficients into a dataframe for comparison
y_pred2 = list()
y_prob2 = list()

for lab,mod in zip(coeff_labels2, coeff_models2):
    y_pred2.append(pd.Series(mod.predict(X_test_new), name=lab))
    y_prob2.append(pd.Series(mod.predict_proba(X_test_new).max(axis=1), name=lab))
    
y_pred2 = pd.concat(y_pred2, axis=1)
y_prob2 = pd.concat(y_prob2, axis=1)
y_pred2.head()


NameError: name 'coeff_models2' is not defined

In [81]:
y_prob2.head()


AttributeError: 'list' object has no attribute 'head'

## Question 12

For each model, calculate the following error metrics: 

* accuracy
* precision
* recall
* fscore
* confusion matrix

Decide how to combine the multi-class metrics into a single value for each model.

In [82]:
# Calculate the error metrics as listed above

metrics2 = list()
cm2 = dict()

for lab in coeff_labels2:

    # Preciision, recall, f-score from the multi-class support function
    precision, recall, fscore, _ = score(Y_test_new, y_pred2[lab], average='weighted')
    
    # The usual way to calculate accuracy
    accuracy = accuracy_score(Y_test_new, y_pred2[lab])
    
    # ROC-AUC scores can be calculated by binarizing the data
    auc = roc_auc_score(label_binarize(Y_test_new, classes=[0,1,2,3,4,5]),
              label_binarize(y_pred2[lab], classes=[0,1,2,3,4,5]), 
              average='weighted')
    
    # Last, the confusion matrix
    cm2[lab] = confusion_matrix(Y_test_new, y_pred2[lab])
    
    metrics2.append(pd.Series({'precision':precision, 'recall':recall, 
                              'fscore':fscore, 'accuracy':accuracy,
                              'auc':auc}, 
                             name=lab))

metrics2 = pd.concat(metrics2, axis=1)

NameError: name 'Y_test_new' is not defined

In [83]:
#Run the metrics
metrics2

[]

## Question 13

Display or plot the confusion matrix for each model.

In [84]:
#plot the confusion matrix
fig, axList = plt.subplots(nrows=2, ncols=2)
axList = axList.flatten()
fig.set_size_inches(12, 10)

for ax,lab in zip(axList, coeff_labels2):
    sns.heatmap(cm2[lab], ax=ax, annot=True, fmt='d');
    ax.set(title=lab);
    
plt.tight_layout()

KeyError: 'lr2'

In [85]:
# Perform a comparison of the outputs between Question 7 and 12 and give your observation

In [86]:
# Perform a comparison of the outputs between Question 8 and 13 and give your observation