<b><font size="6">Logistic Regression</font><a class="anchor"><a id='toc'></a></b><br>

**Step 1:** Import the data and pandas

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split 
tugas = pd.read_csv('datasets/final_tugas.csv',index_col='Custid')

**Step 2:** Data partition
- Assign all the variables excluding the DepVar to the object `data`
- Assign the dependent variable to the object `target`
- Import the needed library to make the partition of the dataset
- Split the data and the target to X_train, X_test, y_train, y_test, where `test_size` should be equal to 0.2, `random_state` equal to 5 the `stratify` equal to `target`

In [2]:
data = tugas.drop(['DepVar','Income'], axis=1)
target = tugas['DepVar']

In [3]:
#make the split here
X_train, X_test, y_train, y_test = train_test_split(data, target,test_size=0.2, random_state=5, stratify=target)

**Step 3:** Import the model and create an instance

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'>sklearn.linear_model.LogisticRegression(fit_intercept=True,...)</a>

__Definition:__ <br>
Applies Logistic Regression classifier.

__Parameters:__ <br>
*fit_intercept*: whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations; <br>
...
</div>

In [19]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()

**Step 4:** Fit the model to the train data

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'>sklearn.linear_model.LogisticRegression().fit(X,y,...)</a>

__Definition:__ <br>
Fit logistic model in the training data.

__Parameters:__ <br>
X : The regressors in my training dataset; <br>
y : The target in my training dataset; <br>
...
</div>

In [20]:
log_model.fit(X_train,y_train)

LogisticRegression()

**Step 5:** Use the model to predict the labels of the test data. Assign them to **y_pred**.

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'>sklearn.linear_model.LogisticRegression().predict(X)</a>

__Definition:__ <br>
Predict class labels for samples in X.

__Parameters:__ <br>
X : Samples to predict; <br>
...

</div>

In [25]:
y_pred = log_model.predict(X_test)
y_pred

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,

***Note:*** You can get the actual probabilities of each sample instead of the assigned class using the method predict_proba()

In [59]:
pred_prob = log_model.predict_proba(X_test)
pred_prob #is the probability of get 0 or 1.
y_pred2=[]

for i in pred_prob:
    if i[0] >0.592:
        y_pred2.append(0)
    else:
        y_pred2.append(1)
        
f1_score(y_pred2,y_test)

0.4067796610169492

***Note:*** In the same way as for the linear regression, you can get the coefficients and intercept

**Step 6:** Evaluate the model

***Note:*** Since we are predicting a categorical target (classification) we use other metrics to evaluate our model than if we were solving a regression problem. Also, for the logistic regression the R-squared cannot be obtained in the same way as we obtain it in the linear case.

### The confusion matrix

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix'>sklearn.metrics.confusion_matrix(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute confusion matrix to evaluate the accuracy of a classification

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

In [14]:
from sklearn.metrics import confusion_matrix

In [15]:
confusion_matrix(y_test,y_pred)

array([[455,  10],
       [ 29,   6]], dtype=int64)

The confusion matrix in sklearn is presented in the following format: <br>
[ [ TN  FP  ] <br>
    [ FN  TP ] ]

### The accuracy score
<img src="img/accuracy.png" alt="Drawing" style="width: 300px;"/>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score'>sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True,...)</a>

__Definition:__ <br>
Accuracy classification score.

__Interpretation:__ <br>
If normalize is True, then the best performance is 1. When normalize = False, then the best performance is the number of samples.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
_normalize_: If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples. <br>
...
</div>

In [17]:
from sklearn.metrics import accuracy_score

In [18]:
accuracy_score(y_test,y_pred)

0.922

### The precision
<img src="img/precision.png" alt="Drawing" style="width: 200px;"/>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score'>sklearn.metrics.precision_score(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute the precision.

__Interpretation:__ <br>
The best value is 1, and the worst value is 0.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

In [19]:
from sklearn.metrics import precision_score

In [20]:
precision_score(y_test,y_pred)

0.375

### The recall
<img src="img/recall.png" alt="Drawing" style="width: 180px;"/>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.recall_score'>sklearn.metrics.recall_score(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute the recall.

__Interpretation:__ <br>
The best value is 1 and the worst value is 0.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

In [21]:
from sklearn.metrics import recall_score

In [22]:
recall_score(y_pred,y_test)

0.375

### The F1 Score
<img src="img/f1.png" alt="Drawing" style="width: 270px;"/>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score'>sklearn.metrics.f1_score(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute the F1 score, also known as balanced F-score or F-measure.

__Interpretation:__ <br>
F1 score reaches its best value at 1 and worst score at 0.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

In [7]:
from sklearn.metrics import f1_score

In [34]:
f1_score(y_pred,y_test)


0.23076923076923078

                            OLS Regression Results                            
Dep. Variable:                 DepVar   R-squared:                       0.242
Model:                            OLS   Adj. R-squared:                  0.234
Method:                 Least Squares   F-statistic:                     28.68
Date:                Tue, 12 Oct 2021   Prob (F-statistic):          2.25e-102
Time:                        16:35:38   Log-Likelihood:                 170.94
No. Observations:                2000   AIC:                            -295.9
Df Residuals:                    1977   BIC:                            -167.1
Df Model:                          22                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                     

In [63]:
import pandas as pd 

#Enter desired years of data
YEARS = [2019,2018,2017]

data = pd.DataFrame()

for i in YEARS:  
    #low_memory=False eliminates a warning
    i_data = pd.read_csv('https://github.com/nflverse/nflfastR-data/blob/master/data/' \
                         'play_by_play_' + str(i) + '.csv.gz?raw=True',
                         compression='gzip', low_memory=False)

    #sort=True eliminates a warning and alphabetically sorts columns
    data = data.append(i_data, sort=True)

#Give each row a unique index
data.reset_index(drop=True, inplace=True)

IncompleteRead: IncompleteRead(4300336 bytes read, 15207478 more expected)

In [61]:
data

NameError: name 't' is not defined