# Model a Multinomial Logistic Regression in Python

This notebook will perform multinomial logistic regression on our sample data.  We have a decent amount of data, though there is some skew that we'll have to watch out for:  two of our classes are under-represented in the dataset.

For prior analysis, we've used `pandas`, `numpy`, and a variety of functions from `scikit-learn`.  Now we'll add two more functions from `sklearn.model_selection`:  `RepeatedStratifiedKFold` and `cross_val_score`.  These aren't mandatory for multinomial logistic regression but will help us get a better idea of how the model fares.

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score

## Data Preparation

We will read in the same dataset as what we have in R, so instead of blanks, we'll get those `NA` values.

In [None]:
df = pd.read_csv("../data/ExtendedAttackData.csv")

The good news is that Python interprets those `NA` values as `NaN`, just as we want.

In [None]:
df.head(5)

Before performing any string imputation or column transformations, let's take `AttackType` and make it our label.  We'll also drop `malicious` from the feature set, as it won't be necessary.

In [None]:
y = df['AttackType']
x = df.drop(['AttackType', 'malicious'], axis=1)

Just as before, we'll create an ordinal encoder and transform string values into ordinals.  Then, we'll impute missing numeric values with the mean value.

In [None]:
string_cols = x.select_dtypes(include=[object]).columns.values
enc = OrdinalEncoder()
enc.fit(x[string_cols])
x[string_cols] = enc.transform(x[string_cols])

In [None]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
x[:] = imp_mean.fit_transform(x)

## K-Fold Cross-Validation

Before we fit our model and run test data against it, let's use a new function:  `cross_val_score()`, which performs k-fold cross-validation.  We'll split our data 10 ways (i.e., 10-fold) with the `RepeatedstratifiedKFold` class.  We do this three separate times with random subsets of the data and generate a value for accuracy for each split attempt.

In [None]:
clf = LogisticRegression(multi_class='multinomial', solver='lbfgs', random_state=106842, max_iter=1000)

In [None]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(clf, x, y, scoring='accuracy', cv=cv, n_jobs=-1)

Here are the resulting scores for the 30 separate tests.  Ideally, the accuracy remains very similar across each split--that would be a good indicator to us that we have stable results and won't drastically change with a different random state.

In [None]:
n_scores

We can also aggregate these results, showing things like the mean and standard deviation of the results.  As we can see, the accuracy is quite stable, so it's a good sign for us and can provide us the guidance to go ahead with our proper analysis.

In [None]:
np.mean(n_scores)

In [None]:
np.std(n_scores)

## Evaluation against Test Data

Now we can split our data into training and test subsets, fit our model to the training data, and generate predictions from the test data.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, stratify=y)

In [None]:
clf = clf.fit(x_train, y_train)

In [None]:
y_pred = clf.predict(x_test)

With our predictions in hand, we can use the `confusion_matrix()` function to generate a confusion matrix.  The findings here are interesting:  unlike R, we can see a real difficulty in separating two of the classes.

In [None]:
confusion_matrix(y_test, y_pred)

Looking at the classification report, we see that the Python logistic regression algorithm does a terrible job of separating regular denial of service attacks from broadcast denial of service attacks.  Because of this, we get every one of the classic DoS predictions wrong.  It does a great job of getting everything else correct, however.

In [None]:
print(classification_report(y_test, y_pred))