# Model a Logistic Regression in Python

This notebook will perform logistic regression on our sample data.  The total number of sample records is not great, though given that most of our input features have very few unique values, it's not quite as bad as it would first appear to be.

In addition to the `Pandas` and `NumPy` libraries, we will also use a few functions from `scikit-learn`, another great package for data scientists to use.

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

In [None]:
df = pd.read_csv("../1553_dos_attack1_Py_clean.csv")

Note that, even though we set missing values to `None`, NumPy reverted them to `NaN`.  This is fine for our analysis.

In [None]:
df.head(5)

## Data Preparation

Aside from the work we've already done, there are a few additional tweaks we need to make to the data before everything can go live.  First up is encoding our `connType` variable, as Python requires all inputs be numeric.  We'll use the `OrdinalEncoder` in sklearn to transform strings like "BC->RT" to arbitrary numbers.

In [None]:
string_cols = df.select_dtypes(include=[object]).columns.values
enc = OrdinalEncoder()
enc.fit(df[string_cols])
df[string_cols] = enc.transform(df[string_cols])

In [None]:
df['connType'].unique()

Next, let's split our data into two sets:  `y`, which contains our label; and `x`, which contains all of our features.

In [None]:
y = df['malicious']

In [None]:
x = df.loc[:, df.columns != 'malicious']

In [None]:
y.head(5)

In [None]:
x.head(5)

## Impute Missing Data

Unlike the library we used for R, sklearn's logistic regression function will not accept missing values.  We can check to see how many records, per column, are missing data.

In [None]:
x.isna().sum()

To fix this, we will use the sklearn `SimpleImputer` and tell it to set missing values to the mean of all values.

In [None]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

In [None]:
x[:] = imp_mean.fit_transform(x)

After doing this, there are no more missing values.

In [None]:
x.isna().sum()

## Partition Data

The next step is to break our data out into training and test datasets, reserving approximately 30% of the data for test.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, stratify=y)

In [None]:
len(x_train)

In [None]:
len(x_test)

## Training a Model

Training a model is very easy to do with Python and sklearn.  The `LogisticRegression` constructor allows us to create a regression object, to which we can fit our training and test data.

In [None]:
clf = LogisticRegression(random_state=184856).fit(x_train, y_train)

In order to perform a prediction, we pass in our test data **without** the labels.

In [None]:
y_pred = clf.predict(x_test)

Instead of calculating accuracy ourselves, we can use a function in sklearn.metrics called `accuracy_score()` to get the result for us.  In this case, both R and Python collected 100% accuracy on this simple dataset.

In [None]:
accuracy_score(y_test, y_pred)