# Logistic Regression

Logistic regression is not really a classification technique, but instead a regression technique that we can use to classify.  Logistic regression is a special case of linear regression, where we apply a Sigmoid function to a linear regression problem.  This ends up pushing values on a curve with the majority of values being near 0 or near 1.  Let's take a look at campus recruitment using logistic regression.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

clf = LogisticRegression(max_iter=1000)

Now let's prep the data.  Because we'll do it the same way for each, we only need to do this once.  I'll also remove the bits where we analyze the data, as we've seen it enough times already.

In [None]:
campus_data = "../data/CampusRecruitment.csv"
df = pd.read_csv(campus_data, header=0)
y = df['status']
X = df.drop(['sl_no', 'status', 'salary'], axis=1)

## Pre-Processing

For this dataset, we want to use the columns leading up to `status` to determine if different college graduates were placed at a job.  Because the salary is determined by the placement status, we can't use it to predict if a new graduate will be placed, so we'll have to drop that column.  Note that if we were interested in doing a regression analysis, we could try to predict the salary given placement, but we're keeping it classy and sticking to classification algorithms only. Also, we'll drop the `sl_no` column because it's a unique student number.

Unlike the heart attack dataset, this dataset includes non-numeric features.

In [None]:
df

### How Many Observations Do We Need?

As a quick reminder, our rule of thumb for observation count is: `10 * (number of features) / (probability of least common occurrence)`. That's a little hard to work with for continuous variables, so we'd probably want to bin each of those features and calculate the probabilities from a histogram. To keep things simple, I'll focus on the categorical variables.

In [None]:
for col in df[['gender', 'ssc_b', 'hsc_b', 'hsc_s', 'degree_t', 'workex', 'specialisation']]:
    print(df[col].unique())

For each categorical variable, we have 2-3 entries. That's a good sign. Now let's see what the frequency of each is.

In [None]:
print('215 total entries')
for col in df[['gender', 'ssc_b', 'hsc_b', 'hsc_s', 'degree_t', 'workex', 'specialisation']]:
    print(df[col].value_counts())

We can see that `hsc_s` == "Arts" and `degree_t` == "Others" are the least frequent, at 11 out of 215 occurrences. That gives us a probability of ~5%.

So ideally, we would have a minimum `(10 * 11) / 0.05` observations in our dataset, as we have 11 features and 5% minimum likelihood for our categorical features. That's a total of 2200 observations, so we're clearly well under the ideal and should give us some pause. But we'll still keep working through this analysis, even though we realize there can be a problem with insufficient data leading to a weaker model than we'd want.

Also, we can speculate that the 11 School of Arts students are the 11 people with `degree_t` == "Others" and so, for the remaining 204 graduates, we seem to have a fairly good mixture. This means we may struggle a bit with School of Arts graduates, but we can have some hope that it won't spill over into the other sets of students, for whom we do have a good number of observations.

### One-Hot Encoding

Before we can feed this data into the logistic regression algorithm (or pretty much any other classification or regression algorithm), we need to convert any text data into numeric data.  There are a few common techniques for encoding.  The technique we will use for our dataset is called one-hot encoding.  What it does is "pivot" the categorical data, so that each distinct categorical value gets its own feature.  For example, `gender` has two values, M and F.  One-hot encoding will create new new features, one for `gender=M` and one for `gender=F`.  We need to do this for each of the non-numeric features.

In [None]:
enc = preprocessing.OneHotEncoder()

# Get only categorical features
X_object = X.select_dtypes('object')

# Fit the input features to our encoder
enc.fit(X_object)

# Perform the transformation on our dataset
codes = enc.transform(X_object).toarray()
feature_names = enc.get_feature_names_out(['gender', 'ssc_b', 'hsc_b', 'hsc_s', 'degree_t', 'workex', 'specialisation'])

# Combine the new one-hot encoded features with our existing floats and everything is now numeric.
X = pd.concat([X.select_dtypes(exclude='object'), pd.DataFrame(codes, columns=feature_names).astype(int)], axis=1)
X.shape

The `shape` here shows that we have the same number of rows as before (215), but the number of columns went from 12 to 21. We started with 5 floating point features. To it, we added 16 values for categorical variables: 5 of our categorical variables had 2 values, and two categorical variables had 3 values.

In [None]:
X

By contrast, I'm going to perform a simple label encoding of the `status`.

In [None]:
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1740)

In [None]:
X_train

Now let's train a model using logistic regression.

In [None]:
clf = clf.fit(X_train, y_train)

## How'd We Do?

Let's first use the `accuracy_score` method in sklearn to see just how well we did.

In [None]:
predicted = clf.predict(X_test)
accuracy_score(y_test, predicted)

Now let's review the confusion matrix and classification report.

In [None]:
confusion_matrix(y_test, predicted)

In [None]:
print(classification_report(y_test, predicted))

Here's how things are looking so far for us:

|Case|kNN|XGBoost|Logistic Regression|
|----|---|-------|-------------------|
|True Negative|13|18|17|
|False Positive|8|3|4|
|False Negative|4|5|6|
|True Positive|40|39|38|

Compared to kNN, the logistic regression model was better in terms of identifying non-placements, and slightly worse at identifying placements. It did ever so slightly worse than XGBoost in both cases, though there's not enough data for me to be confident in that statement.

All in all, logistic regression performed well here despite the paucity of data.