# Logistic Regression Model
Very Basic Logistic Regression model predicting college addmittance based on SAT (continuous) and Gender (binary) features. 

# Checking and Wrangling the Data

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [2]:
raw_data = pd.read_csv('college_admittance_v2.csv')

In [3]:
raw_data.head()

Unnamed: 0,SAT,Admitted,Gender
0,1363,No,Male
1,1792,Yes,Female
2,1954,Yes,Female
3,1653,No,Male
4,1593,No,Male


In [4]:
data_train = pd.get_dummies(raw_data, drop_first=True) #drop first to ensure N-1 groups are made for our dummy variables
data_train.head()

Unnamed: 0,SAT,Admitted_Yes,Gender_Male
0,1363,False,True
1,1792,True,False
2,1954,True,False
3,1653,False,True
4,1593,False,True


In [6]:
data_train = data_train.astype(float) #this converts the boolean "True/False" into 1/0 and also preserves the rest of the values
data_train.head()

Unnamed: 0,SAT,Admitted_Yes,Gender_Male
0,1363.0,0.0,1.0
1,1792.0,1.0,0.0
2,1954.0,1.0,0.0
3,1653.0,0.0,1.0
4,1593.0,0.0,1.0


## Descriptives

In [7]:
data_train.describe() #check for missing variables, and look at the distributions

Unnamed: 0,SAT,Admitted_Yes,Gender_Male
count,168.0,168.0,168.0
mean,1695.27381,0.559524,0.535714
std,183.019017,0.497928,0.500214
min,1334.0,0.0,0.0
25%,1547.5,0.0,0.0
50%,1691.5,1.0,1.0
75%,1844.5,1.0,1.0
max,2050.0,1.0,1.0


# Set Up & Run the Model

In [8]:
y = data_train['Admitted_Yes']
x1 = data_train[['SAT', 'Gender_Male']]

In [9]:
x = sm.add_constant(x1) #add constant
reg_log = sm.Logit(y,x) #define the variable holding the model
results_log = reg_log.fit() #run the model
results_log.summary()

Optimization terminated successfully.
         Current function value: 0.120117
         Iterations 10


0,1,2,3
Dep. Variable:,Admitted_Yes,No. Observations:,168.0
Model:,Logit,Df Residuals:,165.0
Method:,MLE,Df Model:,2.0
Date:,"Tue, 14 May 2024",Pseudo R-squ.:,0.8249
Time:,15:40:01,Log-Likelihood:,-20.18
converged:,True,LL-Null:,-115.26
Covariance Type:,nonrobust,LLR p-value:,5.1180000000000006e-42

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-66.4040,16.321,-4.068,0.000,-98.394,-34.414
SAT,0.0406,0.010,4.129,0.000,0.021,0.060
Gender_Male,-1.9449,0.846,-2.299,0.022,-3.603,-0.287


To find the odd-ratios, which are the essence of the output from a logistic regression, I'll take the coefficients, and their exponentials. This is because the coefficient is presented in its log-odds form. 

In [10]:
np.exp(-1.9449) # this is to calculate the odds-ratio of gender_male

0.14300152277538664

In [11]:
np.exp(0.0406) # this is to calculate the odds-ratios of SAT

1.0414354480403178

Interpretation:
- Each 1 point increase in SAT score results in 4.1% increased chance of being admitted 
- When SAT scores are equal, there is a -14.3% chance to be admitted if you are a male


# Training Model Accuracy

In [12]:
np.set_printoptions(formatter={'float':lambda x:"{0:0.2f}".format(x)}) #this is to make the predicted values legible
results_log.predict()

array([0.00, 1.00, 1.00, 0.23, 0.02, 0.99, 1.00, 1.00, 1.00, 0.01, 1.00,
       1.00, 0.76, 0.00, 0.60, 1.00, 0.11, 0.12, 0.51, 1.00, 1.00, 1.00,
       0.00, 0.01, 0.97, 1.00, 0.48, 0.99, 1.00, 0.99, 0.00, 0.83, 0.25,
       1.00, 1.00, 1.00, 0.31, 1.00, 0.23, 0.00, 0.02, 0.45, 1.00, 0.00,
       0.99, 0.00, 0.99, 0.00, 0.00, 0.01, 0.00, 1.00, 0.92, 0.02, 1.00,
       0.00, 0.37, 0.98, 0.12, 1.00, 0.00, 0.78, 1.00, 1.00, 0.98, 0.00,
       0.00, 0.00, 1.00, 0.00, 0.78, 0.12, 0.00, 0.99, 1.00, 1.00, 0.00,
       0.30, 1.00, 1.00, 0.00, 1.00, 1.00, 0.85, 1.00, 1.00, 0.00, 1.00,
       1.00, 0.89, 0.83, 0.00, 0.98, 0.97, 0.00, 1.00, 1.00, 0.03, 0.99,
       0.96, 1.00, 0.00, 1.00, 0.01, 0.01, 1.00, 1.00, 1.00, 0.00, 0.00,
       0.02, 0.33, 0.00, 1.00, 0.09, 0.00, 0.97, 0.00, 0.75, 1.00, 1.00,
       0.01, 0.01, 0.00, 1.00, 0.00, 0.99, 0.57, 0.54, 0.87, 0.83, 0.00,
       1.00, 0.00, 0.00, 0.00, 1.00, 0.04, 0.00, 0.01, 1.00, 0.99, 0.52,
       1.00, 1.00, 0.05, 0.00, 0.00, 0.00, 0.68, 1.

In [14]:
np.array(data_train['Admitted_Yes']) #these are the actual outcomes

array([0.00, 1.00, 1.00, 0.00, 0.00, 1.00, 1.00, 1.00, 1.00, 0.00, 1.00,
       1.00, 1.00, 0.00, 0.00, 1.00, 0.00, 0.00, 1.00, 1.00, 1.00, 1.00,
       0.00, 0.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 0.00, 1.00, 0.00,
       1.00, 1.00, 1.00, 0.00, 1.00, 0.00, 0.00, 0.00, 1.00, 1.00, 0.00,
       1.00, 0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 1.00,
       0.00, 0.00, 1.00, 0.00, 1.00, 0.00, 1.00, 1.00, 1.00, 1.00, 0.00,
       0.00, 0.00, 1.00, 0.00, 1.00, 1.00, 0.00, 1.00, 1.00, 1.00, 0.00,
       1.00, 1.00, 1.00, 0.00, 1.00, 1.00, 0.00, 1.00, 1.00, 0.00, 1.00,
       1.00, 1.00, 0.00, 0.00, 1.00, 1.00, 0.00, 1.00, 1.00, 0.00, 1.00,
       1.00, 1.00, 0.00, 1.00, 0.00, 0.00, 1.00, 1.00, 1.00, 0.00, 0.00,
       0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 1.00, 0.00, 1.00, 1.00, 1.00,
       0.00, 0.00, 0.00, 1.00, 0.00, 1.00, 0.00, 1.00, 1.00, 1.00, 0.00,
       1.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 1.00, 1.00, 1.00,
       1.00, 1.00, 0.00, 0.00, 0.00, 0.00, 1.00, 1.

This is a useful tool to help visualise the outputs of the model and the actual outputs. 

## Confusion Matrix
Comparing true-positives, true-negatives, and false-negatives and false-positives. 

Essentially, the error-rate sorted by their positive and negative. 

In [15]:
results_log.pred_table()

array([[69.00, 5.00],
       [4.00, 90.00]])

In [16]:
cm_df = pd.DataFrame(results_log.pred_table())
cm_df.columns = ['Predicted 0','Predicted 1']
cm_df = cm_df.rename(index={0:'Actual 0',1:'Actual 1'})
cm_df

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,69.0,5.0
Actual 1,4.0,90.0


Interpretation
- For 69 observations, they were predicted as 0 and were actually 0
- For 90 observations, they were predcited as 1 and were actually 1
- For 4 observations, they were predicted as 0 but were actually 1
- For 5 observations, they were predicted 1 but were actually 0

In [17]:
cm = np.array(cm_df)
accuracy_train = (cm[0,0]+cm[1,1])/cm.sum() #calculate the accuracy of the model
accuracy_train.round(3)

0.946

94.6% accuracy of the model.
- Essentially, this is total accurate observations (159) divided by total observations (168).

Also, this is a good model. The training model seems good. 

# Testing the Model
Using a testing dataset to evaluate the model with new data. 

Comparing the results of the predicted outcomes with the actual outcomes. 

## Wrangle Data

In [18]:
test_raw_data = pd.read_csv('college_admittance_v2_test.csv')

In [19]:
test_raw_data.head()

Unnamed: 0,SAT,Admitted,Gender
0,1323,No,Male
1,1725,Yes,Female
2,1762,Yes,Female
3,1777,Yes,Male
4,1665,No,Male


In [21]:
test_data = pd.get_dummies(test_raw_data, drop_first=True) #drop first to ensure N-1 groups are made for our dummy variables
test_data = test_data.astype(float) #this converts the boolean "True/False" into 1/0 and also preserves the rest of the values
test_data.head() #ensure the data are structured in the same way to the training data

Unnamed: 0,SAT,Admitted_Yes,Gender_Male
0,1323.0,0.0,1.0
1,1725.0,1.0,0.0
2,1762.0,1.0,0.0
3,1777.0,1.0,1.0
4,1665.0,0.0,1.0


In [22]:
test_data.describe() #check for missing variables, and look at the distributions

Unnamed: 0,SAT,Admitted_Yes,Gender_Male
count,19.0,19.0,19.0
mean,1716.157895,0.684211,0.421053
std,189.579729,0.477567,0.507257
min,1323.0,0.0,0.0
25%,1610.5,0.0,0.0
50%,1726.0,1.0,0.0
75%,1842.5,1.0,1.0
max,2039.0,1.0,1.0


In [23]:
x #differences - there's a constant, I'll arrange the test data to be like the training data

Unnamed: 0,const,SAT,Gender_Male
0,1.0,1363.0,1.0
1,1.0,1792.0,0.0
2,1.0,1954.0,0.0
3,1.0,1653.0,1.0
4,1.0,1593.0,1.0
...,...,...,...
163,1.0,1722.0,0.0
164,1.0,1750.0,1.0
165,1.0,1555.0,1.0
166,1.0,1524.0,1.0


In [25]:
test_actual = test_data['Admitted_Yes'] #split the target variable from the testing data
test_data1 = test_data.drop(['Admitted_Yes'],axis=1)
test_data1 = sm.add_constant(test_data)
test_data1 = test_data1[x.columns.values] #although not necessary here, this code lines up the data columns to match the training data
test_data1 #now the data line up

Unnamed: 0,const,SAT,Gender_Male
0,1.0,1323.0,1.0
1,1.0,1725.0,0.0
2,1.0,1762.0,0.0
3,1.0,1777.0,1.0
4,1.0,1665.0,1.0
5,1.0,1556.0,0.0
6,1.0,1731.0,0.0
7,1.0,1809.0,0.0
8,1.0,1930.0,0.0
9,1.0,1708.0,1.0


# Results

## Confusion Matrix

In [33]:
def confusion_matrix(data, actual_values, model):
    pred_values = model.predict(data)
    bins=np.array([0,0.5,1])
    cm = np.histogram2d(actual_values, pred_values, bins=bins)[0]
    accuracy = (cm[0,0]+cm[1,1])/np.sum(cm)
    return cm, accuracy

In [36]:
cm = confusion_matrix(test_data1,test_actual,results_log)

In [37]:
cm

(array([[5.00, 1.00],
        [1.00, 12.00]]),
 0.8947368421052632)

The accuracy is 89.5%

In [40]:
cm_df = pd.DataFrame(cm[0])
cm_df.columns = ['Predicted 0','Predicted 1']
cm_df = cm_df.rename(index={0:'Actual 0',1:'Actual 1'})
cm_df

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,5.0,1.0
Actual 1,1.0,12.0


Interpretation:
- Model accuracy is 89.5%, which is lower than the training model. However, that is a good indication of the model not being overfitted. 