# Predicting SBAC
by Amee Tan 

Table of Contents 
----
1. Research Question
2. Evaluation Metrics
3. Load Data
4. Search for best algorithms
5. Tune hyperparameters for 3 candidate models
6. Compare 3 candidate models and select final model
7. Run final model on test set
8. Examine feature importances for the final model
9. Conclusion and next steps 

Research Question 
----
Can we use student demographic information and test scores throughout the year to predict whether or not a student will pass the Math SBAC at the end of the year? Which features are most important for making the prediction? 


Evaluation Metrics  
----
2 evaluation metrics are used in this notebook: precision & accuracy 

**Precision:** 
- Since the goal is to produce a model that reduces false positives, precision is the best way to measure the model's performance.
- A false positive means the model predicts that a student will pass SBAC when in reality they fail. This is problematic because it means there are students who should have received extra support during the school year did not receive it because the model did not identify them as being on track to fail.

**Accuracy:** 
- It is also important to know the overall accuracy of the model. Given that the target class had a 66/34 split, it is important to make sure the model outperforms the a priori probability (meaning the model needs to achieve an accuracy score of 0.66 or higher)
- I debated whether to use accuracy or balanced accuracy. Balanced accuracy is better when there are class imbalances, but since the target had about a 66/34 split between the two classes, I decided this was not enough of a class imbalance to merit using balanced accuracy. Furthermore, balanced accuracy is defined as the average of recall obtained on each class. Recall is not as important for this case, so I stuck with accuracy score as my second metric.

In [115]:
# Imports

import numpy as np
import pandas as pd
import seaborn as sns
from   sklearn.base            import BaseEstimator
from   sklearn.compose         import *
from   sklearn.ensemble        import RandomForestClassifier, ExtraTreesClassifier, IsolationForest, GradientBoostingClassifier
from   sklearn.experimental    import enable_iterative_imputer
from   sklearn.impute          import *
from   sklearn.inspection      import permutation_importance
from   sklearn.linear_model    import LogisticRegression, PassiveAggressiveClassifier, RidgeClassifier, SGDClassifier
from   sklearn.metrics         import precision_score, classification_report, accuracy_score, confusion_matrix
from   sklearn.model_selection import cross_validate, cross_val_score, KFold, RandomizedSearchCV, train_test_split
from   sklearn.neighbors       import *
from   sklearn.pipeline        import Pipeline
from   sklearn.preprocessing   import *
from   sklearn.svm             import SVC
from   sklearn.tree            import DecisionTreeClassifier, ExtraTreeClassifier

import warnings
warnings.filterwarnings("ignore")

Load Data
-----

In [63]:
data = pd.read_csv('https://github.com/amtan20/predictSBAC/raw/main/public_student_data.csv')

In [64]:
# Create X and y dataframes 
X = data.drop(columns=['Mathematics Achievement Level','ELA/Literacy Achievement Level'])
y = data['Mathematics Achievement Level']

In [65]:
# Clean up column names 

X.rename(columns={"Reading Fall '18 %ile" : "Reading Fall Percentile", 
                  "Reading Winter '18 %ile" : "Reading Winter Percentile", 
                  "Math Fall '18 %ile" : "Math Fall Percentile", 
                  "Winter '18 %ile" : "Math Winter Percentile"}, inplace=True)

Target Engineering 
-----
The data in the target column originally has 4 levels:
- Standard Not Met (Fail)
- Standard Nearly Met (Fail)
- Standard Met (Pass)
- Standard Exceeded (Pass)

For this project, I was more interested in determining if a student would pass or fail because this binary classification is more important for teachers than the multi-class classification. The first two levels indicate that a student failed. The last two levels indicate that a student passed. 

In [66]:
# Change y into binary numeric target 
def create_binary_target(y):
    return np.where(y=='Standard Not Met',0, np.where(y=='Standard Nearly Met', 0, 1))

In [67]:
# Use FunctionTransformer to apply the function to the target 
transformer = FunctionTransformer(create_binary_target)
y = transformer.transform(y)

Create Train, Validation, and Test Sets 
-----

In [68]:
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=46)


In [69]:
X_train.to_csv('X_train.csv')
pd.DataFrame(y_train).to_csv('y_train.csv')

Build Pipeline 
----
Impute missing values, one hot encode categorical features, standardize numeric features 

In [136]:
boolean_columns = ['IDEA Indicator', 'LEP Status', 'Economically Disadvantaged Status']

categorical_columns = ['Grade _x', 
       'Reading \nMet Winter Goal?', 
       'Math Met Winter Goal?', 
       'Race/Ethnicity', 'Language Code',
       'English Language Proficiency Level', 'Migrant Status',
       'Primary Disability Type']

numeric_columns = ['Reading Fall Percentile',
       'Reading Winter Percentile', 'Reading Fall Score to Winter Growth',
       'Math Fall Percentile',
       'Math Winter Percentile', 'Math Fall to Winter Growth',
       'Math Met Winter Goal?', 
       'First Entry Date Into US School', 'LEP Entry Date', 'LEP Exit Date']


boolean_pipe = Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='False')), 
                         ('ohe', OneHotEncoder(handle_unknown='ignore'))])

categorical_pipe = Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                             ('ohe', OneHotEncoder(handle_unknown='ignore'))])

numeric_pipe = Pipeline([('scaler', StandardScaler()),
                         ('imputer', SimpleImputer(strategy='median'))])

preprocessing = ColumnTransformer([('boolean', boolean_pipe,  boolean_columns),
                                   ('categorical', categorical_pipe, categorical_columns),
                                   ('numeric',  numeric_pipe, numeric_columns)])   

Search for best algorithms using randomized search cross validation
----
RandomizedSearchCV will return a single best model. I want to have 3 candidate models so I will run this 3 times. 

In [42]:
# Create helper class 
class DummyEstimator(BaseEstimator):
    "Pass through class, methods are present but do nothing."
    def fit(self): pass
    def score(self): pass

In [43]:
# 1st iteration 

pipe = Pipeline([('preprocessing', preprocessing),
                 ('mod', DummyEstimator())])

models = [{'mod' : [ExtraTreesClassifier()]},
          {'mod' : [GradientBoostingClassifier()]},
          {'mod' : [KNeighborsClassifier()]},
          {'mod' : [LogisticRegression()]},
          {'mod' : [PassiveAggressiveClassifier()]},
          {'mod' : [RandomForestClassifier()]},
          {'mod' : [RidgeClassifier()]},
          {'mod' : [SGDClassifier()]},
          {'mod' : [SVC()]}
          ]

clf_rand = RandomizedSearchCV(estimator=pipe, 
                              param_distributions=models, 
                              n_iter=5,
                              cv=5,
                              scoring='precision',
                              n_jobs=-1)
best_model = clf_rand.fit(X_train, y_train) 
best_model.best_estimator_


Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('boolean',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='False',
                                                                                 strategy='constant')),
                                                                  ('ohe',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['IDEA Indicator',
                                                   'LEP Status',
                                                   'Economically Disadvantaged '
                                                   'Status']),
                                                 ('categorical',
                                                  Pipeline(steps=[('imputer',
                 

In [91]:
# 2nd iteration 

clf_rand = RandomizedSearchCV(estimator=pipe, 
                              param_distributions=models, 
                              n_iter=5,
                              cv=5,
                              scoring='precision',
                              n_jobs=-1)
best_model = clf_rand.fit(X_train, y_train) 
best_model.best_estimator_


Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('boolean',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='False',
                                                                                 strategy='constant')),
                                                                  ('ohe',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['LEP Status',
                                                   'Economically Disadvantaged '
                                                   'Status']),
                                                 ('categorical',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_

In [92]:
# 3rd iteration 

clf_rand = RandomizedSearchCV(estimator=pipe, 
                              param_distributions=models, 
                              n_iter=5,
                              cv=5,
                              scoring='precision',
                              n_jobs=-1)
best_model = clf_rand.fit(X_train, y_train) 
best_model.best_estimator_

Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('boolean',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='False',
                                                                                 strategy='constant')),
                                                                  ('ohe',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['LEP Status',
                                                   'Economically Disadvantaged '
                                                   'Status']),
                                                 ('categorical',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_

Hyper Parameter Tuning and KFold Cross Validation for each candidate model
----
Now that I have my 3 candidate models, I will:
1. Do a randomized search with cross validation to find the best hyperparameters for each model. 
2. Run a KFold Cross Validation to get the precision and accuracy scores for the tuned models. I will get 10 precision and accuracy scores from the KFold Cross Validation and will take the mean. Taking the average of 10 scores will give me a more stable estimate of the performance of the model on the training set. 

### Model 1: Logistic Regression Classifier

In [47]:
# Search for hyperparameters 

pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier',  LogisticRegression())])

search_space = {'classifier__C': np.logspace(0, 4, 10),
                'classifier__class_weight': [None,'balanced'],
                'classifier__dual': [True,False],
                'classifier__fit_intercept': [True,False],
                'classifier__max_iter': [10, 100, 500],
                'classifier__multi_class': ['auto', 'ovr', 'multinomial'],
                'classifier__penalty': ['l1', 'l2', 'elasticnet', None],
                'classifier__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

clf_rand = RandomizedSearchCV(estimator=pipe, 
                            param_distributions=search_space, 
                            n_iter=5,
                            cv=5,
                            verbose=True)

clf_rand.fit(X_train, y_train)
clf_rand.best_estimator_.get_params()

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    1.1s finished


{'memory': None,
 'steps': [('preprocessing',
   ColumnTransformer(transformers=[('boolean',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(fill_value='False',
                                                                   strategy='constant')),
                                                    ('ohe',
                                                     OneHotEncoder(handle_unknown='ignore'))]),
                                    ['IDEA Indicator', 'LEP Status',
                                     'Economically Disadvantaged Status']),
                                   ('categorical',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(fill_value='missing',
                                                                   strategy='constant')),
                                                    ('ohe',
           

In [48]:
# Include only non-default params 
lr_best_params = {'C': 7.742636826811269,
                  'class_weight': 'balanced',
                  'fit_intercept': False,
                  'max_iter': 500,
                  'penalty': 'l1',
                  'solver': 'liblinear'}

In [None]:
# Establish kfold and scoring to be used for all 3 candidate models

kfold = KFold(n_splits=5, shuffle=True, random_state=46)

scoring = ['precision', 'accuracy']

In [137]:
# Fit model

pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier', LogisticRegression(**lr_best_params))])

scores = cross_validate(pipe,
                        X_train,
                        y_train, 
                        cv=kfold, 
                        scoring=scoring)

lr_precision_score = scores['test_precision'].mean()
lr_acc_score = scores['test_accuracy'].mean()
print("Mean Logistic Regression Precision:", round(lr_precision_score,3))
print("Mean Logistic Regression Accuracy:", round(lr_acc_score,3))

Mean Logistic Regression Precision: 0.905
Mean Logistic Regression Accuracy: 0.862



### Model 2 : Extra Trees Classifier 


In [50]:
# Search for hyperparameters 

pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier',  ExtraTreesClassifier())])

search_space = {'classifier__bootstrap': [False,True],
                'classifier__ccp_alpha': [0, 0.1, 0.01],
                'classifier__class_weight': [None, 'balanced', 'balanced_subsample'],
                'classifier__criterion': ['gini', 'entropy'],
                'classifier__max_features': ['auto','sqrt', 'log2'],
                'classifier__max_samples': [0.25,0.5, 0.75, None],
                'classifier__min_samples_leaf': [1,2,3],
                'classifier__min_samples_split': [2,3],
                'classifier__min_weight_fraction_leaf': [0.0, 0.1],
                'classifier__n_estimators': [10,100,500]}

clf_rand = RandomizedSearchCV(estimator=pipe, 
                            param_distributions=search_space, 
                            n_iter=5,
                            cv=5,
                            verbose=True,
                            )
clf_rand.fit(X_train, y_train)
clf_rand.best_estimator_.get_params()

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    4.1s finished


{'memory': None,
 'steps': [('preprocessing',
   ColumnTransformer(transformers=[('boolean',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(fill_value='False',
                                                                   strategy='constant')),
                                                    ('ohe',
                                                     OneHotEncoder(handle_unknown='ignore'))]),
                                    ['IDEA Indicator', 'LEP Status',
                                     'Economically Disadvantaged Status']),
                                   ('categorical',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(fill_value='missing',
                                                                   strategy='constant')),
                                                    ('ohe',
           

In [51]:
# Only include non-default params 

et_best_params = {'bootstrap': True,
                  'min_samples_leaf': 2,
                  'min_samples_split': 3}

In [138]:
# Fit model

pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier', ExtraTreesClassifier(**et_best_params))])

scores = cross_validate(pipe,
                        X_train,
                        y_train, 
                        cv=kfold, 
                        scoring=scoring)

et_precision_score = scores['test_precision'].mean()
et_acc_score = scores['test_accuracy'].mean()
print("Mean Extra Trees Precision:", round(et_precision_score,3))
print("Mean Extra Trees Accuracy:", round(et_acc_score,3))


Mean Extra Trees Precision: 0.88
Mean Extra Trees Accuracy: 0.865


### Model 3: SGD Classifier 

In [53]:
# Search for hyperparameters 

pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier',  SGDClassifier())])

search_space = {'classifier__alpha': [0.0001, 0.001, 0.01],
                 'classifier__class_weight': [None, 'balanced'],
                 'classifier__early_stopping': [True,False],
                 'classifier__fit_intercept': [True, False],
                 'classifier__l1_ratio': [0.05, 0.15, 0.25],
                 'classifier__loss': ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'],
                 'classifier__max_iter': [1000, 2000],
                 'classifier__penalty': ['l2', 'l1', 'elasticnet']}

clf_rand = RandomizedSearchCV(estimator=pipe, 
                            param_distributions=search_space, 
                            n_iter=5,
                            cv=5,
                            verbose=True,
                            )
clf_rand.fit(X_train, y_train)
clf_rand.best_estimator_.get_params()

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    0.7s finished


{'memory': None,
 'steps': [('preprocessing',
   ColumnTransformer(transformers=[('boolean',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(fill_value='False',
                                                                   strategy='constant')),
                                                    ('ohe',
                                                     OneHotEncoder(handle_unknown='ignore'))]),
                                    ['IDEA Indicator', 'LEP Status',
                                     'Economically Disadvantaged Status']),
                                   ('categorical',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(fill_value='missing',
                                                                   strategy='constant')),
                                                    ('ohe',
           

In [54]:
# Include only those that are different from the default values 

sgd_best_params =  {'alpha': 0.01,
                    'early_stopping': True,
                    'fit_intercept': False,
                    'l1_ratio': 0.25,
                    'max_iter': 2000,
                    'penalty': 'l1'}


In [139]:
# Fit model using KFold CV

pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier', SGDClassifier(**sgd_best_params))])

scores = cross_validate(pipe,
                        X_train,
                        y_train, 
                        cv=kfold, 
                        scoring=scoring)

sgd_precision_score = scores['test_precision'].mean()
sgd_acc_score = scores['test_accuracy'].mean()
print("Mean SGD Precision:", round(sgd_precision_score,3))
print("Mean SGD Accuracy:", round(sgd_acc_score,3))


Mean SGD Precision: 0.89
Mean SGD Accuracy: 0.844


Summary of 3 Candidate Models 
----

In [140]:
pd.DataFrame({"Model": ["Logistic Regression", "Extra Trees", "SGD"], 
              "Precision": [lr_precision_score,et_precision_score,sgd_precision_score], 
              "Accuracy": [lr_acc_score,et_acc_score,sgd_acc_score]})\
            .round(3)\
            .sort_values(['Precision'], ascending=False)

Unnamed: 0,Model,Precision,Accuracy
0,Logistic Regression,0.905,0.862
2,SGD,0.89,0.844
1,Extra Trees,0.88,0.865


Since Logistic Regression had the highest precision and the 2nd highest accuracy, I selected this as my final model. 

Examine Importances for Logistic Regression Model
----
Looking at the importances serves 2 purposes:
   1. **Reduce the number of features in my model.** 
       - For logistic regression models, the simpler the model, the better because it will increase the generality of the model. I will use the results of the permutation importance to keep only the features that have positive non-zero importances. 
   2. **Gain a better understanding about which features are most important to the logistic regression model.** 
       - This will help me answer my second question about which features teachers should focus on that seem to be most important for predicting student performance. 

In [None]:
# Create train and validation set from original training set

X_train_imp, X_valid_imp, y_train_imp, y_valid_imp = train_test_split(X_train, y_train, random_state=46)

In [82]:
pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier',  LogisticRegression(**lr_best_params))])

model = pipe.fit(X_train_imp, y_train_imp)

r = permutation_importance(model, 
                           X_valid_imp, y_valid_imp,  
                           n_repeats=5,
                           random_state=42)

In [83]:
# Print results of importances 

features = X.columns
importances = r.importances_mean

feature_importances = []
for x in zip(features,importances):
    feature_importances.append((x[0], x[1]))

for x in sorted(feature_importances, key= lambda x: x[1], reverse=True):
    print(x[0], x[1])

Math Winter Percentile 0.22712328767123283
Reading Winter Percentile 0.07753424657534244
Math Met Winter Goal? 0.016027397260273944
Race/Ethnicity 0.006164383561643815
Economically Disadvantaged Status 0.003013698630136974
LEP Status 0.0010958904109588852
ID  0.0
Grade _x 0.0
Reading Fall '18 RIT 0.0
Reading Fall Percentile 0.0
Reading Typical RIT Growth Points 0.0
Reading Tiered RIT Growth Points 0.0
Reading Winter '18 GOAL Score 0.0
Reading Winter '18 RIT 0.0
Reading Fall Score to Winter Growth 0.0
Reading 
Met Winter Goal? 0.0
Reading Spring '19 GOAL Score 0.0
Reading Spring '19 RIT 0.0
Reading Spring '19 %ile 0.0
Reading Fall to Spring RIT Growth 0.0
Reading Met Spring Goal? 0.0
Math Fall '18 RIT 0.0
Math Fall Percentile 0.0
Math Typical RIT Growth Points 0.0
Math Tiered RIT Growth Points 0.0
Math Winter '18 GOAL Score 0.0
Math Winter '18 RIT 0.0
Math Fall to Winter Growth 0.0
Math Spring '19 GOAL Score 0.0
Math Spring '19 RIT 0.0
Math Spring '19 %ile 0.0
Math Fall to Spring RIT Gr

Remove some correlated features and run again 
----
I removed all features that had 0 or negative importance. The model originally had 23 features and now only has 6. 

Discarded features: 'English Language Proficiency Level', 'Reading \nMet Winter Goal?', 'Grade _x','LEP Entry Date', 'Math Met Winter Goal?', 'Reading Fall Score to Winter Growth','IDEA Indicator', 'Migrant Status',
'LEP Exit Date', 'Primary Disability Type'

In [152]:
# Pipeline V2

boolean_columns_final = ['LEP Status', 'Economically Disadvantaged Status']

categorical_columns_final = ['Math Met Winter Goal?', 'Race/Ethnicity']

numeric_columns_final = ['Reading Winter Percentile', 'Math Winter Percentile']


boolean_pipe_final = Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='False')), 
                         ('ohe', OneHotEncoder(handle_unknown='ignore'))])

categorical_pipe_final = Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                             ('ohe', OneHotEncoder(handle_unknown='ignore'))])

numeric_pipe_final = Pipeline([('scaler', StandardScaler()),
                         ('imputer', SimpleImputer(strategy='median'))])

preprocessing_final = ColumnTransformer([('boolean', boolean_pipe_final,  boolean_columns_final),
                                   ('categorical', categorical_pipe_final, categorical_columns_final),
                                   ('numeric',  numeric_pipe_final, numeric_columns_final)])  

# Hyperparameters
lr_best_params = {'C': 7.742636826811269,
                  'class_weight': 'balanced',
                  'fit_intercept': False,
                  'max_iter': 500,
                  'penalty': 'l1',
                  'solver': 'liblinear'}

# Final Pipeline
pipe_final = Pipeline([('preprocessing', preprocessing_final), 
                 ('classifier',  LogisticRegression(**lr_best_params))])

scores = cross_validate(pipe_final,
                        X_train,
                        y_train, 
                        cv=kfold, 
                        scoring=scoring)

lr_final_precision_score = scores['test_precision'].mean()
lr_final_acc_score = scores['test_accuracy'].mean()
print("Mean Logistic Regression Precision:", round(lr2_precision_score,3))
print("Mean Logistic Regression Accuracy:", round(lr2_acc_score,3))
print("")
print(f"Precision went from {round(lr_precision_score,3)} to {round(lr2_precision_score,3)}, an improvement of", round(lr2_precision_score -lr_precision_score, 3))
print(f"Accuracy went from {round(lr_acc_score,3)} to {round(lr2_acc_score,3)}, an improvement of", round(lr2_acc_score -lr_acc_score, 3))

Mean Logistic Regression Precision: 0.93
Mean Logistic Regression Accuracy: 0.862

Precision went from 0.905 to 0.93, an improvement of 0.025
Accuracy went from 0.862 to 0.862, an improvement of 0.0


Removing features increased the precision, but accuracy stayed the same 

Final Model Results 
----

In [153]:
# Final Features
boolean_columns_final = ['LEP Status', 'Economically Disadvantaged Status']

categorical_columns_final = ['Math Met Winter Goal?', 'Race/Ethnicity']

numeric_columns_final = ['Reading Winter Percentile', 'Math Winter Percentile']


# Final Preprocessing Pipeline
boolean_pipe_final = Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='False')), 
                         ('ohe', OneHotEncoder(handle_unknown='ignore'))])

categorical_pipe_final = Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                             ('ohe', OneHotEncoder(handle_unknown='ignore'))])

numeric_pipe_final = Pipeline([('scaler', StandardScaler()),
                         ('imputer', SimpleImputer(strategy='median'))])

preprocessing_final = ColumnTransformer([('boolean', boolean_pipe_final,  boolean_columns_final),
                                   ('categorical', categorical_pipe_final, categorical_columns_final),
                                   ('numeric',  numeric_pipe_final, numeric_columns_final)])  


# Hyperparameters
lr_best_params = {'C': 7.742636826811269,
                  'class_weight': 'balanced',
                  'fit_intercept': False,
                  'max_iter': 500,
                  'penalty': 'l1',
                  'solver': 'liblinear'}

# Final Pipeline
pipe_final = Pipeline([('preprocessing', preprocessing_final), 
                 ('classifier',  LogisticRegression(**lr_best_params))])

# Fit Pipeline on entire train 
pipe_final.fit(X_train, y_train)

# Predict on test set
preds = pipe_final.predict(X_test)

# See test set precision, accuraacy and confusion matrix 
lr_precision_score = precision_score(y_test, preds)
lr_acc_score = accuracy_score(y_test, preds)
lr_confusion = confusion_matrix(y_test, preds)

print("Logistic Regression Set Precision:", round(lr_precision_score,3))
print("Logistic Regression Set Accuracy:", round(lr_acc_score,3))
print("")
print("Test Set Confusion Matrix:")
print(lr_confusion)

Logistic Regression Set Precision: 0.938
Logistic Regression Set Accuracy: 0.856

Test Set Confusion Matrix:
[[22  4]
 [10 61]]


In [154]:
# Create df to interpret confusion matrix 
labels = ['Fail', 'Pass']
pd.DataFrame(lr_confusion, columns=[f'Pred: {label}' for label in labels],
                  index=[f'Actual: {label}' for label in labels])


Unnamed: 0,Pred: Fail,Pred: Pass
Actual: Fail,22,4
Actual: Pass,10,61


In [155]:
# Check classification report 
full_report = classification_report(y_test, preds)
print(full_report)

              precision    recall  f1-score   support

           0       0.69      0.85      0.76        26
           1       0.94      0.86      0.90        71

    accuracy                           0.86        97
   macro avg       0.81      0.85      0.83        97
weighted avg       0.87      0.86      0.86        97



It is interesting that the model has high precision for class 1 but low precision for class 0. Recall for both classes is about the same. Accuracy is 0.86, well above the a priori probability of 0.66. 

Conclusions and Next Steps
----

My final logistic regression model does a good job of correctly predicting students who will pass the SBAC. The model was selected and tuned to minimize the chance of false positives. In the test set, only 3 of the 97 students are predicted to pass when in reality they failed. Using this model can help teachers identify students that need extra support who would have otherwise gone unnoticed.

It is important to note that in the test set, 7 students were predicted to fail who in reality passed. The model incorrectly suggests that these students should be receiving extra interventions. This could be an issue because teachers have limited time and resources and these students would be unneceessarily taking away some of the resources from the students who need it most. The problem of false negatives is not as significant as the problem of false positives in this case, but it is important to be aware of it. 

Though the model has a 96% chance of correctly predicting if a student will pass, it has only a 75% chance of correctly predicting if a student will fail. When I performed this analyses, I recognized there was a class imbalance, but I did not think it was significant enough to address. Given the major differences in precision for the two classes, I would like to consider addressing this class imbalance issue in the future using SMOTE. 

