# Table of Contents
1. Research Question
2. Evaluation Metrics
3. Load Data
4. Search for best algorithm
5. Tune hyperparameters for 3 candidate models
6. Compare 3 candidate models and select final model
7. Run final model on test set
8. Examine feature importances for the final model
9. Conclusion and next steps 

Research Question 
----
Can we use student demographic information and test scores throughout the year to predict whether or not a student will pass the Math SBAC at the end of the year? Which indicators from the first semester are most important for predicting whether or not a student will pass the Math SBAC? 


Evaluation Metrics  
----
2 evaluation metrics are used in this notebook: 
- precision
- accuracy 

In [62]:
# Imports

import numpy as np
import pandas as pd
import seaborn as sns
from   sklearn.base            import BaseEstimator
from   sklearn.compose         import *
from   sklearn.ensemble        import RandomForestClassifier, ExtraTreesClassifier, IsolationForest, GradientBoostingClassifier
from   sklearn.experimental    import enable_iterative_imputer
from   sklearn.impute          import *
from   sklearn.inspection      import permutation_importance
from   sklearn.linear_model    import LogisticRegression, PassiveAggressiveClassifier, RidgeClassifier, SGDClassifier
from   sklearn.metrics         import precision_score, classification_report, accuracy_score, confusion_matrix
from   sklearn.model_selection import train_test_split
from   sklearn.model_selection import RandomizedSearchCV
from   sklearn.neighbors       import *
from   sklearn.pipeline        import Pipeline
from   sklearn.preprocessing   import *
from   sklearn.svm             import SVC
from   sklearn.tree            import DecisionTreeClassifier, ExtraTreeClassifier


import warnings
warnings.filterwarnings("ignore")

Load Data
-----

In [63]:
data = pd.read_csv('https://github.com/amtan20/predictSBAC/raw/main/public_student_data.csv')

In [64]:
# Create X and y dataframes 
X = data.drop(columns=['Mathematics Achievement Level','ELA/Literacy Achievement Level'])
y = data['Mathematics Achievement Level']

In [65]:
# Clean up column names 

X.rename(columns={"Reading Fall '18 %ile" : "Reading Fall Percentile", 
                  "Reading Winter '18 %ile" : "Reading Winter Percentile", 
                  "Math Fall '18 %ile" : "Math Fall Percentile", 
                  "Winter '18 %ile" : "Math Winter Percentile"}, inplace=True)

Target Engineering 
-----

In [66]:
# Change y into binary numeric target 
def create_binary_target(y):
    return np.where(y=='Standard Not Met',0, np.where(y=='Standard Nearly Met', 0, 1))

In [67]:
# Use FunctionTransformer to apply the function to the target 
transformer = FunctionTransformer(create_binary_target)
y = transformer.transform(y)

Create Train, Validation, and Test Sets 
-----

In [68]:
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=46)


In [69]:
X_train.to_csv('X_train.csv')
pd.DataFrame(y_train).to_csv('y_train.csv')

In [70]:
X_train_tuning, X_valid_tuning, y_train_tuning, y_valid_tuning = train_test_split(X_train, y_train, random_state=46)

Build Pipeline 
----
Impute missing values, one hot encode categorical features, standardize numeric features 

In [41]:
boolean_columns = ['IDEA Indicator', 'LEP Status', 'Economically Disadvantaged Status']

categorical_columns = ['Grade _x', 
       'Reading \nMet Winter Goal?', 
       'Math Met Winter Goal?', 
       'Race/Ethnicity', 'Language Code',
       'English Language Proficiency Level', 'Migrant Status',
       'Primary Disability Type']

numeric_columns = ['Reading Fall Percentile',
       'Reading Winter Percentile', 'Reading Fall Score to Winter Growth',
       'Math Fall Percentile',
       'Math Winter Percentile', 'Math Fall to Winter Growth',
       'Math Met Winter Goal?', 
       'First Entry Date Into US School', 'LEP Entry Date', 'LEP Exit Date']


boolean_pipe = Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='False')), 
                         ('ohe', OneHotEncoder(handle_unknown='ignore'))])

categorical_pipe = Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                             ('ohe', OneHotEncoder(handle_unknown='ignore'))])

numeric_pipe = Pipeline([('scaler', StandardScaler()),
                         ('imputer', SimpleImputer(strategy='median'))])

preprocessing = ColumnTransformer([('boolean', boolean_pipe,  boolean_columns),
                                   ('categorical', categorical_pipe, categorical_columns),
                                   ('numeric',  numeric_pipe, numeric_columns)])   

Search for best algorithm using RandomizedSearchCV
----

In [42]:
# Helper class (you do not have to use it)
class DummyEstimator(BaseEstimator):
    "Pass through class, methods are present but do nothing."
    def fit(self): pass
    def score(self): pass

In [43]:
pipe = Pipeline([('preprocessing', preprocessing),
                 ('mod', DummyEstimator())])

models = [{'mod' : [ExtraTreesClassifier()]},
          {'mod' : [GradientBoostingClassifier()]},
          {'mod' : [KNeighborsClassifier()]},
          {'mod' : [LogisticRegression()]},
          {'mod' : [PassiveAggressiveClassifier()]},
          {'mod' : [RandomForestClassifier()]},
          {'mod' : [RidgeClassifier()]},
          {'mod' : [SGDClassifier()]},
          {'mod' : [SVC()]}
          ]

clf_rand = RandomizedSearchCV(estimator=pipe, 
                              param_distributions=models, 
                              n_iter=3,
                              cv=5,
                              scoring='precision',
                              n_jobs=-1)
best_model = clf_rand.fit(X_train, y_train) 
best_model.best_estimator_


Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('boolean',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='False',
                                                                                 strategy='constant')),
                                                                  ('ohe',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['IDEA Indicator',
                                                   'LEP Status',
                                                   'Economically Disadvantaged '
                                                   'Status']),
                                                 ('categorical',
                                                  Pipeline(steps=[('imputer',
                 

Manual Search for 2 more candidate algorithms 
----

In [44]:
algorithms = [ExtraTreesClassifier(), 
              GradientBoostingClassifier(),
              KNeighborsClassifier(),
              LogisticRegression(), 
              PassiveAggressiveClassifier(), 
              RandomForestClassifier(),
              RidgeClassifier(), 
              SGDClassifier(), 
              SVC()    
              ]

In [45]:
# Check accuracy 100 times 

final_acc_results = []

for algo in algorithms: 
    pipe = Pipeline([('preprocessing', preprocessing), 
                     ('classifier',  algo)])
    acc_results = []
    for x in range(100): 
        
        X_train1, X_valid, y_train1, y_valid = train_test_split(X_train, y_train)
        
        pipe.fit(X_train1, y_train1)

        y_pred = pipe.predict(X_valid)
        acc_test = accuracy_score(y_valid, y_pred)
        acc_results.append(acc_test)
        avg_acc = sum(acc_results)/len(acc_results)
    
    final_acc_results.append((algo, round(avg_acc,3)))

final_acc_results = sorted(final_acc_results, key = lambda x: x[1], reverse=True)

for x in final_acc_results:
    print(x[0], x[1])

ExtraTreesClassifier() 0.869
RandomForestClassifier() 0.865
SVC() 0.865
KNeighborsClassifier() 0.861
LogisticRegression() 0.859
RidgeClassifier() 0.847
PassiveAggressiveClassifier() 0.843
SGDClassifier() 0.84
GradientBoostingClassifier() 0.835


In [46]:
# Check precision 100 times 

final_precision_results = []

for algo in algorithms: 
    pipe = Pipeline([('preprocessing', preprocessing), 
                     ('classifier',  algo)])

    precision_results = []
    for x in range(100):  
        
        X_train1, X_valid, y_train1, y_valid = train_test_split(X_train, y_train)
        
        pipe.fit(X_train1, y_train1)

        y_pred = pipe.predict(X_valid)
        precision_test = precision_score(y_valid, y_pred)
        precision_results.append(precision_test)
        avg_precision = sum(precision_results)/len(precision_results)

    final_precision_results.append((algo, round(avg_precision,3)))

final_precision_results = sorted(final_precision_results, key = lambda x: x[1], reverse=True)

for x in final_precision_results:
    print(x[0], x[1])

ExtraTreesClassifier() 0.898
SGDClassifier() 0.895
RandomForestClassifier() 0.892
LogisticRegression() 0.889
PassiveAggressiveClassifier() 0.881
RidgeClassifier() 0.874
GradientBoostingClassifier() 0.872
KNeighborsClassifier() 0.872
SVC() 0.863


Tune hyperparameters for candidate model 1: Logistic Regression Classifier 
----

In [47]:
# Search for hyperparameters 

pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier',  LogisticRegression())])

search_space = {'classifier__C': np.logspace(0, 4, 10),
                'classifier__class_weight': [None,'balanced'],
                'classifier__dual': [True,False],
                'classifier__fit_intercept': [True,False],
                'classifier__max_iter': [10, 100, 500],
                'classifier__multi_class': ['auto', 'ovr', 'multinomial'],
                'classifier__penalty': ['l1', 'l2', 'elasticnet', None],
                'classifier__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

clf_rand = RandomizedSearchCV(estimator=pipe, 
                            param_distributions=search_space, 
                            n_iter=5,
                            cv=5,
                            verbose=True)

clf_rand.fit(X_train, y_train)
clf_rand.best_estimator_.get_params()

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    1.1s finished


{'memory': None,
 'steps': [('preprocessing',
   ColumnTransformer(transformers=[('boolean',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(fill_value='False',
                                                                   strategy='constant')),
                                                    ('ohe',
                                                     OneHotEncoder(handle_unknown='ignore'))]),
                                    ['IDEA Indicator', 'LEP Status',
                                     'Economically Disadvantaged Status']),
                                   ('categorical',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(fill_value='missing',
                                                                   strategy='constant')),
                                                    ('ohe',
           

In [48]:
# Include only non-default params 
lr_best_params = {'C': 7.742636826811269,
                  'class_weight': 'balanced',
                  'fit_intercept': False,
                  'max_iter': 500,
                  'penalty': 'l1',
                  'solver': 'liblinear'}

In [72]:
# Fit model
pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier',  LogisticRegression(**lr_best_params))])

pipe.fit(X_train_tuning, y_train_tuning)
preds = pipe.predict(X_valid_tuning)
lr_precision_score = precision_score(y_valid_tuning, preds)
lr_acc_score = accuracy_score(y_valid_tuning, preds)
lr_confusion = confusion_matrix(y_valid_tuning, preds)

print("Logistic Regression Precision:", round(lr_precision_score,3))
print("Logistic Regression Accuracy:", round(lr_acc_score,3))
print("")
print("Confusion Matrix:")
lr_confusion

Logistic Regression Precision: 0.913
Logistic Regression Accuracy: 0.877

Confusion Matrix:


array([[22,  4],
       [ 5, 42]])

**Note:** There were 4 students who were incorrectly predicted to pass who in reality failed


Tune hyperparameters for candidate model 2 : Extra Trees Classifier 
----


In [50]:
# Search for hyperparameters 

pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier',  ExtraTreesClassifier())])

search_space = {'classifier__bootstrap': [False,True],
                'classifier__ccp_alpha': [0, 0.1, 0.01],
                'classifier__class_weight': [None, 'balanced', 'balanced_subsample'],
                'classifier__criterion': ['gini', 'entropy'],
                'classifier__max_features': ['auto','sqrt', 'log2'],
                'classifier__max_samples': [0.25,0.5, 0.75, None],
                'classifier__min_samples_leaf': [1,2,3],
                'classifier__min_samples_split': [2,3],
                'classifier__min_weight_fraction_leaf': [0.0, 0.1],
                'classifier__n_estimators': [10,100,500]}

clf_rand = RandomizedSearchCV(estimator=pipe, 
                            param_distributions=search_space, 
                            n_iter=5,
                            cv=5,
                            verbose=True,
                            )
clf_rand.fit(X_train, y_train)
clf_rand.best_estimator_.get_params()

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    4.1s finished


{'memory': None,
 'steps': [('preprocessing',
   ColumnTransformer(transformers=[('boolean',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(fill_value='False',
                                                                   strategy='constant')),
                                                    ('ohe',
                                                     OneHotEncoder(handle_unknown='ignore'))]),
                                    ['IDEA Indicator', 'LEP Status',
                                     'Economically Disadvantaged Status']),
                                   ('categorical',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(fill_value='missing',
                                                                   strategy='constant')),
                                                    ('ohe',
           

In [51]:
# Only include non-default params 

et_best_params = {'bootstrap': True,
                  'min_samples_leaf': 2,
                  'min_samples_split': 3}

In [73]:
# Fit model

pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier',  ExtraTreesClassifier(**et_best_params))])

pipe.fit(X_train_tuning, y_train_tuning)
preds = pipe.predict(X_valid_tuning)
et_precision_score = precision_score(y_valid_tuning, preds)
et_acc_score = accuracy_score(y_valid_tuning, preds)
et_confusion = confusion_matrix(y_valid_tuning, preds)

print("Extra Trees Precision:", round(et_precision_score,3))
print("Extra Trees Accuracy:", round(et_acc_score,3))
print("")
print("Confusion Matrix:")
et_confusion


Extra Trees Precision: 0.902
Extra Trees Accuracy: 0.918

Confusion Matrix:


array([[21,  5],
       [ 1, 46]])

**Note:** This model only misidentifies 7 students as passing when in reality they fail. This did not perform as well as the logistic regression model. 

Tune hyperparameters for candidate model 3: SGD Classifier 
----

In [53]:
# Search for hyperparameters 

pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier',  SGDClassifier())])

search_space = {'classifier__alpha': [0.0001, 0.001, 0.01],
                 'classifier__class_weight': [None, 'balanced'],
                 'classifier__early_stopping': [True,False],
                 'classifier__fit_intercept': [True, False],
                 'classifier__l1_ratio': [0.05, 0.15, 0.25],
                 'classifier__loss': ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'],
                 'classifier__max_iter': [1000, 2000],
                 'classifier__penalty': ['l2', 'l1', 'elasticnet']}

clf_rand = RandomizedSearchCV(estimator=pipe, 
                            param_distributions=search_space, 
                            n_iter=5,
                            cv=5,
                            verbose=True,
                            )
clf_rand.fit(X_train, y_train)
clf_rand.best_estimator_.get_params()

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    0.7s finished


{'memory': None,
 'steps': [('preprocessing',
   ColumnTransformer(transformers=[('boolean',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(fill_value='False',
                                                                   strategy='constant')),
                                                    ('ohe',
                                                     OneHotEncoder(handle_unknown='ignore'))]),
                                    ['IDEA Indicator', 'LEP Status',
                                     'Economically Disadvantaged Status']),
                                   ('categorical',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(fill_value='missing',
                                                                   strategy='constant')),
                                                    ('ohe',
           

In [54]:
# Include only those that are different from the default values 

sgd_best_params =  {'alpha': 0.01,
                    'early_stopping': True,
                    'fit_intercept': False,
                    'l1_ratio': 0.25,
                    'max_iter': 2000,
                    'penalty': 'l1'}


In [74]:
# Fit model 

pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier',  SGDClassifier(**sgd_best_params))])

pipe.fit(X_train_tuning, y_train_tuning)
preds = pipe.predict(X_valid_tuning)
sgd_precision_score = precision_score(y_valid_tuning, preds)
sgd_acc_score = accuracy_score(y_valid_tuning, preds)
sgd_confusion = confusion_matrix(y_valid_tuning, preds)

print("SGD Precision:", round(sgd_precision_score,3))
print("SGD Accuracy:", round(sgd_acc_score,3))
print("")
print("Confusion Matrix:")
sgd_confusion


SGD Precision: 0.911
SGD Accuracy: 0.863

Confusion Matrix:


array([[22,  4],
       [ 6, 41]])

Summary of 3 Candidate Models 
----

In [75]:
pd.DataFrame({"Model": ["Logistic Regression", "Extra Trees", "SGD"], 
              "Precision": [lr_precision_score,et_precision_score,sgd_precision_score], 
              "Accuracy": [lr_acc_score,et_acc_score,sgd_acc_score]})\
            .round(3)\
            .sort_values(['Precision'], ascending=False)

Unnamed: 0,Model,Precision,Accuracy
0,Logistic Regression,0.913,0.877
2,SGD,0.911,0.863
1,Extra Trees,0.902,0.918


Examine Importances for Logistic Regression Model
----

In [82]:
pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier',  LogisticRegression(**lr_best_params))])

model = pipe.fit(X_train_tuning, y_train_tuning)

r = permutation_importance(model, 
                           X_valid_tuning, y_valid_tuning,  
                           n_repeats=100,
                           random_state=42)

In [83]:
features = X.columns
importances = r.importances_mean

feature_importances = []
for x in zip(features,importances):
    feature_importances.append((x[0], x[1]))

for x in sorted(feature_importances, key= lambda x: x[1], reverse=True):
    print(x[0], x[1])

Math Winter Percentile 0.22712328767123283
Reading Winter Percentile 0.07753424657534244
Math Met Winter Goal? 0.016027397260273944
Race/Ethnicity 0.006164383561643815
Economically Disadvantaged Status 0.003013698630136974
LEP Status 0.0010958904109588852
ID  0.0
Grade _x 0.0
Reading Fall '18 RIT 0.0
Reading Fall Percentile 0.0
Reading Typical RIT Growth Points 0.0
Reading Tiered RIT Growth Points 0.0
Reading Winter '18 GOAL Score 0.0
Reading Winter '18 RIT 0.0
Reading Fall Score to Winter Growth 0.0
Reading 
Met Winter Goal? 0.0
Reading Spring '19 GOAL Score 0.0
Reading Spring '19 RIT 0.0
Reading Spring '19 %ile 0.0
Reading Fall to Spring RIT Growth 0.0
Reading Met Spring Goal? 0.0
Math Fall '18 RIT 0.0
Math Fall Percentile 0.0
Math Typical RIT Growth Points 0.0
Math Tiered RIT Growth Points 0.0
Math Winter '18 GOAL Score 0.0
Math Winter '18 RIT 0.0
Math Fall to Winter Growth 0.0
Math Spring '19 GOAL Score 0.0
Math Spring '19 RIT 0.0
Math Spring '19 %ile 0.0
Math Fall to Spring RIT Gr

Remove some correlated features and run again 
----

In [84]:
# Pipeline V2

"""
Removed these correlated columns: 
'English Language Proficiency Level', 'Reading \nMet Winter Goal?', 'Grade _x','LEP Entry Date' 
'Math Met Winter Goal?', 'Reading Fall Score to Winter Growth','IDEA Indicator', 'Migrant Status',
'LEP Exit Date', 'Primary Disability Type'
""" 
boolean_columns = ['LEP Status', 'Economically Disadvantaged Status']

categorical_columns = ['Math Met Winter Goal?', 'Race/Ethnicity']

numeric_columns = ['Reading Winter Percentile', 'Math Winter Percentile']


boolean_pipe = Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='False')), 
                         ('ohe', OneHotEncoder(handle_unknown='ignore'))])

categorical_pipe = Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                             ('ohe', OneHotEncoder(handle_unknown='ignore'))])

numeric_pipe = Pipeline([('scaler', StandardScaler()),
                         ('imputer', SimpleImputer(strategy='median'))])

preprocessing = ColumnTransformer([('boolean', boolean_pipe,  boolean_columns),
                                   ('categorical', categorical_pipe, categorical_columns),
                                   ('numeric',  numeric_pipe, numeric_columns)])  

# Hyperparameters
lr_best_params = {'C': 7.742636826811269,
                  'class_weight': 'balanced',
                  'fit_intercept': False,
                  'max_iter': 500,
                  'penalty': 'l1',
                  'solver': 'liblinear'}

# Final Pipeline
pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier',  LogisticRegression(**lr_best_params))])

# Fit Pipeline on entire train 
pipe.fit(X_train_tuning, y_train_tuning)

# Predict on test set
preds = pipe.predict(X_valid_tuning)

# See test set precision, accuraacy and confusion matrix 
lr_precision_score = precision_score(y_valid_tuning, preds)
lr_acc_score = accuracy_score(y_valid_tuning, preds)
lr_confusion = confusion_matrix(y_valid_tuning, preds)

print("Logistic Regression Precision:", round(lr_precision_score,3))
print("Logistic Regression Accuracy:", round(lr_acc_score,3))
print("")
print("Confusion Matrix:")
lr_confusion

Logistic Regression Precision: 0.935
Logistic Regression Accuracy: 0.904

Confusion Matrix:


array([[23,  3],
       [ 4, 43]])

**Note:** After simplifying the model, precision and accuracy increased and the number of false positives decreased by 1. 

Final Model Results 
----
After tuning the hyperparameters, Logistic Regression had the highest precision but the lowest accuracy. SGD and Extra Trees were about the same. Since I care most about avoiding false positives, I am selecting Logistic Regression as my final model, though I am concerned about the lower accuracy rate. 

In [86]:
# Final Features
boolean_columns = ['LEP Status', 'Economically Disadvantaged Status']

categorical_columns = ['Math Met Winter Goal?', 'Race/Ethnicity']

numeric_columns = ['Reading Winter Percentile', 'Math Winter Percentile']


# Final Preprocessing Pipeline
boolean_pipe = Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='False')), 
                         ('ohe', OneHotEncoder(handle_unknown='ignore'))])

categorical_pipe = Pipeline([('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                             ('ohe', OneHotEncoder(handle_unknown='ignore'))])

numeric_pipe = Pipeline([('scaler', StandardScaler()),
                         ('imputer', SimpleImputer(strategy='median'))])

preprocessing = ColumnTransformer([('boolean', boolean_pipe,  boolean_columns),
                                   ('categorical', categorical_pipe, categorical_columns),
                                   ('numeric',  numeric_pipe, numeric_columns)])  

# Hyperparameters
lr_best_params = {'C': 7.742636826811269,
                  'class_weight': 'balanced',
                  'fit_intercept': False,
                  'max_iter': 500,
                  'penalty': 'l1',
                  'solver': 'liblinear'}

# Final Pipeline
pipe = Pipeline([('preprocessing', preprocessing), 
                 ('classifier',  LogisticRegression(**lr_best_params))])

# Fit Pipeline on entire train 
pipe.fit(X_train, y_train)

# Predict on test set
preds = pipe.predict(X_test)

# See test set precision, accuraacy and confusion matrix 
lr_precision_score = precision_score(y_test, preds)
lr_acc_score = accuracy_score(y_test, preds)
lr_confusion = confusion_matrix(y_test, preds)

print("Logistic Regression Set Precision:", round(lr_precision_score,3))
print("Logistic Regression Set Accuracy:", round(lr_acc_score,3))
print("")
print("Test Set Confusion Matrix:")
print(lr_confusion)

Logistic Regression Set Precision: 0.938
Logistic Regression Set Accuracy: 0.856

Test Set Confusion Matrix:
[[22  4]
 [10 61]]


In [88]:
# Create df to interpret confusion matrix 
labels = ['Fail', 'Pass']
pd.DataFrame(lr_confusion, columns=[f'Pred: {label}' for label in labels],
                  index=[f'Actual: {label}' for label in labels])


Unnamed: 0,Pred: Fail,Pred: Pass
Actual: Fail,22,4
Actual: Pass,10,61


In [89]:
# Check classification report 
full_report = classification_report(y_test, preds)
print(full_report)

              precision    recall  f1-score   support

           0       0.69      0.85      0.76        26
           1       0.94      0.86      0.90        71

    accuracy                           0.86        97
   macro avg       0.81      0.85      0.83        97
weighted avg       0.87      0.86      0.86        97



**Remarks about importances**
- I was not surprised that Math Winter Percentile from the MAP test was the most important feature in this model. This supports what we have observed at the school and reinforces many of the MAP score-based practices that we have in place at the school. 
- It is unsurprising that Math Fall Percentile has negative importance since it is highly correlated to Math Winter Percentile. The same is observed for Reading Winter Percentile and Reading Fall Percentile. 

Conclusions and Next Steps
----

My final logistic regression model does a good job of correctly predicting students who will pass the SBAC. The model was selected and tuned to minimize the chance of false positives. In the test set, only 3 of the 97 students are predicted to pass when in reality they failed. Using this model can help teachers identify students that need extra support who would have otherwise gone unnoticed.

It is important to note that in the test set, 7 students were predicted to fail who in reality passed. The model incorrectly suggests that these students should be receiving extra interventions. This could be an issue because teachers have limited time and resources and these students would be unneceessarily taking away some of the resources from the students who need it most. The problem of false negatives is not as significant as the problem of false positives in this case, but it is important to be aware of it. 

Though the model has a 96% chance of correctly predicting if a student will pass, it has only a 75% chance of correctly predicting if a student will fail. When I performed this analyses, I recognized there was a class imbalance, but I did not think it was significant enough to address. Given the major differences in precision for the two classes, I would like to consider addressing this class imbalance issue in the future using SMOTE. 

