# Impact of Master's Pay on School Performance
I interned with the Department of Public Instruction and was asked to analyze the relationship between teachers with master's degrees and school performance. The hope was that there was a relationship and this would encourage the NC General Assembly to reinstate master's pay.

At the time I did not know about multiple linear regression so I used a simple linear regression. I want to take the time to share my original results and improve upon them using machine learning techniques. I will examine what features impact school growth, school performance grades, and school performance scores.

### Section 0: Importing and cleaning data

In [None]:
#Importing needed modules
import pandas as pd
from pandasql import sqldf
import random
import plotly.express as px
from pandas.api.types import is_numeric_dtype
from scipy.stats import kruskal
from scipy.stats import normaltest
import numpy as np
import scipy.stats as stats
from sklearn import tree
import graphviz
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import Lasso
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
#Reading in data and selecting specific columns of interest
masters=pd.read_csv("C:\\Users\\amlaw\\Documents\\Institute for Advanced Analytics\\SideProjects\\NC School Data\\FullData\\NCSchoolRemade.csv")

#removing data without a spg score
masters=masters[masters['spg_score'].notnull()]

#Replacinig missing values with 0
masters=masters.fillna(0)

#Viewing data
masters.head()

In [None]:
#Printing unique variables to verify clean data.
for col in masters.columns:
    print(col, masters[col].unique())

In [None]:
#Separating data into predictors and y
X=masters.drop(columns='spg_score')
y=masters['spg_score']

#Splitting data once to create test sets
X_train_bad, X_test, y_train_bad, y_test = train_test_split(X, y, test_size=0.1, random_state=844)
#Verifying that the test is 10%
print('Test %', len(X_test)/len(X))
print('Test count',len(X_test))

#Splitting data again to create training and validation
X_train, X_valid, y_train, y_valid = train_test_split(X_train_bad, y_train_bad, test_size=0.22, random_state=844)
#Checking percentages
print('Train %', len(X_train)/len(X))
print('Valid %', len(X_valid)/len(X))

In [None]:
#Visualizing numeric data with histograms
for col in X_train.columns:
    if is_numeric_dtype(X_train[col])==True:
        fig1=px.histogram(X_train[col])
        fig1.show()

### Section 1: Relationship between school performance grades and percentage of teachers with advanced degrees.

#### Is the percentage of teachers with advanced degrees different across whether a school did not meet, met growth, or exceeded growth?

In [None]:
#Creating dataframes for schools that did not meet growth, met growth, or exceeded growth
NotMet=X_train[X_train['eg_status']=='NotMet']
Met=X_train[X_train['eg_status']=='Met']
Exceeded=X_train[X_train['eg_status']=='Exceeded']

In [None]:
#Visualizing the data to see if it is normal enough to use an ANOVA
fig2=px.histogram(NotMet, x='pct_adv_degree', 
                  title='Histogram of teachers with advanced degrees for schools that did not meet growth',
                  labels={'pct_adv_degree': 'Percent of teachers with advanced degrees'})
fig3=px.histogram(Met, x='pct_adv_degree', 
                  title='Histogram of teachers with advanced degrees for schools that met growth',
                  labels={'pct_adv_degree': 'Percent of teachers with advanced degrees'})
fig4=px.histogram(Exceeded, x='pct_adv_degree', 
                  title='Histogram of teachers with advanced degrees for schools that exceeded growth',
                  labels={'pct_adv_degree': 'Percent of teachers with advanced degrees'})
fig2.show()
fig3.show()
fig4.show()
#Some are questionable on their normality

In [None]:
#Need to remove one school because they have no degree information at all
Met=Met[Met['pct_adv_degree'].notnull()]

In [None]:
#I'm not certain if some of these are close enough, so I want to see a statstical test of normality
k1, p1 = normaltest(NotMet['pct_adv_degree'])
k2, p2 = normaltest(Met['pct_adv_degree'])
k3, p3 = normaltest(Exceeded['pct_adv_degree'])
print(p1, p2, p3)
#Not normal

In [None]:
#Testing if there is a difference in distribution of teachers with advanced degrees in schools with 
#different levels of EVAAS growth
stat, p=kruskal(NotMet['pct_adv_degree'], Met['pct_adv_degree'], Exceeded['pct_adv_degree'])
print('The p-value is ' + str(p))

if (p>0.0054):
    print('This shows that there is NOT a difference in percentage of teachers with advanced degrees.')
else:
    print('This shows that there IS a difference in percentage of teachers with advanced degrees.')

In [None]:
#Visualizing differences in percentage of teachers with advanced degrees with boxplots
fig5 = px.box(X_train[X_train['eg_status']!=0], x="eg_status", y="pct_adv_degree", 
              labels={'eg_status': 'EVAAS Growth Status',
                        'pct_adv_degree': '% of teachers with advanced degrees'})
fig5.update_layout(title={'text':'Growth Status and % of teachers with advanced degrees', 'x':0.5})
fig5.show()

The percentage of teachers with advanced degrees does not vary that much by growth status.

### Examining relationship between school score and % of teachers with advanced degrees

In [None]:
#Combining X and y for ease
Xy_train=pd.concat([X_train, y_train], axis=1)

In [None]:
#Fitting a linear regression to see the relationship between school score and the percent of teachers with advanced degrees.
fig6=px.scatter(Xy_train, x='pct_adv_degree', y='spg_score', trendline='ols', 
                labels={'pct_adv_degree':'% of teachers with advanced degrees',
                       'spg_score':'2017 School Performance Score'})
fig6.update_layout(title={'text':'Relationship between school performance scores and teacher advanced degrees','x':0.5})
fig6.show()

### Examining if this continues for Master's degrees too.

In [None]:
#Visualizing for normality
fig7=px.histogram(NotMet, x='Masters', 
                  title="Histogram of teachers with master's degrees for schools that did not meet growth",
                  labels={'Masters': "Percent of teachers with master's degrees"})
fig8=px.histogram(Met, x='Masters', 
                  title="Histogram of teachers with master's degrees for schools that met growth",
                  labels={'Masters': "Percent of teachers with master's degrees"})
fig9=px.histogram(Exceeded, x='Masters', 
                  title="Histogram of teachers with msater's degrees for schools that exceeded growth",
                  labels={'Masters': "Percent of teachers with master's degrees"})
fig7.show()
fig8.show()
fig9.show()
#These are definitely not normal. So Kruskal-Wallis test it is.

In [None]:
#Testing if there is a difference in distribution of teachers with masters degrees in schools with 
#different levels of EVAAS growth
stat, p=kruskal(NotMet['Masters'], Met['Masters'], Exceeded['Masters'])
print('The p-value is ' + str(p))

if (p>0.0054):
    print('This shows that there is NOT a difference in percentage of teachers with masters degrees.')
else:
    print('This shows that there IS a difference in percentage of teachers with masters degrees.')

In [None]:
#Visualizing differences in percentage of teachers with masters degrees with boxplots
fig10 = px.box(Xy_train[Xy_train['eg_status']!=0], x="eg_status", y="Masters", 
               labels={'eg_status': 'EVAAS Growth Status',
                    'Masters': "% of teachers with master's degrees"})
fig10.update_layout(title={'text': "Growth Status and % of teachers with master's degrees", 'x':0.5})
fig10.show()

It seems like only outliers in NC Schools have more than 30% of teachers with master's degrees. Those that met growth seem to have a lower percentage of teachers with master's degrees overall.

### Examining relationship between school score and % of teachers with masters degrees

In [None]:
#Fitting a linear regression to see the relationship between school score and the percent of teachers with masters degrees.
fig11=px.scatter(Xy_train, x='Masters', y='spg_score', trendline='ols', 
                labels={'Masters':"% of teachers with master's degrees",
                       'spg_score':'2017 School Performance Score'})
fig11.update_layout(title={'text':"Relationship between school performance scores and teacher master's degrees",'x':0.5})
fig11.show()

# Section 2: What features contribute to whether a school did not meet growth, met growth, or exceeded growth?

In [None]:
#Creating matrices to complete a decision tree below
Xy_train=Xy_train[Xy_train['eg_status']!=0]
X=Xy_train.drop(columns=['eg_status', 'agency_code', 'School_Name', 'LEA_Name', 'eg_score', 'School_Type_Desc', 'ma_spg_score',
                         'rd_spg_score', 'ma_eg_status', 'rd_eg_status', 'ma_eg_score', 'rd_eg_score', 'LEA_Name.1', 'Free',
                         'Reduced', 'Total', 'ACALL', 'ACCO', 'ACEN', 'ACMA', 'ACRD', 'ACSC', 'ACWR', 'avg_sat_score',
                         'pct_sat_participation', 'EDS%', 'Final_ADM', 'pct_ap_pass', 'pct_ap_participation', 'year', 
                         'WAP_Count', 'NumClassrooms', 'crime', 'short_term', 'long_term', 'expulsion', 
                         'Does_Not_Meet_Expected_Growth', 'Exceeds_Expected_Growth', 'Meets_Expected_Growth', 'Locale2', 
                         'spg_score', 'spg_grade', 'Advanced', 'In_Process', 'Total_Degrees'])
y=Xy_train.eg_status[Xy_train['eg_status']!=0]

In [None]:
#Shortening decision tree classifier
dt=tree.DecisionTreeClassifier(criterion="gini", max_depth=10, min_samples_leaf=50)

#I tried gini and entropy and gini proved to be the best. I tried different pruning depths and 10 with a minimum leaf size of
#50 seemed to produce a reasonably sized tree

In [None]:
#Fitting decision tree
dt=dt.fit(X, y)

In [None]:
#Visuazlizing the tree itself (and saving it as a PDF)
X_feature_names = list(X.columns)
dot_data = tree.export_graphviz(dt, out_file=None, feature_names=X_feature_names, class_names=['Not Met', 'Met', 'Exceeded'])
graph = graphviz.Source(dot_data) 
graph.render("masters")
graph

In [None]:
#Creating predicitions based on this tree
y_pred = dt.predict(X)

#What is the goodness of fit?
print("Goodness of fit:",metrics.accuracy_score(y, y_pred))

In [None]:
#Creating matrices to complete a decision tree below
X_valid_simp=X_valid[X_valid['eg_status']!=0]
Valid_growth=X_valid_simp['eg_status']
X_valid_simp=X_valid_simp.drop(columns=['eg_status', 'agency_code', 'School_Name', 'LEA_Name', 'eg_score', 'School_Type_Desc', 
                         'rd_spg_score', 'ma_eg_status', 'rd_eg_status', 'ma_eg_score', 'rd_eg_score', 'LEA_Name.1', 'Free',
                         'Reduced', 'Total', 'ACALL', 'ACCO', 'ACEN', 'ACMA', 'ACRD', 'ACSC', 'ACWR', 'avg_sat_score',
                         'pct_sat_participation', 'EDS%', 'Final_ADM', 'pct_ap_pass', 'pct_ap_participation', 'year', 
                         'WAP_Count', 'NumClassrooms', 'crime', 'short_term', 'long_term', 'expulsion', 
                         'Does_Not_Meet_Expected_Growth', 'Exceeds_Expected_Growth', 'Meets_Expected_Growth', 'Locale2', 
                         'ma_spg_score', 'spg_grade', 'Advanced', 'In_Process', 'Total_Degrees'])
y_pred_valid = dt.predict(X_valid_simp)

#Accuracy on validation
print(metrics.accuracy_score(Valid_growth, y_pred_valid))

#Gini performed slightly better on validation

In [None]:
#Seeing if variable importance provides any extra information for interpreting the tree.
feat_importance = dt.tree_.compute_feature_importances(normalize=False)
var_importance = {}
var_import=[]
for i in range(0, len(X_feature_names)):
    if feat_importance[i] > 0:
        var_importance[X_feature_names[i]] = feat_importance[i]
        #Storing a list of significant variables for later use
        var_import.append(X_feature_names[i])

    
sort_var_import=sorted(var_importance.items(), key=lambda x: x[1], reverse=True)
for i in sort_var_import:
    print(str(i[0]) + ': ' + str(i[1]))

Degrees matter for school growth, but teacher turnover and the percentage of economically disadvantaged youth also contribute.

# Section 3: What features contribute to school performance grades?

## Fitting a decision tree for school performance grades (A-F)

In [None]:
#Creating data matrix for seeing relationship between school performance grade and other variables
X2=Xy_train.drop(columns=['eg_status', 'agency_code', 'School_Name', 'LEA_Name', 'eg_score', 'School_Type_Desc', 
                         'rd_spg_score', 'ma_eg_status', 'rd_eg_status', 'ma_eg_score', 'rd_eg_score', 'LEA_Name.1', 'Free',
                         'Reduced', 'Total', 'ACALL', 'ACCO', 'ACEN', 'ACMA', 'ACRD', 'ACSC', 'ACWR', 'avg_sat_score',
                         'pct_sat_participation', 'EDS%', 'Final_ADM', 'pct_ap_pass', 'pct_ap_participation', 'year', 
                         'WAP_Count', 'NumClassrooms', 'crime', 'short_term', 'long_term', 'expulsion', 
                         'Does_Not_Meet_Expected_Growth', 'Exceeds_Expected_Growth', 'Meets_Expected_Growth', 'Locale2', 
                         'ma_spg_score', 'spg_grade', 'Advanced', 'In_Process', 'Total_Degrees', 'spg_score'])
y2=Xy_train['spg_grade']

In [None]:
dt2=tree.DecisionTreeClassifier(criterion="entropy", max_depth=10, min_samples_leaf=50)

In [None]:
#Fitting tree
dt2=dt2.fit(X2, y2)

In [None]:
#Visualizing tree
X2_feature_names = list(X2.columns)
dot_data = tree.export_graphviz(dt2, out_file=None, feature_names=X2_feature_names, class_names=['A', 'B', 'C', 'D', 'F', 'A+NG'])
graph = graphviz.Source(dot_data) 
graph.render("masters_grade")
graph

In [None]:
#Seeing if variable importance provides any extra information
feat_importance2 = dt2.tree_.compute_feature_importances(normalize=False)
var_importance2 = {}
var_import2=[]
for i in range(0, len(X2_feature_names)):
    if feat_importance2[i] > 0:
        var_importance2[X2_feature_names[i]] = feat_importance2[i]
        #Storing a list of significant variables for later use
        var_import2.append(X2_feature_names[i])

    
sort_var_import2=sorted(var_importance2.items(), key=lambda x: x[1], reverse=True)
for i in sort_var_import2:
    print(str(i[0]) + ': ' + str(i[1]))

In [None]:
#Predicting values based on tree
y_pred2 = dt2.predict(X2)

#What is the goodness of fit?
print("Goodness of fit:",metrics.accuracy_score(y2, y_pred2))

In [None]:
#Creating matrices to complete a decision tree below
Valid_grade=X_valid['spg_grade']
X_valid_simp=X_valid.drop(columns=['eg_status', 'agency_code', 'School_Name', 'LEA_Name', 'eg_score', 'School_Type_Desc', 
                         'rd_spg_score', 'ma_eg_status', 'rd_eg_status', 'ma_eg_score', 'rd_eg_score', 'LEA_Name.1', 'Free',
                         'Reduced', 'Total', 'ACALL', 'ACCO', 'ACEN', 'ACMA', 'ACRD', 'ACSC', 'ACWR', 'avg_sat_score',
                         'pct_sat_participation', 'EDS%', 'Final_ADM', 'pct_ap_pass', 'pct_ap_participation', 'year', 
                         'WAP_Count', 'NumClassrooms', 'crime', 'short_term', 'long_term', 'expulsion', 
                         'Does_Not_Meet_Expected_Growth', 'Exceeds_Expected_Growth', 'Meets_Expected_Growth', 'Locale2', 
                         'ma_spg_score', 'spg_grade', 'Advanced', 'In_Process', 'Total_Degrees'])
y_pred2_valid = dt2.predict(X_valid_simp)


#Accuracy
print("Accuracy:",metrics.accuracy_score(Valid_grade, y_pred2_valid))

#Entropy did better on accuracy

# Section 4: What features contribute to school performance scores?

In [None]:
#Standardizing data for LASSO Regression
scaler=StandardScaler()
stand_X = scaler.fit_transform(X2)
stand_X_train = pd.DataFrame(stand_X, columns=X2_feature_names)

np_y_train = Xy_train['spg_score'].to_numpy()

stand_y_train = scaler.fit_transform(np_y_train.reshape(-1,1))

#Standardize validation data
X_v=X_valid.drop(columns=['eg_status', 'agency_code', 'School_Name', 'LEA_Name', 'eg_score', 'School_Type_Desc', 
                         'rd_spg_score', 'ma_eg_status', 'rd_eg_status', 'ma_eg_score', 'rd_eg_score', 'LEA_Name.1', 'Free',
                         'Reduced', 'Total', 'ACALL', 'ACCO', 'ACEN', 'ACMA', 'ACRD', 'ACSC', 'ACWR', 'avg_sat_score',
                         'pct_sat_participation', 'EDS%', 'Final_ADM', 'pct_ap_pass', 'pct_ap_participation', 'year', 
                         'WAP_Count', 'NumClassrooms', 'crime', 'short_term', 'long_term', 'expulsion', 
                         'Does_Not_Meet_Expected_Growth', 'Exceeds_Expected_Growth', 'Meets_Expected_Growth', 'Locale2', 
                         'ma_spg_score', 'spg_grade', 'Advanced', 'In_Process', 'Total_Degrees'])
stand_X_v = scaler.fit_transform(X_v)
stand_X_valid = pd.DataFrame(stand_X_v, columns=X_v.columns)

np_y_valid = y_valid.to_numpy()

stand_y_valid = scaler.fit_transform(np_y_valid.reshape(-1,1))

In [None]:
#I used the significant variables from the decision trees for growth and school performance grade as a starting point

#X3 has variables from school performance grade model
X3=stand_X_train[var_import2]
X3_v=stand_X_valid[var_import2]

#X4 has variables from the growth model
X4=stand_X_train[var_import]
X4_v=stand_X_valid[var_import]

In [None]:
# Creating a function to repeat LASSO

def lasso_model(x, y, x_valid, y_valid, alpha):
    # fit
    model = Lasso(alpha=alpha, normalize=False, random_state=543)
    model.fit(x, y)
    pred = model.predict(x_valid)
    coef = model.coef_
    
    #MAE
    diff=abs(y_valid-pred)
    print("Alpha", alpha, "MAE:",np.mean(diff))

In [None]:
# Create list of alpha values to test
alpha_lasso = [1e-15, 1e-10, 1e-8, 1e-5,1e-4, 1e-3,1e-2, 1e-1, 1, 5, 10]

# Run function for X3
for elem in alpha_lasso:
    lasso_model(X3, stand_y_train, X3_v, stand_y_valid, elem)

An alpha >= 1 gave all 0 coefficients. So I'm using 0.1 as my alpha level.

In [None]:
#LASSO regression using significant variables for school performance grades
model = Lasso(alpha=0.1,normalize=False, random_state=543)
model.fit(X3, stand_y_train)

for i in range(0, len(var_importance2)):
    print(var_import2[i], model.coef_[i])

In [None]:
#Creating a pandas dataframe to print values
dict_LASSO_X3 = {'Variable':var_import2,'Coefficient':model.coef_}
LASSO_X3 = pd.DataFrame(dict_LASSO_X3)

#Removing zero coefficients and printing values
LASSO_X3_Sig = LASSO_X3[LASSO_X3['Coefficient'] != 0]
LASSO_X3_Sig.reset_index(drop=True, inplace=True)
prop=pd.DataFrame(['4-Year Graduation Rate', '% of Black Students',  '% of White Students',
                   '% of Economically Disadvantaged Students', '% of Attendance to ADM',
                   '% of Free Lunch Students', 'Short Term Suspensions per 100'], columns=['Variable Name'])
LASSO_X3_Sig=pd.concat([LASSO_X3_Sig, prop], axis=1)
LASSO_X3_Sig['Index']=LASSO_X3_Sig['Variable']
LASSO_X3_Sig.set_index('Index', inplace=True)
LASSO_X3_Sig

In [None]:
#Creating function to put variable coefficients back into regular units
def unstdev(l, coef):
    for elem in coef['Variable']:
        l.append(coef.loc[elem, 'Coefficient']*(np.std(y_train)/np.std(X_train[elem])))
    for i in range(0, len(coef)):
        print(coef.Variable[i] +': ' + str(l[i]))

In [None]:
#unstandardize coefficients
coef_unstd_X3=[]
unstdev(coef_unstd_X3, LASSO_X3_Sig)

In [None]:
#Creating a plot ot visualize information 
fig12 = px.bar(LASSO_X3_Sig, x="Variable Name", y="Coefficient")
fig12.update_layout(title={'text':'Coefficients for each Selected Variable', 'x':0.5})
fig12.show()

In [None]:
# Run function for X4
for elem in alpha_lasso:
    lasso_model(X4, stand_y_train, X4_v, stand_y_valid, elem)

An alpha >= 1 gave all 0 coefficients. So I'm using 0.1 as my alpha level.

In [None]:
#LASSO regression using significant variables from school growth decision tree
model = Lasso(alpha=0.1, normalize=False, random_state=543)
model.fit(X4, stand_y_train)
for i in range(0, len(var_importance)):
    print(var_import[i], model.coef_[i])

In [None]:
#Creating pandas dataframe for examining coefficients
dict_LASSO_X4 = {'Variable':var_import,'Coefficient':model.coef_}
LASSO_X4 = pd.DataFrame(dict_LASSO_X4)

#Removing zero coefficients
LASSO_X4_Sig = LASSO_X4[LASSO_X4['Coefficient'] != 0]
LASSO_X4_Sig.reset_index(drop=True, inplace=True)
prop=pd.DataFrame(['5-Year Graduation Rate', '% of Black Students', '% of Economically Disadvantaged Students',
                   'Average Number of Book Titles', '% of Teachers with < 3 Years Experience', 
                   '1 Year Teacher/Principal Turnover %', '% of Reduced Lunch Students', 'Short Term Suspensions per 100'],
                   columns=['Variable Name'])
LASSO_X4_Sig=pd.concat([LASSO_X4_Sig, prop], axis=1)
LASSO_X4_Sig['Index']=LASSO_X4_Sig['Variable']
LASSO_X4_Sig.set_index('Index', inplace=True)

LASSO_X4_Sig

In [None]:
#Unstandardize the coefficients
coef_unstd_X4=[]
unstdev(coef_unstd_X4, LASSO_X4_Sig)

In [None]:
#Creating a plot ot visualize information 
fig13 = px.bar(LASSO_X4_Sig, x="Variable Name", y="Coefficient")
fig13.update_layout(title={'text':'Coefficients for each Selected Variable', 'x':0.5})
fig13.show()

In [None]:
#Removing variables that measure similar ideas in school and graduation rates to prevent multicollinearity
X5=stand_X_train.drop(columns=['total_expense', 'GradRate5Yr', 'Male_Tot', 'FreePct', 'ReducedPct', 'local_rank', 'state_rank', 
                               'federal_rank', 'total_rank'])

#Validation data
X5_v=stand_X_valid.drop(columns=['total_expense', 'GradRate5Yr', 'Male_Tot', 'FreePct', 'ReducedPct', 'local_rank', 'state_rank', 
                               'federal_rank', 'total_rank'])

In [None]:
# Run function for X5
for elem in alpha_lasso:
    lasso_model(X5, stand_y_train, X5_v, stand_y_valid, elem)

An alpha >= 1 gave all 0 coefficients. So I'm using 0.1 as my alpha level.

In [None]:
#LASSO regression using most variables
model=Lasso(alpha=0.1, normalize=False, random_state=543)
model.fit(X5, stand_y_train)
sig_X5=[]
for i in range(0, len(X5.columns)):
    print(X5.columns[i], model.coef_[i])
    if model.coef_[i]!=0:
        sig_X5.append(X5.columns[i])

In [None]:
#Creating pandas dataframe to print values
dict_LASSO_X5 = {'Variable':X5.columns,'Coefficient':model.coef_}
LASSO_X5 = pd.DataFrame(dict_LASSO_X5)

#Removing zero coefficients
LASSO_X5_Sig = LASSO_X5[LASSO_X5['Coefficient'] != 0]
LASSO_X5_Sig.reset_index(drop=True, inplace=True)
prop=pd.DataFrame(['4-Year Graduation Rate', 'Community Eligibility Provision Indicator', '% of Female Students',
                   '% of Black Students', '% of White Students', '% of Economically Disadvantaged Students',
                   '% of Attendance to ADM', 'Average Media Age', '% of Teachers with < 3 Years Experience',
                   '1 Year Teacher/Principal Turnover %', 'Short Term Suspensions per 100'], columns=['Variable Name'])
LASSO_X5_Sig=pd.concat([LASSO_X5_Sig, prop], axis=1)
LASSO_X5_Sig['Index']=LASSO_X5_Sig['Variable']
LASSO_X5_Sig.set_index('Index', inplace=True)
LASSO_X5_Sig

In [None]:
#Unstandardize the coefficients
coef_unstd_X5=[]
unstdev(coef_unstd_X5, LASSO_X5_Sig)

In [None]:
#Creating a plot ot visualize information 
fig14 = px.bar(LASSO_X5_Sig, x="Variable Name", y="Coefficient")
fig14.update_layout(title={'text':'Coefficients for each Selected Variable', 'x':0.5})
fig14.show()

# Results: The model using X4 had the best results

In [None]:
#Printing results again to see outcomes
model = Lasso(alpha=0.1, normalize=False, random_state=543)
model.fit(X4, stand_y_train)
for i in range(0, len(var_importance)):
    print(var_import[i], model.coef_[i])

# Additional Plots for Report

In [None]:
#Fitting a linear regression to see the relationship between school score and the percent of teachers with <3 years experience.
fig15=px.scatter(Xy_train, x='pct_experience_0', y='spg_score', trendline='ols', 
                labels={'pct_experience_0':"% of teachers with < 3 Years of Experience",
                       'spg_score':'2017 School Performance Score'})
fig15.update_layout(title={'text':"Relationship between school performance scores and teacher experience",'x':0.5})
fig15.show()

fig11.show()

In [None]:
#Fitting a linear regression to see the relationship between school score and the percent of teachers with masters degrees.
fig16=px.scatter(masters, x='pct_eds', y='spg_score', trendline='ols', 
                labels={'pct_eds':"Percentage of Economically Disadvantaged Students",
                       'spg_score':'2017 School Performance Score'})
fig16.update_layout(title={'text':"Relationship between school performance scores and economic disadvantage",'x':0.5})

fig16.show()

In [None]:
#Fitting a linear regression to see the relationship between school score and number of short term suspensions.
fig17=px.scatter(Xy_train, x='shortsusper100', y='spg_score', trendline='ols', 
                labels={'shortsusper100':"Number of Short Term Suspensions per 100 Students",
                       'spg_score':'2017 School Performance Score'})
fig17.update_layout(title={'text':"Relationship between school performance scores and short term suspensions",'x':0.5})
fig17.show()

In [None]:
#Importing plotly subplots and graph objects
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [None]:
#Creating a plot to compare variable significance in the 3 models to get a sense of what matters for school performance scores
fig=go.Figure()
fig.add_trace(go.Bar(
    y=LASSO_X3_Sig['Variable Name'], 
    x=LASSO_X3_Sig['Coefficient'], 
    orientation='h',
    name='Growth Decision Tree Variables',
    marker={'color':'orange'}))
fig.add_trace(go.Bar(
    y=LASSO_X4_Sig['Variable Name'], 
    x=LASSO_X4_Sig['Coefficient'], 
    orientation='h',
    name='Grade Decision Tree Variables',
    marker={'color':'blue'}))
fig.add_trace(go.Bar(
    y=LASSO_X5_Sig['Variable Name'], 
    x=LASSO_X5_Sig['Coefficient'], 
    orientation='h',
    name='Selected Variables',
    marker={'color':'light green'}))
fig.update_layout(title='Variable Significance', title_x=0.5, legend=dict(
    yanchor="bottom",
    y=-0.25,
    xanchor="left",
    x=0
))
layout = go.Layout(
    autosize=False,
    width=800,
    height=700)
fig.update_layout(layout)

In [None]:
#Fitting a linear regression to see the relationship between school score and the percent of white students
fig18=px.scatter(masters, x='White', y='spg_score', trendline='ols', 
                labels={'White':"% of White Students",
                       'spg_score':'2017 School Performance Score'})
fig18.update_layout(title={'text':"Relationship between school performance scores and white students",'x':0.5})
fig18.show()

In [None]:
#Fitting a linear regression to see the relationship between percent of white students and percent of economically disadvantaged
#students
fig19=px.scatter(masters, x='White', y='pct_eds', trendline='ols', 
                labels={'White':"% of White Students",
                       'eds_pct':'% Economically Disadvantaged'})
fig19.update_layout(title={'text':"Relationship between race and economic disadvantage",'x':0.5})
fig19.show()

In [None]:
#Fitting a linear regression to see the relationship between school score and the percent of teachers with masters degrees.
fig20=px.scatter(masters, x='Masters', y='spg_score', trendline='ols', 
                labels={'Masters':"Percentage of Teachers with Master's Degrees",
                       'spg_score':'2017 School Performance Score'})
fig20.update_layout(title={'text':"Relationship between school performance scores and percent of teachers with master's degrees",'x':0.5})

fig20.show()

In [None]:
#Fitting a linear regression to see the relationship between SAT score and the percent of teachers with masters degrees.
sat_masters=masters[masters['avg_sat_score']>0]
fig21=px.scatter(sat_masters, x='Masters', y='avg_sat_score', trendline='ols', 
                labels={'Masters':"Percentage of Teachers with Master's Degrees",
                       'avg_sat_score':'2017 SAT Scores'})
fig21.update_layout(title={'text':"Relationship between SAT scores and percent of teachers with master's degrees",'x':0.5})

fig21.show()