# Modeling 

In [1]:
%load_ext autoreload
%autoreload 2
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
# import standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 


# import additional libraries
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

# functions from .py file
import src.eda_functions as fun

# sklearn
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor


# turn off warnings
import warnings
warnings.simplefilter('ignore', category = DeprecationWarning)
warnings.simplefilter('ignore', category = FutureWarning)

  import pandas.util.testing as tm


read in merged_data file

In [3]:
data = pd.read_csv('../../data/train_set.csv')

In [4]:
data.head()

Unnamed: 0,cohort,rcdts,school_name,district,city,county,district_type,district_size,school_type,grades_served,...,student_chronic_truancy_rate,high_school_dropout_rate_total,high_school_4_year_graduation_rate_total,high_school_5_year_graduation_rate_total,avg_class_size_high_school,pupil_teacher_ratio_high_school,teacher_retention_rate,percent_graduates_enrolled_in_a_postsecondary_institution_within_16_months,percent_graduates_enrolled_in_a_postsecondary_institution_within_12_months,percent_9th_grade_on_track
0,2013,10010010260001,Seymour High School,Payson CUSD 1,Payson,Adams,UNIT,SMALL,HIGH SCHOOL,7 8 9 10 11 12,...,2.3,1.4,86.8,94.1,11.1,,,,,
1,2013,10010020260001,Liberty High School,Liberty CUSD 2,Liberty,Adams,UNIT,MEDIUM,HIGH SCHOOL,7 8 9 10 11 12,...,1.8,0.5,90.7,93.8,29.5,,,,,
2,2013,10010030260001,Central High School,Central CUSD 3,Camp Point,Adams,UNIT,MEDIUM,HIGH SCHOOL,9 10 11 12,...,5.7,0.0,96.9,89.7,12.7,,,,,
3,2013,10010040260001,Unity High School,CUSD 4,Mendon,Adams,UNIT,MEDIUM,HIGH SCHOOL,9 10 11 12,...,1.0,0.5,91.1,88.6,10.9,,,,,
4,2013,10011720220003,Quincy Sr High School,Quincy SD 172,Quincy,Adams,UNIT,LARGE,HIGH SCHOOL,10 11 12,...,12.8,4.6,88.2,88.5,20.1,,,,,


In [5]:
data['cohort'] = data['cohort'].astype('object')

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3403 entries, 0 to 3402
Data columns (total 30 columns):
 #   Column                                                                      Non-Null Count  Dtype  
---  ------                                                                      --------------  -----  
 0   cohort                                                                      3403 non-null   object 
 1   rcdts                                                                       3403 non-null   object 
 2   school_name                                                                 3403 non-null   object 
 3   district                                                                    3403 non-null   object 
 4   city                                                                        3403 non-null   object 
 5   county                                                                      3403 non-null   object 
 6   district_type                                   

# 1. Replace nans with zeros

In [7]:
zeros = data.loc[:, ('high_school_4_year_graduation_rate_total',
                        'high_school_5_year_graduation_rate_total',
                        'percent_graduates_enrolled_in_a_postsecondary_institution_within_16_months',
                        'percent_graduates_enrolled_in_a_postsecondary_institution_within_12_months',
                        'percent_9th_grade_on_track')]

for zero in zeros:
    data[zero].fillna(0, inplace=True)
    
data.isnull().sum().sort_values(ascending=False)

teacher_retention_rate                                                        732
pupil_teacher_ratio_high_school                                               682
avg_class_size_high_school                                                     34
high_school_dropout_rate_total                                                  2
county                                                                          0
percent_student_enrollment_white                                                0
grades_served                                                                   0
school_type                                                                     0
district_size                                                                   0
district_type                                                                   0
percent_9th_grade_on_track                                                      0
city                                                                            0
percent_student_

# 2. Replace nans with mean()


In [8]:
nans = data.loc[:, ('teacher_retention_rate', 
                    'pupil_teacher_ratio_high_school', 
                    'avg_class_size_high_school', 
                    'high_school_dropout_rate_total')]

for nan in nans:
    data[nan].fillna(data[nan].mean(), inplace=True)

In [9]:
data.isnull().sum().sort_values(ascending=False)

percent_9th_grade_on_track                                                    0
percent_graduates_enrolled_in_a_postsecondary_institution_within_12_months    0
rcdts                                                                         0
school_name                                                                   0
district                                                                      0
city                                                                          0
county                                                                        0
district_type                                                                 0
district_size                                                                 0
school_type                                                                   0
grades_served                                                                 0
percent_student_enrollment_white                                              0
percent_student_enrollment_black_or_afri

# Train Test Split
- Train set = data_df
- Validation set = hs_18
- Test set = hs_19

In [32]:
X_train = data.drop('high_school_4_year_graduation_rate_total', axis=1)
y_train = data.high_school_4_year_graduation_rate_total

val_set = pd.read_csv('../../data/val_set.csv')
X_val = val_set.drop('high_school_4_year_graduation_rate_total', axis=1)
y_val = val_set.high_school_4_year_graduation_rate_total

test_set = pd.read_csv('../../data/test_set.csv')
X_test = test_set.drop('high_school_4_year_graduation_rate_total', axis=1)
y_test = test_set.high_school_4_year_graduation_rate_total

In [11]:
X_val['cohort'] = X_val['cohort'].astype('object')
X_test['cohort'] = X_test['cohort'].astype('object')

# Scaling Numeric features
- Standard Scaler?
- MinMaxScaler?

In [33]:
X_train_num = X_train.select_dtypes(['float64', 'int64'])
X_val_num = X_val.select_dtypes(['float64', 'int64'])
X_test_num = X_test.select_dtypes(['float64', 'int64'])

In [34]:
X_train_index = X_train.index
X_val_index = X_val.index
X_test_index = X_test.index

In [36]:
ss = StandardScaler()
X_train_sc = pd.DataFrame(ss.fit_transform(X_train_num), columns=X_train_num.columns, index=X_train_index)
X_val_sc = pd.DataFrame(ss.transform(X_val_num), columns=X_val_num.columns, index=X_val_index)
X_test_sc = pd.DataFrame(ss.transform(X_test_num), columns=X_test_num.columns, index=X_test_index)


ValueError: operands could not be broadcast together with shapes (720,20) (19,) (720,20) 

# 2. Encoding Categorical features

In [15]:
X_train.district_type.unique()

array(['UNIT       ', 'HIGH SCHOOL'], dtype=object)

In [16]:
X_train_cat = X_train[['district_type', 'district_size', 'school_type']]
X_val_cat = X_val[['district_type', 'district_size', 'school_type']]

In [17]:
X_train_cat.district_type = X_train_cat.district_type.str.rstrip()
X_train_cat.district_size = X_train_cat.district_size.str.rstrip()
X_train_cat.school_type = X_train_cat.school_type.str.rstrip()
#X_train_cat.grades_served = X_train_cat.grades_served.str.rstrip()




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [18]:
X_val_cat.district_type.str.rstrip()
X_val_cat.district_size.str.rstrip()
X_val_cat.school_type.str.rstrip()
#X_val_cat.grades_served.str.rstrip()

0      HIGH SCHOOL
1      HIGH SCHOOL
2      HIGH SCHOOL
3      HIGH SCHOOL
4      HIGH SCHOOL
          ...     
716    HIGH SCHOOL
717    HIGH SCHOOL
718    HIGH SCHOOL
719    HIGH SCHOOL
720    HIGH SCHOOL
Name: school_type, Length: 721, dtype: object

In [19]:
X_train_cat.district_type.unique()

array(['UNIT', 'HIGH SCHOOL'], dtype=object)

In [20]:
ohe = OneHotEncoder()
X_train_coded = ohe.fit(X_train_cat)
X_train_coded = ohe.transform(X_train_cat)
X_val_coded = ohe.fit_transform(X_val_cat)

In [26]:
train_columns = ohe.get_feature_names(input_features=X_train_cat.columns)
val_columns = ohe.get_feature_names(input_features=X_val_cat.columns)
#test_columns = ohe.get_feature_names(input_features=X_test_cat.columns)

X_train_processed = pd.DataFrame(X_train_coded.todense(), columns=train_columns, index=X_train_index)
X_val_processed = pd.DataFrame(X_val_coded.todense(), columns=val_columns, index=X_val_index)

#X_test_processed = pd.DataFrame(X_test_coded.todense(), columns=test_columns, index=X_test_index)#

In [27]:
X_train_all = pd.concat([X_train_sc, X_train_processed], axis=1)
X_val_all = pd.concat([X_train_sc, X_val_processed], axis=1)
#X_test_all = pd.concat([X_test_sc, X_test_processed], axis=1)


# 6. Test models
- Linear Regression
- Random Forest
- Gradient Boost
- SVM
- KNN

In [23]:
from sklearn.linear_model import LinearRegression

# fit an sklearn model
#instantiate a linear regression object 
lr = LinearRegression()

# split the data into target and features
y = y_train
X = X_train_all

# Call .fit from the linear regression object, and feed X and y in as parameters
lr.fit(X,y)

# lr has a method called score.  Again, feed in X and y, and read the output. Save it in the variable score.  What is that number?  Compare it to statsmodels. 
score = lr.score(X,y)
# that is the r_2.  It is the same as the Statsmodels R_2

# lr also has attributes coef_ and intercept_. Save and compare to statsmodels
beta = lr.coef_
intercept = lr.intercept_
#sklearn calculates the same coeficients and intercepts as statmodels.

In [24]:
score

0.4691838070464443

In [28]:
from sklearn.ensemble import GradientBoostingRegressor

# Create the model
gradient_boosted = GradientBoostingRegressor(random_state=19)

# Fit the model on the training data
gradient_boosted.fit(X_train_all, y_train)

# Make predictions on the test data
predictions = gradient_boosted.predict(X_val_all)

# Evaluate the model
mae = np.mean(abs(predictions - y_val))

print('Gradient Boosted Performance on the test set: MAE = %0.4f' % mae)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

In [31]:
X_val_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3403 entries, 0 to 3402
Data columns (total 27 columns):
 #   Column                                                                      Non-Null Count  Dtype  
---  ------                                                                      --------------  -----  
 0   percent_student_enrollment_white                                            3403 non-null   float64
 1   percent_student_enrollment_black_or_african_american                        3403 non-null   float64
 2   percent_student_enrollment_hispanic_or_latino                               3403 non-null   float64
 3   percent_student_enrollment_asian                                            3403 non-null   float64
 4   percent_student_enrollment_native_hawaiian_or_other_pacific_islander        3403 non-null   float64
 5   percent_student_enrollment_american_indian_or_alaska_native                 3403 non-null   float64
 6   percent_student_enrollment_two_or_more_races    

# 7. Select model

# 8. grid search w cross val

In [None]:
# Loss function to be optimized
loss = ['ls', 'lad', 'huber']

# Number of trees used in the boosting process
n_estimators = [100, 500, 900, 1100, 1500]

# Maximum depth of each tree
max_depth = [2, 3, 5, 10, 15]

# Minimum number of samples per leaf
min_samples_leaf = [1, 2, 4, 6, 8]

# Minimum number of samples to split a node
min_samples_split = [2, 4, 6, 10]

# Maximum number of features to consider for making splits
max_features = ['auto', 'sqrt', 'log2', None]

# Define the grid of hyperparameters to search
hyperparameter_grid = {'loss': loss,
                       'n_estimators': n_estimators,
                       'max_depth': max_depth,
                       'min_samples_leaf': min_samples_leaf,
                       'min_samples_split': min_samples_split,
                       'max_features': max_features}

# Create the model to use for hyperparameter tuning
model = GradientBoostingRegressor(random_state = 42)

# Set up the random search with 4-fold cross validation
random_cv = RandomizedSearchCV(estimator=model,
                               param_distributions=hyperparameter_grid,
                               cv=4, n_iter=25, 
                               scoring = 'neg_mean_absolute_error',
                               n_jobs = -1, verbose = 1, 
                               return_train_score = True,
                               random_state=42)

# Fit on the training data
random_cv.fit(X, y)

In [None]:
# Find the best combination of settings
random_cv.best_estimator_


# 9. train

# 10. evalutate on test set

In [None]:
# Make predictions on the test set using default and final model
default_pred = default_model.predict(X_test)
final_pred = final_model.predict(X_test)

# 11. Model interpretation
- feature impolrtances
 - create df & visualize
- plot single decision tree
- LIME

### Step 5 FSM
Linear Regression

In [None]:
X = num_features.drop("high_school_4_year_graduation_rate_total", axis=1)
y = num_features["high_school_4_year_graduation_rate_total"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
test=pd.DataFrame(X_train)
test.info()


In [None]:
heatmap_numeric_w_dependent_variable(num_features, 'high_school_4_year_graduation_rate_total')
plt.savefig('figures/heatmap.png')
plt.show();

In [None]:
X = num_features.drop("high_school_4_year_graduation_rate_total", axis=1)
y = num_features["high_school_4_year_graduation_rate_total"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
lin_reg_model = LinearRegression()
baseline_cross_val_score = cross_val_score(lin_reg_model, X_train, y_train)
baseline_cross_val_score

In [None]:
outcome = 'high_school_4_year_graduation_rate_total'
predictors = num_features.drop('high_school_4_year_graduation_rate_total', axis=1)
pred_sum = '+'.join(predictors.columns)
formula = outcome + '~' + pred_sum
model = ols(formula=formula, data=num_features).fit()
model.summary()

In [None]:
from sklearn.ensemble import RandomForestRegressor

rfr= RandomForestRegressor(random_state=42)

print(rfr.fit(X_train, y_train))
print(cross_val_score(rfr, X_train, y_train))