# High School Longitudinal Study Data Analysis: Modeling Academic Success
- Ursaminor Jupyter Notebook
- 22 November 2020
- By Barnett Yang

## Miscellaneous Notes and Libraries

In [113]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing

import tensorflow as tf
from tensorflow import keras

## Indicators and Relevant Variables

### Relevant Academic Success Indicators
- X1TXMTH: Mathematics theta score. 
    - 2009 9th graders.
    - Norm-referenced measure of achievement. 
    - Min: -8, Max: 3.0283, Mean: -0.6693, Std: 2.4536.
- X1TXMTSCOR: Mathematics standardized theta score.
    - 2009 9th graders.
    - Norm-referenced measure of achievement, standardized to facilitate comparisons in standard deviation units.
    - Min: -8, Max: 82.1876, Mean: 45.9312, Std: 19.2860
- X1TXMSCR: Mathematics IRT-estimated number right score. 
    - 2009 9th graders.
    - Min: -8, Max: 69.9317, Mean: 35.9645, Std: 17.7911.
- X1TXMQUINT: Mathematics quintile score. 
    - 1 (lowest) to 5 (highest).
    - Note: Remove unit non-response (-8).
    - Based off of base year 2009 Mathematics Assessment of Algebraic Reasoning. See section 2.3 of https://nces.ed.gov/pubs2014/2014361.pdf for details.
- X1TXMPROF1 - X1TXMPROF5: Mathematics proficiency probability scores. 
    - Min: -8, Max: 1.
- S3CLGFT: Attending college full-time or part-time as of Nov 1 2013 
    - 1 = full-time; 2 = part-time; 3 = don't know. 
    - Note: Remove Unit non-response (-8) and Item legitimate skip/NA (-7).
- S4EARNAMT2: Amount earned for February 2016/last job. 
    - May not be a good indicator since some students may still be in school.
- S4HSGPA: Average grades in high school 
    - 1 mostly A's -> 7 D's and below. 
    - Note: Remove item legitimate skip (-7), unit non-response (-6), and item not administered (-4)

### Sex, Race, Family Income, Poverty Level Variables
- X1SEX: Student Sex.
    - 1: Male
    - 2: Female
- X1RACE: Student Race. 
    - 1: American Indian/Alaska Native 
    - 2: Asian
    - 3: Black/African-American
    - 4: Hispanic, no race specified
    - 5: Hispanic, race specified
    - 7: Native Hawaiian/Pacific Islander
    - 8: White
- X1FAMINCOME: Total family income from all sources (in US dollars, 2008). 
    - 1: <= 15000
    - 2: > 15000 and <= 35000
    - 3: > 35,000 and <= 55,000
    - 4: > 55,000 and <= 75,000
    - 5: > 75,000 and <= 95,000
    - 6: > 95,000 and <= 115,000
    - 7: > 115,000 and <= 135,000
    - 8: > 135,000 and <= 155,000
    - 9: > 155,000 and <= 175,000
    - 10: > 175,000 and <= 195,000
    - 11: > 195,000 and <= 215,000
    - 12: > 215,000 and <= 235,000
    - 13: > 235,000; 
    - -8: Unit non-response
- X1POVERTY: X1 Poverty indicator.
    - Relative to 100% of Census poverty threshold. 
    - 0: At or above poverty threshold
    - 1: Below poverty threshold
    - Note: Remove Unit non-response (-8).
- X1DADEDU: Father's/male guardian's highest level of education
    - 0: No bio/adoptive/step-father in household
    - 1: Less than high school
    - 2: High school deploma or GED
    - 3: Associate's degree
    - 4: Bachelor's degree
    - 5: Master's degree
    - 7: Ph.D/M.D/Law/other high lvl prof degree
    - -9: Missing
    - -8: Unit non-response
- X1MOMEDU: Mother's/female guardian's highest level of education
    - 0: No bio/adoptive/step-mother in household
    - 1: Less than high school
    - 2: High school deploma or GED
    - 3: Associate's degree
    - 4: Bachelor's degree
    - 5: Master's degree
    - 7: Ph.D/M.D/Law/other high lvl prof degree
    - -9: Missing
    - -8: Unit non-response

## Linear Regression Model to Predict Overall High School GPA By Poverty and Income

### Constants and Load Data

In [2]:
variables_gpa = ['X1FAMINCOME', 'X1POVERTY', 'S4HSGPA']
dfgpas = pd.read_csv('../data/HSLS_2017_Datasets/hsls_17_student.csv', usecols=variables_gpa)
dfgpas.head()

Unnamed: 0,X1FAMINCOME,X1POVERTY,S4HSGPA
0,10,0,-7
1,3,0,-7
2,6,0,-7
3,5,0,-7
4,9,0,-7


In [3]:
dfgpas = dfgpas[(dfgpas['X1FAMINCOME']>0) & (dfgpas['X1POVERTY']>=0) & (dfgpas['S4HSGPA']>0)]
dfgpas

Unnamed: 0,X1FAMINCOME,X1POVERTY,S4HSGPA
15,3,0,1
41,3,0,4
53,6,0,2
68,5,0,4
89,5,0,2
...,...,...,...
23435,13,0,1
23455,13,0,2
23471,4,0,5
23474,8,0,2


### One Hot Encode Income

In [4]:
for i in range(1, 14):
    dfgpas['X1FAMINCOME'] = dfgpas['X1FAMINCOME'].replace(i, "Income Bracket " + str(i)).to_frame()
dfgpas

Unnamed: 0,X1FAMINCOME,X1POVERTY,S4HSGPA
15,Income Bracket 3,0,1
41,Income Bracket 3,0,4
53,Income Bracket 6,0,2
68,Income Bracket 5,0,4
89,Income Bracket 5,0,2
...,...,...,...
23435,Income Bracket 13,0,1
23455,Income Bracket 13,0,2
23471,Income Bracket 4,0,5
23474,Income Bracket 8,0,2


In [5]:
col_names = ["Income Bracket " + str(i) for i in range(1, 14)]
dummies = pd.get_dummies(dfgpas.X1FAMINCOME)

dummies = dummies.reindex(columns=col_names)

In [6]:
dfgpas = pd.concat([dfgpas, dummies], axis='columns')
dfgpas = dfgpas.drop(['X1FAMINCOME'], axis='columns')
dfgpas

Unnamed: 0,X1POVERTY,S4HSGPA,Income Bracket 1,Income Bracket 2,Income Bracket 3,Income Bracket 4,Income Bracket 5,Income Bracket 6,Income Bracket 7,Income Bracket 8,Income Bracket 9,Income Bracket 10,Income Bracket 11,Income Bracket 12,Income Bracket 13
15,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
41,0,4,0,0,1,0,0,0,0,0,0,0,0,0,0
53,0,2,0,0,0,0,0,1,0,0,0,0,0,0,0
68,0,4,0,0,0,0,1,0,0,0,0,0,0,0,0
89,0,2,0,0,0,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23435,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
23455,0,2,0,0,0,0,0,0,0,0,0,0,0,0,1
23471,0,5,0,0,0,1,0,0,0,0,0,0,0,0,0
23474,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0


### Set up Feature and Response Variables

In [7]:
#set up X (features) & y (target/response)
X = dfgpas.drop(columns='S4HSGPA')
X = X.values

y = dfgpas['S4HSGPA']
y = y.values

### Ordinary Least Squares

In [8]:
#test size = 0.25 which is default of sklearn, shuffle = False
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=False)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
predictions = linreg.predict(X_test)
linreg_mse = mean_squared_error(y_test, predictions)
print(linreg_mse)
print(linreg.coef_)

3.184564422662211
[ 1.19148111e-01 -2.82591644e+13 -2.82591644e+13 -2.82591644e+13
 -2.82591644e+13 -2.82591644e+13 -2.82591644e+13 -2.82591644e+13
 -2.82591644e+13 -2.82591644e+13 -2.82591644e+13 -2.82591644e+13
 -2.82591644e+13 -2.82591644e+13]


## Linear Regression Model to Predict College Enrollment Status By Poverty and Income

### Constants and Load Data

In [9]:
variables_college = ['X1FAMINCOME', 'X1POVERTY', 'S3CLGFT']
dfcollege = pd.read_csv('../data/HSLS_2017_Datasets/hsls_17_student.csv', usecols=variables_college)
dfcollege.head()

Unnamed: 0,X1FAMINCOME,X1POVERTY,S3CLGFT
0,10,0,1
1,3,0,1
2,6,0,1
3,5,0,-8
4,9,0,3


In [10]:
dfcollege = dfcollege[(dfcollege['X1FAMINCOME']>0) & (dfcollege['X1POVERTY'] >= 0) & (dfcollege['S3CLGFT']>0)]
dfcollege

Unnamed: 0,X1FAMINCOME,X1POVERTY,S3CLGFT
0,10,0,1
1,3,0,1
2,6,0,1
4,9,0,3
5,5,0,1
...,...,...,...
23497,7,0,1
23499,1,1,3
23500,7,0,1
23501,3,0,1


### One Hot Encode Income

In [11]:
for i in range(1, 14):
    dfcollege['X1FAMINCOME'] = dfcollege['X1FAMINCOME'].replace(i, "Income Bracket " + str(i)).to_frame()
dfcollege

Unnamed: 0,X1FAMINCOME,X1POVERTY,S3CLGFT
0,Income Bracket 10,0,1
1,Income Bracket 3,0,1
2,Income Bracket 6,0,1
4,Income Bracket 9,0,3
5,Income Bracket 5,0,1
...,...,...,...
23497,Income Bracket 7,0,1
23499,Income Bracket 1,1,3
23500,Income Bracket 7,0,1
23501,Income Bracket 3,0,1


In [12]:
col_names = ["Income Bracket " + str(i) for i in range(1, 14)]
dummies = pd.get_dummies(dfcollege.X1FAMINCOME)

dummies = dummies.reindex(columns=col_names)
dummies

Unnamed: 0,Income Bracket 1,Income Bracket 2,Income Bracket 3,Income Bracket 4,Income Bracket 5,Income Bracket 6,Income Bracket 7,Income Bracket 8,Income Bracket 9,Income Bracket 10,Income Bracket 11,Income Bracket 12,Income Bracket 13
0,0,0,0,0,0,0,0,0,0,1,0,0,0
1,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
23497,0,0,0,0,0,0,1,0,0,0,0,0,0
23499,1,0,0,0,0,0,0,0,0,0,0,0,0
23500,0,0,0,0,0,0,1,0,0,0,0,0,0
23501,0,0,1,0,0,0,0,0,0,0,0,0,0


In [13]:
dfcollege = pd.concat([dfcollege, dummies], axis='columns')
dfcollege = dfcollege.drop(['X1FAMINCOME'], axis='columns')
dfcollege

Unnamed: 0,X1POVERTY,S3CLGFT,Income Bracket 1,Income Bracket 2,Income Bracket 3,Income Bracket 4,Income Bracket 5,Income Bracket 6,Income Bracket 7,Income Bracket 8,Income Bracket 9,Income Bracket 10,Income Bracket 11,Income Bracket 12,Income Bracket 13
0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0
1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0
4,0,3,0,0,0,0,0,0,0,0,1,0,0,0,0
5,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23497,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0
23499,1,3,1,0,0,0,0,0,0,0,0,0,0,0,0
23500,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0
23501,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0


### Set up Feature and Response Variables

In [14]:
#set up X (features) & y (target/response)
X = dfcollege.drop(columns='S3CLGFT')
X = X.values

y = dfcollege['S3CLGFT']
y = y.values

### Ordinary Least Squares

In [15]:
#test size = 0.25 which is default of sklearn, shuffle = False
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=False)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
predictions = linreg.predict(X_test)
linreg_mse = mean_squared_error(y_test, predictions)
print(linreg_mse)
print(linreg.coef_)

0.23446106430800523
[ 7.14269371e-02 -1.57264149e+12 -1.57264149e+12 -1.57264149e+12
 -1.57264149e+12 -1.57264149e+12 -1.57264149e+12 -1.57264149e+12
 -1.57264149e+12 -1.57264149e+12 -1.57264149e+12 -1.57264149e+12
 -1.57264149e+12 -1.57264149e+12]


## Some Observations of the Linear Regression Models and Hypotheses Regarding Their Limitations

### Linear Regression Prediction of High School GPA Based on Family Income and Poverty
- This model had a high mean squared error of 3.18, where the range of our response variables was 6. This model is far from ideal.
- Possible explanations of the high mean-squared error:
    - There were many non-respondants to the S4HSGPA variable in the HSLS.
    - Linear regression may not work well in cases where all feature and response variables are categorical.
    
### Linear Regression Prediction of College Enrollment Status Based on Family Income and Poverty
- This model had a better mean squared error of 0.23, where the range of our response variables was 2. What is interesting to note is that the coefficients for all of the one hot encoded income variables were the same.
- Possible sources of inaccuracy:
    - Substantial number of non-respondants to the S4CLGFT vairable in the HSLS. However, there were far fewer non-respondants to this variable than there were to S4HSGPA, possibly contributing to its lower mean squared error.
    - The data is not temporally consistent. Income and poverty data was collected in 2008/2009 and college enrollment data was collected in 2013.

## Neural Network to Model College Enrollment Status by Family Income, Poverty, Race, Sex, and Parents' Level of Education

In [148]:
variables = ['X2POVERTY', 'X2POVERTY185', 'X2FAMINCOME', 'X1RACE', 'X1SES', 'X1MTHID', 'X1SCIID',
             'X1SEX', 'X1DADEDU', 'X1MOMEDU', 'X1SCHOOLENG', 'X1SCHOOLBEL', 'S3CLGFT']
coldf = pd.read_csv('../data/HSLS_2017_Datasets/hsls_17_student.csv', usecols=variables)

In [151]:
coldf = coldf[(coldf['X2POVERTY'] >= 0) & (coldf['X2FAMINCOME'] > 0) & (coldf['X1RACE'] > 0) & 
              (coldf['X1SEX'] >= 0) & (coldf['X1DADEDU'] > 0) & (coldf['X1MOMEDU'] > 0) & 
              (coldf['X2POVERTY185'] >= 0) & (coldf['S3CLGFT'] > 0)]
coldf.head()

Unnamed: 0,X1SEX,X1RACE,X1MOMEDU,X1DADEDU,X1SES,X1MTHID,X1SCIID,X1SCHOOLBEL,X1SCHOOLENG,X2FAMINCOME,X2POVERTY,X2POVERTY185,S3CLGFT
0,1,8,5,5,1.5644,1.76,0.91,0.84,-1.41,11,0,0,1
1,2,8,3,2,-0.3699,-0.57,0.91,0.05,-0.2,3,0,1,1
5,2,8,3,3,1.0639,1.16,-1.57,-0.52,0.96,5,0,0,1
7,1,8,5,7,1.5144,0.0,-0.33,0.45,-0.2,7,0,0,1
10,1,8,5,3,1.2058,1.76,1.52,-0.04,-0.58,5,0,0,1


In [152]:
coldf['S3CLGFT'] = coldf['S3CLGFT'] - 1
coldf.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  coldf['S3CLGFT'] = coldf['S3CLGFT'] - 1


Unnamed: 0,X1SEX,X1RACE,X1MOMEDU,X1DADEDU,X1SES,X1MTHID,X1SCIID,X1SCHOOLBEL,X1SCHOOLENG,X2FAMINCOME,X2POVERTY,X2POVERTY185,S3CLGFT
0,1,8,5,5,1.5644,1.76,0.91,0.84,-1.41,11,0,0,0
1,2,8,3,2,-0.3699,-0.57,0.91,0.05,-0.2,3,0,1,0
5,2,8,3,3,1.0639,1.16,-1.57,-0.52,0.96,5,0,0,0
7,1,8,5,7,1.5144,0.0,-0.33,0.45,-0.2,7,0,0,0
10,1,8,5,3,1.2058,1.76,1.52,-0.04,-0.58,5,0,0,0


In [153]:
X = coldf.drop(columns='S3CLGFT')
min_max_scaler = preprocessing.MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(X)
y = coldf['S3CLGFT']
y = y.values

#test size = 0.25 which is default of sklearn, shuffle = False
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, shuffle=False)

In [154]:
# Create and compile model
nn_model2 = keras.Sequential([
    keras.layers.Dense(10, input_shape=(12,), activation='sigmoid'),
    keras.layers.Dense(3, activation='sigmoid')
])

nn_model2.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

nn_model2.fit(X_train, y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fb9884b7220>