<a href="https://colab.research.google.com/github/drewebeatty/colabassignment/blob/main/Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Neural Network Model to Produce Predictions of Success in Residential Treatment**


---




Welcome to this Colab notebook, where our objective is to build a neural network model capable of predicting whether an adolescent will successfully complete the program based on their demographic characteristics and initial score on the Youth Outcome Questionnaire (YOQ). The YOQ is a comprehensive 64-item self-report measure that assesses various aspects of psychological health.

The successful development of this model can serve two valuable purposes. Firstly, it could facilitate predictions prior to clients' admission into the program, aiding in informed admissions decisions. Secondly, it may assist therapists and staff in adopting a more proactive approach to treatment.

For our analysis using the R programming language, we will be utilizing publicly available and de-identified data from a long-term residential treatment center catering to adolescent girls with borderline tendencies. This dataset comprises demographic information, treatment-related variables, responses to all 64 YOQ questions, calculated scores for the six YOQ subscales, and an overall YOQ total score. Additionally, the dataset includes an outcome variable, denoting whether the adolescent successfully completed the program, in contrast to dropping out or being asked to leave.

In regards to this assignment and code, this framework and work flow was learned from John Curtin's machine learning class. The workflow is taken from our unit on neural networks, however the code has been changed to accomodate data from my own research!







# **Required Packages**

The following packages will be needed for building this model. Some packages will not be used outright, but some functions will have a dependency on these packages. When using R, we need to use the Keras package in addition to Tensorflow. We will be using tidymodels and a tidymodels style to set up and run our model.

In [None]:
install.packages('tidymodels') # for modeling
install.packages("psych") # for viewing data and summary stats
install.packages('tidyverse') # for general data wrangling
install.packages('kableExtra') # for displaying formatted tables w/ kbl()
install.packages('skimr') # for skim()
install.packages('corrplot')
install.packages('janitor')
install.packages('cowplot') # for plot_grid() and theme_half_open()
install.packages('ggplot2') # for plotting performance
install.packages("keras") # for NN - needed layer for R
install.packages("tensorflow") # for NN

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



# **Load the Required Packages**

Now that we have all the packages installed, let's load them into our environment. We will also be using some functions that John Curtin wrote and has posted in to his github. We will be pulling down these functions from github using devtools. If this method doesn't work for you the functions can also be found in the file "fun_modeling.R" which is included in the repo.

In [None]:
# Load in the required libraries, set plotting theme, and source functions through git and file
library(ggplot2)
#theme_set(theme_half_open()) # plotting theme
#source('fun_modeling.R')
library(keras)
library(tensorflow)
library(psych) # for summary of data
library(tidymodels) # for modeling
library(tidyverse) # for general data wrangling
library(kableExtra) # for displaying formatted tables w/ kbl()
library(skimr) # for skim()
library(corrplot)
library(ggplot2)
library(cowplot)
devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_plots.R?raw=true") # functions for plotting
devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/fun_ml.R?raw=true") # other functions that might come in handy


Attaching package: ‘psych’


The following objects are masked from ‘package:ggplot2’:

    %+%, alpha


── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.0 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.2.0     [32m✔[39m [34mtibble      [39m 3.2.1
[32m✔[39m [34mdplyr       [39m 1.1.2     [32m✔[39m [34mtidyr       [39m 1.3.0
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mtune        [39m 1.1.1
[32m✔[39m [34mmodeldata   [39m 1.1.0     [32m✔[39m [34mworkflows   [39m 1.1.3
[32m✔[39m [34mparsnip     [39m 1.1.0     [32m✔[39m [34mworkflowsets[39m 1.0.1
[32m✔[39m [34mpurrr       [39m 1.0.1     [32m✔[39m [34myardstick   [39m 1.2.0
[32m✔[39m [34mrecipes     [39m 1.0.6     

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mpsych[39m::[32m%+%()[39m             masks [

# **Load in the Data**
The data file included in the repo is called "yoq_nn.csv". You will need to upload this folder to your working files in Colab to read it in.

In [None]:
d <- read.csv('yoq_nn.csv') # read the file in
describe(d) # get a quick look at our data set

Unnamed: 0_level_0,vars,n,mean,sd,median,trimmed,mad,min,max,range,skew,kurtosis,se
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
CID,1,417,4.589204e+03,855.4372753,4623.0,4587.1582090,1073.4024,3111,6269,3158,0.005653123,-1.1330093,41.890943599
Relationship_to_client,2,417,0.000000e+00,0.0000000,0.0,0.0000000,0.0000,0,0,0,,,0.000000000
TimeInstance,3,417,1.000000e+00,0.0000000,1.0,1.0000000,0.0000,1,1,0,,,0.000000000
LOS,4,416,2.970409e+02,140.5687784,305.5,296.4161677,128.9862,3,743,740,0.077276786,-0.1197712,6.891951385
ReligPref,5,121,1.006612e+01,8.1911899,10.0,9.4432990,10.3782,0,25,25,0.410265911,-1.1135970,0.744653629
ReligSimp,6,121,4.181818e+00,2.3021729,6.0,4.4226804,0.0000,0,6,6,-0.635089989,-1.3913465,0.209288444
Race,7,397,1.375315e+00,1.0287264,1.0,1.0626959,0.0000,1,6,5,2.730537530,6.2780484,0.051630298
Ethnicity,8,387,8.527132e-02,0.2796466,0.0,0.0000000,0.0000,0,1,1,2.958427569,6.7698067,0.014215233
MaritalStat,9,413,2.012107e+00,1.8688423,1.0,1.5377644,0.0000,1,9,8,1.938932447,2.5255548,0.091959718
Positivity,10,406,8.522167e+00,2.1712490,9.0,8.9877301,1.4826,0,10,10,-1.944444456,3.7000300,0.107757278


## Split Data Into Training and Testing Sets

We first need to split our full data into a training set for our model to learn from, and then a test set to evaluate our models performance in cases that it has not seen before.

In [None]:
splits <- d %>%
  initial_split(prop = 0.75, strata = "Completion") # splitting our data, stratifying the 0.75 (3/4) split on our outcome variable

data_trn <- analysis(splits) # saving it into our training set
data_trn %>%  nrow() # get count of training set rows, see if split seems right

data_test <- assessment(splits) # saving it into our test set
data_test %>% nrow() # get count of test set rows, see if split seems right

## Set Random Seed
For reproducibility we will set a random seed.

In [None]:
set.seed(12345) # random seed
fit_seeds <- sample.int(10^5, size = 3) # we will be using a random seed within our model, so we are saving it here

## Setting Up K-Fold Splits
We will be using k-fold cross-validation for our neural network. This is advantageous because we are dealing with a fairly small dataset. By using k-fold cross-validation we have better data utilization by dividing the dataset into k subsets ("or folds"). Each fold serves as a validation set in every iteration, while the remaining k-1 folds are used for training. This maximizes the use of our data, enhances the model's overall robustness, and reduces overfitting.

K-fold cross-validation also provides more reliable performance estimation by averaging performance metrics over *k* iterations. This helps assess the model's consistency and generalization on unseen data, which could vary greatly with a small data set like ours. Additionally, the k-fold cross-validation allows for hyperparameter tuning, as it allows evaluating different parameter configurations across various k subsets. This ultimately leads to more informed hyperparameter selection, ensuring a more stable performing neural network model, even with our small data set!

In [None]:
splits_kfold <- data_trn %>%
  vfold_cv(v = 10, repeats = 1, strata = "Completion") # specifying that we want 10 folds stratified on our outcome variable "Completion" with just one repeat

## Setting Up a Recipe

In the tidymodels framework, a "recipe" serves as a data preprocessing/feature engineering for transforming our raw data into a format that is suited for training and evaluating our neural network model. Normally, recipes facilitate steps like scaling, normalization, handling missing values, and handling categorical variables to produce consistent and sensical input for the model.

Some models require extensive data pre-processing and feature engineering to optimize results, however the nueral network does pretty well "out of the box", and thus minimal processing and feature engineering is required.


In [None]:
rec <-
  recipe(Completion ~ ., data = data_trn) %>% # regressing all variables in the data onto our outcome
  step_string2factor(Completion, levels = c("completion", "non_completion")) %>% # specifying the levels of our outcome variable and turning into a factor variable instead of string
  step_YeoJohnson(all_numeric_predictors()) %>% # for normality, there is some extreme skew in some variabless
  step_nzv(all_predictors()) %>% # removes variables that are very sparse and unbalanced (Near-Zero Variance), there are some variables like this in the dataset
  step_impute_knn(all_numeric_predictors()) %>% # since there is missing data, we will use the knn method to impute the data
  step_range(all_predictors()) # range correction for better model performance and convergence

## Make a Feature Matrix
Now that we have a recipe, we will feed in our training data to make a feature matrix to use in our model

In [None]:
feat_trn <- rec %>%
  make_features(data_trn)

Rows: 312
Columns: 128
$ CID                [3m[90m<dbl>[39m[23m 0.9428398, 0.9037638, 0.8927500, 0.8909134, 0.88723…
$ LOS                [3m[90m<dbl>[39m[23m 0.2913546, 0.4853893, 0.2281092, 0.3838314, 0.25345…
$ ReligPref          [3m[90m<dbl>[39m[23m 0.3427779, 0.4829690, 0.5015057, 0.6169810, 0.26482…
$ ReligSimp          [3m[90m<dbl>[39m[23m 0.4380603, 0.5410472, 0.6383677, 0.8191838, 0.43806…
$ Race               [3m[90m<dbl>[39m[23m 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.8, 0…
$ Ethnicity          [3m[90m<dbl>[39m[23m 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ MaritalStat        [3m[90m<dbl>[39m[23m 0.6938626, 0.9666326, 0.0000000, 0.0000000, 0.69386…
$ Positivity         [3m[90m<dbl>[39m[23m 0.50272067, 1.00000000, 1.00000000, 0.12513324, 1.0…
$ Sense              [3m[90m<dbl>[39m[23m 0.36942738, 0.78333050, 0.67715994, 0.78333050, 0.6…
$ Gender             [3m[90m<dbl>[39m[23m 0.0, 0.2, 0.0, 0.8, 0.0, 0.0, 0.0, 0.0, 

## Hyper Parameter Tuning
With the tidymodels framework we can tune our hyper parameters to find the optimal point. For our model, we are going to tune the amoung of hidden layers, and the amount of dropout.

In [None]:
grid_keras <- expand_grid(hidden_units = c(5, 10, 20, 30), dropout = c(.1, .001, .0001))

In [None]:
fits_nn <-
  mlp(hidden_units = tune(), dropout = tune(), activation = "relu", epochs = 200) %>%
  set_mode("classification") %>%
  set_engine("keras", verbose = 1, seeds = fit_seeds) %>%
  tune_grid(preprocessor = rec,
                grid = grid_keras,
                resamples = splits_kfold,
                metrics = metric_set(accuracy))

→ [31m[1mA[22m[39m | [31merror[39m:   Python module tensorflow.keras was not found.
               
               Detected Python configuration:
               
               
               

There were issues with some computations   [1m[31mA[39m[22m: x1

There were issues with some computations   [1m[31mA[39m[22m: x13

There were issues with some computations   [1m[31mA[39m[22m: x25

There were issues with some computations   [1m[31mA[39m[22m: x33

There were issues with some computations   [1m[31mA[39m[22m: x36





In [None]:
install.packages("reticulate")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
Sys.which("python")

In [None]:
library(reticulate)
use_python("/usr/bin/python3")

In [None]:
library(tensorflow)
tf$executing_eagerly()

ERROR: ignored

In [None]:
tensorflow::tf$enable_eager_execution()

ERROR: ignored