# Student Loan Risk with Deep Learning

In [1]:
# Imports
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from pathlib import Path

# M18D01A04

---

## Prepare the data to be used on a neural network model

### Step 1: Read the `student-loans.csv` file into a Pandas DataFrame. Review the DataFrame, looking for columns that could eventually define your features and target variables.   

In [2]:
# Read the csv into a Pandas DataFrame
file_path = "https://static.bc-edx.com/ai/ail-v-1-0/m18/lms/datasets/student-loans.csv"
loans_df = pd.read_csv(file_path)

# Review the DataFrame
loans_df.head()

Unnamed: 0,payment_history,location_parameter,stem_degree_score,gpa_ranking,alumni_success,study_major_code,time_to_completion,finance_workshop_score,cohort_ranking,total_loan_score,financial_aid_score,credit_ranking
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0


In [3]:
# Review the data types associated with the columns
loans_df.dtypes

Unnamed: 0,0
payment_history,float64
location_parameter,float64
stem_degree_score,float64
gpa_ranking,float64
alumni_success,float64
study_major_code,float64
time_to_completion,float64
finance_workshop_score,float64
cohort_ranking,float64
total_loan_score,float64


In [4]:
# Check the credit_ranking value counts
loans_df["credit_ranking"].value_counts()

Unnamed: 0_level_0,count
credit_ranking,Unnamed: 1_level_1
1,855
0,744


### Step 2: Using the preprocessed data, create the features (`X`) and target (`y`) datasets. The target dataset should be defined by the preprocessed DataFrame column “credit_ranking”. The remaining columns should define the features dataset.

In [5]:
### array([0, 0, 0, 1, 0])

# Define the target set y using the credit_ranking column
y = loans_df["credit_ranking"]

# Display a sample of y
y.head()

Unnamed: 0,credit_ranking
0,0
1,0
2,0
3,1
4,0


In [6]:
### 	payment_history	location_parameter	stem_degree_score	gpa_ranking	alumni_success	study_major_code	time_to_completion	finance_workshop_score	cohort_ranking	total_loan_score	financial_aid_score
### 0	        7.4	                0.70	            0.00	        1.9	        0.076	            11.0	            34.0	            0.9978	            3.51	            0.56	                9.4
### 1	        7.8	                0.88	            0.00	        2.6	        0.098	            5.0	                67.0	            0.9968	            3.20	            0.68	                9.8

# Define features set X by selecting all columns but credit_ranking
X = loans_df.copy()
X = X.drop(columns=["credit_ranking"])
# Two datasets were created: a target (y) dataset, which includes the "credit_ranking" column, and a features (X) dataset, which includes the other columns. ----------------------------------------------> (5 points)

# Review the features DataFrame
X

Unnamed: 0,payment_history,location_parameter,stem_degree_score,gpa_ranking,alumni_success,study_major_code,time_to_completion,finance_workshop_score,cohort_ranking,total_loan_score,financial_aid_score
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4
...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2


### Step 3: Split the features and target sets into training and testing datasets.


In [7]:
# Split the preprocessed data into a training and testing dataset
# Assign the function a random_state equal to 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# The features and target sets have been split into training and testing datasets. -------------------------------------------------------------------------------------------------------------> (5 points)

### Step 4: Use scikit-learn's `StandardScaler` to scale the features data.

In [8]:
# Create a StandardScaler instance
X_scaler = StandardScaler()

# Fit the scaler to the features training dataset
X_scaler.fit(X_train)

# Fit the scaler to the features training dataset
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
# Scikit-learn's StandardScaler was used to scale the features data. ---------------------------------------------------------------------------------------------------------------------------> (5 points)

---

## Compile and Evaluate a Model Using a Neural Network

### Step 1: Create a deep neural network by assigning the number of input features, the number of layers, and the number of neurons on each layer using Tensorflow’s Keras.

> **Hint** You can start with a two-layer deep neural network model that uses the `relu` activation function for both layers.


In [9]:
### 11

# Define the the number of inputs (features) to the model
input_nodes = len(X.columns)

# Review the number of features
input_nodes

11

In [10]:
# Define the number of hidden nodes for the first hidden layer
hidden_nodes_01 = 6

# Define the number of hidden nodes for the second hidden layer
hidden_nodes_02 = 3

# Define the number of neurons in the output layer
output_nodes = 1

In [11]:
# Create the Sequential model instance
nn_model = tf.keras.models.Sequential()

# Add the first hidden layer
nn_model.add(tf.keras.layers.Dense(units=hidden_nodes_01, activation="relu", input_dim=input_nodes))

# Add the second hidden layer
nn_model.add(tf.keras.layers.Dense(units=hidden_nodes_02, activation="relu", input_dim=input_nodes))

# Add the output layer to the model specifying the number of output neurons and activation function
nn_model.add(tf.keras.layers.Dense(units=output_nodes, activation="sigmoid", input_dim=input_nodes))


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [12]:
# Model: "sequential"
# _________________________________________________________________
#  Layer (type)                Output Shape              Param #
# =================================================================
#  dense (Dense)               (None, 6)                 72
#  dense_1 (Dense)             (None, 3)                 21
#  dense_2 (Dense)             (None, 1)                 4
# =================================================================
# Total params: 97 (388.00 Byte)
# Trainable params: 97 (388.00 Byte)
# Non-trainable params: 0 (0.00 Byte)
# _________________________________________________________________

# Display the Sequential model summary
nn_model.summary()

### A deep neural network was created with appropriate parameters. ------------------------------------------------------------------------------------------------------------------------------> (10 points)

### Step 2: Compile and fit the model using the `binary_crossentropy` loss function, the `adam` optimizer, and the `accuracy` evaluation metric.


In [13]:
# Compile the Sequential model
nn_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [14]:
# M18D01A02 → "Scaaled" data to be used.
# CORRECTION - Later samples used for reference brought in corrected data and did not perform scaling operations.

# Fit the model using 50 epochs and the training data
fit_nn_model = nn_model.fit(X_train_scaled, y_train, epochs=50) # ======================================================================================>>> Correction (Accuracy improved 68.1% → 75.0% w/ Scaled data)

### The model was compiled and fit using the accuracy loss function, the adam optimizer, the accuracy evaluation metric, and a small number of epochs, such as 50 or 100. --------------------------------> (10 points)

Epoch 1/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.6016 - loss: 0.6918
Epoch 2/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.5661 - loss: 0.6894
Epoch 3/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.5953 - loss: 0.6816
Epoch 4/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.5884 - loss: 0.6711
Epoch 5/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6215 - loss: 0.6518
Epoch 6/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6004 - loss: 0.6399
Epoch 7/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6121 - loss: 0.6290
Epoch 8/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6919 - loss: 0.6032
Epoch 9/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[

### Step 3: Evaluate the model using the test data to determine the model’s loss and accuracy.


In [18]:
# 13/13 - 0s - loss: 0.5049 - accuracy: 0.7350 - 175ms/epoch - 13ms/step
# Loss: 0.5049149394035339, Accuracy: 0.7350000143051147

# M18D01A02 - CORRECTION (See Above)
# Evaluate the model loss and accuracy metrics using the evaluate method and the test data
model_loss, model_accuracy = nn_model.evaluate(X_test_scaled, y_test, verbose=2) # ============================================================================>>> Correction (Output now matches example)


# Display the model loss and accuracy results
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

### The model was evaluated using the test data to determine its loss and accuracy. ---------------------------------------------------------------------------------------------------------> (5 points)

13/13 - 0s - 3ms/step - accuracy: 0.7575 - loss: 0.5298
Loss: 0.5298176407814026, Accuracy: 0.7574999928474426


### Step 4: Save and export your model to a keras file, and name the file `student_loans.keras`.


In [16]:
### M18D03A05
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [22]:
## M18D02A06
# Set the model's file path
save_file_path = Path("/content/drive/My Drive/Colab Notebooks/18-Neural-Networks-Deep-Learning-1/student_loans.keras")

# Export your model to a keras file (File updated 8/31 to 9/7, with corrections from above)
nn_model.save(save_file_path)

### The model was saved and exported to a keras file named student_loans.keras. ----------------------------------------------------------------> (5 points)

---
## Predict Loan Repayment Success by Using your Neural Network Model

### Step 1: Reload your saved model.

In [26]:
# Set the model's file path
load_file_path = Path("/content/drive/My Drive/Colab Notebooks/18-Neural-Networks-Deep-Learning-1/student_loans.keras")

# Load the model to a new object
nn_imported = tf.keras.models.load_model(load_file_path)

# The saved model was reloaded. ----------------------------------------------------------------------------------------------------------------> (5 points)

### Step 2: Make predictions on the testing data and save the predictions to a DataFrame.

In [27]:
# 13/13 - 0s - 126ms/epoch - 10ms/step
# array([[0.63573396],
#        [0.41001907],
#        [0.88199055],
#        [0.61770827],
#        [0.9708398 ]], dtype=float32)

### M18D01A04 - // CORRECTION → scaled data to be used
# Make predictions with the test data
loan_predictions = nn_imported.predict(X_test_scaled, verbose=2) # =================================================================================================== >>> Correction (5 points)

# Display a sample of the predictions
loan_predictions[:5]

13/13 - 0s - 8ms/step


array([[0.26394847],
       [0.2969619 ],
       [0.77366173],
       [0.66697717],
       [0.9462218 ]], dtype=float32)

In [28]:
# 	    predictions
# 0	    1.0
# 1	    0.0
# 2	    1.0
# 3	    1.0
# 4	    1.0
# ...	...
# 395	1.0
# 396	0.0
# 397	1.0
# 398	0.0
# 399	0.0
# 400 rows × 1 columns

# M18D01A04
predictions_rounded = [round(prediction[0],0) for prediction in loan_predictions]
predictions_rounded

# Save the predictions to a DataFrame and round the predictions to binary results
predictions_df = pd.DataFrame()
predictions_df["predicitons"] = predictions_rounded
predictions_df

### The reloaded model was used to make binary predictions on the testing data. -------------------------------------------------> (10 points)

Unnamed: 0,predicitons
0,0.0
1,0.0
2,1.0
3,1.0
4,1.0
...,...
395,1.0
396,0.0
397,1.0
398,0.0


### Step 4: Display a classification report with the y test data and predictions

In [29]:
#             precision    recall  f1-score   support

#            0       0.70      0.76      0.73       188
#            1       0.77      0.72      0.74       212

#     accuracy                           0.73       400
#    macro avg       0.74      0.74      0.73       400
# weighted avg       0.74      0.73      0.74       400


# Print the classification report with the y test data and predictions
print(classification_report(y_test, predictions_rounded))

### A classification report is generated for the predictions and the testing data. -----------------------------------------------------> (10 points)

              precision    recall  f1-score   support

           0       0.72      0.79      0.75       188
           1       0.80      0.73      0.76       212

    accuracy                           0.76       400
   macro avg       0.76      0.76      0.76       400
weighted avg       0.76      0.76      0.76       400



---
## Discuss creating a recommendation system for student loans

Briefly answer the following questions in the space provided:

1. Describe the data that you would need to collect to build a recommendation system to recommend student loan options for students. Explain why this data would be relevant and appropriate.

2. Based on the data you chose to use in this recommendation system, would your model be using collaborative filtering, content-based filtering, or context-based filtering? Justify why the data you selected would be suitable for your choice of filtering method.

3. Describe two real-world challenges that you would take into consideration while building a recommendation system for student loans. Explain why these challenges would be of concern for a student loan recommendation system.

1. From the Bank :   
&emsp;Loan options.  
&emsp;Connections between associated Loan elements.  
From the Student:  
&emsp;Identification (Name, ID, SSN, permanent address).  
&emsp;Type of Lone requested.  
&emsp;Existing services already used.  
&emsp;Income / employment / employment duration / Credit history.  
&emsp;Cosigner information (same as above).  
```
The response describes the data that should be collected to build a recommendation system for student loan options. (4 points)  
```  
This is the standard information used for the application of a loan.  Additional information beyond this scope would classify as an invasion of privacy.  Even if legal, the addition of questions (even if helpful) beyond the scopy of "reasonable aquasition" will be a red flag from applicatents, and has the potential to drive customers away.  
```
The response explains why they think that data should be collected. (4 points)  
The type of data described is appropriate for a recommendation system for student loan options.(2 points)
```
  
2. This would be context-based filtering, due to the separation of individual users.  The most important driving factor for the selection of loan options is the specific loan being requested.  The individualized ranking of the available options (normally a short list, such as payment terms) can be derived from the standard in-coming profile listed above.  General preferences of others isn't as important as the capacity of the individual's ability to pay back the loan.  In addition, once the co-signer comes into play, there are two separate populations providing information.  The number of existing bank customers matching both profiles will be limited, and recommendation systems sneed very large data sets.  That will most likely not be the case in this situation.  

```
The response chose a filtering method. (4 points)  
The student justified the choice of their filtering method. (4 points)  
The choice of filtering method was appropriate for the data selected in the previous question. (2 points)
```
  
3. A) The main concern would be the availability of sufficient training data.
 With restricted data, the model would likely suffer, but shortfalls could be mitigated by designing the system to only be used in ranking the available options (after adjusting for feasibility of repayment, the analog version of the analysis generated previously in the module).  If the recommendation is in error, then the "correct" options will someply be delayed.
 B) Using Loan selection as the main driver for loan options presentation means that other options that may make different types of loans more enticing may be ignored.  Although useful for up-selling or optimizing the options of present loans, this model cannot be used in "sell-across" drives to encourage existing customers to take out other / additional different loans.

```
The response lists two real-world challenges with building a recommendation system for student loans. (4 points)
The response explains why these challenges would be of concern for a student loan recommendation system. (6 points)
```
