## Preprocessing

**This code was focused in optimizing the model to get an accuracy of 75%. Each code cell contains comments detailing various attempts to improve accuracy. While the visualizations may not be ideal, this approach proved efficient for testing, as it allowed for easy adjustments—simply switching the order of code cells and rerunning them. I hope this commentary is helpful for anyone reviewing the code and looking to contribute to further improvements in accuracy. Once the desired accuracy is achieved, the final step will be to clean up the code.**

In [31]:
# Importing the dependencies needed to train the model and work on the data.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import tensorflow as tf

# Importing pandas and reading the charity_data.csv from the provided cloud URL.
import pandas as pd
application_df = pd.read_csv("https://static.bc-edx.com/data/dl-1-2/m21/lms/starter/charity_data.csv")
application_df.head()

Unnamed: 0,EIN,NAME,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,10520599,BLUE KNIGHTS MOTORCYCLE CLUB,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1
1,10531628,AMERICAN CHESAPEAKE CLUB CHARITABLE TR,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,N,108590,1
2,10547893,ST CLOUD PROFESSIONAL FIREFIGHTERS,T5,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
3,10553066,SOUTHSIDE ATHLETIC ASSOCIATION,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,N,6692,1
4,10556103,GENETIC RESEARCH INSTITUTE OF THE DESERT,T3,Independent,C1000,Heathcare,Trust,1,100000-499999,N,142590,1


In [32]:
# Drop the non-beneficial ID columns, 'EIN' and 'NAME'. These are unique identifiers that do not contribute to the predictive model.
application_df = application_df.drop(columns=['EIN', 'NAME'])

In [33]:
# Determining the number of unique values in each column.
#THE LINE BELOW IS CODE
#application_df.nunique()

In [34]:
#After many try's of switching the learning rate, dropping unique values, adding neurons, epochs, layers, parameters, different activation functions I decided to adjust the variables before training the model.
#Hoping this might raise the accuracy.
# Step 1: Check the value counts for categorical columns
print(application_df["APPLICATION_TYPE"].value_counts())
print(application_df["CLASSIFICATION"].value_counts())

APPLICATION_TYPE
T3     27037
T4      1542
T6      1216
T5      1173
T19     1065
T8       737
T7       725
T10      528
T9       156
T13       66
T12       27
T2        16
T25        3
T14        3
T29        2
T15        2
T17        1
Name: count, dtype: int64
CLASSIFICATION
C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
         ...  
C1248        1
C6100        1
C1820        1
C1900        1
C2150        1
Name: count, Length: 71, dtype: int64


In [35]:
# Step 2: Defining threshold for binning.
app_threshold = 500
class_threshold = 500

In [36]:
# Step 3: Identify categories to be replaced.
application_types_to_replace = application_df["APPLICATION_TYPE"].value_counts()[application_df["APPLICATION_TYPE"].value_counts() < app_threshold].index
classifications_to_replace = application_df["CLASSIFICATION"].value_counts()[application_df["CLASSIFICATION"].value_counts() < class_threshold].index


In [37]:
# Step 4: Replacing rare categories with "Other"
application_df["APPLICATION_TYPE"] = application_df["APPLICATION_TYPE"].apply(lambda x: "Other" if x in application_types_to_replace else x)
application_df["CLASSIFICATION"] = application_df["CLASSIFICATION"].apply(lambda x: "Other" if x in classifications_to_replace else x)

In [38]:
# Step 5: Verifying replacement.
print(application_df["APPLICATION_TYPE"].value_counts())
print(application_df["CLASSIFICATION"].value_counts())

APPLICATION_TYPE
T3       27037
T4        1542
T6        1216
T5        1173
T19       1065
T8         737
T7         725
T10        528
Other      276
Name: count, dtype: int64
CLASSIFICATION
C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
Other     1484
C7000      777
Name: count, dtype: int64


In [39]:
# Look at APPLICATION_TYPE value counts.
#THE LINE BELOW IS CODE
#application_df['APPLICATION_TYPE'].value_counts()

In [40]:
# Choosing a cutoff value and creating a list of application types to be replaced, this is so there isn't too much noise while testing the model.
# Recommended to use the variable name `application_types_to_replace`
#THE LINE BELOW IS CODE

#application_types_to_replace = application_df['APPLICATION_TYPE'].value_counts()[application_df['APPLICATION_TYPE'].value_counts() < 500].index


# Replace in dataframe
#THE LINE BELOW IS CODE

#for app in application_types_to_replace:
#    application_df['APPLICATION_TYPE'] = application_df['APPLICATION_TYPE'].replace(app,"Other")

# Check to make sure replacement was successful
#THE LINE BELOW IS CODE

#application_df['APPLICATION_TYPE'].value_counts()

In [41]:
# Look at CLASSIFICATION value counts
#THE LINE BELOW IS CODE
#application_df['CLASSIFICATION'].value_counts()


In [42]:
# You may find it helpful to look at CLASSIFICATION value counts >1.Again, this is so we can check how many variables are being takein in consideration and trying to detect which ones have a
#fewer count so we could avoid the noise while testing.
#THE LINE BELOW IS CODE

#application_types_to_replace = application_df['APPLICATION_TYPE'].value_counts()[application_df['APPLICATION_TYPE'].value_counts() > 1].index
#application_types_to_replace.value_counts()

In [43]:
# Choosing a cutoff value and creating a list of classifications to be replaced
# Recommended to use the variable name `classifications_to_replace`
#THE LINE BELOW IS CODE

#classifications_to_replace = application_df['CLASSIFICATION'].value_counts()[application_df['CLASSIFICATION'].value_counts() < 1000].index

# Replace in dataframe
#THE LINE BELOW IS CODE

#for cls in classifications_to_replace:
#    application_df['CLASSIFICATION'] = application_df['CLASSIFICATION'].replace(cls,"Other")

# Check to make sure replacement was successful
#THE LINE BELOW IS CODE

#application_df['CLASSIFICATION'].value_counts()

In [44]:
# Convert categorical data to numeric with `pd.get_dummies`. This is so we can convert any string into numerical format so the model can read wheter it's a 1 or a 0.
application_df = pd.get_dummies(application_df)

In [45]:
# Splitting our preprocessed data into our features and target arrays
X = application_df.drop(columns=['IS_SUCCESSFUL'])  # Features
y = application_df['IS_SUCCESSFUL']  # Labels

# Splitting the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [46]:
# Create a StandardScaler instances
scaler = StandardScaler()

# Fitting the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

## Compile, Train and Evaluate the Model

In [47]:
# Define the model - deep neural net, i.e., the number of input features and hidden nodes for each layer.

# Defining the sequential model by creating an instance of the Sequential class.
nn_optimized = tf.keras.models.Sequential()

# First hidden layer
nn_optimized.add(tf.keras.layers.Dense(units=64, activation="relu", input_dim=X_train.shape[1]))

# Second hidden layer
nn_optimized.add(tf.keras.layers.Dense(units=32, activation="relu"))

# Third hidden layer (optional, test if it improves accuracy)
nn_optimized.add(tf.keras.layers.Dense(units=16, activation="relu"))

# Output layer
nn_optimized.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Check the structure of the model
nn_optimized.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [48]:
# Import the Adam optimizer from TensorFlow's Keras module for model optimization
from tensorflow.keras.optimizers import Adam

# Reduce the learning rate to optimize training
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)  # Last try was. We tried with 0.0005

# Compile the model to prepare it for training with specified loss function, optimizer, and metrics.
nn_optimized.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])

In [49]:
# Import the EarlyStopping callback from TensorFlow's Keras module
from tensorflow.keras.callbacks import EarlyStopping

# Create an EarlyStopping callback to stop training when the accuracy stops improving
early_stopping = EarlyStopping(monitor='accuracy', patience=10, restore_best_weights=True)

# Train the neural network model with the training data
history = nn_optimized.fit(X_train_scaled, y_train,
                           epochs=200, batch_size=32,
                           verbose=1, callbacks=[early_stopping])


Epoch 1/200
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.7120 - loss: 0.5873
Epoch 2/200
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.7250 - loss: 0.5571
Epoch 3/200
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.7290 - loss: 0.5509
Epoch 4/200
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.7340 - loss: 0.5471
Epoch 5/200
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.7300 - loss: 0.5498
Epoch 6/200
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.7296 - loss: 0.5474
Epoch 7/200
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.7264 - loss: 0.5515
Epoch 8/200
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.7324 - loss: 0.5468
Epoch 9/200
[1m804/804[0m [32

In [50]:
# Evaluating the model using the test data
model_loss, model_accuracy = nn_optimized.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

268/268 - 1s - 2ms/step - accuracy: 0.7293 - loss: 0.5527
Loss: 0.552704930305481, Accuracy: 0.7293294668197632


In [51]:
# Define the model - deep neural net, i.e., the number of input features and hidden nodes for each layer.
#THE LINE BELOW ARE DEPENDANCIES

#from tensorflow.keras.layers import LeakyReLU

# Defining the sequential model by creating an instance of the Sequential class.
#THE LINE BELOW IS CODE

#nn_optimized= tf.keras.models.Sequential()

# Input layer and first hidden layer
#THE LINE BELOW IS CODE

#nn_optimized.add(tf.keras.layers.Dense(units=64, activation="tanh", input_dim=X_train.shape[1]))

# Second hidden layer (increase neurons)
#THE LINE BELOW IS CODE

#nn_optimized.add(tf.keras.layers.Dense(units=32, activation="tanh"))

# Output layer
#THE LINE BELOW IS CODE

#nn_optimized.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Check the structure of the model
#THE LINE BELOW IS CODE

#nn_optimized.summary()

In [52]:
# Import the Adam optimizer from TensorFlow's Keras module for model optimization
#THE LINE BELOW ARE DEPENDANCIES

#from tensorflow.keras.optimizers import Adam

# Reduce the learning rate
#THE LINE BELOW IS CODE

#optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)  # Default is 0.001, try lowering to 0.0005
#nn_optimized.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])

In [53]:
# Train the model
#THE LINE BELOW ARE DEPENDANCIES

#from tensorflow.keras.callbacks import EarlyStopping

#THE LINES BELOW IS CODE

#early_stopping = EarlyStopping(monitor='accuracy', patience=10, restore_best_weights=True)

#history = nn_optimized.fit(X_train_scaled, y_train,
                           #epochs=200, batch_size=32,
                           #verbose=1, callbacks=[early_stopping])


In [54]:
# Evaluate the model using the test data
#THE LINE BELOW IS CODE

#model_loss, model_accuracy = nn_optimized.evaluate(X_test_scaled,y_test,verbose=2)
#print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

In [55]:
#This was done on the first optimization to check the unique values of the df and drop the ones around 2-3 that could make noise.
#THE LINE BELOW IS CODE

#application_df.nunique()


In [56]:
#THE LINE BELOW IS CODE

#columns_to_drop = application_df.nunique()[application_df.nunique() <= 3].index
#application_df = application_df.drop(columns=columns_to_drop)
#application_df.nunique()

In [57]:
#Same thing as the models above; just switching paramenters, adding or removing layers, changing learning rate and so on.
#THE LINE BELOW IS CODE

#nn_optimized.add(tf.keras.layers.Dense(units=64, activation="tanh", input_dim=X_train.shape[1]))
#nn_optimized.add(tf.keras.layers.Dense(units=32, activation=LeakyReLU(alpha=0.01)))

# Third hidden layer (optional, test if it improves accuracy)
#THE LINE BELOW IS CODE

#nn_optimized.add(tf.keras.layers.Dense(units=16, activation="tanh"))

# Output layer
#THE LINE BELOW IS CODE

#nn_optimized.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Check the structure of the model
#THE LINE BELOW IS CODE

#nn_optimized.summary()

In [58]:
# Reduce the learning rate
#THE LINEs BELOW IS CODE

#optimizer = Adam(learning_rate=0.0005)

#nn_optimized.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])


In [59]:
# Train the model
#THE LINE BELOW ARE DEPENDANCIES

#from tensorflow.keras.callbacks import EarlyStopping

#THE LINES BELOW IS CODE

#early_stopping = EarlyStopping(monitor='accuracy', patience=10, restore_best_weights=True)

#history = nn_optimized.fit(X_train_scaled, y_train,
                           #epochs=200, batch_size=32,
                           #verbose=1, callbacks=[early_stopping])

In [60]:
# Evaluate the model using the test data
#THE LINES BELOW IS CODE

#model_loss, model_accuracy = nn_optimized.evaluate(X_test_scaled,y_test,verbose=2)
#print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

In [61]:
#We went back to the original training settings since at one point we manage to get a 73% accuracy with this training but can't remember what were the parameters or if drop any variables.
#THE LINES BELOW IS CODE

#nn_optimized.fit(X_train_scaled, y_train, epochs=100, batch_size=32)

In [62]:
# Evaluate the model using the test data
#THE LINES BELOW IS CODE

#model_loss, model_accuracy = nn_optimized.evaluate(X_test_scaled,y_test,verbose=2)
#print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

In [63]:
#THE LINES BELOW IS CODE

#nn_optimized= tf.keras.models.Sequential()

# Input layer and first hidden layer
#THE LINES BELOW IS CODE

#nn_optimized.add(tf.keras.layers.Dense(units=64, activation="relu", input_dim=X_train.shape[1]))

# Second hidden layer (increase neurons)
#THE LINES BELOW IS CODE

#nn_optimized.add(tf.keras.layers.Dense(units=32, activation="relu"))

# Output layer
#THE LINES BELOW IS CODE

#nn_optimized.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

In [64]:
# Reduce the learning rate
#THE LINES BELOW IS CODE

#optimizer = Adam(learning_rate=0.0005)

#nn_optimized.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])


In [65]:
#THE LINES BELOW IS CODE

#nn_optimized.fit(X_train_scaled, y_train, epochs=150, batch_size=32)

In [66]:
# Evaluate the model using the test data
#THE LINES BELOW IS CODE

#model_loss, model_accuracy = nn_optimized.evaluate(X_test_scaled,y_test,verbose=2)
#print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

In [67]:
#Dependencies needed to save the files in Google Drive.
from google.colab import drive

# Mounting Google Drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [68]:
# Saving the model in Google Drive
nn_optimized.save("/content/drive/My Drive/Colab Notebooks/AlphabetSoupCharity.h5")




In [69]:
# Exporting the model to HDF5 file
nn_optimized.save("AlphabetSoupCharity.h5")

