## Overview

####Purpose of the analysis: We were tasked creating a tool, a model, that predicts whether or not a hypothetical applicant will succeed in their ventures, if given funding. We were given a CSV containing 34,000 applications from 19,568 organizations. The following report is a brief explanation of the model. I used TensorFlow as it can handle large amounts of data, and analyze complex patterns. The report covers data preprocessing, compiling, training, and evaluation. By the end, we will understand why I chose this particular model.

## Data Preprocessing
----
####1. What variable(s) are the target(s) for your model?

####The "Is-Successful" column was the variable the model targeted. The whole point of the model is to see which applicants paid back which loans successfully, and which were not. The only feature that definitely answered this was "Is-Successful".
---
####2. What variable(s) are the features for your model?

####The variables used for the features for this model are: NAME, APPLICATION, TYPE, AFFILIATION, CLASSIFICATION, USE_CASE, ORGANIZATION, INCOME_AMT, SPECIAL_CONSIDERATIONS, STATUS, and ASK_AMT
---
####3. What variable(s) should be removed from the input data because they are neither targets nor features?

####For the optmitizer, I removed the EIN, or Employment Identification Number. Since we have the name of the organization or applicant, the EIN is effectively useless and could even create noise within the model from numbers that had no influence on the outcome of success or no.

####STATUS was also removed as all values were the same (1), making the feature hyper-clustered and effectively useless.

####SPECIAL_CONSIDERATIONS could be removed as these were objects that would not be able to be converted into the appropriate datatype for a model.




## Compiling, Training, and Evaluating the Model
----
####1. How many neurons, layers, and activation functions did you select for your neural network model, and why?

#### While this orignially seemed like a simple classification model requiring only one layer of neurons, it needed more as I found the accuracy was not able to exceed 53% with just one. I experimented with changing the number of epochs and layers, and while 100 epochs verified the persistence of the model, just 10 epochs seemed to achieve almost the same results within a few hundredth points. Using multiple layers was the key to moving the accuracy above 75%, as was increasing the size of the random state from 2 to at least above 50 (I chose 99). I used 'relu' for the first activation function, and 'sigmoid' for any subsequent and output layers.
---
####2. Were you able to achieve the target model performance?

#### Yes - 80%
---
####3. What steps did you take in your attempts to increase model performance?

#### Steps to increase model performance included grouping outlying values in the features NAME, APPLICATION_TYPE, CLASSIFICATION, and ASK_AMT. It also comprised adjusting the number of layers, activation function placement, number of units, number of random states. Encoding the NAME column into numeric seemed to have the greatest effect for improving accuracy, which makes sense as the model was not able to form a relationship between the target and the other features it was training on.

## Summary
----
####In conclusion, the model is able to accurately predict successful applications, to a degree of 80%, as long as the application has the following criteria:
- The application must have one of the following classifications, under CLASSIFICATION; C1000, C2000, C3000, C1200, C2100
- The application must have used one of the following application types, under APPLICATION_TYPE: T3, T4, T5, T6, T7, T8, T10, T19
- The ASK_AMT = 5000
- The applicant, under NAME, must have submitted an application more than 5 times


####A different model I would recommend is the Random Forest model as it produced an accuracy score of 79.5%, almost identical to TensorFlow. Random Forest is good for classification and regression tasks. I also experimented using K-Nearest Neighbors, although this produced slightly less accurate result of 78.6%. (Scroll past the TensorFlow model optimization to see both the Random Forest and KNN model processes and results).

## Model Optimization for Alphabet Soup Applicant Selection

In [None]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import tensorflow as tf

#  Import and read the charity_data.csv.
import pandas as pd
df = pd.read_csv("https://static.bc-edx.com/data/dl-1-2/m21/lms/starter/charity_data.csv")
df.head()

Unnamed: 0,EIN,NAME,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,10520599,BLUE KNIGHTS MOTORCYCLE CLUB,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1
1,10531628,AMERICAN CHESAPEAKE CLUB CHARITABLE TR,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,N,108590,1
2,10547893,ST CLOUD PROFESSIONAL FIREFIGHTERS,T5,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
3,10553066,SOUTHSIDE ATHLETIC ASSOCIATION,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,N,6692,1
4,10556103,GENETIC RESEARCH INSTITUTE OF THE DESERT,T3,Independent,C1000,Heathcare,Trust,1,100000-499999,N,142590,1


In [None]:
df.shape

(34299, 12)

In [None]:
# Drop any non-beneficial ID columns
# This time, keep NAME and encode

df = df.drop(["EIN"], axis=1)

In [None]:
# Find the number of unique values in each feature

df.nunique()

Unnamed: 0,0
NAME,19568
APPLICATION_TYPE,17
AFFILIATION,6
CLASSIFICATION,71
USE_CASE,5
ORGANIZATION,4
STATUS,2
INCOME_AMT,9
SPECIAL_CONSIDERATIONS,2
ASK_AMT,8747


In [None]:
# We need to group and possibly eliminate applicants with low number of applications
# This helps the model train on a more evenly distributed range of data
# Group any applicants with applications fewer than 5 with "Other"

application_count = df['NAME'].value_counts()

#  Application counts greater than 5

application_count[application_count>5]

Unnamed: 0_level_0,count
NAME,Unnamed: 1_level_1
PARENT BOOSTER USA INC,1260
TOPS CLUB INC,765
UNITED STATES BOWLING CONGRESS INC,700
WASHINGTON STATE UNIVERSITY,492
AMATEUR ATHLETIC UNION OF THE UNITED STATES INC,408
...,...
OLD OAK CLIFF CONSERVATION LEAGUE INC,6
AMERICAN NEPHROLOGY NURSES ASSOCIATION,6
HUMBLE ISD EDUCATIONAL SUPPORT GROUPS INC,6
PROFESSIONAL LOADMASTER ASSOCIATION,6


In [None]:
# Create list of applicants with application counts <= 5
app_count_fiveandbelow = list(application_count[application_count <= 5].index)

# Iterate through list, group into "Other", replace in df
for application in app_count_fiveandbelow:
    df['NAME'] = df['NAME'].replace(application,"Other")

# Did it work?
df['NAME'].value_counts()

Unnamed: 0_level_0,count
NAME,Unnamed: 1_level_1
Other,20043
PARENT BOOSTER USA INC,1260
TOPS CLUB INC,765
UNITED STATES BOWLING CONGRESS INC,700
WASHINGTON STATE UNIVERSITY,492
...,...
HABITAT FOR HUMANITY INTERNATIONAL,6
DAMAGE PREVENTION COUNCIL OF TEXAS,6
FLEET RESERVE ASSOCIATION,6
HUGH OBRIAN YOUTH LEADERSHIP,6


In [None]:
# Do the same with APPLICATION_TYPE - we will need to group application types with low counts into a new "Other" value
application_type_counts = df['APPLICATION_TYPE'].value_counts()
application_type_counts

Unnamed: 0_level_0,count
APPLICATION_TYPE,Unnamed: 1_level_1
T3,27037
T4,1542
T6,1216
T5,1173
T19,1065
T8,737
T7,725
T10,528
T9,156
T13,66


In [None]:
# Create a list of application types that have counts less than 500
application_types_under500 = list(application_type_counts[application_type_counts < 500].index)

# Iterate through list, group into "Other", replace in df
for application in application_types_under500:
    df['APPLICATION_TYPE'] = df['APPLICATION_TYPE'].replace(application,"Other")

# Did it work?
df['APPLICATION_TYPE'].value_counts()

Unnamed: 0_level_0,count
APPLICATION_TYPE,Unnamed: 1_level_1
T3,27037
T4,1542
T6,1216
T5,1173
T19,1065
T8,737
T7,725
T10,528
Other,276


In [None]:
# Do the same with CLASSIFICATION - we will need to group application types with low counts into a new "Other" value
classification_counts = df['CLASSIFICATION'].value_counts()
classification_counts

Unnamed: 0_level_0,count
CLASSIFICATION,Unnamed: 1_level_1
C1000,17326
C2000,6074
C1200,4837
C3000,1918
C2100,1883
...,...
C4120,1
C8210,1
C2561,1
C4500,1


In [None]:
# Create a list of classifications that have counts less than 1000
classifications_under1000 = list(classification_counts[classification_counts < 1000].index)

# Iterate through list, group into "Other", replace in df
for classification in classifications_under1000:
    df['CLASSIFICATION'] = df['CLASSIFICATION'].replace(classification,"Other")

# Did it work?
df['CLASSIFICATION'].value_counts()

Unnamed: 0_level_0,count
CLASSIFICATION,Unnamed: 1_level_1
C1000,17326
C2000,6074
C1200,4837
Other,2261
C3000,1918
C2100,1883


In [None]:
# Do the same with ASK_AMT
ask_counts = df['ASK_AMT'].value_counts()

ask_counts[ask_counts>5]

Unnamed: 0_level_0,count
ASK_AMT,Unnamed: 1_level_1
5000,25398


In [None]:
# Create a list of ask amounts that have counts outside of 5000
ask_amount_not_5000 = list(ask_counts[ask_counts < 5].index)

# Iterate through list, group into "Other", replace in df
for amount in ask_amount_not_5000:
    df['ASK_AMT'] = df['ASK_AMT'].replace(amount,"Other")

# Did it work?
df['ASK_AMT'].value_counts()

Unnamed: 0_level_0,count
ASK_AMT,Unnamed: 1_level_1
5000,25398
Other,8901


In [None]:
# Create new index list based on all object features in DataFrame
categories = df.dtypes[df.dtypes == "object"].index.tolist()

In [None]:
# Like before, convert categorical data to numeric with 'pd.get_dummies'
dummies_df = pd.get_dummies(df)

In [None]:
# Split our preprocessed data into our features and target arrays
X = dummies_df.drop(["IS_SUCCESSFUL"], axis='columns').values
y = dummies_df["IS_SUCCESSFUL"].values

# Split the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

In [None]:
# Create a StandardScaler instances
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

In [None]:
# Define the model - deep neural net, i.e., the number of input features and hidden nodes for each layer.

nn = tf.keras.models.Sequential()

input_features_len = len(X_train[0])


# First hidden layer
nn.add(
    tf.keras.layers.Dense(units=200, input_dim=input_features_len, activation="relu")
)

# Second hidden layer
nn.add(tf.keras.layers.Dense(units=60, activation="sigmoid"))

# Third hidden layer
nn.add(tf.keras.layers.Dense(units=20, activation="sigmoid"))

# Output layer
nn.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Check the structure of the model
nn.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [None]:
# Compile the model
nn.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [None]:
# Train the model
train = nn.fit(X_train_scaled,y_train,epochs=10)

Epoch 1/10
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.7349 - loss: 0.5358
Epoch 2/10
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.7975 - loss: 0.4397
Epoch 3/10
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 5ms/step - accuracy: 0.7982 - loss: 0.4335
Epoch 4/10
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.7952 - loss: 0.4308
Epoch 5/10
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 4ms/step - accuracy: 0.7955 - loss: 0.4309
Epoch 6/10
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4ms/step - accuracy: 0.7955 - loss: 0.4263
Epoch 7/10
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.7992 - loss: 0.4215
Epoch 8/10
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.7992 - loss: 0.4187
Epoch 9/10
[1m804/804[0m [32m━━━━━━━━

In [None]:
# Evaluate the model using the test data
model_loss, model_accuracy = nn.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

268/268 - 1s - 2ms/step - accuracy: 0.8001 - loss: 0.4266
Loss: 0.4265761077404022, Accuracy: 0.8001165986061096


In [None]:
# Export our model to HDF5 file
nn.save("AlphabetSoupCharity_Optimization.h5")



In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier


In [None]:
# Create a KNeighbors classifier
knn_model = KNeighborsClassifier(n_neighbors=22)

# Fitting the model
knn_model = knn_model.fit(X_train_scaled, y_train)

# Evaluate the model
y_pred = knn_model.predict(X_test_scaled)
print(f" K-Nearest Neighbors model accuracy: {accuracy_score(y_test,y_pred):.3f}")

 K-Nearest Neighbors model accuracy: 0.786


In [None]:
# Create a random forest classifier.
rf_model = RandomForestClassifier(n_estimators=128, random_state=99)

# Fitting the model
rf_model = rf_model.fit(X_train_scaled, y_train)

# Evaluate the model
y_pred = rf_model.predict(X_test_scaled)
print(f" Random forest model accuracy: {accuracy_score(y_test,y_pred):.3f}")

 Random forest model accuracy: 0.795
