Charity Funding Predictor

Background:


The nonprofit foundation Alphabet Soup wants a tool that can help it select the applicants for funding with the best chance of success in their ventures. With your knowledge of machine learning and neural networks, you’ll use the features in the provided dataset to create a binary classifier that can predict whether applicants will be successful if funded by Alphabet Soup.
From Alphabet Soup’s business team, you have received a CSV containing more than 34,000 organizations that have received funding from Alphabet Soup over the years. 

STEPS:

1. Preprocess the Data
2. Compile, Train, and Evaluate the Model
3. Optimize the Model
4. Write a Report on the Neural Network Model

In [1]:
#import dependencies
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [2]:
#load in file and create DataFrame
file = 'charity_data_copy.csv'
charity_df = pd.read_csv(file)
charity_df.head()

Unnamed: 0,EIN,NAME,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,10520599,BLUE KNIGHTS MOTORCYCLE CLUB,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1
1,10531628,AMERICAN CHESAPEAKE CLUB CHARITABLE TR,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,N,108590,1
2,10547893,ST CLOUD PROFESSIONAL FIREFIGHTERS,T5,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
3,10553066,SOUTHSIDE ATHLETIC ASSOCIATION,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,N,6692,1
4,10556103,GENETIC RESEARCH INSTITUTE OF THE DESERT,T3,Independent,C1000,Heathcare,Trust,1,100000-499999,N,142590,1


In [17]:
len(charity_df)

34299

In [3]:
#What variable(s) are the target(s) for your model?
#What variable(s) are the feature(s) for your model?

#drop EIN and NAME columns
target = charity_df["IS_SUCCESSFUL"].values
X = charity_df.drop(["EIN", "NAME", "IS_SUCCESSFUL"], axis = 1)

In [4]:
X.dtypes

APPLICATION_TYPE          object
AFFILIATION               object
CLASSIFICATION            object
USE_CASE                  object
ORGANIZATION              object
STATUS                     int64
INCOME_AMT                object
SPECIAL_CONSIDERATIONS    object
ASK_AMT                    int64
dtype: object

In [5]:
#Determine the number of unique values for each column
X.nunique()

APPLICATION_TYPE            17
AFFILIATION                  6
CLASSIFICATION              71
USE_CASE                     5
ORGANIZATION                 4
STATUS                       2
INCOME_AMT                   9
SPECIAL_CONSIDERATIONS       2
ASK_AMT                   8747
dtype: int64

In [6]:
#For columns that have more than 10 unique values, determine the number of data points for each unique value.
X['APPLICATION_TYPE'].value_counts()

T3     27037
T4      1542
T6      1216
T5      1173
T19     1065
T8       737
T7       725
T10      528
T9       156
T13       66
T12       27
T2        16
T25        3
T14        3
T29        2
T15        2
T17        1
Name: APPLICATION_TYPE, dtype: int64

In [7]:
classif = X['CLASSIFICATION'].value_counts()
classif

C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
         ...  
C4120        1
C8210        1
C2561        1
C4500        1
C2150        1
Name: CLASSIFICATION, Length: 71, dtype: int64

In [8]:
X['ASK_AMT'].value_counts()

5000        25398
10478           3
15583           3
63981           3
6725            3
            ...  
5371754         1
30060           1
43091152        1
18683           1
36500179        1
Name: ASK_AMT, Length: 8747, dtype: int64

### Use the number of data points for each unique value to pick a cutoff point to bin "rare" categorical variables together in a new value, Other, and then check if the binning was successful.

In [9]:
#create a list of apps that are below 500 data points, which will be considered 'other'
replaced_apps = ['T9', 'T13', 'T12', 'T2', 'T25', 'T14', 'T29', 'T15', 'T17']

#created loop to replace app with the value 'other'
for app in replaced_apps:
    X['APPLICATION_TYPE'] = X['APPLICATION_TYPE'].replace(app, 'other')

#check this is done correctly 
X['APPLICATION_TYPE'].value_counts()

T3       27037
T4        1542
T6        1216
T5        1173
T19       1065
T8         737
T7         725
T10        528
other      276
Name: APPLICATION_TYPE, dtype: int64

In [10]:
print(f'{classif.values}')
print(f'{classif.index}')

[17326  6074  4837  1918  1883   777   287   194   116   114   104    95
    75    58    50    36    34    32    32    30    20    18    16    15
    15    14    11    10    10     9     9     7     6     6     6     5
     5     3     3     3     2     2     2     2     2     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1]
Index(['C1000', 'C2000', 'C1200', 'C3000', 'C2100', 'C7000', 'C1700', 'C4000',
       'C5000', 'C1270', 'C2700', 'C2800', 'C7100', 'C1300', 'C1280', 'C1230',
       'C1400', 'C7200', 'C2300', 'C1240', 'C8000', 'C7120', 'C1500', 'C1800',
       'C6000', 'C1250', 'C8200', 'C1238', 'C1278', 'C1235', 'C1237', 'C7210',
       'C2400', 'C1720', 'C4100', 'C1257', 'C1600', 'C1260', 'C2710', 'C0',
       'C3200', 'C1234', 'C1246', 'C1267', 'C1256', 'C2190', 'C4200', 'C2600',
       'C5200', 'C1370', 'C1248', 'C6100', 'C1820', 'C1900', 'C1236', 'C3700',
       'C2570', '

In [11]:
#create a list to replace values with low amounts of data points. Anything below 300 data points is considered other.
replaced_class = ['C1700', 'C4000',
       'C5000', 'C1270', 'C2700', 'C2800', 'C7100', 'C1300', 'C1280', 'C1230',
       'C1400', 'C7200', 'C2300', 'C1240', 'C8000', 'C7120', 'C1500', 'C1800',
       'C6000', 'C1250', 'C8200', 'C1238', 'C1278', 'C1235', 'C1237', 'C7210',
       'C2400', 'C1720', 'C4100', 'C1257', 'C1600', 'C1260', 'C2710', 'C0',
       'C3200', 'C1234', 'C1246', 'C1267', 'C1256', 'C2190', 'C4200', 'C2600',
       'C5200', 'C1370', 'C1248', 'C6100', 'C1820', 'C1900', 'C1236', 'C3700',
       'C2570', 'C1580', 'C1245', 'C2500', 'C1570', 'C1283', 'C2380', 'C1732',
       'C1728', 'C2170', 'C4120', 'C8210', 'C2561', 'C4500', 'C2150']
for value in replaced_class:
    X['CLASSIFICATION'] = X['CLASSIFICATION'].replace(value, 'other')

#check
X['CLASSIFICATION'].value_counts()

C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
other     1484
C7000      777
Name: CLASSIFICATION, dtype: int64

In [12]:
#use get_dummies() function to encode categorical variables
X = pd.get_dummies(X)
X.head()

Unnamed: 0,STATUS,ASK_AMT,APPLICATION_TYPE_T10,APPLICATION_TYPE_T19,APPLICATION_TYPE_T3,APPLICATION_TYPE_T4,APPLICATION_TYPE_T5,APPLICATION_TYPE_T6,APPLICATION_TYPE_T7,APPLICATION_TYPE_T8,...,INCOME_AMT_1-9999,INCOME_AMT_10000-24999,INCOME_AMT_100000-499999,INCOME_AMT_10M-50M,INCOME_AMT_1M-5M,INCOME_AMT_25000-99999,INCOME_AMT_50M+,INCOME_AMT_5M-10M,SPECIAL_CONSIDERATIONS_N,SPECIAL_CONSIDERATIONS_Y
0,1,5000,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,1,108590,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
2,1,5000,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,1,6692,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
4,1,142590,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0


In [16]:
len(X.columns)

44

In [15]:
#create train and test spilts for model
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=42)

#create scaler instance
scaler = StandardScaler()
#train scaler
scaler.fit(X_train)

#transform X datasets with scaler
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Step 2: Compile, Train, Evaluate the Model

In [3]:
#Create a neural network model by assigning the number of input features and nodes for each layer using TensorFlow and Keras.
import tensorflow as tf

nn_model = tf.keras.Sequential()

#add layers to the nueral network model
#units is the number of nodes in a hidden layer
#input_dim is number of features in the dataset
#activation is the activation function used on the output data from the respective layer

#first hidden layer
nn_model.add(tf.keras.layers.Dense(units=8, input_dim=44, activation="relu"))

#second hidden layer
nn_model.add(tf.keras.layers.Dense(units=5, activation="relu"))

#third hidden layer
nn_model.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

#checking the structure of the model
nn_model.summary()

In [1]:
#Compile the model.
nn_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

#train the model 
fitted_model = nn_model.fit(X_train_scaled, y_train, epochs = 100)

NameError: name 'nn_model' is not defined

In [None]:
#Evaluate the model using the test data to determine the loss and accuracy.
model_loss, model_accuracy = nn_model.evaluate(X_test_scaled, y_test, verbose=2)
print(f'Model Loss: {model_loss}, Model Accuracy: {model_accuracy}')

In [None]:
#Save and export results to an HDF5 file
import h5py
h5py.File('AlphabetSoupCharity.h5', 'w')