#### 1 & 2: Import and Load Data

In [1]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,OneHotEncoder
import pandas as pd
import tensorflow as tf 
from tensorflow.python.keras.layers import Input, Dense
from tensorflow.python import keras

# Import our input dataset
charity_df = pd.read_csv('Resources/charity_data.csv')
charity_df = pd.read_csv('Resources/charity_data.csv', sep=r'\s*,\s*', header=0, encoding='ascii', engine='python')
charity_df.head()

Unnamed: 0,EIN,NAME,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,10520599,BLUE KNIGHTS MOTORCYCLE CLUB,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1
1,10531628,AMERICAN CHESAPEAKE CLUB CHARITABLE TR,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,N,108590,1
2,10547893,ST CLOUD PROFESSIONAL FIREFIGHTERS,T5,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
3,10553066,SOUTHSIDE ATHLETIC ASSOCIATION,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,N,6692,1
4,10556103,GENETIC RESEARCH INSTITUTE OF THE DESERT,T3,Independent,C1000,Heathcare,Trust,1,100000-499999,N,142590,1


#### 3: Import and characterize the input data. Hint: Be sure to identify the following in your dataset

What variable(s) are considered the target and the features for your model?
What variable(s) are neither and should be removed from the input data?

In [2]:
# What variable(s) are neither and should be removed from the input data

charity_df.drop(["EIN", "AFFILIATION", "SPECIAL_CONSIDERATIONS"], axis=1, inplace=True) 
  
# display 
charity_df.head()

Unnamed: 0,NAME,APPLICATION_TYPE,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,ASK_AMT,IS_SUCCESSFUL
0,BLUE KNIGHTS MOTORCYCLE CLUB,T10,C1000,ProductDev,Association,1,0,5000,1
1,AMERICAN CHESAPEAKE CLUB CHARITABLE TR,T3,C2000,Preservation,Co-operative,1,1-9999,108590,1
2,ST CLOUD PROFESSIONAL FIREFIGHTERS,T5,C3000,ProductDev,Association,1,0,5000,0
3,SOUTHSIDE ATHLETIC ASSOCIATION,T3,C2000,Preservation,Trust,1,10000-24999,6692,1
4,GENETIC RESEARCH INSTITUTE OF THE DESERT,T3,C1000,Heathcare,Trust,1,100000-499999,142590,1


In [3]:
# Generate our categorical variable list
charity_cat = charity_df.dtypes[charity_df.dtypes == "object"].index.tolist()
charity_cat

['NAME',
 'APPLICATION_TYPE',
 'CLASSIFICATION',
 'USE_CASE',
 'ORGANIZATION',
 'INCOME_AMT']

In [4]:
# To ensure that none of the categorical variables have more than 10 unique values and require bucketing. 
# Check the number of unique values in each column, using nunique method, 
charity_df[charity_cat].nunique()

NAME                19568
APPLICATION_TYPE       17
CLASSIFICATION         71
USE_CASE                5
ORGANIZATION            4
INCOME_AMT              9
dtype: int64

#### 4: Preprocess all Numerical and Categorical Variables, as needed

##### Bucketing, Encoding & Standardization

To determine which CLASSIFICATION are uncommon enough to bucket into the “other” category, use a density plot method to identify where the value counts “fall off” and set the threshold within this region. 

In [5]:
# Print out the Country value counts
charity_counts = charity_df.CLASSIFICATION.value_counts()
charity_counts

C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
         ...  
C2500        1
C2170        1
C1820        1
C4500        1
C1900        1
Name: CLASSIFICATION, Length: 71, dtype: int64

In [6]:
# Visualize the value counts, use a density plot to identify where the value counts “fall off” and set the threshold
charity_counts.plot.density()

<matplotlib.axes._subplots.AxesSubplot at 0x17ab5350f60>

According to the density plot, the most common unique values have more than 1000 instances within the dataset. Therefore, we can bucket any CLASSIFICATION that appears fewer than 1000 times in the dataset as “other.” To do this, we’ll use a Python for loop and Pandas’ replace method.

In [7]:
# Determine which values to replace
replace_charity = list(charity_counts[charity_counts < 1000].index)

# Replace in DataFrame
for charity in replace_charity:
    charity_df.CLASSIFICATION = charity_df.CLASSIFICATION.replace(charity,"Other")


# Check to make sure binning was successful
charity_df.CLASSIFICATION.value_counts()

C1000    17326
C2000     6074
C1200     4837
Other     2261
C3000     1918
C2100     1883
Name: CLASSIFICATION, dtype: int64

After reducing the number of unique values in the CLASSIFICATION variable, transpose the variable using one-hot encoding. The easiest way to perform one-hot encoding in Python is to use Scikit-learn’s OneHotEncoder module on the CLASSIFICATION variable. To build the encoded columns, create an instance of OneHotEncoder and “fit” the encoder with our values.

In [8]:
# Create the OneHotEncoder instance
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse=False)

# Fit the encoder and produce encoded DataFrame
encode_df = pd.DataFrame(enc.fit_transform(charity_df.CLASSIFICATION.values.reshape(-1,1)))

# Rename encoded columns
encode_df.columns = enc.get_feature_names(['CLASSIFICATION'])
encode_df.head()

Unnamed: 0,CLASSIFICATION_C1000,CLASSIFICATION_C1200,CLASSIFICATION_C2000,CLASSIFICATION_C2100,CLASSIFICATION_C3000,CLASSIFICATION_Other
0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0


Join the encoded DataFrame with the original and drop the original “CLASSIFICATION” column. The process of joining the two DataFrames together is handled by the Pandas merge method and can be performed within one line.

In [9]:
# Merge the two DataFrames together and drop the Country column
charity_df.merge(encode_df,left_index=True,right_index=True).drop("CLASSIFICATION",1)

Unnamed: 0,NAME,APPLICATION_TYPE,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,ASK_AMT,IS_SUCCESSFUL,CLASSIFICATION_C1000,CLASSIFICATION_C1200,CLASSIFICATION_C2000,CLASSIFICATION_C2100,CLASSIFICATION_C3000,CLASSIFICATION_Other
0,BLUE KNIGHTS MOTORCYCLE CLUB,T10,ProductDev,Association,1,0,5000,1,1.0,0.0,0.0,0.0,0.0,0.0
1,AMERICAN CHESAPEAKE CLUB CHARITABLE TR,T3,Preservation,Co-operative,1,1-9999,108590,1,0.0,0.0,1.0,0.0,0.0,0.0
2,ST CLOUD PROFESSIONAL FIREFIGHTERS,T5,ProductDev,Association,1,0,5000,0,0.0,0.0,0.0,0.0,1.0,0.0
3,SOUTHSIDE ATHLETIC ASSOCIATION,T3,Preservation,Trust,1,10000-24999,6692,1,0.0,0.0,1.0,0.0,0.0,0.0
4,GENETIC RESEARCH INSTITUTE OF THE DESERT,T3,Heathcare,Trust,1,100000-499999,142590,1,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34294,THE LIONS CLUB OF HONOLULU KAMEHAMEHA,T4,ProductDev,Association,1,0,5000,0,1.0,0.0,0.0,0.0,0.0,0.0
34295,INTERNATIONAL ASSOCIATION OF LIONS CLUBS,T4,ProductDev,Association,1,0,5000,0,0.0,0.0,0.0,0.0,1.0,0.0
34296,PTA HAWAII CONGRESS,T3,Preservation,Association,1,0,5000,0,0.0,0.0,1.0,0.0,0.0,0.0
34297,AMERICAN FEDERATION OF GOVERNMENT EMPLOYEES LO...,T5,ProductDev,Association,1,0,5000,1,0.0,0.0,0.0,0.0,1.0,0.0


Standardize numerical variables using TensorFlow’s StandardScaler class.

In [10]:
# To resolve below error:
#ValueError: could not convert string to float: 'ASSOCIATION OF OPERATING ROOM NURSES INC'

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
charity_df['NAME'] = le.fit_transform(charity_df['NAME'])
charity_df['APPLICATION_TYPE'] = le.fit_transform(charity_df['APPLICATION_TYPE'])
#charity_df['AFFILIATION'] = le.fit_transform(charity_df['AFFILIATION'])
charity_df['CLASSIFICATION'] = le.fit_transform(charity_df['CLASSIFICATION'])
charity_df['USE_CASE'] = le.fit_transform(charity_df['USE_CASE'])
charity_df['ORGANIZATION'] = le.fit_transform(charity_df['ORGANIZATION'])
charity_df['INCOME_AMT'] = le.fit_transform(charity_df['INCOME_AMT'])
#charity_df['SPECIAL_CONSIDERATIONS'] = le.fit_transform(charity_df['SPECIAL_CONSIDERATIONS'])

#to check datatypes
# df.select_dtypes(include=['object'])

#number= LabelEncoder()
#df['int_rate'] = number.fit_transform(df['int_rate'].astype('str'))
charity_df.head()

Unnamed: 0,NAME,APPLICATION_TYPE,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,ASK_AMT,IS_SUCCESSFUL
0,2237,0,0,4,0,1,0,5000,1
1,860,10,2,3,1,1,1,108590,1
2,16310,12,4,4,0,1,0,5000,0
3,16127,10,2,3,3,1,2,6692,1
4,6807,10,0,1,3,1,3,142590,1


In [11]:
#to check datatypes
#charity_df.select_dtypes(include=['object'])
charity_df.dtypes

NAME                int32
APPLICATION_TYPE    int32
CLASSIFICATION      int32
USE_CASE            int32
ORGANIZATION        int32
STATUS              int64
INCOME_AMT          int32
ASK_AMT             int64
IS_SUCCESSFUL       int64
dtype: object

In [12]:
#Split our preprocessed data into our "X" features and "Y" target output/arrays
y = charity_df["IS_SUCCESSFUL"].values
X = charity_df.drop(["IS_SUCCESSFUL"],1).values

# Split the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

In [13]:
# Next, standardize the numerical variables using Scikit-learn’s StandardScaler class
# Preprocess numerical data for neural network

# Create a StandardScaler instance
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)
#X_scaler = scaler.fit(charity_df)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

# Scale the data
#scaled_data = scaler.transform(charity_df)

In [18]:
# Create a DataFrame with the scaled data
transformed_scaled_data = pd.DataFrame(X_train_scaled, columns=['NAME', 'APPLICATION_TYPE', 'CLASSIFICATION', 'USE_CASE',
       'ORGANIZATION', 'STATUS', 'INCOME_AMT', 'ASK_AMT'])
transformed_scaled_data.head()

Unnamed: 0,NAME,APPLICATION_TYPE,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,ASK_AMT
0,-1.499204,1.479888,0.507463,1.695031,0.671967,0.013943,-0.572771,-0.033484
1,-0.033289,-0.065461,-0.781882,-0.23456,0.671967,0.013943,-0.572771,-0.033484
2,0.467479,-0.065461,1.152135,-0.23456,0.671967,0.013943,-0.572771,-0.033484
3,-0.08107,-0.065461,-0.13721,-0.23456,0.671967,0.013943,-0.572771,-0.033484
4,0.138621,-0.065461,-0.781882,-0.23456,0.671967,0.013943,1.270587,0.100966


In [16]:
charity_df.columns

Index(['NAME', 'APPLICATION_TYPE', 'CLASSIFICATION', 'USE_CASE',
       'ORGANIZATION', 'STATUS', 'INCOME_AMT', 'ASK_AMT', 'IS_SUCCESSFUL'],
      dtype='object')

#### 5: Using a TensorFlow neural network design of your choice, create a binary classification model that can predict if an Alphabet Soup funded organization will be successful based on the features in the dataset.

You may choose to use a neural network or deep learning model.
Hint: Think about how many inputs there are before determining the number of neurons and layers in your model.

#### Deep Learning Model

In [19]:
# Define the model - deep neural net
number_input_features = len(X_train[0])
hidden_nodes_layer1 =  8
hidden_nodes_layer2 = 5

nn = tf.keras.models.Sequential()

# First hidden layer
nn.add(
    tf.keras.layers.Dense(units=hidden_nodes_layer1, input_dim=number_input_features, activation="relu")
)

# Second hidden layer
nn.add(tf.keras.layers.Dense(units=hidden_nodes_layer2, activation="relu"))

# Output layer
nn.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Check the structure of the model
nn.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 8)                 72        
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 45        
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 6         
Total params: 123
Trainable params: 123
Non-trainable params: 0
_________________________________________________________________


#### 6: Compile, train, and evaluate your binary classification model. And define the loss and accuracy metrics

In [20]:
# Compile model, use model as a binary classifier by using the binary_crossentropy loss function, adam optimizer, 
# and accuracy metrics

nn.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

Instructions for updating:
keep_dims is deprecated, use keepdims instead


In [21]:
# Train the model
fit_model = nn.fit(X_train,y_train,epochs=100)

Epoch 1/100

Epoch 2/100

Epoch 3/100

Epoch 4/100

Epoch 5/100

Epoch 6/100

Epoch 7/100

Epoch 8/100

Epoch 9/100

Epoch 10/100

Epoch 11/100

Epoch 12/100

Epoch 13/100

Epoch 14/100

Epoch 15/100

Epoch 16/100

Epoch 17/100

Epoch 18/100

Epoch 19/100

Epoch 20/100

Epoch 21/100

Epoch 22/100

Epoch 23/100

Epoch 24/100

Epoch 25/100

Epoch 26/100

Epoch 27/100

Epoch 28/100

Epoch 29/100

Epoch 30/100

Epoch 31/100

Epoch 32/100

Epoch 33/100

Epoch 34/100

Epoch 35/100

Epoch 36/100

Epoch 37/100

Epoch 38/100

Epoch 39/100

Epoch 40/100

Epoch 41/100

Epoch 42/100

Epoch 43/100

Epoch 44/100

Epoch 45/100

Epoch 46/100

Epoch 47/100

Epoch 48/100

Epoch 49/100

Epoch 50/100

Epoch 51/100

Epoch 52/100

Epoch 53/100

Epoch 54/100



Epoch 55/100

Epoch 56/100

Epoch 57/100

Epoch 58/100

Epoch 59/100

Epoch 60/100

Epoch 61/100

Epoch 62/100

Epoch 63/100

Epoch 64/100

Epoch 65/100

Epoch 66/100

Epoch 67/100

Epoch 68/100

Epoch 69/100

Epoch 70/100

Epoch 71/100

Epoch 72/100

Epoch 73/100

Epoch 74/100

Epoch 75/100

Epoch 76/100

Epoch 77/100

Epoch 78/100

Epoch 79/100

Epoch 80/100

Epoch 81/100

Epoch 82/100

Epoch 83/100

Epoch 84/100

Epoch 85/100

Epoch 86/100

Epoch 87/100

Epoch 88/100

Epoch 89/100

Epoch 90/100

Epoch 91/100

Epoch 92/100

Epoch 93/100

Epoch 94/100

Epoch 95/100

Epoch 96/100

Epoch 97/100

Epoch 98/100

Epoch 99/100

Epoch 100/100



In [22]:
# Evaluate the model using the test data
model_loss, model_accuracy = nn.evaluate(X_test,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

Loss: 7.442258485454859, Accuracy: 0.5331778426142545


#### 7: Do your best to optimize your model training and input data to achieve a target predictive accuracy higher than 75%.

In [None]:
charity_df = charity_df['ORGANIZATION'].astype('float')
charity_df