# Challenge 21: Deep Leraning
## AlphabetSoupCharity_Optimization.ipynb
The goal is to improve performance of neural network developed
in 'deep_learning_challenge.ipynb:
As in the previous model, the data set describes historical proposal and their outcome and has these features:
- APPLICATION_TYPE (17 values)
- AFFILIATION ( 6 values)
- CLASSIFICATION ( 71 values)
- USE_CASE ( 5 values )
- ORGANIZATION ( 4 values )
- STATUS ( 2 values )
- INCOME_AMT ( 9 values )
- SPECIAL_CONSIDERATIONS ( values )
- ASK_AMT ( 8747 values )
The target is: 
- IS_SUCCESSFUL  ( 2 values )
- As before, the data set was edited as follows:
1. The columns 'EIN' and 'NAME' were dropped as uninformative.
2. The column 'APPLICATION_TYPES' has 17 unique values, but many of them occur infrequently.  'APPLICATION_TYPES' with less than 500 counts were condensed into a single 'Other' category.
3. OneHot encoding was used to convert categorical values into numeric values
### In addition the following three strategies were applied in order to improve model performance:
1. In addition to singleton 'CLASSIFICATIONs', any with 10 or fewer instances were consolidated in 'Other'
2. Droped the 'STATUS' column on the basis that the vast majority of cases are 'STATUS = 1' (active) (only 27 of 34299 cases are 'STATUS = 0'
3. Dropped the 'SPECIAL_CONSIDERATIONS' column on the basis that only 27/ 34299 have special considerations, and only 10/ 27 are 'not successful' (compared to 16038/ 34299 unsucessful total)
4. First, the keras-tuner was used to optimize number of hidden layers (1 - 10), the number neurons in each layer (1 - 32) and the activation function (relu','tanh','sigmoid','leaky_relu').
### The keras tuner discovered the following 'best' model:
Model: "sequential_5"
| Layer (Type) | Output Shape  | Param # |
|:------------:|:--------------|:--------|
| Dense        | (None, 18)    | 1,134   |
| Dense        | (None, 26)    |   494   |
| Dense        | (None, 20)    |   540   |
| Dense        | (None,  4)    |    84   |
| Dense        | (None, 16)    |    80   |
| Dense        | (None, 20)    |   340   |
| Dense        | (None, 24)    |   504   |
| Dense        | (None,  4)    |   100   |
| Dense        | (None, 18)    |    90   |
| Dense        | (None, 16)    |   304   |
| Dense        | (None, 18)    |   306   |
| Dense        | (None,  1)    |    19   |
<br>
 Total params: 11,987 (46.83 KB)
 Trainable params: 3,995 (15.61 KB)
 Non-trainable params: 0 (0.00 B)
 Hidden Activation: tanH
 Output Activation: sigmoid

### Performance:
| Model                        | Accuracy  | Loss    |
|:----------------------------:|:----------|:--------|
| tuner best model             | 0.7271    | 0.5547  |
| manual reconstruction        | 0.7285    | 0.5553  |
| save and restored            | 0.7285    | 0.5553  |
| simple model from notebook 1 | 0.7191    | 0.5763  |

## Conclusion: the iterventions result in a slight improvement in model accuracy (just under 1%)

## Preprocessing

### - Import dependencies

In [541]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd
import tensorflow as tf
import keras_tuner as kt
import pprint

### - Declare Functions

In [544]:
def inspectColumnsForRareBinaryValues(df):
    cols = list(df.columns)
    binary_value_columns = {}
    for col in cols:
        value_counts = application_df[col].value_counts()
        if (len(list(value_counts)) == 2):
            index_list = value_counts.index.to_list()
            value_count_list = list(value_counts)
            a = value_count_list[0]
            b = value_count_list[1]
            if (a>b):
                percent = round( b/(a+b), 5)
            elif (b>a):
                percent = rround( a/(a+b), 5)
            binary_value_columns[col] =\
                {'value_labels': index_list, 'value_counts': value_count_list, 'percent': percent}
    return binary_value_columns

### - read the charity_data.csv
### from the provided cloud URL into dataframe, and inspect
- 'import pandas as pd'
- 'application_df = pd.read_csv("https://static.bc-edx.com/data/dl-1-2/m21/lms/starter/charity_data.csv")'
- 'application_df.head()'

In [547]:
application_df = pd.read_csv("https://static.bc-edx.com/data/dl-1-2/m21/lms/starter/charity_data.csv")
application_df.head()

Unnamed: 0,EIN,NAME,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,10520599,BLUE KNIGHTS MOTORCYCLE CLUB,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1
1,10531628,AMERICAN CHESAPEAKE CLUB CHARITABLE TR,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,N,108590,1
2,10547893,ST CLOUD PROFESSIONAL FIREFIGHTERS,T5,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
3,10553066,SOUTHSIDE ATHLETIC ASSOCIATION,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,N,6692,1
4,10556103,GENETIC RESEARCH INSTITUTE OF THE DESERT,T3,Independent,C1000,Heathcare,Trust,1,100000-499999,N,142590,1


In [549]:
binary_value_columns = inspectColumnsForRareBinaryValues(application_df)
pprint.pprint(binary_value_columns)

{'IS_SUCCESSFUL': {'percent': 0.46759,
                   'value_counts': [18261, 16038],
                   'value_labels': [1, 0]},
 'SPECIAL_CONSIDERATIONS': {'percent': 0.00079,
                            'value_counts': [34272, 27],
                            'value_labels': ['N', 'Y']},
 'STATUS': {'percent': 0.00015,
            'value_counts': [34294, 5],
            'value_labels': [1, 0]}}


### - Get value_counts for 'STATUS'

In [552]:
c = application_df['STATUS'].value_counts()
print(c)
print(type(c))
index_list = c.index.to_list()
print(index_list)

STATUS
1    34294
0        5
Name: count, dtype: int64
<class 'pandas.core.series.Series'>
[1, 0]


### CONCLUSION: only ~ 0.01 application are inactive: drop this 'STATUS' column

### - Get value_counts for 'SPECIAL_CONSIDERATIONS'

In [556]:
counts = list(application_df['SPECIAL_CONSIDERATIONS'].value_counts())
print(counts)

[34272, 27]


### - get co-counts ''SPECIAL_CONSIDERATIONS' and 'IS_SUCCESSFUL'

In [559]:
application_df.loc[application_df['SPECIAL_CONSIDERATIONS'] == 'Y', ['SPECIAL_CONSIDERATIONS', 'IS_SUCCESSFUL']]

Unnamed: 0,SPECIAL_CONSIDERATIONS,IS_SUCCESSFUL
1374,Y,0
2928,Y,1
6056,Y,1
6805,Y,1
7747,Y,0
9437,Y,1
9941,Y,0
12318,Y,1
13998,Y,0
14575,Y,0


### - drop 'SPECIAL_CONSIDERATIONS' column:
- only 27/ 34299 have special considerations, and
- only 10/ 27 are not successful (compared to 16038/ 34299 unsucessful total)

In [562]:
application_df.drop(columns=['SPECIAL_CONSIDERATIONS'], inplace=True)
application_df.head()

Unnamed: 0,EIN,NAME,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,ASK_AMT,IS_SUCCESSFUL
0,10520599,BLUE KNIGHTS MOTORCYCLE CLUB,T10,Independent,C1000,ProductDev,Association,1,0,5000,1
1,10531628,AMERICAN CHESAPEAKE CLUB CHARITABLE TR,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,108590,1
2,10547893,ST CLOUD PROFESSIONAL FIREFIGHTERS,T5,CompanySponsored,C3000,ProductDev,Association,1,0,5000,0
3,10553066,SOUTHSIDE ATHLETIC ASSOCIATION,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,6692,1
4,10556103,GENETIC RESEARCH INSTITUTE OF THE DESERT,T3,Independent,C1000,Heathcare,Trust,1,100000-499999,142590,1


### - Get percentage of successful applications

In [565]:
application_df['IS_SUCCESSFUL'].value_counts()

IS_SUCCESSFUL
1    18261
0    16038
Name: count, dtype: int64

### - Drop the  ID columns, 'EIN' and 'NAME', and the 'STATUS' column; inspect
- 'application_df.drop(columns=['EIN','NAME', 'STATUS'], inplace=True)'
- 'application_df.head()'

In [568]:
application_df.drop(columns=['EIN','NAME','STATUS'], inplace=True)
application_df.head()

Unnamed: 0,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,INCOME_AMT,ASK_AMT,IS_SUCCESSFUL
0,T10,Independent,C1000,ProductDev,Association,0,5000,1
1,T3,Independent,C2000,Preservation,Co-operative,1-9999,108590,1
2,T5,CompanySponsored,C3000,ProductDev,Association,0,5000,0
3,T3,CompanySponsored,C2000,Preservation,Trust,10000-24999,6692,1
4,T3,Independent,C1000,Heathcare,Trust,100000-499999,142590,1


### - Determine the number of unique values in each column.
- 'unique_values = application_df.nunique()'
- 'print(unique_values)'

In [571]:
unique_values = application_df.nunique()
print(unique_values)

APPLICATION_TYPE      17
AFFILIATION            6
CLASSIFICATION        71
USE_CASE               5
ORGANIZATION           4
INCOME_AMT             9
ASK_AMT             8747
IS_SUCCESSFUL          2
dtype: int64


### - get the number of rows and columns
- 'row_col = application_df.shape'
- 'print(f"Number of rows: {row_col[0]}")'
- 'print(f"Number of columns: {row_col[1]}")'

In [574]:
row_col = application_df.shape
print(f"Number of rows: {row_col[0]}")
print(f"Number of columns: {row_col[1]}")

Number of rows: 34299
Number of columns: 8


### - get the data types of dataframe column values
- 'application_df.info()'

In [577]:
application_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34299 entries, 0 to 34298
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   APPLICATION_TYPE  34299 non-null  object
 1   AFFILIATION       34299 non-null  object
 2   CLASSIFICATION    34299 non-null  object
 3   USE_CASE          34299 non-null  object
 4   ORGANIZATION      34299 non-null  object
 5   INCOME_AMT        34299 non-null  object
 6   ASK_AMT           34299 non-null  int64 
 7   IS_SUCCESSFUL     34299 non-null  int64 
dtypes: int64(2), object(6)
memory usage: 2.1+ MB


### inspect APPLICATION_TYPE value counts to identify infrequent types;

In [580]:
counts = application_df['APPLICATION_TYPE'].value_counts()
count_keys = counts.index
print(f"keys: {count_keys} values: {counts}")

keys: Index(['T3', 'T4', 'T6', 'T5', 'T19', 'T8', 'T7', 'T10', 'T9', 'T13', 'T12',
       'T2', 'T25', 'T14', 'T29', 'T15', 'T17'],
      dtype='object', name='APPLICATION_TYPE') values: APPLICATION_TYPE
T3     27037
T4      1542
T6      1216
T5      1173
T19     1065
T8       737
T7       725
T10      528
T9       156
T13       66
T12       27
T2        16
T25        3
T14        3
T29        2
T15        2
T17        1
Name: count, dtype: int64


In [582]:
application_df.head()

Unnamed: 0,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,INCOME_AMT,ASK_AMT,IS_SUCCESSFUL
0,T10,Independent,C1000,ProductDev,Association,0,5000,1
1,T3,Independent,C2000,Preservation,Co-operative,1-9999,108590,1
2,T5,CompanySponsored,C3000,ProductDev,Association,0,5000,0
3,T3,CompanySponsored,C2000,Preservation,Trust,10000-24999,6692,1
4,T3,Independent,C1000,Heathcare,Trust,100000-499999,142590,1


### collect application_types with less than 500 counts
### in list application_types_to_replace 
- 'application_types_to_replace = []'
- 'for i in range(len(counts)):<br>
   > if counts.iloc[i] < 500:<br>
    > > application = count_keys[i]<br>
    > > application_types_to_replace.append(application)'<br>
- 'print(application_types_to_replace) \'

In [585]:
application_types_to_replace = []
for i in range(len(counts)):
    if counts.iloc[i] < 500:
        application = count_keys[i]
        application_types_to_replace.append(application)
print(application_types_to_replace)

['T9', 'T13', 'T12', 'T2', 'T25', 'T14', 'T29', 'T15', 'T17']


### - Replace low counts (< 500) in dataframe with "Other"
- 'for app in application_types_to_replace:'
    > 'application_df['APPLICATION_TYPE'] = application_df['APPLICATION_TYPE'].replace(app,"Other")'

In [588]:
for app in application_types_to_replace:
    application_df['APPLICATION_TYPE'] = application_df['APPLICATION_TYPE'].replace(app,"Other")

### - Check to make sure replacement was successful: OK

In [591]:
application_df['APPLICATION_TYPE'].value_counts()

APPLICATION_TYPE
T3       27037
T4        1542
T6        1216
T5        1173
T19       1065
T8         737
T7         725
T10        528
Other      276
Name: count, dtype: int64

In [593]:
application_df.head()

Unnamed: 0,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,INCOME_AMT,ASK_AMT,IS_SUCCESSFUL
0,T10,Independent,C1000,ProductDev,Association,0,5000,1
1,T3,Independent,C2000,Preservation,Co-operative,1-9999,108590,1
2,T5,CompanySponsored,C3000,ProductDev,Association,0,5000,0
3,T3,CompanySponsored,C2000,Preservation,Trust,10000-24999,6692,1
4,T3,Independent,C1000,Heathcare,Trust,100000-499999,142590,1


In [595]:
application_df['IS_SUCCESSFUL'].value_counts()

IS_SUCCESSFUL
1    18261
0    16038
Name: count, dtype: int64

### Look at CLASSIFICATION value counts to identify 
### low counts for replacement with "Other"
- 'classification_counts = application_df['CLASSIFICATION'].value_counts()'
- 'classification_count_keys = classification_counts.index'
- 'print(f"keys: {classification_count_keys} values: {classification_counts}")'

In [598]:
classification_counts = application_df['CLASSIFICATION'].value_counts()
classification_count_keys = classification_counts.index
print(f"keys: {classification_count_keys} values: {classification_counts}")

keys: Index(['C1000', 'C2000', 'C1200', 'C3000', 'C2100', 'C7000', 'C1700', 'C4000',
       'C5000', 'C1270', 'C2700', 'C2800', 'C7100', 'C1300', 'C1280', 'C1230',
       'C1400', 'C7200', 'C2300', 'C1240', 'C8000', 'C7120', 'C1500', 'C1800',
       'C6000', 'C1250', 'C8200', 'C1238', 'C1278', 'C1235', 'C1237', 'C7210',
       'C2400', 'C1720', 'C4100', 'C1257', 'C1600', 'C1260', 'C2710', 'C0',
       'C3200', 'C1234', 'C1246', 'C1267', 'C1256', 'C2190', 'C4200', 'C2600',
       'C5200', 'C1370', 'C1248', 'C6100', 'C1820', 'C1900', 'C1236', 'C3700',
       'C2570', 'C1580', 'C1245', 'C2500', 'C1570', 'C1283', 'C2380', 'C1732',
       'C1728', 'C2170', 'C4120', 'C8210', 'C2561', 'C4500', 'C2150'],
      dtype='object', name='CLASSIFICATION') values: CLASSIFICATION
C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
         ...  
C4120        1
C8210        1
C2561        1
C4500        1
C2150        1
Name: count, Length: 71, dtype: int64


### - create a list of classifications with more than 1 instance
### - collect in list  `classifications_to_replace`
- 'classifications_to_replace = []'
- 'for i in range(len(classification_counts)):
   >  'if classification_counts.iloc[i] <= 1:'
   > >     'classification = classification_count_keys[i]'
   > >     'classifications_to_replace.append(classification)'
- 'print(classifications_to_replace)'

In [601]:
classifications_to_replace = []
for i in range(len(classification_counts)):
    if classification_counts.iloc[i] <= 10:
        classification = classification_count_keys[i]
        classifications_to_replace.append(classification)
print(classifications_to_replace)

['C1238', 'C1278', 'C1235', 'C1237', 'C7210', 'C2400', 'C1720', 'C4100', 'C1257', 'C1600', 'C1260', 'C2710', 'C0', 'C3200', 'C1234', 'C1246', 'C1267', 'C1256', 'C2190', 'C4200', 'C2600', 'C5200', 'C1370', 'C1248', 'C6100', 'C1820', 'C1900', 'C1236', 'C3700', 'C2570', 'C1580', 'C1245', 'C2500', 'C1570', 'C1283', 'C2380', 'C1732', 'C1728', 'C2170', 'C4120', 'C8210', 'C2561', 'C4500', 'C2150']


### - Replace classifications in list classifications_to_replace
### with "Other"
- 'for cls in classifications_to_replace:'
  >   'application_df['CLASSIFICATION'] = application_df['CLASSIFICATION'].replace(cls,"Other")'

In [604]:
for cls in classifications_to_replace:
    application_df['CLASSIFICATION'] = application_df['CLASSIFICATION'].replace(cls,"Other")

### - check to make sure replacement was successful
- 'application_df['CLASSIFICATION'].value_counts()'

In [607]:
application_df['CLASSIFICATION'].value_counts()

CLASSIFICATION
C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
C7000      777
C1700      287
C4000      194
Other      118
C5000      116
C1270      114
C2700      104
C2800       95
C7100       75
C1300       58
C1280       50
C1230       36
C1400       34
C2300       32
C7200       32
C1240       30
C8000       20
C7120       18
C1500       16
C6000       15
C1800       15
C1250       14
C8200       11
Name: count, dtype: int64

### - Get column headers
- 'application_df.columns'

In [610]:
application_df.columns

Index(['APPLICATION_TYPE', 'AFFILIATION', 'CLASSIFICATION', 'USE_CASE',
       'ORGANIZATION', 'INCOME_AMT', 'ASK_AMT', 'IS_SUCCESSFUL'],
      dtype='object')

### - generate categorical variable list
- 'application_cat = application_df.dtypes[application_df.dtypes == "object"].index.tolist()'
- 'application_cat'

In [613]:
application_cat = application_df.dtypes[application_df.dtypes == "object"].index.tolist()
application_cat

['APPLICATION_TYPE',
 'AFFILIATION',
 'CLASSIFICATION',
 'USE_CASE',
 'ORGANIZATION',
 'INCOME_AMT']

### - check the number of unique values in each catgorical column
- 'application_df[application_cat].nunique()'

In [616]:
application_df[application_cat].nunique()

APPLICATION_TYPE     9
AFFILIATION          6
CLASSIFICATION      28
USE_CASE             5
ORGANIZATION         4
INCOME_AMT           9
dtype: int64

### - create OneHot Encoder Instance
### NOTE using OneHot encoding rather than get_dummies to deal with categorical values
- 'enc = OneHotEncoder(sparse_output=False)'

In [619]:
enc = OneHotEncoder(sparse_output=False)

### - fit and transform One-Hot Encoder using the categortical variables list 'application_cat'
- 'encode_df = pd.DataFrame(enc.fit_transform(application_df[application_cat]))'

In [622]:
encode_df = pd.DataFrame(enc.fit_transform(application_df[application_cat]))

### - add the encoder variable names to the dataframe
- 'encode_df.columns = enc.get_feature_names_out(application_cat)'
- 'encode_df.head()'

In [625]:
encode_df.columns = enc.get_feature_names_out(application_cat)
encode_df.head()

Unnamed: 0,APPLICATION_TYPE_Other,APPLICATION_TYPE_T10,APPLICATION_TYPE_T19,APPLICATION_TYPE_T3,APPLICATION_TYPE_T4,APPLICATION_TYPE_T5,APPLICATION_TYPE_T6,APPLICATION_TYPE_T7,APPLICATION_TYPE_T8,AFFILIATION_CompanySponsored,...,ORGANIZATION_Trust,INCOME_AMT_0,INCOME_AMT_1-9999,INCOME_AMT_10000-24999,INCOME_AMT_100000-499999,INCOME_AMT_10M-50M,INCOME_AMT_1M-5M,INCOME_AMT_25000-99999,INCOME_AMT_50M+,INCOME_AMT_5M-10M
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


### - merge One-Hot features and from the originals
- 'application_df = application_df.merge(encode_df,left_index=True, right_index=True)'
- 'application_df = application_df.drop(application_cat, axis=1)'
- 'application_df.head()'

In [628]:
application_df = application_df.merge(encode_df,left_index=True, right_index=True)
application_df = application_df.drop(application_cat, axis=1)
application_df.head()

Unnamed: 0,ASK_AMT,IS_SUCCESSFUL,APPLICATION_TYPE_Other,APPLICATION_TYPE_T10,APPLICATION_TYPE_T19,APPLICATION_TYPE_T3,APPLICATION_TYPE_T4,APPLICATION_TYPE_T5,APPLICATION_TYPE_T6,APPLICATION_TYPE_T7,...,ORGANIZATION_Trust,INCOME_AMT_0,INCOME_AMT_1-9999,INCOME_AMT_10000-24999,INCOME_AMT_100000-499999,INCOME_AMT_10M-50M,INCOME_AMT_1M-5M,INCOME_AMT_25000-99999,INCOME_AMT_50M+,INCOME_AMT_5M-10M
0,5000,1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,108590,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,5000,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,6692,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,142590,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


### - inspect dataframe with 'info'
- 'application_df.info()'

In [631]:
application_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34299 entries, 0 to 34298
Data columns (total 63 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   ASK_AMT                       34299 non-null  int64  
 1   IS_SUCCESSFUL                 34299 non-null  int64  
 2   APPLICATION_TYPE_Other        34299 non-null  float64
 3   APPLICATION_TYPE_T10          34299 non-null  float64
 4   APPLICATION_TYPE_T19          34299 non-null  float64
 5   APPLICATION_TYPE_T3           34299 non-null  float64
 6   APPLICATION_TYPE_T4           34299 non-null  float64
 7   APPLICATION_TYPE_T5           34299 non-null  float64
 8   APPLICATION_TYPE_T6           34299 non-null  float64
 9   APPLICATION_TYPE_T7           34299 non-null  float64
 10  APPLICATION_TYPE_T8           34299 non-null  float64
 11  AFFILIATION_CompanySponsored  34299 non-null  float64
 12  AFFILIATION_Family/Parent     34299 non-null  float64
 13  A

### - split the preprocessed data into features ('X') and targets ('y')
- 'y = application_df['IS_SUCCESSFUL'].values'
- 'X = application_df.drop("IS_SUCCESSFUL", axis=1).values'

In [634]:
y = application_df['IS_SUCCESSFUL'].values
X = application_df.drop("IS_SUCCESSFUL", axis=1).values

### - split the preprocessed data into a training and testing dataset
- 'X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)'

In [637]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

### - Create a StandardScaler instances
- 'scaler = StandardScaler()'

In [640]:
scaler = StandardScaler()

### - Fit the StandardScaler
- 'X_scaler = scaler.fit(X_train)'

In [643]:
X_scaler = scaler.fit(X_train)

### - Scale the data
- 'X_train_scaled = X_scaler.transform(X_train)'
- 'X_test_scaled = X_scaler.transform(X_test)'

In [646]:
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

## Compile, Train and Evaluate the Model

## Define a function to create a new Sequential model with hyperparameter options
### Function will:
1. Creates a keras model with: 'nn_model = tf.keras.models.Sequential()'
2. Allow keras-tuner to decide which activation function to use in hidden layers <br>
   with: 'activation = hp.Choice('activation',['relu','tanh','sigmoid','leaky_relu'])'
3. Allow kerastuner to decide number of neurons in first layer <b>
   with: <br>nn_model.add(tf.keras.layers.Dense(units=hp.Int('first_units',<br>
&nbsp;&nbsp;&nbsp;&nbsp;       min_value=1,<br>
&nbsp;&nbsp;&nbsp;&nbsp;        max_value=24,<br>
&nbsp;&nbsp;&nbsp;&nbsp;        step=2), activation=activation, input_dim=2))
4. Allow kerastuner to decide number of hidden layers and neurons in hidden layers, with:<br>
for i in range(hp.Int('num_layers', 1, 16)):<br>
&nbsp;&nbsp;&nbsp;        nn_model.add(tf.keras.layers.Dense(units=hp.Int('units_' + str(i),<br>
&nbsp;&nbsp;&nbsp;&nbsp;            min_value=1,<br>
&nbsp;&nbsp;&nbsp;&nbsp;            max_value=32,<br>
&nbsp;&nbsp;&nbsp;&nbsp;            step=2),<br>
&nbsp;&nbsp;&nbsp;&nbsp;            activation=activation))
5. Compile the model with: <br>
   'nn_model.compile(loss="binary_crossentropy", optimizer='adam', metrics=["accuracy"])'
6. return nn_model

In [650]:
# >>>> # Create a method that creates a new Sequential model with hyperparameter options
def create_model(hp):
    nn_model = tf.keras.models.Sequential()
    
    # Allow kerastuner to decide which activation function to use in hidden layers
    activation = hp.Choice('activation',['relu','tanh','sigmoid','leaky_relu'])
    
    # Allow kerastuner to decide number of neurons in first layer
    nn_model.add(tf.keras.layers.Dense(units=hp.Int('first_units',
        min_value=2,
        max_value=24,
        step=2), activation=activation))

    # Allow kerastuner to decide number of hidden layers and neurons in hidden layers
    for i in range(hp.Int('num_layers', 1, 16)):
        nn_model.add(tf.keras.layers.Dense(units=hp.Int('units_' + str(i),
            min_value=2,
            max_value=32,
            step=2),
            activation=activation))

    nn_model.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

    # Compile the model
    nn_model.compile(optimizer='adam',
              #loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              loss="binary_crossentropy",
              metrics=['accuracy'])
    return nn_model

### - instantiate a tuner

In [653]:
# >>>> # Import the keras-tuner library
import keras_tuner as kt

tuner = kt.Hyperband(
    create_model, 
    seed=42,
    objective="val_accuracy",
    max_epochs=50,
    hyperband_iterations=2,
    directory="tune_dir",
    project_name="AlphabetSoupCharity_Optimization")

### - Run the kerastuner search for best hyperparameters
- 'tuner.search(X_train_scaled,y_train,epochs=50,validation_data=(X_test_scaled,y_test), callbacks=[stop_early])'

In [689]:
tuner.search(X_train_scaled,y_train,epochs=50,validation_data=(X_test_scaled,y_test))

### - Get best model hyperparameters
- 'best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]'
- 'best_hps.values'

In [692]:
# Get the optimal hyperparameters
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]
best_hps.values

{'activation': 'tanh',
 'first_units': 18,
 'num_layers': 10,
 'units_0': 26,
 'units_1': 20,
 'units_2': 4,
 'units_3': 16,
 'units_4': 20,
 'units_5': 24,
 'units_6': 4,
 'units_7': 18,
 'units_8': 16,
 'units_9': 18,
 'units_10': 8,
 'units_11': 30,
 'units_12': 10,
 'units_13': 6,
 'units_14': 18,
 'units_15': 4,
 'tuner/epochs': 50,
 'tuner/initial_epoch': 17,
 'tuner/bracket': 2,
 'tuner/round': 2,
 'tuner/trial_id': '0155'}

# Evaluate best model against full test data
- 'nn = tf.keras.models.Sequential()'

In [696]:
# Build the model with the optimal hyperparameters and train it on the data for 50 epochs
model_opt = tuner.hypermodel.build(best_hps)
history = model_opt.fit(X_train_scaled, y_train, epochs=50, validation_data=(X_test_scaled,y_test))

val_acc_per_epoch = history.history['val_accuracy']
best_epoch = val_acc_per_epoch.index(max(val_acc_per_epoch)) + 1
print('Best epoch: %d' % (best_epoch,))

Epoch 1/50
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 807us/step - accuracy: 0.6983 - loss: 0.5973 - val_accuracy: 0.7229 - val_loss: 0.5671
Epoch 2/50
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 647us/step - accuracy: 0.7267 - loss: 0.5661 - val_accuracy: 0.7219 - val_loss: 0.5710
Epoch 3/50
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 645us/step - accuracy: 0.7305 - loss: 0.5565 - val_accuracy: 0.7227 - val_loss: 0.5593
Epoch 4/50
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 644us/step - accuracy: 0.7365 - loss: 0.5498 - val_accuracy: 0.7261 - val_loss: 0.5603
Epoch 5/50
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 639us/step - accuracy: 0.7342 - loss: 0.5497 - val_accuracy: 0.7223 - val_loss: 0.5641
Epoch 6/50
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 640us/step - accuracy: 0.7320 - loss: 0.5525 - val_accuracy: 0.7215 - val_loss: 0.5680
Epoch 7/50
[1m8

In [701]:
# Evaluate the model using the test data
model_loss_opt, model_accuracy_opt = model_opt.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss_opt}, Accuracy: {model_accuracy_opt}")

268/268 - 0s - 399us/step - accuracy: 0.7271 - loss: 0.5547
Loss: 0.5546735525131226, Accuracy: 0.7271137237548828


In [703]:
model_opt.summary()

In [706]:
# Define the deep learning model 
nn_model_opt = tf.keras.models.Sequential()
nn_model_opt.add(tf.keras.layers.Dense(units=18, activation="tanh"))
nn_model_opt.add(tf.keras.layers.Dense(units=26, activation="tanh"))
nn_model_opt.add(tf.keras.layers.Dense(units=20, activation="tanh"))
nn_model_opt.add(tf.keras.layers.Dense(units=4, activation="tanh"))
nn_model_opt.add(tf.keras.layers.Dense(units=16, activation="tanh"))
nn_model_opt.add(tf.keras.layers.Dense(units=20, activation="tanh"))
nn_model_opt.add(tf.keras.layers.Dense(units=24, activation="tanh"))
nn_model_opt.add(tf.keras.layers.Dense(units=4, activation="tanh"))
nn_model_opt.add(tf.keras.layers.Dense(units=18, activation="tanh"))
nn_model_opt.add(tf.keras.layers.Dense(units=16, activation="tanh"))
nn_model_opt.add(tf.keras.layers.Dense(units=18, activation="tanh"))
nn_model_opt.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))


In [708]:
# Compile the Sequential model together and customize metrics
import keras
opt = keras.optimizers.Adam()
nn_model_opt.compile(loss="binary_crossentropy",optimizer=opt, metrics=["accuracy"])

# Train the model
fit_model = nn_model_opt.fit(X_train_scaled, y_train, epochs=50)

Epoch 1/50
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 581us/step - accuracy: 0.6992 - loss: 0.5976
Epoch 2/50
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 540us/step - accuracy: 0.7283 - loss: 0.5660
Epoch 3/50
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 541us/step - accuracy: 0.7312 - loss: 0.5562
Epoch 4/50
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 541us/step - accuracy: 0.7340 - loss: 0.5511
Epoch 5/50
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 543us/step - accuracy: 0.7265 - loss: 0.5551
Epoch 6/50
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 546us/step - accuracy: 0.7347 - loss: 0.5477
Epoch 7/50
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 542us/step - accuracy: 0.7353 - loss: 0.5484
Epoch 8/50
[1m804/804[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 538us/step - accuracy: 0.7326 - loss: 0.5469
Epoch 9/50
[1m804/804[

In [710]:
# Evaluate the model using the test data
model_loss_opt, model_accuracy_opt = nn_model_opt.evaluate(X_test_scaled,y_test,verbose=2)

268/268 - 0s - 664us/step - accuracy: 0.7285 - loss: 0.5553


In [712]:
print(f"Loss: {model_loss_opt}, Accuracy: {model_accuracy_opt}")

Loss: 0.5553467273712158, Accuracy: 0.7285131216049194


In [714]:
nn_model_opt.summary()

In [716]:
# Save the model on local directory
nn_model_opt.save('./AlphabetSoupCharity_optimnized.keras')

In [719]:
test_restored_model = tf.keras.models.load_model('./AlphabetSoupCharity_optimnized.keras')

In [721]:
test_restored_model.summary()

In [723]:
loss, acc = test_restored_model.evaluate(X_test_scaled,y_test, verbose=2)
print('Restored model, accuracy: {:5.2f}%'.format(100 * acc))

268/268 - 0s - 843us/step - accuracy: 0.7285 - loss: 0.5553
Restored model, accuracy: 72.85%
