<a href="https://colab.research.google.com/github/akislenkova/ML-Data/blob/main/Titanic_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import requests
def download_file(url, filename):
    response = requests.get(url)
    if response.status_code == 200:
        with open(filename, 'wb') as file:
            file.write(response.content)
        print(f"File successfully downloaded as {filename}")
    else:
        print("Failed to retrieve the file. Status code:", response.status_code)

In [None]:
url = "https://raw.githubusercontent.com/CreekCS/Lab_3_Titanic/refs/heads/main/Titanic_data.csv"
filename = "Titanic_data.csv"
download_file(url, filename)

In [None]:
"""
################################################################################################################
Lab #3 - The Titanic: Probability of Surviving
The point of this lab is to show how the Lab #2 techniques can be used on a more complex linear model.  We will
insert some activation functions into the model to make it compute a probability.  (Therefore this model will
output a probability, as opposed to Lab #2 which output a real value - Brodie's weight)  This lab...
    -uses Tensorflow and Keras to implement a model with 2 hidden layers, having 7 input Features and 120
     Weights  (i.e. a 121-D model of cost) .
    -introduces the idea of splitting the data into Training and Testing datasets,
    -shows how to manipulate values within the input Python arrays so as to build appropriate Training and Test
     datasets,
    -introduces data scaling using the StandardScaler to scale the individual features of the Training dataset,
    -employs the non-linear RELU and Sigmoid activation functions, and
    -introduces a Confusion Matrix to analyze the results.

Needed files:  Titanic_data.csv
# Author: R. Bourquard - Dec 2020
################################################################################################################
"""

from __future__ import absolute_import, division, print_function, unicode_literals
import numpy as np
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split


In [None]:
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# DISCUSSION #1:  The input file
# -----------------------------------------------------------------------------------------------------------------
# In the morning hours of April 15, 1912, the cruise ship, Titanic, sank with great loss of life.  Ironically, it
# was the "unsinkable" ship's maiden voyage.  There were approximately 2,200 passengers and crew aboard.  At around
# midnight, the ship struck an iceberg.   It sank 2 hours later, resulting in the deaths of approximately 1,500
# people.
#
# Our objective is to build a model (algorithm and weights) that will allow us to input a passenger's data (the
# Features) and be able to predict whether that passenger survived or perished.  In other words, we need to be able
# to find and appropriately weight those Features which were important to survival.  Were your survival odds improved
# if you were in 1st class?  If you were female?  If you were young?  If you were alone?
#
# The input csv file, 'Titanic_data.csv', contains information about all the passengers.  (The file's dataset was
# built by Khashayar Baghizadeh Hosseini, and is available at https://www.kaggle.com/heptapod/titanic.)
# There were approximately 1,310 passengers on the ship, and this file includes as many rows; one row for each
# passenger.
#
# The file contains 9 columns of data for each passenger.  Each column is a Feature.  Each row is a Training Example.
# Columns in csv files are numbered starting with "1".  The numbers in brackets [] are column numbers as stored in a
# python array (starting with "0").
# Column 1 [0]  - An arbitrary Passenger ID (1:1310).
# Column 2 [1] -  The age in years of the passenger.
# Column 3 [2] -  The fare paid in dollars.  (These fares seem incredibly cheap because of the rampant inflation
#                 that has occurred since 1912.)
# Column 4 [3] -  The passenger's sex:  0=male, 1=female
# Column 5 [4] -  SibSp:  The total number of siblings plus spouse who were accompanying the passenger.
# Column 6 [5] -  ParCh:  The total number of parents or children who were accompanying the passenger.
# Column 7 [6] -  Passenger Class:  1=1st Class, 2=2nd Class, 3=3rd Class
# Column 8 [7] -  Embarkation point:  0=Cherbourg, 1=Queenstown, 2=Southhampton
# Column 9 [8] -  Survived:  1=survived, 0=perished (This is the Ground Truth value)
#
# Column 1, the passenger ID, is of no value in this problem.  And Column 9, the survived flag, is the answer (or
# Ground Truth).  So that leaves 7 columns of interest to be used as our input Features.
# We could guess which of these features might be most important to survival, but it would be difficult/impossible
# to guess a useful relative weight to all 7.  A deep neural network can do this for us.  Once it derives the
# weights we can input a new passenger's 7 Features and use them to calculate a prediction of the probability of
# survival.
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# READ THE INPUT DATA
input_filename = filename
input_data = np.loadtxt(input_filename, dtype='float32', delimiter=",", skiprows=1)
print('input_data:', input_data.shape)
# PRINT SOME OF THE INPUT DATA EXAMPLES
print('    ID     Age    Fare    Sex     SibSp     ParCh     Class   Embark   GTSurvive')
for i in range(0, 19):
    print('{:6.0f}'.format(input_data[i,0]), '  ',
          '{:4.0f}'.format(input_data[i,1]), '  ',
          '${:3.0f}'.format(input_data[i,2]), ' ',
          '{:3.0f}'.format(input_data[i,3]), '    ',
          '{:3.0f}'.format(input_data[i, 4]), '     ',
          '{:3.0f}'.format(input_data[i, 5]), '     ',
          '{:3.0f}'.format(input_data[i, 6]), '    ',
          '{:3.0f}'.format(input_data[i, 7]), '     ',
          '{}'.format(input_data[i, 8]==1) )


In [None]:

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# DISCUSSION #2:  Split the input data into a Training dataset and a Testing dataset
# -----------------------------------------------------------------------------------------------------------------
# This model (algorithm and weights) is far more complex than the previous Brodie Weight model.  Once we have
# determined the weights, there will be no way we can plot an imaginary boundary in 7-D Feature-space between those
# passengers who survived and those who perished to visually assess how the well model works.  Instead, we will
# assess how well the model works by simply using part of initial data we were given.
#
# To accomplish this, we will divide the data into two groups: a "Training Examples" dataset, and a "Testing
# Examples" dataset.  The Training Examples dataset will be used to train our model's weights by having it try to
# predict the survival flags (Ground Truth Values) contained therein.  After running many Epochs, the "final" model
# should be able to predict the actual Ground Truth outcomes fairly well.  We will eventually run statistics on
# the "final" model to measure whether it was a good choice.
#
# Once we have a "final" model, the Testing Examples dataset will be input into the "final" model and it will
# predict whether the Testing Examples' passengers survived.  We already know whether these passengers survived
# since we have their Ground Truth values, so if our model is really good, its predictions will match what actually
# happened.  Since our model has never "seen" these Testing Examples before, it's statistically fair to compare
# the predicted outcomes to the actual Ground Truth outcomes to evaluate how good our model will be on new data.
#
# This code splits the input data into a Training Examples dataset and a Testing Examples dataset.  80% of the rows
# (training_split=0.80) will be randomly selected and put in the Training Examples dataset, the remainder will be put
# in the Testing Examples dataset.  Specifying the 'seed' causes the random generator to split the data identically
# each time this program is run, so comparisons can be made between runs.
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# SPLIT THE INPUT DATA INTO A TRAINING DATASET AND A TESTING DATASET
training_split = 0.80
seed = 42
train_data, test_data = train_test_split(input_data, train_size=training_split, random_state=seed)
print('train_data:', train_data.shape)
print('test_data:', test_data.shape)
print()





In [None]:

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# DISCUSSION #3:  Split-off the Features and the Ground Truth from the Training and Testing datafiles
# -----------------------------------------------------------------------------------------------------------------
# There are 9 values, in csv columns 1:9 [python columns 0:8].  However, column 1 [0] is the Passenger ID, which
# is (by my intuition) of no value to deciding who survived, so it is skipped.  Column 9 [8] is the 'survived'
# flag, which is our Ground Truth.  So our input Features will be columns 2:8 [1:7].  This code simply splits the
# Examples into a 2-D Feature matrix (rows=Examples, columns=the 7 Features), and the Ground Truth into a matching
# 1-D array (aka a vector).
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# SEPARATE-OUT THE FEATURES AND THE GROUND TRUTH
nFeatures = 7
ground_truth_col = 8
# for the training data
train_X = train_data[:,1:nFeatures+1]   # The 7 features are in csv columns 2-8 [1:7]
train_truth = train_data[:,ground_truth_col]   # The 'survived' flag is in csv column 9 [8]
print('train_truth shape[0]', train_truth.shape[0])
# for the test data
test_X = test_data[:,1:nFeatures+1]   # The 7 features are in csv columns 2-8 [1:7]
test_truth = test_data[:,ground_truth_col]   # The 'survived' flag is in column 9 [8]
print('train_X', train_X.shape)
print('train_truth', train_truth.shape)
print('test_X', test_X.shape)
print('test_truth', test_truth.shape)
print()
n_test_examples = test_X.shape[0]



In [None]:

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# DISCUSSION #4:  Scale the values in the datasets
# -----------------------------------------------------------------------------------------------------------------
# As with the Brodie Weights model, there is a great difference in magnitude between the values of the various
# Features.  A typical way to solve this is to individually scale the input Features, so they are all
# normalized (centered on zero with a standard deviation of 1).  (This means that after scaling they will mostly
# be between -1 and +1.)  The following code does this.  Each Feature column is scaled individually across
# all the Training Example rows.
#
# Note that we compute a scaling object (scaler_obj), which contains the derived scale factors.  It is calculated
# from just the Training Features.  Once we have it, it will be applied to both the Training Features and the Test
# Features, since they both must be scaled identically.
#
# (Typically, after this is done, the scaling object is output and saved in a file for later use, since it must be
# applied to the Features of every subsequent input to the model.  In this program, we don't save the scaling
# object because there are no additional Examples to be run in some later program.)
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# SCALE EACH FEATURE TO BE CENTERED ON ZERO, WITH A STANDARD DEVIATION OF 1
scaler_obj = preprocessing.StandardScaler().fit(train_X)   # scaler_obj will scale each Feature (column) independently
train_X_scaled = scaler_obj.transform(train_X)  # scale each Training Feature (column)
test_X_scaled = scaler_obj.transform(test_X)  # scale each Test Feature (column)


print('train_X', train_X_scaled.shape)
print('train_truth', train_truth.shape)
print('test_X', test_X_scaled.shape)
print('test_truth', test_truth.shape)



In [None]:

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# DISCUSSION #5:  The DNN Model
# -----------------------------------------------------------------------------------------------------------------
# Now we are ready to build our model.  We use Keras to define the layers:
#  - There is one Input Layer, which simply receives the 7 Features.
#  - There are 2 Hidden (Activation) Layers; both of which use ReLU activation functions.
#  - And 1 Output Layer, which uses a single Sigmoid activation function, since we want the model to predict just
#    one outcome = the probability of survival.
# You can lookup ReLU and Sigmoid activation functions on your own, to see how they modify their input data.
# The specified optimizer is 'adam' which is basically a gradient descent modified by a momentum factor.
# The optimizer will minimize the loss, which will be calculated as 'binary_crossentropy' because we want only
# a single probability value representing the 2 (binary) outcomes:  survived or perished.
# The screen print shows the model has 120 weights and biases ("Trainable params") to be trained.
#
#
# AI MODELING NOTE:  Why are there 120 weights?
# -- FOR THE INPUT LAYER:
#    There are 7 input features composing the Input Layer.
# -- FOR HIDDEN LAYER 1:
#    Because we are using 'Dense' connections, each of the 7 inputs connects to every neuron in Hidden Layer 1.
#    There are 7 neurons in Hidden Layer 1. This means that each of the 7 input features will connect to all
#    7 Hidden Layer 1 neurons. Since each connection has an independent weight, there will be 7*7 = 49 weights.
#    In addition, each neuron in Hidden Layer 1 has a bias value, so there are also 7 bias values. So the total
#    for Hidden Layer 1 will be 49 + 7 = 56 weights and biases.
# -- FOR HIDDEN LAYER 2:
#    Because we are using 'Dense' connections, each of the 7 Hidden Layer 1 neurons connects to every neuron in
#    Hidden Layer 2. Since there are 7 neurons in Hidden Layer 2, there will be 7*7 = 49 weights. In addition,
#    each neuron in Hidden Layer 2 also has a bias value, so there are 7 bias values. So the total for Hidden
#    Layer 2 will be 49 + 7 = 56 weights and biases.
# -- FOR THE OUTPUT LAYER:
#    Because we are using 'Dense' connections, each of the 7 Hidden Layer 2 neurons connects to the single Output
#    Layer neuron. This means that there will be 7 more weights. In addition, the output neuron has a bias value,
#    so the total for the Output Layer will be 7 + 1 = 8 weights and biases.
# Therefore the total number of weights and biases is 56 + 56 + 8 = 120 weights and biases. The weights and biases
# are called "Trainable params" in the output below.
# ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

# BUILD THE TENSORFLOW MODEL
model = keras.models.Sequential()
model.add(keras.layers.InputLayer(shape=[nFeatures,]))        # Input Layer
model.add(keras.layers.Dense(nFeatures, activation='relu'))   # Hidden Layer 1
model.add(keras.layers.Dense(nFeatures, activation='relu'))   # Hidden Layer 2
model.add(keras.layers.Dense(1, activation='sigmoid'))        # Output Layer
model.summary()
model.compile(loss="binary_crossentropy",optimizer='adam',metrics=["accuracy"])




In [None]:

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# DISCUSSION #6:  Run the model
# -----------------------------------------------------------------------------------------------------------------
# The TensorFlow 'fit' method is used to compute the best weights for the Training Examples.  The 120 weights
# are trained over 10 epochs.  Note that the sigmoid activation function outputs a probability of survival (between
# 0 and 1).  We will consider anything >= 0.5 as a prediction of survival.  The comparison of the Training Examples
# to their Ground Truth values are stored in the 'history' object for plotting.
#
# The Test Examples are also input to the fit method and the predictions from our evolving model (derived from the
# Training Examples) are compared against the matching Test Ground Truth values.  This gives a validation
# measure of how the model will do when data it has never seen before are input.  The comparison of the Test
# Examples to their Ground Truth values are also stored in the 'history' object for plotting.
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# FIND BEST VALUES FOR THE 120 WEIGHTS
nEpochs = 10
history = model.fit(train_X_scaled, train_truth, batch_size=1200, epochs=nEpochs, validation_data=(test_X_scaled, test_truth))




In [None]:

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# DISCUSSION #7:  Print/plot information about its accuracy
# -----------------------------------------------------------------------------------------------------------------
# To get a measure of the accuracy of the new model, the 'evaluate' method can be run on both the Training
# Examples and their matching Ground Truth values, and on the Test Examples and their matching Ground Truth
# values.  For each set of examples, the 'loss' shows the Cost of the errors (i.e. false positives and
# negatives).  Lower Cost is better.  The 'accuracy' shows how well the predicted probabilities matched the
# Ground Truth values.  Higher accuracy is better.
#
# For this model's Training data, after 10 epochs the cost was 0.44 (arbitrary units), and the prediction
# accuracy was about 79%.  This means given a Training Example passenger's Features, we can predict that
# passenger's survival with about 79% accuracy.  This give us a measure of whether our model is appropriate for
# the task at hand.
# For this model's Test Data (USING THE WEIGHTS DERIVED FROM THE TRAINING EXAMPLES) the cost was 0.49, and
# the prediction accuracy was about 77%.  This means given any passenger's Features, we can predict that
# passenger's survival with about a 77% accuracy.  This gives us a measure of whether our model will work well
# on entirely new data.
#
# We would expect the Training Examples' accuracy and Cost to be better than the Test Examples' accuracy and
# Cost, since the model weights were derived from the Training Examples!  Regardless, the accuracy and Cost
# for the Test Examples should be similar to those of the Training Examples, if the model we derived is any good.
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# PRINT AND PLOT STATISTICS
score,accuracy = model.evaluate(train_X_scaled, train_truth, batch_size=16, verbose=0)
print("Train score (cost)       = {:.2f}".format(score))
print("Train accuracy (accuracy)= {:.2f}".format(accuracy))
score,accuracy = model.evaluate(test_X_scaled, test_truth, batch_size=16, verbose=0)
print("Test score (val_cost)    = {:.2f}".format(score))
print("Test accuracy (val_accuracy)= {:.2f}".format(accuracy))


# PLOT COST AND ACCURACY
pd.DataFrame(history.history).plot(figsize=(8,5))
plt.grid(True)
plt.gca().set_ylim(0,1)
plt.suptitle("Cost and Accuracy for " + str(nEpochs) + " epochs")
plt.title("'loss' = Cost      'val' = test scores")
plt.show()




In [None]:

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# DISCUSSION #8:  Print a Confusion Matrix on the Test Examples
# -----------------------------------------------------------------------------------------------------------------
# One way to see the relationship between cost and accuracy is to compute a Confusion Matrix.  It shows all the
# Testing Examples' outcomes collected into a square of 4 groupings for quick comparison:
#
#                        true negatives,    false negatives
#                        false positives,   true positives
#
# The Confusion Matrix is run on the Test Examples because those are the best indicators of how accurate the
# model is.
#
# Ideally, the "true" diagonal of 'true negatives' and 'true positives' (in this case, correct 'perished'
# and correct 'survived' predictions) should be far greater than the "false" diagonal of 'false positives' and
# 'false negatives' (in this case, incorrect 'survived' and incorrect 'perished' predictions).
#
# For this Titanic lab, the biggest concern is probably the number of "false positives", because these are
# people the DNN predicted to survive, who actually perished!  The consequence of being wrong was fatal!  For a
# useful model, you would want to minimize these outcomes.  For other situations, however, the number of "false
# negatives" might be more important.  For example, if the DNN was suggesting surgery, a "false negative"
# would mean a needed surgery was NOT performed.  You might wish to minimize this category instead.
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# PRINT A CONFUSION MATRIX FOR THE TEST EXAMPLES
probabilities = model.predict(test_X_scaled)   # the probability that the passengers survived
# convert the probabilities to predictions (0 or 1) for comparison with the ground truth
min_for_true = 0.5   # if the probability is >= 0.5, then assume the passenger survived
vector_int = np.vectorize(int)
predictions = vector_int(probabilities + min_for_true)    # The int truncates any fractional part of the sum
((n_true_negatives, n_false_positives), (n_false_negatives, n_true_positives)) \
      = confusion_matrix(test_truth, predictions)

print()
print()
print('Confusion Matrix for Predicted Survival.  Passenger counts for', n_test_examples, 'Test Examples')
print()
print('     Correctly Predicted will Perish:          ', n_true_negatives, ' | ',
      n_false_negatives, ' :Predicted Perished, but actually Survived')
print('                              ------------------------------------------------')
print('     Predicted Survive, but actually Perished:  ', n_false_positives, ' | ',
      n_true_positives, ' :Correctly Predicted will Survive')
print()




In [None]:

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# DISCUSSION #9:  Print some examples
# -----------------------------------------------------------------------------------------------------------------
# Now that we have derived a model and weights, we can use it to predict the outcomes of new passengers.  However,
# there are no new passengers, so here I've simply predicted the outcomes of the first 20 Test Examples.  The table
# shows the passenger data, the resulting prediction, and the actual Ground Truth outcome for comparison.
# The predictions should be correct about 77% of the time.
#
# Note that the model and its weights are relatively opaque.  It can predict who will survive, but it's difficult
# to know why!  This is a simple model with only 120 weights.  Yet it would be nearly impossible to translate those
# weights into an understanding of the relative importance of each input Feature.  Therefore, the model is rather
# like a black box.  Sadly, this opacity is typical of AI models.
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# PRINT SOME OF THE INPUT DATA EXAMPLES
print()
print()
print('Some Passenger Data, with Predictions and Actual Outcomes...')
print(' Age    Fare    Sex     SibSp     ParCh     Class   Embark   [Pred]Survived    [Actual]Survived')
for i in range(0, 20):
    print('{:4.0f}'.format(test_X[i,0]), '  ',
          '${:3.0f}'.format(test_X[i,1]), ' ',
          '{:3.0f}'.format(test_X[i,2]), '    ',
          '{:3.0f}'.format(test_X[i, 3]), '     ',
          '{:3.0f}'.format(test_X[i, 4]), '     ',
          '{:3.0f}'.format(test_X[i, 5]), '    ',
          '{:3.0f}'.format(test_X[i, 6]), '      ',
          '{}'.format(predictions[i]==1), '         ',
          '{}'.format(test_truth[i]==1)
          )
