# Titanic Survivor Prediction using Weka 

This notebook requires a Java kernal. This notebook uses Weka library for training, testing and verification of Titanic survivor dataset.

NOTE:

I have used IJAVA to create the Java Kernal. Java 9 or higher JDK is required for IJAVA which internally uses jshell.

IJAVA - Jupyter kernel for executing Java code
https://github.com/SpencerPark/IJava

Install IJAVA using  the archive from https://github.com/SpencerPark/IJava/releases/download/v1.1.2/ijava-1.1.2.zip

In [1]:
%%loadFromPOM
<!-- Download the weka library from maven repository -->

<!-- https://mvnrepository.com/artifact/nz.ac.waikato.cms.weka/weka-stable -->
<dependency>
	<groupId>nz.ac.waikato.cms.weka</groupId>
	<artifactId>weka-stable</artifactId>
	<version>3.6.13</version>
</dependency>


In [2]:
/*
 * Import the required classes to the notebook. 
 */

import java.io.File;
import java.io.IOException;
import java.util.Arrays;
import java.util.Enumeration;
import java.util.List;

import weka.classifiers.Classifier;
import weka.classifiers.Evaluation;
import weka.classifiers.evaluation.EvaluationUtils;
import weka.classifiers.trees.BFTree;
import weka.classifiers.trees.RandomForest;
import weka.core.Attribute;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.SerializationHelper;
import weka.core.converters.ArffLoader;
import weka.core.converters.CSVLoader;
import weka.core.converters.CSVSaver;
import weka.core.converters.Loader;

In [3]:
/*
 * Let us load the training dataset first. Set the attributes types
 * to the loader as weka understand only the ARFF format.
 * 
 * The following are the columns in train and test datasets.
 * @attribute survived {0,1}
 * @attribute pclass numeric
 * @attribute name string
 * @attribute sex {male,female}
 * @attribute age numeric
 * @attribute sibsp numeric
 * @attribute parch numeric
 * @attribute ticket string
 * @attribute fare numeric
 * @attribute cabin string
 * @attribute embarked {Q,S,C}
 * 
 * Columns 1,4 & 11 are nominal type, columns 3, 8 & 10 are string
 * type and the remaining integer types. Set the attribute types for
 * string and nominal data types, others are taken as integer.
 */
CSVLoader trainCsvLoader = new CSVLoader();
trainCsvLoader.setSource(new File("survivor_predict/train.csv"));
trainCsvLoader.setStringAttributes("3,8,10");
trainCsvLoader.setNominalAttributes("1,4,11");
Instances trainDataSet = trainCsvLoader.getDataSet();

[GenericPropertiesCreator] classloader in use is not the system classloader: using static entries in weka/gui/GenericObjectEditor.props rather than dynamic class discovery.


In [4]:
/*
 * Identify the label. Here, survived is the label
 * we want to predict.
 */
Attribute trainAttribute = trainDataSet.attribute(0);
trainDataSet.setClass(trainAttribute);

In [5]:
/*
 * The string attributes are not contributing features for prediction.
 * Also, RandomForest cannot handle string type. Hence, removing
 * them before training.
 */
trainDataSet.deleteStringAttributes();
trainDataSet.numAttributes();

8

In [6]:
/*
 * Create a RandForest classifier and configure it.
 */
RandomForest classifier = new RandomForest();
classifier.setNumTrees(500);
//classifier.setDebug(true);

In [8]:
/*
 * Train the classifier and create the prediction model.
 */
classifier.buildClassifier(trainDataSet);

In [9]:
/*
 * Save the model for future prediction.
 */
SerializationHelper.write("survivor_predict/titanic_survivor_prediction.model", classifier);
System.out.println("Saved the trained model as titanic_survivor_prediction.model");

Saved the trained model as titanic_survivor_prediction.model


In [10]:
/*
 * Now load the test data to verify the model.
 * Set attribute types and remove string attributes.
 */
CSVLoader testCsvLoader = new CSVLoader();
testCsvLoader.setSource(new File("survivor_predict/test.csv"));

testCsvLoader.setStringAttributes("3,8,10");
testCsvLoader.setNominalAttributes("1,4,11");
Instances testDataSet = testCsvLoader.getDataSet();

testDataSet.deleteStringAttributes();
testDataSet.numAttributes();

8

In [11]:
/*
 * Create a copy of the test dataset to store the predicted value.
 */
CSVLoader testCsvLoader = new CSVLoader();
testCsvLoader.setSource(new File("survivor_predict/test.csv"));

testCsvLoader.setStringAttributes("3,8,10");
testCsvLoader.setNominalAttributes("1,4,11");
Instances predictDataSet = testCsvLoader.getDataSet();

In [12]:
/*
 * Set the label for test and predict dataset. The remaining is taken as features.
 */
Attribute testAttribute = testDataSet.attribute(0);
testDataSet.setClass(testAttribute);

Attribute predictAttribute = predictDataSet.attribute(0);
predictDataSet.setClass(predictAttribute);

predictDataSet.numAttributes();


11

In [13]:
/*
 * Read the serialized model from disk for verification.
 */
Classifier classifier = (Classifier) SerializationHelper
                            .read("survivor_predict/titanic_survivor_prediction.model");

In [14]:
/*
 * Iterate over the test data, classify each entry and set the
 * value of the 'survived' column of predict dataset with the result 
 * of the classification
 */
Enumeration testInstances = testDataSet.enumerateInstances();
Enumeration predictInstances = predictDataSet.enumerateInstances();
while (testInstances.hasMoreElements()) {
    Instance testInstance = (Instance) testInstances.nextElement();
    Instance predictInstance = (Instance) predictInstances.nextElement();
    
    double classification = classifier.classifyInstance(testInstance);
    predictInstance.setClassValue(classification);
}

In [15]:
/*
 * Write the predicted output dataset to disk. 
 */
CSVSaver predictedCsvSaver = new CSVSaver();
predictedCsvSaver.setFile(new File("survivor_predict/titanic_survivor_prediction.csv"));
predictedCsvSaver.setInstances(predictDataSet);
predictedCsvSaver.writeBatch();

System.out.println("Predicited output dataset written as survivor_predict/titanic_survivor_prediction.csv");


Predicited output dataset written as survivor_predict/titanic_survivor_prediction.csv


In [16]:
/*
 * Evaluate the performance of the classifier model.
 * Make the predict dataset columns same as training dataset
 * for evaluation.
 */
predictDataSet.deleteStringAttributes();

Evaluation evaluation = new Evaluation(trainDataSet);
evaluation.evaluateModel(classifier, predictDataSet, new Object[] {});

[D@6a2d79ac

In [17]:
/*
 * Print the details of classifier and evaluation summary.
 */
System.out.println(classifier);
System.out.println(evaluation.toSummaryString());


Random forest of 500 trees, each constructed while considering 3 random features.
Out of bag error: 0.1796



Correctly Classified Instances         418              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1     
Mean absolute error                      0.1949
Root mean squared error                  0.239 
Relative absolute error                 41.9483 %
Root relative squared error             50.058  %
Total Number of Instances              418     

