# Brief Introduction about Spark in Python also, Spark Machine Learning tools
In this notebook, we will train two classifiers to predict survivors. We will use this classic machine learning problem as a brief introduction to using Spark local mode in a notebook

In [1]:
import pyspark
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.tree import DecisionTree

In [2]:
sc = pyspark.SparkContext()

## Sample the data
The result is a RDD,not the the content of the file. This is a Spark transformation.
We query RDD for the number of lines in the file. The call here causes the file to be read and the result computed. This is a Spark action

In [3]:
raw_rdd = sc.textFile("data/COUNT/titanic.csv")
raw_rdd.count()

1317

Query for the first five rows of the RDD. Even though the data is small, we shouldn't get into the habit of pulling the entire dataset into the notebook. Many datasets that we might want to work with using Spark with be much too large to fit in memory of a single machine

In [4]:
raw_rdd.take(5)

['"","class","age","sex","survived"',
 '"1","1st class","adults","man","yes"',
 '"2","1st class","adults","man","yes"',
 '"3","1st class","adults","man","yes"',
 '"4","1st class","adults","man","yes"']

We see a header row followed by a set of data rows. We filter out the header to define a new RDD containing only the data rows.

In [5]:
header = raw_rdd.first()
data_rdd = raw_rdd.filter(lambda line: line != header)

In [6]:
data_rdd.takeSample(False, 5,0)

['"159","1st class","adults","man","no"',
 '"256","1st class","adults","women","yes"',
 '"1204","3rd class","adults","women","no"',
 '"758","3rd class","adults","man","no"',
 '"730","3rd class","adults","man","no"']

We see that the five value in every row is a passenger number. The next three values are the passenger attributes we might use to predict passenger survival. The final value is the survival ground truth.

## Create labeled points
Now we define a function to turn the passenger attributions into structured LabeledPoint objects.

In [7]:
def raw_to_labeled_point(line):
    """
    Builds a LabelPoint consisting of:
    
    survival (truth): 0=no, 1=yes
    ticked class: 0=1st class, 1=2nd class, 2=3rd class
    age group: 0=child, 1=adults
    gender: 0=man, 1=woman
    """
    passenger_id, kclass, age, sex, survived = [segs.strip('"') for segs in line.split(',')]
    kclass = int(kclass[0]) -1
    if (age not in ['adults','child'] or 
        sex not in ['man','women'] or 
        survived not in ['yes','no']):
        raise RuntimeError('unknown value')
    features = [
        kclass,(1 if age == 'adults' else 0),(1 if sex == 'women' else 0)]
    return LabeledPoint(1 if survived == 'yes' else 0, features)


Apply this funtinon to all rows

In [8]:
labeled_points_rdd = data_rdd.map(raw_to_labeled_point)
labeled_points_rdd.takeSample(False,5,0)

[LabeledPoint(0.0, [0.0,1.0,0.0]),
 LabeledPoint(1.0, [0.0,1.0,1.0]),
 LabeledPoint(0.0, [2.0,1.0,1.0]),
 LabeledPoint(0.0, [2.0,1.0,0.0]),
 LabeledPoint(0.0, [2.0,1.0,0.0])]

## Split for training and test
We split the transformed data into a training(70%) and test set(30%), and print the total number of items in each segment.

In [9]:
training_rdd, test_rdd = labeled_points_rdd.randomSplit([0.7,0.3],seed=0)
training_count=training_rdd.count()
test_count = test_rdd.count()

In [10]:
training_count, test_count

(914, 402)

## Train and test a decision tree classifier
Now we train a Decision Tree model. We specify that we're training a boolean classifier (i.e., there are two outcomes). We also specify that all of our features are categorical and the number of possible categories for each.

In [11]:
model = DecisionTree.trainClassifier(training_rdd, numClasses=2,
                                    categoricalFeaturesInfo={0:3,1:2,2:2})

We now apply the rained model to the feature values in the test set to get the list of predicted outcomines.

In [12]:
predictions_rdd = model.predict(test_rdd.map(lambda x: x.features))

We bundle our predictions with the ground truth outcome for each passenger in the test set.

In [13]:
truth_and_predictions_rdd = test_rdd.map(lambda lp:lp.label).zip(predictions_rdd)

In [14]:
accuracy = truth_and_predictions_rdd.filter(lambda v_p:v_p[0]==v_p[1]).count()/float(test_count)
print('Accuracy= ', accuracy)
print(model.toDebugString())

Accuracy=  0.7985074626865671
DecisionTreeModel classifier of depth 4 with 21 nodes
  If (feature 2 in {0.0})
   If (feature 1 in {0.0})
    If (feature 0 in {0.0,1.0})
     Predict: 1.0
    Else (feature 0 not in {0.0,1.0})
     Predict: 0.0
   Else (feature 1 not in {0.0})
    If (feature 0 in {1.0})
     Predict: 0.0
    Else (feature 0 not in {1.0})
     If (feature 0 in {0.0})
      Predict: 0.0
     Else (feature 0 not in {0.0})
      Predict: 0.0
  Else (feature 2 not in {0.0})
   If (feature 0 in {2.0})
    If (feature 1 in {0.0})
     Predict: 0.0
    Else (feature 1 not in {0.0})
     Predict: 0.0
   Else (feature 0 not in {2.0})
    If (feature 0 in {1.0})
     If (feature 1 in {0.0})
      Predict: 1.0
     Else (feature 1 not in {0.0})
      Predict: 1.0
    Else (feature 0 not in {1.0})
     If (feature 1 in {0.0})
      Predict: 1.0
     Else (feature 1 not in {0.0})
      Predict: 1.0



Now use this well trained model to predict if a passenger with the feature of [1,0,0] which means (2nd class, adults, women) can survive or not.

In [15]:
prediction = model.predict([1,0,0])
print('yes' if prediction==1 else 'no')

yes


## Train and test a logistic regression classifier
For a simple comparison, we also train and test a LogisticRegressionWithSGD model.
> Note: LogisticRegressionWithSGD is deprecated in 2.0.0.

In [16]:
model2 = LogisticRegressionWithLBFGS.train(training_rdd)

In [17]:
predictions2_rdd = model2.predict(test_rdd.map(lambda x:x.features))

In [18]:
labels_and_predictions2_rdd = test_rdd.map(lambda lp:lp.label).zip(predictions2_rdd)

In [19]:
accuracy = labels_and_predictions2_rdd.filter(lambda v_p:v_p[0]==v_p[1]).count()/float(test_count)
print('Accuracy: ', accuracy)

Accuracy:  0.7835820895522388


These two classifiers show similar accuracy. More information about the passengers cound definitely help improve this metric.

> In this case, Decision Tree model perfoms better than Logistic Regression model with LBFGS optimization algorithm 