## Hello World of Machine learning - the usage of PySpark ML on kaggle's titanic dataset.


## Description
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

## Goals
It is your job to predict if a passenger survived the sinking of the Titanic or not. 
For each in the test set, you must predict a 0 or 1 value for the variable.

### Metric
Your score is the percentage of passengers you correctly predict. This is known simply as "accuracy”.

### Submission File Format
You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

The file should have exactly 2 columns:

PassengerId (sorted in any order)
Survived (contains your binary predictions: 1 for survived, 0 for deceased)
```
PassengerId,Survived
 892,0
 893,1
 894,0
 Etc.
 ```

## Data Overview
The data has been split into two groups:
- training set (train.csv)
- test set (test.csv)


The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.


### Data Dictionary
Variable	| Definition	| Key
--- | --- | ---
survival|	Survival |	0 = No, 1 = Yes 
pclass	|Ticket class	| 1 = 1st, 2 = 2nd, 3 = 3rd 
sex | Sex	| 
Age	| Age in years	| 
sibsp	|# of siblings / spouses aboard the Titanic	 | 
parch	|# of parents / children aboard the Titanic	 | 
ticket	|Ticket number	|
fare	|Passenger fare	|
cabin	|Cabin number	|
embarked|	Port of Embarkation	|C = Cherbourg, Q = Queenstown, S = Southampton



#### Variable Notes
- pclass: A proxy for socio-economic status (SES)
* 1st = Upper
* 2nd = Middle
* 3rd = Lower

- age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

- sibsp: The dataset defines family relations in this way...
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)

- parch: The dataset defines family relations in this way...
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch=0 for them.


### Prerequisites:
Understanding of following topics;
  * Python https://www.kaggle.com/learn/python  
  * Machine learning https://www.kaggle.com/learn/machine-learning
  * Apache Spark 


Ref. https://www.kaggle.com/c/titanic

#### Importing needful libraries

In [0]:
from pyspark.sql import SparkSession 
from pyspark.ml import Pipeline
from pyspark.sql.functions import mean, col, split, count, regexp_extract, when, lit
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler 
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import QuantileDiscretizer


### pyspark.ml.pipeline
A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer. When Pipeline.fit() is called, the stages are executed in order. If a stage is an Estimator, its Estimator.fit() method will be called on the input dataset to fit a model. Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage. If a stage is a Transformer, its Transformer.transform() method will be called to produce the dataset for the next stage. The fitted model from a Pipeline is a PipelineModel, which consists of fitted models and transformers, corresponding to the pipeline stages. If stages is an empty list, the pipeline acts as an identity transformer.
- Estimator: Abstract class for estimators that fit models to data.
- Transformer: Abstract class for transformers that transform one dataset into another. 

### pyspark.sql.functions
- mean - is an alias for avg().
- col - returns a Column based on the given column name.
- split - splits str around pattern (pattern is a regular expression).
- count - counts the number of records for each group.
- regexp_extract - Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned.
- when - Evaluates a list of conditions and returns one of multiple possible result expressions. If Column.otherwise() is not invoked, None is returned for unmatched conditions. Parameters: 1. condition – a boolean Column expression. 2. value – a literal value, or a Column expression.
- lit - creates a Column of literal value.

### pyspark.ml.feature.StringIndexer
A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType. Its default value is ‘frequencyDesc’.

### pyspark.ml.feature.VectorAssembler
A feature transformer that merges multiple columns into a vector column.

### pyspark.ml.evaluation.MulticlassClassificationEvaluator
Evaluator for Multiclass Classification, which expects two input columns: prediction and label. Calling evaluate returns metric.

### pyspark.ml.feature.QuantileDiscretizer
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be less than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles.

NaN handling: Note also that QuantileDiscretizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid parameter. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for approxQuantile() for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values.


Ref. 
- https://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=vectorassembler#pyspark.ml.Transformer
- https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

#### Beginning with SparkSession

In [0]:
# Spark running locally https://spark.apache.org/docs/latest/quick-start.html
# Builder, get or create 
# Scala -> dataset
# Python -> dataframe

###### The entry point into all functionality in Spark is the SparkSession class.
###### To create a basic SparkSession, just use SparkSession.builder

Before 2.0, the entry point to Spark Core was `sparkContext`. In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark.

For Scala:

Past, example:
```scala
//set up the spark configuration and create contexts
val sparkConf = new SparkConf().setAppName("Spark ML example on titanic data ").setMaster("local")
// your handle to SparkContext to access other context like SQLContext
val sc = new SparkContext(sparkConf).set("spark.some.config.option", "some-value")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
```

Now, no need to create explicitly SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession. 
```scala
val spark = SparkSession.builder()
.master("local")
.appName("Spark ML example on titanic data ")
.config("spark.some.config.option", "some-value")
.getOrCreate()
SparkSession.builder()
```

In [0]:
spark = SparkSession \
    .builder \
    .appName("Spark ML example on titanic data ") \
    .getOrCreate()

# FYI - it's required to run spark locally. Synapse/Databricks creates SparkSession on notebook creation/opening 

# STORY #1 - load

[PL] 

1. Wczytać dane
2. Wyświetlić je w najprostszy sposób
3. Policzyć, dla ilu pasażerów mamy dane

###### Next, we have to import the dataset. 


* inferSchema (default false): infers the input schema automatically from data. It requires one extra pass over the data. 

Ref. https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/DataFrameReader.html

In [0]:
titanic_df = spark.read.load('abfss://synapseekot@synapseekot.dfs.core.windows.net/titanic/train.csv', format='csv'
, header=True,
,inferSchema=True
)

* display function - The easiest way to create a DataFrame visualization in Synapse/Databricks is to call display(<dataframe-name>). For example, if you have a Spark DataFrame diamonds_df of a diamonds dataset grouped by diamond color, computing the average price, and you call

In [0]:
# TODO Print/display dataframe
# <***>(titanic_df)

* .printSchema() - If you want to see the Structure (Schema) of the DataFrame, then use the following command.

In [0]:
# TODO Print/display dataframe
# titanic_df.<***>

In [0]:
# TODO Count total number of passengers

passengers_count: int = 0  #= titanic_df.<***>

In [0]:
print(passengers_count)

# STORY #2 - explore

[PL]
1. Wyświetl 5 pierwszych wierszy
2. Wyświetl statystyki dla danych
3. Wyświetl kolumny: "Survived","Pclass","Embarked" (tylko te)

###### Viewing few rows

In [0]:
# Show first 5 rows of data

titanic_df.show(#)

Summary of data

* describe: The function describe returns a DataFrame containing information such as number of non-null entries (count), mean, standard deviation, and minimum and maximum value for each numerical column.

In [0]:
# TODO Describe dataset and show it. Result should returns a DataFrame containing information such as number of non-null entries (count), mean, standard deviation, and minimum and maximum value for each numerical column.

# titanic_df.<***>.<***>

###### selecting few features

In [0]:
# TODO show features for columns: "Survived","Pclass","Embarked"

# titanic_df.<***>.show()

# STORY #3 - count survivors

[PL] 

1. Policz ilu pasażerów przeżyło i wyświetl w tabeli
2. Stwórz dataframe z informacją z p.1
3. Wyświetl stworzony dataframe

### Let's do some simple exploratory data analysis (EDA)

###### Knowing the number of Passengers Survived ?

In [0]:
# TODO show how many passengers survived and how many not, d

# All methods https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html

# titanic_df.<***>.<***>.show()


All methods https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html

In [0]:
# TODO present as (donut) the percentage of how many passengers survived and how many not

# output = titanic_df.<***>.<***>

# display(output)

Complete the sentance:

out of ... passengers in dataset, only about ... survived.

# STORY #4 - Do gender & ticket class is correlated somehow with the number of survivors?

###### To know the particulars about survivors we have to explore more of the data.
###### The survival rate can be determined by different features of the dataset such as Sex, Port of Embarcation, Age; few to be mentioned.

Checking survival rate using feature Sex

In [0]:
# TODO show in a table how many male/female passengers survived

# titanic_df.<***>.count().show()

Write down your observations and conclusions. e.g.
- Although the number of males are more than females on ship, the female survivors are twice the number of males saved.

In [0]:
# TODO check if there is any correlation between ticket class and survived passengers number

# titanic_df.<***>.count().show()

Write down your observations and conclusions

# STORY #5 - data quality

[PL]

1. Zapoznaj się z napisaną metodą i znajdź w niej błąd
2. Wyświetl tabelę z dwoma kolumnami (nazwa kolumny, liczba wystąpień wartości zerowych)

#### Checking Null values

In [0]:
# This function use to print feature with null values and null count 
def null_value_count(df):
  null_columns_counts = []
  numRows = df.count()
  for k in df.columns:
    nullRows = df.where(col(k).isNull()).count()
    if(nullRows > 0):
      temp = k,nullRows
      null_columns_counts.append(temp)
  return(null_columns_counts)

In [0]:
# Calling function
null_columns_count_list = null_value_count(titanic_df)

In [0]:
spark.createDataFrame(null_columns_count_list, ['Column_With_Null_Value', 'Null_Values_Count']).show()

Age feature has XX null values.

# STORY #6 Can we guess the age of people based on another feature?

In [0]:
# TODO print mean age of all passengers

# mean_age = titanic_df.select(mean(<***>)).collect()[0][0]

print(mean_age)

In [0]:
# TODO show the feature which can give the most valuable information which could be correlated with age

# titanic_df.select(<***>).show()

* null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.
* NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0

To replace these NaN values, we can assign them the mean age of the dataset.But the problem is, there were many people with many different ages. We just cant assign a 4 year kid with the mean age that is 29 years.

we can check the Name feature. Looking upon the feature, we can see that the names have a salutation like Mr or Mrs. Thus we can assign the mean values of Mr and Mrs to the respective groups

In [0]:
titanic_df = titanic_df.withColumn("Initial",regexp_extract(col("Name"),"([A-Za-z]+)\.",1))


Using the Regex ""[A-Za-z]+)." we extract the initials from the Name. It looks for strings which lie between A-Z or a-z and followed by a .(dot).

In [0]:
titanic_df.show()

In [0]:
# TODO show unique initials

# titanic_df.select(<***>).<***>.show()


There are some misspelled Initials like Mlle or Mme that stand for Miss. I will replace them with Miss and same thing for other values.

In [0]:
# TODO replace misspelled initials like Mlle with Miss and same thing for other values.
# mapping dict
'Mlle' -> 'Miss'
'Mme' -> 'Miss'
'Ms'  -> 'Miss'
'Dr' -> 'Mr'
'Major' -> 'Mr'
'Lady' -> 'Mrs'
'Countess' -> 'Mrs'
'Jonkheer' -> 'Other'
'Col' -> 'Other'
'Rev' -> 'Other'
'Capt' -> 'Mr'
'Sir' -> 'Mr'
'Don' -> 'Mr'

# titanic_df = titanic_df.<***>([<***>], [<***>])


In [0]:
# TODO verify if now the unique initials are correct

# titanic_df.select(<***>).<***>.show()


In [0]:
# TODO chceck average age per initials

# titanic_df.groupby(<***>).avg('Age').collect()

Let's impute missing values in age feature based on average age of Initials

In [0]:
# TODO replace <***> with correct numbers 

titanic_df = titanic_df.withColumn("Age",when((titanic_df["Initial"] == "Miss") & (titanic_df["Age"].isNull()), <***> ).otherwise(titanic_df["Age"]))
titanic_df = titanic_df.withColumn("Age",when((titanic_df["Initial"] == "Other") & (titanic_df["Age"].isNull()), <***>).otherwise(titanic_df["Age"]))
titanic_df = titanic_df.withColumn("Age",when((titanic_df["Initial"] == "Master") & (titanic_df["Age"].isNull()), <***>).otherwise(titanic_df["Age"]))
titanic_df = titanic_df.withColumn("Age",when((titanic_df["Initial"] == "Mr") & (titanic_df["Age"].isNull()), <***>).otherwise(titanic_df["Age"]))
titanic_df = titanic_df.withColumn("Age",when((titanic_df["Initial"] == "Mrs") & (titanic_df["Age"].isNull()), <***>).otherwise(titanic_df["Age"]))


Check the imputation

In [0]:
# TODO 
# titanic_df.filter(titanic_df.Age== <***> ).select("Initial").show()

In [0]:
titanic_df.select("Age").show()

# STORY #7 - Simplification

Embarked feature has only two missining values. Let's check values within Embarked

In [0]:
# TODO 
# titanic_df.groupBy(<***>).count().show()

Majority Passengers boarded from "S". We can impute with "S"

In [0]:
titanic_df = titanic_df.na.fill({"Embarked" : 'S'})


We can drop Cabin features as it has lots of null values

In [0]:
titanic_df = titanic_df.<***>(<***>)

In [0]:
titanic_df.printSchema()

# STORY #8 - [PL] "Z rodziną wychodzi się dobrze tylko na zdjęciach"

We can create a new feature called "Family_size" and "Alone" and analyse it. This feature is the summation of Parch(parents/children) and SibSp(siblings/spouses). It gives us a combined data so that we can check if survival rate have anything to do with family size of the passengers

In [0]:
titanic_df = titanic_df.withColumn("Family_Size",col('SibSp')+col('Parch'))

In [0]:
titanic_df.groupBy("Family_Size").count().show()

In [0]:
titanic_df = titanic_df.withColumn('Alone',lit(0))


In [0]:
#TODO put proper value

# titanic_df = titanic_df.withColumn("Alone",when(titanic_df["Family_Size"] == 0, <***>).otherwise(titanic_df["Alone"]))

In [0]:
titanic_df.columns

# STORY #9 - numeric representation

Lets convert Sex, Embarked & Initial columns from string to number using StringIndexer


Helper:
* StringIndexer: 
 * converts a single column to an index column (similar to a factor column in R)
 * Use it if you want the Machine Learning algorithm to identify column as categorical variable or if want to convert the textual data to numeric data keeping the categorical context.
 * e,g converting days(Monday, Tuesday...) to numeric representation.
 
* VectorIndexer: 
 * is used to index categorical predictors in a featuresCol column. Remember that featuresCol is a single column consisting of vectors (refer to featuresCol and labelCol). Each row is a vector which contains values from each predictors.
 * if you have string type predictors, you will first need to use index those columns with StringIndexer. featuresCol contains vectors, and vectors does not contain string values.
 * use this if we do not know the types of data incoming. so we leave the logic of differentiating between categorical and non categorical data to the algorithm using Vector Indexer.
 * e,g - Data coming from 3rd Party API, where data is hidden and is ingested directly to the training model.


Ref.: https://mingchen0919.github.io/learning-apache-spark/StringIndexer-and-VectorIndexer.html

In [0]:
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(titanic_df) for column in ["Sex","Embarked","Initial"]]
pipeline = Pipeline(stages=indexers)
titanic_df = pipeline.fit(titanic_df).transform(titanic_df)

In [0]:
titanic_df.show()

In [0]:
titanic_df.printSchema()

# STORY #10 - cleaning

###### Drop columns which are not required

In [0]:
# TODO drop columns "PassengerId","Name","Ticket","Cabin","Embarked","Sex","Initial"
# titanic_df = titanic_df.drop(<***>)

In [0]:
titanic_df.show()

# STORY #11 - "It's all about vectors!!!!"

[PL] 

1.

Let's put all features into vector

In [0]:
feature = VectorAssembler(inputCols=titanic_df.columns[1:],outputCol="features")
feature_vector= feature.transform(titanic_df)

In [0]:
feature_vector.show()

# STORY #12 - test & train datasets *

* from main train dataset

Now that the data is all set, let's split it into training and test. I'll be using 80% of it.

In [0]:
# TODO put proper values

# (trainingData, testData) = feature_vector.randomSplit([0.8, <***> ],seed = <***>)

# STORY #13 - MACHINE LEARNING

### Modelling

###### Here is the list of few Classification Algorithms from Spark ML 

LogisticRegression

DecisionTreeClassifier

RandomForestClassifier

Gradient-boosted tree classifier

NaiveBayes

Support Vector Machine


or any ref. https://spark.apache.org/docs/latest/ml-classification-regression.html

###### LogisticRegression



PL Intro

W naukach biologiczno-medycznych często mamy do czynienia ze zmiennymi typu dychotomicznego,
jak np. zmienna Występowanie_Nowotworu (1-tak, 0-nie), czy zmienna Przeżycie (1-tak, 0-nie).
Takiej sytuacji istotnym może okazać się pytanie, jakie zmienne istotnie wpływają na przeżycie czy
wystąpienie nowotworu. W tego typu zagadnieniach świetnie sprawdza się regresja logistyczna.
Analiza i interpretacja wyników regresji logistycznej jest bardzo podobna do metod klasycznej
regresji. Najważniejszymi różnicami pomiędzy tymi dwiema metodami są
• Bardziej skomplikowane i czasochłonne obliczenia,
• Wyliczanie wartości i sporządzanie wykresów reszt zazwyczaj nie wnosi nic nowego do
modelu.

Założenia regresji logistycznej
• Losowy dobór próby;
• Odpowiednie kodowanie (model regresji logistycznej wylicza prawdopodobieństwo, że
zmienna zależna przyjmuje wartość 1);
• Uwzględnienie wszystkich istotnych zmiennych;
• Wyłączenie z modelu wszystkich nieistotnych zmiennych;
• Zależność transformacji logitowej od zmiennych niezależnych jest liniowa;
• Model regresji logistycznej nie wyjaśnia efektów interakcji zmiennych niezależnych;
• Zmienne niezależne nie mogą być współliniowe;
• Regresja logistyczna jest wrażliwa na występowanie punktów odstających. Przed
rozpoczęciem analizy należy je usunąć (wykrycie przypadków odstających umożliwia analiza
reszt);
• Próba musi być dostatecznie liczna (co najmniej n=100); 


W regresji logistycznej, oprócz współczynników regresji i ich statystycznej istotności, dochodzi jeszcze
dodatkowy parametr: iloraz szans (odds ratio)

Ref. http://home.agh.edu.pl/~mmd/_media/dydaktyka/adp/regresja_logistyczna.pdf

In [0]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="Survived", featuresCol="features")
#Training algo
lrModel = lr.fit(trainingData)
lr_prediction = lrModel.transform(testData)
lr_prediction.select("prediction", "Survived", "features").show()
evaluator = MulticlassClassificationEvaluator(labelCol="Survived", predictionCol="prediction", metricName="accuracy")

###### Evaluating accuracy of LogisticRegression.

In [0]:
lr_accuracy = evaluator.evaluate(lr_prediction)
print("Accuracy of LogisticRegression is = %g"% (lr_accuracy))
print("Test Error of LogisticRegression = %g " % (1.0 - lr_accuracy))

###### DecisionTreeClassifier

In [0]:
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(labelCol="Survived", featuresCol="features")
dt_model = dt.fit(trainingData)
dt_prediction = dt_model.transform(testData)
dt_prediction.select("prediction", "Survived", "features").show()


###### Evaluating accuracy of DecisionTreeClassifier.

In [0]:
dt_accuracy = evaluator.evaluate(dt_prediction)
print("Accuracy of DecisionTreeClassifier is = %g"% (dt_accuracy))
print("Test Error of DecisionTreeClassifier = %g " % (1.0 - dt_accuracy))


###### RandomForestClassifier

In [0]:
from pyspark.ml.classification import RandomForestClassifier
rf = DecisionTreeClassifier(labelCol="Survived", featuresCol="features")
rf_model = rf.fit(trainingData)
rf_prediction = rf_model.transform(testData)
rf_prediction.select("prediction", "Survived", "features").show()

###### Evaluating accuracy of RandomForestClassifier.

In [0]:
rf_accuracy = evaluator.evaluate(rf_prediction)
print("Accuracy of RandomForestClassifier is = %g"% (rf_accuracy))
print("Test Error of RandomForestClassifier  = %g " % (1.0 - rf_accuracy))

###### Gradient-boosted tree classifier

In [0]:
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(labelCol="Survived", featuresCol="features",maxIter=10)
gbt_model = gbt.fit(trainingData)
gbt_prediction = gbt_model.transform(testData)
gbt_prediction.select("prediction", "Survived", "features").show()


###### Evaluate accuracy of Gradient-boosted.

In [0]:
gbt_accuracy = evaluator.evaluate(gbt_prediction)
print("Accuracy of Gradient-boosted tree classifie is = %g"% (gbt_accuracy))
print("Test Error of Gradient-boosted tree classifie %g"% (1.0 - gbt_accuracy))

###### Evaluating accuracy of DecisionTreeClassifier.

In [0]:
dt_accuracy = evaluator.evaluate(dt_prediction)
print("Accuracy of DecisionTreeClassifier is = %g"% (dt_accuracy))
print("Test Error of DecisionTreeClassifier = %g " % (1.0 - dt_accuracy))

###### NaiveBayes

In [0]:
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes(labelCol="Survived", featuresCol="features")
nb_model = nb.fit(trainingData)
nb_prediction = nb_model.transform(testData)
nb_prediction.select("prediction", "Survived", "features").show()


###### Evaluating accuracy of NaiveBayes.

In [0]:
nb_accuracy = evaluator.evaluate(nb_prediction)
print("Accuracy of NaiveBayes is  = %g"% (nb_accuracy))
print("Test Error of NaiveBayes  = %g " % (1.0 - nb_accuracy))

###### Support Vector Machine

In [0]:
from pyspark.ml.classification import LinearSVC
svm = LinearSVC(labelCol="Survived", featuresCol="features")
svm_model = svm.fit(trainingData)
svm_prediction = svm_model.transform(testData)
svm_prediction.select("prediction", "Survived", "features").show()


###### Evaluating the accuracy of Support Vector Machine.

In [0]:
svm_accuracy = evaluator.evaluate(svm_prediction)
print("Accuracy of Support Vector Machine is = %g"% (svm_accuracy))
print("Test Error of Support Vector Machine = %g " % (1.0 - svm_accuracy))

How to increase accuracy of a model ?
  * Add new features or drop existing features and train model.
  * Tune ML algorithm (https://spark.apache.org/docs/latest/ml-tuning.html)