Let's start with your project: 

Are you a data scientist? 

I think you are an awesome a data scientist.

### **Problem** 
**Our goal is to create a predictive model that can answer the following question:**

**What kind of people had a better chance of surviving?**

**Data about passengers:**
*   Name
*   Age
*   Gender.


## Install and Import Libraries
Let's install PySpark:

## Build Spark Session

In [4]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [5]:
import findspark

findspark.init()

<IPython.core.display.Javascript object>

In [6]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

<IPython.core.display.Javascript object>

## Data Loading


You have two datasets: 
* Train  
* Test.

Read two datasets: 
* Train
* Test.



In [8]:
Train = spark.read.csv("train.csv", inferSchema=True, header=True)
Test = spark.read.csv("test.csv", inferSchema=True, header=True)

<IPython.core.display.Javascript object>

Let's work with train dataset:

**Confirm if this is a dataframe or not:**

In [9]:
type(Train)


pyspark.sql.dataframe.DataFrame

<IPython.core.display.Javascript object>

In [10]:
type(Test)

pyspark.sql.dataframe.DataFrame

<IPython.core.display.Javascript object>

**Show 5 rows.**

In [11]:
Train.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
+-----------+--------+------+--------------------+------+----+-----+-----+------

<IPython.core.display.Javascript object>

**Display schema for the dataset:**

In [12]:
Train.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



<IPython.core.display.Javascript object>

In [13]:
Test.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



<IPython.core.display.Javascript object>

**Statistical summary:**

In [14]:
Train.describe().show()

+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|summary|      PassengerId|           Survived|            Pclass|                Name|   Sex|               Age|             SibSp|              Parch|            Ticket|             Fare|Cabin|Embarked|
+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|  count|              891|                891|               891|                 891|   891|               714|               891|                891|               891|              891|  204|     889|
|   mean|            446.0| 0.3838383838383838| 2.308641975308642|                null|  null| 29.69911764705882|0.5230078563411896|0.38159371492704824|260318.54916792738| 32.20420

<IPython.core.display.Javascript object>

## EDA - Exploratory Data Analysis

**Display count for the train dataset:**

In [15]:
import pyspark.sql.functions as F

Train.count()

891

<IPython.core.display.Javascript object>

**Can you answer this question:** 

**How many people survived, and how many didn't survive?** 

**Please save data in a variable.**

In [16]:
survivrd = Train.select("Survived").where("Survived = 1")
not_survived = Train.select("Survived").where("Survived = 0")

<IPython.core.display.Javascript object>

**Display your result:**

In [17]:
print("Survived: ",survivrd.count())
print("Not survived:  ",not_survived.count())


Survived:  342
Not survived:   549


<IPython.core.display.Javascript object>

**Can you display your answer in ratio form?(Hint: Use "UDF" Function. (Hint: Use "UDF" Function. This is a hint you can use any method.)**






In [18]:
print("Survived Ration: ", survivrd.count() / Train.count() * 100)
print("Not Survived Ration: ", not_survived.count() / Train.count() * 100)

Survived Ration:  38.38383838383838
Not Survived Ration:  61.61616161616161


<IPython.core.display.Javascript object>

**Can you get the number of males and females?**


In [19]:
males_and_females = (
    Train.select("Sex", "Survived")
    .groupby("Sex")
    .agg(F.count("Sex").alias("Sex_count"))
    .show()
)

+------+---------+
|   Sex|Sex_count|
+------+---------+
|female|      314|
|  male|      577|
+------+---------+



<IPython.core.display.Javascript object>

**1. What is the average number of survivors of each gender?**

In [20]:
avg_survivors_for_each_gender = (
    Train.select("Sex", "Survived")
    .groupby("Sex")
    .agg(F.avg("Survived").alias("Average_survivors"))
)
avg_survivors_for_each_gender.show()

+------+-------------------+
|   Sex|  Average_survivors|
+------+-------------------+
|female| 0.7420382165605095|
|  male|0.18890814558058924|
+------+-------------------+



<IPython.core.display.Javascript object>



**2. What is the number of survivors of each gender?**

(Hint: Group by the "sex" column. This is a hint you can use any method.)

In [21]:
males_and_females = Train.select('Sex' , 'Survived').where('Survived = 1')
num_survivors_for_each_gender = males_and_females.groupby('Sex').agg(F.count(F.col('Survived')).alias("Survived_count")).show()



+------+--------------+
|   Sex|Survived_count|
+------+--------------+
|female|           233|
|  male|           109|
+------+--------------+



<IPython.core.display.Javascript object>

**Create temporary view PySpark:**

In [22]:
Train.createOrReplaceTempView("Train_view")

<IPython.core.display.Javascript object>

**How many people survived, and how many didn't survive? By SQL:**

In [27]:
spark.sql(
    """ SELECT Survived , count(Survived) 
                From Train_view
                GROUP BY Survived """
).show()

+--------+---------------+
|Survived|count(Survived)|
+--------+---------------+
|       1|            342|
|       0|            549|
+--------+---------------+



<IPython.core.display.Javascript object>

**Can you display the number of survivors from each gender as a ratio?**

(Hint: Group by "sex" column. This is a hint you can use any method.)

**Can you do this via SQL?**

In [26]:
spark.sql(
    """ SELECT sex , round(avg(Survived) , 2) 
                From Train_view
                GROUP BY Sex """
).show()

+------+---------------------------------------+
|   sex|round(avg(CAST(Survived AS BIGINT)), 2)|
+------+---------------------------------------+
|female|                                   0.74|
|  male|                                   0.19|
+------+---------------------------------------+



<IPython.core.display.Javascript object>

**Display a ratio for "p-class": SUM(Survived)/count for p-class**


In [29]:
spark.sql(
    """ SELECT Pclass, round(sum(Survived)/count(Pclass),2) as pclass_ratio
                From Train_view
                GROUP BY pclass """
).show()

+------+------------+
|Pclass|pclass_ratio|
+------+------------+
|     1|        0.63|
|     3|        0.24|
|     2|        0.47|
+------+------------+



<IPython.core.display.Javascript object>

**Let's take a break and continue after this.**

## Data Cleaning

**First and foremost, we must merge both the train and test datasets. (Hint: The union function can do this.)**



In [30]:

final_df = Train.union(Test)


<IPython.core.display.Javascript object>

**Display count:**

In [31]:
print("Train: " , Train.count())
print("Test: " , Test.count())
print("final: " , final_df.count())


Train:  891
Test:  438
final:  1329


<IPython.core.display.Javascript object>

**Can you define the number of null values in each column?**


In [32]:

all_col_null_count = final_df.select([F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)).alias(c) for c in final_df.columns])
all_col_null_count.show()


+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|          0|       0|     0|   0|  0|265|    0|    0|     0|   0| 1021|       3|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+



<IPython.core.display.Javascript object>


**Create Dataframe for null values**

1. Column
2. Number of missing values.

In [33]:
null_cols = all_col_null_count.select("Age", "Cabin", "Embarked")
null_cols.show()

+---+-----+--------+
|Age|Cabin|Embarked|
+---+-----+--------+
|265| 1021|       3|
+---+-----+--------+



<IPython.core.display.Javascript object>

## Preprocessing 

**Create Temporary view PySpark:**

In [34]:
Train.createOrReplaceTempView("Train_view")

<IPython.core.display.Javascript object>

**Can you show the "name" column from your temporary table?**

In [35]:
spark.sql(
    """ SELECT name
              From Train_view
             """
).show(truncate=False)

+-------------------------------------------------------+
|name                                                   |
+-------------------------------------------------------+
|Braund, Mr. Owen Harris                                |
|Cumings, Mrs. John Bradley (Florence Briggs Thayer)    |
|Heikkinen, Miss. Laina                                 |
|Futrelle, Mrs. Jacques Heath (Lily May Peel)           |
|Allen, Mr. William Henry                               |
|Moran, Mr. James                                       |
|McCarthy, Mr. Timothy J                                |
|Palsson, Master. Gosta Leonard                         |
|Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)      |
|Nasser, Mrs. Nicholas (Adele Achem)                    |
|Sandstrom, Miss. Marguerite Rut                        |
|Bonnell, Miss. Elizabeth                               |
|Saundercock, Mr. William Henry                         |
|Andersson, Mr. Anders Johan                            |
|Vestrom, Miss

<IPython.core.display.Javascript object>

**Run this code:**

In [38]:
import pyspark.sql.functions as F

final_df = final_df.withColumn(
    "Title", F.regexp_extract(F.col("Name"), "([A-Za-z]+)\.", 1)
)
final_df.createOrReplaceTempView("combined")

<IPython.core.display.Javascript object>

**Display "Title" column and count "Title" column:**

In [39]:
spark.sql(
    """ SELECT Title 
              From combined
             """
).show(truncate=False)

spark.sql(
    """ SELECT count(Title)
              From combined
             """
).show(truncate=False)

+------+
|Title |
+------+
|Mr    |
|Mrs   |
|Miss  |
|Mrs   |
|Mr    |
|Mr    |
|Mr    |
|Master|
|Mrs   |
|Mrs   |
|Miss  |
|Miss  |
|Mr    |
|Mr    |
|Miss  |
|Mrs   |
|Master|
|Mr    |
|Mrs   |
|Mrs   |
+------+
only showing top 20 rows

+------------+
|count(Title)|
+------------+
|1329        |
+------------+



<IPython.core.display.Javascript object>

In [40]:
title_counts = spark.sql(
    """ SELECT distinct(Title) , count(Title) as count
              From combined
              GROUP BY Title
             """
)

title_counts.show()

+--------+-----+
|   Title|count|
+--------+-----+
|     Don|    1|
|    Miss|  257|
|Countess|    2|
|     Col|    4|
|     Rev|    9|
|    Lady|    2|
|  Master|   56|
|     Mme|    1|
|    Capt|    2|
|      Mr|  786|
|      Dr|   11|
|     Mrs|  186|
|     Sir|    2|
|Jonkheer|    2|
|    Mlle|    4|
|   Major|    3|
|      Ms|    1|
+--------+-----+



<IPython.core.display.Javascript object>

**We can see that Dr, Rev, Major, Col, Mlle, Capt, Don, Jonkheer, Countess, Ms, Sir, Lady, and Mme are really rare titles, so create Dictionary and set the value to "rare".**

In [41]:
import numpy as np

rare = np.array(title_counts.select("Title").where("count < 50").collect())
rare = rare.flatten()

not_rare = np.array(title_counts.select("Title").where("count > 50").collect())
not_rare = not_rare.flatten()
print(not_rare)
print(type(not_rare))

['Miss' 'Master' 'Mr' 'Mrs']
<class 'numpy.ndarray'>


<IPython.core.display.Javascript object>

In [43]:
titles_map = {k:"Rare" for k in rare}

for val in not_rare:
    print(type(str(val)))
    titles_map [val] = str(val)

print(titles_map)


<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
{'Don': 'Rare', 'Countess': 'Rare', 'Col': 'Rare', 'Rev': 'Rare', 'Lady': 'Rare', 'Mme': 'Rare', 'Capt': 'Rare', 'Dr': 'Rare', 'Sir': 'Rare', 'Jonkheer': 'Rare', 'Mlle': 'Rare', 'Major': 'Rare', 'Ms': 'Rare', 'Miss': 'Miss', 'Master': 'Master', 'Mr': 'Mr', 'Mrs': 'Mrs'}


<IPython.core.display.Javascript object>

**Run the function:**

In [44]:
def impute_title(title):
    return titles_map[title]


# Title_map is your dictionary. please change this name with your dictionary name.

<IPython.core.display.Javascript object>

In [45]:
print(type(titles_map))

<class 'dict'>


<IPython.core.display.Javascript object>

**Apply the function on "Title" column using UDF:**

In [46]:
final_df.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)
 |-- Title: string (nullable = true)



<IPython.core.display.Javascript object>

In [47]:
impute_title_UDF = F.udf(lambda x: impute_title(x), F.StringType())

final_df = final_df.withColumn("Title", impute_title_UDF("Title"))
final_df.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+-----+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|Title|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+-----+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|   Mr|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|  Mrs|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S| Miss|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|  Mrs|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|   Mr|
+-----------+--------+------+---

<IPython.core.display.Javascript object>

**Display "Title" from table and group by "Title" column:**

In [49]:
final_df.createOrReplaceTempView("temp")
spark.sql("select Title from temp GROUP BY Title").show()

+------+
| Title|
+------+
|  Miss|
|Master|
|    Mr|
|   Mrs|
|  Rare|
+------+



<IPython.core.display.Javascript object>

## **Preprocessing Age**

**Based on the "age" column mean, you will fill in the missing age values:**

In [50]:
mean =final_df.select(F.mean(F.col('age'))).collect()[0][0]


<IPython.core.display.Javascript object>

**Fill missing with "age" mean:**

In [51]:
final_df = final_df.na.fill(mean, "Age")
final_df.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+-----+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|Title|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+-----+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|   Mr|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|  Mrs|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S| Miss|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|  Mrs|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|   Mr|
+-----------+--------+------+---

<IPython.core.display.Javascript object>

## **Preprocessing Embarked**

**Select "Embarked" column, count them, order by count Desc, and save in grouped_Embarked variable:**




In [53]:
emb = spark.sql(
    """ SELECT Embarked , count(Embarked) as Embarked_count
              From temp
              GROUP BY Embarked
              ORDER BY Embarked_count desc """
)

<IPython.core.display.Javascript object>

**Show "groupped_Embarked" your variable:**

In [54]:
emb.show()

+--------+--------------+
|Embarked|Embarked_count|
+--------+--------------+
|       S|           962|
|       C|           253|
|       Q|           111|
|    null|             0|
+--------+--------------+



<IPython.core.display.Javascript object>

**Get max of groupped_Embarked:** 

In [55]:
max = emb.select(F.max(F.col("Embarked_count"))).collect()[0][0]
max

962

<IPython.core.display.Javascript object>

**Fill missing values with max 'S' of grouped_Embarked:**

In [56]:
final_df = final_df.na.fill("S", "Embarked")
final_df.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+-----+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|Title|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+-----+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|   Mr|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|  Mrs|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S| Miss|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|  Mrs|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|   Mr|
+-----------+--------+------+---

<IPython.core.display.Javascript object>

## **Preprocessing Cabin**

**Replace "cabin" column with first char from the string:**



In [57]:
def replace_with_first(input_str):
    if input_str == None:
        return input_str
    else:
        return input_str[0]


replace_with_first("ASmaa")

'A'

<IPython.core.display.Javascript object>

In [58]:
replace_with_first_udf = F.udf(lambda x : replace_with_first(x))
final_df = final_df.withColumn('cabin' , replace_with_first_udf(F.col('cabin')))



<IPython.core.display.Javascript object>

**Show the result:**

In [59]:
final_df.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+-----+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|cabin|Embarked|Title|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+-----+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|   Mr|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|    C|       C|  Mrs|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S| Miss|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1|    C|       S|  Mrs|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|   Mr|
+-----------+--------+------+---

<IPython.core.display.Javascript object>

**Create the temporary view:**

In [60]:
final_df.createOrReplaceTempView("temp")

<IPython.core.display.Javascript object>

**Select "Cabin" column, count "Cabin" column, Group by "Cabin" column, Order By count DESC**  

In [62]:
spark.sql(
    """ SELECT cabin , count(cabin) as cabin_count
                From temp
                GROUP BY cabin
                ORDER BY cabin_count desc """
).show()

+-----+-----------+
|cabin|cabin_count|
+-----+-----------+
|    C|         82|
|    B|         77|
|    D|         52|
|    E|         51|
|    A|         23|
|    F|         18|
|    G|          4|
|    T|          1|
| null|          0|
+-----+-----------+



<IPython.core.display.Javascript object>

**Fill missing values with "U":**

In [63]:
final_df = final_df.na.fill("U", "cabin")
final_df.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+-----+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|cabin|Embarked|Title|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+-----+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25|    U|       S|   Mr|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|    C|       C|  Mrs|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925|    U|       S| Miss|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1|    C|       S|  Mrs|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05|    U|       S|   Mr|
+-----------+--------+------+---

<IPython.core.display.Javascript object>

**StringIndexer: A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType. Its default value is ‘frequencyDesc’.**

In [64]:
string_col = [c for (c, d) in final_df.dtypes]
string_col

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'cabin',
 'Embarked',
 'Title']

<IPython.core.display.Javascript object>

In [65]:
# output column names for StringIndexer
strInd_col = [c + "_Ind" for c in string_col]
strInd_col

['PassengerId_Ind',
 'Survived_Ind',
 'Pclass_Ind',
 'Name_Ind',
 'Sex_Ind',
 'Age_Ind',
 'SibSp_Ind',
 'Parch_Ind',
 'Ticket_Ind',
 'Fare_Ind',
 'cabin_Ind',
 'Embarked_Ind',
 'Title_Ind']

<IPython.core.display.Javascript object>

In [66]:
# output columns for OneHotEncoder

OHE_col = [c + "_OHE" for c in string_col]
OHE_col

['PassengerId_OHE',
 'Survived_OHE',
 'Pclass_OHE',
 'Name_OHE',
 'Sex_OHE',
 'Age_OHE',
 'SibSp_OHE',
 'Parch_OHE',
 'Ticket_OHE',
 'Fare_OHE',
 'cabin_OHE',
 'Embarked_OHE',
 'Title_OHE']

<IPython.core.display.Javascript object>

In [67]:
# getting numeric columns to be merged with OHE columns
numeric_cols = [
    c for (c, d) in final_df.dtypes if ((d == "double" or d == "int") and (c != "crew"))
]
numeric_cols

['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

<IPython.core.display.Javascript object>

In [68]:
# getting all features together and Instantiate VectorAssembler
all_columns = OHE_col + numeric_cols
all_columns

['PassengerId_OHE',
 'Survived_OHE',
 'Pclass_OHE',
 'Name_OHE',
 'Sex_OHE',
 'Age_OHE',
 'SibSp_OHE',
 'Parch_OHE',
 'Ticket_OHE',
 'Fare_OHE',
 'cabin_OHE',
 'Embarked_OHE',
 'Title_OHE',
 'PassengerId',
 'Survived',
 'Pclass',
 'Age',
 'SibSp',
 'Parch',
 'Fare']

<IPython.core.display.Javascript object>

**StringIndexer(inputCol=None, outputCol=None)**

In [69]:
from pyspark.ml.feature import StringIndexer , OneHotEncoder , VectorAssembler
#Instantiate StringIndexer
strInd = StringIndexer(inputCols=string_col , outputCols=strInd_col , handleInvalid='keep')


<IPython.core.display.Javascript object>

**OneHotEncoder(inputCols=None, outputCols=None)**

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

In [70]:
#Instantiate OneHotEncoding
OHEncoder = OneHotEncoder(inputCols=strInd_col , outputCols=OHE_col)


<IPython.core.display.Javascript object>

**VectorAssembler: VectorAssembler(*, inputCols=None, outputCol=None). A feature transformer that merges multiple columns into a vector column.**



In [71]:
# Instantiate VectorAssembler
vecAssem = VectorAssembler(
    inputCols=all_columns, outputCol="features", handleInvalid="skip"
)

<IPython.core.display.Javascript object>

**Use randomSplit function and split data to x_train, and X_test with 80% and 20% Consecutive**

In [72]:
trainDF, testDF = final_df.randomSplit([0.8, 0.2], seed=42)

<IPython.core.display.Javascript object>

**Pipeline: ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines.**

**Build RandomForestClassifier model and use pipeline to fit and transform then display "prediction, Survived, features" columns**

In [73]:
from pyspark.ml.classification import RandomForestClassifier

rand_forest = RandomForestClassifier(featuresCol="features", labelCol="Survived")

<IPython.core.display.Javascript object>

In [74]:
from pyspark.ml import Pipeline
pip = Pipeline(stages=[strInd , OHEncoder , vecAssem , rand_forest] )
pip_model = pip.fit(trainDF)

#transform train data
train_transformed = pip_model.transform(trainDF)
train_pred = train_transformed.select("Survived" , "prediction")

#transform test data
test_transformed = pip_model.transform(testDF)
test_pred = test_transformed.select("Survived" , "prediction")


<IPython.core.display.Javascript object>

**Use MulticlassClassificationEvaluator and set the "labelCol" to "Survived",  "predictionCol" to "prediction", "metricName" to "accuracy"** 

In [75]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator()
evaluator.setPredictionCol("prediction")
evaluator.setLabelCol("Survived")
evaluator.evaluate(train_pred, {evaluator.metricName: "accuracy"})


0.9753199268738574

<IPython.core.display.Javascript object>

In [76]:
evaluator2 = MulticlassClassificationEvaluator()
evaluator2.setPredictionCol("prediction")
evaluator2.setLabelCol("Survived")
evaluator2.evaluate(test_pred, {evaluator.metricName: "accuracy"})

0.9872024205226979

<IPython.core.display.Javascript object>

**When you are finished send the project via Google classroom**
**Please let me know if you have any questions.**
* nabieh.mostafa@yahoo.com
* +201015197566 (Whatsapp)

**Don't Hate me, I push you to learn**

**I will help you to become an awesome data engineer.**

**Why did I say that "Data Engineer"?**

**Tricky question, but an optional question, if you would like to know the answer, ask me.**
