<a href="https://colab.research.google.com/github/dgoon29/ai_ml_as_hw/blob/main/hw2/hw2_deep_goon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 2: Analyzing Titanic Dataset
### Deep Goon


**source:** https://www.kaggle.com/c/titanic/data

### **Step 0:** Setup

In [12]:
# Imports

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.2.3/spark-3.2.3-bin-hadoop2.7.tgz
!tar xf spark-3.2.3-bin-hadoop2.7.tgz
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.3-bin-hadoop2.7"

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler

In [5]:
# from google.colab import drive
# drive.mount('/content/drive/')

# # d('/content/drive/My Drive/Colab Notebooks/Spark ML Notebooks/My Work/Homework/Homework 2/titanic_train.csv')

Mounted at /content/drive/


### **Step 1:** Load Titanic Dataset from local folder




In [None]:
from google.colab import files
files.upload()

In [23]:
dataset = spark.read.csv('titanic_train.csv', inferSchema=True, header=True)

### **Step 2:** Familiarize yourself with the dataset

#### i) Print the dataset

In [10]:
dataset.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



#### ii) Print first 10 rows of the dataset

In [13]:
dataset.show(10)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    0|      

#### iii) Summary statistics

In [16]:
dataset.summary().show

+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|summary|      PassengerId|           Survived|            Pclass|                Name|   Sex|               Age|             SibSp|              Parch|            Ticket|             Fare|Cabin|Embarked|
+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|  count|              891|                891|               891|                 891|   891|               714|               891|                891|               891|              891|  204|     889|
|   mean|            446.0| 0.3838383838383838| 2.308641975308642|                null|  null| 29.69911764705882|0.5230078563411896|0.38159371492704824|260318.54916792738| 32.20420

In [28]:
from pyspark.sql.functions import col

dataset.groupBy("Sex").count().orderBy(col("count").desc()).show(10)
dataset.groupBy("Embarked").count().orderBy(col("count").desc()).show(10)



+------+-----+
|   Sex|count|
+------+-----+
|  male|  577|
|female|  314|
+------+-----+

+--------+-----+
|Embarked|count|
+--------+-----+
|       S|  644|
|       C|  168|
|       Q|   77|
|    null|    2|
+--------+-----+



In [33]:
from pyspark.sql.functions import col, sum

# Calculate the number of nulls in each column
null_counts = dataset.select([sum(col(c).isNull().cast("int")).alias(c) for c in dataset.columns])

null_counts.show()


# Drop rows with missing values
df_clean = dataset.dropna()


+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|          0|       0|     0|   0|  0|177|    0|    0|     0|   0|  687|       2|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+



In [34]:
dataset.count()

891

##### Answer to v) I might drop the cabin column as it’s mostly empty (687 Nulls out of 891 total observations). For missing Age values, I will fill them with the average age.

In [40]:
from pyspark.sql.functions import col

# Assuming 'df' is your DataFrame
# List of feature columns you plan to use (example: 'Pclass', 'Sex', 'Age', 'Fare')
feature_columns = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
target_column = 'Survived'

df = dataset.select(*feature_columns, target_column)

# Convert numerical columns to double
for column in ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived']:
    df = df.withColumn(column, col(column).cast('double'))


In [48]:
df.show(10)

+------+------+-----------------+-----+-----+-------+--------+--------+-----+
|Pclass|   Sex|              Age|SibSp|Parch|   Fare|Embarked|Survived|AgeNA|
+------+------+-----------------+-----+-----+-------+--------+--------+-----+
|   3.0|  male|             22.0|  1.0|  0.0|   7.25|       S|     0.0|    0|
|   1.0|female|             38.0|  1.0|  0.0|71.2833|       C|     1.0|    0|
|   3.0|female|             26.0|  0.0|  0.0|  7.925|       S|     1.0|    0|
|   1.0|female|             35.0|  1.0|  0.0|   53.1|       S|     1.0|    0|
|   3.0|  male|             35.0|  0.0|  0.0|   8.05|       S|     0.0|    0|
|   3.0|  male|29.69911764705882|  0.0|  0.0| 8.4583|       Q|     0.0|    1|
|   1.0|  male|             54.0|  0.0|  0.0|51.8625|       S|     0.0|    0|
|   3.0|  male|              2.0|  3.0|  1.0| 21.075|       S|     0.0|    0|
|   3.0|female|             27.0|  0.0|  2.0|11.1333|       S|     1.0|    0|
|   2.0|female|             14.0|  1.0|  0.0|30.0708|       C|  

In [47]:
from pyspark.sql.functions import when, mean

mean_age = df.select(mean('Age')).collect()[0][0]

df = df.withColumn('AgeNA', when(col('Age').isNull(), 1).otherwise(0))

df = df.na.fill({'Age': mean_age})

In [49]:
df.summary().show()

+-------+------------------+------+------------------+------------------+-------------------+-----------------+--------+-------------------+-------------------+
|summary|            Pclass|   Sex|               Age|             SibSp|              Parch|             Fare|Embarked|           Survived|              AgeNA|
+-------+------------------+------+------------------+------------------+-------------------+-----------------+--------+-------------------+-------------------+
|  count|               891|   891|               891|               891|                891|              891|     889|                891|                891|
|   mean| 2.308641975308642|  null|29.699117647058763|0.5230078563411896|0.38159371492704824| 32.2042079685746|    null| 0.3838383838383838|0.19865319865319866|
| stddev|0.8360712409770491|  null|13.002015226002891|1.1027434322934315| 0.8060572211299488|49.69342859718089|    null|0.48659245426485753|0.39921043398804806|
|    min|               1.0|female

In [50]:
# Calculate the number of nulls in each column
null_counts_2 = df.select([sum(col(c).isNull().cast("int")).alias(c) for c in df.columns])

null_counts_2.show()

+------+---+---+-----+-----+----+--------+--------+-----+
|Pclass|Sex|Age|SibSp|Parch|Fare|Embarked|Survived|AgeNA|
+------+---+---+-----+-----+----+--------+--------+-----+
|     0|  0|  0|    0|    0|   0|       2|       0|    0|
+------+---+---+-----+-----+----+--------+--------+-----+

