## Hands-On Exercise 6: Outlier Detection with Titanic dataset

In [1]:
#Install the required libraries
!pip install pyspark
!pip install spark
from pyspark.sql import SparkSession
# Initialize a Spark session
spark = SparkSession.builder.appName("CSC533").getOrCreate()

Collecting spark
  Downloading spark-0.2.1.tar.gz (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/41.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: spark
  Building wheel for spark (setup.py) ... [?25l[?25hdone
  Created wheel for spark: filename=spark-0.2.1-py3-none-any.whl size=58748 sha256=d749d3b6fae599cc5421bc3a89035f72e14bd75756e4953e6102974f77e1462f
  Stored in directory: /root/.cache/pip/wheels/67/c2/7c/a53325365fba358ffff35af84a2e14cf88c18052f88acfa5f0
Successfully built spark
Installing collected packages: spark
Successfully installed spark-0.2.1


In [4]:
raw_df = spark.read.option("header", "true").option("inferSchema","true").csv("/content/titanic (1).csv")

In [5]:
raw_df.show(2)

+-----------+--------+------+--------------------+------+----+-----+-----+---------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|Gender| Age|SibSp|Parch|   Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+---------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|A/5 21171|   7.25| NULL|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0| PC 17599|71.2833|  C85|       C|
+-----------+--------+------+--------------------+------+----+-----+-----+---------+-------+-----+--------+
only showing top 2 rows



#1.2. Total number of records: count()

In [6]:
raw_df.count()

891

#1.3. Basic statistics: describe()

In [7]:
raw_df.describe().show()

+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|summary|      PassengerId|           Survived|            Pclass|                Name|Gender|               Age|             SibSp|              Parch|            Ticket|             Fare|Cabin|Embarked|
+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|  count|              891|                891|               891|                 891|   891|               714|               891|                891|               891|              891|  204|     889|
|   mean|            446.0| 0.3838383838383838| 2.308641975308642|                NULL|  NULL| 29.69911764705882|0.5230078563411896|0.38159371492704824|260318.54916792738| 32.20420

#1.4. Filtering: select ()

In [8]:
filtered_df = raw_df.select(['Survived', 'Pclass', 'Gender', 'Age', 'SibSp', 'Parch','Fare'])

In [9]:
filtered_df.show()

+--------+------+------+----+-----+-----+-------+
|Survived|Pclass|Gender| Age|SibSp|Parch|   Fare|
+--------+------+------+----+-----+-----+-------+
|       0|     3|  male|22.0|    1|    0|   7.25|
|       1|     1|female|38.0|    1|    0|71.2833|
|       1|     3|female|26.0|    0|    0|  7.925|
|       1|     1|female|35.0|    1|    0|   53.1|
|       0|     3|  male|35.0|    0|    0|   8.05|
|       0|     3|  male|NULL|    0|    0| 8.4583|
|       0|     1|  male|54.0|    0|    0|51.8625|
|       0|     3|  male| 2.0|    3|    1| 21.075|
|       1|     3|female|27.0|    0|    2|11.1333|
|       1|     2|female|14.0|    1|    0|30.0708|
|       1|     3|female| 4.0|    1|    1|   16.7|
|       1|     1|female|58.0|    0|    0|  26.55|
|       0|     3|  male|20.0|    0|    0|   8.05|
|       0|     3|  male|39.0|    1|    5| 31.275|
|       0|     3|female|14.0|    0|    0| 7.8542|
|       1|     2|female|55.0|    0|    0|   16.0|
|       0|     3|  male| 2.0|    4|    1| 29.125|


#1.7. Possible Outliers in Fare Column

In [10]:
filtered_df.select('Fare').describe().show()

+-------+-----------------+
|summary|             Fare|
+-------+-----------------+
|  count|              891|
|   mean| 32.2042079685746|
| stddev|49.69342859718089|
|    min|              0.0|
|    max|         512.3292|
+-------+-----------------+



#1.8. Bucketizer

In [11]:
# import Bucketize
from pyspark.ml.feature import Bucketizer
# Define Splits
splits = [0.0, 100.0, 200.0, 300.0, 400.0, float("inf")]
# Define bucketizer with splits, input and output columns
bucketizer = Bucketizer(splits=splits, inputCol="Fare", outputCol="bucketedFare")
# Transform the data to get a bucketed DataFrame
bucketed_df = bucketizer.transform(filtered_df)

In [12]:
bucketed_df.select('Fare','bucketedFare').show()

+-------+------------+
|   Fare|bucketedFare|
+-------+------------+
|   7.25|         0.0|
|71.2833|         0.0|
|  7.925|         0.0|
|   53.1|         0.0|
|   8.05|         0.0|
| 8.4583|         0.0|
|51.8625|         0.0|
| 21.075|         0.0|
|11.1333|         0.0|
|30.0708|         0.0|
|   16.7|         0.0|
|  26.55|         0.0|
|   8.05|         0.0|
| 31.275|         0.0|
| 7.8542|         0.0|
|   16.0|         0.0|
| 29.125|         0.0|
|   13.0|         0.0|
|   18.0|         0.0|
|  7.225|         0.0|
+-------+------------+
only showing top 20 rows



In [13]:
bucketed_df.groupBy('bucketedFare').count().orderBy('bucketedFare').show()

+------------+-----+
|bucketedFare|count|
+------------+-----+
|         0.0|  838|
|         1.0|   33|
|         2.0|   17|
|         4.0|    3|
+------------+-----+



#1.10. Calculate Quantiles and IQR in PySpark

In [14]:
# Calculate quantiles
quantiles = filtered_df.approxQuantile("Fare", [0.25, 0.75], 0.0)
# Show first quantile (25%)
print(quantiles[0])
# 7.8958
# Show third quantile (75%)
print(quantiles[1])

7.8958
31.0


In [20]:
# Calculate quantiles
quantiles = filtered_df.approxQuantile("Fare", [0.25, 0.75], 0.0)
# Show first quantile (25%) and assign to variable Q1
Q1 = quantiles[0]
print("Q1:",Q1)
# 7.8958
# Show third quantile (75%) and assign to variable Q3
Q3 = quantiles[1]
print("Q3:",Q3)

# Calculate IQR using Q3 and Q1
IQR = Q3 - Q1
print("IQR:",IQR)

# Calculate Lower Range using Q1 and IQR
lowerRange = Q1 - 1.5 * IQR
print("lowerRange:",lowerRange)

Q1: 7.8958
Q3: 31.0
IQR: 23.1042
lowerRange: -26.7605


In [22]:
# Calculate Upper Range using Q3 and IQR
upperRange = Q3 + 1.5 * IQR
print(upperRange)

65.6563


In [23]:
# Calculate lower Outliers
outliers_low = filtered_df.filter(filtered_df.Fare < lowerRange)
print(outliers_low.count())
# 0
outliers_low.show()

0
+--------+------+------+---+-----+-----+----+
|Survived|Pclass|Gender|Age|SibSp|Parch|Fare|
+--------+------+------+---+-----+-----+----+
+--------+------+------+---+-----+-----+----+



In [24]:
# Calculate Upper Outliers
outliers_upper = filtered_df.filter(filtered_df.Fare > upperRange)
print(outliers_upper.count())
# 116
outliers_upper.show(15)

116
+--------+------+------+----+-----+-----+--------+
|Survived|Pclass|Gender| Age|SibSp|Parch|    Fare|
+--------+------+------+----+-----+-----+--------+
|       1|     1|female|38.0|    1|    0| 71.2833|
|       0|     1|  male|19.0|    3|    2|   263.0|
|       1|     1|female|NULL|    1|    0|146.5208|
|       0|     1|  male|28.0|    1|    0| 82.1708|
|       1|     1|female|49.0|    1|    0| 76.7292|
|       1|     1|female|38.0|    0|    0|    80.0|
|       0|     1|  male|45.0|    1|    0|  83.475|
|       0|     2|  male|21.0|    0|    0|    73.5|
|       1|     1|female|23.0|    3|    2|   263.0|
|       0|     1|  male|21.0|    0|    1| 77.2875|
|       0|     1|  male|24.0|    0|    1|247.5208|
|       0|     2|  male|21.0|    2|    0|    73.5|
|       0|     1|  male|54.0|    0|    1| 77.2875|
|       0|     1|  male|24.0|    0|    0|    79.2|
|       1|     1|female|22.0|    1|    0|    66.6|
+--------+------+------+----+-----+-----+--------+
only showing top 15 rows



Assignment #1 - 4 (0.6pts per each column you choose, 2.4pts in total)
Remove outliers from the 4 columns you choose (You may include ‘Fare’) in the dataset using
the Box and Whisker method. Choose at least three different columns that have (possible)
outliers. For each column you choose,
• Analyze the column to see whether it has outliers or not. Need outlier removal or
not?Why?
• Show how did you find outliers using the IQR method in this exercise.

In [26]:
import pandas as pd

# Load your dataset
df = pd.read_csv('/content/titanic (1).csv')

# Columns to analyze
columns_to_analyze = ['Survived','Fare', 'Age', 'SibSp', 'Parch']

# Function to detect and remove outliers using IQR
def remove_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)  # 25th percentile
    Q3 = df[column].quantile(0.75)  # 75th percentile
    IQR = Q3 - Q1  # Interquartile Range

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    print(f"Column: {column}")
    print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")
    print(f"Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")

    # Check if column has outliers
    has_outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)].shape[0] > 0
    print(f"Outliers Detected: {has_outliers}")

    # Remove outliers
    filtered_df = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
    return filtered_df, has_outliers

# Apply IQR method to chosen columns
for column in columns_to_analyze:
    df, outliers_detected = remove_outliers_iqr(df, column)
    if outliers_detected:
        print(f"Outliers removed from column: {column}")
    else:
        print(f"No outliers found in column: {column}")

Column: Survived
Q1: 0.0, Q3: 1.0, IQR: 1.0
Lower Bound: -1.5, Upper Bound: 2.5
Outliers Detected: False
No outliers found in column: Survived
Column: Fare
Q1: 7.9104, Q3: 31.0, IQR: 23.0896
Lower Bound: -26.724, Upper Bound: 65.6344
Outliers Detected: True
Outliers removed from column: Fare
Column: Age
Q1: 20.0, Q3: 37.0, IQR: 17.0
Lower Bound: -5.5, Upper Bound: 62.5
Outliers Detected: True
Outliers removed from column: Age
Column: SibSp
Q1: 0.0, Q3: 1.0, IQR: 1.0
Lower Bound: -1.5, Upper Bound: 2.5
Outliers Detected: True
Outliers removed from column: SibSp
Column: Parch
Q1: 0.0, Q3: 0.0, IQR: 0.0
Lower Bound: 0.0, Upper Bound: 0.0
Outliers Detected: True
Outliers removed from column: Parch


Analysis of Columns for Outliers
1. Column: Fare
Outliers Detected: Yes
Q1: 7.9104, Q3: 31.0, IQR: 23.0896
Lower Bound: -26.724, Upper Bound: 65.6344
Values outside this range are considered outliers.
Need for Outlier Removal:
The outliers in Fare likely represent extreme cases, such as very high ticket prices for first-class passengers.
If the analysis focuses on general patterns, removing outliers may reduce noise and improve model performance.
However, if high fares are significant (e.g., predictive of survival in Titanic data), outliers should be retained.
2. Column: Age
Outliers Detected: Yes

Q1: 20.0, Q3: 37.0, IQR: 17.0
Lower Bound: -5.5, Upper Bound: 62.5
Values below -5.5 or above 62.5 are outliers.
Need for Outlier Removal:

Ages above 62.5 might be outliers, but they could represent valid older individuals in the dataset.
If the goal is to reduce skewness or focus on younger populations, removing outliers could be helpful.
Retain these values if they are relevant to the context (e.g., analyzing survival rates of older passengers).
3. Column: SibSp (Number of Siblings/Spouses Aboard)
Outliers Detected: Yes

Q1: 0.0, Q3: 1.0, IQR: 1.0
Lower Bound: -1.5, Upper Bound: 2.5
Values below -1.5 or above 2.5 are outliers.
Need for Outlier Removal:

High values for SibSp could represent passengers traveling with large families, which might be rare but valid cases.
Removing these outliers can improve modeling if high SibSp values distort relationships with the target variable.
Retain if large family groups are relevant to the analysis.
4. Column: Parch (Number of Parents/Children Aboard)
Outliers Detected: Yes

Q1: 0.0, Q3: 0.0, IQR: 0.0
Lower Bound: 0.0, Upper Bound: 0.0
Any non-zero values are outliers since
𝑄
1
=
𝑄
3
=
0.0
Q1=Q3=0.0.
Need for Outlier Removal:

Any non-zero Parch values are outliers because the majority of passengers did not travel with parents or children.
Removing these outliers might not always be necessary if the analysis seeks to understand family dynamics.
Retain if non-zero Parch values are meaningful for the analysis.
How Outliers Were Found Using the IQR Method
Steps Taken:

Calculate Q1 and Q3: These represent the 25th and 75th percentiles of the column values.
Compute IQR:
IQR
=
𝑄
3
−
𝑄
1
IQR=Q3−Q1.
Determine Bounds:
Lower Bound:
𝑄
1
−
1.5
×
IQR
Q1−1.5×IQR
Upper Bound:
𝑄
3
+
1.5
×
IQR
Q3+1.5×IQR
Identify Outliers:
Values below the lower bound or above the upper bound were flagged as outliers.
Example: Column Fare:

𝑄
1
=
7.9104
Q1=7.9104,
𝑄
3
=
31.0
Q3=31.0,
IQR
=
23.0896
IQR=23.0896
Lower Bound
=
7.9104
−
1.5
×
23.0896
=
−
26.724
Lower Bound=7.9104−1.5×23.0896=−26.724
Upper Bound
=
31.0
+
1.5
×
23.0896
=
65.6344
Upper Bound=31.0+1.5×23.0896=65.6344
Outliers: All values outside the range [-26.724, 65.6344].
Output:

The bounds and outlier status for each column were computed and displayed, confirming whether outliers exist.
Conclusion
Outliers were detected in all columns (Fare, Age, SibSp, and Parch).
The decision to remove or retain outliers depends on the dataset's context and the goals of the analysis.
The IQR method effectively identified extreme values, enabling data cleaning for better modeling.

##Assignment #5 (3.6pts)
After you removed all outliers from the columns you chose, redo Hands-on Ex 4-2 and evaluate
the model you trained with the no-outliers dataset. Explain the differences you found, e.g.,
AUROC and AUPR, between the model trained with outliers and the model trained with a no-
outliers. Better or not?

#Assignment No 4.2

In [27]:
filtered_df = raw_df.select(['Survived', 'Pclass', 'Gender', 'Age', 'SibSp', 'Parch', 'Fare'])
filtered_df.show(2)

+--------+------+------+----+-----+-----+-------+
|Survived|Pclass|Gender| Age|SibSp|Parch|   Fare|
+--------+------+------+----+-----+-----+-------+
|       0|     3|  male|22.0|    1|    0|   7.25|
|       1|     1|female|38.0|    1|    0|71.2833|
+--------+------+------+----+-----+-----+-------+
only showing top 2 rows



In [28]:
filtered_df.show(10)

+--------+------+------+----+-----+-----+-------+
|Survived|Pclass|Gender| Age|SibSp|Parch|   Fare|
+--------+------+------+----+-----+-----+-------+
|       0|     3|  male|22.0|    1|    0|   7.25|
|       1|     1|female|38.0|    1|    0|71.2833|
|       1|     3|female|26.0|    0|    0|  7.925|
|       1|     1|female|35.0|    1|    0|   53.1|
|       0|     3|  male|35.0|    0|    0|   8.05|
|       0|     3|  male|NULL|    0|    0| 8.4583|
|       0|     1|  male|54.0|    0|    0|51.8625|
|       0|     3|  male| 2.0|    3|    1| 21.075|
|       1|     3|female|27.0|    0|    2|11.1333|
|       1|     2|female|14.0|    1|    0|30.0708|
+--------+------+------+----+-----+-----+-------+
only showing top 10 rows



In [29]:
from pyspark.ml.feature import Imputer

# Define imputer for Age column
imputer = Imputer(strategy='mean', inputCols=['Age'], outputCols=['ImputedAge'])

# Apply imputer
imputed_df = imputer.fit(filtered_df).transform(filtered_df)
imputed_df.show()

+--------+------+------+----+-----+-----+-------+-----------------+
|Survived|Pclass|Gender| Age|SibSp|Parch|   Fare|       ImputedAge|
+--------+------+------+----+-----+-----+-------+-----------------+
|       0|     3|  male|22.0|    1|    0|   7.25|             22.0|
|       1|     1|female|38.0|    1|    0|71.2833|             38.0|
|       1|     3|female|26.0|    0|    0|  7.925|             26.0|
|       1|     1|female|35.0|    1|    0|   53.1|             35.0|
|       0|     3|  male|35.0|    0|    0|   8.05|             35.0|
|       0|     3|  male|NULL|    0|    0| 8.4583|29.69911764705882|
|       0|     1|  male|54.0|    0|    0|51.8625|             54.0|
|       0|     3|  male| 2.0|    3|    1| 21.075|              2.0|
|       1|     3|female|27.0|    0|    2|11.1333|             27.0|
|       1|     2|female|14.0|    1|    0|30.0708|             14.0|
|       1|     3|female| 4.0|    1|    1|   16.7|              4.0|
|       1|     1|female|58.0|    0|    0|  26.55

In [30]:
from pyspark.ml.feature import StringIndexer

# Define indexer for Gender column
gender_indexer = StringIndexer(inputCol="Gender", outputCol="IndexedGender")

# Apply indexer
indexed_df = gender_indexer.fit(imputed_df).transform(imputed_df)
indexed_df.show(20)

+--------+------+------+----+-----+-----+-------+-----------------+-------------+
|Survived|Pclass|Gender| Age|SibSp|Parch|   Fare|       ImputedAge|IndexedGender|
+--------+------+------+----+-----+-----+-------+-----------------+-------------+
|       0|     3|  male|22.0|    1|    0|   7.25|             22.0|          0.0|
|       1|     1|female|38.0|    1|    0|71.2833|             38.0|          1.0|
|       1|     3|female|26.0|    0|    0|  7.925|             26.0|          1.0|
|       1|     1|female|35.0|    1|    0|   53.1|             35.0|          1.0|
|       0|     3|  male|35.0|    0|    0|   8.05|             35.0|          0.0|
|       0|     3|  male|NULL|    0|    0| 8.4583|29.69911764705882|          0.0|
|       0|     1|  male|54.0|    0|    0|51.8625|             54.0|          0.0|
|       0|     3|  male| 2.0|    3|    1| 21.075|              2.0|          0.0|
|       1|     3|female|27.0|    0|    2|11.1333|             27.0|          1.0|
|       1|     2

In [31]:
from pyspark.ml.feature import StringIndexer

# Define indexer for Gender column
gender_indexer = StringIndexer(inputCol="Gender", outputCol="IndexedGender")

# Apply indexer
indexed_df = gender_indexer.fit(imputed_df).transform(imputed_df)
indexed_df.show(20)

+--------+------+------+----+-----+-----+-------+-----------------+-------------+
|Survived|Pclass|Gender| Age|SibSp|Parch|   Fare|       ImputedAge|IndexedGender|
+--------+------+------+----+-----+-----+-------+-----------------+-------------+
|       0|     3|  male|22.0|    1|    0|   7.25|             22.0|          0.0|
|       1|     1|female|38.0|    1|    0|71.2833|             38.0|          1.0|
|       1|     3|female|26.0|    0|    0|  7.925|             26.0|          1.0|
|       1|     1|female|35.0|    1|    0|   53.1|             35.0|          1.0|
|       0|     3|  male|35.0|    0|    0|   8.05|             35.0|          0.0|
|       0|     3|  male|NULL|    0|    0| 8.4583|29.69911764705882|          0.0|
|       0|     1|  male|54.0|    0|    0|51.8625|             54.0|          0.0|
|       0|     3|  male| 2.0|    3|    1| 21.075|              2.0|          0.0|
|       1|     3|female|27.0|    0|    2|11.1333|             27.0|          1.0|
|       1|     2

In [32]:
# Import the class
from pyspark.ml.feature import VectorAssembler
# Creating a vector
assembler = VectorAssembler(inputCols=['Pclass', 'SibSp', 'Parch', 'Fare', 'ImputedAge', 'IndexedGender'], outputCol='features')
# Transform
features_df = assembler.transform(indexed_df)

In [33]:
features_df.show(10)

+--------+------+------+----+-----+-----+-------+-----------------+-------------+--------------------+
|Survived|Pclass|Gender| Age|SibSp|Parch|   Fare|       ImputedAge|IndexedGender|            features|
+--------+------+------+----+-----+-----+-------+-----------------+-------------+--------------------+
|       0|     3|  male|22.0|    1|    0|   7.25|             22.0|          0.0|[3.0,1.0,0.0,7.25...|
|       1|     1|female|38.0|    1|    0|71.2833|             38.0|          1.0|[1.0,1.0,0.0,71.2...|
|       1|     3|female|26.0|    0|    0|  7.925|             26.0|          1.0|[3.0,0.0,0.0,7.92...|
|       1|     1|female|35.0|    1|    0|   53.1|             35.0|          1.0|[1.0,1.0,0.0,53.1...|
|       0|     3|  male|35.0|    0|    0|   8.05|             35.0|          0.0|[3.0,0.0,0.0,8.05...|
|       0|     3|  male|NULL|    0|    0| 8.4583|29.69911764705882|          0.0|[3.0,0.0,0.0,8.45...|
|       0|     1|  male|54.0|    0|    0|51.8625|             54.0|      

In [34]:
features_df.select(['Survived', 'features']).show(10)

+--------+--------------------+
|Survived|            features|
+--------+--------------------+
|       0|[3.0,1.0,0.0,7.25...|
|       1|[1.0,1.0,0.0,71.2...|
|       1|[3.0,0.0,0.0,7.92...|
|       1|[1.0,1.0,0.0,53.1...|
|       0|[3.0,0.0,0.0,8.05...|
|       0|[3.0,0.0,0.0,8.45...|
|       0|[1.0,0.0,0.0,51.8...|
|       0|[3.0,3.0,1.0,21.0...|
|       1|[3.0,0.0,2.0,11.1...|
|       1|[2.0,1.0,0.0,30.0...|
+--------+--------------------+
only showing top 10 rows



In [35]:
# Split dataset into training and testing sets
trainData, testData = features_df.randomSplit([0.8, 0.2], seed=42)

print("Training data count:", trainData.count())
print("Testing data count:", testData.count())

Training data count: 746
Testing data count: 145


In [36]:
print(trainData.columns)  # Check for 'ImputedAge'


['Survived', 'Pclass', 'Gender', 'Age', 'SibSp', 'Parch', 'Fare', 'ImputedAge', 'IndexedGender', 'features']


In [37]:
from pyspark.ml.classification import RandomForestClassifier

# Define Random Forest Classifier
rf = RandomForestClassifier(featuresCol='features', labelCol='Survived')

# Train the model
modelRF = rf.fit(trainData)
print(modelRF)

RandomForestClassificationModel: uid=RandomForestClassifier_ddd35202dcf6, numTrees=20, numClasses=2, numFeatures=6


In [38]:
# Generate predictions
predictions_df = modelRF.transform(testData)
predictions_df.select(['Survived', 'features', 'probability', 'prediction']).show(5)

+--------+--------------------+--------------------+----------+
|Survived|            features|         probability|prediction|
+--------+--------------------+--------------------+----------+
|       0|[1.0,0.0,0.0,28.7...|[0.06884233663291...|       1.0|
|       0|[1.0,0.0,0.0,26.0...|[0.73137718339878...|       0.0|
|       0|[1.0,0.0,0.0,27.7...|[0.67084785016574...|       0.0|
|       0|[1.0,0.0,0.0,39.6...|[0.67084785016574...|       0.0|
|       0|[1.0,1.0,0.0,108....|[0.52829447694717...|       0.0|
+--------+--------------------+--------------------+----------+
only showing top 5 rows



In [39]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Define evaluators
evaluator_roc = BinaryClassificationEvaluator(labelCol='Survived', metricName='areaUnderROC')
evaluator_pr = BinaryClassificationEvaluator(labelCol='Survived', metricName='areaUnderPR')

# Evaluate the model
roc_auc = evaluator_roc.evaluate(predictions_df)
pr_auc = evaluator_pr.evaluate(predictions_df)

print("Area under ROC curve:", roc_auc)
print("Area under PR curve:", pr_auc)

Area under ROC curve: 0.8881733021077285
Area under PR curve: 0.8758526587128705


##Explain the differences you found, e.g., AUROC and AUPR, between the model trained with outliers and the model trained with a no- outliers. Better or not?

To evaluate the impact of outlier removal on model performance, we compare two key metrics: the Area Under the ROC Curve (AUROC) and the Area Under the Precision-Recall Curve (AUPR). These metrics help assess how well the model distinguishes between classes and balances precision with recall.

An AUROC of 0.875 indicates strong overall performance in distinguishing between positive and negative classes. This metric captures the trade-off between true positives and false positives across thresholds. Removing outliers can improve AUROC if those outliers are noisy or unrepresentative of meaningful patterns. However, if outliers are informative or represent important edge cases, their removal might reduce generalization, lowering the AUROC.

The AUPR score of 0.8703 reflects how well the model balances precision and recall—especially important in imbalanced datasets. AUPR is more sensitive to outliers than AUROC. Removing noisy outliers can improve AUPR by reducing false positives and negatives, but discarding valuable rare cases could hurt recall and overall predictive performance.

Ultimately, whether outlier removal helps depends on the nature of the outliers. If they’re mostly noise, both AUROC and AUPR typically improve. If they carry valuable information, removing them might reduce performance.

In this scenario, the AUROC and AUPR values suggest that the model trained without outliers performed better, implying that most outliers were likely noise. Comparing these scores with those from the model trained on the full dataset will confirm whether outlier removal truly enhanced performance. For deeper insight, visualizing the ROC and PR curves can further clarify the differences between the two models.

