## HW Assignment 2

In this assignment, we will learn how to use Apache Spark DataFrames and the MLLib package used for machine learning.
Run the code below to start up a local Spark instance.

In [None]:
# Install Spark 3.2.4
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.2.4/spark-3.2.4-bin-hadoop2.7.tgz
!tar xf spark-3.2.4-bin-hadoop2.7.tgz

In [None]:
# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.4-bin-hadoop2.7"

In [None]:
!python -m pip install --upgrade pyspark==2.4.0
!python -m pip install -q findspark

  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-2.4.0-py2.py3-none-any.whl size=213793581 sha256=a681b26a5ee96d98f92d7cfbb60a6060584ccd6e7243beaa0f71bfd1d112108f
  Stored in directory: /root/.cache/pip/wheels/f7/6f/a8/4d2c26233a51a570ccf015208651aeed4590ed3f935b70e7c6
Successfully built pyspark
Installing collected packages: py4j, pyspark
  Attempting uninstall: py4j
    Found existing installation: py4j 0.10.9.7
    Uninstalling py4j-0.10.9.7:
      Successfully uninstalled py4j-0.10.9.7
Successfully installed py4j-0.10.7 pyspark-2.4.0


In [None]:
import findspark
findspark.init()

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
from pyspark.sql import SparkSession

In [None]:
APP_NAME = "HW2"

In [None]:
spark = SparkSession.builder.appName(APP_NAME).getOrCreate()

In [None]:
spark

1. a. In the first part of this assignment, we will load a dataset and discover some insight about the data.
Load the travel dataset provided in the assignment with the option inferSchema set to true. Print 20 rows from the table

In [None]:
df = spark.read.load('/content/gdrive/My Drive//travel insurance.csv', format='csv', inferSchema=True, header=True)

df.show(20)

+------+-------------+--------------------+--------------------+-----+--------+--------------------+---------+--------------------+------+---+
|Agency|  Agency Type|Distribution Channel|        Product Name|Claim|Duration|         Destination|Net Sales|Commision (in value)|Gender|Age|
+------+-------------+--------------------+--------------------+-----+--------+--------------------+---------+--------------------+------+---+
|   CBH|Travel Agency|             Offline|  Comprehensive Plan|   No|     186|            MALAYSIA|    -29.0|                9.57|     F| 81|
|   CBH|Travel Agency|             Offline|  Comprehensive Plan|   No|     186|            MALAYSIA|    -29.0|                9.57|     F| 71|
|   CWT|Travel Agency|              Online|Rental Vehicle Ex...|   No|      65|           AUSTRALIA|    -49.5|                29.7|  null| 32|
|   CWT|Travel Agency|              Online|Rental Vehicle Ex...|   No|      60|           AUSTRALIA|    -39.6|               23.76|  null| 32|

b. Rename the Commision (in value) column to Commission. Assign this dataframe to a new variable.

In [None]:
df_renamed = df.withColumnRenamed('Commision (in value)', 'Commission')
df_renamed.show(20)

+------+-------------+--------------------+--------------------+-----+--------+--------------------+---------+----------+------+---+
|Agency|  Agency Type|Distribution Channel|        Product Name|Claim|Duration|         Destination|Net Sales|Commission|Gender|Age|
+------+-------------+--------------------+--------------------+-----+--------+--------------------+---------+----------+------+---+
|   CBH|Travel Agency|             Offline|  Comprehensive Plan|   No|     186|            MALAYSIA|    -29.0|      9.57|     F| 81|
|   CBH|Travel Agency|             Offline|  Comprehensive Plan|   No|     186|            MALAYSIA|    -29.0|      9.57|     F| 71|
|   CWT|Travel Agency|              Online|Rental Vehicle Ex...|   No|      65|           AUSTRALIA|    -49.5|      29.7|  null| 32|
|   CWT|Travel Agency|              Online|Rental Vehicle Ex...|   No|      60|           AUSTRALIA|    -39.6|     23.76|  null| 32|
|   CWT|Travel Agency|              Online|Rental Vehicle Ex...|   No

c. Compute the count of policies for each destination. Print the top 10 destinations by the count of policies.

In [None]:
from pyspark.sql.functions import col

# Group by 'Destination', count the policies, and then order by count in descending order
destination_counts = df_renamed.groupBy("Destination").count().orderBy(col("count").desc())

# Show the top 10 destinations by policy count
destination_counts.show(10)

+-------------+-----+
|  Destination|count|
+-------------+-----+
|    SINGAPORE|13255|
|     MALAYSIA| 5930|
|     THAILAND| 5894|
|        CHINA| 4796|
|    AUSTRALIA| 3694|
|    INDONESIA| 3452|
|UNITED STATES| 2530|
|  PHILIPPINES| 2490|
|    HONG KONG| 2411|
|        INDIA| 2251|
+-------------+-----+
only showing top 10 rows



d. What is the mean age for customers who filed a claim? What is the mean age for customers who did not file a claim?
Print the mean age for both customers who filed and those who didn't file.

In [None]:
from pyspark.sql.functions import mean


mean_age_by_claim_status = df_renamed.groupBy("Claim").agg(mean("Age").alias("Mean Age"))

# Show the mean age for both groups
mean_age_by_claim_status.show()

+-----+------------------+
|Claim|          Mean Age|
+-----+------------------+
|   No|39.989823554864664|
|  Yes| 38.63430420711974|
+-----+------------------+



e. Which travel agency made the most amount of money in commission? Compute the total amount of commission for each agency and print the top 10 agencies ordered by the amount.

In [None]:
total_commission_by_agency = df_renamed.groupBy("Agency").sum("Commission").withColumnRenamed("sum(Commission)", "Total Commission")

# Order the results to get the top 10 agencies by total commission
top_agencies_by_commission = total_commission_by_agency.orderBy("Total Commission", ascending=False)

# Show the top 10 agencies
top_agencies_by_commission.show(10)

+------+------------------+
|Agency|  Total Commission|
+------+------------------+
|   CWT|277825.68000001844|
|   C2B|169747.34000000358|
|   JZI| 74471.24999999916|
|   LWC| 51169.12999999995|
|   JWT|16208.399999999956|
|   KML| 8550.380000000014|
|   TST|           5556.25|
|   RAB| 5239.199999999985|
|   ART|3493.3499999999995|
|   ADM| 3136.899999999999|
+------+------------------+
only showing top 10 rows



2. a. The second part of the assignment will be to create a transformation pipeline. Remove all rows with missing data, convert all strings to integers, create dummy variables and assemble a feature vector. Use only the following variables as predictors: Agency Type, Distribution Channel, Duration, Net Sales, Commission, Gender, Age. Use Claim as the response variable.

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.functions import col

df_clean = df_renamed.na.drop()

# Convert string columns to integer values (StringIndexer) and create dummy variables (OneHotEncoder)
agency_type_indexer = StringIndexer(inputCol="Agency Type", outputCol="Agency Type Index")
distribution_channel_indexer = StringIndexer(inputCol="Distribution Channel", outputCol="Distribution Channel Index")
gender_indexer = StringIndexer(inputCol="Gender", outputCol="Gender Index")

agency_type_encoder = OneHotEncoder(inputCol="Agency Type Index", outputCol="Agency Type Vec")
distribution_channel_encoder = OneHotEncoder(inputCol="Distribution Channel Index", outputCol="Distribution Channel Vec")
gender_encoder = OneHotEncoder(inputCol="Gender Index", outputCol="Gender Vec")

# Assemble the feature vector
feature_assembler = VectorAssembler(
    inputCols=["Agency Type Vec", "Distribution Channel Vec", "Duration", "Net Sales", "Commission", "Gender Vec", "Age"],
    outputCol="features")

# Set Claim as the response variable
label_indexer = StringIndexer(inputCol="Claim", outputCol="label")

# Define the pipeline
pipeline = Pipeline(stages=[agency_type_indexer, distribution_channel_indexer, gender_indexer,
                            agency_type_encoder, distribution_channel_encoder, gender_encoder,
                            feature_assembler, label_indexer])

# Fit the pipeline to the data
pipeline_model = pipeline.fit(df_clean)

# Transform the data
df_transformed = pipeline_model.transform(df_clean)

# The resulting DataFrame
df_transformed.select("features", "label").show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[0.0,0.0,186.0,-2...|  0.0|
|[0.0,0.0,186.0,-2...|  0.0|
|[1.0,1.0,66.0,-12...|  0.0|
|[1.0,1.0,1.0,-18....|  0.0|
|[0.0,1.0,53.0,-13...|  0.0|
|[1.0,1.0,3.0,-18....|  0.0|
|[1.0,1.0,12.0,46....|  0.0|
|[1.0,1.0,7.0,17.5...|  0.0|
|[1.0,1.0,12.0,94....|  1.0|
|[1.0,1.0,190.0,29...|  0.0|
|[1.0,1.0,364.0,38...|  0.0|
|[1.0,1.0,11.0,50....|  0.0|
|[1.0,1.0,4.0,15.0...|  0.0|
|[1.0,1.0,45.0,26....|  0.0|
|[1.0,1.0,181.0,30...|  0.0|
|[1.0,1.0,5.0,22.0...|  0.0|
|[1.0,1.0,22.0,18....|  0.0|
|[1.0,1.0,76.0,35....|  0.0|
|[1.0,1.0,41.0,44....|  0.0|
|[1.0,1.0,43.0,22....|  0.0|
+--------------------+-----+
only showing top 20 rows



b. Split the data into train and test with 20% of the data in the test sample.

In [None]:

train_data, test_data = df_clean.randomSplit([0.8, 0.2], seed=1234)




c. Apply a logistic regression model to predict the probaility that a customer will make a claim. Use the training data to produce a model and then test it using the test dataset.

In [None]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator


pipeline_model = pipeline.fit(df_clean)
df_transformed = pipeline_model.transform(df_clean)


# Split the transformed data into train and test sets
train_data, test_data = df_transformed.randomSplit([0.8, 0.2], seed=1234)


# Create and train the Logistic Regression model
lr = LogisticRegression(featuresCol='features', labelCol='label')
lr_model = lr.fit(train_data)


test_predictions = lr_model.transform(test_data)
test_predictions.select("prediction", "label", "features").show(5)


evaluator = BinaryClassificationEvaluator()
test_accuracy = evaluator.evaluate(test_predictions)
print(f"Test Data Accuracy: {test_accuracy}")

train_predictions = lr_model.transform(train_data)



+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|       0.0|  0.0|[0.0,1.0,7.0,86.0...|
|       0.0|  0.0|[0.0,1.0,16.0,-86...|
|       0.0|  0.0|[0.0,1.0,197.0,86...|
|       0.0|  0.0|[0.0,1.0,30.0,130...|
|       0.0|  0.0|[0.0,1.0,49.0,0.0...|
+----------+-----+--------------------+
only showing top 5 rows

Test Data Accuracy: 0.7284892741500263


d. Compute the accuracy for the train and test datasets.

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")


train_predictions = lr_model.transform(train_data)


train_accuracy = evaluator.evaluate(train_predictions)
print(f"Training Data Accuracy: {train_accuracy}")


test_predictions = lr_model.transform(test_data)

test_accuracy = evaluator.evaluate(test_predictions)
print(f"Test Data Accuracy: {test_accuracy}")

Training Data Accuracy: 0.9644128113879004
Test Data Accuracy: 0.9670085943997782
