# **File 1: Pyspark on Health Insurance Marketplace**

##Introduction of Pyspark

Apache Spark is an open-source distributed data processing system that is well-suited for large-scale data processing. PySpark is the Python API for Spark. It allows you to use Spark from Python, and it includes a large number of libraries for working with data, including support for distributed data structures and machine learning.

Using PySpark, you can build data pipelines, analyze data, and build machine learning models. PySpark is particularly useful for distributed data processing, as it allows you to scale your data processing tasks across a cluster of machines.

To use PySpark, you will need to have Python and Spark installed on your system. You can then use the PySpark API to interact with Spark in your Python programs. There are many resources available online for learning more about PySpark and how to use it to build data processing and machine learning pipelines.

##Connecting Drive to Colab

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##Setting up PySpark in Colab

In [None]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845512 sha256=2f909f36cb9fbae39e2cd67ce7ec6e81293a28776673e9916879c0e737078029
  Stored in directory: /root/.cache/pip/wheels/43/dc/11/ec201cd671da62fa9c5cc77078235e40722170ceba231d7598
Successfully built pyspark
Installing collected packages: py4j, pyspa

##Initialize PySpark Session

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

In [None]:
spark

##Loading data into PySpark

In [None]:
%%time
spark_df = spark.read.csv('/content/drive/MyDrive/Colab Notebooks/Rate.csv', header=True)

CPU times: user 73.1 ms, sys: 5.24 ms, total: 78.3 ms
Wall time: 10.7 s


In [None]:
%%time
spark_df.show()

+------------+---------+--------+----------+----------+-------------------+---------+----------+-----------------+------------------+--------------+-------------+-------------+-------------+--------------+---------------------+------+--------------------------------+---------------------------------+-----------------------------------------+---------------------+----------------------+------------------------------+---------+
|BusinessYear|StateCode|IssuerId|SourceName|VersionNum|         ImportDate|IssuerId2|FederalTIN|RateEffectiveDate|RateExpirationDate|        PlanId| RatingAreaId|      Tobacco|          Age|IndividualRate|IndividualTobaccoRate|Couple|PrimarySubscriberAndOneDependent|PrimarySubscriberAndTwoDependents|PrimarySubscriberAndThreeOrMoreDependents|CoupleAndOneDependent|CoupleAndTwoDependents|CoupleAndThreeOrMoreDependents|RowNumber|
+------------+---------+--------+----------+----------+-------------------+---------+----------+-----------------+------------------+-------

##Show column details

In [None]:
%%time
spark_df.printSchema()

root
 |-- BusinessYear: string (nullable = true)
 |-- StateCode: string (nullable = true)
 |-- IssuerId: string (nullable = true)
 |-- SourceName: string (nullable = true)
 |-- VersionNum: string (nullable = true)
 |-- ImportDate: string (nullable = true)
 |-- IssuerId2: string (nullable = true)
 |-- FederalTIN: string (nullable = true)
 |-- RateEffectiveDate: string (nullable = true)
 |-- RateExpirationDate: string (nullable = true)
 |-- PlanId: string (nullable = true)
 |-- RatingAreaId: string (nullable = true)
 |-- Tobacco: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- IndividualRate: string (nullable = true)
 |-- IndividualTobaccoRate: string (nullable = true)
 |-- Couple: string (nullable = true)
 |-- PrimarySubscriberAndOneDependent: string (nullable = true)
 |-- PrimarySubscriberAndTwoDependents: string (nullable = true)
 |-- PrimarySubscriberAndThreeOrMoreDependents: string (nullable = true)
 |-- CoupleAndOneDependent: string (nullable = true)
 |-- CoupleAnd

##Number of rows in DF

In [None]:
%%time
spark_df.count()

CPU times: user 129 ms, sys: 22.5 ms, total: 152 ms
Wall time: 22.1 s


12694445

##Display specific columns

In [None]:
%%time
spark_df.select("BusinessYear","ImportDate","Age","IndividualRate","IndividualTobaccoRate").show(5)

+------------+-------------------+-------------+--------------+---------------------+
|BusinessYear|         ImportDate|          Age|IndividualRate|IndividualTobaccoRate|
+------------+-------------------+-------------+--------------+---------------------+
|        2014|2014-03-19 07:06:49|         0-20|          29.0|                 null|
|        2014|2014-03-19 07:06:49|Family Option|         36.95|                 null|
|        2014|2014-03-19 07:06:49|Family Option|         36.95|                 null|
|        2014|2014-03-19 07:06:49|           21|          32.0|                 null|
|        2014|2014-03-19 07:06:49|           22|          32.0|                 null|
+------------+-------------------+-------------+--------------+---------------------+
only showing top 5 rows

CPU times: user 10.1 ms, sys: 0 ns, total: 10.1 ms
Wall time: 339 ms


##Calculate mean

In [None]:
%%time
spark_df.agg({'IndividualRate': 'mean'}).show()

+-------------------+
|avg(IndividualRate)|
+-------------------+
|   4098.02645859167|
+-------------------+

CPU times: user 234 ms, sys: 29.9 ms, total: 264 ms
Wall time: 45 s
