<a href="https://colab.research.google.com/github/duhajarrar/SparkApp/blob/main/Spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Spark

In [None]:
!pip install pyspark



# Import libraries

In [None]:
import pyspark.sql.functions as f
from pyspark.sql import SparkSession
import functools
import pyspark
from pyspark import SparkContext

# Read The Dataset

In [None]:
spark = SparkSession.builder.master("local[1]").appName("SparkApp").getOrCreate()
dfCar=spark.read.option("header",True).csv("/content/drive/MyDrive/Spark-Harri/cars.csv")
dfCar.printSchema()
print(type(spark),type(dfCar))
dfCar.show(5)
print(dfCar.columns)

root
 |-- Car Brand: string (nullable = true)
 |-- Country of Origin: string (nullable = true)

<class 'pyspark.sql.session.SparkSession'> <class 'pyspark.sql.dataframe.DataFrame'>
+------------+-----------------+
|   Car Brand|Country of Origin|
+------------+-----------------+
|      Abarth|            Italy|
|  Alfa Romeo|            Italy|
|Aston Martin|          England|
|        Audi|          Germany|
|     Bentley|          England|
+------------+-----------------+
only showing top 5 rows

['Car Brand', 'Country of Origin']


# Task1: Extract a file which contains the car model and the country of origin of this car.

In [None]:
rows = dfCar.count()
dfCar.repartition(rows).write.csv('/content/drive/MyDrive/Spark-Harri/Cars')

# Task2: Extract one file per country

In [None]:
dfCar.write.partitionBy('Country Of Origin').mode("overwrite").csv('/content/drive/MyDrive/Spark-Harri/Country Of Origin')

In [None]:
# from pyspark import SparkContext
# sc = SparkContext("local", "First App")

In [None]:
# sc.parallelize(dfCar)

In [None]:
# rddCar=dfCar.rdd
# print(rddObj.collect())

In [None]:
# def toCSVLine(data):
#   return ','.join(str(d) for d in data)

# lines = rddCar.map(toCSVLine)
# lines.saveAsTextFile('/content/drive/MyDrive/Spark-Harri/Part2')

In [None]:
#rddCar.saveAsTextFile()
# rddCar.partitionBy('Country Of Origin')
#.mode("overwrite").csv('/content/drive/MyDrive/Spark-Harri/Part2')

# Task3: Use caching properly to optimize the performance

In [None]:
dfCar=dfCar.cache()

# Task4: Expect to read a file with updated records, you should be able to merge these updates with the original dataset.

## Read 2015_State_Top10Report_wTotalThefts file

In [None]:
dfReport=spark.read.option("header",True).csv("/content/drive/MyDrive/Spark-Harri/2015_State_Top10Report_wTotalThefts.csv")
dfReport.printSchema()
dfReport=dfReport.withColumn("Thefts",dfReport.Thefts.cast('int'))
dfReport.show()

root
 |-- State: string (nullable = true)
 |-- Rank: string (nullable = true)
 |-- Make/Model: string (nullable = true)
 |-- Model Year: string (nullable = true)
 |-- Thefts: string (nullable = true)

+-------+----+--------------------+----------+------+
|  State|Rank|          Make/Model|Model Year|Thefts|
+-------+----+--------------------+----------+------+
|Alabama|   1|Chevrolet Pickup ...|      2005|   499|
|Alabama|   2|Ford Pickup (Full...|      2006|   357|
|Alabama|   3|        Toyota Camry|      2014|   205|
|Alabama|   4|       Nissan Altima|      2014|   191|
|Alabama|   4|    Chevrolet Impala|      2004|   191|
|Alabama|   5|        Honda Accord|      1998|   180|
|Alabama|   6|GMC Pickup (Full ...|      1999|   152|
|Alabama|   7|Dodge Pickup (Ful...|      1998|   138|
|Alabama|   8|        Ford Mustang|      2002|   122|
|Alabama|   9|       Ford Explorer|      2002|   119|
| Alaska|   1|Chevrolet Pickup ...|      2003|   147|
| Alaska|   2|Ford Pickup (Full...|      20

Rename some columns to make it easy to use them.

In [None]:
dfReport=dfReport.withColumnRenamed('Make/Model','MakeModel').withColumnRenamed('Model Year','ModelYear')

## Read Updated - Sheet1 file

In [None]:
dfUpdate=spark.read.option("header",True).csv("/content/drive/MyDrive/Spark-Harri/Updated - Sheet1.csv")
dfUpdate.printSchema()
dfUpdate=dfUpdate.withColumn("Thefts",dfUpdate.Thefts.cast('int'))
dfUpdate.show()

root
 |-- State: string (nullable = true)
 |-- Rank: string (nullable = true)
 |-- Make/Model: string (nullable = true)
 |-- Model Year: string (nullable = true)
 |-- Thefts: string (nullable = true)

+------------+----+--------------------+----------+------+
|       State|Rank|          Make/Model|Model Year|Thefts|
+------------+----+--------------------+----------+------+
|    Arkansas|   6|       Nissan Altima|      2015|  3000|
|       Idaho|   8|Jeep Cherokee/Gra...|      1997|    19|
|   Minnesota|   1|         Honda Civic|      1998|    50|
|   Minnesota|   2|        Honda Accord|      1997|    20|
|    Virginia|   7|      Toyota Corolla|      2013|   900|
|    Virginia|   8|       Ford Explorer|      2002|   543|
|North Dakota|   9|    Pontiac Grand Am|      2000|  2100|
|    New York|   5|           Seat Leon|      2019|    11|
|       Maine|   2|             VW Golf|      2021|     6|
+------------+----+--------------------+----------+------+



Rename some columns to make it easy to use them.

In [None]:
dfUpdate=dfUpdate.withColumnRenamed('Make/Model','MakeModel').withColumnRenamed('Model Year','ModelYear')
print(dfUpdate.columns)

['State', 'Rank', 'MakeModel', 'ModelYear', 'Thefts']


## Update the Report dataset using the updated dataset 

In [None]:
dfUpdatedRank=dfReport.alias('a').join(dfUpdate.alias('b'), ['State','MakeModel','ModelYear','Thefts'],how='outer').select('State','MakeModel','ModelYear','Thefts',f.coalesce('b.Rank', 'a.Rank').alias('Rank'))
dfUpdatedRank.show(5)

+-------+--------------------+---------+------+----+
|  State|           MakeModel|ModelYear|Thefts|Rank|
+-------+--------------------+---------+------+----+
|Alabama|    Chevrolet Impala|     2004|   191|   4|
|Alabama|Chevrolet Pickup ...|     2005|   499|   1|
|Alabama|Dodge Pickup (Ful...|     1998|   138|   7|
|Alabama|       Ford Explorer|     2002|   119|   9|
|Alabama|        Ford Mustang|     2002|   122|   8|
+-------+--------------------+---------+------+----+
only showing top 5 rows



In [None]:
# dfUpdatedThefts=dfReport.alias('a').join(dfUpdate.alias('b'), ['State','MakeModel','ModelYear','Rank'], how='outer').select('State','MakeModel','ModelYear','Rank',f.coalesce('b.Thefts', 'a.Thefts').alias('Thefts'))
# dfUpdatedThefts.show(5)

# Create Cars table 

In [None]:
dfUpdatedRank=dfUpdatedRank.withColumn("Thefts",dfUpdatedRank.Thefts.cast('int'))

In [None]:
dfMost5Thefts=dfUpdatedRank.sort('Thefts',ascending=False)

In [None]:
dfMost5Thefts.show()

+------------+--------------------+---------+------+----+
|       State|           MakeModel|ModelYear|Thefts|Rank|
+------------+--------------------+---------+------+----+
|    Arkansas|       Nissan Altima|     2015|  3000|   6|
|North Dakota|    Pontiac Grand Am|     2000|  2100|   9|
|       Texas|       Nissan Altima|     2012|   957|   9|
|     Georgia|Ford Pickup (Full...|     2006|   954|   2|
|     Georgia|Chevrolet Pickup ...|     1999|   948|   3|
|        Utah|        Honda Accord|     1997|   938|   1|
|        Utah|         Honda Civic|     1998|   915|   2|
|     Florida|      Toyota Corolla|     2014|   914|   6|
|    Virginia|      Toyota Corolla|     2013|   900|   7|
|       Texas|    Chevrolet Impala|     2007|   898|  10|
|    Missouri|Ford Pickup (Full...|     2004|   880|   1|
|     Arizona|Chevrolet Pickup ...|     2004|   850|   3|
|  New Jersey|        Honda Accord|     1997|   844|   1|
|     Florida|Chevrolet Pickup ...|     2015|   786|   7|
|    Missouri|

In [None]:
dfUpdatedRank.createOrReplaceTempView("Cars")

# Task5:List the most 5 thefted models in U.S

In [None]:
spark.sql("select distinct MakeModel,Thefts from Cars ORDER BY Thefts desc").show(5)

+--------------------+------+
|           MakeModel|Thefts|
+--------------------+------+
|       Nissan Altima|  3000|
|    Pontiac Grand Am|  2100|
|       Nissan Altima|   957|
|Ford Pickup (Full...|   954|
|Chevrolet Pickup ...|   948|
+--------------------+------+
only showing top 5 rows



In [None]:
# dfUpdatedRank.select('MakeModel','Thefts').sort('Thefts',ascending=False).show(5)

# Task6:List the most 5 states based on the number of thefted cars.

In [None]:
spark.sql("select distinct State,Thefts from Cars ORDER BY Thefts desc").show(5)

+------------+------+
|       State|Thefts|
+------------+------+
|    Arkansas|  3000|
|North Dakota|  2100|
|       Texas|   957|
|     Georgia|   954|
|     Georgia|   948|
+------------+------+
only showing top 5 rows



# Task7:Based on the models, what is the most country from where Americans buy their cars

## Extract Model name 

We need to extract model name then join it with it's country (using cars.csv file)

In [None]:
split_col = pyspark.sql.functions.split(dfUpdatedRank['MakeModel'], ' ')
dfUpdatedRank = dfUpdatedRank.withColumn('MakeModel', split_col.getItem(0))
dfUpdatedRank.show(5)

+-------+---------+---------+------+----+
|  State|MakeModel|ModelYear|Thefts|Rank|
+-------+---------+---------+------+----+
|Alabama|Chevrolet|     2004|   191|   4|
|Alabama|Chevrolet|     2005|   499|   1|
|Alabama|    Dodge|     1998|   138|   7|
|Alabama|     Ford|     2002|   119|   9|
|Alabama|     Ford|     2002|   122|   8|
+-------+---------+---------+------+----+
only showing top 5 rows



In [None]:
numOfModelsBefore=dfUpdatedRank.select('MakeModel').distinct().count()

In [None]:
#dfUpdatedRank.select('MakeModel').distinct().show()

Rename Car Brand column 

In [None]:
dfCar=dfCar.withColumnRenamed('Car Brand','MakeModel').withColumnRenamed('Country of Origin','CountryOfOrigin')
dfCar.show(5)

+------------+---------------+
|   MakeModel|CountryOfOrigin|
+------------+---------------+
|      Abarth|          Italy|
|  Alfa Romeo|          Italy|
|Aston Martin|        England|
|        Audi|        Germany|
|     Bentley|        England|
+------------+---------------+
only showing top 5 rows



## Join cars dataset with report dataset

In [None]:
dfUpdatedRank=dfUpdatedRank.join(dfCar, ['MakeModel'], 'inner')
dfUpdatedRank.show(5)

+---------+-------+---------+------+----+---------------+
|MakeModel|  State|ModelYear|Thefts|Rank|CountryOfOrigin|
+---------+-------+---------+------+----+---------------+
|Chevrolet|Alabama|     2004|   191|   4|        America|
|Chevrolet|Alabama|     2005|   499|   1|        America|
|    Dodge|Alabama|     1998|   138|   7|        America|
|     Ford|Alabama|     2002|   119|   9|        America|
|     Ford|Alabama|     2002|   122|   8|        America|
+---------+-------+---------+------+----+---------------+
only showing top 5 rows



In [None]:
numOfModelsAfter=dfUpdatedRank.select('MakeModel').distinct().count()

In [None]:
#dfUpdatedRank.select('MakeModel').distinct().show()

In [None]:
#dfCar.select('MakeModel').distinct().show()

**Important**

In [None]:
print("Number of models in cars.csv file = ",dfCar.select('MakeModel').distinct().count())

Number of models in cars.csv file =  58


In [None]:
print(" Number Of Models Before join  = ",numOfModelsBefore," Number Of Models After join  = ",numOfModelsAfter)

 Number Of Models Before join  =  15  Number Of Models After join  =  10


**Note:** VW, GMC, Seat, Pontiac, Acura weren't in cars.csv so the models number matched in report csv file and cars csv file is just 10 not 15.

## Calculate the most country repeted in cars report based on the model

In [None]:
dfUpdatedRank.groupby('CountryOfOrigin').count().sort('count',ascending=False).show(1)

+---------------+-----+
|CountryOfOrigin|count|
+---------------+-----+
|        America|  268|
+---------------+-----+
only showing top 1 row

