## Get Started With SparkSQL
### What is SparkSQL
SparkSQL is a component of apache spark allows users to apply sql queries on spark dataframes. It supports different data types such as:
- Hive tables
- JSON files
- Parquet files
- JDBC (Java Database Connectivity)

### Steps to use SparkSQL
1. Import libraries
2. Start spark session
3. Load data into a spark dataframe
4. Register the dataframe as temporary view
5. Perform SQL queries
6. Show results
7. Finally, stop spark session

In [2]:
#Import libraries
import findspark
from pyspark.sql import SparkSession

In [3]:
#Start spark session with name (Start with SparkSQL)
spark = SparkSession.builder.appName('Start with SparkSQL').getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/12/10 16:14:28 WARN Utils: Your hostname, omar, resolves to a loopback address: 127.0.1.1; using 192.168.1.4 instead (on interface wlo1)
25/12/10 16:14:28 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/10 16:14:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
#Load data file into spark dataframe
data = spark.read.csv('data/mpg.csv',header=True, inferSchema=True)
data.printSchema()

root
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Engine Disp: double (nullable = true)
 |-- Horsepower: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Accelerate: double (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Origin: string (nullable = true)



In [6]:
#Create a temporary view from the dataframe
data.createOrReplaceTempView('mileage')      #createOrReplace to replace it if exists

In [7]:
query = 'Select * From mileage Where mpg > 40'
result = spark.sql(query)
result.show()

+----+---------+-----------+----------+------+----------+----+--------+
| MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year|  Origin|
+----+---------+-----------+----------+------+----------+----+--------+
|43.1|        4|       90.0|        48|  1985|      21.5|  78|European|
|43.4|        4|       90.0|        48|  2335|      23.7|  80|European|
|41.5|        4|       98.0|        76|  2144|      14.7|  80|European|
|44.3|        4|       90.0|        48|  2085|      21.7|  80|European|
|40.8|        4|       85.0|        65|  2110|      19.2|  80|Japanese|
|44.6|        4|       91.0|        67|  1850|      13.8|  80|Japanese|
|46.6|        4|       86.0|        65|  2110|      17.9|  80|Japanese|
|44.0|        4|       97.0|        52|  2130|      24.6|  82|European|
+----+---------+-----------+----------+------+----------+----+--------+



### Do some data analysis on the data

In [9]:
spark.sql('Select distinct Origin From mileage').show()

+--------+
|  Origin|
+--------+
|European|
|Japanese|
|American|
+--------+



In [13]:
#Count of Japanese cars
spark.sql('Select Origin, count(*) as Car_count From mileage Group by Origin Having Origin == "Japanese"').show()

+--------+---------+
|  Origin|Car_count|
+--------+---------+
|Japanese|       79|
+--------+---------+



In [14]:
#Count the number of cars with mileage greater than 40
spark.sql('Select count(*) as result From mileage  Where mpg  > 40 ').show()

+------+
|result|
+------+
|     8|
+------+



In [None]:
#List number of cars made in different years
spark.sql('Select Year, count(*) as number_of_cars  from mileage Group by Year Order by number_of_cars desc').show()

+----+--------------+
|Year|number_of_cars|
+----+--------------+
|  73|            40|
|  78|            36|
|  76|            34|
|  82|            30|
|  75|            30|
|  70|            29|
|  79|            29|
|  81|            28|
|  72|            28|
|  77|            28|
|  80|            27|
|  71|            27|
|  74|            26|
+----+--------------+



In [16]:
spark.stop()