## Exercises

Using the repo setup directions, setup a new local and remote repository named spark-exercises. The local version of your repo should live inside of ~/codeup-data-science. This repo should be named spark-exercises

Save this work in your spark-exercises repo. Then add, commit, and push your changes.


Create a jupyter notebook or python script named spark101 for this exercise.


Create a spark data frame that contains your favorite programming languages.

- The name of the column should be language
- View the schema of the dataframe
- Output the shape of the dataframe
- Show the first 5 records in the dataframe


Load the mpg dataset as a spark dataframe.

- Create 1 column of output that contains a message like the one below:

    - The 1999 audi a4 has a 4 cylinder engine.

- For each vehicle.

- Transform the trans column so that it only contains either manual or auto.


Load the tips dataset as a spark dataframe.

- What percentage of observations are smokers?
- Create a column that contains the tip percentage
- Calculate the average tip percentage for each combination of sex and smoker.


Use the seattle weather dataset referenced in the lesson to answer the questions below.

- Convert the temperatures to fahrenheit.
- Which month has the most rain, on average?
- Which year was the windiest?
- What is the most frequent type of weather in January?
- What is the average high and low temperature on sunny days in July in 2013 and 2014?
- What percentage of days were rainy in q3 of 2015?
- For each year, find what percentage of days it rained (had non-zero precipitation).


In [1]:
import pandas as pd
import numpy as np
from pydataset import data

In [2]:
import pyspark
import pyspark.sql.functions as F

spark = pyspark.sql.SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/21 16:09:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/21 16:09:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/10/21 16:09:40 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
22/10/21 16:09:40 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
22/10/21 16:09:40 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.


In [3]:
df = spark.createDataFrame(
    pd.DataFrame(
        {"language": [
                "python",
                "java",
                "c_plus_plus",
                "html",
            ]
        }))

df.show(truncate=False)

                                                                                

+-----------+
|language   |
+-----------+
|python     |
|java       |
|c_plus_plus|
|html       |
+-----------+



In [4]:
df.schema

StructType([StructField('language', StringType(), True)])

In [5]:
df.printSchema()

root
 |-- language: string (nullable = true)



In [6]:
df.count(), len(df.columns)

(4, 1)

In [8]:
df.describe().show()

[Stage 6:>                                                          (0 + 8) / 8]

+-------+-----------+
|summary|   language|
+-------+-----------+
|  count|          4|
|   mean|       null|
| stddev|       null|
|    min|c_plus_plus|
|    max|     python|
+-------+-----------+



                                                                                

In [9]:
df.show()

+-----------+
|   language|
+-----------+
|     python|
|       java|
|c_plus_plus|
|       html|
+-----------+



In [10]:
mpg = spark.createDataFrame(data("mpg"))
mpg.show(5)

+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|manufacturer|model|displ|year|cyl|     trans|drv|cty|hwy| fl|  class|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|        audi|   a4|  1.8|1999|  4|  auto(l5)|  f| 18| 29|  p|compact|
|        audi|   a4|  1.8|1999|  4|manual(m5)|  f| 21| 29|  p|compact|
|        audi|   a4|  2.0|2008|  4|manual(m6)|  f| 20| 31|  p|compact|
|        audi|   a4|  2.0|2008|  4|  auto(av)|  f| 21| 30|  p|compact|
|        audi|   a4|  2.8|1999|  6|  auto(l5)|  f| 16| 26|  p|compact|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
only showing top 5 rows



In [20]:
mpg.select(F.concat(F.lit('The '), mpg.year, F.lit(' '), mpg.manufacturer, F.lit(' has a '), 
                    mpg.cyl, F.lit(' cylinder engine')).alias('description')).show(truncate=False)

+------------------------------------------+
|description                               |
+------------------------------------------+
|The 1999 audi has a 4 cylinder engine     |
|The 1999 audi has a 4 cylinder engine     |
|The 2008 audi has a 4 cylinder engine     |
|The 2008 audi has a 4 cylinder engine     |
|The 1999 audi has a 6 cylinder engine     |
|The 1999 audi has a 6 cylinder engine     |
|The 2008 audi has a 6 cylinder engine     |
|The 1999 audi has a 4 cylinder engine     |
|The 1999 audi has a 4 cylinder engine     |
|The 2008 audi has a 4 cylinder engine     |
|The 2008 audi has a 4 cylinder engine     |
|The 1999 audi has a 6 cylinder engine     |
|The 1999 audi has a 6 cylinder engine     |
|The 2008 audi has a 6 cylinder engine     |
|The 2008 audi has a 6 cylinder engine     |
|The 1999 audi has a 6 cylinder engine     |
|The 2008 audi has a 6 cylinder engine     |
|The 2008 audi has a 8 cylinder engine     |
|The 2008 chevrolet has a 8 cylinder engine|
|The 2008 

In [21]:
mpg.select(mpg.trans).show(5)

+----------+
|     trans|
+----------+
|  auto(l5)|
|manual(m5)|
|manual(m6)|
|  auto(av)|
|  auto(l5)|
+----------+
only showing top 5 rows



In [24]:
mpg.withColumn('trans', F.when(mpg.trans.like('auto%'), 'auto').otherwise('manual')).select('trans').show(5)


+------+
| trans|
+------+
|  auto|
|manual|
|manual|
|  auto|
|  auto|
+------+
only showing top 5 rows



In [25]:
tips = spark.createDataFrame(data('tips'))

In [26]:
tips.count()

244

In [27]:
tips.groupby('smoker').count().show()

+------+-----+
|smoker|count|
+------+-----+
|    No|  151|
|   Yes|   93|
+------+-----+



In [29]:
tips.groupby('smoker').count().withColumn('percent', F.col('count') / tips.count() * 100).show()

+------+-----+------------------+
|smoker|count|           percent|
+------+-----+------------------+
|    No|  151|61.885245901639344|
|   Yes|   93|38.114754098360656|
+------+-----+------------------+



In [30]:
tips.columns

['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']

In [31]:
tips.withColumn('tip_percentage', tips.tip / tips.total_bill).show(5)

+----------+----+------+------+---+------+----+-------------------+
|total_bill| tip|   sex|smoker|day|  time|size|     tip_percentage|
+----------+----+------+------+---+------+----+-------------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|0.05944673337257211|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|0.16054158607350097|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|0.16658733936220846|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2| 0.1397804054054054|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|0.14680764538430255|
+----------+----+------+------+---+------+----+-------------------+
only showing top 5 rows



In [41]:
tips.select(tips.tip, tips.total_bill, F.round((tips.tip / tips.total_bill), 3).alias('tip_percentage')).show(5)

+----+----------+--------------+
| tip|total_bill|tip_percentage|
+----+----------+--------------+
|1.01|     16.99|         0.059|
|1.66|     10.34|         0.161|
| 3.5|     21.01|         0.167|
|3.31|     23.68|          0.14|
|3.61|     24.59|         0.147|
+----+----------+--------------+
only showing top 5 rows



In [40]:
tips.withColumn('tip_percentage', tips.tip / tips.total_bill
               ).groupby('sex', 'smoker').agg(F.round(F.mean('tip_percentage'), 3).alias('avg_tip_p')).show()

+------+------+---------+
|   sex|smoker|avg_tip_p|
+------+------+---------+
|  Male|    No|    0.161|
|Female|    No|    0.157|
|  Male|   Yes|    0.153|
|Female|   Yes|    0.182|
+------+------+---------+



In [39]:
tips.groupby('sex').pivot('smoker').agg(F.round(F.mean(tips.tip / tips.total_bill), 3)).show()

+------+-----+-----+
|   sex|   No|  Yes|
+------+-----+-----+
|Female|0.157|0.182|
|  Male|0.161|0.153|
+------+-----+-----+



In [42]:
from vega_datasets import data
weather = data.seattle_weather()
weather = spark.createDataFrame(weather)

In [43]:
weather.show(5)

+-------------------+-------------+--------+--------+----+-------+
|               date|precipitation|temp_max|temp_min|wind|weather|
+-------------------+-------------+--------+--------+----+-------+
|2012-01-01 00:00:00|          0.0|    12.8|     5.0| 4.7|drizzle|
|2012-01-02 00:00:00|         10.9|    10.6|     2.8| 4.5|   rain|
|2012-01-03 00:00:00|          0.8|    11.7|     7.2| 2.3|   rain|
|2012-01-04 00:00:00|         20.3|    12.2|     5.6| 4.7|   rain|
|2012-01-05 00:00:00|          1.3|     8.9|     2.8| 6.1|   rain|
+-------------------+-------------+--------+--------+----+-------+
only showing top 5 rows



In [44]:
weather.columns

['date', 'precipitation', 'temp_max', 'temp_min', 'wind', 'weather']

In [45]:
weather.count(), len(weather.columns)

(1461, 6)

In [46]:
weather.withColumn('month', F.month(weather.date)).groupby(F.col('month')).agg(F.mean(
weather.precipitation).alias('avg_rainfall')).sort(F.col('avg_rainfall').desc()).first()[0]

                                                                                

11

In [48]:
weather.withColumn('year', F.year(weather.date)
                  ).groupby(F.col('year')).agg(F.mean(weather.wind
                                                      ).alias('avg_wind')).sort(F.col('avg_wind').desc()).first()

Row(year=2012, avg_wind=3.400819672131148)

In [49]:
weather.filter(F.month(weather.date) == 1).groupby(weather.weather).count().sort(F.col('count').desc()).show()

+-------+-----+
|weather|count|
+-------+-----+
|    fog|   38|
|   rain|   35|
|    sun|   33|
|drizzle|   10|
|   snow|    8|
+-------+-----+



In [50]:
weather.filter(F.month('date') == 7).filter(F.year('date') > 2012).filter(F.year('date') < 2015).filter(
F.col('weather') == F.lit('sun')).agg(F.avg('temp_max').alias('average_high_temp'), F.avg('temp_min')
                                     .alias('average_low_temp'),).show()

+------------------+-----------------+
| average_high_temp| average_low_temp|
+------------------+-----------------+
|26.828846153846158|14.18269230769231|
+------------------+-----------------+



In [51]:
weather.filter(F.year('date') == 2015).filter(
    F.quarter('date') == 3).select(F.when(
    F.col('weather') == 'rain', 1).otherwise(0).alias('rain')).agg(F.mean('rain')).show()

+--------------------+
|           avg(rain)|
+--------------------+
|0.021739130434782608|
+--------------------+



In [52]:
weather.withColumn('year', F.year('date')).select(
    F.when(F.col('precipitation') > 0, 1).otherwise(0).alias(
        'did_rain'), 'year').groupby('year').agg(F.mean('did_rain')).show()

+----+-------------------+
|year|      avg(did_rain)|
+----+-------------------+
|2012|0.48360655737704916|
|2013|0.41643835616438357|
|2014|  0.410958904109589|
|2015|0.39452054794520547|
+----+-------------------+

