### Spark Exercises

Within your `codeup-data-science` directory, create a new repo named `spark-exercises`. This will be where you do your work for this module. Create a repository on GitHub with the same name, and link your local repository to GitHub.

Save this work in your `spark-exercises` repo. Then add, commit, and push your changes.

Create a jupyter notebook or python script named `spark101` for this exercise.

In [2]:
import pyspark
import pandas as pd
import numpy as np
from pydataset import data
from pyspark.sql.functions import col, expr
#from pyspark.sql.functions import concat, su, avg, min, max, count, mean
from pyspark.sql.functions import lit
from pyspark.sql.functions import regexp_extract, regexp_replace
from pyspark.sql.functions import when
from pyspark.sql.functions import asc, desc
from pyspark.sql.functions import month, year, quarter


1. Create a spark data frame that contains your favorite programming languages.

    - The name of the column should be `language`
    - View the schema of the dataframe
    - Output the shape of the dataframe
    - Show the first 5 records in the dataframe

In [4]:
np.random.seed(456)

pl_df = pd.DataFrame(np.random.choice(['python', 'sql', 'java', 'c', 'r'], 20))
pl_df.columns = ['language']
spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pl_df)
df

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/02/11 13:48:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/02/11 13:48:13 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


DataFrame[language: string]

In [6]:
df.printSchema()

root
 |-- language: string (nullable = true)



In [8]:
print('DataFrame shape: ', df.count(), ' x ', len(df.columns))

DataFrame shape:  20  x  1


In [9]:
df.show(5)

+--------+
|language|
+--------+
|       c|
|       c|
|     sql|
|    java|
|       r|
+--------+
only showing top 5 rows



2. Load the `mpg` dataset as a spark dataframe.

    a. For each vehicle, create 1 column of output that contains a message like the one below:
    
   > - The 1999 audi a4 has a 4 cylinder engine.
   
    b. Transform the `trans` column so that it only contains either `manual` or `auto`.

In [10]:
mpg = spark.createDataFrame(data("mpg"))
mpg.show(5)

+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|manufacturer|model|displ|year|cyl|     trans|drv|cty|hwy| fl|  class|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|        audi|   a4|  1.8|1999|  4|  auto(l5)|  f| 18| 29|  p|compact|
|        audi|   a4|  1.8|1999|  4|manual(m5)|  f| 21| 29|  p|compact|
|        audi|   a4|  2.0|2008|  4|manual(m6)|  f| 20| 31|  p|compact|
|        audi|   a4|  2.0|2008|  4|  auto(av)|  f| 21| 30|  p|compact|
|        audi|   a4|  2.8|1999|  6|  auto(l5)|  f| 16| 26|  p|compact|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
only showing top 5 rows



In [24]:
desc_column = mpg.select(concat(lit('The '),mpg.year, lit(' '), mpg.manufacturer, lit(' '), mpg.model, lit(' has a '), mpg.cyl, lit(' cylinder engine.')).alias('description'))

In [25]:
desc_column.show(5, truncate = False)

+-----------------------------------------+
|description                              |
+-----------------------------------------+
|The 1999 audi a4 has a 4 cylinder engine.|
|The 1999 audi a4 has a 4 cylinder engine.|
|The 2008 audi a4 has a 4 cylinder engine.|
|The 2008 audi a4 has a 4 cylinder engine.|
|The 1999 audi a4 has a 6 cylinder engine.|
+-----------------------------------------+
only showing top 5 rows



In [30]:
mpg.select('trans', regexp_replace('trans', r"\(.+$", "")).show()

+----------+---------------------------------+
|     trans|regexp_replace(trans, \(.+$, , 1)|
+----------+---------------------------------+
|  auto(l5)|                             auto|
|manual(m5)|                           manual|
|manual(m6)|                           manual|
|  auto(av)|                             auto|
|  auto(l5)|                             auto|
|manual(m5)|                           manual|
|  auto(av)|                             auto|
|manual(m5)|                           manual|
|  auto(l5)|                             auto|
|manual(m6)|                           manual|
|  auto(s6)|                             auto|
|  auto(l5)|                             auto|
|manual(m5)|                           manual|
|  auto(s6)|                             auto|
|manual(m6)|                           manual|
|  auto(l5)|                             auto|
|  auto(s6)|                             auto|
|  auto(s6)|                             auto|
|  auto(l4)| 

3. Load the `tips` dataset as a spark dataframe.

 a. What percentage of observations are smokers?

 b. Create a column that contains the tip percentage

 c. Calculate the average tip percentage for each combination of sex and smoker.

In [32]:
tips = spark.createDataFrame(data("tips"))
tips.show(5)

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
+----------+----+------+------+---+------+----+
only showing top 5 rows



4. Use the seattle weather dataset referenced in the lesson to answer the questions below.

 - Convert the temperatures to fahrenheit.
 - Which month has the most rain, on average?
 - Which year was the windiest?
 - What is the most frequent type of weather in January?
 - What is the average high and low temperature on sunny days in July in 2013 and 2014?
 - What percentage of days were rainy in q3 of 2015?
 - For each year, find what percentage of days it rained (had non-zero precipitation).