## Exercise
### Create a spark data frame that contains your favorite programming languages.
- The name of the column should be language
    - View the schema of the dataframe
    - Output the shape of the dataframe
    - Show the first 5 records in the dataframe

In [5]:
import pandas as pd
import pyspark

In [8]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()

programming_language = ["Python","JavaScript","Java","Go","PHP","Shell","C++"]

In [10]:
language = pd.DataFrame({"language":programming_language})

In [12]:
df = spark.createDataFrame(language)

In [16]:
df.printSchema()

root
 |-- language: string (nullable = true)



In [27]:
print(f"no. of columns: {len(df.columns)}")
print(f"no. of rows: {df.count()}")

no. of columns: 1
no. of rows: 7


In [13]:
df.show(5)

+----------+
|  language|
+----------+
|    Python|
|JavaScript|
|      Java|
|        Go|
|       PHP|
+----------+
only showing top 5 rows



### Load the mpg dataset as a spark dataframe.
- Create 1 column of output that contains a message like the one below:
    The 1999 audi a4 has a 4 cylinder engine.  
    For each vehicle.
- Transform the trans column so that it only contains either manual or auto.
---
- Use the seattle weather dataset referenced in the lesson to answer the questions below.
- Convert the temperatures to farenheight.
- Which month has the most rain, on average?
- Which year was the windiest?
- What is the most frequent type of weather in January?
- What is the average high and low tempurature on sunny days in July in 2013 and 2014?
- What percentage of days were rainy in q3 of 2015?

- For each year, find what percentage of days it rained (had non-zero precipitation).

In [28]:
from pydataset import data

In [69]:
from pyspark.sql.functions import lit, concat, regexp_extract

In [70]:
mpg = data("mpg")
spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [71]:
mpg = spark.createDataFrame(mpg)

In [72]:
mpg.select(concat(lit("The "),mpg.year, lit(" "), mpg.manufacturer, lit(" "), mpg.model, lit(" has a "), mpg.cyl, lit(" cylinder engine.")).alias("description")).show(truncate=False)

+--------------------------------------------------------------+
|description                                                   |
+--------------------------------------------------------------+
|The 1999 audi a4 has a 4 cylinder engine.                     |
|The 1999 audi a4 has a 4 cylinder engine.                     |
|The 2008 audi a4 has a 4 cylinder engine.                     |
|The 2008 audi a4 has a 4 cylinder engine.                     |
|The 1999 audi a4 has a 6 cylinder engine.                     |
|The 1999 audi a4 has a 6 cylinder engine.                     |
|The 2008 audi a4 has a 6 cylinder engine.                     |
|The 1999 audi a4 quattro has a 4 cylinder engine.             |
|The 1999 audi a4 quattro has a 4 cylinder engine.             |
|The 2008 audi a4 quattro has a 4 cylinder engine.             |
|The 2008 audi a4 quattro has a 4 cylinder engine.             |
|The 1999 audi a4 quattro has a 6 cylinder engine.             |
|The 1999 audi a4 quattro

Transform the trans column so that it only contains either manual or auto.

In [73]:
mpg.select("*").show(1)

+------------+-----+-----+----+---+--------+---+---+---+---+-------+
|manufacturer|model|displ|year|cyl|   trans|drv|cty|hwy| fl|  class|
+------------+-----+-----+----+---+--------+---+---+---+---+-------+
|        audi|   a4|  1.8|1999|  4|auto(l5)|  f| 18| 29|  p|compact|
+------------+-----+-----+----+---+--------+---+---+---+---+-------+
only showing top 1 row



In [78]:
trans_filter = regexp_extract("trans",r'^(\w+)',1)

In [80]:
mpg.select(trans_filter).show(5)

+--------------------------------+
|regexp_extract(trans, ^(\w+), 1)|
+--------------------------------+
|                            auto|
|                          manual|
|                          manual|
|                            auto|
|                            auto|
+--------------------------------+
only showing top 5 rows



### Load the tips dataset as a spark dataframe.
- What percentage of observations are smokers?
    - Create a column that contains the tip percentage
    - Calculate the average tip percentage for each combination of sex and smoker.

In [81]:
from pydataset import data

In [89]:
from pyspark.sql.functions import round

In [84]:
tips = data("tips")
spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [85]:
tips = spark.createDataFrame(tips)

In [112]:
percentage = (tips.tip/tips.total_bill).alias("tip_percentage")

In [114]:
tips.select("*",percentage).show(5)

+----------+----+------+------+---+------+----+-------------------+
|total_bill| tip|   sex|smoker|day|  time|size|     tip_percentage|
+----------+----+------+------+---+------+----+-------------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|0.05944673337257211|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|0.16054158607350097|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|0.16658733936220846|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2| 0.1397804054054054|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|0.14680764538430255|
+----------+----+------+------+---+------+----+-------------------+
only showing top 5 rows



In [116]:
tips.selgroupBy("sex","smoker").agg(percentage)

AnalysisException: "expression '`tip`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;\nAggregate [sex#284, smoker#285], [sex#284, smoker#285, (tip#283 / total_bill#282) AS tip_percentage#368]\n+- LogicalRDD [total_bill#282, tip#283, sex#284, smoker#285, day#286, time#287, size#288L], false\n"

In [None]:
""