# ------------------------------------------------------------------------
## Exercise 1:

Within your `codeup-data-science` directory, create a new repo named `spark-exercises`. This will be where you do your work for this module. Create a repository on GitHub with the same name, and link your local repository to GitHub.

Save this work in your `spark-exercises` repo. Then add, commit, and push your changes.

Create a jupyter notebook or python script named `spark101` for this exercise.

Create a spark data frame that contains your favorite programming languages.

- Create a dataframe with one column named `language`
> Hint: Start with a pandas dataframe. Maybe use a dictionary?
- View the schema of the dataframe
- Output the shape of the dataframe
- Show the first 5 records in the dataframe

# ------------------------------------------------------------------------

In [1]:
import pyspark
import pandas as pd
import numpy as np
spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [4]:
pandas_df= pd.read_json('repo_readmes_10_feb_am.json')
pandas_df = pandas_df.drop(columns = ['repo','readme_contents'])
pandas_df.head()


Unnamed: 0,language
0,C#
1,JavaScript
2,C++
3,JavaScript
4,


In [7]:
df = spark.createDataFrame(pandas_df)

In [8]:
df.show(5)

+----------+
|  language|
+----------+
|        C#|
|JavaScript|
|       C++|
|JavaScript|
|      null|
+----------+
only showing top 5 rows



In [9]:
df.printSchema()

root
 |-- language: string (nullable = true)



In [10]:
print("DataFrame shape: ", df.count(), " x ", len(df.columns))

DataFrame shape:  180  x  1


# ------------------------------------------------------------------------
## Exercise 2:

Load the `mpg` dataset as a spark dataframe.

a. Create 1 column of output that contains a message like the one below for each record:

    The 1999 audi a4 has a 4 cylinder engine.

> Hint: You will need to concatenate values that already exist in the data with string literals

b. Transform the trans column so that it only contains either manual or auto.

> Hint: Consider spark string methods and `when().otherwise()` chaining
# ------------------------------------------------------------------------

In [13]:
from pyspark.sql.functions import concat, sum, avg, min, max, count, mean
from pyspark.sql.functions import lit

## a. Create 1 column of output that contains a message like the one below for each record:

    The 1999 audi a4 has a 4 cylinder engine.
    
    > Hint: You will need to concatenate values that already exist in the data with string literals



In [14]:
from pydataset import data

mpg = spark.createDataFrame(data("mpg"))
mpg.show(5)

+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|manufacturer|model|displ|year|cyl|     trans|drv|cty|hwy| fl|  class|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|        audi|   a4|  1.8|1999|  4|  auto(l5)|  f| 18| 29|  p|compact|
|        audi|   a4|  1.8|1999|  4|manual(m5)|  f| 21| 29|  p|compact|
|        audi|   a4|  2.0|2008|  4|manual(m6)|  f| 20| 31|  p|compact|
|        audi|   a4|  2.0|2008|  4|  auto(av)|  f| 21| 30|  p|compact|
|        audi|   a4|  2.8|1999|  6|  auto(l5)|  f| 16| 26|  p|compact|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
only showing top 5 rows



In [25]:
# mpg.select(concat(mpg.cyl, lit(" cylinders"))).show(5)

In [62]:
mpg.select(
    concat(
        lit('The '), 
        mpg.year, 
        lit(' '), 
        mpg.manufacturer,
        lit(' '), 
        mpg.model, 
        lit(' has a '), 
        mpg.cyl, 
        lit(' cylinder engine'))
).show(20,False)


+-----------------------------------------------------------------------------+
|concat(The , year,  , manufacturer,  , model,  has a , cyl,  cylinder engine)|
+-----------------------------------------------------------------------------+
|The 1999 audi a4 has a 4 cylinder engine                                     |
|The 1999 audi a4 has a 4 cylinder engine                                     |
|The 2008 audi a4 has a 4 cylinder engine                                     |
|The 2008 audi a4 has a 4 cylinder engine                                     |
|The 1999 audi a4 has a 6 cylinder engine                                     |
|The 1999 audi a4 has a 6 cylinder engine                                     |
|The 2008 audi a4 has a 6 cylinder engine                                     |
|The 1999 audi a4 quattro has a 4 cylinder engine                             |
|The 1999 audi a4 quattro has a 4 cylinder engine                             |
|The 2008 audi a4 quattro has a 4 cylind

## b. Transform the trans column so that it only contains either manual or auto.

> Hint: Consider spark string methods and `when().otherwise()` chaining

In [53]:
from pyspark.sql.functions import regexp_extract, regexp_replace
from pyspark.sql.functions import when

In [36]:
mpg.createOrReplaceTempView("mpg")

In [37]:
# mpg.where(mpg.cyl == 4).where(mpg["class"] == "subcompact").show()
# vs
# mpg.filter(mpg.cyl == 4).where(mpg["class"] == "subcompact").show()

In [40]:
mpg.show(2)

+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|manufacturer|model|displ|year|cyl|     trans|drv|cty|hwy| fl|  class|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|        audi|   a4|  1.8|1999|  4|  auto(l5)|  f| 18| 29|  p|compact|
|        audi|   a4|  1.8|1999|  4|manual(m5)|  f| 21| 29|  p|compact|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
only showing top 2 rows



In [44]:
spark.sql(
    """
SELECT trans
FROM mpg
"""
).show(5)

+----------+
|     trans|
+----------+
|  auto(l5)|
|manual(m5)|
|manual(m6)|
|  auto(av)|
|  auto(l5)|
+----------+
only showing top 5 rows



In [63]:
mpg.select(
    'trans',
    when(regexp_extract("trans", r"^([a])", 1) == "auto", "auto") # the problem here goes deeper than the regex
    .otherwise("manual")
).show(12)

+----------+-----------------------------------------------------------------------------+
|     trans|CASE WHEN (regexp_extract(trans, ^([a]), 1) = auto) THEN auto ELSE manual END|
+----------+-----------------------------------------------------------------------------+
|  auto(l5)|                                                                       manual|
|manual(m5)|                                                                       manual|
|manual(m6)|                                                                       manual|
|  auto(av)|                                                                       manual|
|  auto(l5)|                                                                       manual|
|manual(m5)|                                                                       manual|
|  auto(av)|                                                                       manual|
|manual(m5)|                                                                       manual|

In [67]:
mpg.select(
    'trans',
#     regexp_extract('trans',r"^(\w+)\(",1).alias('regexp_extract'),
    when(
        mpg.trans.like("a%"), "auto"
    ).otherwise("manual").alias("when + like")
).show()

+----------+-----------+
|     trans|when + like|
+----------+-----------+
|  auto(l5)|       auto|
|manual(m5)|     manual|
|manual(m6)|     manual|
|  auto(av)|       auto|
|  auto(l5)|       auto|
|manual(m5)|     manual|
|  auto(av)|       auto|
|manual(m5)|     manual|
|  auto(l5)|       auto|
|manual(m6)|     manual|
|  auto(s6)|       auto|
|  auto(l5)|       auto|
|manual(m5)|     manual|
|  auto(s6)|       auto|
|manual(m6)|     manual|
|  auto(l5)|       auto|
|  auto(s6)|       auto|
|  auto(s6)|       auto|
|  auto(l4)|       auto|
|  auto(l4)|       auto|
+----------+-----------+
only showing top 20 rows

