# Capstone Project: Parsing Nested Data

Mount JSON data using DBFS, define and apply a schema, parse fields, and save the cleaned results back to DBFS.

## Instructions

A common source of data in ETL pipelines is <a href="https://kafka.apache.org/" target="_blank">Apache Kafka</a>, or the managed alternative <a href="https://docs.microsoft.com/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview" target="_blank">Azure Event Hubs</a>. A common data type in these use cases is newline-separated JSON.

For this exercise, Tweets were streamed from the <a href="https://developer.twitter.com/en/docs" target="_blank">Twitter firehose API</a>.
Use these four exercises to perform ETL on the data in this bucket:  
<br>
1. Extracting and Exploring the Data
2. Defining and Applying a Schema
3. Creating the Tables
4. Loading the Results

Run the following cell.

In [3]:
%run "../Includes/Classroom-Setup"

## Exercise 1: Extracting and Exploring the Data

First, review the data.

### Step 1: Explore the Folder Structure

Explore the mount and review the directory structure. Use `%fs ls`.  The data is located in `/mnt/training/twitter/firehose/`

In [6]:
# TODO
path="/mnt/training/twitter/firehose/2018/01/10/01"
display(dbutils.fs.ls(path))

path,name,size
dbfs:/mnt/training/twitter/firehose/2018/01/10/01/twitterstream-1-2018-01-10-01-02-54-76b07dc1-609a-47d7-ab72-60975b109629,twitterstream-1-2018-01-10-01-02-54-76b07dc1-609a-47d7-ab72-60975b109629,27357467
dbfs:/mnt/training/twitter/firehose/2018/01/10/01/twitterstream-1-2018-01-10-01-12-55-ec34b878-b230-43a0-82a9-9d69dd0fbda5,twitterstream-1-2018-01-10-01-12-55-ec34b878-b230-43a0-82a9-9d69dd0fbda5,19907670
dbfs:/mnt/training/twitter/firehose/2018/01/10/01/twitterstream-1-2018-01-10-01-22-55-6916b906-8a3d-49a2-954b-bbfcd69b03fd,twitterstream-1-2018-01-10-01-22-55-6916b906-8a3d-49a2-954b-bbfcd69b03fd,21936208
dbfs:/mnt/training/twitter/firehose/2018/01/10/01/twitterstream-1-2018-01-10-01-33-42-134418ff-a7d1-4a7e-be5e-398638187699,twitterstream-1-2018-01-10-01-33-42-134418ff-a7d1-4a7e-be5e-398638187699,44


### Step 2: Explore a Single File

> "Premature optimization is the root of all evil." -Sir Tony Hoare

There are a few gigabytes of Twitter data available in the directory. Hoare's law about premature optimization is applicable here.  Instead of building a schema for the entire data set and then trying it out, an iterative process is much less error prone and runs much faster. Start by working on a single file before you apply your proof of concept across the entire data set.

Read a single file.  Start with `twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4`. Find this in `/mnt/training/twitter/firehose/2018/01/08/18/`.  Save the results to the variable `df`.

In [9]:
# TODO
folder = "/mnt/training/twitter/firehose/2018/01/08/18/"
fname = "twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4"
path = folder+fname

df = spark.read.json(path)

In [10]:
# TEST - Run this cell to test your solution
cols = df.columns

dbTest("ET1-P-08-02-01", 1744, df.count())
dbTest("ET1-P-08-02-02", True, "id" in cols)
dbTest("ET1-P-08-02-03", True, "text" in cols)

print("Tests passed!")

Display the schema.

In [12]:
# TODO
df.printSchema()

Count the records in the file. Save the result to `dfCount`.

In [14]:
# TODO
dfCount = df.count()

In [15]:
# TEST - Run this cell to test your solution
dbTest("ET1-P-08-03-01", 1744, dfCount)

print("Tests passed!")

## Exercise 2: Defining and Applying a Schema

Applying schemas is especially helpful for data with many fields to sort through. With a complex dataset like this, define a schema **that captures only the relevant fields**.

Capture the hashtags and dates from the data to get a sense for Twitter trends. Use the same file as above.

### Step 1: Understanding the Data Model

In order to apply structure to semi-structured data, you first must understand the data model.  

There are two forms of data models to employ: a relational or non-relational model.<br><br>
* **Relational models** are within the domain of traditional databases. [Normalization](https://en.wikipedia.org/wiki/Database_normalization) is the primary goal of the data model. <br>
* **Non-relational data models** prefer scalability, performance, or flexibility over normalized data.

Use the following relational model to define a number of tables to join together on different columns, in order to reconstitute the original data. Regardless of the data model, the ETL principles are roughly the same.

Compare the following [Entity-Relationship Diagram](https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model) to the schema you printed out in the previous step to get a sense for how to populate the tables.

-sandbox
<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/ER-diagram.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

-sandbox
### Step 2: Create a Schema for the `Tweet` Table

Create a schema for the JSON data to extract just the information that is needed for the `Tweet` table, parsing each of the following fields in the data model:

| Field | Type|
|-------|-----|
| tweet_id | integer |
| user_id | integer |
| language | string |
| text | string |
| created_at | string* |

*Note: Start with `created_at` as a string. Turn this into a timestamp later.

Save the schema to `tweetSchema`, use it to create a dataframe named `tweetDF`, and use the same file used in the exercise above: `"/mnt/training/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4"`.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** You might need to reexamine the data schema. <br>
<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** [Import types from `pyspark.sql.types`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=pyspark%20sql%20types#module-pyspark.sql.types).

In [20]:
# TODO
from pyspark.sql.types import IntegerType, StringType, StructField, StructType, LongType, ArrayType
path = "/mnt/training/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4"

tweetSchema = StructType([
  StructField("id", LongType(), True),
  StructField("user", StructType([
    StructField("id", LongType(), True)
  ]), True),
  StructField("lang", StringType(), True),
  StructField("text", StringType(), True),
  StructField("created_at", StringType(), True)
])

tweetDF = (spark.read
    .schema(tweetSchema)
    .json(path)
)

display(tweetDF)

id,user,lang,text,created_at
,,,,
,,,,
9.504389542720961e+17,List(371607576),en,RT @TheTinaVasquez: Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salvador…,Mon Jan 08 18:47:59 +0000 2018
9.504389542889144e+17,List(732417055),ja,【太ももを引きしめるエクササイズ】足あげ～１．うつぶせに寝る。２．片方の足先を床から10センチ上にあげてキープ３．２の状態で足を振り上げる。※お腹が床から離れると× #diet,Mon Jan 08 18:47:59 +0000 2018
9.504389542764504e+17,List(235927210),tr,Ben bir beni bulup icine girip saklanirsam kim beni bulur,Mon Jan 08 18:47:59 +0000 2018
9.504389542804723e+17,List(1564880654),ar,تواصل قالوا عن قطر المتحدث باسم الجيش الصهيوني يشكر قناة الجزيرة https://t.co/j0RgDwS36n … #صاروخ_سعودي_يرعب_ايران,Mon Jan 08 18:47:59 +0000 2018
9.504389542888897e+17,List(349070364),en,*Before you argue about your dirty house someone didn't clean or sweep -* *Think of the people who are living in the streets.*,Mon Jan 08 18:47:59 +0000 2018
9.504389542806692e+17,List(340482488),en,RT @TippyLexx: Bruh you ever accidentally open a message and be like damn now I gotta reply 😂😂,Mon Jan 08 18:47:59 +0000 2018
9.504389542764419e+17,List(4354072997),pt,"RT @MorraoTudo2: A liberdade é só questão de tempo, solta os faixa preta 🔐🔓⏳✌️",Mon Jan 08 18:47:59 +0000 2018
9.504389542764787e+17,List(738897225061912576),en,I just want this all to be over,Mon Jan 08 18:47:59 +0000 2018


In [21]:
# TEST - Run this cell to test your solution
from pyspark.sql.functions import col

schema = tweetSchema.fieldNames()
schema.sort()
tweetCount = tweetDF.filter(col("id").isNotNull()).count()

dbTest("ET1-P-08-04-01", 'created_at', schema[0])
dbTest("ET1-P-08-04-02", 'id', schema[1])
dbTest("ET1-P-08-04-03", 1491, tweetCount)

assert schema[0] == 'created_at' and schema[1] == 'id'
assert tweetCount == 1491

print("Tests passed!")

### Step 3: Create a Schema for the Remaining Tables

Finish off the full schema, save it to `fullTweetSchema`, and use it to create the DataFrame `fullTweetDF`. Your schema should parse all the entities from the ER diagram above.  Remember, smart small, run your code, and then iterate.

In [23]:
# TODO
path = "/mnt/training/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4"

fullTweetSchema = StructType([
    StructField("id", LongType(), True),
    StructField("user", StructType([
        StructField("id", LongType(), True),
        StructField("screen_name", StringType(), True),
        StructField("location", StringType(), True),
        StructField("friends_count", IntegerType(), True),
        StructField("followers_count", IntegerType(), True),
        StructField("description", StringType(), True)
    ]), True),
    StructField("entities", StructType([
        StructField("hashtags", ArrayType(
            StructType([
                StructField("text", StringType(), True)
            ])
        ), True),
        StructField("url", ArrayType(
            StructType([
                StructField("url", StringType(), True),
                StructField("expanded_url", StringType(), True),
                StructField("display_url", StringType(), True)
            ])
        ), True)
    ]), True),
    StructField("lang", StringType(), True),
    StructField("text", StringType(), True),
    StructField("created_at", StringType(), True)

])

fullTweetDF = (spark.read
    .schema(fullTweetSchema)
    .json(path)
)

fullTweetDF.printSchema()
display(fullTweetDF)

id,user,entities,lang,text,created_at
,,,,,
,,,,,
9.504389542720961e+17,"List(371607576, smileifyou_love, null, 473, 160, •Psalm 34:18• Living life one day at a time ✌️)","List(List(), null)",en,RT @TheTinaVasquez: Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salvador…,Mon Jan 08 18:47:59 +0000 2018
9.504389542889144e+17,"List(732417055, bw198e18, null, 1641, 1285, 【期間限定】今なら無料！！ ただ今話題沸騰中の「ダイエットできるアプリ」こと「ヤセサポ」！！ 今だけ無料でダウンロードできます！この機会にぜひ！お試しあれ★ダウンロード⇒http://bit.ly/MxDjjc)","List(List(List(diet)), null)",ja,【太ももを引きしめるエクササイズ】足あげ～１．うつぶせに寝る。２．片方の足先を床から10センチ上にあげてキープ３．２の状態で足を振り上げる。※お腹が床から離れると× #diet,Mon Jan 08 18:47:59 +0000 2018
9.504389542764504e+17,"List(235927210, marlascigarette, null, 214, 223, △)","List(List(), null)",tr,Ben bir beni bulup icine girip saklanirsam kim beni bulur,Mon Jan 08 18:47:59 +0000 2018
9.504389542804723e+17,"List(1564880654, rebaab_1326, null, 45, 0, null)","List(List(List(صاروخ_سعودي_يرعب_ايران)), null)",ar,تواصل قالوا عن قطر المتحدث باسم الجيش الصهيوني يشكر قناة الجزيرة https://t.co/j0RgDwS36n … #صاروخ_سعودي_يرعب_ايران,Mon Jan 08 18:47:59 +0000 2018
9.504389542888897e+17,"List(349070364, puskine, Kampala, Uganda, 5008, 4916, God first . Football fun . Talk so much . Reader. Year of FINANCIAL BREAK THROUGH . Still learning how to love kibitram@gmail.com. +256779646952)","List(List(), null)",en,*Before you argue about your dirty house someone didn't clean or sweep -* *Think of the people who are living in the streets.*,Mon Jan 08 18:47:59 +0000 2018
9.504389542806692e+17,"List(340482488, xNina_Beana, the land , 1130, 1646, Prince Carter ❤️ && Messiah Carter Miles ❤️)","List(List(), null)",en,RT @TippyLexx: Bruh you ever accidentally open a message and be like damn now I gotta reply 😂😂,Mon Jan 08 18:47:59 +0000 2018
9.504389542764419e+17,"List(4354072997, gbfranca22, cpx da congo🔞, 252, 632, mãe nunca te escutei, mas sempre te amarei❤)","List(List(), null)",pt,"RT @MorraoTudo2: A liberdade é só questão de tempo, solta os faixa preta 🔐🔓⏳✌️",Mon Jan 08 18:47:59 +0000 2018
9.504389542764787e+17,"List(738897225061912576, squeeqi, null, 213, 160, We are two guys who have great knowledge in scripting. If you have 10k+ coins on CSGODouble we can help you triple that amount. Check out how in the link below)","List(List(), null)",en,I just want this all to be over,Mon Jan 08 18:47:59 +0000 2018


In [24]:
# TEST - Run this cell to test your solution
from pyspark.sql.functions import col

schema = fullTweetSchema.fieldNames()
schema.sort()
tweetCount = fullTweetDF.filter(col("id").isNotNull()).count()

assert tweetCount == 1491

dbTest("ET1-P-08-05-01", "created_at", schema[0])
dbTest("ET1-P-08-05-02", "entities", schema[1])
dbTest("ET1-P-08-05-03", 1491, tweetCount)

print("Tests passed!")

## Exercise 3: Creating the Tables

Apply the schema you defined to create tables that match the relational data model.

### Step 1: Filtering Nulls

The Twitter data contains both deletions and tweets.  This is why some records appear as null values. Create a DataFramed called `fullTweetFilteredDF` that filters out the null values.

In [27]:
# TODO
from pyspark.sql.functions import col

fullTweetFilteredDF = fullTweetDF.filter(col("id").isNotNull())

In [28]:
# TEST - Run this cell to test your solution
dbTest("ET1-P-08-06-01", 1491, fullTweetFilteredDF.count())

print("Tests passed!")

-sandbox
### Step 2: Creating the `Tweet` Table

Twitter uses a non-standard timestamp format that Spark doesn't recognize. Currently the `created_at` column is formatted as a string. Create the `Tweet` table and save it as `tweetDF`. Parse the timestamp column using `unix_timestamp`, and cast the result as `TimestampType`. The timestamp format is `EEE MMM dd HH:mm:ss ZZZZZ yyyy`.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Use `alias` to alias the name of your columns to the final name you want for them.  
<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** `id` corresponds to `tweet_id` and `user.id` corresponds to `user_id`.

In [30]:
# TODO
from pyspark.sql.functions import unix_timestamp
from pyspark.sql.types import TimestampType

timestampFormat = "EEE MMM dd HH:mm:ss ZZZZZ yyyy"
tweetDF = (fullTweetFilteredDF
           .select(col("id").alias("tweet_id"),
                   col("user.id").alias("user_id"),
                   col("lang").alias("language"),
                   "text",
                   unix_timestamp("created_at", timestampFormat).cast(TimestampType()).alias("createdAt")
                   )
          )
display(tweetDF)

tweet_id,user_id,language,text,createdAt
950438954272096257,371607576,en,RT @TheTinaVasquez: Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salvador…,2018-01-08T18:47:59.000+0000
950438954288914432,732417055,ja,【太ももを引きしめるエクササイズ】足あげ～１．うつぶせに寝る。２．片方の足先を床から10センチ上にあげてキープ３．２の状態で足を振り上げる。※お腹が床から離れると× #diet,2018-01-08T18:47:59.000+0000
950438954276450305,235927210,tr,Ben bir beni bulup icine girip saklanirsam kim beni bulur,2018-01-08T18:47:59.000+0000
950438954280472576,1564880654,ar,تواصل قالوا عن قطر المتحدث باسم الجيش الصهيوني يشكر قناة الجزيرة https://t.co/j0RgDwS36n … #صاروخ_سعودي_يرعب_ايران,2018-01-08T18:47:59.000+0000
950438954288889856,349070364,en,*Before you argue about your dirty house someone didn't clean or sweep -* *Think of the people who are living in the streets.*,2018-01-08T18:47:59.000+0000
950438954280669184,340482488,en,RT @TippyLexx: Bruh you ever accidentally open a message and be like damn now I gotta reply 😂😂,2018-01-08T18:47:59.000+0000
950438954276442113,4354072997,pt,"RT @MorraoTudo2: A liberdade é só questão de tempo, solta os faixa preta 🔐🔓⏳✌️",2018-01-08T18:47:59.000+0000
950438954276478976,738897225061912576,en,I just want this all to be over,2018-01-08T18:47:59.000+0000
950438954289033216,273646363,ar,RT @Arab_original: للاسف قطاع كان ممكن حل ولا اروع للبطاله لكن وزارة النقل قررت ان لا تنظم السوق بحجه السوق الحرة !! اي حريه والشركتين يسحق…,2018-01-08T18:47:59.000+0000
950438954289033218,1541143441,ru,RT @craneswordboi: блять мне так смешно от слова срождество,2018-01-08T18:47:59.000+0000


In [31]:
# TEST - Run this cell to test your solution
from pyspark.sql.types import TimestampType
t = tweetDF.select("createdAt").schema[0]

dbTest("ET1-P-08-07-01", TimestampType(), t.dataType)

print("Tests passed!")

### Step 3: Creating the Account Table

Save the account table as `accountDF`.

In [33]:
# TODO
accountDF = (fullTweetFilteredDF.select(
        col("user.id").alias("userID"),
        col("user.screen_name").alias("screenName"),
        col("user.location"),
        col("user.friends_count").alias("friendsCount"),
        col("user.followers_count").alias("followersCount"),
        col("user.description")
    )
)

display(accountDF)

userID,screenName,location,friendsCount,followersCount,description
371607576,smileifyou_love,,473,160,•Psalm 34:18• Living life one day at a time ✌️
732417055,bw198e18,,1641,1285,【期間限定】今なら無料！！ ただ今話題沸騰中の「ダイエットできるアプリ」こと「ヤセサポ」！！ 今だけ無料でダウンロードできます！この機会にぜひ！お試しあれ★ダウンロード⇒http://bit.ly/MxDjjc
235927210,marlascigarette,,214,223,△
1564880654,rebaab_1326,,45,0,
349070364,puskine,"Kampala, Uganda",5008,4916,God first . Football fun . Talk so much . Reader. Year of FINANCIAL BREAK THROUGH . Still learning how to love kibitram@gmail.com. +256779646952
340482488,xNina_Beana,the land,1130,1646,Prince Carter ❤️ && Messiah Carter Miles ❤️
4354072997,gbfranca22,cpx da congo🔞,252,632,"mãe nunca te escutei, mas sempre te amarei❤"
738897225061912576,squeeqi,,213,160,We are two guys who have great knowledge in scripting. If you have 10k+ coins on CSGODouble we can help you triple that amount. Check out how in the link below
273646363,iiib53,,631,427,
1541143441,nappo_what,🇫🇮🇺🇦,297,925,


In [34]:
# TEST - Run this cell to test your solution
cols = accountDF.columns

dbTest("ET1-P-08-08-01", True, "friendsCount" in cols)
dbTest("ET1-P-08-08-02", True, "screenName" in cols)
dbTest("ET1-P-08-08-03", 1491, accountDF.count())


print("Tests passed!")

-sandbox
### Step 4: Creating Hashtag and URL Tables Using `explode`

Each tweet in the data set contains zero, one, or many URLs and hashtags. Parse these using the `explode` function so each URL or hashtag has its own row.

In this example, `explode` gives one row from the original column `hashtags` for each value in an array. All other columns are left untouched.

```
+---------------+--------------------+----------------+
|     screenName|            hashtags|explodedHashtags|
+---------------+--------------------+----------------+
|        zooeeen|[[Tea], [GoldenGl...|           [Tea]|
|        zooeeen|[[Tea], [GoldenGl...|  [GoldenGlobes]|
|mannydidthisone|[[beats], [90s], ...|         [beats]|
|mannydidthisone|[[beats], [90s], ...|           [90s]|
|mannydidthisone|[[beats], [90s], ...|     [90shiphop]|
|mannydidthisone|[[beats], [90s], ...|           [pac]|
|mannydidthisone|[[beats], [90s], ...|        [legend]|
|mannydidthisone|[[beats], [90s], ...|          [thug]|
|mannydidthisone|[[beats], [90s], ...|         [music]|
|mannydidthisone|[[beats], [90s], ...|     [westcoast]|
|mannydidthisone|[[beats], [90s], ...|        [eminem]|
|mannydidthisone|[[beats], [90s], ...|         [drdre]|
|mannydidthisone|[[beats], [90s], ...|          [trap]|
|  Satish0919995|[[BB11], [BiggBos...|          [BB11]|
|  Satish0919995|[[BB11], [BiggBos...|    [BiggBoss11]|
|  Satish0919995|[[BB11], [BiggBos...| [WeekendKaVaar]|
+---------------+--------------------+----------------+
```

The concept of `explode` is similar to `pivot`.

Create the rest of the tables and save them to the following DataFrames:<br><br>

* `hashtagDF`
* `urlDF`

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> <a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.explode" target="_blank">Find the documentation for `explode` here</a>

In [36]:
# TODO
from pyspark.sql.functions import explode, col

hashtagDF = (fullTweetFilteredDF
    .select(col("id").alias("tweetID"),
            explode(col("entities.hashtags.text")).alias("hashtag")
    )  
)

urlDF = (fullTweetFilteredDF.select(col("id").alias("tweetID"), 
    explode(col("entities.url")).alias("urls"))
  .select(
    col("tweetID"),
    col("urls.url").alias("URL"),
    col("urls.display_url").alias("displayURL"),
    col("urls.expanded_url").alias("expandedURL"))
)
#hashtagDF.show()
urlDF.show()

In [37]:
# TEST - Run this cell to test your solution
hashtagCols = hashtagDF.columns
urlCols = urlDF.columns
hashtagDFCounts = hashtagDF.count()
urlDFCounts = urlDF.count()

dbTest("ET1-P-08-09-01", True, "hashtag" in hashtagCols)
dbTest("ET1-P-08-09-02", True, "displayURL" in urlCols)
dbTest("ET1-P-08-09-03", 394, hashtagDFCounts)
dbTest("ET1-P-08-09-04", 368, urlDFCounts)

print("Tests passed!")

-sandbox
## Exercise 4: Loading the Results

Use DBFS as your target warehouse for your transformed data. Save the DataFrames in Parquet format to the following endpoints:  

| DataFrame    | Endpoint              |
|:-------------|:----------------------|
| `accountDF`  | `/tmp/account.parquet`|
| `tweetDF`    | `/tmp/tweet.parquet`  |
| `hashtagDF`  | `/tmp/hashtag.parquet`|
| `urlDF`      | `/tmp/url.parquet`    |

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> If you run out of storage in `/tmp`, use `.limit(10)` to limit the size of your DataFrames to 10 records.

In [39]:
# TODO
accountDF.write.mode("overwrite").parquet("/tmp/account.parquet")
tweetDF.write.mode("overwrite").parquet("/tmp/tweet.parquet")
hashtagDF.write.mode("overwrite").parquet("/tmp/hashtag.parquet")
urlDF.write.mode("overwrite").parquet("/tmp/url.parquet")

In [40]:
# TEST - Run this cell to test your solution
from pyspark.sql.dataframe import DataFrame

accountDF = spark.read.parquet("/tmp/account.parquet")
tweetDF = spark.read.parquet("/tmp/tweet.parquet")
hashtagDF = spark.read.parquet("/tmp/hashtag.parquet")
urlDF = spark.read.parquet("/tmp/url.parquet")

dbTest("ET1-P-08-10-01", DataFrame, type(accountDF))
dbTest("ET1-P-08-10-02", DataFrame, type(tweetDF))
dbTest("ET1-P-08-10-03", DataFrame, type(hashtagDF))
dbTest("ET1-P-08-10-04", DataFrame, type(urlDF))

print("Tests passed!")