d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Capstone Project: Parsing Nested Data

Mount JSON data using DBFS, define and apply a schema, parse fields, and save the cleaned results back to DBFS.

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Data Pipeline Engineers

## Prerequisites
* Web browser: Please use a <a href="https://docs.azuredatabricks.net/user-guide/supported-browsers.html#supported-browsers" target="_blank">supported browser</a>.
* Lesson: <a href="$./02-ETL-Process-Overview">ETL Process Overview</a> 
* Lesson: <a href="$./05-Applying-Schemas-to-JSON-Data">Applying Schemas to JSON Data</a> 

## Instructions

A common source of data in ETL pipelines is <a href="https://kafka.apache.org/" target="_blank">Apache Kafka</a>, or the managed alternative
<a href="https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-about" target="_blank">Azure Event Hubs</a>.
A common data type in these use cases is newline-separated JSON.

For this exercise, Tweets were streamed from the <a href="https://developer.twitter.com/en/docs" target="_blank">Twitter firehose API</a> into such an aggregation server and,
from there, dumped into the distributed file system.

Use these four exercises to perform ETL on the data in this bucket:  
<br>
1. Extracting and Exploring the Data
2. Defining and Applying a Schema
3. Creating the Tables
4. Loading the Results

Run the following cell.

In [4]:
%run "./Includes/Classroom-Setup"

## Exercise 1: Extracting and Exploring the Data

First, review the data.

### Step 1: Explore the Folder Structure

Explore the mount and review the directory structure. Use `%fs ls`.  The data is located in `/mnt/training/twitter/firehose/`

In [7]:
%fs ls "/mnt/training/twitter/firehose/"

path,name,size
dbfs:/mnt/training/twitter/firehose/2018/,2018/,0


In [8]:
# Alternate
display(dbutils.fs.ls("/mnt/training/twitter/firehose/2018/01/08/18/"))

path,name,size
dbfs:/mnt/training/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4,twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4,7945820
dbfs:/mnt/training/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-58-00-90ebdcae-ee96-443d-bd8b-de09ece454c2,twitterstream-1-2018-01-08-18-58-00-90ebdcae-ee96-443d-bd8b-de09ece454c2,12527115


### Step 2: Explore a Single File

> "Premature optimization is the root of all evil." -Sir Tony Hoare

There are a few gigabytes of Twitter data available in the directory. Hoare's law about premature optimization is applicable here.  Instead of building a schema for the entire data set and then trying it out, an iterative process is much less error prone and runs much faster. Start by working on a single file before you apply your proof of concept across the entire data set.

Read a single file.  Start with `twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4`. Find this in `/mnt/training/twitter/firehose/2018/01/08/18/`.  Save the results to the variable `df`.

In [11]:
%fs head "/mnt/training/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4"

In [12]:
# TODO
filename = "twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4"
df = (spark.read
     .json("/mnt/training/twitter/firehose/2018/01/08/18/" + filename)
     )
display(df)

contributors,coordinates,created_at,delete,display_text_range,entities,extended_entities,extended_tweet,favorite_count,favorited,filter_level,geo,hangup,heartbeat_timeout,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,is_quote_status,lang,place,possibly_sensitive,quote_count,quoted_status,quoted_status_id,quoted_status_id_str,reply_count,retweet_count,retweeted,retweeted_status,source,text,timestamp_ms,truncated,user
,,,"List(List(950438769756397568, 950438769756397568, 2299370352, 2299370352), 1515437279350)",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,"List(List(950434776783228931, 950434776783228931, 2938447679, 2938447679), 1515437279393)",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,Mon Jan 08 18:47:59 +0000 2018,,,"List(List(), null, List(), List(), List(List(1666134386, 1666134386, List(3, 18), Tina Vasquez, TheTinaVasquez)))",,,0.0,False,low,,,,9.504389542720961e+17,9.504389542720961e+17,,,,,,False,en,,,0.0,,,,0.0,0.0,False,"List(null, null, Mon Jan 08 15:30:07 +0000 2018, null, List(List(), null, List(), List(List(twitter.com/i/web/status/9…, https://twitter.com/i/web/status/950389157758685185, List(117, 140), https://t.co/oHa6NSRlJk)), List()), null, List(List(0, 275), List(List(), null, List(), List(), List()), null, Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salvador's civil war, including training death squads. We funded the reason people fled, the effects of which are still being felt and still causing people to flee.), 167, false, low, null, 950389157758685185, 950389157758685185, TheTinaVasquez, 950388673727684609, 950388673727684609, 1666134386, 1666134386, false, en, null, null, 15, null, null, null, 11, 193, false, Twitter Web Client, Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salv… https://t.co/oHa6NSRlJk, true, List(false, Mon Aug 12 22:06:43 +0000 2013, false, false, Immigration Reporter, @rewire_news. Tips? vasquez.tina@rewire.news, 22699, null, 9877, null, 1296, false, 1666134386, 1666134386, false, en, 308, North Carolina, Tina Vasquez, null, 1A1B1F, http://abs.twimg.com/images/themes/theme9/bg.gif, https://abs.twimg.com/images/themes/theme9/bg.gif, false, https://pbs.twimg.com/profile_banners/1666134386/1493340440, http://pbs.twimg.com/profile_images/851172170776576004/57clqAAv_normal.jpg, https://pbs.twimg.com/profile_images/851172170776576004/57clqAAv_normal.jpg, 2FC2EF, 181A1E, 252429, 666666, true, false, TheTinaVasquez, 39885, Eastern Time (US & Canada), none, https://rewire.news/author/tina-vasquez/, -18000, true))",Twitter for iPhone,RT @TheTinaVasquez: Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salvador…,1515437279657.0,False,"List(false, Sun Sep 11 05:18:35 +0000 2011, false, false, •Psalm 34:18• Living life one day at a time ✌️, 7277, null, 160, null, 473, true, 371607576, 371607576, false, en, 0, null, Ash, null, C0DEED, http://pbs.twimg.com/profile_background_images/652822689/ng7zyrh6taxcv5gfhx5i.jpeg, https://pbs.twimg.com/profile_background_images/652822689/ng7zyrh6taxcv5gfhx5i.jpeg, true, https://pbs.twimg.com/profile_banners/371607576/1467821525, http://pbs.twimg.com/profile_images/785334362099200001/6RT_Leu__normal.jpg, https://pbs.twimg.com/profile_images/785334362099200001/6RT_Leu__normal.jpg, 0084B4, C0DEED, DDEEF6, 333333, true, false, smileifyou_love, 1654, Alaska, none, null, -32400, false)"
,,Mon Jan 08 18:47:59 +0000 2018,,,"List(List(List(List(83, 88), diet)), null, List(), List(), List())",,,0.0,False,low,,,,9.504389542889144e+17,9.504389542889144e+17,,,,,,False,ja,,,0.0,,,,0.0,0.0,False,,twittbot.net,【太ももを引きしめるエクササイズ】足あげ～１．うつぶせに寝る。２．片方の足先を床から10センチ上にあげてキープ３．２の状態で足を振り上げる。※お腹が床から離れると× #diet,1515437279661.0,False,"List(false, Thu Aug 02 08:08:50 +0000 2012, true, false, 【期間限定】今なら無料！！ ただ今話題沸騰中の「ダイエットできるアプリ」こと「ヤセサポ」！！ 今だけ無料でダウンロードできます！この機会にぜひ！お試しあれ★ダウンロード⇒http://bit.ly/MxDjjc, 0, null, 1285, null, 1641, false, 732417055, 732417055, false, ja, 21, null, 美モテDIET, null, C0DEED, http://abs.twimg.com/images/themes/theme1/bg.png, https://abs.twimg.com/images/themes/theme1/bg.png, false, null, http://pbs.twimg.com/profile_images/2458475899/icon_normal.png, https://pbs.twimg.com/profile_images/2458475899/icon_normal.png, 1DA1F2, C0DEED, DDEEF6, 333333, true, false, bw198e18, 68293, Irkutsk, none, null, 28800, false)"
,,Mon Jan 08 18:47:59 +0000 2018,,,"List(List(), null, List(), List(), List())",,,0.0,False,low,,,,9.504389542764504e+17,9.504389542764504e+17,,,,,,False,tr,,,0.0,,,,0.0,0.0,False,,Twitter for iPhone,Ben bir beni bulup icine girip saklanirsam kim beni bulur,1515437279658.0,False,"List(false, Sun Jan 09 11:52:15 +0000 2011, false, false, △, 1834, null, 223, null, 214, true, 235927210, 235927210, false, tr, 3, null, E., null, D3BCDB, http://pbs.twimg.com/profile_background_images/378800000115824546/a446c7dcda954ab0baa1e2a5ff831dcf.jpeg, https://pbs.twimg.com/profile_background_images/378800000115824546/a446c7dcda954ab0baa1e2a5ff831dcf.jpeg, true, https://pbs.twimg.com/profile_banners/235927210/1512946341, http://pbs.twimg.com/profile_images/939990745775247360/OQSeNi8n_normal.jpg, https://pbs.twimg.com/profile_images/939990745775247360/OQSeNi8n_normal.jpg, 6E1A6B, FFFFFF, C0DBC7, 000000, true, false, marlascigarette, 3475, Istanbul, none, http://Instagram.com/ecegizemerikci, 10800, false)"
,,Mon Jan 08 18:47:59 +0000 2018,,,"List(List(List(List(91, 114), صاروخ_سعودي_يرعب_ايران)), null, List(), List(List(youtube.com/watch?v=b4iz9n…, https://www.youtube.com/watch?v=b4iz9nZPzAA, List(65, 88), https://t.co/j0RgDwS36n)), List())",,,0.0,False,low,,,,9.504389542804723e+17,9.504389542804723e+17,,,,,,False,ar,,False,0.0,,,,0.0,0.0,False,,Twitter Web Client,تواصل قالوا عن قطر المتحدث باسم الجيش الصهيوني يشكر قناة الجزيرة https://t.co/j0RgDwS36n … #صاروخ_سعودي_يرعب_ايران,1515437279659.0,False,"List(false, Wed Jul 03 04:50:29 +0000 2013, true, false, null, 0, null, 0, null, 45, false, 1564880654, 1564880654, false, en, 0, null, سعد الشمري, null, C0DEED, http://abs.twimg.com/images/themes/theme1/bg.png, https://abs.twimg.com/images/themes/theme1/bg.png, false, null, http://pbs.twimg.com/profile_images/950101193430200320/RDrilm60_normal.jpg, https://pbs.twimg.com/profile_images/950101193430200320/RDrilm60_normal.jpg, 1DA1F2, C0DEED, DDEEF6, 333333, true, false, rebaab_1326, 371, Arizona, none, null, -25200, false)"
,,Mon Jan 08 18:47:59 +0000 2018,,,"List(List(), null, List(), List(), List())",,,0.0,False,low,,,,9.504389542888897e+17,9.504389542888897e+17,,,,,,False,en,,,0.0,,,,0.0,0.0,False,,Mobile Web (M2),*Before you argue about your dirty house someone didn't clean or sweep -* *Think of the people who are living in the streets.*,1515437279661.0,False,"List(false, Fri Aug 05 14:02:37 +0000 2011, false, false, God first . Football fun . Talk so much . Reader. Year of FINANCIAL BREAK THROUGH . Still learning how to love kibitram@gmail.com. +256779646952, 1191, null, 4916, null, 5008, true, 349070364, 349070364, false, en, 9, Kampala, Uganda, Kibirango Martin, null, 1A1B1F, http://pbs.twimg.com/profile_background_images/378800000065105458/dfb2c050324d851ddc47955ac445c2cf.jpeg, https://pbs.twimg.com/profile_background_images/378800000065105458/dfb2c050324d851ddc47955ac445c2cf.jpeg, false, https://pbs.twimg.com/profile_banners/349070364/1514792019, http://pbs.twimg.com/profile_images/947732559034798080/Eon5oPyT_normal.jpg, https://pbs.twimg.com/profile_images/947732559034798080/Eon5oPyT_normal.jpg, 2FC2EF, 000000, DDEEF6, 333333, true, false, puskine, 4376, Baghdad, none, null, 10800, false)"
,,Mon Jan 08 18:47:59 +0000 2018,,,"List(List(), null, List(), List(), List(List(735005565911453696, 735005565911453696, List(3, 13), Alexis Holloway🌐, TippyLexx)))",,,0.0,False,low,,,,9.504389542806692e+17,9.504389542806692e+17,,,,,,False,en,,,0.0,,,,0.0,0.0,False,"List(null, null, Sat Jan 06 18:21:37 +0000 2018, null, List(List(), null, List(), List(), List()), null, null, 5986, false, low, null, 949707540786532357, 949707540786532357, null, null, null, null, null, false, en, null, null, 1058, null, null, null, 40, 4802, false, Twitter for iPhone, Bruh you ever accidentally open a message and be like damn now I gotta reply 😂😂, false, List(false, Tue May 24 07:12:37 +0000 2016, false, false, Features and Booking info: tippylexbooking@gmail.com | snap : eff_haters 🏳️‍🌈alcorn |, 955, null, 21665, null, 12, true, 735005565911453696, 735005565911453696, false, en, 189, Gautier, MS, Alexis Holloway🌐, null, 000000, http://abs.twimg.com/images/themes/theme1/bg.png, https://abs.twimg.com/images/themes/theme1/bg.png, false, https://pbs.twimg.com/profile_banners/735005565911453696/1514685108, http://pbs.twimg.com/profile_images/947288075423617024/KvrEXdmZ_normal.jpg, https://pbs.twimg.com/profile_images/947288075423617024/KvrEXdmZ_normal.jpg, 981CEB, 000000, 000000, 000000, false, false, TippyLexx, 41, Pacific Time (US & Canada), none, http://tippylexofficialsite.webnode.com, -28800, false))",Twitter for iPhone,RT @TippyLexx: Bruh you ever accidentally open a message and be like damn now I gotta reply 😂😂,1515437279659.0,False,"List(false, Fri Jul 22 19:34:10 +0000 2011, false, false, Prince Carter ❤️ && Messiah Carter Miles ❤️, 2792, null, 1646, null, 1130, true, 340482488, 340482488, false, en, 2, the land , the Queen 👑❤️, null, C0DEED, http://pbs.twimg.com/profile_background_images/707066859/d306dfc46dd2eeecfb053f88b335525a.jpeg, https://pbs.twimg.com/profile_background_images/707066859/d306dfc46dd2eeecfb053f88b335525a.jpeg, true, https://pbs.twimg.com/profile_banners/340482488/1510876123, http://pbs.twimg.com/profile_images/929758380461355008/JJe3uSrP_normal.jpg, https://pbs.twimg.com/profile_images/929758380461355008/JJe3uSrP_normal.jpg, 0084B4, FFFFFF, 252429, 666666, true, false, xNina_Beana, 50597, Quito, none, null, -18000, false)"
,,Mon Jan 08 18:47:59 +0000 2018,,,"List(List(), null, List(), List(), List(List(728233133389324288, 728233133389324288, List(3, 15), Morrão Tudo 2, MorraoTudo2)))",,,0.0,False,low,,,,9.504389542764419e+17,9.504389542764419e+17,,,,,,False,pt,,,0.0,,,,0.0,0.0,False,"List(null, null, Mon Jan 08 03:14:08 +0000 2018, null, List(List(), null, List(), List(), List()), null, null, 106, false, low, null, 950203940544671745, 950203940544671745, null, null, null, null, null, false, pt, null, null, 21, null, null, null, 1, 184, false, Twitter for Android, A liberdade é só questão de tempo, solta os faixa preta 🔐🔓⏳✌️, false, List(false, Thu May 05 14:41:23 +0000 2016, false, false, Tudo sobre o morrão,eventos e fatos da comunidade,qualquer dúvida só chamar na DM! Morrão vigiado por Deus, monitorado pelos cria... ✌ Gestão Inteligente #2, 1432, null, 20556, null, 438, true, 728233133389324288, 728233133389324288, false, pt, 9, Terra De Loucos - Ffmlc, Morrão Tudo 2, null, 000000, http://abs.twimg.com/images/themes/theme1/bg.png, https://abs.twimg.com/images/themes/theme1/bg.png, false, https://pbs.twimg.com/profile_banners/728233133389324288/1497224104, http://pbs.twimg.com/profile_images/857000775565881344/w9LIfVLm_normal.jpg, https://pbs.twimg.com/profile_images/857000775565881344/w9LIfVLm_normal.jpg, ABB8C2, 000000, 000000, 000000, false, false, MorraoTudo2, 7404, null, none, null, null, false))",Twitter for Android,"RT @MorraoTudo2: A liberdade é só questão de tempo, solta os faixa preta 🔐🔓⏳✌️",1515437279658.0,False,"List(false, Wed Dec 02 20:29:19 +0000 2015, false, false, mãe nunca te escutei, mas sempre te amarei❤, 8836, null, 632, null, 252, true, 4354072997, 4354072997, false, pt, 3, cpx da congo🔞, gabrielfrança😇, null, 000000, http://abs.twimg.com/images/themes/theme1/bg.png, https://abs.twimg.com/images/themes/theme1/bg.png, false, https://pbs.twimg.com/profile_banners/4354072997/1513471552, http://pbs.twimg.com/profile_images/948648044790272006/0FUD7Ux0_normal.jpg, https://pbs.twimg.com/profile_images/948648044790272006/0FUD7Ux0_normal.jpg, 1B95E0, 000000, 000000, 000000, false, false, gbfranca22, 14277, Pacific Time (US & Canada), none, http://www.youtube.com/c/GabrielFrançaDoYouTube, -28800, false)"
,,Mon Jan 08 18:47:59 +0000 2018,,,"List(List(), null, List(), List(), List())",,,0.0,False,low,,,,9.504389542764787e+17,9.504389542764787e+17,,,,,,False,en,,,0.0,,,,0.0,0.0,False,,Twitter Web Client,I just want this all to be over,1515437279658.0,False,"List(false, Sat Jun 04 00:56:41 +0000 2016, true, false, We are two guys who have great knowledge in scripting. If you have 10k+ coins on CSGODouble we can help you triple that amount. Check out how in the link below, 659, null, 160, null, 213, false, 738897225061912576, 738897225061912576, false, en, 4, null, andy, null, F5F8FA, , , false, https://pbs.twimg.com/profile_banners/738897225061912576/1465001956, http://pbs.twimg.com/profile_images/738897770149490688/4RcvCqYn_normal.jpg, https://pbs.twimg.com/profile_images/738897770149490688/4RcvCqYn_normal.jpg, 1DA1F2, C0DEED, DDEEF6, 333333, true, false, squeeqi, 1753, null, regular, null, null, false)"


In [13]:
# TEST - Run this cell to test your solution
cols = df.columns

dbTest("ET1-P-08-02-01", 1744, df.count())
dbTest("ET1-P-08-02-02", True, "id" in cols)
dbTest("ET1-P-08-02-03", True, "text" in cols)

print("Tests passed!")

Display the schema.

In [15]:
# TODO
df.printSchema()

In [16]:
display(df.limit(5))

contributors,coordinates,created_at,delete,display_text_range,entities,extended_entities,extended_tweet,favorite_count,favorited,filter_level,geo,hangup,heartbeat_timeout,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,is_quote_status,lang,place,possibly_sensitive,quote_count,quoted_status,quoted_status_id,quoted_status_id_str,reply_count,retweet_count,retweeted,retweeted_status,source,text,timestamp_ms,truncated,user
,,,"List(List(950438769756397568, 950438769756397568, 2299370352, 2299370352), 1515437279350)",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,"List(List(950434776783228931, 950434776783228931, 2938447679, 2938447679), 1515437279393)",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,Mon Jan 08 18:47:59 +0000 2018,,,"List(List(), null, List(), List(), List(List(1666134386, 1666134386, List(3, 18), Tina Vasquez, TheTinaVasquez)))",,,0.0,False,low,,,,9.504389542720961e+17,9.504389542720961e+17,,,,,,False,en,,,0.0,,,,0.0,0.0,False,"List(null, null, Mon Jan 08 15:30:07 +0000 2018, null, List(List(), null, List(), List(List(twitter.com/i/web/status/9…, https://twitter.com/i/web/status/950389157758685185, List(117, 140), https://t.co/oHa6NSRlJk)), List()), null, List(List(0, 275), List(List(), null, List(), List(), List()), null, Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salvador's civil war, including training death squads. We funded the reason people fled, the effects of which are still being felt and still causing people to flee.), 167, false, low, null, 950389157758685185, 950389157758685185, TheTinaVasquez, 950388673727684609, 950388673727684609, 1666134386, 1666134386, false, en, null, null, 15, null, null, null, 11, 193, false, Twitter Web Client, Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salv… https://t.co/oHa6NSRlJk, true, List(false, Mon Aug 12 22:06:43 +0000 2013, false, false, Immigration Reporter, @rewire_news. Tips? vasquez.tina@rewire.news, 22699, null, 9877, null, 1296, false, 1666134386, 1666134386, false, en, 308, North Carolina, Tina Vasquez, null, 1A1B1F, http://abs.twimg.com/images/themes/theme9/bg.gif, https://abs.twimg.com/images/themes/theme9/bg.gif, false, https://pbs.twimg.com/profile_banners/1666134386/1493340440, http://pbs.twimg.com/profile_images/851172170776576004/57clqAAv_normal.jpg, https://pbs.twimg.com/profile_images/851172170776576004/57clqAAv_normal.jpg, 2FC2EF, 181A1E, 252429, 666666, true, false, TheTinaVasquez, 39885, Eastern Time (US & Canada), none, https://rewire.news/author/tina-vasquez/, -18000, true))",Twitter for iPhone,RT @TheTinaVasquez: Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salvador…,1515437279657.0,False,"List(false, Sun Sep 11 05:18:35 +0000 2011, false, false, •Psalm 34:18• Living life one day at a time ✌️, 7277, null, 160, null, 473, true, 371607576, 371607576, false, en, 0, null, Ash, null, C0DEED, http://pbs.twimg.com/profile_background_images/652822689/ng7zyrh6taxcv5gfhx5i.jpeg, https://pbs.twimg.com/profile_background_images/652822689/ng7zyrh6taxcv5gfhx5i.jpeg, true, https://pbs.twimg.com/profile_banners/371607576/1467821525, http://pbs.twimg.com/profile_images/785334362099200001/6RT_Leu__normal.jpg, https://pbs.twimg.com/profile_images/785334362099200001/6RT_Leu__normal.jpg, 0084B4, C0DEED, DDEEF6, 333333, true, false, smileifyou_love, 1654, Alaska, none, null, -32400, false)"
,,Mon Jan 08 18:47:59 +0000 2018,,,"List(List(List(List(83, 88), diet)), null, List(), List(), List())",,,0.0,False,low,,,,9.504389542889144e+17,9.504389542889144e+17,,,,,,False,ja,,,0.0,,,,0.0,0.0,False,,twittbot.net,【太ももを引きしめるエクササイズ】足あげ～１．うつぶせに寝る。２．片方の足先を床から10センチ上にあげてキープ３．２の状態で足を振り上げる。※お腹が床から離れると× #diet,1515437279661.0,False,"List(false, Thu Aug 02 08:08:50 +0000 2012, true, false, 【期間限定】今なら無料！！ ただ今話題沸騰中の「ダイエットできるアプリ」こと「ヤセサポ」！！ 今だけ無料でダウンロードできます！この機会にぜひ！お試しあれ★ダウンロード⇒http://bit.ly/MxDjjc, 0, null, 1285, null, 1641, false, 732417055, 732417055, false, ja, 21, null, 美モテDIET, null, C0DEED, http://abs.twimg.com/images/themes/theme1/bg.png, https://abs.twimg.com/images/themes/theme1/bg.png, false, null, http://pbs.twimg.com/profile_images/2458475899/icon_normal.png, https://pbs.twimg.com/profile_images/2458475899/icon_normal.png, 1DA1F2, C0DEED, DDEEF6, 333333, true, false, bw198e18, 68293, Irkutsk, none, null, 28800, false)"
,,Mon Jan 08 18:47:59 +0000 2018,,,"List(List(), null, List(), List(), List())",,,0.0,False,low,,,,9.504389542764504e+17,9.504389542764504e+17,,,,,,False,tr,,,0.0,,,,0.0,0.0,False,,Twitter for iPhone,Ben bir beni bulup icine girip saklanirsam kim beni bulur,1515437279658.0,False,"List(false, Sun Jan 09 11:52:15 +0000 2011, false, false, △, 1834, null, 223, null, 214, true, 235927210, 235927210, false, tr, 3, null, E., null, D3BCDB, http://pbs.twimg.com/profile_background_images/378800000115824546/a446c7dcda954ab0baa1e2a5ff831dcf.jpeg, https://pbs.twimg.com/profile_background_images/378800000115824546/a446c7dcda954ab0baa1e2a5ff831dcf.jpeg, true, https://pbs.twimg.com/profile_banners/235927210/1512946341, http://pbs.twimg.com/profile_images/939990745775247360/OQSeNi8n_normal.jpg, https://pbs.twimg.com/profile_images/939990745775247360/OQSeNi8n_normal.jpg, 6E1A6B, FFFFFF, C0DBC7, 000000, true, false, marlascigarette, 3475, Istanbul, none, http://Instagram.com/ecegizemerikci, 10800, false)"


Count the records in the file. Save the result to `dfCount`.

In [18]:
# TODO
dfCount = df.count()

In [19]:
# TEST - Run this cell to test your solution
dbTest("ET1-P-08-03-01", 1744, dfCount)

print("Tests passed!")

## Exercise 2: Defining and Applying a Schema

Applying schemas is especially helpful for data with many fields to sort through. With a complex dataset like this, define a schema **that captures only the relevant fields**.

Capture the hashtags and dates from the data to get a sense for Twitter trends. Use the same file as above.

### Step 1: Understanding the Data Model

In order to apply structure to semi-structured data, you first must understand the data model.  

There are two forms of data models to employ: a relational or non-relational model.<br><br>
* **Relational models** are within the domain of traditional databases. [Normalization](https://en.wikipedia.org/wiki/Database_normalization) is the primary goal of the data model. <br>
* **Non-relational data models** prefer scalability, performance, or flexibility over normalized data.

Use the following relational model to define a number of tables to join together on different columns, in order to reconstitute the original data. Regardless of the data model, the ETL principles are roughly the same.

Compare the following [Entity-Relationship Diagram](https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model) to the schema you printed out in the previous step to get a sense for how to populate the tables.

-sandbox
<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/ER-diagram.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

-sandbox
### Step 2: Create a Schema for the `Tweet` Table

Create a schema for the JSON data to extract just the information that is needed for the `Tweet` table, parsing each of the following fields in the data model:

| Field | Type|
|-------|-----|
| tweet_id | integer |
| user_id | integer |
| language | string |
| text | string |
| created_at | string* |

*Note: Start with `created_at` as a string. Turn this into a timestamp later.

Save the schema to `tweetSchema`, use it to create a DataFrame named `tweetDF`, and use the same file used in the exercise above: `"/mnt/training/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4"`.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** You might need to reexamine the data schema. <br>
<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** [Import types from `pyspark.sql.types`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=pyspark%20sql%20types#module-pyspark.sql.types).

In [24]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, LongType, ArrayType
from pyspark.sql.functions import col

In [25]:
# TODO
path = "/mnt/training/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4"

tweetSchema = StructType([
  StructField("id", LongType(), True),
  StructField("user", StructType([
    StructField("id", LongType(), True)
  ]), True),  
  StructField("lang", StringType(), True),
  StructField("text", StringType(), True),
  StructField("created_at", StringType(), True)
])

tweetDF = (spark.read
           .schema(tweetSchema)
           .json(path)
          )
# Remove null rows, flatten and rename user id
'''
tweetDF = (tweetDF
           .dropna()
           .select("id",col("user.id").alias("user_id"),"lang","text","created_at"))
'''
display(tweetDF)

id,user,lang,text,created_at
,,,,
,,,,
9.504389542720961e+17,List(371607576),en,RT @TheTinaVasquez: Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salvador…,Mon Jan 08 18:47:59 +0000 2018
9.504389542889144e+17,List(732417055),ja,【太ももを引きしめるエクササイズ】足あげ～１．うつぶせに寝る。２．片方の足先を床から10センチ上にあげてキープ３．２の状態で足を振り上げる。※お腹が床から離れると× #diet,Mon Jan 08 18:47:59 +0000 2018
9.504389542764504e+17,List(235927210),tr,Ben bir beni bulup icine girip saklanirsam kim beni bulur,Mon Jan 08 18:47:59 +0000 2018
9.504389542804723e+17,List(1564880654),ar,تواصل قالوا عن قطر المتحدث باسم الجيش الصهيوني يشكر قناة الجزيرة https://t.co/j0RgDwS36n … #صاروخ_سعودي_يرعب_ايران,Mon Jan 08 18:47:59 +0000 2018
9.504389542888897e+17,List(349070364),en,*Before you argue about your dirty house someone didn't clean or sweep -* *Think of the people who are living in the streets.*,Mon Jan 08 18:47:59 +0000 2018
9.504389542806692e+17,List(340482488),en,RT @TippyLexx: Bruh you ever accidentally open a message and be like damn now I gotta reply 😂😂,Mon Jan 08 18:47:59 +0000 2018
9.504389542764419e+17,List(4354072997),pt,"RT @MorraoTudo2: A liberdade é só questão de tempo, solta os faixa preta 🔐🔓⏳✌️",Mon Jan 08 18:47:59 +0000 2018
9.504389542764787e+17,List(738897225061912576),en,I just want this all to be over,Mon Jan 08 18:47:59 +0000 2018


In [26]:
# TEST - Run this cell to test your solution
from pyspark.sql.functions import col

schema = tweetSchema.fieldNames()
schema.sort()
tweetCount = tweetDF.filter(col("id").isNotNull()).count()

dbTest("ET1-P-08-04-01", 'created_at', schema[0])
dbTest("ET1-P-08-04-02", 'id', schema[1])
dbTest("ET1-P-08-04-03", 1491, tweetCount)

assert schema[0] == 'created_at' and schema[1] == 'id'
assert tweetCount == 1491

print("Tests passed!")

### Step 3: Create a Schema for the Remaining Tables

Finish off the full schema, save it to `fullTweetSchema`, and use it to create the DataFrame `fullTweetDF`. Your schema should parse all the entities from the ER diagram above.  Remember, smart small, run your code, and then iterate.

In [28]:
# TODO
path = "/mnt/training/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4"

# Define the schema
fullTweetSchema = StructType([
  StructField("id", LongType(), True),
  StructField("user", StructType([
    StructField("id", LongType(), True),
    StructField("screen_name", StringType(), True),
    StructField("location", StringType(), True),
    StructField("friends_count", IntegerType(), True),
    StructField("followers_count", IntegerType(), True),
    StructField("description", StringType(), True)
  ]), True),
  StructField("entities", StructType([
    StructField("hashtags", ArrayType(
      StructType([
        StructField("text", StringType(), True)
      ]),
    ), True),
    StructField("urls", ArrayType(
      StructType([
        StructField("url", StringType(), True),
        StructField("expanded_url", StringType(), True),
        StructField("display_url", StringType(), True)
      ]),
    ), True)
  ]), True),
  StructField("lang", StringType(), True),
  StructField("text", StringType(), True),
  StructField("created_at", StringType(), True)
])

# Load with the schema
fullTweetDF = (spark.read
           .schema(fullTweetSchema)
           .json(path)
          )

# Show the Dataframe
display(fullTweetDF)

id,user,entities,lang,text,created_at
,,,,,
,,,,,
9.504389542720961e+17,"List(371607576, smileifyou_love, null, 473, 160, •Psalm 34:18• Living life one day at a time ✌️)","List(List(), List())",en,RT @TheTinaVasquez: Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salvador…,Mon Jan 08 18:47:59 +0000 2018
9.504389542889144e+17,"List(732417055, bw198e18, null, 1641, 1285, 【期間限定】今なら無料！！ ただ今話題沸騰中の「ダイエットできるアプリ」こと「ヤセサポ」！！ 今だけ無料でダウンロードできます！この機会にぜひ！お試しあれ★ダウンロード⇒http://bit.ly/MxDjjc)","List(List(List(diet)), List())",ja,【太ももを引きしめるエクササイズ】足あげ～１．うつぶせに寝る。２．片方の足先を床から10センチ上にあげてキープ３．２の状態で足を振り上げる。※お腹が床から離れると× #diet,Mon Jan 08 18:47:59 +0000 2018
9.504389542764504e+17,"List(235927210, marlascigarette, null, 214, 223, △)","List(List(), List())",tr,Ben bir beni bulup icine girip saklanirsam kim beni bulur,Mon Jan 08 18:47:59 +0000 2018
9.504389542804723e+17,"List(1564880654, rebaab_1326, null, 45, 0, null)","List(List(List(صاروخ_سعودي_يرعب_ايران)), List(List(https://t.co/j0RgDwS36n, https://www.youtube.com/watch?v=b4iz9nZPzAA, youtube.com/watch?v=b4iz9n…)))",ar,تواصل قالوا عن قطر المتحدث باسم الجيش الصهيوني يشكر قناة الجزيرة https://t.co/j0RgDwS36n … #صاروخ_سعودي_يرعب_ايران,Mon Jan 08 18:47:59 +0000 2018
9.504389542888897e+17,"List(349070364, puskine, Kampala, Uganda, 5008, 4916, God first . Football fun . Talk so much . Reader. Year of FINANCIAL BREAK THROUGH . Still learning how to love kibitram@gmail.com. +256779646952)","List(List(), List())",en,*Before you argue about your dirty house someone didn't clean or sweep -* *Think of the people who are living in the streets.*,Mon Jan 08 18:47:59 +0000 2018
9.504389542806692e+17,"List(340482488, xNina_Beana, the land , 1130, 1646, Prince Carter ❤️ && Messiah Carter Miles ❤️)","List(List(), List())",en,RT @TippyLexx: Bruh you ever accidentally open a message and be like damn now I gotta reply 😂😂,Mon Jan 08 18:47:59 +0000 2018
9.504389542764419e+17,"List(4354072997, gbfranca22, cpx da congo🔞, 252, 632, mãe nunca te escutei, mas sempre te amarei❤)","List(List(), List())",pt,"RT @MorraoTudo2: A liberdade é só questão de tempo, solta os faixa preta 🔐🔓⏳✌️",Mon Jan 08 18:47:59 +0000 2018
9.504389542764787e+17,"List(738897225061912576, squeeqi, null, 213, 160, We are two guys who have great knowledge in scripting. If you have 10k+ coins on CSGODouble we can help you triple that amount. Check out how in the link below)","List(List(), List())",en,I just want this all to be over,Mon Jan 08 18:47:59 +0000 2018


In [29]:
# TEST - Run this cell to test your solution
from pyspark.sql.functions import col

schema = fullTweetSchema.fieldNames()
schema.sort()
tweetCount = fullTweetDF.filter(col("id").isNotNull()).count()

assert tweetCount == 1491

dbTest("ET1-P-08-05-01", "created_at", schema[0])
dbTest("ET1-P-08-05-02", "entities", schema[1])
dbTest("ET1-P-08-05-03", 1491, tweetCount)

print("Tests passed!")

## Exercise 3: Creating the Tables

Apply the schema you defined to create tables that match the relational data model.

### Step 1: Filtering Nulls

The Twitter data contains both deletions and tweets.  This is why some records appear as null values. Create a DataFrame called `fullTweetFilteredDF` that filters out the null values.

In [32]:
# TODO
fullTweetFilteredDF = fullTweetDF.dropna()

In [33]:
# TEST - Run this cell to test your solution
dbTest("ET1-P-08-06-01", 1491, fullTweetFilteredDF.count())

print("Tests passed!")

-sandbox
### Step 2: Creating the `Tweet` Table

Twitter uses a non-standard timestamp format that Spark doesn't recognize. Currently the `created_at` column is formatted as a string. Create the `Tweet` table and save it as `tweetDF`. Parse the timestamp column using `unix_timestamp`, and cast the result as `TimestampType`. The timestamp format is `EEE MMM dd HH:mm:ss ZZZZZ yyyy`.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Use `alias` to alias the name of your columns to the final name you want for them.  
<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** `id` corresponds to `tweet_id` and `user.id` corresponds to `user_id`.

In [35]:
from pyspark.sql.types import TimestampType
from pyspark.sql.functions import unix_timestamp

In [36]:
tweetDF.printSchema()

In [37]:
# TODO

# Timestamp Format
timestampFormat = "EEE MMM dd HH:mm:ss ZZZZZ yyyy"

# Create Dataframe
tweetDF = (fullTweetFilteredDF
           .select(col("id").alias("tweetID"),
                  col("user.id").alias("userID"),
                  col("lang").alias("language"),
                  col("text"),
                  unix_timestamp("created_at",timestampFormat).cast(TimestampType()).alias("createdAt"))
          )
display(tweetDF.limit(10))

tweetID,userID,language,text,createdAt
950438954272096257,371607576,en,RT @TheTinaVasquez: Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salvador…,2018-01-08T18:47:59.000+0000
950438954288914432,732417055,ja,【太ももを引きしめるエクササイズ】足あげ～１．うつぶせに寝る。２．片方の足先を床から10センチ上にあげてキープ３．２の状態で足を振り上げる。※お腹が床から離れると× #diet,2018-01-08T18:47:59.000+0000
950438954276450305,235927210,tr,Ben bir beni bulup icine girip saklanirsam kim beni bulur,2018-01-08T18:47:59.000+0000
950438954280472576,1564880654,ar,تواصل قالوا عن قطر المتحدث باسم الجيش الصهيوني يشكر قناة الجزيرة https://t.co/j0RgDwS36n … #صاروخ_سعودي_يرعب_ايران,2018-01-08T18:47:59.000+0000
950438954288889856,349070364,en,*Before you argue about your dirty house someone didn't clean or sweep -* *Think of the people who are living in the streets.*,2018-01-08T18:47:59.000+0000
950438954280669184,340482488,en,RT @TippyLexx: Bruh you ever accidentally open a message and be like damn now I gotta reply 😂😂,2018-01-08T18:47:59.000+0000
950438954276442113,4354072997,pt,"RT @MorraoTudo2: A liberdade é só questão de tempo, solta os faixa preta 🔐🔓⏳✌️",2018-01-08T18:47:59.000+0000
950438954276478976,738897225061912576,en,I just want this all to be over,2018-01-08T18:47:59.000+0000
950438954289033216,273646363,ar,RT @Arab_original: للاسف قطاع كان ممكن حل ولا اروع للبطاله لكن وزارة النقل قررت ان لا تنظم السوق بحجه السوق الحرة !! اي حريه والشركتين يسحق…,2018-01-08T18:47:59.000+0000
950438954289033218,1541143441,ru,RT @craneswordboi: блять мне так смешно от слова срождество,2018-01-08T18:47:59.000+0000


In [38]:
# TEST - Run this cell to test your solution
from pyspark.sql.types import TimestampType
t = tweetDF.select("createdAt").schema[0]

dbTest("ET1-P-08-07-01", TimestampType(), t.dataType)

print("Tests passed!")

### Step 3: Creating the Account Table

Save the account table as `accountDF`.

In [40]:
fullTweetDF.printSchema()

In [41]:
# TODO
accountDF = (fullTweetFilteredDF
            .select(col("user.id").alias("userID"),
                    col("user.screen_name").alias("screenName"),
                    col("user.location").alias("location"),
                    col("user.friends_count").alias("friendsCount"),
                    col("user.followers_count").alias("followersCount"),
                    col("user.description").alias("description")
                   )
            )
display(accountDF.limit(5))

userID,screenName,location,friendsCount,followersCount,description
371607576,smileifyou_love,,473,160,•Psalm 34:18• Living life one day at a time ✌️
732417055,bw198e18,,1641,1285,【期間限定】今なら無料！！ ただ今話題沸騰中の「ダイエットできるアプリ」こと「ヤセサポ」！！ 今だけ無料でダウンロードできます！この機会にぜひ！お試しあれ★ダウンロード⇒http://bit.ly/MxDjjc
235927210,marlascigarette,,214,223,△
1564880654,rebaab_1326,,45,0,
349070364,puskine,"Kampala, Uganda",5008,4916,God first . Football fun . Talk so much . Reader. Year of FINANCIAL BREAK THROUGH . Still learning how to love kibitram@gmail.com. +256779646952


In [42]:
# TEST - Run this cell to test your solution
cols = accountDF.columns

dbTest("ET1-P-08-08-01", True, "friendsCount" in cols)
dbTest("ET1-P-08-08-02", True, "screenName" in cols)
dbTest("ET1-P-08-08-03", 1491, accountDF.count())


print("Tests passed!")

-sandbox
### Step 4: Creating Hashtag and URL Tables Using `explode`

Each tweet in the data set contains zero, one, or many URLs and hashtags. Parse these using the `explode` function so that each URL or hashtag has its own row.

In this example, `explode` gives one row from the original column `hashtags` for each value in an array. All other columns are left untouched.

```
+---------------+--------------------+----------------+
|     screenName|            hashtags|explodedHashtags|
+---------------+--------------------+----------------+
|        zooeeen|[[Tea], [GoldenGl...|           [Tea]|
|        zooeeen|[[Tea], [GoldenGl...|  [GoldenGlobes]|
|mannydidthisone|[[beats], [90s], ...|         [beats]|
|mannydidthisone|[[beats], [90s], ...|           [90s]|
|mannydidthisone|[[beats], [90s], ...|     [90shiphop]|
|mannydidthisone|[[beats], [90s], ...|           [pac]|
|mannydidthisone|[[beats], [90s], ...|        [legend]|
|mannydidthisone|[[beats], [90s], ...|          [thug]|
|mannydidthisone|[[beats], [90s], ...|         [music]|
|mannydidthisone|[[beats], [90s], ...|     [westcoast]|
|mannydidthisone|[[beats], [90s], ...|        [eminem]|
|mannydidthisone|[[beats], [90s], ...|         [drdre]|
|mannydidthisone|[[beats], [90s], ...|          [trap]|
|  Satish0919995|[[BB11], [BiggBos...|          [BB11]|
|  Satish0919995|[[BB11], [BiggBos...|    [BiggBoss11]|
|  Satish0919995|[[BB11], [BiggBos...| [WeekendKaVaar]|
+---------------+--------------------+----------------+
```

The concept of `explode` is similar to `pivot`.

Create the rest of the tables and save them to the following DataFrames:<br><br>

* `hashtagDF`
* `urlDF`

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> <a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.explode" target="_blank">Find the documentation for `explode` here</a>

In [44]:
fullTweetFilteredDF.printSchema()

In [45]:
from pyspark.sql.functions import explode, col

In [46]:
# TODO
hashtagDF = (fullTweetFilteredDF
             .select(col("id").alias("tweetID"),
                     explode(col("entities.hashtags.text")).alias("hashtag")
                    )
            )
urlDF = (fullTweetFilteredDF.select(col("id").alias("tweetID"), 
    explode(col("entities.urls")).alias("urls"))
  .select(
    col("tweetID"),
    col("urls.url").alias("URL"),
    col("urls.display_url").alias("displayURL"),
    col("urls.expanded_url").alias("expandedURL"))
)

display(hashtagDF.limit(5))

tweetID,hashtag
950438954288914432,diet
950438954280472576,صاروخ_سعودي_يرعب_ايران
950438954297303040,Tea
950438954297303040,GoldenGlobes
950438954305716226,الهلال_الاتفاق


In [47]:
display(urlDF.limit(5))

tweetID,URL,displayURL,expandedURL
950438954280472576,https://t.co/j0RgDwS36n,youtube.com/watch?v=b4iz9n…,https://www.youtube.com/watch?v=b4iz9nZPzAA
950438954284797958,https://t.co/B5Zgkoy4TL,twitter.com/i/web/status/9…,https://twitter.com/i/web/status/950438954284797958
950438954310033410,http://t.co/Kv3EWEhO,bit.ly/OYlKII,http://bit.ly/OYlKII
950438954305835008,https://t.co/l3x0sVSvFa,goo.gl/fb/atjACB,https://goo.gl/fb/atjACB
950438954305761280,https://t.co/syjjduK5w0,instagram.com/p/BdsvNFABXNL/,https://www.instagram.com/p/BdsvNFABXNL/


In [48]:
# TEST - Run this cell to test your solution
hashtagCols = hashtagDF.columns
urlCols = urlDF.columns
hashtagDFCounts = hashtagDF.count()
urlDFCounts = urlDF.count()

dbTest("ET1-P-08-09-01", True, "hashtag" in hashtagCols)
dbTest("ET1-P-08-09-02", True, "displayURL" in urlCols)
dbTest("ET1-P-08-09-03", 394, hashtagDFCounts)
dbTest("ET1-P-08-09-04", 368, urlDFCounts)

print("Tests passed!")

-sandbox
## Exercise 4: Loading the Results

Use DBFS as your target warehouse for your transformed data. Save the DataFrames in Parquet format to the following endpoints:  

| DataFrame    | Endpoint                                 |
|:-------------|:-----------------------------------------|
| `accountDF`  | `"/tmp/" + username + "/account.parquet"`|
| `tweetDF`    | `"/tmp/" + username + "/tweet.parquet"`  |
| `hashtagDF`  | `"/tmp/" + username + "/hashtag.parquet"`|
| `urlDF`      | `"/tmp/" + username + "/url.parquet"`    |

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> If you run out of storage in `/tmp`, use `.limit(10)` to limit the size of your DataFrames to 10 records.

In [50]:
# TODO

username = "kcmunnin@microsoft.com"

accountDF.write.mode("overwrite").parquet("/tmp/" + username + "/account.parquet")
tweetDF.write.mode("overwrite").parquet("/tmp/" + username + "/tweet.parquet")
hashtagDF.write.mode("overwrite").parquet("/tmp/" + username + "/hashtag.parquet")
urlDF.write.mode("overwrite").parquet("/tmp/" + username + "/url.parquet")

In [51]:
# TEST - Run this cell to test your solution
from pyspark.sql.dataframe import DataFrame

accountDF = spark.read.parquet("/tmp/" + username + "/account.parquet")
tweetDF = spark.read.parquet("/tmp/" + username + "/tweet.parquet")
hashtagDF = spark.read.parquet("/tmp/" + username + "/hashtag.parquet")
urlDF = spark.read.parquet("/tmp/" + username + "/url.parquet")

dbTest("ET1-P-08-10-01", DataFrame, type(accountDF))
dbTest("ET1-P-08-10-02", DataFrame, type(tweetDF))
dbTest("ET1-P-08-10-03", DataFrame, type(hashtagDF))
dbTest("ET1-P-08-10-04", DataFrame, type(urlDF))

print("Tests passed!")

## IMPORTANT Next Steps
* Please complete the <a href="https://www.surveymonkey.com/r/WPD7YNV" target="_blank">short feedback survey</a>.  Your input is extremely important and shapes future course development.
* Congratulations, you have completed ETL Part 1!

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>