# Group E S2 - Ukraine War -Twitter Promotion from raw to std layer

# Sections
* [1. Setup](#1)
  * [1.1 Start Hadoop](#1.1)  
  * [1.2 Search for Spark Installation](#1.2)
  * [1.3 Create SparkSession](#1.3)
* [2. Lab](#2)
  * [2.1 Check Files](#2.1)
  * [2.2 Read raw DataFrame](#2.3)
  * [2.3 Transform raw DataFrame](#2.3)
  * [2.4 Write DataFrame to std](#2.4)
  * [2.5 Code improvements](#2.5)

<a id='1'></a>
## 1. Setup

Since we are going to process data stored from HDFS let's start the service

<a id='1.1'></a>
### 1.1 Start Hadoop

Start Hadoop
Open a terminal and execute
```sh
hadoop-start.sh
```

<a id='1.2'></a>
### 1.2 Search for Spark Installation 
This step is required just because we are working in the course environment.

In [1]:
import findspark
findspark.init()

I'm changing pandas max column width property to improve data displaying

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

<a id='1.3'></a>
### 1.3 Create SparkSession

By setting this environment variable we can include extra libraries in our Spark cluster

In [3]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /opt/hive3/lib/hive-hcatalog-core-3.1.2.jar pyspark-shell'

The first thing always is to create the SparkSession

In [4]:
from pyspark.sql.session import SparkSession

spark = SparkSession.builder\
.appName("RAW to STD - DataFrames")\
.config("spark.sql.warehouse.dir","hdfs://localhost:9000/warehouse")\
.config("spark.sql.legacy.timeParserPolicy","LEGACY")\
.config("spark.sql.sources.partitionOverwriteMode","dynamic")\
.enableHiveSupport()\
.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


<a id='2'></a>
## 2. Lab

<a id='2.1'></a>
### 2.1 Check  Files

To promote the twitter files, we need to ingest them using nifi as we describe in our report.
The files in raw format will be located here:

http://localhost:50070/explorer.html#/datalake/raw/twitter/War/

<a id='2.2'></a>
### 2.2 Read Raw DataFrame

We can infer the schema of the underlying json files by setting this option during the reading operation.<br/>
This is not recommended in production workloads as is very expensive (Spark will scan all the files in order to determine all the columns)


In [5]:
tweets_raw = spark.read.option("inferSchema","true")\
                       .option("recursiveFileLookup", "true")\
                       .json("hdfs://localhost:9000/datalake/raw/twitter/War/")
                       
tweets_raw.limit(5).toPandas()

                                                                                

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,extended_tweet,favorite_count,favorited,filter_level,...,reply_count,retweet_count,retweeted,retweeted_status,source,text,timestamp_ms,truncated,user,withheld_in_countries
0,,,Sun Feb 27 12:08:48 +0000 2022,,"([], None, [], [], [(80653307, 80653307, [3, 14], Mikhail Khodorkovsky (English), mbk_center)])",,,0,False,low,...,0,0,False,"(None, None, Sun Feb 27 01:20:51 +0000 2022, None, ([], None, [], [Row(display_url='twitter.com/i/web/status/1…', expanded_url='https://twitter.com/i/web/status/1497743498094825475', indices=[117, 140], url='https://t.co/WDULKtPX0I')], []), None, ([0, 149], ([], None, [], [Row(display_url='dw.com/en/ukraine-ant…', expanded_url='https://www.dw.com/en/ukraine-anti-war-protests-take-place-around-the-world/a-60930141?fbclid=IwAR0Q21jELU1JJKZG3Kpnw_py5IPFVCjdcqjWUDgewHWjPUlqOwE-GzOAmdg', indices=[126, 149], url='https://t.co/wBaScBeXs6')], []), None, Putin's invasion of Ukraine🇺🇦 has sparked a host of anti-war demonstrations in cities around the world, including Russia 🇷🇺. https://t.co/wBaScBeXs6), 588, False, low, None, 1497743498094825475, 1497743498094825475, None, None, None, None, None, False, en, None, False, 5, None, None, None, None, 16, 107, False, None, <a href=""https://mobile.twitter.com"" rel=""nofollow"">Twitter Web App</a>, Putin's invasion of Ukraine🇺🇦 has sparked a host of anti-war demonstrations in cities around the world, including R… https://t.co/WDULKtPX0I, True, (False, Wed Oct 07 19:03:44 +0000 2009, False, False, Mikhail Khodorkovsky @mich261213 Twitter Account in English, 17519, None, 28401, None, 1664, True, 80653307, 80653307, False, None, 702, London, Mikhail Khodorkovsky (English), None, 1E2529, http://abs.twimg.com/images/themes/theme1/bg.png, https://abs.twimg.com/images/themes/theme1/bg.png, True, https://pbs.twimg.com/profile_banners/80653307/1615379899, http://pbs.twimg.com/profile_images/1404398088224133130/_cFSk3wh_normal.jpg, https://pbs.twimg.com/profile_images/1404398088224133130/_cFSk3wh_normal.jpg, 0084B4, C0DEED, DDEEF6, 333333, True, False, mbk_center, 35576, None, none, https://khodorkovsky.com/, None, False, []), None)","<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>","RT @mbk_center: Putin's invasion of Ukraine🇺🇦 has sparked a host of anti-war demonstrations in cities around the world, including Russia 🇷🇺…",1645963728513,False,"(False, Sun Apr 03 21:06:31 +0000 2016, True, False, Graphics Designer | Travel and Tour consultant | Man United Fan⚽ | Music Lover | Movie lover.... I always Follow Back✌️, 4075, None, 188, None, 1091, False, 716733642488299520, 716733642488299520, False, None, 0, Ogun, Nigeria, Abęfę🤴, None, F5F8FA, , , False, https://pbs.twimg.com/profile_banners/716733642488299520/1617610000, http://pbs.twimg.com/profile_images/1378982406951202820/Ww9zr1oh_normal.jpg, https://pbs.twimg.com/profile_images/1378982406951202820/Ww9zr1oh_normal.jpg, 1DA1F2, C0DEED, DDEEF6, 333333, True, False, abefe007, 3347, None, none, None, None, False, [])",
1,,,Sun Feb 27 12:08:48 +0000 2022,,"([], None, [], [], [(1027991081584144386, 1027991081584144386, [3, 11], nws, load_pm)])",,,0,False,low,...,0,0,False,"(None, None, Sun Feb 27 12:02:00 +0000 2022, [0, 140], ([], None, [], [Row(display_url='twitter.com/i/web/status/1…', expanded_url='https://twitter.com/i/web/status/1497904851489107969', indices=[117, 140], url='https://t.co/E1mhc4wTSg')], []), None, ([0, 273], ([Row(indices=[235, 246], text='stoprussia'), Row(indices=[247, 255], text='Украина'), Row(indices=[256, 264], text='Україна'), Row(indices=[265, 273], text='ukraine')], [Row(additional_media_info=Row(description=None, embeddable=None, monetizable=False, title=None), description=None, display_url='pic.twitter.com/8znkA8g6im', expanded_url='https://twitter.com/load_pm/status/1497904851489107969/video/1', id=1497904821160099843, id_str='1497904821160099843', indices=[274, 297], media_url='http://pbs.twimg.com/ext_tw_video_thumb/1497904821160099843/pu/img/JuTsvq9OrkTNxVrw.jpg', media_url_https='https://pbs.twimg.com/ext_tw_video_thumb/1497904821160099843/pu/img/JuTsvq9OrkTNxVrw.jpg', sizes=Row(large=Row(h=464, resize='fit', w=848), medium=Row(h=464, resize='fit', w=848), small=Row(h=372, resize='fit', w=680), thumb=Row(h=150, resize='crop', w=150)), source_status_id=None, source_status_id_str=None, source_user_id=None, source_user_id_str=None, type='video', url='https://t.co/8znkA8g6im', video_info=Row(aspect_ratio=[53, 29], duration_millis=17869, variants=[Row(bitrate=832000, content_type='video/mp4', url='https://video.twimg.com/ext_tw_video/1497904821160099843/pu/vid/848x464/td9POzDPWWqSC0Wi.mp4?tag=12'), Row(bitrate=256000, content_type='video/mp4', url='https://video.twimg.com/ext_tw_video/1497904821160099843/pu/vid/492x270/c05E-PSDLx_sWcmv.mp4?tag=12'), Row(bitrate=None, content_type='application/x-mpegURL', url='https://video.twimg.com/ext_tw_video/1497904821160099843/pu/pl/LsG2uAoaXB0TsN5x.m3u8?tag=12&container=fmp4')]))], [], [], []), ([Row(additional_media_info=Row(description=None, embeddable=None, monetizable=False, title=None), description=None, display_url='pic.twitter.com/8znkA8g6im', expanded_url='https://twitter.com/load_pm/status/1497904851489107969/video/1', id=1497904821160099843, id_str='1497904821160099843', indices=[274, 297], media_url='http://pbs.twimg.com/ext_tw_video_thumb/1497904821160099843/pu/img/JuTsvq9OrkTNxVrw.jpg', media_url_https='https://pbs.twimg.com/ext_tw_video_thumb/1497904821160099843/pu/img/JuTsvq9OrkTNxVrw.jpg', sizes=Row(large=Row(h=464, resize='fit', w=848), medium=Row(h=464, resize='fit', w=848), small=Row(h=372, resize='fit', w=680), thumb=Row(h=150, resize='crop', w=150)), source_status_id=None, source_status_id_str=None, source_user_id=None, source_user_id_str=None, type='video', url='https://t.co/8znkA8g6im', video_info=Row(aspect_ratio=[53, 29], duration_millis=17869, variants=[Row(bitrate=832000, content_type='video/mp4', url='https://video.twimg.com/ext_tw_video/1497904821160099843/pu/vid/848x464/td9POzDPWWqSC0Wi.mp4?tag=12'), Row(bitrate=256000, content_type='video/mp4', url='https://video.twimg.com/ext_tw_video/1497904821160099843/pu/vid/492x270/c05E-PSDLx_sWcmv.mp4?tag=12'), Row(bitrate=None, content_type='application/x-mpegURL', url='https://video.twimg.com/ext_tw_video/1497904821160099843/pu/pl/LsG2uAoaXB0TsN5x.m3u8?tag=12&container=fmp4')]))],), 💪Зараз у Хмельницькому жінки та діти плетуть маскувальні сітки, щоб допомогти нашим героям. Всі українці, від малого до великого, працюють на перемогу🇺🇦\n\nМи непереможні, адже кожен долучається до захисту своєї землі!\n\nСлава Україні !\n\n#stoprussia #Украина #Україна #ukraine https://t.co/8znkA8g6im), 12, False, low, None, 1497904851489107969, 1497904851489107969, None, None, None, None, None, False, uk, None, False, 0, None, None, None, None, 0, 1, False, None, <a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>, 💪Зараз у Хмельницькому жінки та діти плетуть маскувальні сітки, щоб допомогти нашим героям. Всі українці, від малог… https://t.co/E1mhc4wTSg, True, (False, Fri Aug 10 18:52:27 +0000 2018, True, False, 2022 Russian invasion of Ukraine, 108, None, 1652, None, 14, False, 1027991081584144386, 1027991081584144386, False, None, 21, UA Кременчук, nws, None, F5F8FA, , , False, https://pbs.twimg.com/profile_banners/1027991081584144386/1645953910, http://pbs.twimg.com/profile_images/1497180735576449031/T_pNZOlN_normal.jpg, https://pbs.twimg.com/profile_images/1497180735576449031/T_pNZOlN_normal.jpg, 1DA1F2, C0DEED, DDEEF6, 333333, True, False, load_pm, 395, None, none, None, None, False, []), None)","<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>","RT @load_pm: 💪Зараз у Хмельницькому жінки та діти плетуть маскувальні сітки, щоб допомогти нашим героям. Всі українці, від малого до велико…",1645963728569,False,"(False, Mon Nov 28 23:57:34 +0000 2016, True, False, None, 37, None, 1, None, 14, False, 803387388533899264, 803387388533899264, False, None, 0, None, Piotr, None, F5F8FA, , , False, None, http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png, https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png, 1DA1F2, C0DEED, DDEEF6, 333333, True, False, Grado_Wwa, 7, None, none, None, None, False, [])",
2,,,Sun Feb 27 12:08:48 +0000 2022,,"([], None, [], [], [(1466263472132423681, 1466263472132423681, [3, 11], Raebo, Raebo56)])",,,0,False,low,...,0,0,False,"(None, None, Sat Feb 26 17:07:49 +0000 2022, None, ([], None, [], [Row(display_url='twitter.com/i/web/status/1…', expanded_url='https://twitter.com/i/web/status/1497619424223772672', indices=[116, 139], url='https://t.co/9zJSIToIw8')], []), None, ([0, 180], ([], None, [], [], []), None, Don’t let the Russia invasion of Ukraine allow you to be distracted from the fact that \nJustin Trudeau and Jagmeet Singh just tried to take Canadians human rights away indefinitely), 440, False, low, None, 1497619424223772672, 1497619424223772672, None, None, None, None, None, False, en, None, None, 9, None, None, None, None, 17, 184, False, None, <a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>, Don’t let the Russia invasion of Ukraine allow you to be distracted from the fact that \nJustin Trudeau and Jagmeet… https://t.co/9zJSIToIw8, True, (False, Thu Dec 02 04:31:03 +0000 2021, True, False, Albertan, Secessionist, Political Activist, Conservative, Business Owner, Patent Holder, Inventor, Jack Russells, Aviation, Cosmetic Chemistry, Photography, 3086, None, 562, None, 594, False, 1466263472132423681, 1466263472132423681, False, None, 1, Alberta, Canada, Raebo, None, F5F8FA, , , False, None, http://pbs.twimg.com/profile_images/1466267537381814272/-2Ayv9-S_normal.jpg, https://pbs.twimg.com/profile_images/1466267537381814272/-2Ayv9-S_normal.jpg, 1DA1F2, C0DEED, DDEEF6, 333333, True, False, Raebo56, 7581, None, none, None, None, False, []), None)","<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>",RT @Raebo56: Don’t let the Russia invasion of Ukraine allow you to be distracted from the fact that \nJustin Trudeau and Jagmeet Singh just…,1645963728571,False,"(False, Sun Feb 11 18:57:47 +0000 2018, True, False, Married, one adult son and a beautiful dog named Lily. Love all creatures except some humans I know:) Very into Canadian, British and U.S. politics., 53826, None, 525, None, 1225, True, 962762606766317568, 962762606766317568, False, None, 4, Montreal, Sheila J, None, F5F8FA, , , False, None, http://pbs.twimg.com/profile_images/962772067023245312/b2pn8HCa_normal.jpg, https://pbs.twimg.com/profile_images/962772067023245312/b2pn8HCa_normal.jpg, 1DA1F2, C0DEED, DDEEF6, 333333, True, False, Sheilaanne2191, 31286, None, none, None, None, False, [])",
3,,,Sun Feb 27 12:08:48 +0000 2022,,"([], None, [], [], [(1034205948007645184, 1034205948007645184, [3, 19], David Laufman, DavidLaufmanLaw)])",,,0,False,low,...,0,0,False,"(None, None, Sat Feb 26 19:25:20 +0000 2022, None, ([], None, [], [Row(display_url='twitter.com/i/web/status/1…', expanded_url='https://twitter.com/i/web/status/1497654030901948420', indices=[117, 140], url='https://t.co/DdRmq1YWQW')], []), None, ([0, 263], ([Row(indices=[233, 238], text='fara')], None, [], [Row(display_url='washingtonpost.com/media/2022/02/…', expanded_url='https://www.washingtonpost.com/media/2022/02/26/rt-america-putin-ukraine/', indices=[240, 263], url='https://t.co/uhPvorKu2d')], [Row(id=73181712, id_str='73181712', indices=[215, 230], name='Justice Department', screen_name='TheJusticeDept')]), None, Correction: It wasn’t “the Trump Administration” that required RT’s U.S. affiliate to register under the Foreign Agents Registration Act in 2017; it was non-political officials at the National Security Division of ⁦@TheJusticeDept⁩.\n#fara https://t.co/uhPvorKu2d), 433, False, low, None, 1497654030901948420, 1497654030901948420, None, None, None, None, None, False, en, None, False, 4, None, None, None, None, 4, 111, False, None, <a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>, Correction: It wasn’t “the Trump Administration” that required RT’s U.S. affiliate to register under the Foreign Ag… https://t.co/DdRmq1YWQW, True, (False, Mon Aug 27 22:28:07 +0000 2018, True, False, Representation in gov’t investigations and national security matters. Former Chief of DOJ Counterintelligence Section. Stalwart Houston Astros fan. Views my own, 3553, None, 29303, None, 1286, False, 1034205948007645184, 1034205948007645184, False, None, 259, Washington, DC, David Laufman, None, F5F8FA, , , False, https://pbs.twimg.com/profile_banners/1034205948007645184/1628726511, http://pbs.twimg.com/profile_images/1034207602727628801/imEvklvQ_normal.jpg, https://pbs.twimg.com/profile_images/1034207602727628801/imEvklvQ_normal.jpg, 1DA1F2, C0DEED, DDEEF6, 333333, True, False, DavidLaufmanLaw, 991, None, none, http://www.wiggin.com, None, False, []), None)","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @DavidLaufmanLaw: Correction: It wasn’t “the Trump Administration” that required RT’s U.S. affiliate to register under the Foreign Agent…,1645963728575,False,"(False, Tue Jun 02 20:27:08 +0000 2020, True, False, expat returned after little over 20 years. Trying to understand what’s going on?!?, 47309, None, 49, None, 120, False, 1267915607804194817, 1267915607804194817, False, None, 0, New York, USA, TerryDean, None, F5F8FA, , , False, https://pbs.twimg.com/profile_banners/1267915607804194817/1632346380, http://pbs.twimg.com/profile_images/1496686195681677319/zUErAAiq_normal.jpg, https://pbs.twimg.com/profile_images/1496686195681677319/zUErAAiq_normal.jpg, 1DA1F2, C0DEED, DDEEF6, 333333, True, False, TerryDe53425222, 6838, None, none, None, None, False, [])",
4,,,Sun Feb 27 12:08:48 +0000 2022,,"([], None, [], [], [(1475905928511766532, 1475905928511766532, [3, 19], Alexandre 𓅨𓂋𓀀, PessimismeActif)])",,,0,False,low,...,0,0,False,"(None, None, Sun Feb 27 11:39:09 +0000 2022, None, ([], None, [], [], []), None, None, 154, False, low, None, 1497899097491218438, 1497899097491218438, None, None, None, None, None, True, fr, None, None, 4, (None, None, Sun Feb 27 02:14:07 +0000 2022, [0, 21], ([], [Row(additional_media_info=Row(description=None, embeddable=None, monetizable=False, title=None), description=None, display_url='pic.twitter.com/6YrrO6yHr7', expanded_url='https://twitter.com/caissesdegreve/status/1497756905003012104/video/1', id=1497756333961060358, id_str='1497756333961060358', indices=[22, 45], media_url='http://pbs.twimg.com/ext_tw_video_thumb/1497756333961060358/pu/img/9xWAnqSJjeJMcm5o.jpg', media_url_https='https://pbs.twimg.com/ext_tw_video_thumb/1497756333961060358/pu/img/9xWAnqSJjeJMcm5o.jpg', sizes=Row(large=Row(h=720, resize='fit', w=1280), medium=Row(h=675, resize='fit', w=1200), small=Row(h=383, resize='fit', w=680), thumb=Row(h=150, resize='crop', w=150)), source_status_id=None, source_status_id_str=None, source_user_id=None, source_user_id_str=None, type='photo', url='https://t.co/6YrrO6yHr7')], [], [], []), ([Row(additional_media_info=Row(description=None, embeddable=None, monetizable=False, title=None), description=None, display_url='pic.twitter.com/6YrrO6yHr7', expanded_url='https://twitter.com/caissesdegreve/status/1497756905003012104/video/1', id=1497756333961060358, id_str='1497756333961060358', indices=[22, 45], media_url='http://pbs.twimg.com/ext_tw_video_thumb/1497756333961060358/pu/img/9xWAnqSJjeJMcm5o.jpg', media_url_https='https://pbs.twimg.com/ext_tw_video_thumb/1497756333961060358/pu/img/9xWAnqSJjeJMcm5o.jpg', sizes=Row(large=Row(h=720, resize='fit', w=1280), medium=Row(h=675, resize='fit', w=1200), small=Row(h=383, resize='fit', w=680), thumb=Row(h=150, resize='crop', w=150)), source_status_id=None, source_status_id_str=None, source_user_id=None, source_user_id_str=None, type='video', url='https://t.co/6YrrO6yHr7', video_info=Row(aspect_ratio=[16, 9], duration_millis=62920, variants=[Row(bitrate=256000, content_type='video/mp4', url='https://video.twimg.com/ext_tw_video/1497756333961060358/pu/vid/480x270/VY212mP7vxX3STn9.mp4?tag=12'), Row(bitrate=832000, content_type='video/mp4', url='https://video.twimg.com/ext_tw_video/1497756333961060358/pu/vid/640x360/RoNvnDxIiPb8t8wh.mp4?tag=12'), Row(bitrate=2176000, content_type='video/mp4', url='https://video.twimg.com/ext_tw_video/1497756333961060358/pu/vid/1280x720/FCm5oVM-boQCL0ts.mp4?tag=12'), Row(bitrate=None, content_type='application/x-mpegURL', url='https://video.twimg.com/ext_tw_video/1497756333961060358/pu/pl/QhTjEhO8tzRKTCJu.m3u8?tag=12&container=fmp4')]))],), None, 15854, False, low, None, 1497756905003012104, 1497756905003012104, None, None, None, None, None, False, es, None, False, 1582, None, None, 374, 8548, False, None, <a href=""https://mobile.twitter.com"" rel=""nofollow"">Twitter Web App</a>, 50 nuances de racisme https://t.co/6YrrO6yHr7, False, (False, Tue Dec 24 12:50:13 +0000 2019, True, False, Vidéos diverses et recensement des caisses de grève en ligne.\n\nau cas où https://www.buymeacoffee.com/caissesdegreve, 6520, None, 30640, None, 273, False, 1209456196916301824, 1209456196916301824, False, None, 143, None, Caisses de grève, None, F5F8FA, , , False, https://pbs.twimg.com/profile_banners/1209456196916301824/1577199552, http://pbs.twimg.com/profile_images/1239891411564191745/s1D9fg4W_normal.jpg, https://pbs.twimg.com/profile_images/1239891411564191745/s1D9fg4W_normal.jpg, 1DA1F2, C0DEED, DDEEF6, 333333, True, False, caissesdegreve, 5311, None, none, http://caissesdegreve.fr, None, False, []), None), 1497756905003012104, 1497756905003012104, (twitter.com/caissesdegreve…, https://twitter.com/caissesdegreve/status/1497756905003012104, https://t.co/2X41458p31), 14, 33, False, None, <a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>, Oui, l’Ukraine est plus chère et proche à nos cœurs que l’Afghanistan ou le Togo.\n\nEt on s’excusera de rien du tout., False, (False, Tue Dec 28 19:06:21 +0000 2021, True, False, Chats, Géopolitique & Pessimisme | moi/je, 3613, None, 1125, None, 307, False, 1475905928511766532, 1475905928511766532, False, None, 1, Paris, France, Alexandre 𓅨𓂋𓀀, None, F5F8FA, , , False, https://pbs.twimg.com/profile_banners/1475905928511766532/1640721000, http://pbs.twimg.com/profile_images/1475916961032847365/R3J-Yz5V_normal.jpg, https://pbs.twimg.com/profile_images/1475916961032847365/R3J-Yz5V_normal.jpg, 1DA1F2, C0DEED, DDEEF6, 333333, True, False, PessimismeActif, 2438, None, none, None, None, False, []), None)","<a href=""https://mobile.twitter.com"" rel=""nofollow"">Twitter Web App</a>","RT @PessimismeActif: Oui, l’Ukraine est plus chère et proche à nos cœurs que l’Afghanistan ou le Togo.\n\nEt on s’excusera de rien du tout.",1645963728541,False,"(False, Sat Oct 12 12:13:06 +0000 2013, False, False, TOUS A VOS TEE SHIRT pour \nLIBERTE - EGALITE - FRATERNITE -\nRUES DE FRANCES LAÏQUES -\nLA RELIGION A LA MAISON -\nPOUR UNE FRANCE EN PAIX et SANS DISCREMINATION, 18668, None, 1280, None, 1619, True, 1956489876, 1956489876, False, None, 14, Aquitaine, France, lemmerdeuse 🇫🇷🇮🇱Z, None, 000000, http://abs.twimg.com/images/themes/theme1/bg.png, https://abs.twimg.com/images/themes/theme1/bg.png, False, https://pbs.twimg.com/profile_banners/1956489876/1442309512, http://pbs.twimg.com/profile_images/643718584426983424/kg1ilVbC_normal.jpg, https://pbs.twimg.com/profile_images/643718584426983424/kg1ilVbC_normal.jpg, 0084B4, 000000, 000000, 000000, False, False, lemmerdeuse24, 22247, None, none, None, None, False, [])",


Setting up the schema during data reading.

In [6]:
schema="""
created_at string,
id_str string,
text string,

`user` struct<
            name:string,
            screen_name:string,
            location:string,
            description:string,
            followers_count:bigint,
            friends_count:bigint,
            listed_count:bigint,
            favourites_count:bigint,
            statuses_count:bigint,
            created_at:string
            >,
retweeted_status struct <
            quot_count:int,
            reply_count:int,
            retweet_count:int,
            favorite_count:int,
            user: 
            struct <
                id_str:string,
                name:string,
                screen_name:string,
                location:string,
                description:string,
                followers_count:bigint,
                friends_count:bigint,
                listed_count:bigint,
                favourites_count:bigint,
                statuses_count:bigint,
                created_at:string
            >
            >,
place struct<
            country:string,
            country_code:string,
            full_name:string,
            place_type:string,
            url:string
            >,
quote_count bigint,
reply_count bigint,
retweet_count bigint,
favorite_count bigint,
entities struct<
            user_mentions:array<struct<screen_name:string>>,
            hashtags:array<struct<text:string>>, 
            media:array<struct<expanded_url:string>>, 
            urls:array<struct<expanded_url:string>>, 
            symbols:array<struct<text:string>>
            >,
favorited boolean,
retweeted boolean,
possibly_sensitive boolean,
filter_level string,
lang string
"""
tweets_raw = spark.read.schema(schema)\
                       .option("recursiveFileLookup", "true")\
                       .json("hdfs://localhost:9000/datalake/raw/twitter/War/")
                       
tweets_raw.limit(5).toPandas()


Unnamed: 0,created_at,id_str,text,user,retweeted_status,place,quote_count,reply_count,retweet_count,favorite_count,entities,favorited,retweeted,possibly_sensitive,filter_level,lang
0,Sun Feb 27 12:08:48 +0000 2022,1497906560814796801,"RT @mbk_center: Putin's invasion of Ukraine🇺🇦 has sparked a host of anti-war demonstrations in cities around the world, including Russia 🇷🇺…","(Abęfę🤴, abefe007, Ogun, Nigeria, Graphics Designer | Travel and Tour consultant | Man United Fan⚽ | Music Lover | Movie lover.... I always Follow Back✌️, 188, 1091, 0, 4075, 3347, Sun Apr 03 21:06:31 +0000 2016)","(None, 16, 107, 588, (80653307, Mikhail Khodorkovsky (English), mbk_center, London, Mikhail Khodorkovsky @mich261213 Twitter Account in English, 28401, 1664, 702, 17519, 35576, Wed Oct 07 19:03:44 +0000 2009))",,0,0,0,0,"([(mbk_center,)], [], None, [], [])",False,False,,low,en
1,Sun Feb 27 12:08:48 +0000 2022,1497906561049583620,"RT @load_pm: 💪Зараз у Хмельницькому жінки та діти плетуть маскувальні сітки, щоб допомогти нашим героям. Всі українці, від малого до велико…","(Piotr, Grado_Wwa, None, None, 1, 14, 0, 37, 7, Mon Nov 28 23:57:34 +0000 2016)","(None, 0, 1, 12, (1027991081584144386, nws, load_pm, UA Кременчук, 2022 Russian invasion of Ukraine, 1652, 14, 21, 108, 395, Fri Aug 10 18:52:27 +0000 2018))",,0,0,0,0,"([(load_pm,)], [], None, [], [])",False,False,,low,uk
2,Sun Feb 27 12:08:48 +0000 2022,1497906561058062340,RT @Raebo56: Don’t let the Russia invasion of Ukraine allow you to be distracted from the fact that \nJustin Trudeau and Jagmeet Singh just…,"(Sheila J, Sheilaanne2191, Montreal, Married, one adult son and a beautiful dog named Lily. Love all creatures except some humans I know:) Very into Canadian, British and U.S. politics., 525, 1225, 4, 53826, 31286, Sun Feb 11 18:57:47 +0000 2018)","(None, 17, 184, 440, (1466263472132423681, Raebo, Raebo56, Alberta, Canada, Albertan, Secessionist, Political Activist, Conservative, Business Owner, Patent Holder, Inventor, Jack Russells, Aviation, Cosmetic Chemistry, Photography, 562, 594, 1, 3086, 7581, Thu Dec 02 04:31:03 +0000 2021))",,0,0,0,0,"([(Raebo56,)], [], None, [], [])",False,False,,low,en
3,Sun Feb 27 12:08:48 +0000 2022,1497906561074741250,RT @DavidLaufmanLaw: Correction: It wasn’t “the Trump Administration” that required RT’s U.S. affiliate to register under the Foreign Agent…,"(TerryDean, TerryDe53425222, New York, USA, expat returned after little over 20 years. Trying to understand what’s going on?!?, 49, 120, 0, 47309, 6838, Tue Jun 02 20:27:08 +0000 2020)","(None, 4, 111, 433, (1034205948007645184, David Laufman, DavidLaufmanLaw, Washington, DC, Representation in gov’t investigations and national security matters. Former Chief of DOJ Counterintelligence Section. Stalwart Houston Astros fan. Views my own, 29303, 1286, 259, 3553, 991, Mon Aug 27 22:28:07 +0000 2018))",,0,0,0,0,"([(DavidLaufmanLaw,)], [], None, [], [])",False,False,,low,en
4,Sun Feb 27 12:08:48 +0000 2022,1497906560932134913,"RT @PessimismeActif: Oui, l’Ukraine est plus chère et proche à nos cœurs que l’Afghanistan ou le Togo.\n\nEt on s’excusera de rien du tout.","(lemmerdeuse 🇫🇷🇮🇱Z, lemmerdeuse24, Aquitaine, France, TOUS A VOS TEE SHIRT pour \nLIBERTE - EGALITE - FRATERNITE -\nRUES DE FRANCES LAÏQUES -\nLA RELIGION A LA MAISON -\nPOUR UNE FRANCE EN PAIX et SANS DISCREMINATION, 1280, 1619, 14, 18668, 22247, Sat Oct 12 12:13:06 +0000 2013)","(None, 14, 33, 154, (1475905928511766532, Alexandre 𓅨𓂋𓀀, PessimismeActif, Paris, France, Chats, Géopolitique & Pessimisme | moi/je, 1125, 307, 1, 3613, 2438, Tue Dec 28 19:06:21 +0000 2021))",,0,0,0,0,"([(PessimismeActif,)], [], None, [], [])",False,False,,low,fr


<a id='2.3'></a>
### 2.3 Transform Raw DataFrame
We transform the column created_at to its proper timestamp data type because is a string type and also we created some derive columns (dt, year, hour and day-hour) to store data information.<br/>

In [7]:
import pyspark.sql.functions as F
from pyspark.sql.functions import *

In [8]:
tweets_std = tweets_raw\
             .withColumn("created_at",F.to_timestamp(F.col("created_at"),"EEE MMM dd HH:mm:ss ZZZZZ yyyy"))\
             .withColumn("year",F.year("created_at"))\
             .withColumn("dt",F.to_date("created_at"))\
             .withColumn("hour",F.hour("created_at"))\
             .withColumn("day-hour",concat_ws("-",F.to_date("created_at"),F.hour("created_at")))
                
tweets_std.limit(50).toPandas()

Unnamed: 0,created_at,id_str,text,user,retweeted_status,place,quote_count,reply_count,retweet_count,favorite_count,entities,favorited,retweeted,possibly_sensitive,filter_level,lang,year,dt,hour,day-hour
0,2022-02-27 13:08:48,1497906560814796801,"RT @mbk_center: Putin's invasion of Ukraine🇺🇦 has sparked a host of anti-war demonstrations in cities around the world, including Russia 🇷🇺…","(Abęfę🤴, abefe007, Ogun, Nigeria, Graphics Designer | Travel and Tour consultant | Man United Fan⚽ | Music Lover | Movie lover.... I always Follow Back✌️, 188, 1091, 0, 4075, 3347, Sun Apr 03 21:06:31 +0000 2016)","(None, 16, 107, 588, (80653307, Mikhail Khodorkovsky (English), mbk_center, London, Mikhail Khodorkovsky @mich261213 Twitter Account in English, 28401, 1664, 702, 17519, 35576, Wed Oct 07 19:03:44 +0000 2009))",,0,0,0,0,"([(mbk_center,)], [], None, [], [])",False,False,,low,en,2022,2022-02-27,13,2022-02-27-13
1,2022-02-27 13:08:48,1497906561049583620,"RT @load_pm: 💪Зараз у Хмельницькому жінки та діти плетуть маскувальні сітки, щоб допомогти нашим героям. Всі українці, від малого до велико…","(Piotr, Grado_Wwa, None, None, 1, 14, 0, 37, 7, Mon Nov 28 23:57:34 +0000 2016)","(None, 0, 1, 12, (1027991081584144386, nws, load_pm, UA Кременчук, 2022 Russian invasion of Ukraine, 1652, 14, 21, 108, 395, Fri Aug 10 18:52:27 +0000 2018))",,0,0,0,0,"([(load_pm,)], [], None, [], [])",False,False,,low,uk,2022,2022-02-27,13,2022-02-27-13
2,2022-02-27 13:08:48,1497906561058062340,RT @Raebo56: Don’t let the Russia invasion of Ukraine allow you to be distracted from the fact that \nJustin Trudeau and Jagmeet Singh just…,"(Sheila J, Sheilaanne2191, Montreal, Married, one adult son and a beautiful dog named Lily. Love all creatures except some humans I know:) Very into Canadian, British and U.S. politics., 525, 1225, 4, 53826, 31286, Sun Feb 11 18:57:47 +0000 2018)","(None, 17, 184, 440, (1466263472132423681, Raebo, Raebo56, Alberta, Canada, Albertan, Secessionist, Political Activist, Conservative, Business Owner, Patent Holder, Inventor, Jack Russells, Aviation, Cosmetic Chemistry, Photography, 562, 594, 1, 3086, 7581, Thu Dec 02 04:31:03 +0000 2021))",,0,0,0,0,"([(Raebo56,)], [], None, [], [])",False,False,,low,en,2022,2022-02-27,13,2022-02-27-13
3,2022-02-27 13:08:48,1497906561074741250,RT @DavidLaufmanLaw: Correction: It wasn’t “the Trump Administration” that required RT’s U.S. affiliate to register under the Foreign Agent…,"(TerryDean, TerryDe53425222, New York, USA, expat returned after little over 20 years. Trying to understand what’s going on?!?, 49, 120, 0, 47309, 6838, Tue Jun 02 20:27:08 +0000 2020)","(None, 4, 111, 433, (1034205948007645184, David Laufman, DavidLaufmanLaw, Washington, DC, Representation in gov’t investigations and national security matters. Former Chief of DOJ Counterintelligence Section. Stalwart Houston Astros fan. Views my own, 29303, 1286, 259, 3553, 991, Mon Aug 27 22:28:07 +0000 2018))",,0,0,0,0,"([(DavidLaufmanLaw,)], [], None, [], [])",False,False,,low,en,2022,2022-02-27,13,2022-02-27-13
4,2022-02-27 13:08:48,1497906560932134913,"RT @PessimismeActif: Oui, l’Ukraine est plus chère et proche à nos cœurs que l’Afghanistan ou le Togo.\n\nEt on s’excusera de rien du tout.","(lemmerdeuse 🇫🇷🇮🇱Z, lemmerdeuse24, Aquitaine, France, TOUS A VOS TEE SHIRT pour \nLIBERTE - EGALITE - FRATERNITE -\nRUES DE FRANCES LAÏQUES -\nLA RELIGION A LA MAISON -\nPOUR UNE FRANCE EN PAIX et SANS DISCREMINATION, 1280, 1619, 14, 18668, 22247, Sat Oct 12 12:13:06 +0000 2013)","(None, 14, 33, 154, (1475905928511766532, Alexandre 𓅨𓂋𓀀, PessimismeActif, Paris, France, Chats, Géopolitique & Pessimisme | moi/je, 1125, 307, 1, 3613, 2438, Tue Dec 28 19:06:21 +0000 2021))",,0,0,0,0,"([(PessimismeActif,)], [], None, [], [])",False,False,,low,fr,2022,2022-02-27,13,2022-02-27-13
5,2022-02-27 13:08:48,1497906561125023745,"RT @thehill: Germany says its sending anti-tank weapons, stinger missiles to Ukraine https://t.co/YwIi7KJPPn https://t.co/0fV1BFgovZ","(#Get Vaccinated, jtatsuno, USA, The Republican Party died Jan 6 & became the Insurrection Party., 1518, 1907, 67, 8422, 71377, Tue Mar 24 17:52:29 +0000 2009)","(None, 5, 27, 103, (1917731, The Hill, thehill, Washington, DC, The Hill is the premier source for policy and political news. Follow for tweets on what's happening in Washington, breaking news and retweets of our reporters., 4316605, 290, 28996, 10, 988995, Thu Mar 22 18:15:18 +0000 2007))",,0,0,0,0,"([(thehill,)], [], [(https://twitter.com/thehill/status/1497905617721339907/photo/1,)], [(http://hill.cm/CjdC3Pu,)], [])",False,False,False,low,en,2022,2022-02-27,13,2022-02-27-13
6,2022-02-27 13:08:48,1497906560852389892,"RT @kamilkazani: News about 10 000 Chechen troops leaving to Ukraine alarmed many. And yet, one must know context to understand its meaning…","(John McTernan, johnmcternan, London, England, Strategist and commentator. Formerly Tony Blair's Political Secretary and Julia Gillard's Comms Director. No 2, PR Influencer Index., 33503, 13114, 571, 265, 213829, Tue Feb 17 23:31:28 +0000 2009)","(None, 141, 1481, 5836, (1364845405, Kamil Galeev, kamilkazani, Washington DC, Galina Starovoitova Fellow @WoodrowWilsonCenter. MLitt in Early Modern History, St Andrews. MA in China Studies, Peking University, 28785, 605, 452, 863, 1824, Fri Apr 19 16:30:08 +0000 2013))",,0,0,0,0,"([(kamilkazani,)], [], None, [], [])",False,False,,low,en,2022,2022-02-27,13,2022-02-27-13
7,2022-02-27 13:08:48,1497906561125126148,RT @spectatorindex: BREAKING: Japan's Prime Minister Kishida says his country will join SWIFT measures against Russia,"(Baba Chelsea, JrKambewa, Mashonaland Central, Zimbabwe, ●@ChelseaFC fanatic\n●GMO @MvurwiHospital, 2450, 3630, 9, 170805, 93186, Tue Aug 26 07:45:23 +0000 2014)","(None, 72, 1515, 8226, (1626294277, The Spectator Index, spectatorindex, Global, Focused on finance, business, news, science and sports., 2109410, 0, 12238, 5, 9890, Sat Jul 27 20:42:27 +0000 2013))",,0,0,0,0,"([(spectatorindex,)], [], None, [], [])",False,False,,low,en,2022,2022-02-27,13,2022-02-27-13
8,2022-02-27 13:08:48,1497906561213034502,"RT @RichardEngel: Multiple reports and videos of Russian vehicles running out of fuel, stuck on the roads in Ukraine. There’s now an onlin…","(sunilraju, sunilraju1, None, None, 249, 4046, 1, 172787, 58877, Wed Jun 08 04:12:41 +0000 2011)","(None, 110, 1392, 6468, (47438401, Richard Engel, RichardEngel, None, @NBCNews Chief Foreign Correspondent, 572764, 761, 8701, 506, 9802, Mon Jun 15 20:40:51 +0000 2009))",,0,0,0,0,"([(RichardEngel,)], [], None, [], [])",False,False,,low,en,2022,2022-02-27,13,2022-02-27-13
9,2022-02-27 13:08:48,1497906561141944322,RT @mhikaric: Belarus is doing a kangaroo referendum tomorrow which will remove its formal neutrality and allow Russia to base nuclear weap…,"(Christoffer Skogholt, ChristofferSk11, None, Ph D-student in philosophy of religion. Writing on theological anthropology in relation to evolutionary biology. Podcast: http://trofornuft.libsyn.com, 256, 881, 3, 3006, 2561, Tue Feb 02 09:27:18 +0000 2021)","(None, 110, 2463, 9365, (1431294506452783105, Michael Hikari Cecire 🇺🇦, mhikaric, Trantor, Sr Policy Advisor at @HelsinkiComm. Former econ devt. & defense production @CRS4Congress. The only non PhD on Twitter. Dogs are people too. RT≠E. Personal acct., 1445, 1980, 28, 6440, 3178, Fri Aug 27 16:36:29 +0000 2021))",,0,0,0,0,"([(mhikaric,)], [], None, [], [])",False,False,,low,en,2022,2022-02-27,13,2022-02-27-13


<a id='2.4'></a>
### 2.4 Write DataFrame to std

In this case since a single day of data would be comprised of multiple files (partitions) I'm going to regroup them in just one with the coalesce method).<br/>
Also I'm going to partition the data using the columns created in the previous step.

In [9]:
(tweets_std.coalesce(1)
          .write
          .partitionBy("year","dt","day-hour")
          .mode("overwrite")
          .parquet("hdfs://localhost:9000/datalake/std/twitter/War/"))

                                                                                

<a id='2.5'></a>
### 2.5 Code improvements

The previous code it will read all the files from raw and write them to std layer.<br/>
This is not a good idea since we will convert and save the same data over and over if we execute this code every day.<br/>
With time based datasets like this one is typical just to take one day of data to just propote previous data to std layer without having to replay the transformations of previous days.<br/>
We also adapt the code with retweeted_Status structure as the original columns for counting favorites and retweets did not get data, and after reviewing Twitter API we decided those columns were the right ones.

In [10]:
from pyspark.sql.functions import *             

def promote_raw2std(day):
    schema="""
        created_at string,
        id_str string,
        text string,

        `user` struct<
                    name:string,
                    screen_name:string,
                    location:string,
                    description:string,
                    followers_count:bigint,
                    friends_count:bigint,
                    listed_count:bigint,
                    favourites_count:bigint,
                    statuses_count:bigint,
                    created_at:string
                    >,
        retweeted_status struct <
                    quot_count:int,
                    reply_count:int,
                    retweet_count:int,
                    favorite_count:int,
                    user: 
                    struct <
                        id_str:string,
                        name:string,
                        screen_name:string,
                        location:string,
                        description:string,
                        followers_count:bigint,
                        friends_count:bigint,
                        listed_count:bigint,
                        favourites_count:bigint,
                        statuses_count:bigint,
                        created_at:string
                    >
                    >,
        place struct<
                    country:string,
                    country_code:string,
                    full_name:string,
                    place_type:string,
                    url:string
                    >,
        quote_count bigint,
        reply_count bigint,
        retweet_count bigint,
        favorite_count bigint,
        entities struct<
                    user_mentions:array<struct<screen_name:string>>,
                    hashtags:array<struct<text:string>>, 
                    media:array<struct<expanded_url:string>>, 
                    urls:array<struct<expanded_url:string>>, 
                    symbols:array<struct<text:string>>
                    >,
        favorited boolean,
        retweeted boolean,
        possibly_sensitive boolean,
        filter_level string,
        lang string,
        hour int,
        year int,
        mix string
        """
    raw_location = f"hdfs://localhost:9000/datalake/raw/twitter/War/{day}/"
    (spark.read.schema(schema)
                       .option("recursiveFileLookup", "true")
                       .json(raw_location)
                       .withColumn("created_at",to_timestamp(col("created_at"),"EEE MMM dd HH:mm:ss ZZZZZ yyyy"))
                       .withColumn("year",year("created_at"))
                       .withColumn("dt",to_date("created_at"))
                       .withColumn("day-hour",concat_ws("-",F.to_date("created_at"),F.hour("created_at")))
                       .coalesce(1)
                       .write
                       .partitionBy("year","dt","day-hour")
                       .mode("overwrite")
                       .parquet("hdfs://localhost:9000/datalake/std/twitter/War/"))

In [11]:
tweets_std.printSchema()

root
 |-- created_at: timestamp (nullable = true)
 |-- id_str: string (nullable = true)
 |-- text: string (nullable = true)
 |-- user: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- screen_name: string (nullable = true)
 |    |-- location: string (nullable = true)
 |    |-- description: string (nullable = true)
 |    |-- followers_count: long (nullable = true)
 |    |-- friends_count: long (nullable = true)
 |    |-- listed_count: long (nullable = true)
 |    |-- favourites_count: long (nullable = true)
 |    |-- statuses_count: long (nullable = true)
 |    |-- created_at: string (nullable = true)
 |-- retweeted_status: struct (nullable = true)
 |    |-- quot_count: integer (nullable = true)
 |    |-- reply_count: integer (nullable = true)
 |    |-- retweet_count: integer (nullable = true)
 |    |-- favorite_count: integer (nullable = true)
 |    |-- user: struct (nullable = true)
 |    |    |-- id_str: string (nullable = true)
 |    |    |-- name: string (

Now we can promote the data per year, in this case, 2022.<br/>


In [12]:
# Change this date according to your data in HDFS

#promote_raw2std("2021/12/06")

#promote_raw2std("2021/12")

promote_raw2std("2022")

                                                                                