# Hello!

This is Group 9's Jupyter Notebook for data exploration. We wanted to figure out all sorts of things about the data we have, and the result of that exploration is the series of cells you see below.

We also wanted to show our results in a way that makes the reader immersed in the data exploration experience, so there's a narrative tone throughout this notebook. Enjoy!

---

As with any self-respecting Python program, let's first import the ~~minions~~ packages we'll need:

- `pandas` (DataFrames!)
- `plotly` (plotting)
- `re` (regular expressions)
- `nltk` (natural language processing)
- `emoji` (🤔)

In [18]:
%pip install emoji



In [19]:
import pandas as pd
import plotly.express as px
import plotly.io as pio
import re
import nltk
import emoji

pio.renderers.default = "notebook"  # so that plotly works in an HTML file

# So...where's our data?

The CSV files for our dataset are hosted in one of Daryll's GitHub repositories, so let's just yoink those into this notebook:

In [20]:
CSV_PATHS = [
    "https://raw.githubusercontent.com/daryll-ko/cs132-main/main/explorer/data/data_001-100.csv",
    "https://raw.githubusercontent.com/daryll-ko/cs132-main/main/explorer/data/data_101-150.csv",
    "https://raw.githubusercontent.com/daryll-ko/cs132-main/main/explorer/data/data_other_groups.csv",
]

In [21]:
dfs = [pd.read_csv(csv_path, index_col=False) for csv_path in CSV_PATHS]

There are three CSV files (and DataFrames) because our data is split into three parts: one for the first 100 samples, another for the remaining 50 samples, and the last one for other groups' samples. Having to work with three things at once is a pain, so let's concatenate the DataFrames together:

In [22]:
df = pd.concat(dfs, ignore_index=True)

# Preprocessing

Let's look at the shape of the data:

In [23]:
df.shape

(2120, 32)

That's...an awful lot of rows. It turns out that the CSV files still have the default blank rows from Google Sheets.

To fix this, let's just filter out the rows that don't have a `Tweet URL` entry:

In [24]:
sentinels = ["Tweet URL", "Following"]

for sentinel in sentinels:
    df = df[df[sentinel].notna()]

df.reset_index(drop=True, inplace=True)

How many rows do we have now?

In [25]:
df.shape

(399, 32)

That's much better! Now let's look at our DataFrame's columns:

In [26]:
df.columns

Index(['ID', 'Timestamp', 'Tweet URL', 'Group', 'Collector', 'Category',
       'Topic', 'Keywords', 'Account handle', 'Account name', 'Account bio',
       'Account type', 'Joined', 'Following', 'Followers', 'Location', 'Tweet',
       'Tweet Translated', 'Tweet Type', 'Date posted', 'Screenshot',
       'Content type', 'Likes', 'Replies', 'Retweets', 'Quote Tweets', 'Views',
       'Rating', 'Reasoning', 'Remarks', 'Reviewer', 'Review'],
      dtype='object')

Some of these columns are irrelevant to the core of our data science project (e.g., `Timestamp` of data collection), so let's just drop 'em:

In [27]:
cols_to_drop = [
    "ID",
    "Timestamp",
    "Group",
    "Collector",
    "Category",
    "Topic",
    "Keywords",
    "Reviewer",
    "Review",
    "Remarks",
]

df = df.drop(columns=cols_to_drop)

What columns do we have left?

In [28]:
df.columns

Index(['Tweet URL', 'Account handle', 'Account name', 'Account bio',
       'Account type', 'Joined', 'Following', 'Followers', 'Location', 'Tweet',
       'Tweet Translated', 'Tweet Type', 'Date posted', 'Screenshot',
       'Content type', 'Likes', 'Replies', 'Retweets', 'Quote Tweets', 'Views',
       'Rating', 'Reasoning'],
      dtype='object')

That's much better! In subsequent sections, we'll clean up our data further and extract new features from our existing features that are more relevant to our problem (context analysis).

## Handling missing values

Before we even know *what* to handle, let's get a bird's-eye view of our dataset using the `describe()` method:

In [29]:
df.describe()

Unnamed: 0,Likes,Replies,Retweets,Quote Tweets,Views
count,399.0,399.0,399.0,128.0,7.0
mean,13.408521,1.498747,4.994987,1.601562,55.857143
std,72.995223,7.001391,25.784963,8.352792,57.744924
min,0.0,0.0,0.0,0.0,3.0
25%,0.0,0.0,0.0,0.0,22.0
50%,0.0,0.0,0.0,0.0,49.0
75%,2.0,1.0,1.0,0.0,59.0
max,738.0,70.0,281.0,72.0,177.0


There are some oddities here. First of all, it seems like some columns don't have all the 400-ish entries (these were optional columns in the data collection phase).

There's not much we can do about that, so let's just drop 'em as well...

In [30]:
more_cols_to_drop = [
    "Tweet Translated",
    "Screenshot",
    "Quote Tweets",
    "Views",
    "Rating",
]

df = df.drop(columns=more_cols_to_drop)

That should do the trick! Let's call the `describe` method again:

In [31]:
df.describe()

Unnamed: 0,Likes,Replies,Retweets
count,399.0,399.0,399.0
mean,13.408521,1.498747,4.994987
std,72.995223,7.001391,25.784963
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,0.0
75%,2.0,1.0,1.0
max,738.0,70.0,281.0


Cool! At least for our quantitative features, we seem to have data that looks nor—wait a second, 149? Don't we have 150 columns?

To our dismay, we found that for some reason, we didn't include the `Joined`, `Following`, `Followers`, and `Location` info for one row. Oh well, let's see what we can do.

`Following` and `Followers` are data points at the ratio level, so we can impute those missing values with the mean of their respective columns:

In [33]:
cols_to_fill = ["Following", "Followers"]

df[cols_to_fill] = df[cols_to_fill].apply(pd.to_numeric, errors="coerce")

for col in cols_to_fill:
    mean = df[col].mean()
    df[col] = df[col].fillna(mean)

`describe` *should* be fine now...

In [34]:
df.describe()

Unnamed: 0,Following,Followers,Likes,Replies,Retweets
count,399.0,399.0,399.0,399.0,399.0
mean,628.275862,1299.478836,13.408521,1.498747,4.994987
std,1100.116199,7008.724823,72.995223,7.001391,25.784963
min,0.0,0.0,0.0,0.0,0.0
25%,77.0,25.5,0.0,0.0,0.0
50%,247.0,181.0,0.0,0.0,0.0
75%,628.275862,714.5,2.0,1.0,1.0
max,8058.0,85700.0,738.0,70.0,281.0


## Handling outliers + Standardization

For our project, tweets that are outliers in terms of likes, replies, or retweets are actually important because these are tweets that have a lot of reach, and so they give a lot more context when mis/disinformation is our issue.

Thus, we're not going to drop these outliers; instead, we'll just identify them. It's good to know what kinds of tweets are getting a lot more attention than usual.

We first add new columns dedicated to showing the z-scores of the entries in some columns (standardization), then identify outliers based on these z-score columns:

In [35]:
cols = ["Likes", "Replies", "Retweets"]

for col in cols:
    new_col_name = f"{col} (z-scores)"
    df[new_col_name] = (df[col] - df[col].mean()) / df[col].std()
    print(f"Rows with outlier in [{col}]")
    display(df[abs(df[new_col_name]) > 3])
    print()

Rows with outlier in [Likes]


Unnamed: 0,Tweet URL,Account handle,Account name,Account bio,Account type,Joined,Following,Followers,Location,Tweet,Tweet Type,Date posted,Content type,Likes,Replies,Retweets,Reasoning,Likes (z-scores)
1,https://twitter.com/ItsJamMagno/status/1602713...,@itsjammagno,Jam Magno,"Ibahin niyo ako, palaban to.\nQueen of Facts. ...",Identified,08/15,0.0,76069.0,Butuan City,Jusko si Leni walang ambag sa Maritime Industr...,Text,14/12/22 01:14,"Rational, Emotional",612.0,51.0,94.0,Stranded seafarers sa 481 Mariners Safe Dormit...,8.20042
136,https://twitter.com/PulbuRonn/status/149790370...,@pulburonn,Ronn,Writer | AB Journalism from PUP (the UP bootle...,Identified,09/17,144.0,3768.0,,"Oh Leni, we don't need ""fake news"" to know how...",Text,27/02/22 19:57,Rational,371.0,31.0,85.0,Former VP Robredo has been productive during h...,4.898834
358,https://twitter.com/AndresRizal68/status/14606...,@AndresRizal68,Andres Rizal,Math enthusiast. Patriot. Strategist. Sic Parv...,Anonymous,10/21,568.0,558.0,Unspecified,Bakit Hindi si Leni:\r\n1. Mahina sa kominikas...,Text,16/11/21 21:57,Rational,721.0,49.0,281.0,Claims with no evidence,9.693668
372,https://twitter.com/Paps_Caloy/status/15173373...,@Paps_Caloy,Paps Caloy,Laban para sa Bayan 👊\r\n,Anonymous,01/18,1713.0,13600.0,"Pampanga, Philippines",Oh tapos sasabihin niyo redtag?,"Text, Image",22/04/22 10:59,Rational,738.0,70.0,267.0,Vice President Robredo already stated that she...,9.92656
375,https://twitter.com/jmy_perez/status/151740892...,@jmy_perez,Jimmy Perez,"Been there, done that; enjoying the life of my...",Identified,11/14,4971.0,4278.0,"Pasig City, Philippines",Hoy Losers\r\nAlam ba ninyo ito??,"Tweet, Image",22/04/22 15:44,Rational,633.0,58.0,213.0,Communist Party of the Philippines founder Jos...,8.48811



Rows with outlier in [Replies]


Unnamed: 0,Tweet URL,Account handle,Account name,Account bio,Account type,Joined,Following,Followers,Location,Tweet,Tweet Type,Date posted,Content type,Likes,Replies,Retweets,Reasoning,Likes (z-scores),Replies (z-scores)
1,https://twitter.com/ItsJamMagno/status/1602713...,@itsjammagno,Jam Magno,"Ibahin niyo ako, palaban to.\nQueen of Facts. ...",Identified,08/15,0.0,76069.0,Butuan City,Jusko si Leni walang ambag sa Maritime Industr...,Text,14/12/22 01:14,"Rational, Emotional",612.0,51.0,94.0,Stranded seafarers sa 481 Mariners Safe Dormit...,8.20042,7.070203
18,https://twitter.com/4theBoysIlocano/status/124...,@4theboysilocano,For the Boys,Kilusan ng Pagkakaisa,Anonymous,02/20,257.0,130.0,,@cnnphilippines @AC_Nicholls Yan nga yung sina...,"Text, Reply",02/04/20 12:43,Rational,5.0,31.0,0.0,"""In 2020, the first year of the pandemic, the ...",-0.115193,4.213628
136,https://twitter.com/PulbuRonn/status/149790370...,@pulburonn,Ronn,Writer | AB Journalism from PUP (the UP bootle...,Identified,09/17,144.0,3768.0,,"Oh Leni, we don't need ""fake news"" to know how...",Text,27/02/22 19:57,Rational,371.0,31.0,85.0,Former VP Robredo has been productive during h...,4.898834,4.213628
171,https://twitter.com/WinwinEklabu/status/151771...,@WinwinEklabu,Mr.Winwin_Situation,"I am proud to be part of ""The 31 Million Stron...",Identified,12/15,6456.0,7123.0,Metro Manila,Hala guys si Leni Robredo may asawa pala sa un...,"Text, picture",23/04/22 12:14,Rational,142.0,38.0,64.0,Leni Robredo's first and only husband is confi...,1.761642,5.213429
204,https://twitter.com/BbMaharlika/status/1001357...,@BbMaharlika,Maharlika\r\n\r\n,"I'm not afraid to speak my mind, even if it up...",Identified,11/15,534.0,1299.478836,United States,Sabi ni Fake VP Leni Robredo sa kanyang video:...,Text,29/5/18 15:01,Emotional,147.0,24.0,36.0,Tweet insinuates romantic link between Leni Ro...,1.83014,3.213826
292,https://twitter.com/TEAMMADAMCORA10/status/136...,@TEAMMADAMCORA10,KPOP FAN,simple kpop fan with 60 followers,Anonymous,02/23,323.0,74.0,,Pinanigan si Leni ng Supreme Court dahil 1 yea...,"Text, Reply",2021-02-22T07:42:47.000Z,Rational,13.0,42.0,0.0,The tweet contains a hashtag accusing Robredo ...,-0.005597,5.784744
358,https://twitter.com/AndresRizal68/status/14606...,@AndresRizal68,Andres Rizal,Math enthusiast. Patriot. Strategist. Sic Parv...,Anonymous,10/21,568.0,558.0,Unspecified,Bakit Hindi si Leni:\r\n1. Mahina sa kominikas...,Text,16/11/21 21:57,Rational,721.0,49.0,281.0,Claims with no evidence,9.693668,6.784545
372,https://twitter.com/Paps_Caloy/status/15173373...,@Paps_Caloy,Paps Caloy,Laban para sa Bayan 👊\r\n,Anonymous,01/18,1713.0,13600.0,"Pampanga, Philippines",Oh tapos sasabihin niyo redtag?,"Text, Image",22/04/22 10:59,Rational,738.0,70.0,267.0,Vice President Robredo already stated that she...,9.92656,9.78395
375,https://twitter.com/jmy_perez/status/151740892...,@jmy_perez,Jimmy Perez,"Been there, done that; enjoying the life of my...",Identified,11/14,4971.0,4278.0,"Pasig City, Philippines",Hoy Losers\r\nAlam ba ninyo ito??,"Tweet, Image",22/04/22 15:44,Rational,633.0,58.0,213.0,Communist Party of the Philippines founder Jos...,8.48811,8.070004
397,https://twitter.com/smninews/status/1425760974...,@smninews,SMNI News,We bring truth that matters\nWe are nation-bui...,Media,03/20,784.0,85700.0,Makati City,"VP Leni Robredo, kabilang sa mga CPP-NPA lover...","Text, Image, URL",21/08/21 18:08,Rational,93.0,26.0,44.0,VERA FILES FACT CHECK: Robredo DID NOT threate...,1.090366,3.499484



Rows with outlier in [Retweets]


Unnamed: 0,Tweet URL,Account handle,Account name,Account bio,Account type,Joined,Following,Followers,Location,Tweet,Tweet Type,Date posted,Content type,Likes,Replies,Retweets,Reasoning,Likes (z-scores),Replies (z-scores),Retweets (z-scores)
1,https://twitter.com/ItsJamMagno/status/1602713...,@itsjammagno,Jam Magno,"Ibahin niyo ako, palaban to.\nQueen of Facts. ...",Identified,08/15,0.0,76069.0,Butuan City,Jusko si Leni walang ambag sa Maritime Industr...,Text,14/12/22 01:14,"Rational, Emotional",612.0,51.0,94.0,Stranded seafarers sa 481 Mariners Safe Dormit...,8.20042,7.070203,3.451819
136,https://twitter.com/PulbuRonn/status/149790370...,@pulburonn,Ronn,Writer | AB Journalism from PUP (the UP bootle...,Identified,09/17,144.0,3768.0,,"Oh Leni, we don't need ""fake news"" to know how...",Text,27/02/22 19:57,Rational,371.0,31.0,85.0,Former VP Robredo has been productive during h...,4.898834,4.213628,3.102778
148,https://twitter.com/AlgenFaith/status/84265951...,@algenfaith,Genalie Baluya,,Identified,02/16,454.0,57.0,,The most inept Fake VP Leni Robredo in the Phi...,"Text, Image",17/03/17 16:50,Rational,163.0,7.0,182.0,Former VP Robredo has been productive during h...,2.049332,0.785737,6.86466
162,https://twitter.com/dearugreisback/status/1517...,@dearugreisback\r,Dear Ugre\r,FAN ACCOUNT #SarahGeronimo #EugeneDomingo #Ugr...,Anonymous,08/20,361.0,595.0,,Akala nila fake news ito.\n\nSi butch Robredo ...,"Text, picture",22/04/2022 9:01,Rational,192.0,4.0,102.0,"\n\nThere are no news articles, statements, or...",2.446619,0.357251,3.762077
358,https://twitter.com/AndresRizal68/status/14606...,@AndresRizal68,Andres Rizal,Math enthusiast. Patriot. Strategist. Sic Parv...,Anonymous,10/21,568.0,558.0,Unspecified,Bakit Hindi si Leni:\r\n1. Mahina sa kominikas...,Text,16/11/21 21:57,Rational,721.0,49.0,281.0,Claims with no evidence,9.693668,6.784545,10.704108
372,https://twitter.com/Paps_Caloy/status/15173373...,@Paps_Caloy,Paps Caloy,Laban para sa Bayan 👊\r\n,Anonymous,01/18,1713.0,13600.0,"Pampanga, Philippines",Oh tapos sasabihin niyo redtag?,"Text, Image",22/04/22 10:59,Rational,738.0,70.0,267.0,Vice President Robredo already stated that she...,9.92656,9.78395,10.161155
375,https://twitter.com/jmy_perez/status/151740892...,@jmy_perez,Jimmy Perez,"Been there, done that; enjoying the life of my...",Identified,11/14,4971.0,4278.0,"Pasig City, Philippines",Hoy Losers\r\nAlam ba ninyo ito??,"Tweet, Image",22/04/22 15:44,Rational,633.0,58.0,213.0,Communist Party of the Philippines founder Jos...,8.48811,8.070004,8.066912





Cool! These are rows we can keep in mind for later on.

## Ensuring formatting consistency

Let's make sure that for columns that are supposed to follow a specific format, the entries in these columns *do* follow that format.

First, let's verify that our account handles start with the `@` character:

In [36]:
len(df[df["Account handle"].str.startswith("@")])

398

Next, let's verify that the `Joined` entries are of the form `MM/YY`. We'll use a regular expression to accomplish this:

In [37]:
def is_proper_month(month: int) -> bool:
    return 1 <= month <= 12


def is_proper_joined_date(s: str) -> bool:
    result = re.search(r"^(\d{2})/(\d{2})$", s)
    if result is None:
        return False
    month, year = map(int, result.groups())
    return is_proper_month(month)


def row_has_valid_joined_date(row) -> bool:
    return is_proper_joined_date(str(row["Joined"]))


len(df[df.apply(row_has_valid_joined_date, axis=1)])

350

Remember when we said that we missed the `Joined` entry for one row? It seems our code has noticed as well.

Finally, let's verify that the `Date posted` entries are of the form `DD/MM/YY hh:mm`. We'll use another regular expression to accomplish this:

In [38]:
def is_proper_hour(hour: int) -> bool:
    return 0 <= hour < 24


def is_proper_minute(minute: int) -> bool:
    return 0 <= minute < 60


def is_leap_year(year: int) -> bool:
    return year % 4 == 0 and (year % 400 == 0 or year % 100 != 0)


def is_proper_date(day: int, month: int, year: int) -> bool:
    if month == 2:
        return 1 <= day <= (29 if is_leap_year(year) else 28)
    elif month in [1, 3, 5, 7, 8, 10, 12]:
        return 1 <= day <= 31
    else:
        return 1 <= day <= 30


def is_proper_date_posted(s: str) -> bool:
    result = re.search(r"^(\d{2})/(\d{2})/(\d{2}) (\d{2}):(\d{2})$", s)
    if result is None:
        return False
    day, month, year, hour, minute = map(int, result.groups())
    return (
        is_proper_month(int(month))
        and is_proper_date(day, month, year)
        and is_proper_hour(hour)
        and is_proper_minute(minute)
    )


def has_valid_date_posted(row) -> bool:
    return is_proper_date_posted(row["Date posted"])


len(df[df.apply(has_valid_date_posted, axis=1)])

255

Hm? 149 doesn't look right this time. Let's find out what the offending row is:

In [39]:
def has_invalid_date_posted(row) -> bool:
    return not has_valid_date_posted(row)


df[df.apply(has_invalid_date_posted, axis=1)]

Unnamed: 0,Tweet URL,Account handle,Account name,Account bio,Account type,Joined,Following,Followers,Location,Tweet,Tweet Type,Date posted,Content type,Likes,Replies,Retweets,Reasoning,Likes (z-scores),Replies (z-scores),Retweets (z-scores)
44,https://twitter.com/EggtlogNaMaalat/status/154...,@eggtlognamaalat,EggtlogNaMaalat,Pink on the outside. Yellow on the inside.,Anonymous,03/20,819.0,116.0,Egg Yolk of Itlog na Maalat,Myghad Leni Robredo! Napaka-KURAKOT mo talaga ...,"Text, Image",20/07/22 0:44,Rational,0.0,0.0,0.0,There is no evidence to suggest that Leni is c...,-0.183690,-0.214064,-0.193717
150,https://twitter.com/cierloX6/status/1598004813...,@cierloX6,Cierlo,"natsugi acct ko, makaleninggaw si pipito, pest...",Anonymous,02/22,564.0,559.0,,meron sex video ang anak ni leni robredo sabi...,Text,01/12/22 1:23,Emotional,16.0,2.0,1.0,The video scandal going around the internet cl...,0.035502,0.071593,-0.154935
153,https://twitter.com/Bongtothemax/status/159778...,@Bongtothemax,Bong,"I""Am your lifeguard...",Anonymous,12/20,962.0,1011.0,"Cebu City, Central Visayas",It was you Leni Robredo who spread disimformat...,Reply (Quote tweet),30/11/2022 10:49,Emotional,3.0,1.0,2.0,The video scandal going around the internet cl...,-0.142592,-0.071235,-0.116152
154,https://twitter.com/tapatnapilipino/status/103...,@tapatnapilipino\n,TapatNaPilipino(TNP)\r,Ang naninira sa gobyerno ay kasiraan sa Lahing...,Anonymous,04/11,58.0,316.0,Sa Puso ng Pilipinas,Anong failure? Kayo lang ang nagsasabi nyan at...,Reply (Quote tweet),02/09/18 5:22,Emotional,23.0,0.0,6.0,"There are no news articles, statements, or off...",0.131399,-0.214064,0.038977
155,https://twitter.com/Jaybee_Elay/status/1032826...,@Jaybee_Elay\r,BBM-SaraDU30 Nation-Building 🇵🇭👊❤💚🇵🇭\r,"💚Build-Build-Build. ❤War on Drugs, Terrorism, ...",Anonymous,06/18,2807.0,5545.0,Philippines,I hate Drugs! Please watch and RT.\r\n\r\nButc...,Reply (Quote tweet),24/08/18 11:06,Rational,13.0,2.0,9.0,"There are no news articles, statements, or off...",-0.005597,0.071593,0.155324
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371,https://twitter.com/onofight/status/1502794215...,@onofight,𝕕_ŕεαl_ΩᎶ,ℙ𝕖𝕣𝕤𝕠𝕟𝕒𝕝 𝕊𝕪𝕤𝕥𝕖𝕞 ℂ𝕠𝕞𝕡𝕖𝕥𝕚𝕥𝕚𝕧𝕖 𝔸𝕟𝕒𝕝𝕪𝕤𝕥\r\nReviews...,Identified,11/21,1869.0,1818.0,Unspecified,NO to Leni - NPA Collaboration\r\n#notocppnpan...,"Text, Image (through Quote Tweet)",13/03/22 7:50,Rational,5.0,0.0,1.0,"Claims with no evidence, lacks context and unp...",-0.115193,-0.214064,-0.154935
376,https://twitter.com/Jackrayperc/status/1517732...,@Jackrayperc,Jack,Corporate Professional,Anonymous,12/20,69.0,4.0,"NCR, Philippines",Lol NPA tlga mga dilwan at pinklawan... C cory...,"Tweet, Reply (comment)",23/04/2022 13:11,Rational,0.0,0.0,0.0,Claims with no evidence,-0.183690,-0.214064,-0.193717
378,https://twitter.com/callmealizia/status/161550...,@callmealizia,AkoSiAlizia ✌👊💪,"mother of two grandmother to three, im no fan ...",Identified,12/19,962.0,992.0,"Dasmarinas, Calabarzon",Bkit nga ba mas pinili ko si BBM over ka leni ...,Text,01/18/23 8:30,Rational,19.0,2.0,0.0,VERA FILES FACT CHECK: Robredo DID NOT threate...,0.076601,0.071593,-0.193717
383,https://twitter.com/Snowy16C/status/1524120448...,@Snowy16C,Let Leni Leave,shut up,Anonymous,11/21,29.0,3.0,Unspecified,"Eh bat puro martial law ? NPA Si Leni, NPA Sum...","Text, Reply",11/05/22 4:13,Rational,0.0,1.0,0.0,VERA FILES FACT CHECK: Robredo DID NOT threate...,-0.183690,-0.071235,-0.193717


Ah, we missed a `0` for the hour. Oh well...

## Categorical data encoding

There are three columns that look susceptible to categorial data encoding:

- `Account type`
- `Tweet Type`
- `Content type`

An `Account type` may be `Identified`, `Anonymous`, or `Media`. Let's encode these using three new columns: each new column whether or not an account is part of some account type (the values in that column are 0 or 1).

As a convention, let's name the new columns `Account is {Account Type}` (e.g., `Account is Anonymous`):

In [40]:
account_types = ["Identified", "Anonymous", "Media"]


def h_decorator(account_type: str):
    def h(row) -> int:
        return 1 if account_type == str(row["Account type"]).strip() else 0

    return h


for account_type in account_types:
    df[f"Account is {account_type}"] = df.apply(h_decorator(account_type), axis=1)

df[[f"Account is {account_type}" for account_type in account_types]]

Unnamed: 0,Account is Identified,Account is Anonymous,Account is Media
0,0,1,0
1,1,0,0
2,0,1,0
3,0,1,0
4,0,1,0
...,...,...,...
394,1,0,0
395,0,1,0
396,0,1,0
397,0,0,1


A `Tweet Type` may be some of multiple things:

- `Text`
- `Image`
- `Video`
- `URL`
- `Reply`
- `Quote Tweet`

To encode this, let's introduce five new columns: each new column indicates whether or not a Tweet is part of some tweet type (again, the values in that column are 0 or 1).

As a convention, let's name the new columns `Tweet is {tweet_type}` (e.g., `Tweet is Text`):

In [41]:
tweet_types = ["Text", "Image", "Video", "URL", "Reply", "QuoteTweet"]


def f_decorator(tweet_type: str):
    def f(row) -> int:
        return 1 if tweet_type in row["Tweet Type"].replace(" ", "").split(",") else 0

    return f


for tweet_type in tweet_types:
    df[f"Tweet is {tweet_type}"] = df.apply(f_decorator(tweet_type), axis=1)

df[[f"Tweet is {tweet_type}" for tweet_type in tweet_types]]

Unnamed: 0,Tweet is Text,Tweet is Image,Tweet is Video,Tweet is URL,Tweet is Reply,Tweet is QuoteTweet
0,1,0,0,0,1,0
1,1,0,0,0,0,0
2,1,0,0,0,1,0
3,1,0,0,0,1,0
4,1,0,0,0,1,0
...,...,...,...,...,...,...
394,1,0,0,0,1,0
395,1,0,0,0,1,0
396,1,1,0,0,0,0
397,1,1,0,1,0,0


Finally, a `Content type` may be some of multiple things:

- `Rational`
- `Emotional`
- `Transactional`

Once again, we introduce three new columns. Let's name them `Content is {content_type}` (e.g., `Content is Emotional`):

In [42]:
content_types = ["Rational", "Emotional", "Transactional"]


def g_decorator(content_type: str):
    def g(row) -> int:
        return (
            1 if content_type in row["Content type"].replace(" ", "").split(",") else 0
        )

    return g


for content_type in content_types:
    df[f"Content is {content_type}"] = df.apply(g_decorator(content_type), axis=1)

df[[f"Content is {content_type}" for content_type in content_types]]

Unnamed: 0,Content is Rational,Content is Emotional,Content is Transactional
0,1,0,0
1,1,1,0
2,1,1,0
3,1,0,0
4,1,1,0
...,...,...,...
394,1,0,0
395,1,0,0
396,0,1,0
397,1,0,0


Great! With all that covered, it's time to do some **natural language processing**.

## Tokenization + lower casing

For analysis, let's reflect our language processing work in a new column called `Tweet (processed)`. This will allow us to see the progress we've made in processing our tweets.

Let's tokenize our tweets first. In our case, we split the tweets by whitespaces and punctuation marks:

In [43]:
df["Tweet (processed)"] = df["Tweet"]

df["Tweet (processed)"] = df["Tweet (processed)"].transform(
    lambda tweet: re.split(r" |,|!|\.|\?|;|:", tweet)
)

df[["Tweet", "Tweet (processed)"]]

Unnamed: 0,Tweet,Tweet (processed)
0,Wala alam si leni sa foreign policy. Di nya ng...,"[Wala, alam, si, leni, sa, foreign, policy, , ..."
1,Jusko si Leni walang ambag sa Maritime Industr...,"[Jusko, si, Leni, walang, ambag, sa, Maritime,..."
2,"@setsu0196 @indaysara Inggit lang mga yun, wal...","[@setsu0196, @indaysara, Inggit, lang, mga, yu..."
3,@BosyoJ Wala na kasing ibang topic na alam si ...,"[@BosyoJ, Wala, na, kasing, ibang, topic, na, ..."
4,"@VancouverEye Parang Vovo, matagal ng walang g...","[@VancouverEye, Parang, Vovo, , matagal, ng, w..."
...,...,...
394,"NPA po ang kalaban, na pinoprotektahan ni Leni...","[NPA, po, ang, kalaban, , na, pinoprotektahan,..."
395,we are only given the choices of:\nBBM - son o...,"[we, are, only, given, the, choices, of, \nBBM..."
396,Desperado na si Leni..npa at kumunista Ang gag...,"[Desperado, na, si, Leni, , npa, at, kumunista..."
397,"VP Leni Robredo, kabilang sa mga CPP-NPA lover...","[VP, Leni, Robredo, , kabilang, sa, mga, CPP-N..."


Afterwards, let's turn each token to lowercase (if possible):

In [44]:
df["Tweet (processed)"] = df["Tweet (processed)"].transform(
    lambda tokens: list(map(lambda token: token.lower(), tokens))
)

df[["Tweet", "Tweet (processed)"]]

Unnamed: 0,Tweet,Tweet (processed)
0,Wala alam si leni sa foreign policy. Di nya ng...,"[wala, alam, si, leni, sa, foreign, policy, , ..."
1,Jusko si Leni walang ambag sa Maritime Industr...,"[jusko, si, leni, walang, ambag, sa, maritime,..."
2,"@setsu0196 @indaysara Inggit lang mga yun, wal...","[@setsu0196, @indaysara, inggit, lang, mga, yu..."
3,@BosyoJ Wala na kasing ibang topic na alam si ...,"[@bosyoj, wala, na, kasing, ibang, topic, na, ..."
4,"@VancouverEye Parang Vovo, matagal ng walang g...","[@vancouvereye, parang, vovo, , matagal, ng, w..."
...,...,...
394,"NPA po ang kalaban, na pinoprotektahan ni Leni...","[npa, po, ang, kalaban, , na, pinoprotektahan,..."
395,we are only given the choices of:\nBBM - son o...,"[we, are, only, given, the, choices, of, \nbbm..."
396,Desperado na si Leni..npa at kumunista Ang gag...,"[desperado, na, si, leni, , npa, at, kumunista..."
397,"VP Leni Robredo, kabilang sa mga CPP-NPA lover...","[vp, leni, robredo, , kabilang, sa, mga, cpp-n..."


In [45]:
df["Tweet (processed)"] = df["Tweet (processed)"].transform(
    lambda tokens: list(filter(lambda token: re.match(r"^[a-zA-Z]+$", token), tokens))
)

df[["Tweet", "Tweet (processed)"]]

Unnamed: 0,Tweet,Tweet (processed)
0,Wala alam si leni sa foreign policy. Di nya ng...,"[wala, alam, si, leni, sa, foreign, policy, di..."
1,Jusko si Leni walang ambag sa Maritime Industr...,"[jusko, si, leni, walang, ambag, sa, maritime,..."
2,"@setsu0196 @indaysara Inggit lang mga yun, wal...","[inggit, lang, mga, yun, walang, ambag, kasi, ..."
3,@BosyoJ Wala na kasing ibang topic na alam si ...,"[wala, na, kasing, ibang, topic, na, alam, si,..."
4,"@VancouverEye Parang Vovo, matagal ng walang g...","[parang, vovo, matagal, ng, walang, galaw, ang..."
...,...,...
394,"NPA po ang kalaban, na pinoprotektahan ni Leni...","[npa, po, ang, kalaban, na, pinoprotektahan, n..."
395,we are only given the choices of:\nBBM - son o...,"[we, are, only, given, the, choices, of, son, ..."
396,Desperado na si Leni..npa at kumunista Ang gag...,"[desperado, na, si, leni, npa, at, kumunista, ..."
397,"VP Leni Robredo, kabilang sa mga CPP-NPA lover...","[vp, leni, robredo, kabilang, sa, mga, lovers,..."


## Stop words removal + other filters

Let's remove stop words from our tweets. For filtering out English stop words, we'll use the `nltk` package. Meanwhile, for filtering out Tagalog stop words, we'll use our own curated set:

In [46]:
nltk.download("stopwords")
from nltk.corpus import stopwords

stopwords_set_english = set(stopwords.words("english"))
stopwords_set_tagalog = set(
    [
        "ah",
        "akin",
        "aking",
        "ako",
        "alin",
        "alinsunod",
        "amin",
        "ang",
        "ano",
        "apat",
        "at",
        "ay",
        "ayon",
        "ayun",
        "ba",
        "bagaman",
        "bagamat",
        "bakit",
        "basta",
        "dahil",
        "dalawa",
        "datapwat",
        "daw",
        "di",
        "din",
        "dito",
        "doon",
        "eh",
        "ganito",
        "gayunpaman",
        "ha",
        "hala",
        "hanggang",
        "haydiba",
        "hinggil",
        "https",
        "ikaw",
        "isa",
        "ito",
        "iyan",
        "iyon",
        "jusko",
        "kabila",
        "kami",
        "kanila",
        "kasi",
        "ka",
        "kay",
        "kaya",
        "kayo",
        "kaysa",
        "kina",
        "ko",
        "kung",
        "kuwan",
        "labag",
        "lang",
        "mag",
        "may",
        "mga",
        "mo",
        "mong",
        "mula",
        "na",
        "naku",
        "naman",
        "nang",
        "ng",
        "nga",
        "ngek",
        "ngunit",
        "ni",
        "nina",
        "niya",
        "niyo",
        "noong",
        "nung",
        "nya",
        "nyo",
        "o",
        "opo",
        "pa",
        "pag",
        "pagkatapos",
        "pangalawa",
        "para",
        "parang",
        "pero",
        "po",
        "raw",
        "rin",
        "sa",
        "sapagkat",
        "si",
        "sila",
        "sina",
        "siyatalaga",
        "sumunod",
        "sya",
        "tatlo",
        "tayo",
        "tungo",
        "una",
        "yan",
        "yun",
        "yung",
    ]
)
stopwords_set = stopwords_set_english.union(stopwords_set_tagalog)

df["Tweet (processed)"] = df["Tweet (processed)"].transform(
    lambda tokens: list(filter(lambda token: token not in stopwords_set, tokens))
)

df[["Tweet", "Tweet (processed)"]]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,Tweet,Tweet (processed)
0,Wala alam si leni sa foreign policy. Di nya ng...,"[wala, alam, leni, foreign, policy, alam, pres..."
1,Jusko si Leni walang ambag sa Maritime Industr...,"[leni, walang, ambag, maritime, industry, year..."
2,"@setsu0196 @indaysara Inggit lang mga yun, wal...","[inggit, walang, ambag, leni, moro, chaka]"
3,@BosyoJ Wala na kasing ibang topic na alam si ...,"[wala, kasing, ibang, topic, alam, leni, walan..."
4,"@VancouverEye Parang Vovo, matagal ng walang g...","[vovo, matagal, walang, galaw, unilever, ngayo..."
...,...,...
394,"NPA po ang kalaban, na pinoprotektahan ni Leni...","[npa, kalaban, pinoprotektahan, leni, npa, pum..."
395,we are only given the choices of:\nBBM - son o...,"[given, choices, son, npa, casted, votes, some..."
396,Desperado na si Leni..npa at kumunista Ang gag...,"[desperado, leni, npa, kumunista, gagawa, gulo..."
397,"VP Leni Robredo, kabilang sa mga CPP-NPA lover...","[vp, leni, robredo, kabilang, lovers, pastor, ..."


All of those emojis might be painful to work with (for one, they're not ASCII characters), so let's remove them. We'll accomplish this by replacing all emojis with empty strings, with the help of the `emoji` package!

In [47]:
df["Tweet (processed)"] = df["Tweet (processed)"].transform(
    lambda tokens: list(map(lambda token: emoji.replace_emoji(token, ""), tokens))
)

display(df[["Tweet", "Tweet (processed)"]])

Unnamed: 0,Tweet,Tweet (processed)
0,Wala alam si leni sa foreign policy. Di nya ng...,"[wala, alam, leni, foreign, policy, alam, pres..."
1,Jusko si Leni walang ambag sa Maritime Industr...,"[leni, walang, ambag, maritime, industry, year..."
2,"@setsu0196 @indaysara Inggit lang mga yun, wal...","[inggit, walang, ambag, leni, moro, chaka]"
3,@BosyoJ Wala na kasing ibang topic na alam si ...,"[wala, kasing, ibang, topic, alam, leni, walan..."
4,"@VancouverEye Parang Vovo, matagal ng walang g...","[vovo, matagal, walang, galaw, unilever, ngayo..."
...,...,...
394,"NPA po ang kalaban, na pinoprotektahan ni Leni...","[npa, kalaban, pinoprotektahan, leni, npa, pum..."
395,we are only given the choices of:\nBBM - son o...,"[given, choices, son, npa, casted, votes, some..."
396,Desperado na si Leni..npa at kumunista Ang gag...,"[desperado, leni, npa, kumunista, gagawa, gulo..."
397,"VP Leni Robredo, kabilang sa mga CPP-NPA lover...","[vp, leni, robredo, kabilang, lovers, pastor, ..."


Next, let's filter out empty tokens:

In [48]:
df["Tweet (processed)"] = df["Tweet (processed)"].transform(
    lambda tokens: list(filter(lambda token: len(token) > 0, tokens))
)

df[["Tweet", "Tweet (processed)"]]

Unnamed: 0,Tweet,Tweet (processed)
0,Wala alam si leni sa foreign policy. Di nya ng...,"[wala, alam, leni, foreign, policy, alam, pres..."
1,Jusko si Leni walang ambag sa Maritime Industr...,"[leni, walang, ambag, maritime, industry, year..."
2,"@setsu0196 @indaysara Inggit lang mga yun, wal...","[inggit, walang, ambag, leni, moro, chaka]"
3,@BosyoJ Wala na kasing ibang topic na alam si ...,"[wala, kasing, ibang, topic, alam, leni, walan..."
4,"@VancouverEye Parang Vovo, matagal ng walang g...","[vovo, matagal, walang, galaw, unilever, ngayo..."
...,...,...
394,"NPA po ang kalaban, na pinoprotektahan ni Leni...","[npa, kalaban, pinoprotektahan, leni, npa, pum..."
395,we are only given the choices of:\nBBM - son o...,"[given, choices, son, npa, casted, votes, some..."
396,Desperado na si Leni..npa at kumunista Ang gag...,"[desperado, leni, npa, kumunista, gagawa, gulo..."
397,"VP Leni Robredo, kabilang sa mga CPP-NPA lover...","[vp, leni, robredo, kabilang, lovers, pastor, ..."


Finally, let's remove the mentions (tokens that start with `@`):

In [49]:
df["Tweet (processed)"] = df["Tweet (processed)"].transform(
    lambda tokens: list(filter(lambda token: token[0] != "@", tokens))
)

df[["Tweet", "Tweet (processed)"]]

Unnamed: 0,Tweet,Tweet (processed)
0,Wala alam si leni sa foreign policy. Di nya ng...,"[wala, alam, leni, foreign, policy, alam, pres..."
1,Jusko si Leni walang ambag sa Maritime Industr...,"[leni, walang, ambag, maritime, industry, year..."
2,"@setsu0196 @indaysara Inggit lang mga yun, wal...","[inggit, walang, ambag, leni, moro, chaka]"
3,@BosyoJ Wala na kasing ibang topic na alam si ...,"[wala, kasing, ibang, topic, alam, leni, walan..."
4,"@VancouverEye Parang Vovo, matagal ng walang g...","[vovo, matagal, walang, galaw, unilever, ngayo..."
...,...,...
394,"NPA po ang kalaban, na pinoprotektahan ni Leni...","[npa, kalaban, pinoprotektahan, leni, npa, pum..."
395,we are only given the choices of:\nBBM - son o...,"[given, choices, son, npa, casted, votes, some..."
396,Desperado na si Leni..npa at kumunista Ang gag...,"[desperado, leni, npa, kumunista, gagawa, gulo..."
397,"VP Leni Robredo, kabilang sa mga CPP-NPA lover...","[vp, leni, robredo, kabilang, lovers, pastor, ..."


In [50]:
df.to_csv("processed_data.csv")

Cool! Our tweets look easier to work with now.

# Visualization

Here comes the fun part! Let's look at all kinds of relationships in our data using different kinds of plots.

For this section, we'll be using the `plotly` package: it gives us *interactive* plots to play with, which makes the exploration process more active.

## Histogram: tokens length

For our first plot, we'll see how long our token lists ended up as a result of our natural language processing.

First, let's add a new column for the number of tokens per tweet:

In [51]:
df["Token count"] = df["Tweet (processed)"].transform(lambda tokens: len(tokens))

df[["Tweet (processed)", "Token count"]]

Unnamed: 0,Tweet (processed),Token count
0,"[wala, alam, leni, foreign, policy, alam, pres...",10
1,"[leni, walang, ambag, maritime, industry, year...",18
2,"[inggit, walang, ambag, leni, moro, chaka]",6
3,"[wala, kasing, ibang, topic, alam, leni, walan...",28
4,"[vovo, matagal, walang, galaw, unilever, ngayo...",25
...,...,...
394,"[npa, kalaban, pinoprotektahan, leni, npa, pum...",8
395,"[given, choices, son, npa, casted, votes, some...",9
396,"[desperado, leni, npa, kumunista, gagawa, gulo...",13
397,"[vp, leni, robredo, kabilang, lovers, pastor, ...",11


Now let's chuck those token counts into a `plotly` histogram:

In [59]:
fig = px.histogram(
    df, x="Token count", nbins=4, text_auto=True, title="Distribution of token counts"
)
fig.update_layout(
    font_family="monospace",
    title_font_family="monospace",
)
fig.show()

We see that there are fewer tweets with larger token counts. Here's another histogram, where each token count gets its own bin:

In [53]:
fig = px.histogram(
    df, x="Token count", nbins=40, text_auto=True, title="Distribution of token counts"
)
fig.update_layout(
    font_family="monospace",
    title_font_family="monospace",
    yaxis_title="Number of tweets",
)
fig.show()

Now it looks like a city background!

## Heat map: content type against engagements

Let's look at the correlation matrix for the engagement counts (`Likes`, `Replies`, `Retweets`) and some of the content types (`Rational`, `Emotional`).

First, let's make a mini DataFrame that contains only these columns and get its correlation matrix:

In [54]:
mini_df = df[
    ["Likes", "Replies", "Retweets", "Content is Rational", "Content is Emotional"]
]

mini_corr = mini_df.corr(numeric_only=True).round(2)

Great! Let's turn this into a heatmap using `plotly`'s `imshow` function:

In [55]:
fig = px.imshow(mini_corr, text_auto=True)
fig.update_layout(
    title="Correlation matrix of content types and engagement counts",
    font_family="monospace",
    title_font_family="monospace",
)
fig.show()

Interesting: correlations seem to be stronger *within* content types and engagement counts rather than *between* them...

## Violin plot: tweet type and likes

How is like count distributed across tweet types? Let's find out! To accomplish this, we'll use a **violin plot**; it's like a box plot, but there's a kernel density plot (read: distribution) surrounding it.

First, since a row in our original DataFrame can have multiple tweet types, let's make an equivalent dataframe that has one row per tweet type per tweet:

In [56]:
new_df = pd.DataFrame()

for _, row in df.iterrows():
    tweet_types = row["Tweet Type"].replace(" ", "").split(",")
    for tweet_type in tweet_types:
        new_row = pd.Series({"Likes": row["Likes"], "Tweet Type": tweet_type})
        new_df = pd.concat([new_df, new_row.to_frame().T], ignore_index=True)

Now let's turn it into a violin plot!

In [57]:
fig = px.violin(
    new_df,
    y="Likes",
    x="Tweet Type",
    title="Distribution of like counts per tweet type",
)
fig.update_layout(font_family="monospace", title_font_family="monospace")
fig.show()

## Line graph: tweet counts over the years

How many tweets do we have for each year from 2016 to 2022? Let's use a line graph to find out!

First, we need to group our tweets by year:

In [58]:
years = [year for year in range(2016, 2023)]
counts = [0 for _ in range(7)]

for _, row in df.iterrows():
    date_posted = row["Date posted"]
    year_posted = int(f"20{re.split(r'/| |:', date_posted)[2]}")
    counts[year_posted - 2016] += 1

fig = px.line(x=years, y=counts, title="Number of tweets per year", text=counts)
fig.update_layout(
    xaxis_title="Year",
    yaxis_title="Number of tweets",
    font_family="monospace",
    title_font_family="monospace",
)
fig.update_traces(textposition="top left")
fig.show()

IndexError: ignored

This is actually not too surprising: our scraper for data collection was originally programmed to yoink recent tweets, which explains the spike for the later years.

## 3D: engagement z-scores

Back when we were preprocessing, we added columns for the z-scores of likes, replies, and retweets. Given that triples of these values are usually in the range $[-3, 3]^3$, a 3D scatter plot of these triples may be insightful:

In [None]:
fig = px.scatter_3d(
    df,
    x="Likes (z-scores)",
    y="Replies (z-scores)",
    z="Retweets (z-scores)",
    title="Likes z-score vs. Replies z-score vs. Retweets z-score",
)
fig.update_layout(
    font_family="monospace",
    title_font_family="monospace",
)
fig.show()

Here's another 3D scatter plot, which only includes non-outliers:

In [None]:
fig = px.scatter_3d(
    df,
    x="Likes (z-scores)",
    y="Replies (z-scores)",
    z="Retweets (z-scores)",
    title="Likes z-score vs. Replies z-score vs. Retweets z-score",
)
fig.update_layout(
    scene=dict(
        xaxis=dict(nticks=4, range=[-3, 3]),
        yaxis=dict(nticks=4, range=[-3, 3]),
        zaxis=dict(nticks=4, range=[-3, 3]),
    ),
    font_family="monospace",
    title_font_family="monospace",
)
fig.show()

The triples still look close together!

# Features

This section will be relatively short; that's because we've actually already done most of the work for this section in previous sections.

For the last two subsections of this tour, we'll highlight how some of our actions during preprocessing and visualization will help us for the modeling phase in the future.

## Feature selection

At the start of `Preprocessing`, we mentioned dropping some columns because they were irrelevant (e.g., optional columns during data collection, name of data collector).

Removing these columns reflects the spirit of *feature selection*: once we start training ML models using our data, we want to make sure that the columns we're feeding into our models will actually help them make more accurate decisions.

For example, suppose we didn't drop the `Collector` column. For all we know, our model will end up basing its judgments on whether Westin collected the data or not; that would be bad!

## Feature generation

Sometimes, the columns we have aren't enough to help our future models make accurate predictions. To remedy this is what *feature generation*.

It feels like the opposite of feature selection, but the idea is that the feature selection is negative × negative (remove bad stuff) while feature generation is a positive × positive (add good stuff).

There are a few places where we introduced some new columns partly to incorporate feature generation:

- The first is when we added columns of z-scores for different engagement types (likes, replies, retweets). This was done more for the purpose of standardization itself, but it could be helpful to our models.

- The second is when we did categorial data encoding. Our models will appreciate it if we feed them numerical data, as that's what they're good at crunching, so bridging the gap between categories and numbers should prove helpful during the modeling phase.

- The third is when we introduced the token length column during visualization. This is a case where we extracted a *property* of some column and turned that into another column (as opposed to standardization or encoding). These property-based new features could potentially provide more context to our models.

# Goodbye!

That wraps up our journey through the Group 9 dataset. We hope you enjoyed seeing the process evolve, and feel free to apply some of the ideas here in your own future works.

That's all from `<Team Name>` for now. Until next time! 👋