# Hello!

This is Group 9's Jupyter Notebook for data exploration. We wanted to figure out all sorts of things about the data we have, and the result of that exploration is the series of cells you see below.

We also wanted to show our results in a way that makes the reader immersed in the data exploration experience, so there's a narrative tone throughout this notebook. Enjoy!

---

As with any self-respecting Python program, let's first import the ~~minions~~ packages we'll need:

- `pandas` (DataFrames!)
- `plotly` (plotting)
- `re` (regular expressions)
- `nltk` (natural language processing)
- `emoji` (🤔)

In [199]:
%pip install emoji

Note: you may need to restart the kernel to use updated packages.


In [200]:
import emoji
import nltk
import pandas as pd
import plotly.express as px
import plotly.io as pio
import re

pio.renderers.default = "notebook"  # so that plotly works in an HTML file

# So...where's our data?

The CSV files for our dataset are hosted in one of Daryll's GitHub repositories, so let's just yoink those into this notebook:

In [201]:
CSV_PATHS = [
    "https://raw.githubusercontent.com/daryll-ko/cs132-main/main/explorer/data/data_001-100.csv",
    "https://raw.githubusercontent.com/daryll-ko/cs132-main/main/explorer/data/data_101-150.csv",
    "https://raw.githubusercontent.com/daryll-ko/cs132-main/main/explorer/data/data_other_groups.csv",
]

In [202]:
dfs = [pd.read_csv(csv_path, index_col=False) for csv_path in CSV_PATHS]

There are three CSV files (and DataFrames) because our data is split into three parts: one for the first 100 samples, another for the remaining 50 samples, and the last one for other groups' samples. Having to work with three things at once is a pain, so let's concatenate the DataFrames together:

In [203]:
df = pd.concat(dfs, ignore_index=True)

# Preprocessing

Let's look at the shape of the data:

In [204]:
df.shape

(2120, 32)

That's...an awful lot of rows. It turns out that the CSV files still have the default blank rows from Google Sheets.

To fix this, let's just filter out the rows that don't have a `Tweet URL` entry:

In [205]:
sentinels = ["Tweet URL", "Following", "Date posted"]

for sentinel in sentinels:
    df = df[df[sentinel].notna()]

df.reset_index(drop=True, inplace=True)

How many rows do we have now?

In [206]:
df.shape

(399, 32)

That's much better! Now let's look at our DataFrame's columns:

In [207]:
df.columns

Index(['ID', 'Timestamp', 'Tweet URL', 'Group', 'Collector', 'Category',
       'Topic', 'Keywords', 'Account handle', 'Account name', 'Account bio',
       'Account type', 'Joined', 'Following', 'Followers', 'Location', 'Tweet',
       'Tweet Translated', 'Tweet Type', 'Date posted', 'Screenshot',
       'Content type', 'Likes', 'Replies', 'Retweets', 'Quote Tweets', 'Views',
       'Rating', 'Reasoning', 'Remarks', 'Reviewer', 'Review'],
      dtype='object')

Some of these columns are irrelevant to the core of our data science project (e.g., `Timestamp` of data collection), so let's just drop 'em:

In [208]:
cols_to_drop = [
    "ID",
    "Timestamp",
    "Group",
    "Collector",
    "Category",
    "Topic",
    "Keywords",
    "Reviewer",
    "Review",
    "Remarks",
]

df = df.drop(columns=cols_to_drop)

What columns do we have left?

In [209]:
df.columns

Index(['Tweet URL', 'Account handle', 'Account name', 'Account bio',
       'Account type', 'Joined', 'Following', 'Followers', 'Location', 'Tweet',
       'Tweet Translated', 'Tweet Type', 'Date posted', 'Screenshot',
       'Content type', 'Likes', 'Replies', 'Retweets', 'Quote Tweets', 'Views',
       'Rating', 'Reasoning'],
      dtype='object')

That's much better! In subsequent sections, we'll clean up our data further and extract new features from our existing features that are more relevant to our problem (context analysis).

## Handling missing values

Before we even know *what* to handle, let's get a bird's-eye view of our dataset using the `describe()` method:

In [210]:
df.describe()

Unnamed: 0,Likes,Replies,Retweets,Quote Tweets,Views
count,399.0,399.0,399.0,128.0,7.0
mean,13.408521,1.498747,4.994987,1.601562,55.857143
std,72.995223,7.001391,25.784963,8.352792,57.744924
min,0.0,0.0,0.0,0.0,3.0
25%,0.0,0.0,0.0,0.0,22.0
50%,0.0,0.0,0.0,0.0,49.0
75%,2.0,1.0,1.0,0.0,59.0
max,738.0,70.0,281.0,72.0,177.0


There are some oddities here. First of all, it seems like some columns don't have all the 400-ish entries (these were optional columns in the data collection phase).

There's not much we can do about that, so let's just drop 'em as well...

In [211]:
more_cols_to_drop = [
    "Tweet Translated",
    "Screenshot",
    "Quote Tweets",
    "Views",
    "Rating",
]

df = df.drop(columns=more_cols_to_drop)

That should do the trick! Let's call the `describe` method again:

In [212]:
df.describe()

Unnamed: 0,Likes,Replies,Retweets
count,399.0,399.0,399.0
mean,13.408521,1.498747,4.994987
std,72.995223,7.001391,25.784963
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,0.0
75%,2.0,1.0,1.0
max,738.0,70.0,281.0


Cool! At least for our quantitative features, we seem to have data that looks nor—wait a second, 149? Don't we have 150 columns?

To our dismay, we found that for some reason, we didn't include the `Joined`, `Following`, `Followers`, and `Location` info for one row. Oh well, let's see what we can do.

`Following` and `Followers` are data points at the ratio level, so we can impute those missing values with the mean of their respective columns:

In [213]:
cols_to_fill = ["Following", "Followers"]

df[cols_to_fill] = df[cols_to_fill].apply(pd.to_numeric, errors="coerce")

for col in cols_to_fill:
    mean = df[col].mean()
    df[col] = df[col].fillna(mean)

`describe` *should* be fine now...

In [214]:
df.describe()

Unnamed: 0,Following,Followers,Likes,Replies,Retweets
count,399.0,399.0,399.0,399.0,399.0
mean,628.275862,1299.478836,13.408521,1.498747,4.994987
std,1100.116199,7008.724823,72.995223,7.001391,25.784963
min,0.0,0.0,0.0,0.0,0.0
25%,77.0,25.5,0.0,0.0,0.0
50%,247.0,181.0,0.0,0.0,0.0
75%,628.275862,714.5,2.0,1.0,1.0
max,8058.0,85700.0,738.0,70.0,281.0


## Handling outliers + Standardization

For our project, tweets that are outliers in terms of likes, replies, or retweets are actually important because these are tweets that have a lot of reach, and so they give a lot more context when mis/disinformation is our issue.

Thus, we're not going to drop these outliers; instead, we'll just identify them. It's good to know what kinds of tweets are getting a lot more attention than usual.

We first add new columns dedicated to showing the z-scores of the entries in some columns (standardization), then identify outliers based on these z-score columns:

In [215]:
cols = ["Likes", "Replies", "Retweets"]

for col in cols:
    new_col_name = f"{col} (z-scores)"
    df[new_col_name] = (df[col] - df[col].mean()) / df[col].std()
    print(f"Rows with outlier in [{col}]")
    display(df[abs(df[new_col_name]) > 3][f"{col} (z-scores)"])
    print()

Rows with outlier in [Likes]


1      8.200420
136    4.898834
358    9.693668
372    9.926560
375    8.488110
Name: Likes (z-scores), dtype: float64


Rows with outlier in [Replies]


1      7.070203
18     4.213628
136    4.213628
171    5.213429
204    3.213826
292    5.784744
358    6.784545
372    9.783950
375    8.070004
397    3.499484
Name: Replies (z-scores), dtype: float64


Rows with outlier in [Retweets]


1       3.451819
136     3.102778
148     6.864660
162     3.762077
358    10.704108
372    10.161155
375     8.066912
Name: Retweets (z-scores), dtype: float64




Cool! These are rows we can keep in mind for later on.

## Ensuring formatting consistency

Let's make sure that for columns that are supposed to follow a specific format, the entries in these columns *do* follow that format.

First, let's ensure that our account handles start with the `@` character:

In [216]:
def purify_account_handle(row) -> str:
    return row["Account handle"].replace("\n", "")


df["Account handle"] = df.apply(purify_account_handle, axis=1)

In [217]:
len(df[[not handle.startswith("@") for handle in df["Account handle"]]])

0

In [218]:
print(df.columns)
print(df.shape)

Index(['Tweet URL', 'Account handle', 'Account name', 'Account bio',
       'Account type', 'Joined', 'Following', 'Followers', 'Location', 'Tweet',
       'Tweet Type', 'Date posted', 'Content type', 'Likes', 'Replies',
       'Retweets', 'Reasoning', 'Likes (z-scores)', 'Replies (z-scores)',
       'Retweets (z-scores)'],
      dtype='object')
(399, 20)


Next, let's ensure that the `Joined` entries are of the form `MM/YY`. We'll use a regular expression to accomplish this:

In [219]:
regex_normal = r"^(\d{1,2})/(\d{2,4})$"
regex_fallback = r"^(\d{2})/(\d{2})/(\d{4}) (\d{1,2}):(\d{2}):(\d{2})$"


def is_proper_month(month: int) -> bool:
    return 1 <= month <= 12


def is_proper_joined_date(s: str) -> bool:
    result = re.search(regex_normal, s)
    if result is None:
        return False
    month, _ = map(int, result.groups())
    return is_proper_month(month)


def purify_joined_date(row) -> str:
    s = row["Joined"]
    result = re.search(regex_normal, s)
    if result is not None:
        return s
    else:
        fallback = re.search(regex_fallback, s)
        if fallback is not None:
            _, month, year, _, _, _ = map(int, fallback.groups())
            return f"{str(month).zfill(2)}/{str(year)[-2:]}"
        else:
            return ""


df["Joined"] = df.apply(purify_joined_date, axis=1)

In [220]:
def row_has_valid_joined_date(row) -> bool:
    return is_proper_joined_date(str(row["Joined"]))


len(df[df.apply(row_has_valid_joined_date, axis=1)])

399

Finally, let's ensure that the `Date posted` entries are of the form `DD/MM/YY hh:mm`. We'll use another regular expression to accomplish this:

In [221]:
regex_normal = r"^(\d{2})/(\d{2})/(\d{2}) (\d{2}):(\d{2})$"
regex_fallback = r"^(\d{2})/(\d{1,2})/(\d{2,4}) (\d{1,2}):(\d{2}).*$"
regex_fallback_2 = r"^(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{2}).*$"


def is_proper_hour(hour: int) -> bool:
    return 0 <= hour < 24


def is_proper_minute(minute: int) -> bool:
    return 0 <= minute < 60


def is_leap_year(year: int) -> bool:
    return year % 4 == 0 and (year % 400 == 0 or year % 100 != 0)


def is_proper_date(day: int, month: int, year: int) -> bool:
    if month == 2:
        return 1 <= day <= (29 if is_leap_year(2000 + year % 1000) else 28)
    elif month in [1, 3, 5, 7, 8, 10, 12]:
        return 1 <= day <= 31
    else:
        return 1 <= day <= 30


def is_proper_date_posted(s: str) -> bool:
    result = re.search(regex_normal, s)
    if result is None:
        return False
    day, month, year, hour, minute = map(int, result.groups())
    return (
        is_proper_month(int(month))
        and is_proper_date(day, month, year)
        and is_proper_hour(hour)
        and is_proper_minute(minute)
    )


def purify_date_posted(row) -> str:
    s = re.sub(r"\s+", " ", row["Date posted"])
    result = re.search(regex_normal, s)
    if result is not None:
        day, month, year, hour, minute = map(int, result.groups())
        if not is_proper_month(month):
            day, month = month, day
        return (
            f"{str(day).zfill(2)}"
            "/"
            f"{str(month).zfill(2)}"
            "/"
            f"{str(year)[-2:]}"
            " "
            f"{str(hour).zfill(2)}"
            ":"
            f"{str(minute).zfill(2)}"
        )
    else:
        fallback = re.search(regex_fallback, s)
        if fallback is not None:
            day, month, year, hour, minute = map(int, fallback.groups())
            return (
                f"{str(day).zfill(2)}"
                "/"
                f"{str(month).zfill(2)}"
                "/"
                f"{str(year)[-2:]}"
                " "
                f"{str(hour).zfill(2)}"
                ":"
                f"{str(minute).zfill(2)}"
            )
        else:
            fallback_2 = re.search(regex_fallback_2, s)
            if fallback_2 is not None:
                year, month, day, hour, minute = map(int, fallback_2.groups())
                return (
                    f"{str(day).zfill(2)}"
                    "/"
                    f"{str(month).zfill(2)}"
                    "/"
                    f"{str(year)[-2:]}"
                    " "
                    f"{str(hour).zfill(2)}"
                    ":"
                    f"{str(minute).zfill(2)}"
                )
            else:
                print(s)
                return s


df["Date posted"] = df.apply(purify_date_posted, axis=1)

In [222]:
def has_valid_date_posted(row) -> bool:
    return is_proper_date_posted(row["Date posted"])


len(df[df.apply(has_valid_date_posted, axis=1)])

395

## Categorical data encoding

There are three columns that look susceptible to categorial data encoding:

- `Account type`
- `Tweet Type`
- `Content type`

An `Account type` may be `Identified`, `Anonymous`, or `Media`. Let's encode these using three new columns: each new column whether or not an account is part of some account type (the values in that column are 0 or 1).

As a convention, let's name the new columns `Account is {Account Type}` (e.g., `Account is Anonymous`):

In [223]:
account_types = ["Identified", "Anonymous", "Media"]


def h_decorator(account_type: str):
    def h(row) -> int:
        return 1 if account_type == str(row["Account type"]).strip() else 0

    return h


for account_type in account_types:
    df[f"Account is {account_type}"] = df.apply(h_decorator(account_type), axis=1)

df[[f"Account is {account_type}" for account_type in account_types]]

Unnamed: 0,Account is Identified,Account is Anonymous,Account is Media
0,0,1,0
1,1,0,0
2,0,1,0
3,0,1,0
4,0,1,0
...,...,...,...
394,1,0,0
395,0,1,0
396,0,1,0
397,0,0,1


A `Tweet Type` may be some of multiple things:

- `Text`
- `Image`
- `Video`
- `URL`
- `Reply`
- `Quote Tweet`

To encode this, let's introduce five new columns: each new column indicates whether or not a Tweet is part of some tweet type (again, the values in that column are 0 or 1).

As a convention, let's name the new columns `Tweet is {tweet_type}` (e.g., `Tweet is Text`):

In [224]:
tweet_types = ["Text", "Image", "Video", "URL", "Reply", "QuoteTweet"]


def f_decorator(tweet_type: str):
    def f(row) -> int:
        return 1 if tweet_type in row["Tweet Type"].replace(" ", "").split(",") else 0

    return f


for tweet_type in tweet_types:
    df[f"Tweet is {tweet_type}"] = df.apply(f_decorator(tweet_type), axis=1)

df[[f"Tweet is {tweet_type}" for tweet_type in tweet_types]]

Unnamed: 0,Tweet is Text,Tweet is Image,Tweet is Video,Tweet is URL,Tweet is Reply,Tweet is QuoteTweet
0,1,0,0,0,1,0
1,1,0,0,0,0,0
2,1,0,0,0,1,0
3,1,0,0,0,1,0
4,1,0,0,0,1,0
...,...,...,...,...,...,...
394,1,0,0,0,1,0
395,1,0,0,0,1,0
396,1,1,0,0,0,0
397,1,1,0,1,0,0


Finally, a `Content type` may be some of multiple things:

- `Rational`
- `Emotional`
- `Transactional`

Once again, we introduce three new columns. Let's name them `Content is {content_type}` (e.g., `Content is Emotional`):

In [225]:
content_types = ["Rational", "Emotional", "Transactional"]


def g_decorator(content_type: str):
    def g(row) -> int:
        return (
            1 if content_type in row["Content type"].replace(" ", "").split(",") else 0
        )

    return g


for content_type in content_types:
    df[f"Content is {content_type}"] = df.apply(g_decorator(content_type), axis=1)

df[[f"Content is {content_type}" for content_type in content_types]]

Unnamed: 0,Content is Rational,Content is Emotional,Content is Transactional
0,1,0,0
1,1,1,0
2,1,1,0
3,1,0,0
4,1,1,0
...,...,...,...
394,1,0,0
395,1,0,0
396,0,1,0
397,1,0,0


Great! With all that covered, it's time to do some **natural language processing**.

## Tokenization + lower casing

For analysis, let's reflect our language processing work in a new column called `Tweet (processed)`. This will allow us to see the progress we've made in processing our tweets.

Let's tokenize our tweets first. In our case, we split the tweets by whitespaces and punctuation marks:

In [226]:
df["Tweet (processed)"] = df["Tweet"]

df["Tweet (processed)"] = df["Tweet (processed)"].transform(
    lambda tweet: re.split(r" |,|!|\.|\?|;|:", tweet)
)

df[["Tweet", "Tweet (processed)"]]

Unnamed: 0,Tweet,Tweet (processed)
0,Wala alam si leni sa foreign policy. Di nya ng...,"[Wala, alam, si, leni, sa, foreign, policy, , ..."
1,Jusko si Leni walang ambag sa Maritime Industr...,"[Jusko, si, Leni, walang, ambag, sa, Maritime,..."
2,"@setsu0196 @indaysara Inggit lang mga yun, wal...","[@setsu0196, @indaysara, Inggit, lang, mga, yu..."
3,@BosyoJ Wala na kasing ibang topic na alam si ...,"[@BosyoJ, Wala, na, kasing, ibang, topic, na, ..."
4,"@VancouverEye Parang Vovo, matagal ng walang g...","[@VancouverEye, Parang, Vovo, , matagal, ng, w..."
...,...,...
394,"NPA po ang kalaban, na pinoprotektahan ni Leni...","[NPA, po, ang, kalaban, , na, pinoprotektahan,..."
395,we are only given the choices of:\nBBM - son o...,"[we, are, only, given, the, choices, of, \nBBM..."
396,Desperado na si Leni..npa at kumunista Ang gag...,"[Desperado, na, si, Leni, , npa, at, kumunista..."
397,"VP Leni Robredo, kabilang sa mga CPP-NPA lover...","[VP, Leni, Robredo, , kabilang, sa, mga, CPP-N..."


Afterwards, let's turn each token to lowercase (if possible):

In [227]:
df["Tweet (processed)"] = df["Tweet (processed)"].transform(
    lambda tokens: list(map(lambda token: token.lower(), tokens))
)

df[["Tweet", "Tweet (processed)"]]

Unnamed: 0,Tweet,Tweet (processed)
0,Wala alam si leni sa foreign policy. Di nya ng...,"[wala, alam, si, leni, sa, foreign, policy, , ..."
1,Jusko si Leni walang ambag sa Maritime Industr...,"[jusko, si, leni, walang, ambag, sa, maritime,..."
2,"@setsu0196 @indaysara Inggit lang mga yun, wal...","[@setsu0196, @indaysara, inggit, lang, mga, yu..."
3,@BosyoJ Wala na kasing ibang topic na alam si ...,"[@bosyoj, wala, na, kasing, ibang, topic, na, ..."
4,"@VancouverEye Parang Vovo, matagal ng walang g...","[@vancouvereye, parang, vovo, , matagal, ng, w..."
...,...,...
394,"NPA po ang kalaban, na pinoprotektahan ni Leni...","[npa, po, ang, kalaban, , na, pinoprotektahan,..."
395,we are only given the choices of:\nBBM - son o...,"[we, are, only, given, the, choices, of, \nbbm..."
396,Desperado na si Leni..npa at kumunista Ang gag...,"[desperado, na, si, leni, , npa, at, kumunista..."
397,"VP Leni Robredo, kabilang sa mga CPP-NPA lover...","[vp, leni, robredo, , kabilang, sa, mga, cpp-n..."


Let's remove tokens that don't consist solely of uppercase or lowercase letters:

In [228]:
df["Tweet (processed)"] = df["Tweet (processed)"].transform(
    lambda tokens: list(filter(lambda token: re.match(r"^[a-zA-Z]+$", token), tokens))
)

df[["Tweet", "Tweet (processed)"]]

Unnamed: 0,Tweet,Tweet (processed)
0,Wala alam si leni sa foreign policy. Di nya ng...,"[wala, alam, si, leni, sa, foreign, policy, di..."
1,Jusko si Leni walang ambag sa Maritime Industr...,"[jusko, si, leni, walang, ambag, sa, maritime,..."
2,"@setsu0196 @indaysara Inggit lang mga yun, wal...","[inggit, lang, mga, yun, walang, ambag, kasi, ..."
3,@BosyoJ Wala na kasing ibang topic na alam si ...,"[wala, na, kasing, ibang, topic, na, alam, si,..."
4,"@VancouverEye Parang Vovo, matagal ng walang g...","[parang, vovo, matagal, ng, walang, galaw, ang..."
...,...,...
394,"NPA po ang kalaban, na pinoprotektahan ni Leni...","[npa, po, ang, kalaban, na, pinoprotektahan, n..."
395,we are only given the choices of:\nBBM - son o...,"[we, are, only, given, the, choices, of, son, ..."
396,Desperado na si Leni..npa at kumunista Ang gag...,"[desperado, na, si, leni, npa, at, kumunista, ..."
397,"VP Leni Robredo, kabilang sa mga CPP-NPA lover...","[vp, leni, robredo, kabilang, sa, mga, lovers,..."


## Stop words removal + other filters

Let's remove stop words from our tweets. For filtering out English stop words, we'll use the `nltk` package. Meanwhile, for filtering out Tagalog stop words, we'll use our own curated set:

In [229]:
nltk.download("stopwords")
from nltk.corpus import stopwords

stopwords_set_english = set(stopwords.words("english"))
stopwords_set_tagalog = set(
    [
        "ah",
        "akin",
        "aking",
        "ako",
        "alin",
        "alinsunod",
        "amin",
        "ang",
        "ano",
        "apat",
        "at",
        "ay",
        "ayon",
        "ayun",
        "ba",
        "bagaman",
        "bagamat",
        "bakit",
        "basta",
        "dahil",
        "dalawa",
        "datapwat",
        "daw",
        "di",
        "din",
        "dito",
        "doon",
        "eh",
        "ganito",
        "gayunpaman",
        "ha",
        "hala",
        "hanggang",
        "haydiba",
        "hinggil",
        "https",
        "ikaw",
        "isa",
        "ito",
        "iyan",
        "iyon",
        "jusko",
        "kabila",
        "kami",
        "kanila",
        "kasi",
        "ka",
        "kay",
        "kaya",
        "kayo",
        "kaysa",
        "kina",
        "ko",
        "kung",
        "kuwan",
        "labag",
        "lang",
        "mag",
        "may",
        "mga",
        "mo",
        "mong",
        "mula",
        "na",
        "naku",
        "naman",
        "nang",
        "ng",
        "nga",
        "ngek",
        "ngunit",
        "ni",
        "nina",
        "niya",
        "niyo",
        "noong",
        "nung",
        "nya",
        "nyo",
        "o",
        "opo",
        "pa",
        "pag",
        "pagkatapos",
        "pangalawa",
        "para",
        "parang",
        "pero",
        "po",
        "raw",
        "rin",
        "sa",
        "sapagkat",
        "si",
        "sila",
        "sina",
        "siyatalaga",
        "sumunod",
        "sya",
        "tatlo",
        "tayo",
        "tungo",
        "una",
        "yan",
        "yun",
        "yung",
    ]
)
stopwords_set = stopwords_set_english.union(stopwords_set_tagalog)

df["Tweet (processed)"] = df["Tweet (processed)"].transform(
    lambda tokens: list(filter(lambda token: token not in stopwords_set, tokens))
)

df[["Tweet", "Tweet (processed)"]]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/macintoshhd/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Tweet,Tweet (processed)
0,Wala alam si leni sa foreign policy. Di nya ng...,"[wala, alam, leni, foreign, policy, alam, pres..."
1,Jusko si Leni walang ambag sa Maritime Industr...,"[leni, walang, ambag, maritime, industry, year..."
2,"@setsu0196 @indaysara Inggit lang mga yun, wal...","[inggit, walang, ambag, leni, moro, chaka]"
3,@BosyoJ Wala na kasing ibang topic na alam si ...,"[wala, kasing, ibang, topic, alam, leni, walan..."
4,"@VancouverEye Parang Vovo, matagal ng walang g...","[vovo, matagal, walang, galaw, unilever, ngayo..."
...,...,...
394,"NPA po ang kalaban, na pinoprotektahan ni Leni...","[npa, kalaban, pinoprotektahan, leni, npa, pum..."
395,we are only given the choices of:\nBBM - son o...,"[given, choices, son, npa, casted, votes, some..."
396,Desperado na si Leni..npa at kumunista Ang gag...,"[desperado, leni, npa, kumunista, gagawa, gulo..."
397,"VP Leni Robredo, kabilang sa mga CPP-NPA lover...","[vp, leni, robredo, kabilang, lovers, pastor, ..."


(Disclaimer: The filters below are already redundant once we use alphabet filter above.)

All of those emojis might be painful to work with (for one, they're not ASCII characters), so let's remove them. We'll accomplish this by replacing all emojis with empty strings, with the help of the `emoji` package!

In [230]:
df["Tweet (processed)"] = df["Tweet (processed)"].transform(
    lambda tokens: list(map(lambda token: emoji.replace_emoji(token, ""), tokens))
)

display(df[["Tweet", "Tweet (processed)"]])

Unnamed: 0,Tweet,Tweet (processed)
0,Wala alam si leni sa foreign policy. Di nya ng...,"[wala, alam, leni, foreign, policy, alam, pres..."
1,Jusko si Leni walang ambag sa Maritime Industr...,"[leni, walang, ambag, maritime, industry, year..."
2,"@setsu0196 @indaysara Inggit lang mga yun, wal...","[inggit, walang, ambag, leni, moro, chaka]"
3,@BosyoJ Wala na kasing ibang topic na alam si ...,"[wala, kasing, ibang, topic, alam, leni, walan..."
4,"@VancouverEye Parang Vovo, matagal ng walang g...","[vovo, matagal, walang, galaw, unilever, ngayo..."
...,...,...
394,"NPA po ang kalaban, na pinoprotektahan ni Leni...","[npa, kalaban, pinoprotektahan, leni, npa, pum..."
395,we are only given the choices of:\nBBM - son o...,"[given, choices, son, npa, casted, votes, some..."
396,Desperado na si Leni..npa at kumunista Ang gag...,"[desperado, leni, npa, kumunista, gagawa, gulo..."
397,"VP Leni Robredo, kabilang sa mga CPP-NPA lover...","[vp, leni, robredo, kabilang, lovers, pastor, ..."


Next, let's filter out empty tokens:

In [231]:
df["Tweet (processed)"] = df["Tweet (processed)"].transform(
    lambda tokens: list(filter(lambda token: len(token) > 0, tokens))
)

df[["Tweet", "Tweet (processed)"]]

Unnamed: 0,Tweet,Tweet (processed)
0,Wala alam si leni sa foreign policy. Di nya ng...,"[wala, alam, leni, foreign, policy, alam, pres..."
1,Jusko si Leni walang ambag sa Maritime Industr...,"[leni, walang, ambag, maritime, industry, year..."
2,"@setsu0196 @indaysara Inggit lang mga yun, wal...","[inggit, walang, ambag, leni, moro, chaka]"
3,@BosyoJ Wala na kasing ibang topic na alam si ...,"[wala, kasing, ibang, topic, alam, leni, walan..."
4,"@VancouverEye Parang Vovo, matagal ng walang g...","[vovo, matagal, walang, galaw, unilever, ngayo..."
...,...,...
394,"NPA po ang kalaban, na pinoprotektahan ni Leni...","[npa, kalaban, pinoprotektahan, leni, npa, pum..."
395,we are only given the choices of:\nBBM - son o...,"[given, choices, son, npa, casted, votes, some..."
396,Desperado na si Leni..npa at kumunista Ang gag...,"[desperado, leni, npa, kumunista, gagawa, gulo..."
397,"VP Leni Robredo, kabilang sa mga CPP-NPA lover...","[vp, leni, robredo, kabilang, lovers, pastor, ..."


Finally, let's remove the mentions (tokens that start with `@`):

In [232]:
df["Tweet (processed)"] = df["Tweet (processed)"].transform(
    lambda tokens: list(filter(lambda token: token[0] != "@", tokens))
)

df[["Tweet", "Tweet (processed)"]]

Unnamed: 0,Tweet,Tweet (processed)
0,Wala alam si leni sa foreign policy. Di nya ng...,"[wala, alam, leni, foreign, policy, alam, pres..."
1,Jusko si Leni walang ambag sa Maritime Industr...,"[leni, walang, ambag, maritime, industry, year..."
2,"@setsu0196 @indaysara Inggit lang mga yun, wal...","[inggit, walang, ambag, leni, moro, chaka]"
3,@BosyoJ Wala na kasing ibang topic na alam si ...,"[wala, kasing, ibang, topic, alam, leni, walan..."
4,"@VancouverEye Parang Vovo, matagal ng walang g...","[vovo, matagal, walang, galaw, unilever, ngayo..."
...,...,...
394,"NPA po ang kalaban, na pinoprotektahan ni Leni...","[npa, kalaban, pinoprotektahan, leni, npa, pum..."
395,we are only given the choices of:\nBBM - son o...,"[given, choices, son, npa, casted, votes, some..."
396,Desperado na si Leni..npa at kumunista Ang gag...,"[desperado, leni, npa, kumunista, gagawa, gulo..."
397,"VP Leni Robredo, kabilang sa mga CPP-NPA lover...","[vp, leni, robredo, kabilang, lovers, pastor, ..."


Cool! Our tweets look easier to work with now. Let's export this dataframe as a CSV file for future analysis.

In [233]:
df.to_csv("processed_data.csv")

# Visualization

Here comes the fun part! Let's look at all kinds of relationships in our data using different kinds of plots.

For this section, we'll be using the `plotly` package: it gives us *interactive* plots to play with, which makes the exploration process more active.

## Histogram: tokens length

For our first plot, we'll see how long our token lists ended up as a result of our natural language processing.

First, let's add a new column for the number of tokens per tweet:

In [39]:
df["Token count"] = df["Tweet (processed)"].transform(lambda tokens: len(tokens))

df[["Tweet (processed)", "Token count"]]

Unnamed: 0,Tweet (processed),Token count
0,"[wala, alam, leni, foreign, policy, alam, pres...",10
1,"[leni, walang, ambag, maritime, industry, year...",18
2,"[inggit, walang, ambag, leni, moro, chaka]",6
3,"[wala, kasing, ibang, topic, alam, leni, walan...",28
4,"[vovo, matagal, walang, galaw, unilever, ngayo...",25
...,...,...
394,"[npa, kalaban, pinoprotektahan, leni, npa, pum...",8
395,"[given, choices, son, npa, casted, votes, some...",9
396,"[desperado, leni, npa, kumunista, gagawa, gulo...",13
397,"[vp, leni, robredo, kabilang, lovers, pastor, ...",11


Now let's chuck those token counts into a `plotly` histogram:

In [40]:
fig = px.histogram(
    df, x="Token count", nbins=4, text_auto=True, title="Distribution of token counts"
)
fig.update_layout(
    font_family="monospace",
    title_font_family="monospace",
)
fig.show()

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

We see that there are fewer tweets with larger token counts. Here's another histogram, where each token count gets its own bin:

In [None]:
fig = px.histogram(
    df, x="Token count", nbins=40, text_auto=True, title="Distribution of token counts"
)
fig.update_layout(
    font_family="monospace",
    title_font_family="monospace",
    yaxis_title="Number of tweets",
)
fig.show()

Now it looks like a city background!

## Heat map: content type against engagements

Let's look at the correlation matrix for the engagement counts (`Likes`, `Replies`, `Retweets`) and some of the content types (`Rational`, `Emotional`).

First, let's make a mini DataFrame that contains only these columns and get its correlation matrix:

In [None]:
mini_df = df[
    ["Likes", "Replies", "Retweets", "Content is Rational", "Content is Emotional"]
]

mini_corr = mini_df.corr(numeric_only=True).round(2)

Great! Let's turn this into a heatmap using `plotly`'s `imshow` function:

In [None]:
fig = px.imshow(mini_corr, text_auto=True)
fig.update_layout(
    title="Correlation matrix of content types and engagement counts",
    font_family="monospace",
    title_font_family="monospace",
)
fig.show()

Interesting: correlations seem to be stronger *within* content types and engagement counts rather than *between* them...

## Violin plot: tweet type and likes

How is like count distributed across tweet types? Let's find out! To accomplish this, we'll use a **violin plot**; it's like a box plot, but there's a kernel density plot (read: distribution) surrounding it.

First, since a row in our original DataFrame can have multiple tweet types, let's make an equivalent dataframe that has one row per tweet type per tweet:

In [None]:
new_df = pd.DataFrame()

for _, row in df.iterrows():
    tweet_types = row["Tweet Type"].replace(" ", "").split(",")
    for tweet_type in tweet_types:
        new_row = pd.Series({"Likes": row["Likes"], "Tweet Type": tweet_type})
        new_df = pd.concat([new_df, new_row.to_frame().T], ignore_index=True)

Now let's turn it into a violin plot!

In [None]:
fig = px.violin(
    new_df,
    y="Likes",
    x="Tweet Type",
    title="Distribution of like counts per tweet type",
)
fig.update_layout(font_family="monospace", title_font_family="monospace")
fig.show()

## Line graph: tweet counts over the years

How many tweets do we have for each year from 2016 to 2022? Let's use a line graph to find out!

First, we need to group our tweets by year:

In [None]:
years = [year for year in range(2016, 2023)]
counts = [0 for _ in range(7)]

for _, row in df.iterrows():
    date_posted = row["Date posted"]
    year_posted = int(f"20{re.split(r'/| |:', date_posted)[2]}")
    counts[year_posted - 2016] += 1

fig = px.line(x=years, y=counts, title="Number of tweets per year", text=counts)
fig.update_layout(
    xaxis_title="Year",
    yaxis_title="Number of tweets",
    font_family="monospace",
    title_font_family="monospace",
)
fig.update_traces(textposition="top left")
fig.show()

IndexError: ignored

This is actually not too surprising: our scraper for data collection was originally programmed to yoink recent tweets, which explains the spike for the later years.

## 3D: engagement z-scores

Back when we were preprocessing, we added columns for the z-scores of likes, replies, and retweets. Given that triples of these values are usually in the range $[-3, 3]^3$, a 3D scatter plot of these triples may be insightful:

In [None]:
fig = px.scatter_3d(
    df,
    x="Likes (z-scores)",
    y="Replies (z-scores)",
    z="Retweets (z-scores)",
    title="Likes z-score vs. Replies z-score vs. Retweets z-score",
)
fig.update_layout(
    font_family="monospace",
    title_font_family="monospace",
)
fig.show()

Here's another 3D scatter plot, which only includes non-outliers:

In [None]:
fig = px.scatter_3d(
    df,
    x="Likes (z-scores)",
    y="Replies (z-scores)",
    z="Retweets (z-scores)",
    title="Likes z-score vs. Replies z-score vs. Retweets z-score",
)
fig.update_layout(
    scene=dict(
        xaxis=dict(nticks=4, range=[-3, 3]),
        yaxis=dict(nticks=4, range=[-3, 3]),
        zaxis=dict(nticks=4, range=[-3, 3]),
    ),
    font_family="monospace",
    title_font_family="monospace",
)
fig.show()

The triples still look close together!

# Features

This section will be relatively short; that's because we've actually already done most of the work for this section in previous sections.

For the last two subsections of this tour, we'll highlight how some of our actions during preprocessing and visualization will help us for the modeling phase in the future.

## Feature selection

At the start of `Preprocessing`, we mentioned dropping some columns because they were irrelevant (e.g., optional columns during data collection, name of data collector).

Removing these columns reflects the spirit of *feature selection*: once we start training ML models using our data, we want to make sure that the columns we're feeding into our models will actually help them make more accurate decisions.

For example, suppose we didn't drop the `Collector` column. For all we know, our model will end up basing its judgments on whether Westin collected the data or not; that would be bad!

## Feature generation

Sometimes, the columns we have aren't enough to help our future models make accurate predictions. To remedy this is what *feature generation*.

It feels like the opposite of feature selection, but the idea is that the feature selection is negative × negative (remove bad stuff) while feature generation is a positive × positive (add good stuff).

There are a few places where we introduced some new columns partly to incorporate feature generation:

- The first is when we added columns of z-scores for different engagement types (likes, replies, retweets). This was done more for the purpose of standardization itself, but it could be helpful to our models.

- The second is when we did categorial data encoding. Our models will appreciate it if we feed them numerical data, as that's what they're good at crunching, so bridging the gap between categories and numbers should prove helpful during the modeling phase.

- The third is when we introduced the token length column during visualization. This is a case where we extracted a *property* of some column and turned that into another column (as opposed to standardization or encoding). These property-based new features could potentially provide more context to our models.

# Goodbye!

That wraps up our journey through the Group 9 dataset. We hope you enjoyed seeing the process evolve, and feel free to apply some of the ideas here in your own future works.

That's all from `<Team Name>` for now. Until next time! 👋