# **[CS 132] Group 29: Allegations Against the Aquinos**

## **Data Modeling and Data Communication using Google Colaboratory**

### **I. Preprocessing (Data Exploration Extension)**

In [None]:
# Import core libraries
import numpy as np
import scipy as sp
import pandas as pd
import calendar
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import statsmodels.api as sm

from copy import deepcopy
from datetime import datetime, timedelta
from scipy.signal import argrelextrema
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression

custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="darkgrid", rc=custom_params, palette="pastel")

In [None]:
main = pd.read_csv('https://raw.githubusercontent.com/cs132group29/allegations-surrounding-the-aquinos/main/Dataset%20-%20Group%2029.csv') # Retrieve Main Dataset from URL
master = pd.read_csv('https://raw.githubusercontent.com/cs132group29/allegations-surrounding-the-aquinos/main/Master-Dataset.csv') # Retrieve Master Dataset from URL

In [None]:
df_main = deepcopy(main) # Make a copy of the original dataset
df_master = deepcopy(master) # Make a copy of the master dataset

# Removes the rows containing null values
df_main = df_main[df_main['Timestamp'].notna()]
df_master = df_master[df_master['Timestamp'].notna()]

df_master.rename(columns={'Joined\nFormat:': 'Joined'}, inplace=True)

# Remove irrelevant columns
drop_list = ['ID', 'Keywords', 'Group', 'Category', 'Topic', 'Collector', 'Tweet Translated', 'Screenshot',
                'Quote Tweets', 'Views', 'Rating', 'Remarks', 'Reviewer', 'Review', 'Reasoning', 'Joined',
                  'Followers', 'Following', 'Location', 'Content type', 'Likes', 'Replies', 'Retweets',
                    'Account type', 'Timestamp', 'Tweet Type']

df_main.drop(drop_list, axis = 1, inplace = True)
df_master.drop(drop_list, axis = 1, inplace = True)

# Anonymize data
anonymity_list = ['Tweet URL','Account handle','Account name',]

df_main.drop(anonymity_list, axis = 1, inplace = True)
df_master.drop(anonymity_list, axis = 1, inplace = True)

df_master.rename(columns={'Date posted ': 'Date posted'}, inplace=True)

# Drop NaN entries
df_main.dropna(axis=1, how='any', inplace = True)
df_master.dropna(axis=1, how='any', inplace = True)

# Type casting
column_types = {
    'Date posted': 'datetime64[ns]',
}

df_main = df_main.astype(column_types)
df_master = df_master.astype(column_types)

# Merge the dataframes
df = pd.merge(df_main, df_master, how='outer')

# Drop duplicate rows
df = df.drop_duplicates(subset=['Tweet', 'Date posted'])

# Invalid parsing will be tagged as null
df['Date posted'] = pd.to_datetime(df['Date posted'], format='%d/%m/%y %H:%M', errors='coerce')

# Apply a uniform format
df['Date posted'] = df['Date posted'].dt.strftime('%d/%m/%y %H:%M')

# Reset the index of the merged dataframe
df = df.reset_index(drop=True)

# display(df)

------------------------------------------------
### **II. Data Modeling and Analysis**

For time-series analysis, we are interested in knowing the relationship of the tweets to certain events (if there is any). Thus, we focus our attention to the occurence of the tweets during the 2016-2022 timeframe.

In [None]:
# Extract year, month and day from date column
df['date'] = pd.to_datetime(df['Date posted']).dt.date
df['date'] = pd.DatetimeIndex(df['date'])


# Filter out 2013 and 2023
df = df[(df['date'].dt.year != 2023) & (df['date'].dt.year != 2013)]

# Count the occurrences of similar dates
date_counts = df['date'].value_counts()

# Create a new DataFrame
df_reg = pd.DataFrame({'date': date_counts.index, 'count': date_counts.values})
df_reg = df_reg.sort_values('date')

df_reg['year'] = df_reg['date'].dt.year
df_reg['month'] = df_reg['date'].dt.month
df_reg['day'] = df_reg['date'].dt.day


# display(df_reg)

### **A. Monthly Support Vector Regression (SVR) and Linear Regression Model**

In [None]:
## Perform regression modeling
df_reg['Month_Year'] = df_reg['date'].dt.to_period('M')
MY_counts = df_reg['Month_Year'].value_counts()
# Create a new DataFrame
MY_reg = pd.DataFrame({'date': MY_counts.index, 'count': MY_counts.values})
MY_reg = MY_reg.sort_values('date')

# Convert datetime to int
x = MY_reg['date'].astype(int) / 10**9  # Convert to seconds (UNIX epoch start)
x = x.values.reshape(-1, 1)

y = MY_reg['count']

#---------------------------------------------------------------------------
# Linear Regression and SVR Model with Hyperparameter Optimization
#---------------------------------------------------------------------------
# Stastical approach
x_lms = sm.add_constant(x)
linear_model_stat = sm.OLS(y, x_lms)
lms_results = linear_model_stat.fit()
p_values = lms_results.pvalues[1:]

# Machine learning approach (no p-values)
linear_model = LinearRegression()
linear_model.fit(x_lms, y)
y_linear_pred = linear_model.predict(x_lms)

# Split the data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Create a pipeline with StandardScaler and SVR
pipeline = Pipeline([('scaler', StandardScaler()),
                     ('svr', SVR(kernel='rbf'))])

# Define the parameter grid for hyperparameter optimization
#param_grid = {'svr__C': [10000, 1000000, 100],
#              'svr__gamma': [10, 1000, 100]}
param_grid = {'svr__C': [0.01,0.1,10],
              'svr__gamma': [0.1,1,10,100, 1000]}

# Perform grid search with cross-validation on train data
grid_search = GridSearchCV(pipeline, param_grid)


grid_search.fit(x_train, y_train)

# Predict using best model on test data
best_svr = grid_search.best_estimator_
y_svr_pred_test = best_svr.predict(x_test)

# Calculate R2 and RMSE for linear regression model
linear_r2 = r2_score(y, y_linear_pred)
linear_rmse = np.sqrt(mean_squared_error(y, y_linear_pred))

print("Model Evaluation")
print("\nLinear Regression: RMSE=%.2f, R2=%.2f" % (linear_rmse, linear_r2))
for i, p_value in enumerate(p_values.index):
  print(f'P({p_value}): {p_values[i]}')

# Calculate R2 and RMSE for SVR model
svr_r2 = r2_score(y_test, y_svr_pred_test)
svr_rmse = np.sqrt(mean_squared_error(y_test, y_svr_pred_test))

print("\nSupport Vector Regression: RMSE=%.2f, R2=%.2f" % (svr_rmse, svr_r2))
print("Hyperparameters:", best_svr, "\n")

if any(p_values <= 0.05):
  print("There is a significant relationship between the predictor and the response.\n")
else:
  print("There is no significant relationship between the predictor and the response.\n")

# Plot the model
xtt = MY_reg['date'].dt.strftime('%Y-%m-%d')
scatter_actual = go.Scatter(x=xtt, y=y, mode='markers', name='Actual', marker=dict(color='#FFE459', opacity=1))

line_regression = go.Scatter(x=xtt, y=y_linear_pred, mode='lines', name='LR', line=dict(color='#1B70C8', dash='dash'))

y_svr_pred = best_svr.predict(x) # Plot on all data
line_svr = go.Scatter(x=xtt, y=y_svr_pred, mode='lines', name='SVR', line=dict(color='#3691E9'))

data = [scatter_actual, line_regression, line_svr]

fig = go.Figure(data=data)

# Calculate the minimum and maximum dates
min_date = datetime.strptime(xtt.min(), "%Y-%m-%d")
max_date = datetime.strptime(xtt.max(), "%Y-%m-%d")

# Add one month to the minimum and maximum dates
min_date = min_date - timedelta(days=30)
max_date = max_date + timedelta(days=30)

fig.update_xaxes(
    range=[min_date.strftime("%Y-%m-%d"), max_date.strftime("%Y-%m-%d")],
    tickformat="%m - %Y",
    dtick="M3",
    ticklabelmode="period",
    #tickangle=90,
    tickfont=dict(size=18), # Size of x-axis labels (tweak this)
)

fig.update_layout(
    height=600,
    title=dict(text='<b>Twitter Aquino Allegations Monthly SVR and Linear Regression Model</b>', font=dict(size=30, color='#1B70C8')), # Size and color of title (tweak this)
    xaxis_title=dict(text='<b>Month - Year</b>', font=dict(size=20, color='#3691E9')), # Size of x-axis header (tweak this)
    yaxis_title=dict(text='<b>Number of Tweets</b>', font=dict(size=20, color='#3691E9')), # Size of y-axis header (tweak this)
    plot_bgcolor='#99CEFF' # Color of background (tweak this)
)

fig.update_layout(
    font=dict(
        family="Arial Black",
        size=18,
    ),
)
fig.show()

Model Evaluation

Linear Regression: RMSE=3.33, R2=0.38
P(x1): 5.563760949067641e-10

Support Vector Regression: RMSE=2.08, R2=0.72
Hyperparameters: Pipeline(steps=[('scaler', StandardScaler()), ('svr', SVR(C=10, gamma=10))]) 

There is a significant relationship between the predictor and the response.



### **Interpretation**

These models were used to discover a pattern on the volume of dataset through time. Now, in observing the monthly tweet count per year, the number of tweets gradually increases. Implying that every year, there is a **significant increase** of Twitter **influencers/trolls/users** that contribute in the **spread of mis/disinformation**. There is also an **observable pattern** in the **SVR model** wherein **tweet count** is at **maximum** during the months of **April to August**.

In exploring the **slope with the greatest amount of tweets** (2022), these are the following **major events** that occurred in the Philippines.

***On April-June 2022***: A lot of the topics tied in social media during the 2022 National Elections revolves around the Marcoses, Robredos, and more importantly the Aquinos. Since these three have historical links, topics such as the EDSA Revolution, Ninoy Aquino's unresolved death and contribution to the nation emerges again as the one of the trending topics among the Filipinos during this time.

***On July 2022***: This is when President Bongbong Marcos was set to deliver his first State of the Nation Address (SONA) at the Batasang Pambansa Complex. This major event in the Philippines resurfaced the events about Martial Law during Ferdinand Marcos' reign. As such, since the Aquinos were a key player in toppling the dictatorial rule of Marcos, hearsays or unproven claims tend to arise as well.

***On August 2022***: As confirmed by the National Historical Commission of the Philippines (NHCP), Ninoy Aquino is declared as a national hero. As such, during the month of National Heroes' Day, there is a hot debate in social media whether Ninoy Aquino deserves the title or not. Unfortunately, many Twitter users rely on claims that are already deemed false to support their arguments.

However, this is just the tip of the iceberg! We still need to know why these influx of tweets occur every year (why is there a pattern?). With that, we explore the next analysis.

--------------------

### **B. Peak Event Detection**

In [None]:
#------------------------
# Peak Events Detection
#------------------------

# Convert the date column to seconds
x = df_reg['date'].astype(int) / 10**9  # Convert to seconds (UNIX epoch start)
x = x.values.reshape(-1, 1)
y = df_reg['count'].values

# Find local maxima
peaks = argrelextrema(y, np.greater)[0]

# Filter peaks based on their significance
significance_threshold = 3  # Adjust as needed
filtered_peaks = [peak for peak in peaks if y[peak] > significance_threshold * np.mean(y)]

# Extract event timestamps
events = df_reg.iloc[filtered_peaks]

# Plot peaks
fig = go.Figure()

# Original data
fig.add_trace(go.Scatter(
    x=df_reg['date'],
    y=df_reg['count'],
    hovertext=df_reg['date'].dt.strftime('%Y-%m-%d'),
    mode='lines',
    name='Tweet Data',
    line=dict(color='#1B70C8') # Color of lines
))

# Peaks
fig.add_trace(go.Scatter(
    x=events['date'],
    y=events['count'],
    hovertext=events['date'].dt.strftime('%Y-%m-%d'),
    mode='markers',
    name='Peaks',
    marker=dict(
        color='#FFE459', # Color of peaks (X mark)
        size=15, # Size of peaks (X mark)
        symbol='x'
    ))
)

# Arrows
for date, count in zip(events['date'], events['count']):
    fig.add_annotation(
        x=date,
        y=count,
        text=date.strftime('%b, %d, %Y'),
        showarrow=True,
        arrowhead=2,
        arrowsize=1,
        arrowwidth=2,
        arrowcolor='black', # Color of arrows
        ax=20,
        ay=-30,
        font=dict(size=14, family='Arial Black') # Size of arrows
    )

fig.update_layout(
    height=600,
    title=dict(text='<b>Twitter Aquino Allegations Peak Events</b>', font=dict(size=30, color='#1B70C8')), # Size and color of title
    xaxis_title=dict(text='<b>Year (2016-2022)</b>', font=dict(size=20, color='#3691E9')), # Size of x-axis header
    yaxis_title=dict(text='<b>Number of Tweets</b>', font=dict(size=20, color='#3691E9')), # Size of y-axis header
    plot_bgcolor='#99CEFF' # Color of background (tweak this)
)

fig.update_xaxes(
    tickformat="%Y",
    dtick="M12",
    ticklabelmode="period",
    ticklen=10,
    tickfont=dict(size=18,family="Arial Black") # Size of x-axis labels
)

fig.update_yaxes(
    tickfont=dict(size=18,family="Arial Black") # Size of y-axis labels
)

fig.show()

### **Interpretation**

Using peak detection with ```significance_threshold = 3```, we filter the peaks using the significance threshold we have set. Now, to visualize these peaks, we calculated for the local maxima using ```argrelextrema()``` to locate these peaks using a graph. Using these techniques, we found out that the following dates have major implications in the occurence of mis/disinformation tweets regarding the Aquinos:

*   **February 25, 2020** - EDSA Revolution Anniversary
*   **August 21, 2020**- Ninoy Aquino Day
*   **February 26, 2021**- EDSA Revolution Anniversary
*   **August 21, 2021** - Ninoy Aquino Day
*   **January 25, 2022**- Cory Aquino's Birthday
*   **February 25, 2022** - EDSA Revolution Anniversary
*   **May 10, 2022** - Philippine Elections
*   **August 21, 2022** - Ninoy Aquino Day

Now, this implies that mis/disinformation tweets **significantly increases** whenever **one of these major events** occur. It is likely possible that there is a **underlying connection** that drives trolls to **stir or distort history** whenever the **topics/controversies on the Aquinos** are once more debated in social media.

### **III. Data Communication**

**Introduction** \\
Social media is quoted as the “primary platform for politics,” making it one of the most active mediums for publicity and political discourse.

Filipinos highly trust the variety of available information in social media platforms, causing a great divide in public opinion. There is an existing influx of misinformation that distorts the criteria for legitimate sources and amplifies the existence of echo chambers, which contain biased and misleading information.

Hence, our team investigated and analyzed the disinformation currently present on social media, in our study titled: "Decoding Dates: A Time Series Analysis on the Deceptions Bound to “Dilawan".

**Materials and Methods** \\
The problem that we have identified that despite being one of the most prominent families in the Philippines for their political involvement and significant influence on Filipino democracy, the image and legacy of the Aquinos are facing a downturn as it is being tampered with and manipulated using disinformation.

This study aims to give light on the impact of recent events like the 2016 and 2022 Elections, EDSA People Power Anniversaries, and Ninoy Aquino Day on the spread of the allegations surrounding the Aquinos. The study is limited to utilizing relevant tweets from 2016-2022.

In line with this problem, we have:

*   **Null hypothesis**: the recent events had no significant effect on the number of allegation tweets towards the aquinos
*   **Alternative hypothesis**: the recent events had a significant effect on the number of allegation tweets towards the aquinos

These are the methods that we have conducted to know which of the aforementioned hypothesis is accepted or rejected:

1.   Our group’s primary step is to utilize a data scraper and twitter’s advanced search feature, in gathering a pool of tweets containing keywords such as Ninoy traitor, Ninoy not a hero, Aquino murderer, corrupt, NPA.
2.   From the gathered data, each group member has identified at least 50 Allegation tweets which are verified as mis/disinformation by trusted fact-checking organizations like VERA Files, Rappler, FactRakers, Agence France-Presse. As part of this step, it was ensured that every data collected can be classified as clean data and is labeled appropriately.
3. We also got tweets from the Master Dataset that contained misinformation about the Aquinos so that our findings and results will be more reliable. The tweets that we got were from Groups 35, 37, 46, and 59. We also manually checked that those tweets obtained from them were clean and labeled correctly.
4. Afterwards, on data exploration part, the group accumulated a dataset of 722 tweets, which was later reduced to 713 after preprocessing. This step included ensuring that there are no missing values, handling missing values, handling outliers, ensuring formatting consistency, and normalization/standardization/scaling. It also included the preparation for the time series analysis which is counting the number of tweets per day, month, and year. All of these preprocessing steps are necessary to ensure that the data is ready for modeling and analysis.
5. Afterwards we proceed to statistical analysis and machine learning. For this, we used peak point detection and SVR-Linear Regression Model to find the trends of the tweets.

**Results and Discussion** \\
In the SVR-Linear Regression Model, observing the monthly tweet count per year, the number of tweets gradually increases. Implying that every year, there is a **significant increase** of Twitter **influencers/trolls/users** that contribute in the **spread of mis/disinformation**. There is also an **observable pattern** in the **SVR model** wherein **tweet count** is at **maximum** during the months of **April to August**.

In exploring the **slope with the greatest amount of tweets** (2022), these are the following **major events** that occurred in the Philippines.

***On April-June 2022***: A lot of the topics tied in social media during the 2022 National Elections revolves around the Marcoses, Robredos, and more importantly the Aquinos. Since these three have historical links, topics such as the EDSA Revolution, Ninoy Aquino's unresolved death and contribution to the nation emerges again as the one of the trending topics among the Filipinos during this time.

***On July 2022***: This is when President Bongbong Marcos was set to deliver his first State of the Nation Address (SONA) at the Batasang Pambansa Complex. This major event in the Philippines resurfaced the events about Martial Law during Ferdinand Marcos' reign. As such, since the Aquinos were a key player in toppling the dictatorial rule of Marcos, hearsays or unproven claims tend to arise as well.

***On August 2022***: As confirmed by the National Historical Commission of the Philippines (NHCP), Ninoy Aquino is declared as a national hero. As such, during the month of National Heroes' Day, there is a hot debate in social media whether Ninoy Aquino deserves the title or not. Unfortunately, many Twitter users rely on claims that are already deemed false to support their arguments.

However, this is just the tip of the iceberg! We still need to know why these influx of tweets occur every year (why is there a pattern?). With that, we explore the next analysis.

Using peak detection with ```significance_threshold = 3```, we filter the peaks using the significance threshold we have set. Now, to visualize these peaks, we calculated for the local maxima using ```argrelextrema()``` to locate these peaks using a graph. Using these techniques, we found out that the following dates have major implications in the occurence of mis/disinformation tweets regarding the Aquinos:

*   **February 25, 2020** - EDSA Revolution Anniversary
*   **August 21, 2020**- Ninoy Aquino Day
*   **February 26, 2021**- EDSA Revolution Anniversary
*   **August 21, 2021** - Ninoy Aquino Day
*   **January 25, 2022**- Cory Aquino's Birthday
*   **February 25, 2022** - EDSA Revolution Anniversary
*   **May 10, 2022** - Philippine Elections
*   **August 21, 2022** - Ninoy Aquino Day



**Implications** \\
Now, this implies that there is a **significant increase** in mis/disinformation tweets whenever major events related to the Aquinos, a prominent political family in the Philippines, occur. It suggests that there may be an **underlying connection** driving trolls to manipulate or distort historical events whenever discussions or controversies surrounding the Aquinos resurface on social media.

These mis/disinformation campaigns often surge during major events associated with the Aquinos, such as anniversaries of historical moments or political developments. These campaigns are driven by various motivations, including political agendas, personal biases, or attempts to shape public opinion. Furthermore, trolls engage in spreading mis/disinformation by stirring or distorting history, particularly events linked to the Aquinos. Their aim is to alter public perceptions, discredit the Aquinos, or incite controversy. Historical revisionism also plays a role, as some individuals or organizations seek to promote alternative narratives or downplay aspects of the Aquinos' legacy.

**Conclusion** \\
Summing it up, the results show how the recent events established a connection with the Aquinos, which in turn, increased the volume of allegation tweets against them. As the Monthly SVR and Linear Regression Model showed that the allegation tweets increased linearly in time, implying a significant increase in twitter users who contribute to the influx of mis/disinformation.

Identified peak events turned to be dates of relevant events like People Power Revolution Anniversary, Ninoy Aquino Day, Cory Aquino’s Birthday, and 2022
National Elections. Thus, these are the top contributors to the volume of allegation tweets.

This calls for a more rigorous and extensive fact-checking, especially during these important events that reflect our political views, not just as individuals, but our decisions as a democratic country. We should be more vigilant as the battle against mis/disinformation will never end. However, as long as journalists, data scientists, and we exist, we will do whatever it takes for truth to prevail against this sea of mis/disinformation.

With that, this concludes the study that we have conducted about the Deceptions Bound to "Dilawan".