<img src="../images/GA-logo.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs and NLP

**Primary Objectives:**

1. Scrape tweets from two Twitter accounts
2. Use NLP to train a classifier to predict the account a given tweet comes from (i.e. binary classification)


----

## Problem Statement

If given an unlabelled tweet, are we able to discern from the Twitter accounts of 2 political parties the tweet is from? What are the factors that best differentiate the 2 political parties' tweets?

Political analysts frequently need to dissect and analyse the different stances that political parties take for similar issues. Such findings will be useful for political analysts by supplementing their qualitative analysis with quantitative analysis.

We will build a NLP binary classifier with explanatory and predictive power to attempt to predict whether a given tweet comes from 2 of Singapore's political parties, namely the @PAPSingapore and the @WPSG.

----

## Background & External Research

Political parties tend to talk about similar things, e.g., government policy, national security etc. Based on personal experience, anecdotal evidence and ChatGPT3, the tweets from different political parties (loosely based on those in the U.S.) can differ by a few key aspects despite being similar in content, namely:
- **Messaging** - messaging usually reflects the party's ideology, values and priorities. For instance, conservative parties may emphasise on lower taxes, greater individual freedom and smaller government (e.g., Republicans in the U.S.) while liberal parties may emphasise on social justice, government intervention (e.g., Democrats in the U.S.). The type of messaging can also be proxied by the hashtags commonly used by the account
- **Tone** - this can range from aggressive to conciliatory even if the content is similar (e.g., talking about the same topic), depending on the party's political objective
- **Content Type** - Depending on the target audience, political parties may use different types of content that resonate better with the intended target audience. For example, political parties that tend to target younger voters may use more trendy language and memes.
- **Frequency** - some political parties may tweet more often if the majority of voters are on Twitter. We will not use this in our classifier as frequency is not a valid characteristic of an individual tweet as defined in this problem statement.
- **Level of Engagement** - some political parties may more readily engage their followers by replying to comments, mentioning users or retweeting to amplify messages from supporters.

These key aspects may be present in the tweets of Singapore's political parties as well, albeit to different extents compared to the U.S. As such, we may be able to differentiate the tweets from different political parties by looking at these aspects of their tweets.

In testing this hypothesis, we choose Singapore's 2 largest political parties (based on number of seats in Parliament), and examine their tweets. These are:
|  |  |
|-------------|--------------|
| <div style="text-align:center;"><img src="../images/twitter_pap.jpg" alt="PAP Logo" width="150" height="150" style="display: block; margin: 0 auto;"><br>**People's Action Party (PAP)**<br>[@PAPSingapore](https://twitter.com/PAPSingapore)</div> | <div style="text-align:center;"><img src="../images/twitter_wp.jpg" alt="WP Logo" width="150" height="150" style="display: block; margin: 0 auto;"><br>**The Workers' Party (WP)**<br>[@WPSG](https://twitter.com/wpsg?lang=en)</div> |


----

## Data Scraping

There are many tools to scrape Twitter, but for ease of implementation, we use the [SNScrape](https://github.com/JustAnotherArchivist/snscrape) package to scrape the tweets from these 2 accounts. Information on using SNScrape to scrape Twitter can be found [here](https://datasciencedojo.com/blog/scrape-twitter-data-using-snscrape/).

We also scrape a time period that is common for both accounts (i.e. 1 Jan 2010 to 31 Dec 2022) so that there is no time difference bias.

In [2]:
import snscrape.modules.twitter as sntwitter
import numpy as np
import pandas as pd

from data_cleaning import get_query, scrape_tweets

In [2]:
# define account names to scrape
account_pap = 'PAPSingapore'
account_wp = 'WPSG'

# define time period to scrape
since = '2010-01-01'
until = '2022-12-31'

In [3]:
%%time
# Scraping PAPSingapore Twitter account
### RUN THIS ONLY IF YOU INTEND TO SCRAPE, THE PROCESS CAN TAKE A FEW MINUTES
tweets_pap_raw = scrape_tweets(account_pap, since, until)

Starting scrape of Twitter account PAPSingapore...
Scraped a total of 2762 tweets.
CPU times: total: 781 ms
Wall time: 2min 11s


In [4]:
%%time
# Scraping WPSG Twitter account
### RUN THIS ONLY IF YOU INTEND TO SCRAPE, THE PROCESS CAN TAKE A FEW MINUTES
tweets_wp_raw = scrape_tweets(account_wp, since, until)

Starting scrape of Twitter account WPSG...
Scraped a total of 2654 tweets.
CPU times: total: 766 ms
Wall time: 1min 59s


### Export Raw Data

In [5]:
tweets_pap_raw.to_csv('../datasets/tweets_pap_raw.csv')

In [6]:
tweets_wp_raw.to_csv('../datasets/tweets_wp_raw.csv')