# Capstone Project: Life in the "New Normal"
---

#### Organization of Project Notebooks:
- **Notebook #1: Problem Statement & Webscraping (current notebook)**
- Notebook #2: [Data Cleaning & Exploratory Data Analysis](./02_data_cleaning_and_eda.ipynb)
- Notebook #3: Preprocessing & Modelling
    - Notebook #3-1: [Preprocessing & Sentiment Analysis](./03-1_preprocessing_and_sentiment_analysis.ipynb)
    - Notebook #3-2: [Building Pre-trained LSTM RNN Model for Sentiment Analysis](./03-2_building_pretrained_lstm_model.ipynb) 
- Notebook #4: [Topic Modelling & Conclusion](./04_topic_modelling_and_conclusion.ipynb)
- Notebook #5: [Plotly & Dash Visualization](./05_plotly_and_dash.ipynb)

## Notebook #1: Problem Statement & Webscraping

### Contents
1. [Background](#1.-Background)
2. [Problem Statement](#2.-Problem-Statement)
3. [Executive Summary](#3.-Executive-Summary)
4. [Data Dictionary](#4.-Data-Dictionary)
5. [Webscraping](#5.-Webscraping)

### 1. Background

It has been more than a year since the COVID-19 pandemic started. The Ministry of Health (MOH) announced the first imported case of "novel coronavirus" infection in Singapore on 23 January 2020 (source: [Yong, Jan 2021](https://www.channelnewsasia.com/singapore/covid-19-pandemic-singapore-one-year-on-coronavirus-419301)). Since then, there have been changes to several aspects of the country, such as the economy, working lifestyle and social fabric. In a bid to curb the rising number of confirmed COVID-19 cases in the community, some measures introduced by the government of Singapore include: 
- Travel restrictions, i.e. unable to travel overseas for leisure.
- Going into isolation, i.e. circuit breaker, Phases 1 and 2 where dining in at coffee shops/restaurants is not allowed.
- Working from home and online home-based learning.
- Rollout of COVID-19 vaccinations.

In the same timeframe, an increasing number of people in Singapore has sought help for mental health issues (source: [Ang & Phua, Apr 2020](https://www.channelnewsasia.com/singapore/covid-19-fear-toll-mental-health-hotline-anxiety-singapore-763336)). Concerns over the rising number of COVID-19 cases in the community, loss of employment, reduction in income, burnout from work, separation from loved ones, etc. have resulted in some feeling more anxious and uneasy. Concurrently, there are other groups of people who are seemingly unaffected by the COVID-19 pandemic – life goes on normally for them, with great work-life balance and indulging in luxurious experiences (source: [Warren, Mar 2021](https://www.insider.com/singapore-rich-spending-travel-luxury-hotels-pandemic-2021-3)).

### 2. Problem Statement

With social media becoming more prevalent in our daily lives, more and more people are turning to social media to share about their lives and express their opinions. While there have been studies on how people are reacting to COVID-19 at the height of the pandemic, there is limited understanding on how people in Singapore are adapting to life in the "new normal" – a pandemic-filled era.

Therefore, this project aims to leverage on tweets in Singapore from 1 January 2021 to 31 July 2021, to achieve the following objectives:
- What is the sentiment expressed in Singapore tweets?  
- What are the common topics of discussion on Twitter?
- How have the sentiment and topics of discussion changed over the past seven months in Singapore? 
- What can the government of Singapore do to help members of the public to better cope with the pandemic?

### 3. Executive Summary

*INTRODUCTION*

When the first COVID-19 case in Singapore was announced on 23 January 2020, most people expected the COVID-19 pandemic to last for a couple of months, given its resemblance to the SARS (Severe Acute Respiratory Syndrome) virus outbreak in 2003. Fast forward to July 2021, the media is still reporting new COVID-19 cases in the community as well as new variants of COVID-19, constantly reminding us to take precaution when carrying out our day-to-day activities. 

With our lives becoming more intertwined with social media (where we are openly expressing our opinions on online platforms to people whom we have never met before), I'm keen to understand how people are reacting and adapting to life in the "new normal". The social media site that will be explored in this project is Twitter, which is a 'microblogging' platform that allows users to send and receive short posts (usually less than 280 characters) called tweets (source: [UKRI, Aug 2021](https://www.ukri.org/councils/esrc/impact-toolkit-for-economic-and-social-sciences/how-to-use-social-media/choosing-what-social-media-you-use/)).


*METHODOLOGY*

A data science workflow was introduced to conduct this analysis. Firstly, the problem statement was defined — I would like to understand the sentiment expressed in Singapore tweets, the common topics of discussion, how these aspects changed over the past seven months and what the government of Singapore could do to help members of the public better cope with the pandemic. 

Next, tweets from 1 January 2021 to 31 July 2021 located near Singapore were extracted via webscraping using Twint API. An exploratory data analysis was conducted to understand the distribution of the tweets over the seven months as well as any patterns and trends associated with the number of likes, replies, retweets, mentions, hashtags, etc. New features such as the character length and word count of the tweets were engineered, and relationships were visualized with a series of bar charts, histograms and boxplots. Commonly occurring one-word, two-word and three-word phrases were also identified and visualized using bar charts and word cloud.

To conduct a sentiment analysis for determining the sentiment of the tweets, Natural Language Processing (NLP) techniques and various packages were utilised, such as VADER (Valence Aware Dictionary for sEntiment Reasoning), TextBlob and a pre-trained Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN) model. As the tweets were unlabelled, VADER and TextBlob could be employed to predict the tweets' sentiment without any training data. For LSTM RNN, it was slightly trickier as the model was first trained on labelled tweets from Sentiment140 (source: [Sentiment140, 2021](http://help.sentiment140.com/for-students)) and obtained an accuracy score of 79.6%. The model was then used to predict the sentiment for Singapore tweets in the odd months (January, March, May and July), in which a new LSTM RNN model was created from these predictions and achieved a higher accuracy score of 86.4%. This new model was then employed in the prediction of the sentiment for Singapore tweets in the even months (February, April and June). 

For topic modelling, Latent Dirichlet Allocation (LDA) from Gensim was employed to identify the optimal number of topics from the tweets and contents surrounding them, to understand the commonly discussed pointers on Twitter. Relationships between the topics and their keywords, as well as between the topics and their sentiment, were visualized. Concurrently, external research about announcements and events that occurred in Singapore and notable events around the world in the past seven months was carried out, to understand factors that could have affected the sentiment labels. 

Last but not the least, a dashboard featuring some of the visualizations was created using Plotly and Dash, allowing for easier reference to trends, sentiment and topics of Singapore tweets from January to July 2021. 


*FINDINGS*

In the first seven months of 2021, it can be observed that overall, the proportion of positive tweets (64.9%) is close to 2x that of negative tweets (35.1%). This indicates that despite the prevalence of COVID-19 virus and various social issues, the community seems to have remained positive.

As for the dominant topics present in the tweets, they include discussions related to COVID-19 cases, clusters and vaccinations, recreational activities in Singapore, military coup in Myanmar, and even political affairs in the United States of America.

### 4. Data Dictionary

|Feature|Type|Dataset|Description|
|---|---|---|---|
|id|integer|all tweets datasets|id of each tweet
|date|object|all tweets datasets|date and time each tweet is created
|tweet|object|all tweets datasets|content of each tweet 
|language|object|all tweets datasets|language of the tweet 
|hashtags|object|all tweets datasets|hashtags present in each tweet 
|user_id|integer|all tweets datasets|user id of each Twitter user
|username|object|all tweets datasets|username of each Twitter user
|nlikes|integer|all tweets datasets|number of likes that each tweet receives
|nreplies|integer|all tweets datasets|number of replies that each tweet receives
|nretweets|integer|all tweets datasets|number of retweets that each tweet receives
|near|object|all tweets datasets|location detected from each tweet posting

*Note: 'all tweets datasets' refers to the following datasets:*
- *sg_tweets_jan_2021*
- *sg_tweets_feb_2021*
- *sg_tweets_mar_2021*
- *sg_tweets_apr_2021*
- *sg_tweets_may_2021*
- *sg_tweets_jun_2021*
- *sg_tweets_jul_2021*

### 5. Webscraping

To scrape data from Twitter, the options available include using [Tweepy API](https://www.tweepy.org/) or [Twint API](https://github.com/twintproject/twint). The Twint API method is selected for this project, as it can easily retrieve fields from Twitter without creating a Twitter Developer account, and without restrictions being imposed on the number of Tweets that can be extracted during a time period. 

Using Twint API, the following information for a tweet is collected: 
- ID
- Date
- Content
- Language
- Hashtags
- User ID 
- User Name
- Number of likes 
- Number of replies
- Number of retweets
- Location (represented by 'near')

In [66]:
# import the relevant packages
import pandas as pd

import twint
import nest_asyncio
nest_asyncio.apply()

After importing the relevant packages, tweets that are located near 'Singapore' from 1 January 2021 to 31 July 2021 will be extracted.

In [54]:
# instantiate twint Config
c = twint.Config()

# derived from twint configuration write-up
tweet_columns = ['id', 'date', 'tweet', 'language', 'hashtags', 'user_id', 
                 'username', 'nlikes', 'nreplies', 'nretweets', 'near']

In [55]:
# create a function to extract tweets
def extract_tweets(start_date, end_date, file_location):
    c.Lang= 'en'
    c.Since = start_date
    c.Until = end_date
    c.Near = 'Singapore'
    c.Pandas = True 
    c.Hide_output = True
    twint.run.Search(c)
    Tweets_df = twint.output.panda.Tweets_df[tweet_columns]
    Tweets_df.to_csv(file_location)

In [63]:
# extract January 2021 tweets
extract_tweets('2021-01-01', '2021-01-31', '../datasets/sg_tweets_jan_2021.csv')

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.


In [62]:
# extract February 2021 tweets
extract_tweets('2021-02-01', '2021-02-28', '../datasets/sg_tweets_feb_2021.csv')

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.


In [59]:
# extract March 2021 tweets
extract_tweets('2021-03-01', '2021-03-31', '../datasets/sg_tweets_mar_2021.csv')

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.


In [58]:
# extract April 2021 tweets
extract_tweets('2021-04-01', '2021-04-30', '../datasets/sg_tweets_apr_2021.csv')

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.


In [57]:
# extract May 2021 tweets
extract_tweets('2021-05-01', '2021-05-31', '../datasets/sg_tweets_may_2021.csv')

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.


In [65]:
# extract June 2021 tweets
extract_tweets('2021-06-01', '2021-06-30', '../datasets/sg_tweets_jun_2021.csv')

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.


In [64]:
# extract July 2021 tweets
extract_tweets('2021-07-01', '2021-07-31', '../datasets/sg_tweets_jul_2021.csv')

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.


The data from this section will be explored in the [next notebook](./02_data_cleaning_and_eda.ipynb).