# Project Proposal

### Problem Statement

What are the main opinions observed from the tweets related to the 2020 US election? What are people talking about on Twitter two weeks prior to election day?

### Context

October surprise, according to Wikipedia, is a U.S. political jargon and a news event that may influence the outcome of an upcoming November election, particularly one for the U.S. presidency), whether deliberately planned or spontaneously occurring. Tweeter is a great source of unfiltered conversations, opinions and news events directly posted by the individuals themselves compared to the filtered news provided by the media outlets. This project analyses various sentiments from the tweets occurring 2 weeks prior to the election day to identify main topics people are talking about and October surprises. 

### Criteria For Success

Achieve at least 75% accuracy in predicting the topics of the tweets.

### Scope of Solution Space

Solution scope will be limited to analysing the tweets related to the 2020 US election. In the modeling section, the main focus will be given to the training and evaluating deep learning models with the text data. While labeling of the tweets is required as part of the project, it is out of the scope of the project to improve the performance of the labeling job. 

### Solution Approach
* Collect data from the Twitter with the relevant keywords
* Apply Natural Language Processing (NLP) techniques to clean and preprocess the data
* Distributed computing will be used to preprocess and label the data set
* Train and evaluate SimpleRNN and LSTM models in Keras

### Constraints

* Limitation of computational power/resource
* Limited hands-on experience in advanced deep learning techniques
* Access to the distributed computing platforms and technical challenges

### Stakeholders

The report will be intended for and will be shared with the general public

### Deliverables

* Final Report
* Final Presentation
* Jupyter notebooks with the code for each stage of the data science method
* Model deployment into production
* Article on Medium / TDS

### Data Sources
Data will be collected directly from the Twitter social media platform via the Twitter API and Tweepy package in Python. The data set will contain over 440,000 raw tweets about the 2020 US election from over 2 weeks prior to the election day.


# Data Collection - Streaming Twitter Data

## Step 1: Setting up the environment

In [1]:
# Install Tweepy
!pip install tweepy

Collecting tweepy
  Downloading tweepy-3.9.0-py2.py3-none-any.whl (30 kB)
Collecting requests-oauthlib>=0.7.0
  Downloading requests_oauthlib-1.3.0-py2.py3-none-any.whl (23 kB)
Collecting oauthlib>=3.0.0
  Downloading oauthlib-3.1.0-py2.py3-none-any.whl (147 kB)
[K     |████████████████████████████████| 147 kB 10.3 MB/s eta 0:00:01
[?25hInstalling collected packages: oauthlib, requests-oauthlib, tweepy
Successfully installed oauthlib-3.1.0 requests-oauthlib-1.3.0 tweepy-3.9.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import tweepy
import json

In [4]:
# Twitter API consumer keys
access_token = "  insert your here  "
access_token_secret = "  insert your here  "
consumer_key = "  insert your here  "
consumer_secret = "  insert your here  "

In [5]:
# Twitter API authorization
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

In [6]:
class MyStreamListener(tweepy.StreamListener):
    """Function to listen and stream Twitter data"""
    def __init__(self, api=None):
        super(MyStreamListener, self).__init__()
        self.num_tweets = 0
        self.file = open("tweets.txt", "w")

    def on_status(self, status):
        tweet = status._json
        self.file.write( json.dumps(tweet) + '\n' )
        self.num_tweets += 1
        if self.num_tweets < 20000:
            return True
        else:
            return False
        self.file.close()

    def on_error(self, status):
        print(status)

## Step 2: Determine your data collection apporach  
I stream and collect Twitter data in 11 chunks due to its computational intensity and I run through these 11 chunks twice to ultimately collect 500,000 tweets.

## Step 3: Let the stream being!!!

### Chunk 1

In [7]:
listener = MyStreamListener()
stream = tweepy.Stream(auth, listener)
stream.filter(track = ['US Election', 'election', 'trump', 'Mike Pence', 'biden', 'Kamala Harris', 'Donald Trump', 'Joe Biden'])

In [8]:
tweets_data_path = 'tweets.txt'
tweets_data=[]
tweets_file_1 = open(tweets_data_path, 'r')
# Read in tweets and store in list: tweets_data
for line in tweets_file_1:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file_1.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())

dict_keys(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'extended_tweet', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms'])


In [9]:
names = ['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', \
         'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', \
         'contributors', 'retweeted_status', 'quoted_status_id', 'quoted_status_id_str', 'quoted_status', \
         'quoted_status_permalink', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', \
         'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms']
df1 = pd.DataFrame(tweets_data, columns= names)

In [10]:
df1.shape

(19999, 32)

### Chunk 2

In [11]:
listener = MyStreamListener()
stream = tweepy.Stream(auth, listener)
stream.filter(track = ['US Election', 'Election', 'Trump', 'pence', 'biden', 'harris', 'Donald Trump', 'Joe Biden'])

In [12]:
tweets_data_path = 'tweets.txt'
tweets_data=[]
tweets_file_2 = open(tweets_data_path, 'r')
# Read in tweets and store in list: tweets_data
for line in tweets_file_2:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file_2.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())

dict_keys(['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms'])


In [13]:
# 2nd Streaming
df2 = pd.DataFrame(tweets_data, columns=names)

In [14]:
df2.shape

(20000, 32)

### Chunk 3

In [15]:
listener = MyStreamListener()
stream = tweepy.Stream(auth, listener)
stream.filter(track = ['US Election', 'Election', 'Trump', 'Pence', 'Biden', 'Harris', 'Donald Trump', 'Sleepy Joe'])

In [16]:
tweets_data_path = 'tweets.txt'
tweets_data=[]
tweets_file_3 = open(tweets_data_path, 'r')
# Read in tweets and store in list: tweets_data
for line in tweets_file_3:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file_3.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())

dict_keys(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'quoted_status_id', 'quoted_status_id_str', 'quoted_status', 'quoted_status_permalink', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'extended_entities', 'favorited', 'retweeted', 'possibly_sensitive', 'filter_level', 'lang', 'timestamp_ms'])


In [17]:
# 3rd Streaming
df3 = pd.DataFrame(tweets_data, columns=names)

In [18]:
df3.shape

(20000, 32)

### Chunk 4

In [19]:
listener = MyStreamListener()
stream = tweepy.Stream(auth, listener)
stream.filter(track = ['2021 Presidential Election', 'election', 'trump', 'pence', 'biden', 'harris', 'Donald Trump', 'Joe Biden'])

In [20]:
tweets_data_path = 'tweets.txt'
tweets_data=[]
tweets_file_4 = open(tweets_data_path, 'r')
# Read in tweets and store in list: tweets_data
for line in tweets_file_4:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file_4.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())

dict_keys(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms'])


In [21]:
# 4th Streaming
df4 = pd.DataFrame(tweets_data, columns=names)

In [22]:
df4.shape

(20000, 32)

### Chunk 5

In [23]:
listener = MyStreamListener()
stream = tweepy.Stream(auth, listener)
stream.filter(track = ['US COVID19', 'Covid', 'covid19', 'election', 'trump', 'pence', 'biden', 'harris', 'Donald Trump', 'Joe Biden'])

In [24]:
tweets_data_path = 'tweets.txt'
tweets_data=[]
tweets_file_5 = open(tweets_data_path, 'r')
# Read in tweets and store in list: tweets_data
for line in tweets_file_5:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file_5.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())

dict_keys(['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms'])


In [25]:
df5 = pd.DataFrame(tweets_data, columns=names)
df5.shape

(20000, 32)

### Chunk 6 

In [26]:
listener = MyStreamListener()
stream = tweepy.Stream(auth, listener)
stream.filter(track = ['USA COVID19', 'US Covid', 'usa covid-19', 'usa election', 'Trump', 'Pence', 'Biden', 'Harris'])

In [28]:
tweets_data_path = 'tweets.txt'
tweets_data=[]
tweets_file_6 = open(tweets_data_path, 'r')
# Read in tweets and store in list: tweets_data
for line in tweets_file_6:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file_6.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())

dict_keys(['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'extended_entities', 'favorited', 'retweeted', 'possibly_sensitive', 'filter_level', 'lang', 'timestamp_ms'])


In [29]:
df6 = pd.DataFrame(tweets_data, columns=names)

In [30]:
df6.shape

(20000, 32)

### Chunk 7

In [31]:
listener = MyStreamListener()
stream = tweepy.Stream(auth, listener)
stream.filter(track = ['GOP', 'Democrats', 'usa covid-19', 'usa election', 'Trump', 'Pence', 'Biden', 'Harris', 'COVID19'])

In [32]:
tweets_data_path = 'tweets.txt'
tweets_data=[]
tweets_file_7 = open(tweets_data_path, 'r')
# Read in tweets and store in list: tweets_data
for line in tweets_file_7:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file_7.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())

dict_keys(['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms'])


In [33]:
df7 = pd.DataFrame(tweets_data, columns=names)

In [34]:
df7.shape

(20000, 32)

### Chunk 8

In [35]:
listener = MyStreamListener()
stream = tweepy.Stream(auth, listener)
stream.filter(track = ['US', 'China', 'Democrats', 'Republicans', 'usa election', 'Trump', 'Pence', 'Biden', 'Harris', 'COVID19'])

In [36]:
tweets_data_path = 'tweets.txt'
tweets_data=[]
tweets_file_8 = open(tweets_data_path, 'r')
# Read in tweets and store in list: tweets_data
for line in tweets_file_8:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file_8.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())

dict_keys(['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'possibly_sensitive', 'filter_level', 'lang', 'timestamp_ms'])


In [37]:
df8 = pd.DataFrame(tweets_data, columns=names)

In [38]:
df8.shape

(19999, 32)

### Chunk 9

In [43]:
listener = MyStreamListener()
stream = tweepy.Stream(auth, listener)
stream.filter(track = ['election', 'Democrats', 'Republicans', 'election', 'Trump', 'Pence', 'Biden', 'Harris', 'US Election'])

In [44]:
tweets_data_path = 'tweets.txt'
tweets_data=[]
tweets_file_9 = open(tweets_data_path, 'r')
# Read in tweets and store in list: tweets_data
for line in tweets_file_9:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file_9.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())

dict_keys(['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms'])


In [45]:
df9 = pd.DataFrame(tweets_data, columns=names)

In [46]:
df9.shape

(20000, 32)

### Chunk 10

In [47]:
listener = MyStreamListener()
stream = tweepy.Stream(auth, listener)
stream.filter(track = ['Democrats', 'US Congress', 'BLM', 'Republicans', 'election', 'Trump', 'Pence', 'Biden', 'Harris', 'US Election'])

In [48]:
tweets_data_path = 'tweets.txt'
tweets_data=[]
tweets_file_10 = open(tweets_data_path, 'r')
# Read in tweets and store in list: tweets_data
for line in tweets_file_10:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file_10.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())

dict_keys(['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'quoted_status_id', 'quoted_status_id_str', 'quoted_status', 'quoted_status_permalink', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms'])


In [49]:
df10 = pd.DataFrame(tweets_data, columns=names)

In [50]:
df10.shape

(19998, 32)

### Chunk 11

In [51]:
listener = MyStreamListener()
stream = tweepy.Stream(auth, listener)
stream.filter(track = ['Republicans', 'election', 'Democrats', 'Donald Trump', 'Pence', 'Joe Biden', 'Harris', 'usa election'])

In [52]:
tweets_data_path = 'tweets.txt'
tweets_data=[]
tweets_file_11 = open(tweets_data_path, 'r')
# Read in tweets and store in list: tweets_data
for line in tweets_file_11:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file_11.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())

dict_keys(['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'quoted_status_id', 'quoted_status_id_str', 'quoted_status', 'quoted_status_permalink', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms'])


In [53]:
df11 = pd.DataFrame(tweets_data, columns=names)

In [54]:
df11.shape

(19999, 32)

## Step 4: Combine All Data Chunks Into A Single DataFrame

In [55]:
df = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11], ignore_index=True)

In [56]:
df.shape

(219995, 32)

## Step 5: Save the combined data in CSV file

In [57]:
df.to_csv('tweets_2.csv', index=False)

## Final Step: Combine the CSV files into a single data set

In [3]:
first_run = pd.read_csv('tweets.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
first_run.shape

(220004, 32)

In [6]:
second_run = pd.read_csv('tweets_2.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [7]:
second_run.shape

(219995, 32)

In [9]:
final = pd.concat([first_run, second_run], ignore_index=True)
final.shape

(439999, 32)

In [10]:
final.head()

Unnamed: 0,created_at,id,id_str,text,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,...,quote_count,reply_count,retweet_count,favorite_count,entities,favorited,retweeted,filter_level,lang,timestamp_ms
0,Fri Oct 16 05:01:51 +0000 2020,1316967569660776450,1316967569660776450,RT @RudyGiuliani: The competing Town Halls wer...,"<a href=""https://mobile.twitter.com"" rel=""nofo...",False,,,,,...,0.0,0,0,0,"{'hashtags': [], 'urls': [], 'user_mentions': ...",False,False,low,en,1602825000000.0
1,Fri Oct 16 05:01:51 +0000 2020,1316967569648222211,1316967569648222211,RT @rachelv12: Trump and machismo https://t.c...,"<a href=""https://mobile.twitter.com"" rel=""nofo...",False,,,,,...,0.0,0,0,0,"{'hashtags': [], 'urls': [{'url': 'https://t.c...",False,False,low,en,1602825000000.0
2,Fri Oct 16 05:01:51 +0000 2020,1316967569652371456,1316967569652371456,RT @briantylercohen: Biden is like an encyclop...,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,,...,0.0,0,0,0,"{'hashtags': [], 'urls': [], 'user_mentions': ...",False,False,low,en,1602825000000.0
3,Fri Oct 16 05:01:51 +0000 2020,1316967569652371458,1316967569652371458,RT @BradleyWhitford: Yo Semites!!! QAnon doesn...,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,,...,0.0,0,0,0,"{'hashtags': [], 'urls': [], 'user_mentions': ...",False,False,low,en,1602825000000.0
4,Fri Oct 16 05:01:51 +0000 2020,1316967569794977792,1316967569794977792,RT @ACTBrigitte: Retweet if President Trump wo...,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,,...,0.0,0,0,0,"{'hashtags': [], 'urls': [], 'user_mentions': ...",False,False,low,en,1602825000000.0


In [11]:
final.to_csv('all_tweets.csv', index=False)