# Twitter Web Scraper and Text Classifier

#### Author: Chandana Karunaratne
#### Date: 9 April 2020
#### Description: This code does the following:
#### i) Collects the n most recent tweets for a specific hashtag during a specific period and saves them to a Pandas dataframe
#### ii) Classifies each tweet based on sentiment (whether it is most likely to be positive or negative)
#### iii) Classifies each tweet based on thematic area (the most likely category it belongs to)


In [1]:
print "hello, friend"

hello, friend


In [2]:
import tweepy as tw
import numpy as np
import pandas as pd
import pattern
from pattern.en import sentiment
from pattern.en import mood, modality

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier



### a) Twitter Web Scraper:

In [3]:
# Source: https://www.earthdatascience.org/courses/earth-analytics-python/using-apis-natural-language-processing-twitter/get-and-use-twitter-data-in-python/

# Specify the variables that contain the user credentials to access Twitter API:

# Note: These credentials below need to be updated occassionally

access_token = "XXXXXXXXXXXXXXXXXXXX"
access_token_secret = "XXXXXXXXXXXXXXXXXXXX"
consumer_key = "XXXXXXXXXXXXXXXXXXXX"
consumer_secret = "XXXXXXXXXXXXXXXXXXXX"


In [4]:
# Specify the authentication parameters using the variables specified above:

auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)


In [5]:
# Define the search term and the date_since date as variables:

search_words = "#coronavirus"

date_since = "2019-12-01"

# Remove retweets:

search_words = search_words + " -filter:retweets" 


In [6]:
# Collect n most recent tweets from date_since date: 

n = 100

tweets = tw.Cursor(api.search,
                       q=search_words,
                       lang="en",
                       since=date_since,
                       tweet_mode="extended").items(n)

# Save information about the tweet's user's name, screen name, datetime of tweet, user-specified location, number of followers,
# and the tweet text:

# For more attributes, see: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object

user_data = [[tweet.user.name, 
              tweet.user.screen_name, 
              tweet.created_at, 
              tweet.user.location,
              tweet.user.followers_count,
              tweet.full_text] for tweet in tweets]


In [7]:
pd.set_option('display.max_colwidth', -1) # Display all text in cells without truncation

columns = ['Twitter Name', 
           'Twitter Handle (username)', 
           'Date and Time', 
           'User-specified Location', 
           'Number of Followers', 
           'Tweet']

tweets_df = pd.DataFrame(data = user_data, columns = columns)

tweets_df.head(2)


Unnamed: 0,Twitter Name,Twitter Handle (username),Date and Time,User-specified Location,Number of Followers,Tweet
0,Michael Coleman,Colemans1,2020-04-09 19:18:17,London,1198,BBC News - #Coronavirus: Ofcom formally probes David Icke TV interview https://t.co/dBc20GHUks
1,Dan Currie,poeboston,2020-04-09 19:18:17,Boston,1623,The coronavirus outbreak is part of the climate change crisis @AJEnglish https://t.co/PnCfwmRGFZ via @AJEnglish &gt; #bospoli #mapoli #capitalism #publichealth #environment #climatechange #covid19 #coronavirus #pandemic


In [8]:
# Check how many tweets were collected:

len(tweets_df)


100

In [9]:
# Sources:

# 1) https://www.earthdatascience.org/courses/earth-analytics-python/using-apis-natural-language-processing-twitter/get-and-use-twitter-data-in-python/



### b) Twitter Text Classifier:

#### b) i) Cleaning the data:

In [10]:
# Remove non-ASCII characters from 'Tweet' column in 'tweets_df' dataframe:

# Sources: 

# i) https://www.quora.com/How-do-I-remove-non-ASCII-characters-e-g-%C3%90%C2%B1%C2%A7%E2%80%A2-%C2%B5%C2%B4%E2%80%A1%C5%BD%C2%AE%C2%BA%C3%8F%C6%92%C2%B6%C2%B9-from-texts-in-Panda%E2%80%99s-DataFrame-columns
# ii) https://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20

# function to remove non-ASCII

def remove_non_ascii(text):
    return ''.join(i for i in text if ord(i)<128).encode('utf-8')
 
tweets_df['Tweet'] = tweets_df['Tweet'].apply(remove_non_ascii)


In [11]:
# Punctuate each comment with a full-stop ("."):

# Source: https://stackoverflow.com/questions/20025882/append-string-to-the-start-of-each-value-in-a-said-column-of-a-pandas-dataframe

counter = 0

while counter < len(tweets_df):
    if tweets_df['Tweet'][counter][-1] != '.':
        tweets_df.loc[counter, 'Tweet'] = tweets_df['Tweet'][counter] + '.'
    counter = counter + 1


#### b) ii) Sentiment Analysis:

In [12]:
# Convert the data in the 'Tweet' column of the 'content_df' dataframe to a list:

tweets_list = tweets_df['Tweet'].tolist()
# Create a list of sentiments for each student feedback comment:
sent_list = []
for i in tweets_list:
    sent_list.append(sentiment(i)[0])
    
# Convert the above list into a dataframe:
sent_df = pd.DataFrame({'Sentiment': sent_list})

# Get the length of the above dataframe and then create a list of consecutive numbers to be used as the "Tweet Number":
df_len = len(sent_df)
my_list = range(0, df_len)
idx = 0
sent_df.insert(loc=idx, column='Tweet Number', value=my_list)



In [13]:
# Add a column 'Tweet Number' to tweets_df so that the dataframe can be merged with sent_df:

tweets_df.insert(loc=idx, column='Tweet Number', value=my_list)

# Now, merge the two dataframes above, giving one dataframe with all information, including sentiment value for 
# each comment:

# Source: https://chrisalbon.com/python/data_wrangling/pandas_join_merge_dataframe/

tweets_df = pd.merge(tweets_df, sent_df, left_on = 'Tweet Number', right_on = 'Tweet Number')



In [14]:
# Add a column to the 'temp_mod_df' dataframe with a label indicating whether the sentiment is positive or negative:
# Source: https://stackoverflow.com/questions/26830752/making-new-column-in-pandas-dataframe-based-on-filter

tweets_df['Sent_label'] = tweets_df['Sentiment'] >= 0

# Replace all True/False boolean values with their respective labels ('Positive' or 'Negative'):
# Source: https://stackoverflow.com/questions/23307301/pandas-replacing-column-values-in-dataframe

tweets_df.Sent_label.replace([True, False], ['Positive', 'Negative'], inplace = True)


In [15]:
tweets_df.head(2)

Unnamed: 0,Tweet Number,Twitter Name,Twitter Handle (username),Date and Time,User-specified Location,Number of Followers,Tweet,Sentiment,Sent_label
0,0,Michael Coleman,Colemans1,2020-04-09 19:18:17,London,1198,BBC News - #Coronavirus: Ofcom formally probes David Icke TV interview https://t.co/dBc20GHUks.,0.0,Positive
1,1,Dan Currie,poeboston,2020-04-09 19:18:17,Boston,1623,The coronavirus outbreak is part of the climate change crisis @AJEnglish https://t.co/PnCfwmRGFZ via @AJEnglish &gt; #bospoli #mapoli #capitalism #publichealth #environment #climatechange #covid19 #coronavirus #pandemic.,0.0,Positive


#### b) iii) Classification Analysis:

In [16]:
# Read in the training/test data set:
# Note: Sample tweets should be in first column and corresponding classification (category) should be in second column.

train_df = pd.read_csv('filepath/filename.csv') # Read in the sample data that is used to train the model. 


In [17]:
train_df.head()

Unnamed: 0,Tweet,Category
0,"Here's a good rundown of how coronavirus is affecting agriculture and food distribution, ICYMI.\n\n#coronavirus #economy https://t.co/H07NGkUKcC",Advice
1,"If you are trapped with an abuser during social isolation, find resources on how to get help here: https://t.co/9j9wAQ4oJH. #COVID19 #Coronavirus @Gothamist",Advice
2,Coronavirus: Half of all confirmed cases worldwide are now in Europe #Coronavirus https://t.co/nG3Hht13Hf,News coverage
3,@barbarafriesen I was telling friends &amp; family to wear masks 3 weeks ago and these top doctors were essentially saying I was a quack. Now one of them has #coronavirus. You damn right I’ll call them out on this screwup.,Blame
4,Keeping Your Employees and Customers Safe. Learn more about our #Coronavirus Disinfection Services here: https://t.co/seS7flf7P7 https://t.co/CwUuXpeoeo,Advice


In [18]:
# Clean the training/test data set:

# Remove non-ASCII characters from 'Tweet' column in 'tweets_df' dataframe:

# Sources: 

# i) https://www.quora.com/How-do-I-remove-non-ASCII-characters-e-g-%C3%90%C2%B1%C2%A7%E2%80%A2-%C2%B5%C2%B4%E2%80%A1%C5%BD%C2%AE%C2%BA%C3%8F%C6%92%C2%B6%C2%B9-from-texts-in-Panda%E2%80%99s-DataFrame-columns
# ii) https://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20

# function to remove non-ASCII

def remove_non_ascii(text):
    return ''.join(i for i in text if ord(i)<128).encode('utf-8')
 
train_df['Tweet'] = train_df['Tweet'].apply(remove_non_ascii)


# Punctuate each comment with a full-stop ("."):

# Source: https://stackoverflow.com/questions/20025882/append-string-to-the-start-of-each-value-in-a-said-column-of-a-pandas-dataframe

counter = 0

while counter < len(train_df):
    if train_df['Tweet'][counter][-1] != '.':
        train_df.loc[counter, 'Tweet'] = train_df['Tweet'][counter] + '.'
    counter = counter + 1



In [19]:
# Create a train-test data set and thereafter create a learning model to classify the test data set using the 
# Stochastic Gradient Descent classifier:

train_array = train_df.as_matrix()    # Convert the 'tweets_df' dataframe to an array

X = train_array[:, 0]    # Assign all the data in the first column of 'train_array' to X
Y = train_array[:, 1]    # Assign all the data in the second column of 'train_array' to Y

# Specify the parameters of the training data set and the test data set. 'test_size = 0.4' refers to the test data set 
# comprising 40% of the entire data set.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.4, random_state = 42)    

# Specify the parameters of the classifier model. Use the English 'stop words' list to filter out words like 'the', 
# 'is', 'of', etc. and use the Stochastic Gradient Descent classifier for the classification task. 

tweet_clf = Pipeline([('vect', CountVectorizer(stop_words = 'english')), ('tfidf', TfidfTransformer()), ('clf-sgd', SGDClassifier(loss = 'hinge', alpha = 1e-3, n_iter = 5, random_state = 42))])    

tweet_clf = tweet_clf.fit(X_train, Y_train)    # Fit the training data to the classifier model.

predicted = tweet_clf.predict(X_test)    # Predict each classification value for the data in the test set.

# Summary of above variables:

# 'X_train' = refers to the tweets in the training set 
# 'Y_train' = refers to the manually-labeled categories in the training set
# 'X_test' = refers to the tweets in the test set
# 'Y_test' = refers to the manually-labeled categories in the test set
# 'predicted' = refers to the predicted categories that were predicted against the 'X_test' test set


In [20]:
# Print the accuracy of the classification model based on the test set:

# Specifically, check how many predicted values corresponded to the actual values in the test set
# and then get the mean of these values. This provides an indicator of how accurate
# the classifier is for the given set of test data (this is presented as a percentage):

print("Accuracy of classification model: ", float(np.mean(predicted == Y_test) * 100), "%") 


('Accuracy of classification model: ', 40.0, '%')


In [21]:
# Now, run the classifier on the unclassified (new) tweets:

tweets_list = tweets_df['Tweet'].tolist()   # Convert the data in the 'Comment' column of the 'content_df' dataframe to a list.
tweets_array = np.asarray(tweets_list)    # Convert list into array.
tweets_classd_array = tweet_clf.predict(tweets_array)    # Classify new tweets contained in array using the
                                                         # tweet classifier ('tweet_clf') to predict the classification labels

tweets_classd_df = pd.DataFrame(tweets_classd_array, columns=['Category'])   # Convert array containing classification labels
                                                                             # into dataframe and specify column name.



In [22]:
# Finally, merge the original tweets dataframe with the classified tweets dataframe:

tweets_merged_df = pd.merge(tweets_df, tweets_classd_df, left_index = True, right_index = True)


In [23]:
tweets_merged_df.head()

Unnamed: 0,Tweet Number,Twitter Name,Twitter Handle (username),Date and Time,User-specified Location,Number of Followers,Tweet,Sentiment,Sent_label,Category
0,0,Michael Coleman,Colemans1,2020-04-09 19:18:17,London,1198,BBC News - #Coronavirus: Ofcom formally probes David Icke TV interview https://t.co/dBc20GHUks.,0.0,Positive,News coverage
1,1,Dan Currie,poeboston,2020-04-09 19:18:17,Boston,1623,The coronavirus outbreak is part of the climate change crisis @AJEnglish https://t.co/PnCfwmRGFZ via @AJEnglish &gt; #bospoli #mapoli #capitalism #publichealth #environment #climatechange #covid19 #coronavirus #pandemic.,0.0,Positive,News coverage
2,2,Julie Collins,CollinsAtl,2020-04-09 19:18:14,"New York, NY",2784,"My daily visit with her, she remains a beacon of hope and health #coronavirus statueoflibertynyc #LadyLiberty #NYCStrong #golegsgo @ New York, New York https://t.co/EaHoPqcBVi.",0.090909,Positive,News coverage
3,3,amit kumar rai,talk2amit,2020-04-09 19:18:13,,32,Sweden took actions at the right time but not adequately..only if they would have enforced social distancing .. government should have not trusted people ..the count would have been lower.. maybe they wanted to save on the pension money #coronavirus #COVID19SWEDEN.,0.159524,Positive,Blame
4,4,Hakim Bellamy,HakimBe,2020-04-09 19:18:13,"Albuquerque, NM",4424,"@shaunking We knew there winds of change were on the horizon, ""weather"" political or literal...it was going to be revolution or evolution. Apparently, we elected the latter. #Coronavirus #COVID19 #COVID19forPresident #PrimaryElection #GeneralElection.",0.016667,Positive,News coverage


In [24]:
# Save 'tweets_merged_df' dataframe to a csv file:

tweets_merged_df.to_csv("filepath/filename.csv", encoding = 'utf-8', index = False) 


In [25]:
# Create a dataframe featuring the frequencies of each category:

freq_series = tweets_classd_df.Category.value_counts()    # Create a series featuring the frequency of counts for each 
                                                                # classification.
    
freq_df = pd.DataFrame(freq_series)    # Convert the frequency of counts into a dataframe (for easy manipulation).

freq_df.reset_index(level = 0, inplace = True)    # Reset the index of the dataframe so that you can use the 
                                                  # classification labels in a separate column.
    
freq_df.columns = ["Category", "Frequency"]    # Rename the columns of the dataframe.


In [26]:
freq_df

Unnamed: 0,Category,Frequency
0,News coverage,81
1,Advice,6
2,Humour,4
3,Appreciation,3
4,Blame,3
5,Self-promotion,2
6,Spam,1


In [27]:
# Save 'freq_df' dataframe to a csv file:

freq_df.to_csv("filepath/filename.csv", encoding = 'utf-8', index = False) 
