In this notebook, my aim is to get most recent bunch of tweets from twitter and clean strings, afterwards I did sentiment analysis on each tweet one by one and assign them some scores to decide positivity/negativity of it. 

The data below scrabbed at 6th of July in 2021 and top 100 tweets were filtered with only contains BTC hash to give a short example. 

Note : It is required to sign up with your own credientials as developer account to be able to wrap data through tweepy package. That is why I imported my credientials as a package called 'TwitterKeys' so I can use them to receive data through API.


In [1]:
import tweepy, json, TwitterKeys,csv,re, time                  # Python wrapper around Twitter API
import pandas as pd
import numpy as np
from datetime import date
from datetime import datetime
import matplotlib.pyplot as plt
from textblob import TextBlob as tb
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transformers import pipeline
from nltk.tokenize import regexp_tokenize, TweetTokenizer,word_tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/burcin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# Authenticate to Twitter
auth = tweepy.OAuthHandler(TwitterKeys.API_TOKEN, TwitterKeys.API_KEY)

auth.set_access_token(TwitterKeys.ACCESS_TOKEN, TwitterKeys.ACCESS_KEY)

api = tweepy.API(auth)

try:
    api.verify_credentials()
    print("Authentication OK")
except:
    print("Error during authentication")

Authentication OK


After authentication, I wrapped data through twitter api, it is optional to set a beginning date to get past tweets, however there are some drawbacks of that option. Firstly you cannot set a time together with date and more importantly, it is allowed to receive tweets at most 7 days before current date. 

Then I saved tweets as a data frame to process..

In [4]:
search_words = "#BTC"


In [5]:
new_search = search_words + " -filter:retweets" # Filtered retweets
new_search

'#BTC -filter:retweets'

In [30]:
tweets = tweepy.Cursor(api.search,
                       q=new_search,
                       lang="en"
                       #since=date_since
                      ).items(100)

df = pd.DataFrame(data = [[tweet.user.screen_name, tweet.user.location,tweet.created_at,tweet.text] for tweet in tweets],columns=['user', "location","tweet_date","tweet_text"])

In [31]:
df.shape

(100, 4)

In [32]:
df.head()

Unnamed: 0,user,location,tweet_date,tweet_text
0,nowonbitcoin,"London, England 🇬🇧",2021-07-06 09:25:36,DOGE and DeFi Avoided by Australian Crypto Fund with 119% of YTD Returns in 2021 \n\n#btc #bitcoin #cryptocurrency… https://t.co/4XXBZuJRqM
1,AmegaFX,,2021-07-06 09:25:30,❇️Cryptocurrencies Now #BTC #LTC #ETH #DOGE #ADA https://t.co/wBZciKPFQh
2,hana_x007,,2021-07-06 09:25:21,"I'm excited about this one! Prepare your bags #NFTCommunity! 💵\n\nLet's make money on $HOD lads, @SecretsOfCrypto&amp;… https://t.co/d3gFp99n67"
3,Crypto_trace,,2021-07-06 09:25:17,All waiting for @elonmusk to give the green light.. Tesla accepting #btc https://t.co/5fkWoJhIQO
4,Raaaj_Financial,"New York, NY",2021-07-06 09:25:02,"2,873 Satoshis now equal $1 Dollar (1 #BTC = 100,000,000 Satoshis) \nExplore the full chart @… https://t.co/gMw0kQw2R9"


In [34]:
pd.options.display.max_colwidth = 2000
# Wrote a pattern that matches with all links (urls) 
pattern = r'http\S+'

# Use the pattern on the last tweet in the tweets list
print(df.tweet_text.apply(lambda x:  regexp_tokenize(x, pattern=pattern))) 
#print(df.tweet_text.apply(lambda x: re.search("(?P<url>https?://[^\s]+)", x).group("url")))


0                              [https://t.co/4XXBZuJRqM]
1                              [https://t.co/wBZciKPFQh]
2                              [https://t.co/d3gFp99n67]
3                              [https://t.co/5fkWoJhIQO]
4                              [https://t.co/gMw0kQw2R9]
                             ...                        
95                                                    []
96                             [https://t.co/mNN35cimL1]
97                             [https://t.co/rDrHrQKeSs]
98    [https://t.co/La5Cc7UYpP, https://t.co/k31lx3aKEy]
99                             [https://t.co/xFhaM3TqX5]
Name: tweet_text, Length: 100, dtype: object


Now I am going to drop urls from tweets as well as dropped tweets with same contents and generated a dictionary to set tweet texts as keys..

After that, I dropped english stopwords, hashtags and mentions too for more clean texts. 

In [329]:
tknzr = TweetTokenizer()
tweet_dict = {}




for i in range(0,len(df)):
    tweettext = tknzr.tokenize(re.sub(r'http\S+', "",df.tweet_text[i])) # dropped urls 
    tweettext = [ t for t in tweettext if t not in stopwords.words('english') and len(t)> 1 and t[0] != '#' and t[0] != '@'] # remove hastags, mentions and stopwords
    if tweet_dict.get(str(tweettext)) != 1:
        tweet_dict[str(tweettext)]=1
    #tweets = [word_tokenize(re.sub(r"""["?,$!:#@/[///\]\)\.?\+\-]|'(?!(?<! ')[ts])""", "",key)) for key in tweet_dict.keys()]
    tweets = [key for key in tweet_dict.keys() if [k.isalpha() for k in key]]
    


In [340]:
tweets

["['DOGE', 'DeFi', 'Avoided', 'Australian', 'Crypto', 'Fund', '119', 'YTD', 'Returns', '2021']",
 "['️Cryptocurrencies', 'Now']",
 '["I\'m", \'excited\', \'one\', \'Prepare\', \'bags\', "Let\'s", \'make\', \'money\', \'HOD\', \'lads\']',
 "['All', 'waiting', 'give', 'green', 'light', '..', 'Tesla', 'accepting']",
 "['2,873', 'Satoshis', 'equal', 'Dollar', '100,000', '000', 'Satoshis', 'Explore', 'full', 'chart']",
 "['Auto', 'Follower']",
 "['Once', 'bears', 'deliver', 'massive', 'counter', 'punch']",
 "['BULLWHALE', 'LONGED', '4,183', '469', 'worth', '34,109', 'Futures']",
 "['MORE', 'FUD', 'BTC']",
 "['Do', 'experience', 'using', 'crypto', 'currency', 'Follow', 'us', 'Twitter', 'Facebook', 'airdrops', 'futu']",
 "['Thanks', 'giving', 'us', 'great', 'opportunity', 'supporting', 'always', 'success', 'development']",
 '[\'earned\', \'lot\', \'Mini\', \'doge\', "It\'s", \'Give\', \'away\', \'time\', \'100\', \'lucky\', \'person\', \'Follow\', \'Retweet\', \'like\']',
 "['So', '...', 'Jus

In [331]:
df.tweet_text[0] # raw version

'DOGE and DeFi Avoided by Australian Crypto Fund with 119% of YTD Returns in 2021  \n\n#btc #bitcoin #cryptocurrency… https://t.co/4XXBZuJRqM'

In [332]:
tweets[0] # cleaned version

"['DOGE', 'DeFi', 'Avoided', 'Australian', 'Crypto', 'Fund', '119', 'YTD', 'Returns', '2021']"

In [33]:
# In this cell I transfered ready to use trained BERT model provided from Hugging Face 

classifier = pipeline("sentiment-analysis")


In [333]:
[item["label"] for item in classifier(tweets[0])]

['NEGATIVE']

In [335]:
# This fuction provided from vadersentiment library to score tweets 
analyzer = SentimentIntensityAnalyzer()
analyzer.polarity_scores(tweets[0])

{'neg': 0.211, 'neu': 0.789, 'pos': 0.0, 'compound': -0.34}

Last but not least I created a data frame which includes scores provided from textblob, BERT and vadersentiment libraries. 

With this clean data frame it can be evaluated by comparing results. 

In [336]:
# values from textblob
pol = []
sub = []
# values from transformers
label = []
score = []
# values from vader sentiment
neg_val = []
pos_val = []
neu_val = []
comp_val = []

for j in tweets:
    tx = tb(j)
    pol.append(tx.sentiment.polarity)
    sub.append(tx.sentiment.subjectivity)
    label.append([item["label"] for item in classifier(j)])
    score.append([item["score"] for item in classifier(j)] )  
    
    neg_val.append(analyzer.polarity_scores(j)['neg'])
    pos_val.append(analyzer.polarity_scores(j)['pos'])
    neu_val.append(analyzer.polarity_scores(j)['neu'])
    comp_val.append(analyzer.polarity_scores(j)['compound'])


In [337]:
df_pols = pd.DataFrame({"polarity":pol,"subjectivity":sub, 'label' : label, 'score': score, 'neg_val': neg_val, 'pos_val': pos_val, 'neu_val' : neu_val,'comp_val': comp_val, 'text':tweets})


In [338]:
df_pols

Unnamed: 0,polarity,subjectivity,label,score,neg_val,pos_val,neu_val,comp_val,text
0,0.000000,0.000000,[NEGATIVE],[0.9941918849945068],0.211,0.000,0.789,-0.3400,"['DOGE', 'DeFi', 'Avoided', 'Australian', 'Crypto', 'Fund', '119', 'YTD', 'Returns', '2021']"
1,0.000000,0.000000,[NEGATIVE],[0.9730685949325562],0.000,0.000,1.000,0.0000,"['️Cryptocurrencies', 'Now']"
2,0.375000,0.750000,[POSITIVE],[0.861072838306427],0.000,0.211,0.789,0.3400,"[""I'm"", 'excited', 'one', 'Prepare', 'bags', ""Let's"", 'make', 'money', 'HOD', 'lads']"
3,0.100000,0.500000,[POSITIVE],[0.9365736842155457],0.000,0.271,0.729,0.3818,"['All', 'waiting', 'give', 'green', 'light', '..', 'Tesla', 'accepting']"
4,0.175000,0.400000,[POSITIVE],[0.9320528507232666],0.000,0.000,1.000,0.0000,"['2,873', 'Satoshis', 'equal', 'Dollar', '100,000', '000', 'Satoshis', 'Explore', 'full', 'chart']"
...,...,...,...,...,...,...,...,...,...
92,-0.145833,0.770833,[POSITIVE],[0.760202944278717],0.097,0.417,0.486,0.6808,"['Let', 'begin', 'Years', 'hard', 'dev', 'work', 'community', 'trust', 'support', 'finally']"
93,0.000000,0.000000,[NEGATIVE],[0.952576756477356],0.000,0.000,1.000,0.0000,"['Biggest', 'exchange', 'outflow', ""Let's"", 'see', 'market', 'behaves', 'coming', 'days']"
94,0.000000,0.000000,[NEGATIVE],[0.9995121955871582],0.323,0.000,0.677,-0.5160,"['As', 'told', 'No', 'volume', 'grayscale', 'unlocking', 'We', 'DUMP']"
95,0.000000,0.750000,[NEGATIVE],[0.9966623783111572],0.000,0.000,1.000,0.0000,"['The', 'Shibaswap', 'open', 'use', 'swaps', 'anymore', 'Only', 'shibaswap']"
