# Set up

The `Scweet` library is old, and uses a deprecated `Selenium` function `find_elements_by_xpath()`, therefore, we must install an older version of `Selenium 4.2.0`.

This can be done by running the command `pip install selenium==4.2.0 --force-reinstall --user` in the terminal.

In [1]:
import pandas as pd
from Scweet.scweet import scrape

In [91]:
# constants
KEYWORDS = ["$AAPL", "AAPL"]
MAX_TWEETS = 100
MAX_REPLIES = 10
ORIGINAL_DIR = "data/scraped_twitter/original"
PROCESSED_DIR = "data/scraped_twitter/"

# Scraping the Data

In [44]:
scraped_data = scrape(words=keywords, since="2022-12-29", until="2023-01-01", interval=1, 
                      save_images=False, limit=MAX_TWEETS, headless=True, proxy=None, save_dir="data")

Scraping on headless mode.
looking for tweets between 2013-12-29 and 2013-12-30 ...
 path : https://twitter.com/search?q=($AAPL%20OR%20AAPL)%20until%3A2013-12-30%20since%3A2013-12-29%20&src=typed_query
Tweet made at: 2013-12-29T06:17:53.000Z is found.
Tweet made at: 2013-12-29T15:10:29.000Z is found.
Tweet made at: 2013-12-29T02:04:02.000Z is found.
Tweet made at: 2013-12-29T23:08:49.000Z is found.
Tweet made at: 2013-12-29T00:34:30.000Z is found.
Tweet made at: 2013-12-29T03:58:42.000Z is found.
Tweet made at: 2013-12-29T16:08:02.000Z is found.
scroll  1
Tweet made at: 2013-12-29T19:56:38.000Z is found.
Tweet made at: 2013-12-29T04:03:29.000Z is found.
Tweet made at: 2013-12-29T16:16:03.000Z is found.
Tweet made at: 2013-12-29T19:32:17.000Z is found.
Tweet made at: 2013-12-29T18:12:10.000Z is found.
Tweet made at: 2013-12-29T19:51:03.000Z is found.
Tweet made at: 2013-12-29T17:19:42.000Z is found.
Tweet made at: 2013-12-29T01:44:39.000Z is found.
Tweet made at: 2013-12-29T18:45:13.000

Tweet made at: 2013-12-31T15:03:37.000Z is found.
scroll  4
Tweet made at: 2013-12-31T14:01:20.000Z is found.
Tweet made at: 2013-12-31T13:30:32.000Z is found.
Tweet made at: 2013-12-31T22:14:31.000Z is found.
Tweet made at: 2013-12-31T17:23:09.000Z is found.
Tweet made at: 2013-12-31T15:01:13.000Z is found.
Tweet made at: 2013-12-31T15:15:45.000Z is found.
Tweet made at: 2013-12-31T18:04:45.000Z is found.
Tweet made at: 2013-12-31T15:26:39.000Z is found.
Tweet made at: 2013-12-31T18:11:02.000Z is found.
Tweet made at: 2013-12-31T20:00:30.000Z is found.
Tweet made at: 2013-12-31T18:02:10.000Z is found.
Tweet made at: 2013-12-31T23:08:43.000Z is found.
Tweet made at: 2013-12-31T15:08:59.000Z is found.
Tweet made at: 2013-12-31T01:46:40.000Z is found.
Tweet made at: 2013-12-31T14:00:05.000Z is found.
Tweet made at: 2013-12-31T16:25:22.000Z is found.
Tweet made at: 2013-12-31T18:48:34.000Z is found.
Tweet made at: 2013-12-31T03:25:35.000Z is found.
scroll  5
Tweet made at: 2013-12-31T19:0

In [45]:
scraped_data.shape

(186, 11)

In [47]:
scraped_data.head()

Unnamed: 0,UserScreenName,UserName,Timestamp,Text,Embedded_text,Emojis,Comments,Likes,Retweets,Image link,Tweet URL
0,"FMC (,)",@FreeMrktCptlst,2013-12-29T06:17:53.000Z,"FMC (,)\n@FreeMrktCptlst\n·\nDec 29, 2013",$AAPL needs to run a contest that awards a $14...,📈 📈,,1,1,[https://pbs.twimg.com/media/Bcoc2YQIMAAwRE6?f...,https://twitter.com/FreeMrktCptlst/status/4171...
1,David Patrick -President- Fitzstock Charts LLC,@Fitzstock2004,2013-12-29T15:10:29.000Z,David Patrick -President- Fitzstock Charts LLC...,$AAPL update http://stks.co/f062X,,,1,4,[],https://twitter.com/Fitzstock2004/status/41731...
2,"Rachel Shasha, MS, MFT",@Sassy_SPY,2013-12-29T02:04:02.000Z,"Rachel Shasha, MS, MFT\n@Sassy_SPY\n·\nDec 28,...",It's Over! Move Along + 1st OPEX of 2014 $SPY ...,,3.0,8,11,[],https://twitter.com/Sassy_SPY/status/417113751...
3,analognotebook,@MonsaludJerry,2013-12-29T23:08:49.000Z,"analognotebook\n@MonsaludJerry\n·\nDec 29, 2013",what will keep pushing forward 2014?\n$TSLA $G...,,1.0,1,2,[],https://twitter.com/MonsaludJerry/status/41743...
4,Daniel Eran Dilger,@DanielEran,2013-12-29T00:34:30.000Z,"Daniel Eran Dilger\n@DanielEran\n·\nDec 28, 2013",Editorial: 2013 was a terrible year for both A...,,9.0,13,8,[],https://twitter.com/DanielEran/status/41709121...


# Cleaning the Data

We first must keep only the `Timestamp` and `Embedded_text` columns.

In [48]:
sd = scraped_data[['Timestamp', 'Embedded_text']]
sd

Unnamed: 0,Timestamp,Embedded_text
0,2013-12-29T06:17:53.000Z,$AAPL needs to run a contest that awards a $14...
1,2013-12-29T15:10:29.000Z,$AAPL update http://stks.co/f062X
2,2013-12-29T02:04:02.000Z,It's Over! Move Along + 1st OPEX of 2014 $SPY ...
3,2013-12-29T23:08:49.000Z,what will keep pushing forward 2014?\n$TSLA $G...
4,2013-12-29T00:34:30.000Z,Editorial: 2013 was a terrible year for both A...
...,...,...
181,2013-12-31T12:33:51.000Z,Summary of Yesterday's Webcast Featuring $AAPL...
182,2013-12-31T23:10:36.000Z,Replying to \n@SconsetCapital
183,2013-12-31T13:06:40.000Z,"Early movers: HTZ, FDX, TWTR, NFLX, AAPL & PSX..."
184,2013-12-31T12:35:32.000Z,"***excellent piece, I agree 100%***\n@jfahmy\n..."


We then split up the `Timestamp` column into the date and time, for easier joining of the stock data later.

In [49]:
# sd.iloc[0].Timestamp
sd[['Date', 'Time']] = sd['Timestamp'].str.split('T', expand=True)
sd.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,Timestamp,Embedded_text,Date,Time
0,2013-12-29T06:17:53.000Z,$AAPL needs to run a contest that awards a $14...,2013-12-29,06:17:53.000Z
1,2013-12-29T15:10:29.000Z,$AAPL update http://stks.co/f062X,2013-12-29,15:10:29.000Z
2,2013-12-29T02:04:02.000Z,It's Over! Move Along + 1st OPEX of 2014 $SPY ...,2013-12-29,02:04:02.000Z
3,2013-12-29T23:08:49.000Z,what will keep pushing forward 2014?\n$TSLA $G...,2013-12-29,23:08:49.000Z
4,2013-12-29T00:34:30.000Z,Editorial: 2013 was a terrible year for both A...,2013-12-29,00:34:30.000Z


In [50]:
# dropping unnecessary columns and reordering
sd = sd[['Embedded_text', 'Date']]
sd.head()

Unnamed: 0,Embedded_text,Date
0,$AAPL needs to run a contest that awards a $14...,2013-12-29
1,$AAPL update http://stks.co/f062X,2013-12-29
2,It's Over! Move Along + 1st OPEX of 2014 $SPY ...,2013-12-29
3,what will keep pushing forward 2014?\n$TSLA $G...,2013-12-29
4,Editorial: 2013 was a terrible year for both A...,2013-12-29


We then clean the text using the `nltk` library.

In [51]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ksnbx\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ksnbx\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ksnbx\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [52]:
def clean(text):
    wn = nltk.WordNetLemmatizer()
    stopword = nltk.corpus.stopwords.words('english')
    
    # break into tokens
    tokens = nltk.word_tokenize(text)
    
    # lowercase the text
    lower = [word.lower() for word in tokens]
    
    # remove stopwords
    no_stopwords = [word for word in lower if word not in stopword]
    
    # remove non-alphanumeric characters
    no_alpha = [word for word in no_stopwords if word.isalpha()]
    
    # lemmatize the tokens
    lemm_text = [wn.lemmatize(word) for word in no_alpha]
    
    clean_text = lemm_text
    return clean_text

Thank you [Ona_Gilbert](https://www.kaggle.com/code/onadegibert/sentiment-analysis-with-tfidf-and-random-forest) for pointing us to the `nltk` library

In [53]:
sd['Cleaned_text'] = sd['Embedded_text'].apply(clean)
sd

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sd['Cleaned_text'] = sd['Embedded_text'].apply(clean)


Unnamed: 0,Embedded_text,Date,Cleaned_text
0,$AAPL needs to run a contest that awards a $14...,2013-12-29,"[aapl, need, run, contest, award, gift, card, ..."
1,$AAPL update http://stks.co/f062X,2013-12-29,"[aapl, update, http]"
2,It's Over! Move Along + 1st OPEX of 2014 $SPY ...,2013-12-29,"[move, along, opex, spy, aapl, amzn, bidu, fb,..."
3,what will keep pushing forward 2014?\n$TSLA $G...,2013-12-29,"[keep, pushing, forward, tsla, goog, fb, twtr,..."
4,Editorial: 2013 was a terrible year for both A...,2013-12-29,"[editorial, terrible, year, apple, competitor,..."
...,...,...,...
181,Summary of Yesterday's Webcast Featuring $AAPL...,2013-12-31,"[summary, yesterday, webcast, featuring, aapl,..."
182,Replying to \n@SconsetCapital,2013-12-31,"[replying, sconsetcapital]"
183,"Early movers: HTZ, FDX, TWTR, NFLX, AAPL & PSX...",2013-12-31,"[early, mover, htz, fdx, twtr, nflx, aapl, psx..."
184,"***excellent piece, I agree 100%***\n@jfahmy\n...",2013-12-31,"[excellent, piece, agree, jfahmy, blog, post, ..."


In [54]:
sd['Untokenized_clean'] = sd['Cleaned_text'].map(lambda t: " ".join(t))
sd

Unnamed: 0,Embedded_text,Date,Cleaned_text,Untokenized_clean
0,$AAPL needs to run a contest that awards a $14...,2013-12-29,"[aapl, need, run, contest, award, gift, card, ...",aapl need run contest award gift card throwing...
1,$AAPL update http://stks.co/f062X,2013-12-29,"[aapl, update, http]",aapl update http
2,It's Over! Move Along + 1st OPEX of 2014 $SPY ...,2013-12-29,"[move, along, opex, spy, aapl, amzn, bidu, fb,...",move along opex spy aapl amzn bidu fb goog lnk...
3,what will keep pushing forward 2014?\n$TSLA $G...,2013-12-29,"[keep, pushing, forward, tsla, goog, fb, twtr,...",keep pushing forward tsla goog fb twtr amzn aa...
4,Editorial: 2013 was a terrible year for both A...,2013-12-29,"[editorial, terrible, year, apple, competitor,...",editorial terrible year apple competitor mediu...
...,...,...,...,...
181,Summary of Yesterday's Webcast Featuring $AAPL...,2013-12-31,"[summary, yesterday, webcast, featuring, aapl,...",summary yesterday webcast featuring aapl wynn ...
182,Replying to \n@SconsetCapital,2013-12-31,"[replying, sconsetcapital]",replying sconsetcapital
183,"Early movers: HTZ, FDX, TWTR, NFLX, AAPL & PSX...",2013-12-31,"[early, mover, htz, fdx, twtr, nflx, aapl, psx...",early mover htz fdx twtr nflx aapl psx http
184,"***excellent piece, I agree 100%***\n@jfahmy\n...",2013-12-31,"[excellent, piece, agree, jfahmy, blog, post, ...",excellent piece agree jfahmy blog post everyon...


# Sentiment analysis using VADER

In [55]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\ksnbx\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [56]:
sd['vader_polarity'] = sd['Untokenized_clean'].map(lambda text: sid.polarity_scores(text)['compound'])
sd.head()

Unnamed: 0,Embedded_text,Date,Cleaned_text,Untokenized_clean,vader_polarity
0,$AAPL needs to run a contest that awards a $14...,2013-12-29,"[aapl, need, run, contest, award, gift, card, ...",aapl need run contest award gift card throwing...,0.4215
1,$AAPL update http://stks.co/f062X,2013-12-29,"[aapl, update, http]",aapl update http,0.0
2,It's Over! Move Along + 1st OPEX of 2014 $SPY ...,2013-12-29,"[move, along, opex, spy, aapl, amzn, bidu, fb,...",move along opex spy aapl amzn bidu fb goog lnk...,0.0
3,what will keep pushing forward 2014?\n$TSLA $G...,2013-12-29,"[keep, pushing, forward, tsla, goog, fb, twtr,...",keep pushing forward tsla goog fb twtr amzn aa...,0.3612
4,Editorial: 2013 was a terrible year for both A...,2013-12-29,"[editorial, terrible, year, apple, competitor,...",editorial terrible year apple competitor mediu...,-0.6369


# Calculating each day's sentiment

In [71]:
byday = sd.groupby('Date')['vader_polarity'].mean()

In [73]:
byday.to_csv("data/twitter_sentiment.csv")

# Putting it all into one function
It will take a long time to scrape tweets for our 10-year period, so we put the entire process into one function that saves the final sentiment datagram. Each file is considered one batch, with the whole 10-year period being split into 20 batches consisting of 6 months of data.

In [93]:
def twitter_sentiment(begin, end):
    # scrape the data
    scraped_data = scrape(words=KEYWORDS, since=begin, until=end, interval=1, 
                      save_images=False, limit=MAX_TWEETS, headless=True, proxy=None, save_dir=ORIGINAL_DIR)
    
    # keep only relevant columns
    sd = scraped_data[['Timestamp', 'Embedded_text']]
    
    # split into date and time columns
    sd[['Date', 'Time']] = sd['Timestamp'].str.split('T', expand=True)

    # dropping unnecessary columns and reordering
    sd = sd[['Embedded_text', 'Date']]
    
    # clean the text
    sd['Cleaned_text'] = sd['Embedded_text'].apply(clean)

    # untokenize the text
    sd['Untokenized_clean'] = sd['Cleaned_text'].map(lambda t: " ".join(t))
    
    # sentiment analysis
    sd['vader_polarity'] = sd['Untokenized_clean'].map(lambda text: sid.polarity_scores(text)['compound'])
    
    # calculate the mean sentiment by day
    byday = sd.groupby('Date')['vader_polarity'].mean()
    
    name = 'twitter_sentiment_' + '_'.join(keywords) + '_' + st + '_' + en + '.csv'
    byday.to_csv(PROCESSED_DIR + name)

In [99]:
# for curiosity, time each batch
import time

In [96]:
# the start + end dates of each batch
dates = [("2022-12-31", "2023-01-1")]
dates

[('2022-12-31', '2023-01-1')]

In [107]:
# scrape the data for each batch
times = []
for date in dates:
    start = time.time()
    twitter_sentiment(date[0], date[1])
    end = time.time()
    duration = end - start
    times.append(duration)
    print("Batch " + str(date) + " elapsed - " + str(duration) + " seconds")

('2022-12-31', '2023-01-1')
Batch ('2022-12-31', '2023-01-1') elapsed - 3.000563859939575 seconds
