In [None]:
# uncomment if you haven't had nltk and textblob installed
# !pip install nltk textblob

### Twitter Sentiment  

In the first [notebook](01_tweets_mining.ipynb), we have downloaded tweets for different cryptocurrencies. We will next perform [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis#:~:text=Sentiment%20analysis%20(also%20known%20as,affective%20states%20and%20subjective%20information.) on the tweets so as to understand the mood of the public towards each cryptocurrency, and eventually see if the sentiments could be used to predict the returns of the cryptocurrency.

As a quickstart to [Natural Language Programming](https://en.wikipedia.org/wiki/Natural-language_programming), we will be using [TextBlob](https://textblob.readthedocs.io/en/dev/quickstart.html) to extract the polarity of the tweets. We will revisit polarity again later in the notebook.  

Do note that we are unable to share the downloaded tweets due to Twitter's regulation on its data. Hence this notebook will only serve as a reference on how we can obtain the sentiments.

  
<div style="text-align:center"><a href="https://www.atoti.io/?utm_source=gallery&utm_content=twitter2" target="_blank" rel="noopener noreferrer"><img src="https://data.atoti.io/notebooks/banners/Discover-Atoti-now.png" alt="atoti" /></a></div>

In [None]:
import glob
import os
import re

import nltk
import pandas as pd

nltk.download("stopwords")
nltk.download("wordnet")
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from textblob import TextBlob

As the mined tweets are split into files by the currency, the below list is used to iterate through the currency files for processing.

In [None]:
currency = [
    "BTC",
    "ETH",
    "USDT",
    "XRP",
    "BCH",
    "ADA",
    "BSV",
    "LTC",
    "LINK",
    "BNB",
    "EOS",
    "TRON",
]

In [None]:
def process_ccy(curr):
    pd_df = pd.concat(
        [pd.read_csv(f) for f in glob.glob(f"clean_data/{curr}*.csv", recursive=True)],
        ignore_index=True,
    )

    print(f"Number of records loaded for {curr}", len(pd_df))
    pd_df[["coin_symbol", "tweet_id", "created_at", "date", "hour"]].to_csv(
        f"clean_data/ccy_tweets/{curr}_tweets.csv", index=False
    )

    return pd_df

We combine all the downloaded tweets into one dataframe.

In [None]:
combined_df = pd.concat(
    [process_ccy(c) for c in currency],
    ignore_index=True,
)

Since we repeated the data mining multiple times, we expect duplication of tweets. We will keep the latest mined as the number of followers and retweets can change.  

In [None]:
combined_df.sort_values(by=["mined_at"], inplace=True, ignore_index=True)
combined_df.drop_duplicates(
    subset=["tweet_id"], inplace=True, keep="last", ignore_index=True
)
combined_df.head(5)

As `tweet_id` is somewhat similar to timestamp, we sort the tweets by the id to get it in chronological order (though it doesn't really matter much here).

In [None]:
combined_df.sort_values(by=["tweet_id"], inplace=True)

In [None]:
len(combined_df)

### Preprocessing

We used the minimal text processing in this notebook to turn the text into a canonical, machine-readable format.  
The text normalization process will include:
- converting all letters into lower case
- Lemmatization and stemming the words 
- remove stop words

In [None]:
stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

tokenizer = RegexpTokenizer(r"\w+")

In [None]:
def preprocess(text):
    words = [lemmatizer.lemmatize(w) for w in text if w not in stop_words]
    stem_text = " ".join([stemmer.stem(i) for i in words])
    return stem_text

In [None]:
combined_df["processed text"] = combined_df["text"].apply(
    lambda x: preprocess(tokenizer.tokenize(x.lower()))
)

### Getting sentiment

We create a TextBlob and obtain the polarity from its [sentiment](https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.blob.TextBlob.sentiment) property.  

In [None]:
def getSentiment(tweet):
    analysis = TextBlob(tweet)
    return analysis.sentiment.polarity  # work out sentiment

In [None]:
combined_df["polarity"] = combined_df["processed text"].apply(lambda x: getSentiment(x))

In [None]:
combined_df.head(5)

The polarity is a value within -1.0 and 1.0 where 0 shows a neutral post, and likely to be a bot posting.  
Negative polarity indicates a negative posting and likewise, postive polarity indicates a positive mood.  
We create a `sentiment` column to store these classifications.

In [None]:
combined_df["sentiment"] = combined_df["polarity"].apply(
    lambda s: "Postive" if s > 0 else ("Neutral" if s == 0 else "Negative")
)

In [None]:
combined_df.to_csv("full_tweets.csv", index=False)

  
<div style="text-align:center"><a href="https://www.atoti.io/?utm_source=gallery&utm_content=twitter2" target="_blank" rel="noopener noreferrer"><img src="https://data.atoti.io/notebooks/banners/Your-turn-to-try-Atoti.jpg" alt="atoti" /></a></div>