<a href="https://colab.research.google.com/github/aytimothy/3804ict-fandoms-connect/blob/master/3803ICT_Week_7_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!wget https://aytimothy.xyz/files/uni/2020_t1_3803ict/elonmusk_tweets.csv

!pip install numpy
!pip install pandas
!pip install scipy
!pip install matplotlib
!pip install nltk
!pip install twitter_scraper

--2020-04-15 08:50:32--  https://aytimothy.xyz/files/uni/2020_t1_3803ict/elonmusk_tweets.csv
Resolving aytimothy.xyz (aytimothy.xyz)... 104.27.169.222, 104.27.168.222, 2606:4700:3032::681b:a8de, ...
Connecting to aytimothy.xyz (aytimothy.xyz)|104.27.169.222|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘elonmusk_tweets.csv’

elonmusk_tweets.csv     [<=>                 ]       0  --.-KB/s               elonmusk_tweets.csv     [ <=>                ] 392.65K  --.-KB/s    in 0.01s   

2020-04-15 08:50:32 (33.4 MB/s) - ‘elonmusk_tweets.csv’ saved [402077]

Collecting twitter_scraper
  Downloading https://files.pythonhosted.org/packages/69/d2/080ad55919d547dba653936eb62f1cb25512b9715873a1e69337fb1d5b78/twitter_scraper-0.4.1-py2.py3-none-any.whl
Collecting MechanicalSoup
  Downloading https://files.pythonhosted.org/packages/0b/fe/4f871ec3379080c5979815bfec3266871e555eebf4879f551a7e5dee4766/MechanicalSoup-0.12.0-py2.py3-none-an

In [0]:
import collections
import pandas
import matplotlib.pyplot
import nltk
import nltk.sentiment
import numpy
import operator
import sklearn

In [0]:
nltk.download("stopwords")
stopwords = set(nltk.corpus.stopwords.words("english"))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Feature Engineering

First, lets see if we can read out the data.

In [0]:
dataframe = pandas.read_csv("elonmusk_tweets.csv")
print(dataframe)

                      id  ...                                               text
0     849636868052275200  ...  b'And so the robots spared humanity ... https:...
1     848988730585096192  ...  b"@ForIn2020 @waltmossberg @mims @defcon_5 Exa...
2     848943072423497728  ...      b'@waltmossberg @mims @defcon_5 Et tu, Walt?'
3     848935705057280001  ...                b'Stormy weather in Shortville ...'
4     848416049573658624  ...  b"@DaveLeeBBC @verge Coal is dying due to nat ...
...                  ...  ...                                                ...
2814  142881284019060736  ...               b'That was a total non sequitur btw'
2815  142880871391838208  ...  b'Great Voltaire quote, arguably better than T...
2816  142188458125963264  ...  b'I made the volume on the Model S http://t.co...
2817  142179928203460608  ...  b"Went to Iceland on Sat to ride bumper cars o...
2818         15434727182  ...  b'Please ignore prior tweets, as that was some...

[2819 rows x 3 columns]


We can see that there are three columns and 2819 tweets.

  * **id** is an integer that is the UUID of the tweet in Twitter's databases.
  * **created_at** is a formatted timestamp of when the tweet was created.
  * **text** is basically the string contents of the Tweet.

You can go back from the ID to the actual Tweet page by substituting it in the URL:

    https://twitter.com/[username]/status/[id]

*Note: It doesn't really matter whose username you put in the username field; Twitter will correct it to whatever the ID is.*

So, for example, the twats at Imagineer Co. (Japan) can't claim [this tweet](https://twitter.com/medarotsha/status/1245585645025644554) for themselves:

    https://twitter.com/medarotsha/status/1245585645025644554

(It'll show up as mine)

Oh, I forgot to mention that there's [this](https://pypi.org/project/twitter-scraper/) and [this](https://github.com/bear/python-twitter):

    pip install twitter_scraper
    pip install python-twitter

First, let's clean up the data into a form that's readable by `pandas`.

In [0]:
dataframe["created_at"] = pandas.to_datetime(dataframe["created_at"], format="%Y-%m-%d %H:%M:%S")
dataframe["text"] = [str(text[2:-1]) for text in dataframe["text"]]
print(dataframe.dtypes)
print(dataframe)

id                     int64
created_at    datetime64[ns]
text                  object
dtype: object
                      id  ...                                               text
0     849636868052275200  ...  And so the robots spared humanity ... https://...
1     848988730585096192  ...  @ForIn2020 @waltmossberg @mims @defcon_5 Exact...
2     848943072423497728  ...         @waltmossberg @mims @defcon_5 Et tu, Walt?
3     848935705057280001  ...                   Stormy weather in Shortville ...
4     848416049573658624  ...  @DaveLeeBBC @verge Coal is dying due to nat ga...
...                  ...  ...                                                ...
2814  142881284019060736  ...                  That was a total non sequitur btw
2815  142880871391838208  ...  Great Voltaire quote, arguably better than Twa...
2816  142188458125963264  ...  I made the volume on the Model S http://t.co/w...
2817  142179928203460608  ...  Went to Iceland on Sat to ride bumper cars on ...
2818    

## Text Normalization

First of all, there are five possible features in a Tweet:

  * **Text**, which is the primary content of the tweet. This is what we want.
  * **Links**, which are basically URLs. Twitter loves analytics, so any link you put in there is shortened to always be `https://t.co/[shortlink]` (23 characters long)
  * **Images**, which are also URLs, are always shortened to `https://pbs.twimg.com/media/[id]`. These are not included in the post count, unless posted as a link.
  * **Retweet/Parent Data**, which are appended to the end of a Tweet's text is actually stored part of the Tweet itself. It refers to the tweet's parent tweet or tweet that is retweeted.
  * **Mentions**, which are pings for other people, beginning with `@`.
  * **Hashtags**, which are topic tags for a tweet. They all begin with `#`.

We can work out:

  * **Text** as anything that does not fit the below citerias.
  * **Links** are anything beginning with `https://t.co/`.
  * **Images** are anything beginning with `https://pbs.twimg.com/media/`.
  * **Retweet/Parent Data** is basically anything at the end of a post beginning in `https://twitter.com/`.
  * **Mentions** are anything beginning with `@`.
  * **Hashtags**, like Mentions are anything beginning with `#`.

In [0]:
def normalize_text(text):
    output = []
    text = text.replace("\r", ' ')                                              # Check for 
    text = text.replace("\n", ' ')
    for split_string in text.split(' '):
        if split_string.startswith('#'):                                        # Is a hashtag
            continue
        if split_string.startswith('@'):                                        # Is a mention
            continue
        if split_string.startswith("https://t.co"):                             # Is a link
            continue
        if split_string.startswith("https://pbs.twimg.com"):                    # Is a media
            continue
        if split_string.startswith("https://twitter.com"):                      # Is a retweet
            continue
        if split_string.startswith("http://"):                                  # Is a link (also)
            continue
        if split_string.startswith("rt"):                                       # Is a Retweet
            continue
        split_string = split_string.replace(',', '')
        split_string = split_string.replace('.', '')
        split_string = split_string.replace('!', '')
        split_string = split_string.replace('?', '')
        split_string = split_string.replace('-', '')
        split_string = split_string.replace('_', '')
        split_string = split_string.replace('/', '')
        split_string = split_string.replace('"', '')
        split_string = split_string.replace("'", '')
        if not split_string:                                                    # Did we just stripped everything?
            continue
        split_string = split_string.lower()
        if split_string in stopwords:                                           # Is a stop word
            continue
        output.append(split_string)

    return output

In [0]:
print((' ').join(normalize_text("\"Halt!\" said the policeman, \"Stop where you are, thief!\"\nYou are under arrest.")))

halt said policeman stop thief arrest


## Implement TF-IDF

First, we make the word cloud of the top 500 words...

In [0]:
wordcloud = [word for tweet in dataframe["text"].iteritems() for word in normalize_text(tweet[1])]
wordcount = collections.Counter(wordcloud)
wordcountlist = [(word, wordcount[word]) for word in list(wordcount.keys())]
wordcountlist.sort(key = operator.itemgetter(1), reverse = True)
wordcountlist = wordcountlist[:500]
keywords = [counttuple[0] for counttuple in wordcountlist]

In [0]:
# Term Frequency: The number of times word "word" appears in the text "document".
# "word" is a keyword. "document" is an array of words.
def tf(word, document):
    count = 0
    for _word in wordlist:
        if _word.lower() == wordlist.lower():
            count += 1
    return count / len(document)

# Inverse Document Frequency: The inverse of "how frequent is a word in all documents"
# "word" is a keyword. "documents" is an array of arrays of words, or an array of documents from the tf(str, str) function, so to say.
def idf(word, documents):
    count = 0
    for _word in docuemnt in documents:
        if _word.lower() == word.lower():
            count += 1

    return math.log ((1 + len(documents)) / (1 + count))

# Term-Frequency Inverse-Document-Frequency: "How original is a word?"
def tfidf(word, document, documents):
    tf(word, document) * idf(word, documents)

## Compare the results with the reference implementation of scikit-learn library

In [0]:
tfidf = sklearn.feature_extraction.text.TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english', max_features=500)
features = tfidf.fit(dataframe["text"])
corpus_tf_idf = tfidf.transform(dataframe["text"])
sum_words = corpus_tf_idf.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in tfidf.vocabulary_.items()]
print(sorted(words_freq, key = lambda x: x[1], reverse=True)[:5])
print("testla", corpus_tf_idf[1, features.vocabulary_["tesla"]]) 

[('http', 163.54366542841234), ('https', 151.85039944652075), ('rt', 112.61998731390989), ('tesla', 95.96401470715628), ('xe2', 88.20944486346477)]
testla 0.3495243100660956


Except this is nonsense, because we got URLs.

In [0]:
tfidf = sklearn.feature_extraction.text.TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english', max_features=500)
features = tfidf.fit([(' ').join(normalize_text(text[1])) for text in dataframe["text"].iteritems()])
corpus_tf_idf = tfidf.transform([(' ').join(normalize_text(text[1])) for text in dataframe["text"].iteritems()])
sum_words = corpus_tf_idf.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in tfidf.vocabulary_.items()]
print(sorted(words_freq, key = lambda x: x[1], reverse=True)[:5])
print("testla", corpus_tf_idf[1, features.vocabulary_["tesla"]]) 

[('rt', 145.0874794898136), ('tesla', 95.68148712014852), ('model', 80.97774132627796), ('xe2', 73.79066553488956), ('x80', 72.1418978712968)]
testla 0.3086434846143853


## Apply TF-IDF for Information Retrieval

(Out of Time)

# Sentiment Analysis

In [0]:
nltk.download('vader_lexicon')
nltk.download('movie_reviews')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

In [0]:
pos_docs = [(list(nltk.corpus.movie_reviews.words(pos_id)), 'pos') for pos_id in nltk.corpus.movie_reviews.fileids('pos')[:500]]
neg_docs = [(list(nltk.corpus.movie_reviews.words(neg_id)), 'neg') for neg_id in nltk.corpus.movie_reviews.fileids('neg')[:500]]
# X = words (atoms), Y = "neg" / "pos" (result)
pos_xtrain, pos_xtest, pos_ytrain, pos_ytest = sklearn.model_selection.train_test_split(pos_docs, ["pos" for i in range(len(pos_docs))], test_size = 0.2)
neg_xtrain, neg_xtest, neg_ytrain, neg_ytest = sklearn.model_selection.train_test_split(neg_docs, ["neg" for i in range(len(pos_docs))], test_size = 0.2)
x_train = pos_xtrain + neg_xtrain
y_train = pos_ytrain + neg_ytrain
x_test = pos_xtest + neg_xtest
y_test = pos_ytest + neg_ytest

## Classification Approach

In [0]:
sentim_analyzer = nltk.sentiment.SentimentAnalyzer()
all_words_neg = sentim_analyzer.all_words([mark_negation(doc) for doc in neg_xtrain])
print("Negative Words:")
print(all_words_neg[:5])
all_words_pos = sentim_analyzer.all_words([mark_negation(doc) for doc in pos_xtrain])
print("Positive Words:")
print(all_words_pos[:5])

neg_unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, min_freq=4)
print(len(unigram_feats))
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

NameError: ignored

## Lexical Approach

-0.9472


In [32]:
sid = nltk.sentiment.vader.SentimentIntensityAnalyzer()
for doc in x_train:
    doc = " ".join(doc[0])
    print(doc[:100] + "...")
    ss = sid.polarity_scores(doc)
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]), end='')
    if ss["neg"] > ss["pos"]:
        print("NEGATIVE")
    else:
        print("POSITIVE")

what if one of our cities became the target for terror ? what can we do ? what can america really do...
compound: -0.9536, neg: 0.172, neu: 0.664, pos: 0.164, NEGATIVE
the disney studios has its formula for annual , full - length animated features down so pat that it ...
compound: 0.9957, neg: 0.053, neu: 0.768, pos: 0.179, POSITIVE
i think the first thing this reviewer should mention is wether or not i am a fan of the x - files . ...
compound: 0.9986, neg: 0.078, neu: 0.774, pos: 0.148, POSITIVE
as with his other stateside releases , jackie chan ' s latest chopsocky vehicle , mr . nice guy , is...
compound: 0.9965, neg: 0.062, neu: 0.752, pos: 0.186, POSITIVE
" jaws " is a rare film that grabs your attention before it shows you a single image on screen . the...
compound: 0.977, neg: 0.084, neu: 0.809, pos: 0.106, POSITIVE
the " submarine " genre of movies seems to be one of the most intriguing and compelling types of sto...
compound: 0.9584, neg: 0.053, neu: 0.85, pos: 0.098, POSITIVE

## Comparing the two approaches

Both approaches seem to be both be able to accurately (somewhat) distinguish negative from positive.

However, this is a classical example of having the combined knowledge of people, versus the auto-generation of an algorithm. The Lexicon version is only good, because many man-staking hours have been spent classifying words as "positive", "negative" or "neutral".