<center><h1>BUSS6002 - Data Science in Business</h1></center>

#### Pre-Tutorial Checklist

1. Complete Task 1 and Task 2 from Week 10

# Tutorial 11 - Text Analytics

## Text Analytics

Often the data that we need to analyse is textual. Text data is incompatible with the models we have discussed so far because the models require numeric values. For example we don't have a direct numeric distance between the words "hello" and "friend". 

Moreover sometimes single data points such as a tweet, facebook post etc will contain a large number of words. So we need a way to convert the text data into a flexible numeric representation. This representation should tell us which words occured and how many times.


## Bag-of-Words

The bag-of-words (BoW) model is a simple method of transforming strings into a numeric representation. BoW treats each word as a feature and the value of the feature is the number of times it occurds.


For example the string

    "The quick brown fox jumps over the lazy dog"
    
would be transformed into

| the | quick | brown | fox | jumps | over | lazy | dog |
|---|---|---|---|---|---|---|---|
| 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

Note that:
- the word "the" occurs twice so its count is 2
- other words are unique so they only occur once
- the number of features is the number of unique words

To create a BoW set of features

<div style="margin-bottom: 0px;"><img width=20 style="display: block; float: left;  margin-right: 20px;" src="img/docs.png"> <h3 style="padding-top: 0px;">Documentation - CountVectorizer</h3></div>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()

corpus = ["The quick brown fox jumps over the lazy dog",
          "Another dog ran after the fox"]

X = count_vectorizer.fit_transform(corpus)

X is a sparse matrix. To view it we need to convert it to a dense matrix

In [2]:
print(X.todense())

[[0 0 1 1 1 1 1 1 1 0 2]
 [1 1 0 1 1 0 0 0 0 1 1]]


In [4]:
print(X)
X.shape

  (0, 3)	1
  (0, 6)	1
  (0, 7)	1
  (0, 5)	1
  (0, 4)	1
  (0, 2)	1
  (0, 8)	1
  (0, 10)	2
  (1, 0)	1
  (1, 9)	1
  (1, 1)	1
  (1, 3)	1
  (1, 4)	1
  (1, 10)	1


(2, 11)

The matrix isn't that helpful by itself. Lets look at the corresponding feature names (words)

In [5]:
print(count_vectorizer.get_feature_names())

['after', 'another', 'brown', 'dog', 'fox', 'jumps', 'lazy', 'over', 'quick', 'ran', 'the']


Lets combine everything into a DataFrame for clarity. You don't always have to do this

In [7]:
import pandas as pd

# TRNASFORM THE FEATURES TO PANDAS DATASET
features = pd.DataFrame(X.todense(), columns = count_vectorizer.get_feature_names())

print(features)

corpus = ["The quick brown fox jumps over the lazy dog",
          "Another dog ran after the fox"]

print(corpus[0])
print(corpus[1])


# importT:from sklearn.feature_extraction.text import CountVectorizer
# use .fit_transform: X = count_vectorizer.fit_transform(corpus) 
# print the content of features from the X.

   after  another  brown  dog  fox  jumps  lazy  over  quick  ran  the
0      0        0      1    1    1      1     1     1      1    0    2
1      1        1      0    1    1      0     0     0      0    1    1
The quick brown fox jumps over the lazy dog
Another dog ran after the fox


In [None]:
# e.g. if I want to calculate tf.idf for brown
# tf_brown_1 = 1
# tf_brown_2 = 0

# idf_brow = log(2/1) = log2

# tf_idf_brown_1 = 1*log2
# tf_idf_brown_2 = 0*log2

# in this example we use the simplest version of tf-idf, use boolean counting for tf, and simple log(N/n_d) with idf


## TF-IDF

In text data there will be lots of repeated words such as "a", "is" and "the" that aren't very useful. We should ignore them as much as possible.

The Term Frequency–Inverse Document Frequency (TF-IDF) is a weighting procedure for BoW data. The TF-IDF weights boost the counts or frequency of uncommon words (which will be useful) and shrinks the mangitude of common words. There are two components to the TF-IDF weights, and each of these can be calculated in different ways:

- Term Frequency, often the _raw count_ of a term in a document $tf = f_D$. Other possibilities are boolean (1 if the term appears, otherwise 0), length adjusted ($tf = \frac{f_D}{n_{words}}$) or logarithmic ($tf = \log(1+f_D)$).

- Inverse Document Frequency, or a measure of the information contained in a word. This is a penalty for commonly used words like 'a' and 'the'. It's the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient). $idf = \log(\frac{N}{n_D})$ where $N$ is the number of documents and $n_D$ is the number of documents in which the word appears.

The tfidf score is calculated as follows:

$$tfidf = tf \cdot idf $$

## TF-IDF in Sklearn

Let's vectorise a collection of documents. Notice that each line is treated as a document in this case, so our corpus is a total of 4 documents. 

<div style="margin-bottom: 0px;"><img width=20 style="display: block; float: left;  margin-right: 20px;" src="img/docs.png"> <h3 style="padding-top: 0px;">Documentation - TfidfVectorizer</h3></div>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [9]:
corpus = ["The quick brown fox jumps over the lazy dog",
          "Another dog ran after the fox",
          "The world is turning",
          "Hello world"
         ]

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectoriser = TfidfVectorizer() # they might using some complicated parameters to count the tf-idf features.

X = tfidf_vectoriser.fit_transform(corpus)

In [11]:
features = pd.DataFrame(X.todense(), columns = tfidf_vectoriser.get_feature_names())

features

Unnamed: 0,after,another,brown,dog,fox,hello,is,jumps,lazy,over,quick,ran,the,turning,world
0,0.0,0.0,0.356398,0.280988,0.280988,0.0,0.0,0.356398,0.356398,0.356398,0.356398,0.0,0.454968,0.0,0.0
1,0.463709,0.463709,0.0,0.365594,0.365594,0.0,0.0,0.0,0.0,0.0,0.0,0.463709,0.29598,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.57458,0.0,0.0,0.0,0.0,0.0,0.366747,0.57458,0.453005
3,0.0,0.0,0.0,0.0,0.0,0.785288,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.61913


<div style="margin-bottom: 30px;"><img width=48 style="display: block; float: left;  margin-right: 20px;" src="img/question-mark-button.png"> <h3 style="padding-top: 15px;">Exercise 1 - TF-IDF weights calculation</h3></div>

In the corpus with two documents:

``On the 24th of February, 1815, the look–out at Notre–Dame de la Garde signalled the three–master, the Pharaon from Smyrna, Trieste, and Naples.``

``As usual, a pilot put off immediately, and rounding the Chateau d’If, got on board the vessel between Cape Morgion and Rion island.``

Find the TF-IDF weights for the words ``the`` and ``Trieste`` by code or by hand. For term frequency, try Boolean weights.

In [None]:
# try to use hand
# N = 2
# we use  Boolean weights, if the words shows it should be 1, otherwise 0

## Solution:

For ``the``: $tf_1 = 1$, $tf_2 = 1$, $idf = \log(2/2) = 0$

So for both documents, $tfidf = 0$

For ``Trieste``: $tf_1 = 1$, $tf_2 = 0$, $idf = \log(2/1) = 0.7$

So for document 1, $tfidf = 0.7$, for document 2 $tfidf = 0$.

In [12]:
# TASK 

import pandas as pd

# Load the Obama tweets set
obama = pd.read_csv("BarackObama.csv")

obama.head()

Unnamed: 0.1,Unnamed: 0,date,id,link,retweet,text,author
0,0,20h20 hours ago,786982739517943808,/BarackObama/status/786982739517943808,False,Denying climate change is dangerous. Join @OFA...,BarackObama
1,1,18h18 hours ago,787010142378332160,/BarackObama/status/787010142378332160,False,The American Bar Association gave Judge Garlan...,BarackObama
2,2,16h16 hours ago,787039774330748928,/BarackObama/status/787039774330748928,False,We need a fully functional Supreme Court. Edit...,BarackObama
3,3,21h21 hours ago,786964419905523712,/BarackObama/status/786964419905523712,False,"Cynics, take note: When we #ActOnClimate, we b...",BarackObama
4,4,Oct 13,786680553617629184,/BarackObama/status/786680553617629185,False,"""That’s how we will overcome the challenges we...",BarackObama


In [13]:
# Load the trump tweets set
trump = pd.read_csv("DonaldTrumpTweets.csv")

trump.head()

Unnamed: 0.1,Unnamed: 0,date,id,link,retweet,text,author
0,0,Oct 7,784609194234306560,/realDonaldTrump/status/784609194234306560,False,Here is my statement.pic.twitter.com/WAZiGoQqMQ,DonaldTrump
1,1,Oct 10,785608815962099712,/realDonaldTrump/status/785608815962099712,False,Is this really America? Terrible!pic.twitter.c...,DonaldTrump
2,2,Oct 8,784840992734064640,/realDonaldTrump/status/784840992734064641,False,The media and establishment want me out of the...,DonaldTrump
3,3,Oct 8,784767399442653184,/realDonaldTrump/status/784767399442653184,False,Certainly has been an interesting 24 hours!,DonaldTrump
4,4,Oct 10,785561269571026944,/realDonaldTrump/status/785561269571026946,False,Debate polls look great - thank you!\n#MAGA #A...,DonaldTrump


In [19]:
# Lets apply a tfidf transformation to Obama's tweets to check for any
# weird words/phrases that might cause problems
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = obama['text']

test_vectorizer = TfidfVectorizer(stop_words = 'english')
tfidf = test_vectorizer.fit_transform(corpus)
print(tfidf.shape)
# print(test_vectorizer.get_feature_names())

(6896, 11064)


In [22]:
import re
corpus = corpus.apply(lambda x: re.sub(r"http\S+|pic.twitter.com\S+", '', x, flags=re.IGNORECASE))

# tf-dif
test_vectorizer = TfidfVectorizer(stop_words = 'english')
tfidf = test_vectorizer.fit_transform(corpus)
print(tfidf.shape)


(6896, 6646)


In [23]:
import re

corpus = trump['text']

corpus = corpus.apply(lambda x: re.sub(r"http\S+|pic.twitter.com\S+|@\S+", '', x, flags=re.IGNORECASE))

test_vectorizer = TfidfVectorizer(stop_words = 'english')
tfidf = test_vectorizer.fit_transform(corpus)

print(tfidf.shape)

(17216, 12482)


In [35]:
# Mark which tweets belong to which person
# print(trump.head())
trump['class'] = 1 # if this tweet belong to trump we assign it to class 1
obama['class'] = 0 # otherwise, it is class 0

print(trump[['text', 'class']].values.shape)
print(obama[['text', 'class']].values.shape)


# Combine the trump and obama tweets into a single dataframe
data = pd.concat( [trump[['text', 'class']], obama[['text', 'class']]], axis = 0 )
print(data.values.shape)

(17216, 2)
(6896, 2)
(24112, 2)


In [36]:
data.head()
# unprocessing feature text and label class

Unnamed: 0,text,class
0,Here is my statement.pic.twitter.com/WAZiGoQqMQ,1
1,Is this really America? Terrible!pic.twitter.c...,1
2,The media and establishment want me out of the...,1
3,Certainly has been an interesting 24 hours!,1
4,Debate polls look great - thank you!\n#MAGA #A...,1


In [37]:
# Pick up ALL the tweets
corpus = data['text']

# Fix the tweets get rid of the useless words
corpus = corpus.apply(lambda x: re.sub(r"http\S+|pic.twitter.com\S+|@\S+", '', x, flags=re.IGNORECASE))

# call the class of TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words = 'english')

# transfer the raw text to tf.idf feature
tfidf_data = tfidf_vectorizer.fit_transform(corpus)

In [43]:
from sklearn.model_selection import train_test_split

# Train/Test split the dataset
Xtrain, Xtest, ytrain, ytest = train_test_split(tfidf_data, data['class'])

# x is feature, y is label
print(Xtrain.shape)
print(ytrain.shape)

# we have 18084 rows of training data

print(Xtest.shape)
print(ytest.shape)

# and then we evaluate in 6028 testing data to see the result

(18084, 14595)
(18084,)
(6028, 14595)
(6028,)


In [44]:
from sklearn.linear_model import LogisticRegression

# call 
log_reg = LogisticRegression()

# fit
log_reg.fit(Xtrain, ytrain)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [45]:
# test 
from sklearn.metrics import classification_report

ytest_preds = log_reg.predict(Xtest)

print(classification_report(ytest, ytest_preds))

             precision    recall  f1-score   support

          0       0.93      0.77      0.85      1766
          1       0.91      0.98      0.94      4262

avg / total       0.92      0.92      0.92      6028



In [46]:
from sklearn.metrics import confusion_matrix

confusion_matrix(ytest, ytest_preds)

array([[1365,  401],
       [  96, 4166]], dtype=int64)