![UKDS Logo](images/UKDS_Logos_Col_Grey_300dpi.png)

# Text-mining: Classifiers and sentiment analysis

Welcome to this <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> *Computational Social Science* training series! 

The various *Computational Social Science* training series, all of which guide you through some of the popular and useful computational techniques, tools, methods and concepts that social science research might want to use. For example, this series covers collecting data from websites and social media platorms, working with text data, conducting simulations (agent based modelling), and more. The series includes recorded video webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials on our GitHub site: <a href="https://github.com/UKDataServiceOpen/computational-social-science" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr J. Kasmire</a>  <br />
UK Data Service  <br />
University of Manchester <br />

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Guide-to-using-this-resource" data-toc-modified-id="Guide-to-using-this-resource-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Guide to using this resource</a></span></li><li><span><a href="#Interaction" data-toc-modified-id="Interaction-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Interaction</a></span></li><li><span><a href="#Learn-more" data-toc-modified-id="Learn-more-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Learn more</a></span></li></ul></li><li><span><a href="#Sentiment-Analysis-as-an-example-of-machine-learning/deep-learning-classification" data-toc-modified-id="Sentiment-Analysis-as-an-example-of-machine-learning/deep-learning-classification-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Sentiment Analysis as an example of machine learning/deep learning classification</a></span></li><li><span><a href="#Analyse-trivial-documents-with-built-in-sentiment-analysis-tool" data-toc-modified-id="Analyse-trivial-documents-with-built-in-sentiment-analysis-tool-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Analyse trivial documents with built-in sentiment analysis tool</a></span></li><li><span><a href="#Acquire-and-analyse-trivial-documents" data-toc-modified-id="Acquire-and-analyse-trivial-documents-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Acquire and analyse trivial documents</a></span></li><li><span><a href="#Train-and-test-a-sentiment-analysis-tool-with-trivial-data" data-toc-modified-id="Train-and-test-a-sentiment-analysis-tool-with-trivial-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Train and test a sentiment analysis tool with trivial data</a></span></li><li><span><a href="#You-can-train-and-test-a-sentiment-analysis-tool-with-more-interesting-data-too..." data-toc-modified-id="You-can-train-and-test-a-sentiment-analysis-tool-with-more-interesting-data-too...-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>You can train and test a sentiment analysis tool with more interesting data too...</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Conclusions</a></span></li><li><span><a href="#Further-reading-and-resources" data-toc-modified-id="Further-reading-and-resources-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Further reading and resources</a></span></li></ul></div>


There is a table of contents provided here at the top of the notebook, but you can also access this menu at any point by clicking the Table of Contents button on the top toolbar (an icon with four horizontal bars, if unsure hover your mouse over the buttons). 

## Introduction

Sentiment analysis is a commonly used example of automatic classification. To be clear, automatic classification means that a model or learning algorithm has been trained on correctly classified documents and it uses this training to return a probability assessment of what class a new document should belong to. 

Sentiment analysis works the same way, but usually only has two classes - positive and negative. A trained model looks at new data and says whether that new data is likely to be positive or negative. Let's take a look!

### Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*Sentiment Analysis as an example of machine learning/deep learning classification*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and computational social science!".format(name)) 

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool.

## Sentiment Analysis as an example of machine learning/deep learning classification

Let's start off by importing and downloading some useful packages, including `textblob`: it is based on `nltk` and has built in sentiment analysis tools. 

To import the packages, click in the code cell below and hit the 'Run' button at the top of this page or by holding down the 'Shift' key and hitting the 'Enter' key. 

For the rest of this notebook, I will use 'Run/Shift+Enter' as short hand for 'click in the code cell below and hit the 'Run' button at the top of this page or by hold down the 'Shift' key while hitting the 'Enter' key'. 

Run/Shift+Enter.

In [None]:
import os                         # os is a module for navigating your machine (e.g., file directories).
import nltk                       # nltk stands for natural language tool kit and is useful for text-mining. 
import csv                        # csv is for importing and working with csv files
import statistics

# List all of the files in the "data" folder that is provided to you

for file in os.listdir("./data/sentiment-analysis"):
   print("A file we can use is... ", file)
print("")

In [None]:
!pip install -U textblob -q
!python -m textblob.download_corpora -q
from textblob import TextBlob

## Analyse trivial documents with built-in sentiment analysis tool

Now, lets get some data.

Run/Shift+Enter, as above!

In [None]:
Doc1 = TextBlob("Textblob is just super. I love it!")             # Convert a few basic strings into Textblobs 
Doc2 = TextBlob("Cabbages are the worst. Say no to cabbages!")    # Textblobs, like other text-mining objects, are often called
Doc3 = TextBlob("Paris is the capital of France. ")               # 'documents'
print("...")
type(Doc1)

Docs 1 through 3 are Textblobs, which we can see by the output of type(Doc1). 

We get a Textblob by passing a string to the function that we imported above. Specifically, this is done by using this format --> Textblob('string goes here'). Textblobs are ready for analysis through the textblob tools, such as the built-in sentiment analysis tool that we see in the code below. 

Run/Shift+Enter on those Textblobs.

In [None]:
print(Doc1.sentiment)
print(Doc2.sentiment)
print(Doc3.sentiment)

The output of the previous code returns two values for each Textblob object. Polarity refers to a positive-negative spectrum while subjectivity refers to an opinion-fact spectrum. 

We can see, for example, that Doc1 is fairly positive but also quite subjective while Doc2 is very negative and very subjective. Doc3, in contrast, is both neutral and factual. 

Maybe you don't need both polarity and subjectivity. For example, if you are trying to categorise opinions, you don't need the subjectivity score and would only want the polarity. 

To get only one of the two values, you can call the appropriate sub-function as shown below. 

Run/Shift+Enter for sub-functional fun. 

In [None]:
print(Doc1.sentiment.polarity)
print(Doc1.sentiment.subjectivity)

## Acquire and analyse trivial documents

Super. We have imported some documents (in our case, just sentences in string format) to textblob and analysed it using the built-in sentiment analyser. But we don't want to import documents one string at a time...that would take forever!

Let's import data in .csv format instead! The data here comes from a set of customer reviews of Amazon products. Naturally, not all of the comments in the product reviews are really on topic, but it does not actually matter for our purposes. But, I think it is only fair to warn you...there is some foul language and potentially objectionable personal opinions in the texts if you go through it all. 

Run/Shift+Enter (if you dare!)

In [None]:
with open('./data/sentiment-analysis/training_set.csv', newline='', encoding = 'ISO-8859-1') as f:  # Import a csv of scored "product reviews"
    reader = csv.reader(f)
    Doc_set = list(reader)

print(Doc_set[45:55])  # Look at a subset of the imported data

A very good start (although you will see what I mean about the off-topic comments and foul language). 

Now, the .csv file has multiple strings per row, the first of which we want to pass to `texblob` to create a Textblob object. The second is a number representing the class that the statement belongs to. '4' represents 'positive', '2' represents neutral and '0' represents negative. Don't worry about this for now as we will come to that in a moment. 

The code below creates a new list that has the text string and the sentiment score for each item in the imported Doc_set, and also shows you the first 20 results of that new list to look at. 

Run/Shift+Enter

In [None]:
Doc_set_analysed = []

for item in Doc_set:
    Doc_set_analysed.append([item[0], item[1], TextBlob(item[0]).sentiment])

print(Doc_set_analysed[45:55])

Now, edit the code above so that Doc_set_analysed only has the text string, the number string and the Textblob polarity. 

We will want to use that to get a sense of whether the polarity judgements are accurate or not. Thus, we want to know whether the judgement assigned to each statement (the '4', '2' or '0') matches with the polarity assigned by the `textblob` sentiment analyser. 

To do this, we need to convert the second item (the '4', '2' or '0') to a 1, 0 or -1 to match what we get back from the sentiment analyser, compare them to find the difference and then find the average difference. 

Run\Shift+Enter. 

In [None]:
Doc_set_polarity_accuracy = []

for item in Doc_set_analysed:
    if (item[1] >= '4'):                            # this code checks the string with the provided judgement
        x = 1                                       # and replaces it with a number matching textblob's polarity
    elif (item[1] == '2'):
        x = 0
    else:
        x = -1
    y = item[2].polarity
    Doc_set_polarity_accuracy.append(abs(x-y))     # unless my math is entirely wrong, this returns 'accuracy' or
                                                    # the difference between the provided and calculated polarity
                                                    # Exact matches (-1 and -1 or 1 and 1) return 0, complete opposites
                                                    # (1 and -1 or -1 and 1) returning 2, all else proportionally in between. 
    

print(statistics.mean(Doc_set_polarity_accuracy))   # Finding the average of all accuracy shows ... it is not great.  

Hmmm. If the sentiment analyser were:
- entirely accurate, we would have an average difference of 0
- entirely inaccurate, we would have an average difference of 2
- entirely random, we would expect an average difference of 1

As it stands, we have an average difference that suggests we are a bit more accurate than chance... but not my much. 

However, it is important to remember that we are testing an assigned class against a probable class... The assigned class (the '4', '2' or '0' in the original data set) is an absolute judgement and so is always *exactly* 4, 2, or 0 but never 2.8 or 0.05. In contrast, the polarity judgement returned by the sentiment analyser is a probability: it is 1 if the sentiment analyser is absolutely confident that the statement is positive but only .5 if the sentiment analyser is fairly confident that the statement is positive. 

In light of this, the fact that we got a better than chance score on our average accuracy test may mean we are doing quite well. We could test this, of course, and convert the polarity scores from the sentiment analyser into 1, 0 or -1 or even into 4, 2 and 0 and then compare those. 

Heck. Why not? Let's have a go. 
Run\Shift+Enter. 


In [None]:
Doc_set_polarity_accuracy_2 = []

for item in Doc_set_analysed:
    x = item[1]                                     # This code sets the original judgement assigned to each statement as x
    if (item[2].polarity > 0):                               # then converts polarity scores of more than 0 to '4'
        y = '4'                                    
    elif (item[2].polarity == 0 ):                           # converts polarity scores of exactly 0 to '2'
        y = '2'
    else:                                           # and converts negative polarity scores to '0'
        y = '0'
    if x == y:                                      # then compares the assigned judgement to the converted polarity score
        Doc_set_polarity_accuracy_2.append(1)       # and adds a 1 if they match exactly
    else:
        Doc_set_polarity_accuracy_2.append(0)       # or adds a 0 if they do not match exactly. 

print(statistics.mean(Doc_set_polarity_accuracy_2)) # Finds the average of the match rate. Still not great.  

Well, an average close to 1 would be entirely accurate while close to 0 would be entirely wrong (and to be fair, *entirely* wrong would also be accurate too...in a sense). 

Our average though suggests that our accuracy is still not great. Ah well. 

## Train and test a sentiment analysis tool with trivial data

Now that we know how to use the built-in analyser, let's have a look back at the sentiment analysis scores for Doc1 and Doc2. 
- Doc1 = 'Textblob is just super. I love it!' which scored scored .48 on polarity... halfway between neutral and positive. 
- Doc2 = 'Cabbages are the worst. Say no to cabbages!' which scored -1 on polarity... the most negative it could score. 

Do we really think Doc2 is so much more negative than Doc1 is positive? Hmmmm. The built-in sentiment analyser is clearly not as accurate as we would want. Let's try to train our own, starting with a small set of trivial training and testing data sets. 

The following code does a few different things:
- It defines 'train' as a data set with 10 sentences, each of which is marked as 'pos' or 'neg'.
- It defines 'test' as a data set with 6 completely different sentences, also marked as 'pos' or 'neg'. 
- It imports NaiveBayesClassifier from the textblob.classifiers.
- It defines 'cl' as a brand new NaiveBayesClassifier that is trained on the 'train' data set. 

Run/Shift+Enter to make it so. 

In [None]:
train = [
    ('I love this sandwich.', 'pos'),
    ('this is an amazing place!', 'pos'),
    ('I feel very good about these beers.', 'pos'),
    ('this is my best work.', 'pos'),
    ("what an awesome view", 'pos'),
    ('I do not like this restaurant', 'neg'),
    ('I am tired of this stuff.', 'neg'),
    ("I can't deal with this", 'neg'),
    ('he is my sworn enemy!', 'neg'),
    ('my boss is horrible.', 'neg')]
test = [
     ('the beer was good.', 'pos'),
     ('I do not enjoy my job', 'neg'),
     ("I ain't feeling dandy today.", 'neg'),
     ("I feel amazing!", 'pos'),
     ('Gary is a friend of mine.', 'pos'),
     ("I can't believe I'm doing this.", 'neg')]


from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(train)

Hmm. The code ran but there is nothing to see. This is because we have no output! Let's get some output and see what it did. 

The next code block plays around with 'cl', the classifier we trained on our 'train' data set.

The first line asks 'cl' to return a judgment of one sentence about a library. 

Then, we ask it to return a judgement of another sentence about something being a doozy. Although both times we get a judgement on whether the sentence is 'pos' or 'neg', the second one has more detailed sub-judgements we can analyse that show us how the positive and negative the sentence is so we can see whether the overall judgement is close or not. 

Do the Run/Shift+Enter thing that you are so good at doing!

In [None]:
print("Our 'cl' classifier says 'This is an amazing library!' is ", cl.classify("This is an amazing library!"))
print('...')

prob_dist = cl.prob_classify("This one is a doozy.")
print("Our 'cl' classifier says 'This one is a doozy.' is probably",
      prob_dist.max(), "because its positive score is ",
      round(prob_dist.prob("pos"), 2),
      " and its negative score is ",
      round(prob_dist.prob("neg"), 2),
      ".")

Super. Now... What if we want to apply our 'cl' classifier to a document with multiple sentences... What kind of judgements can we get with that? 

Well, `textblob` is sophisticated enough to give an overall 'pos' or 'neg' judgement, as well as a sentence-by-sentence judgement. 

Run/Shift+Enter, buddy. 

In [None]:
blob = TextBlob("The beer is good. But the hangover is horrible.", classifier=cl)

print("Overall, 'blob' is ", blob.classify(), " because it's sentences are ...")
for s in blob.sentences:
     print(s)
     print(s.classify())

What if we try to classify a document that we converted to Textblob format with the built-in sentiment analyser?

Well, we still have Doc1 to try it on.

Run/Shift+Enter

In [None]:
print(Doc1)
Doc1.classify()

Uh huh. We get an error. 

The error message says the blob known as Doc1 has no classifier. It suggests we train one first, but we can just apply 'cl'. 

Run/Shift+Enter

In [None]:
cl_Doc1 = TextBlob('Textblob is just super. I love it!', classifier=cl)
cl_Doc1.classify()

Unsurprisingly, when we classify the string that originally went into Doc1 using our 'cl' classifier, we still get a positive judgement. 

Now, what about accuracy? We have been using 'cl' even though it is trained on a REALLY tiny training data set. What does that do to our accuracy? For that, we need to run an accuracy challenge using our test data set. This time, we are using a built-in accuracy protocol which deals with negative values and everything for us. This means we want our result to be as close to 1 as possible. 

Run/Shift+Enter

In [None]:
cl.accuracy(test)


Hmmm. Not perfect.

Fortunately, we can add more training data and try again. The code below defines a new training data set and then runs a re-training functiong called 'update' on our 'cl' classifier. 

Run/Shift+Enter.

In [None]:
new_data = [('She is my best friend.', 'pos'),
            ("I'm happy to have a new friend.", 'pos'),
            ("Stay thirsty, my friend.", 'pos'),
            ("He ain't from around here.", 'neg')]

cl.update(new_data)

Now, copy the code we ran before to get the accuracy check. Paste it into the next code block and Run\Shift+Enter it.  

Not only will this tell us if updating 'cl' with 'new_data' has improved the accuracy, it is also a chance for you to create a code block of your own. Well, done you (I assume). 

In [None]:
# Copy and paste the accuracy challenge from above into this cell and re-run it to get an updated accuracy score. 


## You can train and test a sentiment analysis tool with more interesting data too...

This is all well and good, but 'cl' is trained on some seriously trivial data. What if we want to use some more interesting data, like the Doc_set that we imported from .csv earlier?

Well, we are in luck! Sort of...

We can definitely train a classifier on Doc_set, but let's just have a closer look at Doc_set before we jump right in and try that. 


In [None]:
print(Doc_set[45:55])
print('...')
print(len(Doc_set))

Doc_set is a set of comments that come from 'product reviews'. As we saw earlier, each item has two strings, the first of which is the comment and the second of which is a number 4, 2 or 0 which is written as a string. The second item, the number-written-as-a-string, is the class judgement. These scores may have been manually created, or may be the result of a semi-manual or supervised automation process. Excellent for our purposes, but not ideal because:
- These scores are strings rather than integers. You can tell because they are enclosed in quotes.
- These scores range from 0 (negative) to 4 (positive) and also contains 2 (neutral), while the textblob sentiment analysis and classifier functions we have been using return scores from -1 (negative) through 0 (neutral) to 1 (positive). 

Well, we could change 4 to 1, 2 to 0 and 0 to -1 with the use of regular expressions (RegEx) if we wanted. But as you will see, this is not strictly necessary. 

However, there is another issue. Doc_set has 20,000 items. This is big, but this is actually MUCH smaller than it could be. This is a subset of a 1,000,000+ item data set that you can download for free (see extra resources and reading at the end). The original data set was way too big for Jupyter notebook and was even too big for me to analyse on my laptop. I know because I tried. When you find yourself in a situation like this, you can try: 
- Accessing proper research computing facilities (good for real research, too much for a code demo). 
- Dividing a too big data set into chunks, and train/update a chunk at a time. 
- Processing a too big data set to remove punctuation, stop words, urls, twitter handles, etc. (saving computer power for what matters).
- Or a combination of these options. 

But, you can try training a classifier on the much smaller 'testing_set' if you like. That set has under 5000 entries and so does not max out the computer's memory. 

I have provided the code below to load 'testing_set' into a new variable called Doc_set_2. Feel free to run the code below, then add more code blocks with processes copied from above. 

In [None]:
with open('./data/sentiment-analysis/testing_set.csv', newline='') as f:              # Import a csv of scored "product reviews"
    reader = csv.reader(f)
    Doc_set_2 = list(reader)

print(Doc_set_2[45:55])                                                             # Look at a subset of the imported data

## Conclusions

You can train a classifier on whatever data you want and with whatever categories you want. 

Want to train a classifier to recognise sarcasm? Go for it. 
How about recognising lies in political speeches? Good idea. 
How about tweets from bots or from real people? Definitely useful. 

The hard part is actually getting the data ready to feed to train your classifier. Depending on what you want to train your classifier to do, you may have to manually tag a whole lotta data. But it is always a good idea to start small. 10 items? 100? What can you do quickly that will give you enough of an idea to see if it is worth investing more time. 

Good luck!

## Further reading and resources

Books, tutorials, package recommendations, etc. for Python

- Natural Language Processing with Python by Steven Bird, Ewan Klein and Edward Loper, http://www.nltk.org/book/
- Foundations of Statistical Natural Language Processing by Christopher Manning and Hinrich Schütze, https://nlp.stanford.edu/fsnlp/promo/
- Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition by Dan Jurafsky and James H. Martin, https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
- Deep Learning in Natural Language Processing by Li Deng, Yang Liu, https://lidengsite.wordpress.com/book-chapters/
- Sentiment Analysis data sets https://blog.cambridgespark.com/50-free-machine-learning-datasets-sentiment-analysis-b9388f79c124

NLTK options
- nltk.corpus http://www.nltk.org/howto/corpus.html
- Data Camp tutorial on sentiment analysis with nltk https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python
- Vader sentiment analysis script available on github (nltk) https://www.nltk.org/_modules/nltk/sentiment/vader.html
- TextBlob https://textblob.readthedocs.io/en/dev/
- Flair, a NLP script available on github https://github.com/flairNLP/flair

spaCy options
- spaCy https://nlpforhackers.io/complete-guide-to-spacy/
- Data Quest tutorial on sentiment analysis with spaCy https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/


Books and package recommendations for R
- Quanteda, an R package for text analysis https://quanteda.io/​
- Text Mining with R, a free online book https://www.tidytextmining.com/​