![UKDS Logo](images/UKDS_Logos_Col_Grey_300dpi.png)

# Text-mining: Basics

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *Computational Social Science*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. We provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/computational-social-science" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
June 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Guide-to-using-this-resource" data-toc-modified-id="Guide-to-using-this-resource-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Guide to using this resource</a></span><ul class="toc-item"><li><span><a href="#Interaction" data-toc-modified-id="Interaction-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Interaction</a></span></li><li><span><a href="#Learn-more" data-toc-modified-id="Learn-more-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Learn more</a></span></li></ul></li><li><span><a href="#Retrieval" data-toc-modified-id="Retrieval-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Retrieval</a></span></li><li><span><a href="#Processing" data-toc-modified-id="Processing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Processing</a></span><ul class="toc-item"><li><span><a href="#Tokenisation" data-toc-modified-id="Tokenisation-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Tokenisation</a></span></li><li><span><a href="#Standardising" data-toc-modified-id="Standardising-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Standardising</a></span></li><li><span><a href="#Removing-irrelevancies" data-toc-modified-id="Removing-irrelevancies-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Removing irrelevancies</a></span></li><li><span><a href="#Consolidation" data-toc-modified-id="Consolidation-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Consolidation</a></span></li></ul></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Conclusions</a></span></li><li><span><a href="#Further-reading-and-resources" data-toc-modified-id="Further-reading-and-resources-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Further reading and resources</a></span></li></ul></div>


There is a table of contents provided here at the top of the notebook, but you can also access this menu at any point by clicking the Table of Contents button on the top toolbar (an icon with four horizontal bars, if unsure hover your mouse over the buttons). 

-------------------------------------

<div style="text-align: center"><i><b>This is notebook 1 of 2 in this lesson</i></b></div>

-------------------------------------

## Introduction



This is the first in a series of jupyter notebooks on text-mining that cover basic preparation processes, common natural language processing tasks, and some more advanced natural language tasks. These interactive code-along notebooks use python as a programming language, but introduce various packages related to text-mining and text processing. Most of those tasks could be done in other packages, so please be aware that the options demonstrated here are not the only way, or even the best way, to accomplish a text-mining task.  

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*Retrieval*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and computational social science!".format(name))

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool.

## Retrieval


The first step in text-mining, or any form of data-mining, is retrieving a data set to work with. Within text-mining, or any language analysis context, one data set is usually referred to as 'a corpus' while multiple data sets are referred to as 'corpora'. 'Corpus' is a latin-root word and therefore has a funny plural. 

For text-mining, a corpus can be:
- a set of tweets, 
- the full text of an 18th centrury novel,
- the contents of a page in the dictionary, 
- minutes of local council meetings, 
- random gibberish letters and numbers, or
- just about anything else in text format. 


Retrieval is a very important step, but it is not the focus of this particular training series. If you are interested in creating a corpus from internet data, then you may want to check out our <a href="https://github.com/UKDataServiceOpen/web-scraping" target=_blank>previous NFoD training series</a> that covers Web-scraping (available as recordings of webinars or as a code-along jupyter notebook like this one) and API's (also as recording or jupyter notebook). Both of these demonstrate and discuss ways to get data from the internet that you could use to build a corpus. 

Instead, for the purposes of this session, we will assume that you already have a corpus to analyse. This is easy for us to assume, because we have provided a sample text file that we can use as a corpus for these exercises. 

First, let's check that it is there. To do that, click in the code cell below and hit the 'Run' button at the top of this page or by holding down the 'Shift' key and hitting the 'Enter' key. 

For the rest of this notebook, I will use 'Run/Shift+Enter' as short hand for 'click in the code cell below and hit the 'Run' button at the top of this page or by hold down the 'Shift' key while hitting the 'Enter' key'. 

In [None]:
# It is good practice to always start by importing the modules and packages you will need. 

import os                         # os is a module for navigating your machine (e.g., file directories).
import nltk                       # nltk stands for natural language tool kit and is useful for text-mining. 
import re                         # re is for regular expressions, which we use later 

print("1. Succesfully imported necessary modules")    # The print statement is just a bit of encouragement!

print("")

# List all of the files in the "data" folder that is provided to you
for file in os.listdir("./data"):
   print("2. One of the files in ./data is...", file)
print("")


_______________________________________________________________________________________________________________________________
Great! We have imported a useful module and used it to check that we have access to the sample_text file. 

Now we need to load that sample_text file into a variable that we can work with in python. Time to Run/Shift+Enter again!

In [None]:
# Open the "sample_text" file and read (import) its contents to a variable called "corpus"
with open("./data/sample_text.txt", "r", encoding = "ISO-8859-1") as f:
    corpus = f.read()
    
    print(corpus)

_______________________________________________________________________________________________________________________________
Hmm. Not excellent literature, but it will do for our purposes. 

A quick look tells us that there are capital letters, contractions, punctuation, numbers as digits, numbers written out, abbreviations, and other things that, as humans, we know are equivalent but that computers do not know about. 

Before we go further, it helps to know what kind of variable corpus is. Run/Shift+Enter the next code block to find out!

In [None]:
type(corpus)

_______________________________________________________________________________________________________________________________
This tells us that 'corpus' is one very long string of text characters.  

Congratulations! We are done with the retreival portion of this process. The rest won't be quite so straightforward because next up... Processing. 

Processing is about cleaning, correcting, standardizing and formatting the raw data returned from the retrieval process. 

## Processing



_______________________________________________________________________________________________________________________________
The string we have as our corpus is a good starting point, but it is not perfect. It has a bunch of errors and punctuation which need to be corrected. But even worse, it is 'one long thing' when statistical analysis typically requires 'lots of short things'. 

So, clearly, we have a few steps to go through with our raw text. 
- Tokenisation, (or splitting text into various kinds of 'short things' that can be statistically analysed).
- Standardising the next (including converting uppercase to lower, correcting spelling, find-and-replace operations to remove abbreviations, etc.). 
- Removing irrelevancies (anything from punctuation to stopwords like 'the' or 'to' that are unhelpful for many kinds of analysis).
- Consolidating (including stemming and lemmatisation that strip words back to their 'root'). 
- Basic NLP (that put some of the small things back together into logically useful medium things, like multi-word noun or verb phrases and proper names).

In practice, most text-mining work will require that any given corpus undergo multiple steps, but the exact steps and the exact order of steps depends on the desired analysis to be done. Thus, some of the examples that follow will use the raw text corpus as an input to the process while others use a processed corpus as an input. 

As a side note, it is good practice to create new variables whenever you manipulate an existing variable rather than write over the original. This means that you keep the original and can go back to it anytime you need to if you want to try a different manipulation or correct an error. You will see how this works as we progress through the processing steps. 

### Tokenisation

Our first step is to cut our 'one big thing' into tokens, or 'lots of little things'. As an example, one project I worked involved downloading a file with hundreds of recorded chess games, which I then divided into individual text files with one game each. The games had a very standard format, with every game ending with either '1-0', '0-1' or '1/2-1/2'. Thus, I was able to use regular expressions (covered in more detail later) to iterate over the file, selecting everyithing until it found an instance of '1-0', '0-1' or '1/2-1/2', at which point it would cut what it had selected, write it to a blank file, save it, and start iterating over the original file again. 

Other options that might make more sense with other kinds of files would be to to cut and write from the large file to new files after a specified number of lines or characters. 

Whether you have one big file or many smaller ones, most text-mining work will also want to divide the corpus into what are known as 'tokens'. These 'tokens' are the unit of analysis, which might be chapters, sections, paragraphs, sentences, words, or something else. 

Since we have one file already loaded as a corpus, we can skip the right to tokenising that text into sentences and words. Both options are functions available through the ntlk package that we imported earlier. These are both useful tokens in their own way, so we will see how to produce both kinds. 
 
We start by dividing our corpus into words, splitting the string into substrings whenever 'word_tokenize' detects a word. 

Let's try that. But this time, let's just have a look at the first 100 things it finds instead of the entire text.
Run/Shift+Enter.

In [None]:
nltk.download('punkt')
from nltk import word_tokenize                                           # importing the word_tokenize function from nltk

corpus_words = word_tokenize(corpus)                                     # Pass the corpus through word tokenize 
print(corpus_words[:10])                                                 # the [:100] within the print statement says 
                                                                         # to print only the first 100 items in the list  
print("...")                                                             # the print("...") just improves output readability
type(corpus_words)                                                       # Always good to know your variable type!


Let's have a look. 

We can see that corpus_words is a list of strings. We know it is a list because it starts and ends with square brackets and we know the things in that list are strings because they are surrounded by single quotes. 

We can also see that puctuation marks are counted as tokens in that list. For example, the full stop at the end of the first sentence appears as its own token because word_tokenize knows that it does not count as part of the previous word. Interestingly, 'U.K.' is all one token, despite having full stops in. Clever stuff, this tokenisation function!

Word_tokenize is a useful function if you want to take a 'bag of words' approach to text-mining. This reduces a lot of the contextual information within the original corpus because it ignores how the words were used or in what order they originally appeared, making it easy to count how often each word occurrs. There is a surprising amount of insight to be gained here, but it does mean that 'building' in the next two sentences will be counted as the "same" word. 
- "He is building a diorama for a school project." where 'building' is a verb
- "The building is a clear example of brutalist architecture." where 'building' is a noun

There are other kinds of analyses that you could do if you want verb-building and noun-building to be counted as different words. That usually starts with tokenising differently, for example into sentences rather than words. 
Let's see what that looks like by running the same basic analysis again, but this time with sentence-token things instead of word-token things. 

Do that funky Run/Shift+Enter thing! 

In [None]:
# importing sent_tokenize from nltk
from nltk import sent_tokenize

# Same again, but this time broken into sentences
corpus_sentences = sent_tokenize(corpus)
print(corpus_sentences[:10])                                                  # Since these are sentences instead of words, 
                                                                              # we only want the first 10 items instead of 100.
print("...")                                                                  
type(corpus_sentences)

_______________________________________________________________________________________________________________________________

Corpus_sentences is also a list of strings (starts and ends with square brackets, each item is surrounded by single quotes). 

This time, the full stops at the end of each sentence are included within the sentence token, which makes sense. 

Moving forward, some of the next steps make more sense to do on the word-tokens while others on sentence-tokens.

### Standardising
#### Remove uppercase letters

If we want to focus on the 'bag of words' approach, we don't really care about uppercase or lowercase distinctions. For example, we want 'Privacy' to count as the same word as 'privacy', rather than as two different words. 

We can remove all uppercase letters with a built in python command on corpus_words. Do this in the next code cell, again returning just the first 100 items instead of the whole thing. 

Do the Run/Shift+Enter thing. 

In [None]:
# You can see that I created a new variable called corpus_lower rather than edit corpus_words directly.
# This means I can easily compare two different processes or correct something without going back and re-running earlier steps. 

corpus_lower = [word.lower() for word in corpus_words]
print(corpus_lower[:100])

_______________________________________________________________________________________________________________________________
Great! This is another step in the right direction. 

If you want a bit more practice, you can copy/paste/edit the command above to create a second version that applies to corpus_sentences instead of corpus_words. You will have to think for yourself whether this makes sense to do or not. Uppercase letters are potentially useful in an analysis that looks at sentences, but since the tokens already capture sentences, maybe that value is no longer useful. 

Anyway, have a go. Knock yourself out! 

#### Spelling correction

_______________________________________________________________________________________________________________________________
Everybody loves spelling... RIGHT?!?

Fortunately, there are several decent spellchecking packages written for python. They are not automatically installed and ready to import in the same way that the 'os' or 'nltk' packages were, but we just need to install the packages and import the functions we need through an installer called 'pip'. You will see 'pip' in the next code block, but since this is in jupyter notebook rather than directly in a python shell, we need to put a '!' in front of the 'pip' function. Don't worry too much about that now, I just mention  it here in case you find it interesting to know. 

The next code cell:
- installs the 'autocorrect' package,
- imports the Speller function, and
- creates a one-word command that specifies that the Speller function should use English language. 

Run/Shift+Enter, as per usual. 

In [None]:
!pip install autocorrect
from autocorrect import Speller
check = Speller(lang='en')

_______________________________________________________________________________________________________________________________
Super. Creating that one-word command saves us some time, which is maybe less important here but is a good skill to be aware of if you are working on text-mining every day for weeks on end. Always be on the look out for good ways to save time. 

Moving on, we need to iterate over our corpus, checking and correcting each token. This is easy to do if you start with a new, empty list (I called mine 'corpus_correct_spell'). As I work through corpus_words, one token at a time, we append (which is just fancy for 'add to the end') the corrected word to our new blank list. 

Then, as usual, we have a quick look at the first 100 entries in the new 'corpus_correct_spell'. 

Run/Shift+Enter. You know how to do it. Don't worry if it takes a while... Checking the spelling on each word is not a cakewalk. 

In [None]:
corpus_correct_spell = []

for word in corpus_words:
    corpus_correct_spell.append(check(word))    

print(corpus_correct_spell[:100])

_______________________________________________________________________________________________________________________________
How did it do? Well, this spell-checker replaced 'haz' with 'had' rather than 'has'. That is ok, I guess. No automatic spelling correction programme will get it 100% right 100% of the time. Maybe your project has specific research questions that won't work with this decision. 

In that case, you would have to check out some other spell-checkers like textblob or pyspellchecker. You might even want to custom build or adapt your own spell-checker, especially if you were working with very non-standard text, like comment boards that use a bunch of slang, common typos, or specific terms. 

But take a moment here and consider the following questions... 
- Can you apply this spell-checker to corpus_sentences rather than corpus_words? If you are not sure what happens, try it out by copying, editing and re-running the above code block. 
- Should you have appled this spell-checker to corpus_lower rather than corpus_words? What difference would it make? Again, try it out if you are not sure. 

Next up, specific replacements with RegEx! 

#### RegEx replacements

RegEx stands for REGular EXpressions, which is probably familiar to you as the basis for how find-and-replace works in text documents. I mentioned this above when I talked about cutting up a large file into smaller files whenever the computer iterating over the large file found one of three specific combinations of numbers and symbols. 

But RegEx is actually stronger than that because you can use it to identify combinations of letters, numbers, symbols, spaces and more, some of which can be repeated more than once or can be optional. I won't go into RegEx too much more here, because that is a whole set of lessons on its own. But here are a couple of examples that you might find useful in a text like ours where we know that there are mixtures of numbers written as numbers, numbers spelled out, geographic abbreviations and more.

As you might expect, do the Run/Shift+Enter thing. 

In [None]:
corpus_numbers = [re.sub(r"ninety-six", "96", word) for word in corpus_words]   # Defines a new variable create by substituting
                                                                                # '96' for 'ninety-six' in corpus_words

print(corpus_numbers[:100])                                            # Prints the first 100 items in the newly created corpus


Super! Now, this only works on 'ninety-six', but there might be other numbers spelled out in the text. We would have to look at it all to be sure, either manually or by using word frequency tables (we'll get to that). If we were to find some, we would have to revise our RegEx to capture more things and substitute them properly. 

One way to do that might be to define multiple terms to replace and what to replace them with. To do that, I searched on stack overflow and found a function written to multiple items by RegEx in a string. 

Run/Shift+Enter below!

Now let's try editing this. 
What happens when we use lowercase letters instead of uppercase letters in "United Kingdom"?
What happens if you change the order of the entries in 'dict'. What happens if you reverse the order of 
- "United Kingdom of Great Britain and Northern Ireland" : "U.K.", and 
- "United Kingdom of Great Britain" : "U.K.", ?

You should also feel free to add your own lines to 'dict' to exact some substitutions of your own. 

Note: this function works on strings, so I applied it to 'corpus' our original raw text. 
We can either put a step like this as the first step in a pipeline, or we can adapt the code to iterate over a list of strings. Both have pros and cons. What do you think those pros and cons might be?

In [None]:
def multiple_replace(dict, text):
  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

if __name__ == "__main__": 

  dict = {
    "CA" : "California",
    "United Kingdom" : "U.K.",
    "United Kingdom of Great Britain and Northern Ireland" : "U.K.",
    "United Kingdom of Great Britain" : "U.K.",
    "UK" : "U.K.",
    "Privacy Policy" : "noodle soup",
  } 

corpus_replace = multiple_replace(dict, corpus)
print(corpus_replace)


### Removing irrelevancies

#### Remove punctuation

Punctuation is not always very useful for understanding text, especially if you look at words as tokens because lots of the punctuation ends up being tokenised on its own. 

We could use RegEx to replace all punctuation with nothing, and that is a valid approach. But, just for variety sake, I demonstrate another way here.

_______________________________________________________________________________________________________________________________
Forging ahead, let's filter out punctuation. We can define a string that includes all the standard English language punctuation, and then use that to iterate over corpus_words, removing anything that matches.

But wait... Do we really want to remove the:
- hyphen in 'ninety-six' or words like 'lactose-free'? 
- full stops in 'u.k.'? 
- the apostrophe in contractions or possessives?

There are no right or wrong answers here. Every project will have to decide, based on the research questions, what is the right choice for the specific context. In this case, we want to remove the full stops, even from 'u.k.' so that it becomes identical to 'uk'. 

But, at the same time, we don't necessarily want to remove dashes or apostrophes. Those are punctuation marks that occur in the middle of words and do add meaning to the word, so I want to keep them. 

Run/Shift+Enter, as is tradition. 

In [None]:
English_punctuation = "!\"#$%&()*+,./:;<=>?@[\]^_`{|}~“”"      # Define a variable with all the punctuation to remove.
print(English_punctuation)                                     # Print that defined variable, just to check it is correct.
print("...")                                                   # Print an ellipsis, just to make the output more readable.

table_punctuation = str.maketrans('','', English_punctuation)  # The python function 'maketrans' creates a table that maps
print(table_punctuation)                                       # the punctation marks to 'None'. Print the table to check. 
print("...")                                                   # Just to be clear, '!' is 33 in Unicode, and '\' is 34, etc.
                                                               # 'None' is python for nothing, not a string of the word "none".
    
corpus_no_punct = [w.translate(table_punctuation) for w in corpus_words]  
                                                               # Iterate over corpus_words, turning punctuation to nothing.
print(corpus_no_punct[:100])                                   # Print the 1st 100 items in corpus_no_punct to check.

_______________________________________________________________________________________________________________________________
Super! 

Do you want to try something else? How about you create a version that *does* filter out dashes and apostrophes. 

C'mon. You know you can do it. 

Take each of the steps above and copy/paste/edit them as needed. 
- Create a copy of the line that defines the English_punctuation variable and edit it to define an All_English_Punctuation variable that includes more punctuation.
- Then create a copy of the line that defines the table_punctuation variable and have it create a table_all_punctuation variable.
- Then create a copy of the line that creates the corpus_no_punct variable and have it create an absolutely_no_punct variable.
- Then ask for the first 100 items of absolutely_no_punct. 

Feel free to change the variable names as you like. I am going for clarity, but you might prefer brevity. 

Did you notice that removing the punctuation has left list items that are empty strings. Between 'corpus' and 'it', for example, is an item shown as ''. This is an empty string item that was a full stop before we removed the punctuation. 

Why do you think these empty string items are included in the output list? 
Can you think of how we might remove this?
Since those empty strings are python-recognised instances of 'None',  python can find and filter them out. 

Let's give it a try. Run/Shift+Enter. Do it!

In [None]:
corpus_no_space = list(filter(None, corpus_no_punct))     # This filters out the empty string from the no_punct list.

print(corpus_no_space[:100])

Now we are cooking with gas (unless that saying is no longer environmentally sustainable? Hmmm. ). 

But we are not done yet! Next up... Stopwords!

#### Stopwords

Stopwords are typically conjunctions ('and', 'or'), prepositions ('to', 'around'), determiners ('the', 'an'), possessives ('s) and the like. The are **REALLY** common in all languages, and tend to occur at about the same ratio in all kinds of writing, regardless of who did the writing or what it is about. These words are definitely important for structure as they make all the difference between "Freeze *or* I'll shoot!" and "Freeze *and* I'll shoot!". 

Buuuut... Many for many text-mining analyses, especially those that take the bag of words approach, these words don't have a whole lot of meaning in and of themselves. Thus, we want to remove them. 

Let's start by downloading the basic stopwords function built into nltk and storing the English language ones in a list called, appropriately enough, 'stop_words'. 

Then let's have a look at what is in that list with a print command by doing the whole Run/Shift+Enter thing in the next two (two?!?) code cells. 

In [None]:
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(sorted(stop_words))


_____________________________________________________________________________________________________________________________
Great. Now let's remove those stop_words by creating another list called corpus_no_stop_words. Then, we iterate over corpus_correct_spell, looking at them one by one and appending them to corpus_no_stop_words if and only if they do not match any of the items in the stop_words list. 

As you might expect, you should do the whole Run/Shift+Enter thing. Again. (I know, I know...)

In [None]:
corpus_no_stop_words = []

for word in corpus_lower:
    if word not in stop_words:
        corpus_no_stop_words.append(word)
        
        
print(corpus_no_stop_words[:100])

_______________________________________________________________________________________________________________________________
Hey now! That looks pretty good. Not perfect, but good.

Want to try more? Run the same code above, but on 'corpus_words' rather than 'corpus_lower'. What happens? Why do you think that is?

### Consolidation
#### Stemming words

You can probably imagine what comes next by now. We import a specific tool from nltk (it is not called the natural language tool kit for nuthin'), define a function, create a fresh new corpus by applying the function to an existing corpus, and print the first hundred items to have a nosey. 

Go ahead. Run/Shift+Enter

In [None]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()
corpus_stemmed = [porter.stem(word) for word in corpus_no_space]
print(corpus_stemmed[:100])

We see that 'sample' has become 'sampl', which collapses 'sampled' together with 'samples' and 'sampling' and 'sample'. This puts plurals and verb tenses all in the same form so they can be counted as instances of the "same" word.

If we are happy with this stemming process, we might decide that we are done with the cleaning and can dive into the text-mining. 

Alternatively, we might decide to do a bit more cleaning, perhaps by downloading packages that replace contractions, so that 'haven't' would become 'have' and 'not'. There are many potentially useful changes like these that you may want to make. 

Buuuuuuuuuuuuuut... maybe we want to keep the count the verbs together and the nouns separetely? For that, we need the slightly more sophisticated approach of 'lemmatisation'. 

#### Lemmatisation

Lemmatisation is similar to stemming, in that it aims to turn various forms of the same word into a single form. However, lemmatisation is a bit more sophisticated because: 
- It recognises irregular plurals and returns the correct singular form. Example = 'rocks' --> 'rock' but 'corpora' --> 'corpus' 
- If part of speech tags are supplied, it treats verbs, adjectives and nouns differenly, even if they have the same surface form. Example - 'caring' would not be changed if used as an adjective (as in 'his caring manner') but would go to 'care' if it was a verb (as in 'he is caring for baby squirrels'. In contrast, stemming would remove the 'ing' and turn 'caring' into 'car'. 
- If no part of speech tags are supplied, lemmatisation tools tend to assume words as nouns, so the process becomes a sophisticated de-pluraliser. 

Again, you import a specific tool from nltk, define a short form for its use, apply it to the relevant input variable, saving the output as a new variable with a suitable name. 

Once more, unto the Run/Shift+Enter!

In [None]:
nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 
 
print('rocks :', lemmatizer.lemmatize('rocks'))              #a few examples of lemmatising as a de-pluraliser
print('corpora :', lemmatizer.lemmatize('corpora'))
print('cares :', lemmatizer.lemmatize('cares'))              #no part of speech tag supplied, so 'cares' is treated as noun
print('caring :', lemmatizer.lemmatize('caring', pos = "v")) #when part of speech tag added, 'caring' is treated as verb             
print('cared :', lemmatizer.lemmatize('cared', pos = "v"))

The results show that our examples produce good output -  'rocks', 'corpora' and 'cares' are all de-pluralised correctly. The examples with part of speech tags also show that 'caring' and 'cared' are both correctly converted to 'care' as the base verb. 

Let's try it on our corpus, this time applying it to the 'corpus_no_space' variable, which has not had the stemming process applied to it. 

Run/Shift+Enter. 

In [None]:
corpus_lemmed = [lemmatizer.lemmatize(word) for word in corpus_no_space]

print(corpus_lemmed[:100])

Well, the results are a bit mixed. There were no part of speech tags in our corpus, so everything was treated as nouns. The corpus has been effectively de-pluralised, but all of the different verb tenses remain. So, I guess we need to mark the corpus for part of speech tags, usually abbreviated to POS. 

But that is a topic for the next section!

## Conclusions

We have achieved a whole lot already! This is great work! 

Now, you will have to think carefully about:
- what processes you will need for the analysis you want to run, 
- what is the right order of processes for your corpus/corpora and your research questions, and 
- how will you keep track of which processes you run and in which order. Replicability demands clear step-by-steps!



## Further reading and resources

Books, tutorials, package recommendations, etc. for Python
- Programming with Python for Social Scientists. Brooker, 2020. https://study.sagepub.com/brooker
- Automate the Boring Stuff with Python: Practical Programming for Total Beginners, Sweigart, 2019. ISBN-13: 9781593279929
- SentDex, python programming tutorials on YouTube https://www.youtube.com/user/sentdex
- nltk (Natural Language Toolkit) https://www.nltk.org/book/ch01.html
- nltk.corpus http://www.nltk.org/howto/corpus.html
- spaCy https://nlpforhackers.io/complete-guide-to-spacy/

Books and package recommendations for R
- Quanteda, an R package for text analysis https://quanteda.io/​
- Text Mining with R, a free online book https://www.tidytextmining.com/​

<div style="text-align: right"><a href="./tm-extraction-2020-06-16.ipynb" target=_blank><i>Next section: Extracting text</i></a></div>