![UKDS Logo](images/UKDS_Logos_Col_Grey_300dpi.png)

# Text-mining: Basics

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *Computational Social Science*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. We provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/computational-social-science" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
June 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Guide-to-using-this-resource" data-toc-modified-id="Guide-to-using-this-resource-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Guide to using this resource</a></span><ul class="toc-item"><li><span><a href="#Interaction" data-toc-modified-id="Interaction-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Interaction</a></span></li><li><span><a href="#Learn-more" data-toc-modified-id="Learn-more-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Learn more</a></span></li></ul></li><li><span><a href="#Preliminary-NLP-(or-finishing-up-the-processing)" data-toc-modified-id="Preliminary-NLP-(or-finishing-up-the-processing)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Preliminary NLP (or finishing up the processing)</a></span><ul class="toc-item"><li><span><a href="#POS---part-of-speech-tagging" data-toc-modified-id="POS---part-of-speech-tagging-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>POS - part of speech tagging</a></span></li><li><span><a href="#Named-Entity-Recogntion-and-chunking" data-toc-modified-id="Named-Entity-Recogntion-and-chunking-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Named Entity Recogntion and chunking</a></span></li></ul></li><li><span><a href="#Counts-and-(relative)-frequency" data-toc-modified-id="Counts-and-(relative)-frequency-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Counts and (relative) frequency</a></span></li><li><span><a href="#Similarity" data-toc-modified-id="Similarity-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Similarity</a></span></li><li><span><a href="#Discovery" data-toc-modified-id="Discovery-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Discovery</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Conclusions</a></span></li><li><span><a href="#Further-reading-and-resources" data-toc-modified-id="Further-reading-and-resources-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Further reading and resources</a></span></li></ul></div>


There is a table of contents provided here at the top of the notebook, but you can also access this menu at any point by clicking the Table of Contents button on the top toolbar (an icon with four horizontal bars, if unsure hover your mouse over the buttons). 

-------------------------------------

<div style="text-align: center"><i><b>This is notebook 2 of 2 in this lesson</i></b></div>

-------------------------------------

## Introduction

At the end of the last section, we got held up by the need to do part-of-speech (POS) tagging in order to get a really effective lemmatisation process. POS tagging is actually basic NLP, as opposed to the kind of cleaning and regularising that we were doing in the last section. So why should a NLP process be needed before the preparatory processing is done? 

Well, it comes done to choices. Not every analysis will need a sophisticated lemmatiser, so those projects may have a nice finish to the processing step and the start of the extraction or NLP step. Others willneed the lemmatiser or name entity recognisers or other advanced preparatory steps. Those projects will have a less clear distinction between preparation for NLP and NLP. 

But even if the project ends up having a clear distinction between the processes, researchers may find that after they start doing some NLP processes, they need to go back and run different preparatory processes instead of or in addition to the ones they chose earlier. 

The main takeaway point here is that researchers need to know that developing a text-mining project can be messy, iterative, and complicated. I recommend that you think about each step as elements in a pipeline (or in multiple pipelines). I recommend that you build your own code functions that concatenate the steps, running each one from the output of the previous one. 

In this way, you get a fresh clean output at the end of the pipeline each time whenever you need one. It also means that everything you apply the pipeline to gets treated in the same way, with each process done in the same order. This helps replicability!


## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*Preliminary NLP*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and computational social science!".format(name)) 

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool.

## Preliminary NLP (or finishing up the processing)

Let's start off by importing and downloading all the things we will need. 

Run/Shift+Enter.

In [None]:
import nltk                       # get nltk 
from nltk import word_tokenize    # and some of its key functions
from nltk import sent_tokenize    

!pip install autocorrect          
from autocorrect import Speller   # things we need for spell checking
check = Speller(lang='en')

import re                         # things we need for RegEx corrections
def multiple_replace(dict, text):
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

if __name__ == "__main__": 
  dict = {
    "CA" : "California",
    "United Kingdom" : "U.K.",
    "United Kingdom of Great Britain and Northern Ireland" : "U.K.",
    "United Kingdom of Great Britain" : "U.K.",
    "UK" : "U.K.",
    "Privacy Policy" : "noodle soup",}

English_punctuation = "-!\"#$%&()'*+,./:;<=>?@[\]^_`{|}~''“”"      # Things for removing punctuation, stopwords and empty strings
table_punctuation = str.maketrans('','', English_punctuation)  

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('webtext')

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

from nltk.corpus import wordnet                    # Finally, things we need for lemmatising!
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 
nltk.download('averaged_perceptron_tagger')        # Like a POS-tagger...

print("Succesfully imported necessary modules")    # The print statement is just a bit of encouragement!

### POS - part of speech tagging

Now, let's get back to where we were when we left off last time - with a tokenised corpus on which we need to run a POS tagger. 

Run/Shift+Enter, as above!

In [None]:
with open("./data/sample_text.txt", "r", encoding = "ISO-8859-1") as f:
    corpus = f.read()
    
corpus_words = word_tokenize(corpus)

corpus_lower = [word.lower() for word in corpus_words]

corpus_correct_spell = []
for word in corpus_lower:
    corpus_correct_spell.append(check(word))    

corpus_no_stopwords = []
for word in corpus_correct_spell:
    if word not in stop_words:
        corpus_no_stopwords.append(word)
        
corpus_no_punct = [w.translate(table_punctuation) for w in corpus_no_stopwords]  
corpus_no_space = list(filter(None, corpus_no_punct))      
        
print(corpus_no_space[:100])

Excellent. Now it is time to tag that corpus with POS-tags. This is pretty easy, as nltk comes with a POS-tagger. 

Run/Shift+Enter, as you would expect. 

In [None]:
corpus_pos_tagged = nltk.pos_tag(corpus_no_space)        
print(corpus_pos_tagged[:100])

Excellent. That has successfully added POS tags to all off the words in our corpus. Now, let's try lemmatising again with the POS tags. 

Despite what seems obvious, the nltk POS tagger does not use the same POS tags that the nltk lemmatize function needs. Why? I have no idea. 

But to move forward, I need a to define a quick little function called get_wordnet_pos to convert the tag format to the right one. I tell a lie. I did not write this function but copied it off of Stack Overflow. This is not cheating so much as being economical. A HUGE number of the things you want to do or the problems you want to solve will be discussed on Stack Overflow. Just use a popular search engine to find them, read through all the answers, try them out. 

Having defined the get_wordnet_pos function, the code belowe then creates a new, blank list called corpus_lemmed. 
After that, the code iterates over corpus_pos_tagged, looking at each word and POS-tag pair, uses the get_wordnet_pos function to convert the POS-tag to the right format, and using that to lemmatize correctly. 

At the end, the lemmatised word is appended to the new list we created. 

Go ahead. 
Run/Shift+Enter. 
You know you want to!

In [None]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

corpus_lemmed = []
for pair in corpus_pos_tagged:
    corpus_lemmed.append(lemmatizer.lemmatize(pair[0], get_wordnet_pos(pair[0])))   
print(corpus_lemmed[:100])

#corpus_lemmed_tagged = []    
#for pair in corpus_pos_tagged:
#    corpus_lemmed_tagged.append([(lemmatizer.lemmatize(pair[0], get_wordnet_pos(pair[0])), pair[1])])   
#print(corpus_lemmed_tagged[:100])

The output of the above code returns a list with only words but without any POS-tags. 

If you want to keep the corpus in pairs of word and POS-tag, you will need to activate the second, commented out lines. This means you will need to  remove the '#' in front of each line of code starting with 'corpus_lemmed_tagged' and re-run the code.

Give it a try!

### Named Entity Recogntion and chunking

We are really getting somewhere now! Let's try another basic NLP process - CHUNKING!

Named Entity Recognition is a specific kind of 'chunk' operation. Chunking operations iterate over a corpus that has been word tokenised and POS-tagged and put it all back together into sentences. Named Entity Recognition does this too, with special attention to building up the noun phrases that capture well-known entities or onganisations. 

The chunks are returned within sets of nested brackets (both square and round to capture different levels of nesting. 

So, 'The Cat in the Hat' would come out as (S The/DT (ORGANIZATION Cat/NNP) in/IN the/DT Hat/NNP). 
The 'S' at the beginning stands for 'sentence' which is the highest level grouping that the chunker can find. 
the 'Cat' is recognised as the key entity, so is tagged with ORGANIZATION. 
The 'in the hat' part is captured as belonging to a noun phrase, same as the cat, but it recongsises that this a sentence about a cat, not a sentence about a hat. 

Clever, eh?
Let's try it!

In [None]:
#importing chunk library from nltk
nltk.download('words')
nltk.download('maxent_ne_chunker')
from nltk import ne_chunk                                    # ne_chunk is 'named entity chunk'. Other chunkers are available.

# NER and other chunkers only work on word tokenised and POS tagged corpora... 
corpus_pos_tagged2 = nltk.pos_tag(corpus_words)
corpus_chunked = ne_chunk(corpus_pos_tagged2)
print(corpus_chunked[57:88])

This time, I specifically only asked for a printout of a key range in the resulting corpus. I wanted to highlight here how the word "Tree" preceeds those ORGANIZATION entities that hang together as multi-word entities. See, for example, how 'United Kingdom', 'Great Britain' and 'Northern Ireland' are each within square brackets to identify them as the multi-word entity captured by the ORGANIZATION tag. 

You might also have noticed that this chunking function is run on corpus_pos_tagged2, which is simply the corpus_words that has been put through the nltk.pos_tag function. This means that corpus_pos_tagged2 still has its stopwords, punctuation, etc. 

Why do you think this is? What do you think would happen if you ran the chunking procedure on corpus_pos_tagged which DOES have all the stopwords and punctuation removed?

Well, guess what? You can find out by doing that Run/Shift+Enter thing!

In [None]:
corpus_chunked_extra_processes = ne_chunk(corpus_pos_tagged)
print(corpus_chunked_extra_processes[:100])

Hmmm. No 'Tree' markers, no 'ORGANIZATION' markers, etc. 

This is because some chunking processes use some of the stopwords (especially determiners like 'an' and 'the') and punctuation, etc. to be useful in determining appropriate chunks. 

This may create some challenges for your corpus. For example, if you want to: 
- Count words, then you probably want to remove stopwords, punctuation, etc. 
- Identify chunks, like named entities, then you probably want to leave some or all of the stopwords, punctuation, etc.
- Count chunks (e.g. count named entities), you probably want to combine the processes in the right order. 

Good to know!

## Counts and (relative) frequency

Excellent! Now, you might be surprised, but a very important function of NLP for analysing text boils down to counting things, often words. This is why so much attention in the last section was focussed on making sure all the words that we want to be counted as 'the same word' appeared in the same form while all the words that we want to count as 'different words' appear in different forms. 

Thus, we want to apply the count functions to a corpus that has had some of that standardisation, consolidation, lemmatised (or at least stemming) processes applied already. 

First, we import some counting functions, then we apply them to corpus_lemmed. 
Run/Shift+Enter

In [None]:
from collections import Counter
corpus_counts = Counter(corpus_lemmed)
print(corpus_counts)

Great! You may have noticed that we applied this count function to a list of words rather than word and POS-tag pairs. This is on purpose, but the code could be written so that it only looks at the first item in each word and POS-tag pairs. 

If you want to try that, go ahead. You may want to refer back to the code block where we defined get_wordnet_pos because the code to create corpus_lemmed_tagged uses indices (in [square brackets]) to refer to only one element within a pair. 

But, for now, let's have a closer look at the 100 most common words in our corpus by using the 'most_common' function from Counter. 
Run/Shift+Enter!

In [None]:
print(corpus_counts.most_common(100))

Just for comparison, let's find the 100 most common words in 'Emma' by Jane Austen. 
We do need to import the text as a corpus and process it in the same way as we did your corpus so that they can be seen as comparable. 

Run/Shift+Enter  - but be patient. This is a lot of processes to run. 

In [None]:
nltk.download('gutenberg')
import nltk.corpus
emma = nltk.corpus.gutenberg.raw('austen-emma.txt')

emma_words = word_tokenize(emma)

emma_lower = [word.lower() for word in emma_words]

emma_correct_spell = []
for word in emma_lower:
    emma_correct_spell.append(check(word))    

    emma_no_stopwords = []
for word in emma_lower:
    if word not in stop_words:
        emma_no_stopwords.append(word)
        
emma_no_punct = [w.translate(table_punctuation) for w in emma_no_stopwords]  
emma_no_space = list(filter(None, emma_no_punct))              
emma_pos_tagged = nltk.pos_tag(emma_no_space)   

emma_lemmed = []
for pair in emma_pos_tagged:
    emma_lemmed.append(lemmatizer.lemmatize(pair[0], get_wordnet_pos(pair[0])))   
    
emma_counts = Counter(emma_lemmed)
print(emma_counts)

Excellent! Clearly, the words in Emma are very different than those of our sample corpus, and those words that appear in both occur in very different relative frequencies (not least because one is a page of babble and the other is a full novel).

To get a better idea, how about we compare the 20 most common words from both corpora. 
Run/Shift+Enter

In [None]:
print(corpus_counts.most_common(20))
print(emma_counts.most_common(20))

Ok. Clearly very different. But let's try one more thing for now... Let's count how many times each of these texts use the word 'personal'. We could use any word as the target word, but I happen to know that there is a non-zero result for these two texts for this word. 

Run/Shift+Enter!!!

In [None]:
print(corpus_counts['personal'])
print(emma_counts['personal'])

So, despite being much shorter, the sample text corpus uses the word 'personal' over 8 times more often. 

That sounds very personal. 

Feel free to choose other words and re-run the code. 

## Similarity

Now, comparing the most common words in two documents is one way to compare how similar they are, but there are more sophisticated ways. 

spaCy is a relatively new option for text-mining in python, but it is very powerful. First off, we need to download and import a few things. 

Run/Shift+Enter (you are already so good at this!)

In [None]:
!pip install spacy -q
import spacy
!python -m spacy download en_core_web_lg -q
from nltk.corpus import webtext

Super. Now, let's load the model via spacy-load, and then test it on a trivial corpus that has only three words. 

Run/Shift+Enter already!

In [None]:
nlp = spacy.load('en_core_web_lg')

word_similarity = nlp("troll elf rabbit")


for word1 in word_similarity:
    for word2 in word_similarity:
        print(word1.text, word2.text, word1.similarity(word2))

This code does a few things. First, it loads a model of common words in English (this is the 'en_core_web_lg') that has 300 dimension vectors for each word. If that sounds like nonsense, don't worry too much. 

What it means is that the model has a list of lots of common words in English, each of which comes with a 'scorecard' of how they rank on 300 different features which is a sort of abstract way of capturing the meaning of the word. This is not derived from logical scoring by people, but through an AI sort of analysis of how the words are used in LOTS of text, which finds patterns like:
- Is the target word used more often like a noun or a verb? 
- Is it usually plural (if a noun) or in gerrund ('ing'-form, if a verb)? 
- Is it frequently preceded by adjectives like 'little' or 'unprecedented' or adverbs like 'always' or 'never'? 

What comes out of this code is a pair-wise comparison of all the vectors, or scorecards, for the words in our little three-word corpus. This comes out as a number between 0 and 1, with 0 being totally different (or not found in the model) and 1 being a perfect match. 

Looking at the results, we see that comparing a word to itself (e.g. the first line which has 'dog dog 1.0') scores a 1, or 100 percent match. Not surprising. 

We also see that 'dog' and 'cat' are a pretty good match at 0.8. Both words are likely to be used in similar ways. For example, both would fit equally well into a sentence like "I really want to get a pet (dog/cat), but I just don't spend enough time at home to take care of it properly."

We also see that 'banana' is closer to 'cat' than to 'dog', but not by much. Presumably, bananas are more likely to sit around like cats than to run around like dogs? No idea. 

Feel free to edit the little three-word corpus and re-run the similarity test. Try 'puppy' to see if it is closer to 'dog' than 'dog' is to 'cat'? Try adding 'apple'? Or 'unprecedented'? Or anything else?

Of course, comparing individual words is all well and good, but what you probably want to compare is one text to another. To do that, first we need to prepare a few texts to do some comparing. You have already seen our sample corpus and 'Emma' by Jane Austen, but let's also add 'Persuasion' by Jane Austen and a selection of text from another nltk.corpus of texts from the web. The specific text is called 'firefox'. 

All of these texts need to be put through the nlp function we created from spaCy so that it creates a document vector. 

Document vectors are much like word vectors in that they score the document on a large number of dimensions. However, instead of coming packaged with spaCy, they are created from the text that you pass to spaCy. Which is what we are doing now.

Run/Shift+Enter below!

In [None]:
SimEmma = nlp(nltk.corpus.gutenberg.raw('austen-emma.txt'))
SimPers = nlp(nltk.corpus.gutenberg.raw('austen-persuasion.txt'))
SimFire = nlp(nltk.corpus.webtext.raw('firefox.txt'))
SimCorp = nlp(corpus)

You can, of course, have a peek at the contents of 'Persuasion' or 'firefox' if you like. You probably know how to do that with a print command, but maybe you want to run some of the other operations on the text too. 

Or you can plow on ahead and Run/Shift+Enter to run the document vector similarity comparisons below. 

In [None]:
print(SimEmma.similarity(SimPers))
print(SimEmma.similarity(SimFire))
print(SimEmma.similarity(SimCorp))

Each of these compares 'Emma' to one of the other texts (the ones in parentheses at the end). 

Are you surprised by the results? Feel free to try comparing the other texts to each other, rather than just to 'Emma'. 



## Discovery


Now, for the final bit of NLP that we cover here, let's talk about discovery. This is about identifying patterns that reveal relationships and applying it more widely to discover additional relationships. Let's start by importing a few things that we need. 

Run/Shift+Enter. 

In [None]:
import re 
import string 
import nltk 
import spacy 
import pandas as pd 
import numpy as np 
import math 
from tqdm import tqdm 

from spacy.matcher import Matcher 
from spacy.tokens import Span 
from spacy import displacy 

pd.set_option('display.max_colwidth', 200)

Now, let's take a look at 'Emma'. We start by tokenising the raw text into sentences, then we create a list of all sentences that contain the sub-string "like a", then we create run our list through our nlp function from spaCy. 

Run/Shift+Enter. 

In [None]:
# sample text 
emma_sentences = nltk.sent_tokenize(nltk.corpus.gutenberg.raw('austen-emma.txt'))
emma_such_as =""
for sentence in emma_sentences:
    if "like a " in sentence:
        emma_such_as += sentence
                
# create a spaCy object 
doc = nlp(emma_such_as)

Now, let's take a closer look at context around those instances of "like a". 

To do that, we use some spaCy functions that print the word, print its role in the sentence, and print its POS-tag. 

This lets us see if we can find any patterns in the word roles or POS-tags that might help us understand the patterns relating to "like a". 

Run/Shift+Enter.

In [None]:
# print token, dependency, POS tag 
for tok in doc: 
  print(tok.text, "-->",tok.dep_,"-->", tok.pos_)

It seems like a good start would be to define a pattern with "like a" followed by a noun. So first, we define that pattern. 

Run/Shift+Enter. 

In [None]:
#define the pattern 
pattern = [{'LOWER': 'like'}, 
           {'LOWER': 'a'}, 
           {'POS': 'NOUN'}]

Next, we run a function called Matcher over the text that returns all of the substrings that match the pattern. 

Run/Shift+Enter. 

In [None]:
# Matcher class object 
matcher = Matcher(nlp.vocab) 
matcher.add("matching_1", None, pattern) 

matches = matcher(doc) 
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(span.text)

As you would expect, we get a list of substrings that match the pattern we have defined. Let's create a little more ambitious pattern. 

This time, we want to capture verbs followed by "like a" followed by up to three optional modifiers (adverbs and adjectives) and finally followed by a noun. 

Run/Shift+Enter. 

In [None]:
# Matcher class object
matcher = Matcher(nlp.vocab)

#redefine the pattern
pattern2 = [{'POS':'VERB'},
           {'LOWER': 'like'},
           {'LOWER': 'a'},
           {'DEP':'amod', 'OP':"?"},
           {'DEP':'amod', 'OP':"?"},
           {'DEP':'amod', 'OP':"?"},
           {'POS': 'NOUN'}]

matcher.add("matching_1", None, pattern2)
matches = matcher(doc)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(span.text)

Well, this is more interesting. We see that some one can 'look like a sensible young man', can 'argue like a young man' and can 'write like a sensible man'. This suggests that the author, or possible society at the time of publication, associates these verbs and these adjectives with men. Potentially, the analysis is even more complicated in that young men might argue while old men do not, or that sensible men write much more often than other men. 

If we continued to analyse the text in this way, we might also find a similar combinations for verbs and adjectives associated with women. Will there be any evidence that (young) women are sensible? Or that women write or argue in ways that are comparable to how men write and argue? 


## Conclusions

We have only started to dip our toes into what NLP can do, but hopefully this will whet your appetite to know more. 

As before, these exercises and this sample code should highlight to you that you need to think about:
- your research questions and what you want to show, explore or understand, 
- your data, texts, corpus, or other research materials to analyse etc. 
- how your processes are related to your reserch questions, and 
- how your processes and data can be made available and reproducible. 

## Further reading and resources

Books, tutorials, package recommendations, etc. for Python
- Natural Language Processing with Python by Steven Bird, Ewan Klein and Edward Loper, http://www.nltk.org/book/
- Foundations of Statistical Natural Language Processing by Christopher Manning and Hinrich Schütze, https://nlp.stanford.edu/fsnlp/promo/
- Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition by Dan Jurafsky and James H. Martin, https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
- Deep Learning in Natural Language Processing by Li Deng, Yang Liu, https://lidengsite.wordpress.com/book-chapters/
- nltk.corpus http://www.nltk.org/howto/corpus.html
- spaCy https://nlpforhackers.io/complete-guide-to-spacy/

Books and package recommendations for R
- Quanteda, an R package for text analysis https://quanteda.io/​
- Text Mining with R, a free online book https://www.tidytextmining.com/​

<div style="text-align: right"><a href="./tm-processing-2020-06-16.ipynb" target=_blank><i>Previous section: Processing text</i></a></div>