# Natural Language Processing

This chapter covers text analysis, also known as natural language processing. We'll cover tokenisation of text, removing stop words, counting words, performing other statistics on words, and analysing the parts of speech.

## Introduction

When doing NLP, it's worth thinking carefully about the unit of analysis: is it a corpus, a text, a line, a paragraph, a sentence, a word, or even a character? It could also be two of these simultaneously, and working with document x token matrices is one very common way of doing NLP. Although we'll be mixing between a few of these in this chapter, thinking about what the block of text data you're working with will really help you keep track of what operations are being deployed and how they might interact.

In this chapter, we'll use a single example and using NLP on it in a few different ways. First we need to read in the text data we'll be using, part of Adam Smith's *The Wealth of Nations* and do some light cleaning of it though.

We'll read in our text so that each new line appears on a different row of a **pandas** dataframe. We'll also import the packages we'll need; remember, if you need these on your computer you may need to run `pip install packagename` on your own computer.

In [None]:
import pandas as pd
import string

In [None]:
df = pd.read_csv('https://github.com/aeturrell/coding-for-economists/raw/main/data/smith_won.txt',
    delimiter = "\n",
    names=["text"])
df.head()

We need to do a bit of light text cleaning before we get on to the more in-depth natural language processing. We'll make use of vectorised string operations as seen in the [Introduction to Text](text-intro) chapter. First, we want to put everything in lower case:


In [None]:
df["text"] = df["text"].str.lower()
df.head()

Next, we'll remove the punctuation from the text. You may not always wish to do this but it's a good default.

In [None]:
translator = string.punctuation.maketrans({x: "" for x in string.punctuation})
df["text"] = df["text"].str.translate(translator)
df.head()

Okay, we now have rows and rows of lower case words without punctuation.

## Tokenisation

We're going to now see an example of tokenisation: the process of taking blocks of text and breaking them down into tokens, most commonly a word but potentially all one and two word pairs. Note that you might sometimes see all two word pairs referred to as 2-grams, with an n-gram being all n-word pairs. There are many ways to tokenise text; we'll look at two of the most common: using regular expressions and using pre-configured NLP packages.

### Tokenisation with regular expressions

Because regular expressions excel at finding patterns in text, they can also be used to decide where to split text up into tokens. For a very simple example, let's take the first line of our text example:

In [None]:
import re
word_pattern = r'\w+'
tokens = re.findall(word_pattern, df.iloc[0, 0])
tokens

This produced a split of a single line into one word tokens that are represented by a list of strings. We could have also asked for other variations, eg sentences, by asking to split at every '.'. 


### Tokenisation using NLP tools

Many of the NLP packages available in Python come with built-in tokenisation tools. Two of the most loved NLP packages are [**nltk**](https://www.nltk.org/) and [**spaCy**](https://spacy.io/). We'll use nltk for tokenisation.


In [None]:
from nltk.tokenize import word_tokenize

word_tokenize(df.iloc[0, 0])

We have the same results as before. Now let's scale this up to our whole corpus while retaining the lines of text, giving us a structure of the form (lines x tokens):

In [None]:
df["tokens"] = df["text"].apply(lambda x: word_tokenize(x))
df.head()

**nltk** also has a `sent_tokenize` function that tokenises sentences, although as it makes use of punctuation you must take care with what pre-cleaning of text you undertake.


## Stop Words and Their Removal

bbb

## Counting Text

There are several ways of performing basic counting statistics on text. We saw one in the previous chapter, `str.count()`, but that only applies to one word at a time. Often, we're interested in the relative counts of words in a corpus. In this section, we'll look at two powerful ways of computing this: using the `Counter` function and via term frequenc-inverse document frequency.

First, `Counter`, which is a built-in Python library that does pretty much what you'd expect. Here's a simple example:


In [None]:
from collections import Counter

fruit_list = ["apple", "apple", "orange", "satsuma", "banana", "orange", "mango", "satsuma", "orange"]
freq = Counter(fruit_list)
freq

Counter returns a `collections.Counter` object where the numbers of each type in a given input list are summed. The resulting dictionnary of unique counts can be extracted using `dict(freq)`, and `Counter` has some other useful functions too including `most_common()` which, given a number `n`, returns `n` tuples of the form `(thing, count)`:

In [None]:
freq.most_common(10)

Say we wanted to apply this not just to every line in our corpus separately, but to our whole corpus in one go; how would we do it? `Counter` will happily accept a list but our dataframe token column is currently a vector of lists. So we must first transform the token column to a single list of all tokens and then apply `Counter`. To achieve the former and flatten a list of lists, we'll use `itertools` chain function which makes an iterator that returns elements from the first iterable until it is exhausted, then proceeds to the next iterable, until all of the iterables in all inputs are exhausted. For example, given `[a, b, c]` and `[d, e, f]` as arguments, this function would return `[a, b, c, d, e, f]`. Because this function accepts an arbitrary number of iterable arguments, we use the splat operator to tell it to expect lots of different arguments. The second step using `Counter` is far more straightforward!

In [None]:
import itertools

merged_list = list(itertools.chain(*df["tokens"].to_list()))
freq = Counter(merged_list)
freq.most_common(10)

### TF-IDF

