# [LEGALST-190] Preprocessing Text - Lab 3-1

---

This lab will provide an introduction to manipulating strings and chunking sentences.

*Estimated Time: 30-40 minutes*

---

### Topics Covered
- How to tokenize text
- How to stem text
- How to chunk text

### Table of Contents

[The Data](#section data)<br>

1 - [Tokenization](#section 1)<br>

2 - [Stemming](#section 2)<br>

3 - [Chunking](#section 3)<br>


In [5]:
! pip install nltk

[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


---

## The Data <a id='data'></a>


In this notebook, you'll be working with the text of each country’s statement from the General Debate in annual sessions of the United Nations General Assembly. This dataset is separated by country, session and year and tagged for each, and has over forty years of data from different countries.



### Visualizing data

Run the below cells and take a look at a sample of the data that we'll be working with.

In [5]:
import pandas as pd
!unzip "../data/un-general-debates.zip"
data = pd.read_csv("../data/un-general-debates.csv")


Usage:   
  pip install [options] <requirement specifier> [package-index-options] ...
  pip install [options] -r <requirements file> [package-index-options] ...
  pip install [options] [-e] <vcs project url> ...
  pip install [options] [-e] <local project path> ...
  pip install [options] <archive url/path> ...

no such option: --update


AttributeError: module 'pandas' has no attribute 'core'

In [1]:
import csv
data = list(csv.reader(open("un-general-debates.csv", 'r')))

FileNotFoundError: [Errno 2] No such file or directory: 'un-general-debates.csv'

In [7]:
data[:2]

[['session', 'year', 'country', 'text'],
 ['44',
  '1989',
  'MDV',

## Tokenization  <a id='section 1'></a>

Tokenization is defined as <b>the process of segmenting running text into words and sentences</b>.


### Why do we need to tokenize text

Electronic text is a linear sequence of symbols. Before any processing is to be done, text needs to be segmented into linguistic units, and this process is called tokenization.

We usually look at grammar and meaning at the level of words, related to each other within sentences, within each document. So if we're starting with raw text, we first need to split the text into sentences, and those sentences into words -- which we call "tokens".

### How to tokenize

You might imagine that the easiest way to identify sentences is to split the document at every period '.', and to split the sentences using white space to get the words.

In [8]:
# using the split function to create tokens
paragraph = data[1][3][:540]
sentences = paragraph.split(".")
for s in sentences:
    print(s + '\n')

﻿It is indeed a pleasure for me and the members of my delegation to extend to Ambassador Garba our sincere congratulations on his election to the presidency of the forty-fourth session of the General Assembly

 His election to this high office is a well-deserved tribute to his personal qualities and experience

 I am fully confident that under his able and wise leadership the Assembly will further consolidate the gains achieved during the past year


My delegation associates itself with previous speakers in expressing its appreciation of



Then to split sentences further into words.

In [9]:
sentence = "What kind of patterns do you see in this graph?"
tokens = sentence.split(" ")
tokens

['What', 'kind', 'of', 'patterns', 'do', 'you', 'see', 'in', 'this', 'graph?']

### Regular expression

Regular expressions allow us to better tokenzie text. Run the example below

In [10]:
import re
s1 = "this has  weird     spacing issues"
s1.split(" ")

['this', 'has', '', 'weird', '', '', '', '', 'spacing', 'issues']

As we split a sentence by <b>one space</b>, some spaces above are also being treated as tokens.

In [11]:
# split using regular expression
re.split(r'\s+', s1)

['this', 'has', 'weird', 'spacing', 'issues']

"\s" denotes a space character, and + sign means one or more. Thus, \s+ means one or more spaces.

In [12]:
s2 = "brazil4china2france5australia"
re.split(r'[0-9]', s2)

['brazil', 'china', 'france', 'australia']

Above we are splitting up the sentence at each number. Regular expression provides an easy and clean way to do so.

Let's try to find all the words that are of length 5 in the below sentence.

{5} denotes that there are exactly 5 of whatever precedes it

In [13]:
s3 = "A well-written encyclopedia article identifies a notable encyclopedic topic, summarizes that topic comprehensively, contains references to reliable sources, and links to other related topics"
re.findall(r'\s+[a-z]{5}\s+', s3)

[' topic ', ' links ', ' other ']

There are two things happening here:

1. `[` and `]` do not mean 'bracket'; they are special characters which mean 'anything of this class'
2. we've only matched one letter each

re is flexible about how you specify numbers - you can match none, some, a range, or all repetitions of a sequence or character class.

character | meaning
----------|--------
`{x}`     | exactly x repetitions
`{x,y}`   | between x and y repetitions
`?`       | 0 or 1 repetition
`*`       | 0 or many repetitions
`+`       | 1 or many repetitions

<b>Question</b>: find all words of length 8 in speech.

In [14]:
speech = data[1][3]
...

re is powerful but we'll stop there, because NLTK provides handy classes to do this for us:

### NLTK

NLTK (Natural Language Toolkit) is a platform for building Python programs to work with human language data

In [15]:
import nltk
# run the below commented command if error
# nltk.download()

In [16]:
sents = nltk.sent_tokenize(speech)
sents[:3]

['\ufeffIt is indeed a pleasure for me and the members of my delegation to extend to Ambassador Garba our sincere congratulations on his election to the presidency of the forty-fourth session of the General Assembly.',
 'His election to this high office is a well-deserved tribute to his personal qualities and experience.',
 'I am fully confident that under his able and wise leadership the Assembly will further consolidate the gains achieved during the past year.']

In [17]:
s4 = "At eight o'clock on Thursday morning Arthur didn't feel very good."
nltk.word_tokenize(s4)

['At',
 'eight',
 "o'clock",
 'on',
 'Thursday',
 'morning',
 'Arthur',
 'did',
 "n't",
 'feel',
 'very',
 'good',
 '.']

nltk recognized that "o'clock" is one word and separated "didn't" into "did" and "n't"

For more complicated metrics, it's easier to use NLTK's classes and methods.

In [18]:
tokens = nltk.word_tokenize(speech)
fd = nltk.collocations.FreqDist(tokens)
fd.most_common()[:10]

[('the', 270),
 ('of', 165),
 ('.', 121),
 ('and', 110),
 (',', 109),
 ('to', 95),
 ('in', 74),
 ('that', 48),
 ('a', 47),
 ('is', 40)]

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging.

In [19]:
tagged = nltk.pos_tag(tokens[2:8])
tagged

[('indeed', 'RB'),
 ('a', 'DT'),
 ('pleasure', 'NN'),
 ('for', 'IN'),
 ('me', 'PRP'),
 ('and', 'CC')]

A common step in text analysis is to remove noise. *However*, what you deem "noise" is not only very important but also dependent on the project at hand. For the purposes of today, we will discuss two common categories of strings often considered "noise". 

- Punctuation: While important for sentence analysis, punctuation will get in the way of word frequency and n-gram analyses. They will also affect any clustering on topic modeling.

- Stopwords: Stopwords are the most frequent words in any given language. Words like "the", "a", "that", etc. are considered not semantically important, and would also skew any frequency or n-gram analysis.

<b>Question</b> Write a function below that takes a string as an argument and returns a list of words without punctuation or stopwords.

`punctuation` is a list of punctuation strings, and we have created the list `stop_words` for you.

Hint: first you'll want to remove punctuation, then tokenize, then remove stop words. Make sure you account for upper and lower case!

In [20]:
def rem_punc_stop(text):
    
    from string import punctuation
    from nltk.corpus import stopwords
    
    stop_words = stopwords.words("english")
    
    punc_removed = ""
    for i in text:
        if i not in punctuation:
            punc_removed += i
        else:
            punc_removed += " "
            
    tokens = nltk.word_tokenize(punc_removed)
    noise_free = []
    for token in tokens:
        if token not in stop_words:
            noise_free += [token]
    return noise_free

Now we can rerun our frequency analysis without the noise:

In [21]:
tokens_reduced = rem_punc_stop(speech)
fd_reduced = nltk.collocations.FreqDist(tokens_reduced)
fd_reduced.most_common()[:10]

[('We', 25),
 ('The', 23),
 ('security', 16),
 ('international', 14),
 ('situation', 13),
 ('efforts', 13),
 ('development', 11),
 ('political', 11),
 ('peace', 11),
 ('environment', 11)]

Now our analysis is much more informational and revealing.

## Stemming <a id='section 2'></a>

In NLP it is often the case that the specific form of a word is not as important as the idea to which it refers. For example, if you are trying to identify the topic of a document, counting 'running', 'runs', 'ran', and 'run' as four separate words is not useful. Reducing words to their stems is a process called stemming.

A popular stemming implementation is the Snowball Stemmer, which is based on the Porter Stemmer. Its algorithm looks at word forms and does things like drop final 's's, 'ed's, and 'ing's.

Just like the tokenizers, we first have to create a stemmer object with the language we are using.

In [22]:
snowball = nltk.SnowballStemmer('english')

Now, we can try stemming some words

In [23]:
snowball.stem('running')

'run'

In [24]:
snowball.stem('eats')

'eat'

In [25]:
snowball.stem('embarassed')

'embarass'

Snowball is a very fast algorithm, but it has a lot of edge cases. In some cases, words with the same stem are reduced to two different stems or two different words are reduced to the same stem.

In [26]:
snowball.stem('cylinder'), snowball.stem('cylindrical')

('cylind', 'cylindr')

In [27]:
snowball.stem('vacation'), snowball.stem('vacate')

('vacat', 'vacat')

## Chunking<a id='section 3'></a>

We may want to work with larger segments of text than single words (but still smaller than a sentence). For instance, in the sentence "The black cat climbed over the tall fence", we might want to treat "The black cat" as one thing (the subject), "climbed over" as a distinct act, and "the tall fence" as another thing (the object). The first and third sequences are noun phrases, and the second is a verb phrase.

We can separate these phrases by "chunking" the sentence, i.e. splitting it into larger chunks than individual tokens. This is also an important step toward identifying entities, which are often represented by more than one word. You can probably imagine certain patterns that would define a noun phrase, using part of speech tags. For instance, a determiner (e.g. an article like "the") could be concatenated onto the noun that follows it. If there's an adjective between them, we can include that too.

To define rules about how to structure words based on their part of speech tags, we use a grammar (in this case, a "chunk grammar"). NLTK provides a RegexpParser that takes as input a grammar composed of regular expressions. The grammar is defined as a string, with one line for each rule we define. Each rule starts with the label we want to assign to the chunk (e.g. NP for "noun phrase"), followed by a colon, then an expression in regex-like notation that will be matched to tokens' POS tags.

We can define a single rule for a noun phrase like this. The rule allows 0 or 1 determiner, then 0 or more adjectives, and finally at least 1 noun. (By using 'NN.*' as the last POS tag, we can match 'NN', 'NNP' for a proper noun, or 'NNS' for a plural noun.) If a matching sequence of tokens is found, it will be labeled 'NP'.

In [28]:
grammar = "NP: {<DT>?<JJ>*<NN.*>+}"

We create a chunk parser object by supplying this grammar, then use it to parse a sentence into chunks. The sentence we want to parse must already be POS-tagged, since our grammar uses those POS tags to identify chunks. Let's try this on the second sentence of the speech we generated above.

In [29]:
from nltk import RegexpParser

cp = RegexpParser(grammar)

sent_tagged = nltk.pos_tag(sents[1])
sent_chunked = cp.parse(sent_tagged)

print(sent_chunked)

(S
  (NP H/NNP i/NN)
  s/VBP
  (NP  /JJ e/NN l/NN e/NN)
  c/VBP
  (NP t/NN i/NN)
  o/VBP
  (NP n/JJ  /NNP t/NN o/NN  /NNP t/NN h/NN)
  i/JJ
  s/VBP
  (NP  /JJ h/NN i/NN)
  g/VBP
  (NP h/NN  /NNP)
  o/VBZ
  (NP f/JJ f/NN i/NN)
  c/VBP
  (NP e/NN  /NN i/NN)
  s/VBP
   /PDT
  (NP a/DT  /JJ w/NN e/NN l/NN)
  l/SYM
  -/:
  (NP d/NN e/NN s/NN e/NN r/NN v/NN e/NN d/NN  /NNP t/NN r/NN i/NN)
  b/VBP
  (NP u/JJ t/NN e/NN  /NNP t/NN o/NN  /NNP h/NN i/NN)
  s/VBP
  (NP  /JJ p/NN e/NN r/NN s/NN o/NN)
  n/IN
  (NP a/DT l/NN  /NNP q/NN)
  u/VBD
  (NP a/DT l/NN i/NN)
  t/VBP
  (NP i/NN)
  e/VBP
  s/JJ
   /FW
  (NP a/DT n/JJ d/NN  /NNP e/NN x/NNP p/NN e/NN r/NN i/NN)
  e/VBP
  (NP n/JJ c/NN e/NN)
  ./.)


When we called print() on this chunked sentence, it printed out a nested list of nodes.

In [30]:
type(sent_chunked)

nltk.tree.Tree

The tree object has a number of methods we can use to interact with its components. For instance, we can use the method draw() to see a more graphical representation. This will open a separate window.

The tree is pretty flat, because we defined a grammar that only grouped words into non-overlapping noun phrases, with no additional hierarchy above them. This is sometimes referred to as "shallow parsing".

In [57]:
sent_chunked.draw()

---
Notebook developed by: X,X,X

Data Science Modules: http://data.berkeley.edu/education/modules
