## Week 3: Basic Text Processing

In [8]:
import re # For regular expressions
from nltk import word_tokenize, sent_tokenize, ngrams
from collections import Counter # To help with counting

In [9]:
# An example string to use
text ="""More than US$20 More than US in venture capital is being invested in FinTech this year. \
@CloudExpo is pleased to bring you the latest FinTech developments as an integral part of our \
program, starting at the 21st International \
Cloud Expo October 31 - November 2, 2017 in Silicon Valley, and June 12-14, 2018, in New York City. \
The upcoming 21st International @CloudExpo | @ThingsExpo, October 31 - November 2, 2017, \
Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, \
NY announces that its Call For Papers for speaking opportunities is open."""

### Regular expressions

In [10]:
# Write a regex to remove everything that isn't a letter, number, full stop, comma
pattern = re.compile(r'[^A-Za-z0-9., ]')
re.sub(pattern, "", text)

'More than US20 More than US in venture capital is being invested in FinTech this year. CloudExpo is pleased to bring you the latest FinTech developments as an integral part of our program, starting at the 21st International Cloud Expo October 31  November 2, 2017 in Silicon Valley, and June 1214, 2018, in New York City. The upcoming 21st International CloudExpo  ThingsExpo, October 31  November 2, 2017, Santa Clara Convention Center, CA, and June 1214, 2018, at the Javits Center in New York City, NY announces that its Call For Papers for speaking opportunities is open.'

In [11]:
# Search for all special characters
re.search('\$(.+?) ', text)

<re.Match object; span=(12, 16), match='$20 '>

### Tokenization

In [12]:
# Sentence tokenization
sentences = sent_tokenize(text)
for sentence in sentences:
    print(sentence + "\n")

More than US$20 More than US in venture capital is being invested in FinTech this year.

@CloudExpo is pleased to bring you the latest FinTech developments as an integral part of our program, starting at the 21st International Cloud Expo October 31 - November 2, 2017 in Silicon Valley, and June 12-14, 2018, in New York City.

The upcoming 21st International @CloudExpo | @ThingsExpo, October 31 - November 2, 2017, Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY announces that its Call For Papers for speaking opportunities is open.



In [13]:
tokens = word_tokenize(text)
for token in tokens[:2]:
    print(token + "\n")

More

than



From tokens, we can group them into two or more tokens i.e. **ngrams**.

In [14]:
# Create an ngram object
trigram = ngrams(tokens, 3)

# Use counter to return a dictionary and count of each ngram
# Counter(trigram)

### Stemming and Lemmatization

In [15]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [16]:
# Create a stemmer
ps = PorterStemmer()
stemmed_text = [ps.stem(token) for token in tokens]
stemmed_text[10:20]

['capit', 'is', 'be', 'invest', 'in', 'fintech', 'thi', 'year', '.', '@']

In [17]:
wl = WordNetLemmatizer()
lemmatized_text = [wl.lemmatize(token) for token in tokens]
lemmatized_text[10:20]

['capital',
 'is',
 'being',
 'invested',
 'in',
 'FinTech',
 'this',
 'year',
 '.',
 '@']

### POS Tagging

In [21]:
from nltk import pos_tag

In [39]:
# POS on the third sentence
tokens = word_tokenize(sentences[2])
sentence_pos = pos_tag(tokens)
sentence_pos[:5]

[('The', 'DT'),
 ('upcoming', 'JJ'),
 ('21st', 'CD'),
 ('International', 'NNP'),
 ('@', 'NNP')]

You can also define your own grammar using regex. From the POS tags on a sentence, you can construct your own CFG.

In [40]:
# Here is a definition of a noun phrase - determinant, followed by adj then noun
grammar = "NP: {<DT>?<JJ>*<NNP>}"

In [41]:
from nltk import RegexpParser
cp = RegexpParser(grammar)
cp

<chunk.RegexpParser with 1 stages>

In [42]:
print(cp.parse(sentence_pos))

(S
  The/DT
  upcoming/JJ
  21st/CD
  (NP International/NNP)
  (NP @/NNP)
  (NP CloudExpo/NNP)
  (NP |/NNP)
  (NP @/NNP)
  (NP ThingsExpo/NNP)
  ,/,
  (NP October/NNP)
  31/CD
  -/:
  (NP November/NNP)
  2/CD
  ,/,
  2017/CD
  ,/,
  (NP Santa/NNP)
  (NP Clara/NNP)
  (NP Convention/NNP)
  (NP Center/NNP)
  ,/,
  (NP CA/NNP)
  ,/,
  and/CC
  (NP June/NNP)
  12-14/CD
  ,/,
  2018/CD
  ,/,
  at/IN
  (NP the/DT Javits/NNP)
  (NP Center/NNP)
  in/IN
  (NP New/NNP)
  (NP York/NNP)
  (NP City/NNP)
  ,/,
  (NP NY/NNP)
  announces/VBZ
  that/IN
  its/PRP$
  (NP Call/NNP)
  For/IN
  (NP Papers/NNP)
  for/IN
  speaking/VBG
  opportunities/NNS
  is/VBZ
  open/JJ
  ./.)
