# Text preprocess

- Your task is to complete the coding cell (indicated by YOUR CODE BELOW)

### Basic Text Analysis   - *Text representation & basic NLP operations*

In [1]:
!pip install nltk



<a id='lab1'></a>
## Lab 1: Basic Text Analysis

Let's start with basic NLP operations, usually used for text preprocessing
to improve the quality of data for better subsequent tasks, such as:

- stopword removal
- word/sentence tokenization
- part-of-speech (POS)
- Stemming
- Lemmatization
- Word Sense Disambiguation (WSD)
- Name Entity Recogntion (NER)

In [2]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
# sentence / word tokenization
cnn = '''The Cable News Network is a multinational news channel and website headquartered in Atlanta, Georgia, U.S.
        Founded in 1980 by American media proprietor Ted Turner and Reese Schonfeld as a 24-hour cable'''

sent_tokenize(cnn)

['The Cable News Network is a multinational news channel and website headquartered in Atlanta, Georgia, U.S.',
 'Founded in 1980 by American media proprietor Ted Turner and Reese Schonfeld as a 24-hour cable']

In [4]:
word_tokenize(cnn)

['The',
 'Cable',
 'News',
 'Network',
 'is',
 'a',
 'multinational',
 'news',
 'channel',
 'and',
 'website',
 'headquartered',
 'in',
 'Atlanta',
 ',',
 'Georgia',
 ',',
 'U.S',
 '.',
 'Founded',
 'in',
 '1980',
 'by',
 'American',
 'media',
 'proprietor',
 'Ted',
 'Turner',
 'and',
 'Reese',
 'Schonfeld',
 'as',
 'a',
 '24-hour',
 'cable']

In [5]:
# word tokenization
sent = 'This is the first sentence, and this is the second sentence.'
words = word_tokenize(sent.lower())
print(words)

['this', 'is', 'the', 'first', 'sentence', ',', 'and', 'this', 'is', 'the', 'second', 'sentence', '.']


In [6]:
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [7]:
stops_en = stopwords.words('english')
stops_fr = stopwords.words('french')
print(stops_en)
print(stops_fr)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [8]:
# customize your stop word list by adding words
stops_en.append('airline')
print(stops_en)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [9]:
# POS

from nltk import pos_tag

nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


In [10]:
sent = 'I like that awesome movie, especially the great director.'

# YOUR CODE BELOW
words = word_tokenize(sent)
tagged = pos_tag(words)
print(tagged)

[('I', 'PRP'), ('like', 'VBP'), ('that', 'DT'), ('awesome', 'JJ'), ('movie', 'NN'), (',', ','), ('especially', 'RB'), ('the', 'DT'), ('great', 'JJ'), ('director', 'NN'), ('.', '.')]


In [11]:
# Stemming
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()

d1 = 'Text mining, also referred to as text data mining, is equivalent to text analytics.'

# YOUR CODE BELOW
words = word_tokenize(d1)
for w in words:
  print(w,',',porter.stem(w),',',lancaster.stem(w))

Text , text , text
mining , mine , min
, , , , ,
also , also , also
referred , refer , refer
to , to , to
as , as , as
text , text , text
data , data , dat
mining , mine , min
, , , , ,
is , is , is
equivalent , equival , equ
to , to , to
text , text , text
analytics , analyt , analys
. , . , .


In [12]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [13]:
# Lemmatization
import nltk
from nltk import word_tokenize, pos_tag
nltk.download('wordnet')

d2 = 'Text analysis refers to information retrieval, analyzing documents and articles.'

# YOUR CODE BELOW
wnl = nltk.WordNetLemmatizer()

words = word_tokenize(d2)
for t in words:
  print(t,',',wnl.lemmatize(t))

[nltk_data] Downloading package wordnet to /root/nltk_data...


Text , Text
analysis , analysis
refers , refers
to , to
information , information
retrieval , retrieval
, , ,
analyzing , analyzing
documents , document
and , and
articles , article
. , .


In [14]:
!pip install pywsd

Collecting pywsd
  Downloading pywsd-1.2.5-py3-none-any.whl (26.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.9/26.9 MB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
Collecting wn==0.0.23 (from pywsd)
  Downloading wn-0.0.23.tar.gz (31.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.6/31.6 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wn
  Building wheel for wn (setup.py) ... [?25l[?25hdone
  Created wheel for wn: filename=wn-0.0.23-py3-none-any.whl size=31792911 sha256=f6a2793cb69da5630fc874bac4a154a4a961f2ecf9d017ac0610504fd77b480e
  Stored in directory: /root/.cache/pip/wheels/a1/1a/7d/23a76ce45998af60e47466a694c237fa26023c5674b47672b2
Successfully built wn
Installing collected packages: wn, pywsd
Successfully installed pywsd-1.2.5 wn-0.0.23


In [15]:
# Word Sense Disambiguation
#plant_sents = ['The workers at the industrial plant were overworked', 'The plant was no longer bearing flowers']

from pywsd.lesk import simple_lesk

bank_sents = ['I went to the bank to deposit my money', 'The river bank was full of dead fishes']
answer = simple_lesk(bank_sents[0],'bank')
print("Sense:", answer)
print("Definition:",answer.definition())

print('===')

answer = simple_lesk(bank_sents[1],'bank')
print("Sense:", answer)
print("Definition:",answer.definition())

# PLEASE EXPLAIN WHAT THE CELL DOES in PLAIN LANGUAGE
# This cell clarifies the context and meaning of the word bank in the sentences within "bank_sents".
# The word bank is spelled the same but has a different meaning in each sentence.
# For each sentence in "bank_sent" the simple_lesk function is used to define the word bank given the context of the sentence.
# The results show the first sentence refers to a financial institution while the second sentence is a body of water.


Warming up PyWSD (takes ~10 secs)... 

Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities
===
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)


took 8.438102006912231 secs.


In [16]:
# Name Entity Recognition (NER)

nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('tagsets')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

In [18]:
from nltk import pos_tag, word_tokenize

ex = 'European authorities fined Google a record $5.1 billion for abusing its power in the mobile market.'

# YOUR CODE BELOW
words = word_tokenize(ex)
tags = pos_tag(words)

ne_tree = nltk.ne_chunk(tags)

print(ne_tree)

(S
  (GPE European/JJ)
  authorities/NNS
  fined/VBD
  (PERSON Google/NNP)
  a/DT
  record/NN
  $/$
  5.1/CD
  billion/CD
  for/IN
  abusing/VBG
  its/PRP$
  power/NN
  in/IN
  the/DT
  mobile/JJ
  market/NN
  ./.)
