# Lab 3: Natural Language Processing

<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 10px">
    <ol>
        <li><a href="#download_data">Importing Needed packages and download_data</a></li>
        <li><a href="#words">Segmentation in words (tokenization)</a></li>
        <li><a href="#Sentence">Sentence segmentation</a></li>
        <li><a href="#POS">PoS tagging</a></li>
        <li><a href="#lemmatization">Lemmatization</a></li>
        <li><a href="#recognition">recognition of named entities</a></li>
         <li><a href="#Constituency">Constituency parsing</a></li>
         <li><a href="#recognition">Dependency parsing</a></li>
        <li><a href="#reference">Co-reference resolution </a></li>
    </ol>
</div>
<br>
<hr>

<h1 id='download_data'>1.Importing Needed packages and download_data </h1>

In [2]:
import nltk
import urllib.request
import tarfile
import numpy as np
import os
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
import re
import string
import warnings
warnings.filterwarnings("ignore")

In [3]:
#download the data
nltk.download('averaged_perceptron_tagger')
urllib.request.urlretrieve("https://perso.limsi.fr/anne/tbbt.tar.gz", filename="tbbt.tar.gz")
filename = "tbbt.tar.gz"
tf = tarfile.open(filename)
tf.extractall('data')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/geekayman/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


<h1 id='words'>2.Segmentation in words (tokenization) </h1>

## Text processing

The following code  perform tokenization using the Natural Language Toolkit library (NLTK)

In [4]:
tokenized_docs =[]
file_name =[]
start_path = 'data/tbbt/s3/txt/' # current directory
for path,dirs,files in os.walk(start_path):
    for filename in files:
        with open (os.path.join(path,filename)) as fin:
            tokens = word_tokenize(fin.read())
        tokenized_docs.append(tokens)
        file_name.append(filename)

## Cleaning Text

In [5]:
regex = re.compile('[%s]' % re.escape(string.punctuation)) #see documentation here: http://docs.python.org/2/library/string.html

tokenized_docs_no_punctuation = []

for review in tokenized_docs:
    new_review = []
    for token in review:
        new_token = regex.sub(u'', token)
        if not new_token == u'':
            new_review.append(new_token)
    
    tokenized_docs_no_punctuation.append(new_review)
    
print(tokenized_docs_no_punctuation[0][:100])

['Goin', 'out', 'tonight', 'I', 'm', 'feelin', 'all', 'right', 'Gon', 'na', 'let', 'It', 'all', 'hang', 'out', 'Wan', 'na', 'make', 'some', 'noise', 'Really', 'raise', 'my', 'voice', 'Yeah', 'I', 'wan', 'na', 'scream', 'and', 'shout', 'Unh', 'No', 'Morning', 'Sheldon', 'Come', 'dance', 'with', 'me', 'No', 'Why', 'not', 'Penny', 'while', 'I', 'subscribe', 'to', 'the', 'ManyWorlds', 'theory', 'which', 'posits', 'the', 'existence', 'of', 'infinite', 'Sheldons', 'in', 'an', 'infinite', 'number', 'of', 'universes', 'I', 'assure', 'you', 'that', 'none', 'of', 'them', 'am', 'I', 'dancing', 'Are', 'you', 'fun', 'in', 'any', 'of', 'them', 'The', 'math', 'would', 'suggest', 'that', 'in', 'a', 'few', 'I', 'm', 'a', 'clown', 'made', 'of', 'candy', 'But', 'I', 'do', 'nt', 'dance']


The function ord_tokenize() is simply splitting the sentence by white space.

<h1 id='Sentence'>3.Sentence segmentation </h1>

In [6]:
sentences = []
text = open("data/tbbt/s3/txt/tbbts03e01.txt", "r")
for s in text:
    sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x]
sentences[0:50]

["I just want you both tknow, when I publish my findings, I won't forget your contributions.",
 '- Great.',
 '- Thanks.',
 "Of course, Ian't mention you in my Nobel acceptance speech, but when I get around to writing my memoirs, you can expect a very effusive footnote and perhaps a signed copy.",
 '- We have to tell him.',
 '- Tell me what?',
 'Damn his Vulcan hearing.',
 "You fellows are planning a party for me, aren't you?",
 'Okay, Sheldon, sit down.',
 'If there\'s going to be a them I should let you know that I don\'t care for luau, toga or " under the sea. "',
 "Yeah, we'll keep that in mind.",
 'Look... we neeto talk to you about something that happened at the North Pole.',
 "If this is about the night the heat went out, there's nothing to be embarrassed about.",
 "It's not about that.",
 'And we agreed never speak of it again.',
 'So we slept together naked.',
 'It was only to keep our core body temperatures from plummeting.',
 "He's spking about it.",
 'For me, it wasa bonding

<h1 id='POS'>4.POS tagging with NLTK </h1>

Identifies the part of speech (noun, verb, adverb, and
so on)

| POS tags | meaning                                  | POS tags | meaning               | POS tags | meaning                               |
|:---------|:-----------------------------------------|:---------|:----------------------|:---------|:--------------------------------------|
| CC       | Coordinating conjunction                 | NNS      | Noun, plural          | UH       | Interjection                          |
| CD       | Cardinal number                          | NNP      | Proper noun, singular | VB       | Verb, base form                       |
| DT       | Determiner                               | NNPS     | Proper noun, plural   | VBD      | Verb, past tense                      |
| EX       | Existential there                        | PDT      | Predeterminer         | VBG      | Verb, gerund or present participle    |
| FW       | Foreign word                             | POS      | Possessive ending     | VBN      | Verb, past participle                 |
| IN       | Preposition or subordinating conjunction | PRP      | Personal pronoun      | VBP      | Verb, non-3rd person singular present |
| JJ       | Adjective                                | PRP$     | Possessive pronoun    | VBZ      | Verb, 3 rd person singular present    |
| JJR      | Adjective, comparative                   | RB       | Adverb                | WDT      | Wh-determiner                         |
| JJS      | Adjective, superlative                   | RBR      | Adverb, comparative   | WP       | Wh-pronoun                            |
| LS       | List item marker                         | RBS      | Adverb, superlative   | WP\$      | Possessive wh-pronoun                 |
| MD       | Modal                                    | RP       | Particle              | WRB      | Wh-adverb                             |
| NN       | Noun, singular or mass                   | SYM      | Symbol                | TO       | to                                    |



In [7]:
nltk.pos_tag(tokenized_docs_no_punctuation[0][:50])

[('Goin', 'VB'),
 ('out', 'RP'),
 ('tonight', 'NN'),
 ('I', 'PRP'),
 ('m', 'VBP'),
 ('feelin', 'RB'),
 ('all', 'RB'),
 ('right', 'JJ'),
 ('Gon', 'NNP'),
 ('na', 'TO'),
 ('let', 'VB'),
 ('It', 'PRP'),
 ('all', 'DT'),
 ('hang', 'VBD'),
 ('out', 'RP'),
 ('Wan', 'NNP'),
 ('na', 'NNS'),
 ('make', 'VBP'),
 ('some', 'DT'),
 ('noise', 'NN'),
 ('Really', 'RB'),
 ('raise', 'VB'),
 ('my', 'PRP$'),
 ('voice', 'NN'),
 ('Yeah', 'NNP'),
 ('I', 'PRP'),
 ('wan', 'VBP'),
 ('na', 'JJ'),
 ('scream', 'NN'),
 ('and', 'CC'),
 ('shout', 'NN'),
 ('Unh', 'NNP'),
 ('No', 'NNP'),
 ('Morning', 'NNP'),
 ('Sheldon', 'NNP'),
 ('Come', 'NNP'),
 ('dance', 'NN'),
 ('with', 'IN'),
 ('me', 'PRP'),
 ('No', 'NNP'),
 ('Why', 'WRB'),
 ('not', 'RB'),
 ('Penny', 'NNP'),
 ('while', 'IN'),
 ('I', 'PRP'),
 ('subscribe', 'VBP'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('ManyWorlds', 'NNP'),
 ('theory', 'NN')]

<h1 id='lemmatization'>5.Performing lemmatization </h1>

In [8]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/geekayman/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [9]:
lemmatizer = WordNetLemmatizer()
lemmatized_tokens=[lemmatizer.lemmatize(word) for word in tokenized_docs_no_punctuation[1]]
print(lemmatized_tokens[:100])

['Sheldon', 'you', 're', 'wrong', 'Wolverine', 'wa', 'not', 'born', 'with', 'bone', 'claw', 'Howard', 'you', 'know', 'me', 'to', 'be', 'a', 'very', 'smart', 'man', 'Do', 'nt', 'you', 'think', 'if', 'I', 'were', 'wrong', 'I', 'd', 'know', 'it', 'First', 'of', 'all', 'Give', 'it', 'up', 'You', 're', 'arguing', 'with', 'a', 'crazy', 'person', 'I', 'm', 'not', 'crazy', 'My', 'mother', 'had', 'me', 'tested', 'Hey', 'guy', 'What', 'are', 'you', 'doing', 'here', 'What', 'It', 's', 'newcomicbook', 'night', 'Since', 'you', 'and', 'Penny', 'hooked', 'up', 'we', 'thought', 'you', 'two', 'would', 'be', 'having', 'bouncy', 'naked', 'yumyum', 'night', 'There', 's', 'more', 'to', 'life', 'than', 'sex', 'Raj', 'Who', 'had', 'Leonard', 'flame', 'out', 'in', '24', 'hour']


<h1 id='recognition'>6.recognition of named entities and Dependency parsing</h1>

In [10]:
import spacy
from spacy import displacy

# install spacy model
#! pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.0/en_core_web_md-2.2.0.tar.gz
nlp = spacy.load("en_core_web_md")
doc = nlp(sentences[20])

displacy.render(doc, style="dep") # (1)
displacy.render(doc, style="ent") # (2)

In [11]:
doc = nlp(sentences[10])
displacy.render(doc, style="dep") # (1)
displacy.render(doc, style="ent") # (2)

In [12]:
doc = nlp(sentences[47])
displacy.render(doc, style="ent") # (2)

In [13]:
doc = nlp(sentences[134])
displacy.render(doc, style="dep") # (1)
displacy.render(doc, style="ent") # (2)

<h1 id='Constituency'>7.Constituency parsing</h1>

Exemple 1

![Image of Yaktocat](https://raw.githubusercontent.com/damounayman/Text-Mining-and-Chatbots/main/NLP%20Basics/Constituency%20Parsing.png)

Example 2

![Image of Yaktocat](https://raw.githubusercontent.com/damounayman/Text-Mining-and-Chatbots/main/NLP%20Basics/Constituency%20parsing%201.png)

<h1 id='reference'>9.Co-reference resolution</h1>

![Image of Yaktocat](https://raw.githubusercontent.com/damounayman/Text-Mining-and-Chatbots/main/NLP%20Basics/Coreference%20resolution.png)