# Text mining

Date: 22/01/2021

Name: Jiangnan Huang, You Zuo

We will test the following analyzes:

- tokenization
- sentence segmentation
- part-of-speech tagging
- stemming
- Named entity recognition
- constituency parsing
- dependency parsing

The first four analyzes were done with the package **nltk**, while the last three were done with **StanfordCoreNLP**.

In [1]:
import numpy as np
import os 
import itertools

import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer


import matplotlib.pyplot as plt
%matplotlib inline

[nltk_data] Downloading package wordnet to /home/jiangnan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jiangnan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jiangnan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
# load dataset

!wget https://perso.limsi.fr/anne/tbbt.tar.gz
!tar -xzf 'tbbt.tar.gz'

--2021-01-24 23:56:58--  https://perso.limsi.fr/anne/tbbt.tar.gz
Resolving perso.limsi.fr (perso.limsi.fr)... 129.175.134.198
Connecting to perso.limsi.fr (perso.limsi.fr)|129.175.134.198|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 210807 (206K) [application/x-gzip]
Saving to: ‘tbbt.tar.gz.3’


2021-01-24 23:56:58 (5,82 MB/s) - ‘tbbt.tar.gz.3’ saved [210807/210807]



In [3]:
dir_ = 'tbbt/s3/txt/'
files = [ dir_ + file for file in os.listdir(dir_)]
files

['tbbt/s3/txt/tbbts03e01.txt',
 'tbbt/s3/txt/tbbts03e05.txt',
 'tbbt/s3/txt/tbbts03e04.txt',
 'tbbt/s3/txt/tbbts03e02.txt',
 'tbbt/s3/txt/tbbts03e03.txt']

## Word segmentation

In [4]:
txtfile = open(files[0],'r')
words = []

for s in txtfile:
    words.append(word_tokenize(s))
    
words = list(itertools.chain(*words))
words[:30]

['I',
 'just',
 'want',
 'you',
 'both',
 'tknow',
 ',',
 'when',
 'I',
 'publish',
 'my',
 'findings',
 ',',
 'I',
 'wo',
 "n't",
 'forget',
 'your',
 'contributions',
 '.',
 '-',
 'Great',
 '.',
 '-',
 'Thanks',
 '.',
 'Of',
 'course',
 ',',
 'Ia']

## Sentence segmentation

In [5]:
txtfile = open(files[0],'r')
sentences = []

for s in txtfile:
    sentences.append(sent_tokenize(s))
sentences[:10]

[["I just want you both tknow, when I publish my findings, I won't forget your contributions."],
 ['- Great.'],
 ['- Thanks.'],
 ["Of course, Ian't mention you in my Nobel acceptance speech, but when I get around to writing my memoirs, you can expect a very effusive footnote and perhaps a signed copy."],
 ['- We have to tell him.'],
 ['- Tell me what?'],
 ['Damn his Vulcan hearing.'],
 ["You fellows are planning a party for me, aren't you?"],
 ['Okay, Sheldon, sit down.'],
 ['If there\'s going to be a them I should let you know that I don\'t care for luau, toga or " under the sea. "']]

## PoS tagging

In [6]:
pos = nltk.pos_tag(words)
pos[:10]

[('I', 'PRP'),
 ('just', 'RB'),
 ('want', 'VBP'),
 ('you', 'PRP'),
 ('both', 'DT'),
 ('tknow', 'VBP'),
 (',', ','),
 ('when', 'WRB'),
 ('I', 'PRP'),
 ('publish', 'VBP')]

## Lemmatization

In [7]:
lemmatizer = WordNetLemmatizer() 
lemmatized = [lemmatizer.lemmatize(word) for word in words[:30]]
lemmatized

['I',
 'just',
 'want',
 'you',
 'both',
 'tknow',
 ',',
 'when',
 'I',
 'publish',
 'my',
 'finding',
 ',',
 'I',
 'wo',
 "n't",
 'forget',
 'your',
 'contribution',
 '.',
 '-',
 'Great',
 '.',
 '-',
 'Thanks',
 '.',
 'Of',
 'course',
 ',',
 'Ia']

## Named entity recognition

In [8]:
# first install the python API and down load the StanfordCoreNLP.
from stanfordcorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP('stanford-corenlp-full-2018-02-27')

In [9]:
sentence = 'Paris-Saclay University is the biggest university in France, which is located in the south of Paris.'

In [10]:
print('Named Entities:', nlp.ner(sentence))

Named Entities: [('Paris-Saclay', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('is', 'O'), ('the', 'O'), ('biggest', 'O'), ('university', 'O'), ('in', 'O'), ('France', 'COUNTRY'), (',', 'O'), ('which', 'O'), ('is', 'O'), ('located', 'O'), ('in', 'O'), ('the', 'O'), ('south', 'O'), ('of', 'O'), ('Paris', 'CITY'), ('.', 'O')]


## Constituency Parsing

In [11]:
print('Constituency Parsing:', nlp.parse(sentence))

Constituency Parsing: (ROOT
  (S
    (NP (NNP Paris-Saclay) (NNP University))
    (VP (VBZ is)
      (NP
        (NP (DT the) (JJS biggest) (NN university))
        (PP (IN in)
          (NP
            (NP (NNP France))
            (, ,)
            (SBAR
              (WHNP (WDT which))
              (S
                (VP (VBZ is)
                  (ADJP (JJ located)
                    (PP (IN in)
                      (NP
                        (NP (DT the) (NN south))
                        (PP (IN of)
                          (NP (NNP Paris)))))))))))))
    (. .)))


## Dependency Parsing

In [12]:
print('Dependency Parsing:', nlp.dependency_parse(sentence))

Dependency Parsing: [('ROOT', 0, 6), ('compound', 2, 1), ('nsubj', 6, 2), ('cop', 6, 3), ('det', 6, 4), ('amod', 6, 5), ('case', 8, 7), ('nmod', 6, 8), ('punct', 8, 9), ('nsubj', 12, 10), ('cop', 12, 11), ('acl:relcl', 8, 12), ('case', 15, 13), ('det', 15, 14), ('nmod', 12, 15), ('case', 17, 16), ('nmod', 15, 17), ('punct', 6, 18)]


In [19]:
nlp.close()