## Tokenization

In [6]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /home/edureka/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [7]:
gold = """Gold is a chemical element with symbol Au (from Latin: aurum) and atomic number 79, making it one of the higher atomic number elements that occur naturally. In its purest form, it is a bright, slightly reddish yellow, dense, soft, malleable, and ductile metal. Chemically, gold is a transition metal and a group 11 element. It is one of the least reactive chemical elements and is solid under standard conditions. Gold often occurs in free elemental (native) form, as nuggets or grains, in rocks, in veins, and in alluvial deposits. It occurs in a solid solution series with the native element silver (as electrum) and also naturally alloyed with copper and palladium. Less commonly, it occurs in minerals as gold compounds, often with tellurium (gold tellurides).

Gold is resistant to most acids, though it does dissolve in aqua regia, a mixture of nitric acid and hydrochloric acid, which forms a soluble tetrachloroaurate anion. Gold is insoluble in nitric acid, which dissolves silver and base metals, a property that has long been used to refine gold and to confirm the presence of gold in metallic objects, giving rise to the term acid test. Gold also dissolves in alkaline solutions of cyanide, which are used in mining and electroplating. Gold dissolves in mercury, forming amalgam alloys, but this is not a chemical reaction.

A relatively rare element,[5][6] gold is a precious metal that has been used for coinage, jewelry, and other arts throughout recorded history. In the past, a gold standard was often implemented as a monetary policy, but gold coins ceased to be minted as a circulating currency in the 1930s, and the world gold standard was abandoned for a fiat currency system after 1971.

A total of 186,700 tonnes of gold exists above ground, as of 2015.[7] The world consumption of new gold produced is about 50% in jewelry, 40% in investments, and 10% in industry.[8] Gold's high malleability, ductility, resistance to corrosion and most other chemical reactions, and conductivity of electricity have led to its continued use in corrosion resistant electrical connectors in all types of computerized devices (its chief industrial use). Gold is also used in infrared shielding, colored-glass production, gold leafing, and tooth restoration. Certain gold salts are still used as anti-inflammatories in medicine. As of 2016, the world's largest gold producer by far was China with 450 tonnes per year"""

In [8]:
type(gold)

str

In [9]:
gold_word_tokenize = word_tokenize(gold)

In [10]:
gold_word_tokenize

['Gold',
 'is',
 'a',
 'chemical',
 'element',
 'with',
 'symbol',
 'Au',
 '(',
 'from',
 'Latin',
 ':',
 'aurum',
 ')',
 'and',
 'atomic',
 'number',
 '79',
 ',',
 'making',
 'it',
 'one',
 'of',
 'the',
 'higher',
 'atomic',
 'number',
 'elements',
 'that',
 'occur',
 'naturally',
 '.',
 'In',
 'its',
 'purest',
 'form',
 ',',
 'it',
 'is',
 'a',
 'bright',
 ',',
 'slightly',
 'reddish',
 'yellow',
 ',',
 'dense',
 ',',
 'soft',
 ',',
 'malleable',
 ',',
 'and',
 'ductile',
 'metal',
 '.',
 'Chemically',
 ',',
 'gold',
 'is',
 'a',
 'transition',
 'metal',
 'and',
 'a',
 'group',
 '11',
 'element',
 '.',
 'It',
 'is',
 'one',
 'of',
 'the',
 'least',
 'reactive',
 'chemical',
 'elements',
 'and',
 'is',
 'solid',
 'under',
 'standard',
 'conditions',
 '.',
 'Gold',
 'often',
 'occurs',
 'in',
 'free',
 'elemental',
 '(',
 'native',
 ')',
 'form',
 ',',
 'as',
 'nuggets',
 'or',
 'grains',
 ',',
 'in',
 'rocks',
 ',',
 'in',
 'veins',
 ',',
 'and',
 'in',
 'alluvial',
 'deposits',
 '.

In [11]:
len(gold_word_tokenize)

479

In [12]:
from nltk.probability import FreqDist

fdist = FreqDist()

In [13]:
for word in gold_word_tokenize:
    fdist[word.lower()]+=1

In [14]:
fdist

FreqDist({',': 39, 'gold': 22, 'in': 22, '.': 18, 'and': 17, 'a': 16, 'of': 12, 'is': 11, 'the': 10, 'as': 8, ...})

In [15]:
fdist['gold']

22

In [16]:
len(fdist)

224

In [17]:
fdist_top10=fdist.most_common(10)

In [18]:
fdist_top10

[(',', 39),
 ('gold', 22),
 ('in', 22),
 ('.', 18),
 ('and', 17),
 ('a', 16),
 ('of', 12),
 ('is', 11),
 ('the', 10),
 ('as', 8)]

### RegEx Tokenizer

In [19]:
from nltk.tokenize import regexp_tokenize, blankline_tokenize

In [20]:
regexp_tokenize(gold,pattern='\d+')

['79',
 '11',
 '5',
 '6',
 '1930',
 '1971',
 '186',
 '700',
 '2015',
 '7',
 '50',
 '40',
 '10',
 '8',
 '2016',
 '450']

### Blankline Tokenizer

In [21]:
gold_bl_tokenize=blankline_tokenize(gold)

In [22]:
len(gold_bl_tokenize)

4

In [23]:
gold_bl_tokenize[0]

'Gold is a chemical element with symbol Au (from Latin: aurum) and atomic number 79, making it one of the higher atomic number elements that occur naturally. In its purest form, it is a bright, slightly reddish yellow, dense, soft, malleable, and ductile metal. Chemically, gold is a transition metal and a group 11 element. It is one of the least reactive chemical elements and is solid under standard conditions. Gold often occurs in free elemental (native) form, as nuggets or grains, in rocks, in veins, and in alluvial deposits. It occurs in a solid solution series with the native element silver (as electrum) and also naturally alloyed with copper and palladium. Less commonly, it occurs in minerals as gold compounds, often with tellurium (gold tellurides).'

### Sentence Tokenizer

In [24]:
from nltk.tokenize import sent_tokenize

In [25]:
gold_sent_tokenize=sent_tokenize(gold)

In [26]:
gold_sent_tokenize

['Gold is a chemical element with symbol Au (from Latin: aurum) and atomic number 79, making it one of the higher atomic number elements that occur naturally.',
 'In its purest form, it is a bright, slightly reddish yellow, dense, soft, malleable, and ductile metal.',
 'Chemically, gold is a transition metal and a group 11 element.',
 'It is one of the least reactive chemical elements and is solid under standard conditions.',
 'Gold often occurs in free elemental (native) form, as nuggets or grains, in rocks, in veins, and in alluvial deposits.',
 'It occurs in a solid solution series with the native element silver (as electrum) and also naturally alloyed with copper and palladium.',
 'Less commonly, it occurs in minerals as gold compounds, often with tellurium (gold tellurides).',
 'Gold is resistant to most acids, though it does dissolve in aqua regia, a mixture of nitric acid and hydrochloric acid, which forms a soluble tetrachloroaurate anion.',
 'Gold is insoluble in nitric acid, 

## N Grams

In [27]:
from nltk.util import bigrams, trigrams, ngrams

In [28]:
string = "The Mona Lisa is a half length portrait painting by the Italian Renaissance artist Leonardo da Vinci"

In [29]:
mona_lisa_tokens=nltk.word_tokenize(string)

In [30]:
mona_lisa_tokens

['The',
 'Mona',
 'Lisa',
 'is',
 'a',
 'half',
 'length',
 'portrait',
 'painting',
 'by',
 'the',
 'Italian',
 'Renaissance',
 'artist',
 'Leonardo',
 'da',
 'Vinci']

In [31]:
mona_lisa_bigrams=list(nltk.bigrams(mona_lisa_tokens))

In [32]:
mona_lisa_bigrams

[('The', 'Mona'),
 ('Mona', 'Lisa'),
 ('Lisa', 'is'),
 ('is', 'a'),
 ('a', 'half'),
 ('half', 'length'),
 ('length', 'portrait'),
 ('portrait', 'painting'),
 ('painting', 'by'),
 ('by', 'the'),
 ('the', 'Italian'),
 ('Italian', 'Renaissance'),
 ('Renaissance', 'artist'),
 ('artist', 'Leonardo'),
 ('Leonardo', 'da'),
 ('da', 'Vinci')]

In [33]:
mona_lisa_trigrams=list(nltk.trigrams(mona_lisa_tokens))

In [34]:
mona_lisa_trigrams

[('The', 'Mona', 'Lisa'),
 ('Mona', 'Lisa', 'is'),
 ('Lisa', 'is', 'a'),
 ('is', 'a', 'half'),
 ('a', 'half', 'length'),
 ('half', 'length', 'portrait'),
 ('length', 'portrait', 'painting'),
 ('portrait', 'painting', 'by'),
 ('painting', 'by', 'the'),
 ('by', 'the', 'Italian'),
 ('the', 'Italian', 'Renaissance'),
 ('Italian', 'Renaissance', 'artist'),
 ('Renaissance', 'artist', 'Leonardo'),
 ('artist', 'Leonardo', 'da'),
 ('Leonardo', 'da', 'Vinci')]

In [35]:
mona_lisa_ngrams=list(nltk.ngrams(mona_lisa_tokens, 4))
mona_lisa_ngrams

[('The', 'Mona', 'Lisa', 'is'),
 ('Mona', 'Lisa', 'is', 'a'),
 ('Lisa', 'is', 'a', 'half'),
 ('is', 'a', 'half', 'length'),
 ('a', 'half', 'length', 'portrait'),
 ('half', 'length', 'portrait', 'painting'),
 ('length', 'portrait', 'painting', 'by'),
 ('portrait', 'painting', 'by', 'the'),
 ('painting', 'by', 'the', 'Italian'),
 ('by', 'the', 'Italian', 'Renaissance'),
 ('the', 'Italian', 'Renaissance', 'artist'),
 ('Italian', 'Renaissance', 'artist', 'Leonardo'),
 ('Renaissance', 'artist', 'Leonardo', 'da'),
 ('artist', 'Leonardo', 'da', 'Vinci')]

## Stemming

In [36]:
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

In [37]:
pst=PorterStemmer()

In [38]:
pst.stem("having")

'have'

In [39]:
words_to_stem=["give","giving","given","gave"]
for words in words_to_stem:
    print(words+ ":" +pst.stem(words))

give:give
giving:give
given:given
gave:gave


In [40]:
words_to_stem=["like","liking","liked","likes"]
for words in words_to_stem:
    print(words+ ":" +pst.stem(words))

like:like
liking:like
liked:like
likes:like


In [41]:
words_to_stem=["retrieval","retrieved","retrieves"]
for words in words_to_stem:
    print(words+ ":" +pst.stem(words))

retrieval:retriev
retrieved:retriev
retrieves:retriev


In [42]:
words_to_stem = ["program", "programs", "programer", "programing", "programers"] 
for words in words_to_stem:
    print(words+ ":" +pst.stem(words))

program:program
programs:program
programer:program
programing:program
programers:program


In [43]:
lst=LancasterStemmer()

In [44]:
for words in words_to_stem:
    print(words+ ":" +lst.stem(words))

program:program
programs:program
programer:program
programing:program
programers:program


In [45]:
sbst=SnowballStemmer('english')

In [46]:
sbst.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

In [47]:
sbst.stem('having')

'have'

In [48]:
for words in words_to_stem:
    print(words+ ":" +sbst.stem(words))

program:program
programs:program
programer:program
programing:program
programers:program


In [49]:
def stemms(word):
    print("Porter:"+pst.stem(word))
    print("Lancaster:"+lst.stem(word))
    print("Snowball:"+sbst.stem(word))
    return

In [50]:
stemms('fishing')

Porter:fish
Lancaster:fish
Snowball:fish


In [51]:
stemms('corpus')

Porter:corpu
Lancaster:corp
Snowball:corpus


In [52]:
stemms('curricula')

Porter:curricula
Lancaster:curricul
Snowball:curricula


In [53]:
stemms('curriculum')

Porter:curriculum
Lancaster:curricul
Snowball:curriculum


## Lemmatization

In [58]:
nltk.download('wordnet')
from nltk.stem import wordnet
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to /home/edureka/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [59]:
word_lem=WordNetLemmatizer()

In [60]:
word_lem.lemmatize('curricula')

'curriculum'

In [61]:
for words in words_to_stem:
    print(words+ ":" +word_lem.lemmatize(words))

program:program
programs:program
programer:programer
programing:programing
programers:programers


### Stanford NLP

735mb download. This will require good bandwidth. The process will be killed in cloud lab, so try in local system only. 

In [66]:
# !pip install stanfordnlp

In [65]:
import stanfordnlp

MODELS_DIR = '/home/edureka/2 Extracting, Cleaning and Preprocessing Text'
stanfordnlp.download('en', MODELS_DIR) # Download the English models

In [39]:
nlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=MODELS_DIR, treebank='en_ewt', use_gpu=False, pos_batch_size=3000) # Build the pipeline, specify part-of-speech processor's batch size
doc = nlp("Barack Obama was born in Hawaii.") # Run the pipeline on input text
doc.sentences[0].print_tokens() # Look at the result

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': 'F:\\All_Prep\\NLP Training\\Session\\2 Extracting cleaning and processing text\\en_ewt_models\\en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': 'F:\\All_Prep\\NLP Training\\Session\\2 Extracting cleaning and processing text\\en_ewt_models\\en_ewt_tagger.pt', 'pretrain_path': 'F:\\All_Prep\\NLP Training\\Session\\2 Extracting cleaning and processing text\\en_ewt_models\\en_ewt.pretrain.pt', 'batch_size': 3000, 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Done loading processors!
---
<Token index=1;words=[<Word index=1;text=Barack;upos=PROPN;xpos=NNP;feats=Number=Sing>]>
<Token index=2;words=[<Word index=2;text=Obama;upos=PROPN;xpos=NNP;feats=Number=Sing>]>
<Token index=3;words=[<Word index=3;text=was;upos=AUX;xpos=VBD;feats=Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin>]>
<Token index=4;words=[<Word index=4;text=born;upos=V

In [32]:
# import stanfordnlp

nlp = stanfordnlp.Pipeline(processors='tokenize,mwt,pos,lemma',use_gpu=False,models_dir=MODELS_DIR)

# doc = nlp("Barack Obama was born in Hawaii.")
doc = nlp("A cat was seen by John in NY.")

print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

word: A 	lemma: a
word: cat 	lemma: cat
word: was 	lemma: be
word: seen 	lemma: see
word: by 	lemma: by
word: John 	lemma: John
word: in 	lemma: in
word: NY 	lemma: NY
word: . 	lemma: .


In [None]:
# Spacy, Gensim, CoreNLP(Java based)

## Stop words

In [69]:
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/edureka/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [71]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [72]:
len(stopwords.words('english'))

179

In [73]:
fdist_top10

[(',', 39),
 ('gold', 22),
 ('in', 22),
 ('.', 18),
 ('and', 17),
 ('a', 16),
 ('of', 12),
 ('is', 11),
 ('the', 10),
 ('as', 8)]

In [74]:
len(gold_word_tokenize)

479

In [78]:
import re
punctuation=re.compile(r'[-.?!,:;()|0-9]')

In [79]:
post_punctuation=[]

In [80]:
for words in gold_word_tokenize:
    word=punctuation.sub("",words)
    if len(word)>0:
        post_punctuation.append(word)

In [81]:
len(post_punctuation) #list after removing punctuation

397

In [82]:
post_stop_words=[]

In [83]:
stp_words=stopwords.words('english')
stp_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [84]:
for words in post_punctuation:
    words=words.lower()
    if words not in stp_words:
        post_stop_words.append(words)

In [85]:
len(post_stop_words)

243

In [86]:
fdist2=FreqDist()

In [87]:
for word in post_stop_words:
    fdist2[word]+=1

In [88]:
len(fdist2)

170

In [89]:
fdist2.most_common(10)

[('gold', 22),
 ('used', 5),
 ('chemical', 4),
 ('element', 4),
 ('acid', 4),
 ('[', 4),
 (']', 4),
 ('metal', 3),
 ('standard', 3),
 ('often', 3)]

In [90]:
fdist.most_common(10)

[(',', 39),
 ('gold', 22),
 ('in', 22),
 ('.', 18),
 ('and', 17),
 ('a', 16),
 ('of', 12),
 ('is', 11),
 ('the', 10),
 ('as', 8)]

## POS

In [91]:
sent = "Mary is driving a big car."
sent_tokens = word_tokenize(sent)

In [94]:
nltk.download('averaged_perceptron_tagger')
for token in sent_tokens:
    print(nltk.pos_tag([token]))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/edureka/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[('Mary', 'NNP')]
[('is', 'VBZ')]
[('driving', 'VBG')]
[('a', 'DT')]
[('big', 'JJ')]
[('car', 'NN')]
[('.', '.')]


In [95]:
sent2 = "John is eating a delicious cake."
sent2_tokens = word_tokenize(sent2)
for token in sent2_tokens:
    print(nltk.pos_tag([token]))

[('John', 'NNP')]
[('is', 'VBZ')]
[('eating', 'VBG')]
[('a', 'DT')]
[('delicious', 'JJ')]
[('cake', 'NN')]
[('.', '.')]


In [96]:
sent3 = "Jim eats a banana"
sent3_tokens = word_tokenize(sent3)
for token in sent3_tokens:
    print(nltk.pos_tag([token]))

[('Jim', 'NNP')]
[('eats', 'NNS')]
[('a', 'DT')]
[('banana', 'NN')]


In [97]:
from nltk.tokenize import RegexpTokenizer
reg_tokenizer = RegexpTokenizer('(?u)\W+|\$[\d\.]+|\S+')
regex_tokenize = reg_tokenizer.tokenize(sent3)
regex_tokenize

['Jim', ' ', 'eats', ' ', 'a', ' ', 'banana']

In [98]:
regex_tag = nltk.pos_tag(regex_tokenize)
regex_tag

[('Jim', 'NNP'),
 (' ', 'NNP'),
 ('eats', 'VBZ'),
 (' ', 'VBP'),
 ('a', 'DT'),
 (' ', 'NN'),
 ('banana', 'NN')]

## NER

In [102]:
# nltk.download('maxent_ne_chunker')
# nltk.download('words')
from nltk import ne_chunk

ne_sent = "The US President stays in the White House."

ne_tokens = word_tokenize(ne_sent)
ne_tags = nltk.pos_tag(ne_tokens)
ne_ner = ne_chunk(ne_tags)
print(ne_ner)

(S
  The/DT
  (ORGANIZATION US/NNP)
  President/NNP
  stays/VBZ
  in/IN
  the/DT
  (FACILITY White/NNP House/NNP)
  ./.)


In [103]:
ne_sent2 = "Apple is a fruit and Apple is a company's name."

print(ne_chunk(nltk.pos_tag(word_tokenize(ne_sent2))))


(S
  (GPE Apple/NNP)
  is/VBZ
  a/DT
  fruit/NN
  and/CC
  (PERSON Apple/NNP)
  is/VBZ
  a/DT
  company/NN
  's/POS
  name/NN
  ./.)


### spaCy - POS

In [107]:
# !pip install -U spacy
# !python -m spacy download en_core_web_sm


In [108]:

import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("Mary is driving a big car.")
# doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

print("Text\tLemma\tPOS\tTag\tDep\tShape\tisAlpha\tisStop")
for token in doc:
    print("{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}".format(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, \
                                                  token.shape_, token.is_alpha, token.is_stop))

Text	Lemma	POS	Tag	Dep	Shape	isAlpha	isStop
Mary	Mary	PROPN	NNP	nsubj	Xxxx	True	False
is	be	AUX	VBZ	aux	xx	True	True
driving	drive	VERB	VBG	ROOT	xxxx	True	False
a	a	DET	DT	det	x	True	True
big	big	ADJ	JJ	amod	xxx	True	False
car	car	NOUN	NN	dobj	xxx	True	False
.	.	PUNCT	.	punct	.	False	False


In [109]:
for token in sent_tokens:
    print(nltk.pos_tag([token]))

[('Mary', 'NNP')]
[('is', 'VBZ')]
[('driving', 'VBG')]
[('a', 'DT')]
[('big', 'JJ')]
[('car', 'NN')]
[('.', '.')]


In [110]:
# import spacy
from spacy import displacy

displacy.render(doc, style="dep",jupyter=True, options={"word_spacing":45,"compact":False,"distance":150})

### spaCy NER

In [111]:
# import spacy
from spacy import displacy

text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.render(doc, style="ent",jupyter=True)

In [114]:

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Sebastian 5 14 NORP
Google 61 67 ORG
2007 71 75 DATE


### Exercise data

In [115]:
nltk.download('movie_reviews')
file = nltk.corpus.movie_reviews.words('pos/cv033_24444.txt')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/edureka/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [116]:
text = " ".join(file)

In [117]:
text

'in wonder boys michael douglas plays an aged writer \\ professor with such lived - in naturalism that i believe it may be his best performance . ever since wall street , douglas has spent the greater part of his career playing variations on the shark in a suit gordon gecko character he personified in the mid - 80 \' s . in those performances he tended to exaggerate the vehemence of cutthroat businessmen , with much frothing at the mouth while projecting all his bad intentions to the world . you \' d think such a man would keep his evil wrapped tightly underneath a good - natured veneer , but from gordon gecko to nicholas van orton , douglas played the role straight and out in the open . in wonder boys his performance isn \' t showy or a tour de force , it \' s simple yet truthful . he embodies grady , a craggy old writer with a predilection for pot and pink bathrobes . grady instructs a writers workshop while working tirelessly on a follow up to the novel that put him on the map . whe

In [118]:
fdist5 = FreqDist()
for word in file:
    fdist5[word.lower()]+=1

In [119]:
fdist5.most_common(10)

[(',', 60),
 ('.', 54),
 ('the', 54),
 ('a', 46),
 ('in', 27),
 ('of', 24),
 ('to', 21),
 ("'", 20),
 ('i', 19),
 ('his', 19)]

## Exercise

1. Remove the stop words and find out the most common words occuring in the list?
2. Apply stemming and Lemmatization. Does the frequency of most common words change? 
3. Try POS and NER, both from NLTK and Spacy. Do you notice any differnce?
4. How accurately both models are able to find out the names? What do you think the reason for such output?