In [None]:
!pip install textacy

In [1]:
#We need texacy, which inturn loads spacy library
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [5]:
import pandas as pd
import numpy as np
import spacy
import textacy.ke
from textacy import *

In [3]:
#Load a spacy model, which will be used for all further processing.
en = textacy.load_spacy_lang("en_core_web_sm")
en

<spacy.lang.en.English at 0x7fb091158310>

In [27]:
#Let us use a sample text file, nlphistory.txt, which is the text from the history section of Wikipedia's
#page on Natural Language Processing 
#https://en.wikipedia.org/wiki/Natural_language_processing
#path = 'PATH TO REPO'
#mytext = open(path+'./Data/nlphistory.txt').read()
path='https://raw.githubusercontent.com/duybluemind1988/Data-science/master/Practical%20NLP%20Oreilly/Ch5/Data/nlphistory.txt'
path_2='/content/nlphistory.txt'
mytext = open(path_2).read()
mytext

'The history of natural language processing generally started in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.\n\nThe Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed.\n\nSome notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working 

In [29]:
import requests
url = path
mytext2 = requests.get(url)
mytext2=mytext2.text
print(mytext2)

The history of natural language processing generally started in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.

The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed.

Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in re

In [33]:
#convert the text into a spacy document.
doc = textacy.make_spacy_doc(mytext2, lang=en)
doc._.preview

u'Doc(1090 tokens: "The history of natural language processing gene...")'

In [34]:
textacy.ke.textrank(doc, topn=5)

[(u'successful natural language processing system', 0.02477763226482553),
 (u'statistical machine translation system', 0.024691643927525625),
 (u'natural language system', 0.020534645561892825),
 (u'statistical natural language processing', 0.018614712584248353),
 (u'natural language task', 0.015808223689229454)]

In [35]:
#Print the keywords using TextRank algorithm, as implemented in Textacy.
print("Textrank output: ", [kps for kps, weights in textacy.ke.textrank(doc, normalize="lemma", topn=5)])\
#Print the key words and phrases, using SGRank algorithm, as implemented in Textacy
print("SGRank output: ", [kps for kps, weights in textacy.ke.sgrank(doc, topn=5)])

('Textrank output: ', [u'successful natural language processing system', u'statistical machine translation system', u'natural language system', u'statistical natural language processing', u'natural language task'])
('SGRank output: ', [u'natural language processing system', u'machine translation', u'successful natural language', u'statistical machine translation system', u'low enough time complexity'])


In [37]:
#To address the issue of overlapping key phrases, textacy has a function: aggregage_term_variants.
#Choosing one of the grouped terms per item will give us a list of non-overlapping key phrases!
terms = set([term for term,weight in textacy.ke.sgrank(doc)])
terms

{u'early',
 u'example',
 u'late 1980',
 u'low enough time complexity',
 u'machine translation',
 u'natural language processing system',
 u'real',
 u'research',
 u'statistical machine translation system',
 u'successful natural language'}

In [38]:
print(textacy.ke.utils.aggregate_term_variants(terms))

[set([u'statistical machine translation system']), set([u'natural language processing system']), set([u'successful natural language']), set([u'low enough time complexity']), set([u'machine translation']), set([u'late 1980']), set([u'research']), set([u'example']), set([u'early']), set([u'real'])]


In [39]:
#A way to look at key phrases is just consider all noun chunks as potential ones. 
#However, keep in mind this will result in a lot of phrases, and no way to rank them!

print([chunk for chunk in textacy.extract.noun_chunks(doc)])

[history, natural language processing, 1950s, work, earlier periods, Alan Turing, article, what, criterion, intelligence, Georgetown experiment, fully automatic translation, more than sixty Russian sentences, English, authors, three or five years, machine translation, solved problem.[2, real progress, ALPAC report, ten-year-long research, expectations, machine translation, Little further research, machine translation, late 1980s, first statistical machine translation systems, notably successful natural language processing systems, 1960s, SHRDLU, natural language system, restricted "blocks worlds, restricted vocabularies, ELIZA, simulation, Rogerian psychotherapist, Joseph Weizenbaum, almost no information, human thought, emotion, ELIZA, startlingly human-like interaction, "patient, very small knowledge base, ELIZA, generic response, example, "My head, you, head, 1970s, many programmers, "conceptual ontologies, real-world information, computer-understandable data, Examples, MARGIE, (Sch

Textacy also has a bunch of other information extraction functions, many of them based on regular expression patterns and heuristics to address extracting specific expressions such as acronyms and quotations. Apart from these, we can also extract matching custom regular expressions including POS tag patterns, or look for statements involving an entity, subject-verb-object tuples etc. We will discuss some of these as they come, in this chapter. 

Documentation: https://chartbeat-labs.github.io/textacy/build/html/index.html

# Learn Textacy (outside book)

https://chartbeat-labs.github.io/textacy/build/html/getting_started/quickstart.html

In [42]:
text = (
"Since the so-called \"statistical revolution\" in the late 1980s and mid 1990s, "
"much Natural Language Processing research has relied heavily on machine learning. "
"Formerly, many language-processing tasks typically involved the direct hand coding "
"of rules, which is not in general robust to natural language variation. "
"The machine-learning paradigm calls instead for using statistical inference "
"to automatically learn such rules through the analysis of large corpora "
"of typical real-world examples.")
text

'Since the so-called "statistical revolution" in the late 1980s and mid 1990s, much Natural Language Processing research has relied heavily on machine learning. Formerly, many language-processing tasks typically involved the direct hand coding of rules, which is not in general robust to natural language variation. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real-world examples.'