In [None]:
!pip install textacy

In [1]:
#We need texacy, which inturn loads spacy library
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [29]:
import pandas as pd
import numpy as np
import spacy
import textacy.ke
from textacy import *

In [30]:
#Load a spacy model, which will be used for all further processing.
en = textacy.load_spacy_lang("en_core_web_sm")
en

<spacy.lang.en.English at 0x7f5199944438>

In [27]:
#Let us use a sample text file, nlphistory.txt, which is the text from the history section of Wikipedia's
#page on Natural Language Processing 
#https://en.wikipedia.org/wiki/Natural_language_processing
#path = 'PATH TO REPO'
#path_2='/content/nlphistory.txt'
#mytext = open(path+'./Data/nlphistory.txt').read()
#mytext = open(path_2).read()
#mytext

'The history of natural language processing generally started in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.\n\nThe Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed.\n\nSome notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working 

In [27]:
path='https://raw.githubusercontent.com/duybluemind1988/Data-science/master/Practical%20NLP%20Oreilly/Ch5/Data/nlphistory.txt'

import requests
url = path
mytext2 = requests.get(url)
mytext2=mytext2.text
print(mytext2)

The history of natural language processing generally started in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.

The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed.

Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in re

In [31]:
#convert the text into a spacy document.
doc = textacy.make_spacy_doc(mytext2, lang=en)
doc._.preview

'Doc(1090 tokens: "The history of natural language processing gene...")'

In [32]:
textacy.ke.textrank(doc, topn=5)

[('successful natural language processing system', 0.02475549496438359),
 ('statistical machine translation system', 0.024648673368376665),
 ('natural language system', 0.020518708001159278),
 ('statistical natural language processing', 0.01858983530270439),
 ('natural language task', 0.01579726776487791)]

In [33]:
#Print the keywords using TextRank algorithm, as implemented in Textacy.
print("Textrank output: ", [kps for kps, weights in textacy.ke.textrank(doc, normalize="lemma", topn=5)])\
#Print the key words and phrases, using SGRank algorithm, as implemented in Textacy
print("SGRank output: ", [kps for kps, weights in textacy.ke.sgrank(doc, topn=5)])

Textrank output:  ['successful natural language processing system', 'statistical machine translation system', 'natural language system', 'statistical natural language processing', 'natural language task']
SGRank output:  ['natural language processing system', 'statistical machine translation', 'research', 'late 1980', 'early']


In [36]:
#To address the issue of overlapping key phrases, textacy has a function: aggregage_term_variants.
#Choosing one of the grouped terms per item will give us a list of non-overlapping key phrases!
terms = set([term for term,weight in textacy.ke.sgrank(doc)])
terms

{'ELIZA',
 'early',
 'example',
 'late 1980',
 'natural language processing system',
 'real',
 'research',
 'statistical machine translation',
 'statistical model',
 'world'}

In [37]:
print(textacy.ke.utils.aggregate_term_variants(terms))

[{'natural language processing system'}, {'statistical machine translation'}, {'statistical model'}, {'late 1980'}, {'research'}, {'example'}, {'early'}, {'world'}, {'ELIZA'}, {'real'}]


In [39]:
#A way to look at key phrases is just consider all noun chunks as potential ones. 
#However, keep in mind this will result in a lot of phrases, and no way to rank them!

print([chunk for chunk in textacy.extract.noun_chunks(doc)])

[history, natural language processing, 1950s, work, earlier periods, Alan Turing, article, what, criterion, intelligence, Georgetown experiment, fully automatic translation, more than sixty Russian sentences, English, authors, three or five years, machine translation, real progress, ALPAC report, ten-year-long research, expectations, machine translation, Little further research, machine translation, late 1980s, first statistical machine translation systems, notably successful natural language processing systems, SHRDLU, natural language system, restricted "blocks worlds, restricted vocabularies, ELIZA, simulation, Rogerian psychotherapist, Joseph Weizenbaum, almost no information, human thought, emotion, ELIZA, startlingly human-like interaction, "patient, very small knowledge base, ELIZA, generic response, example, head, you, head, 1970s, many programmers, "conceptual ontologies, real-world information, computer-understandable data, Examples, MARGIE, Schank, Cullingford, (Wilensky, Le

Textacy also has a bunch of other information extraction functions, many of them based on regular expression patterns and heuristics to address extracting specific expressions such as acronyms and quotations. Apart from these, we can also extract matching custom regular expressions including POS tag patterns, or look for statements involving an entity, subject-verb-object tuples etc. We will discuss some of these as they come, in this chapter. 

Documentation: https://chartbeat-labs.github.io/textacy/build/html/index.html

# Learn Textacy (outside book)

https://chartbeat-labs.github.io/textacy/build/html/getting_started/quickstart.html

In [4]:
!pip install textacy

Collecting textacy
[?25l  Downloading https://files.pythonhosted.org/packages/f3/fe/0b57ac1a202de9819e71e8373980d586e824f515ad2f4266e4e98627f8b8/textacy-0.10.0-py3-none-any.whl (206kB)
[K     |████████████████████████████████| 215kB 2.8MB/s 
Collecting pyphen>=0.9.4
  Using cached https://files.pythonhosted.org/packages/15/82/08a3629dce8d1f3d91db843bb36d4d7db6b6269d5067259613a0d5c8a9db/Pyphen-0.9.5-py2.py3-none-any.whl
Collecting cytoolz>=0.8.0
  Using cached https://files.pythonhosted.org/packages/62/b1/7f16703fe4a497879b1b457adf1e472fad2d4f030477698b16d2febf38bb/cytoolz-0.10.1.tar.gz
Collecting jellyfish>=0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/6c/09/927ae35fc5a9f70abb6cc2c27ee88fc48549f7bc4786c1d4b177c22e997d/jellyfish-0.8.2-cp36-cp36m-manylinux2014_x86_64.whl (93kB)
[K     |████████████████████████████████| 102kB 6.2MB/s 
Building wheels for collected packages: cytoolz
  Building wheel for cytoolz (setup.py) ... [?25l[?25hdone
  Created wheel for cyto

In [5]:
import textacy

In [6]:
text = (
"Since the so-called \"statistical revolution\" in the late 1980s and mid 1990s, "
"much Natural Language Processing research has relied heavily on machine learning. "
"Formerly, many language-processing tasks typically involved the direct hand coding "
"of rules, which is not in general robust to natural language variation. "
"The machine-learning paradigm calls instead for using statistical inference "
"to automatically learn such rules through the analysis of large corpora "
"of typical real-world examples.")
text

'Since the so-called "statistical revolution" in the late 1980s and mid 1990s, much Natural Language Processing research has relied heavily on machine learning. Formerly, many language-processing tasks typically involved the direct hand coding of rules, which is not in general robust to natural language variation. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real-world examples.'

Before (or in lieu of) processing this text with spaCy, we can do a few things. First, let’s look for keywords-in-context, as a quick way to assess, by eye, how a particular word or phrase is used in a body of text:

In [7]:
textacy.text_utils.KWIC(text, "language", window_width=35)

<generator object keyword_in_context at 0x7f51a3a23d00>

In [8]:
textacy.text_utils.KWIC(text, "statistical", window_width=35)

<generator object keyword_in_context at 0x7f51a3a23c50>

Sometimes, “raw” text is messy and must be cleaned up before analysis; other times, an analysis simply benefits from well-standardized text. In either case, the textacy.preprocessing sub-package contains a number of functions to normalize (whitespace, quotation marks, etc.), remove (punctuation, accents, etc.), and replace (URLs, emails, numbers, etc.) messy text data. For example:

In [60]:
text

'Since the so-called "statistical revolution" in the late 1980s and mid 1990s, much Natural Language Processing research has relied heavily on machine learning. Formerly, many language-processing tasks typically involved the direct hand coding of rules, which is not in general robust to natural language variation. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real-world examples.'

In [9]:
from textacy import preprocessing
preprocessing.normalize_whitespace(preprocessing.remove_punctuation(text))

'Since the so called statistical revolution in the late 1980s and mid 1990s much Natural Language Processing research has relied heavily on machine learning Formerly many language processing tasks typically involved the direct hand coding of rules which is not in general robust to natural language variation The machine learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real world examples'

Usually, though, we want to work with text that’s been processed by spaCy: tokenized, part-of-speech tagged, parsed, and so on. Since spaCy’s pipelines are language-dependent, we have to load a particular pipeline to match the text; when working with texts from multiple languages, this can be a pain. Fortunately, textacy includes automatic language detection to apply the right pipeline to the text, and it caches the loaded language data to minimize wait time and hassle. Making a Doc from text is easy:

In [10]:
doc = textacy.make_spacy_doc(text)
doc._.preview

100%|██████████| 66.7M/66.7M [00:05<00:00, 11.2MB/s]


'Doc(85 tokens: "Since the so-called "statistical revolution" in...")'

In [11]:
metadata = {
"title": "Natural-language processing",
"url": "https://en.wikipedia.org/wiki/Natural-language_processing",
"source": "wikipedia",
}
doc = textacy.make_spacy_doc((text, metadata))
doc._.meta["title"]

'Natural-language processing'

There are many ways to understand the content of a Doc. For starters, let’s extract various elements of interest:

In [14]:
list(textacy.extract.ngrams(
doc, 3, filter_stops=True, filter_punct=True, filter_nums=False))

[1980s and mid,
 Natural Language Processing,
 Language Processing research,
 research has relied,
 heavily on machine,
 processing tasks typically,
 tasks typically involved,
 involved the direct,
 direct hand coding,
 coding of rules,
 robust to natural,
 natural language variation,
 learning paradigm calls,
 paradigm calls instead,
 inference to automatically,
 learn such rules,
 analysis of large,
 corpora of typical]

In [15]:
list(textacy.extract.ngrams(
doc, 2, filter_stops=True, filter_punct=True, filter_nums=False))

[statistical revolution,
 late 1980s,
 mid 1990s,
 Natural Language,
 Language Processing,
 Processing research,
 relied heavily,
 machine learning,
 processing tasks,
 tasks typically,
 typically involved,
 direct hand,
 hand coding,
 general robust,
 natural language,
 language variation,
 learning paradigm,
 paradigm calls,
 calls instead,
 statistical inference,
 automatically learn,
 large corpora,
 typical real,
 world examples]

We can also identify key terms in a document by a number of algorithms:

In [16]:
import textacy.ke
textacy.ke.textrank(doc, normalize="lemma", topn=10)

[('Natural Language Processing research', 0.059959246697826624),
 ('natural language variation', 0.04488350959275309),
 ('direct hand coding', 0.037736661821063354),
 ('statistical inference', 0.03432557996664981),
 ('statistical revolution', 0.034007535820683756),
 ('machine learning', 0.03305919655573349),
 ('mid 1990', 0.026993994406706995),
 ('late 1980', 0.026499549123496648),
 ('processing task', 0.0256684200517989),
 ('general robust', 0.024835834233545625)]

In [17]:
ts = textacy.TextStats(doc)
ts.n_unique_words

57

In [18]:
ts.basic_counts

{'n_chars': 414,
 'n_long_words': 30,
 'n_monosyllable_words': 38,
 'n_polysyllable_words': 19,
 'n_sents': 3,
 'n_syllables': 134,
 'n_unique_words': 57,
 'n_words': 73}

In [19]:
ts.flesch_kincaid_grade_level

15.56027397260274

In [20]:
ts.readability_stats

{'automated_readability_index': 17.448173515981736,
 'coleman_liau_index': 16.32928468493151,
 'flesch_kincaid_grade_level': 15.56027397260274,
 'flesch_reading_ease': 26.84351598173518,
 'gulpease_index': 44.61643835616438,
 'gunning_fog_index': 20.144292237442922,
 'lix': 65.42922374429223,
 'smog_index': 17.5058628484301,
 'wiener_sachtextformel': 11.857779908675797}

Lastly, we can transform a document into a “bag of terms”, with flexible weighting and term inclusion criteria:

In [24]:
bot = doc._.to_bag_of_terms(ngrams=(1, 2, 3), entities=True, 
                            weighting="count",as_strings=True)
sorted(bot.items(), key=lambda x: x[1], reverse=True)[:15]

[('call', 2),
 ('statistical', 2),
 ('machine', 2),
 ('language', 2),
 ('rule', 2),
 ('learn', 2),
 ('revolution', 1),
 ('late', 1),
 ('1980', 1),
 ('mid', 1),
 ('1990', 1),
 ('Natural', 1),
 ('Language', 1),
 ('Processing', 1),
 ('research', 1)]