# NATURAL LANGUAGE PROCESSING WITH TEXTACY & SPACY

__Spacy__ is a very high performance NLP library for doing several tasks of NLP with ease and speed. Let us explore another library built on top of __SpaCy__ called __TextaCy__.

## TEXTACY
+ Textacy is a Python library for performing higher-level natural language processing (NLP) tasks,
built on the high-performance Spacy library.
+ Textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text.
+ Uses
    + Text preprocessing
    + Keyword in Context
    + Topic modeling
    + Information Extraction
    + Keyterm extraction,
    + Text and Readability statistics,
    + Emotional valence analysis,
    + Quotation attribution

### INSTALLATION
You can install using `pip install textacy` or `conda install -c conda-forge textacy`.
NB: In case you are having issues with installing on windows you can use conda instead of pip.

### Downloading Dataset
You can use the following command to download the `capitol_words` dataset, whcih we will use in this tutorial.
`python -m textacy download capital_words`

### FOR LANGUAGE DETECTION
You can either use `pip install textacy[lang]` or `pip install cld2-cffi` to install the required language pack for textacy. 

__NOTE__: All required the package, dependencies, and add-on packs are pre-installed for this tutorial.

## Getting Started

In [23]:
# Loading Packages
import textacy

In [24]:
example = "Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, \
built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing \
— offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, \
and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, \
and more."

In [25]:
example

'Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more.'

## TEXT PREPROCESSING WITH TEXTACY
Following methods can be used to preprocess your text data:

+ `textacy.preprocess_text()`
+ `textacy.preprocess.`
    + Punctuation Lowercase
    + Urls
    + Phone numbers
    + Currency
    + Emails


In [26]:
raw_text = """ The best programs, are the ones written when the programmer is supposed to be working on something else.\
Mike bought the book for $50 although in Paris it will cost $30 dollars. Don’t document the problem, \
fix it.This is from https://twitter.com/codewisdom?lang=en. """

In [27]:
# Removing Punctuation and Uppercase
processed_text = textacy.preprocess.remove_punct(raw_text)
processed_text

' The best programs  are the ones written when the programmer is supposed to be working on something else Mike bought the book for $50 although in Paris it will cost $30 dollars  Don t document the problem  fix it This is from https   twitter com codewisdom lang=en  '

In [28]:
# Removing urls
processed_text = textacy.preprocess.replace_urls(processed_text,replace_with='TWITTER')
processed_text

' The best programs  are the ones written when the programmer is supposed to be working on something else Mike bought the book for $50 although in Paris it will cost $30 dollars  Don t document the problem  fix it This is from https   twitter com codewisdom lang=en  '

In [29]:
# Replacing Currency Symbols
processed_text = textacy.preprocess.replace_currency_symbols(processed_text,replace_with='USD')
processed_text

' The best programs  are the ones written when the programmer is supposed to be working on something else Mike bought the book for USD50 although in Paris it will cost USD30 dollars  Don t document the problem  fix it This is from https   twitter com codewisdom lang=en  '

In [30]:
#Corrected text block

processed_text = textacy.preprocess.replace_currency_symbols(raw_text,replace_with='USD')
processed_text = textacy.preprocess.replace_urls(processed_text,replace_with='TWITTER')
processed_text = textacy.preprocess.remove_punct(processed_text)
processed_text

' The best programs  are the ones written when the programmer is supposed to be working on something else Mike bought the book for USD50 although in Paris it will cost USD30 dollars  Don t document the problem  fix it This is from TWITTER '

Notice that we created a variable `processed_text` in every cell block above? That is because that usually text preprocessing is a pipeline - with multiple steps in it. Here are the steps we completed above:
+ Removing Punctuation and Uppercase
+ Removing urls
+ Replacing Currency Symbols

So we are using the variable to pass text data from each step to the next.

There are much more text preprocessing steps. [Here](https://www.kdnuggets.com/2019/04/text-preprocessing-nlp-machine-learning.html) is a good summary of these steps.

You can incorporate all above steps together using the following code:

In [31]:
# Preprocess All
#This is not correct
#processed_text = textacy.preprocess_text(raw_text,lowercase=True,no_punct=True,no_urls=True)
#processed_text


Refer to the [docs](https://chartbeat-labs.github.io/textacy/api_reference.html#module-textacy.preprocess) for more details of the `textacy.preprocess` and its family methods.

### READING A TEXT OR A DOCUMENT
+ `textacy.Doc(your_text)`
+ `textacy.io.read_text(your_text)`

Textacy would not receive a lot of attractions if it only can remove URLs or punctuations; however, all additional/more advanced techniques/analyses required on `formatting` the data.

TextaCy/SpaCy uses a `Doc` as a container for any text objects. [Here](https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html#make-a-doc) is nice documentation of the `doc` object.



In [32]:
# With Doc
# Requires Language Pkg Model
docx_textacy = textacy.Doc(example, lang='en')

In [33]:
docx_textacy

Doc(82 tokens; "Textacy is a Python library for performing high...")

We can look at the `type` of the `docx_textacy` object.

In [34]:
type(docx_textacy)

textacy.doc.Doc

Following code read a `doc` from a local file:

1. use the .read() method
`file_textacy = textacy.Doc(open("example.txt").read())`

2. create a generator
`file_textacy2 = textacy.io.read_text('example.txt',lines=True)`

then:

`for text in file_textacy2:`

    `docx_file = textacy.Doc(text)`
    
    `print(docx_file)`

### Advanced Text Analytics 

1. Named-Entity Recognition

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. [Source: [Wikipedia](https://en.wikipedia.org/wiki/Named-entity_recognition)]

TextaCy has a built-in method for NER.

In [35]:
# Using Textacy Named Entity Extraction
list(textacy.extract.named_entities(docx_textacy))

[NLP, Spacy]

2. n-grams

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

TextaCy has a built-in method for n-grams.

N-grams, a.k.a __Bag-of-Words__, is a very important quantifying approach for text data.

In [36]:
# NGrams with Textacy
# NB SpaCy method would be to use noun Phrases
# Tri Grams

list(textacy.extract.ngrams(docx_textacy,3))

[library for performing,
 level natural language,
 natural language processing,
 performance Spacy library,
 With the basics,
 focuses on tasks,
 availability of tokenized,
 emotional valence analysis]

3. text statistics
This usually includes computing basic counts and various readability statistics.

In [37]:
ts = textacy.TextStats(docx_textacy)

In [38]:
# Number of unique words
ts.n_unique_words

51

In [39]:
# Basic counts of linguistic units
ts.basic_counts

{'n_sents': 2,
 'n_words': 60,
 'n_chars': 364,
 'n_syllables': 116,
 'n_unique_words': 51,
 'n_long_words': 28,
 'n_monosyllable_words': 29,
 'n_polysyllable_words': 19}

In [40]:
# readability scores
ts.readability_stats

{'flesch_kincaid_grade_level': 18.923333333333336,
 'flesch_reading_ease': 12.825000000000045,
 'smog_index': 20.736966565827903,
 'gunning_fog_index': 24.66666666666667,
 'coleman_liau_index': 18.884049400000002,
 'automated_readability_index': 22.144,
 'lix': 76.66666666666666,
 'gulpease_index': 38.333333333333336,
 'wiener_sachtextformel': 14.740666666666666}

Some of these basic counts and readability stats seem intimidating. Feel free to Google them to better understand them. 

4. Dealing with a collection of documents (corpus)

Many NLP tasks require datasets comprised of a large number of texts, which are often stored on disk in one or multiple files. textacy makes it easy to efficiently stream text and (text, metadata) pairs from disk, regardless of the format or compression of the data.

In [42]:
import textacy.datasets  # note the import
ds = textacy.datasets.CapitolWords()
ds.download()
records = ds.records(speaker_name={"Hillary Clinton", "Barack Obama"})
next(records)

{'date': '2001-02-13',
 'congress': 107,
 'speaker_name': 'Hillary Clinton',
 'speaker_party': 'D',
 'title': 'MORNING BUSINESS',
 'text': 'I yield myself 15 minutes of the time controlled by the Democrats.',
 'chamber': 'Senate'}

A `textacy.Corpus` is an ordered collection of spaCy Doc s, all processed by the same language pipeline. Let’s continue with the Capitol Words dataset and make a corpus from a stream of records. (Note: This may take a few minutes.)

In [43]:
cw = textacy.datasets.CapitolWords()
records = cw.records(limit=100)
text_stream, metadata_stream = textacy.io.split_records(records, 'text')
corpus = textacy.Corpus('en', texts=text_stream, metadatas=metadata_stream)
print(corpus)

Corpus(100 docs; 70500 tokens)


In [44]:
for doc in corpus:
    print(doc.metadata.get("speaker_name"))

Bernie Sanders
Lindsey Graham
Bernie Sanders
Bernie Sanders
Rick Santorum
Rick Santorum
Bernie Sanders
Bernie Sanders
Bernie Sanders
Bernie Sanders
Bernie Sanders
Rick Santorum
Rick Santorum
Rick Santorum
Joseph Biden
Bernie Sanders
Joseph Biden
Joseph Biden
Bernie Sanders
Rick Santorum
Rick Santorum
John Kasich
John Kasich
Lindsey Graham
Lindsey Graham
Lindsey Graham
Lindsey Graham
Bernie Sanders
Rick Santorum
Rick Santorum
Joseph Biden
Lindsey Graham
Rick Santorum
Bernie Sanders
Rick Santorum
Rick Santorum
Rick Santorum
Rick Santorum
Lindsey Graham
Lindsey Graham
Lindsey Graham
Joseph Biden
Joseph Biden
Rick Santorum
Bernie Sanders
Bernie Sanders
Bernie Sanders
John Kasich
John Kasich
Joseph Biden
Joseph Biden
Joseph Biden
Rick Santorum
Joseph Biden
Joseph Biden
Joseph Biden
Joseph Biden
Joseph Biden
Joseph Biden
Joseph Biden
Lindsey Graham
Lindsey Graham
Lindsey Graham
Lindsey Graham
Bernie Sanders
Rick Santorum
Rick Santorum
John Kasich
Joseph Biden
Joseph Biden
Joseph Biden
Joseph

You can filter the corpus using certain conditions, which would cover your specific use cases:

In [45]:
# Suppose we just want speeches of Mr. Bernie Sanders
# Just print out top-3 for illustration
match_func = lambda doc: doc.metadata.get("speaker_name") == "Bernie Sanders"

for doc in corpus.get(match_func, limit=3):
    print(doc)


Doc(159 tokens; "Mr. Speaker, 480,000 Federal employees are work...")
Doc(336 tokens; "Mr. Speaker, I thank the gentleman for yielding...")
Doc(177 tokens; "Mr. Speaker, if we want to understand why in th...")


Corpus is a list-like object that can be iterated on - each element in Corpus is a `textacy.Doc` object. 

Which means we can do slicing as we slice any list in Python:

In [46]:
# any element
corpus[0]

Doc(159 tokens; "Mr. Speaker, 480,000 Federal employees are work...")

In [47]:
# a sub-list
[doc for doc in corpus[:3]]

[Doc(159 tokens; "Mr. Speaker, 480,000 Federal employees are work..."),
 Doc(219 tokens; "Mr. Speaker, a relationship, to work and surviv..."),
 Doc(336 tokens; "Mr. Speaker, I thank the gentleman for yielding...")]

In [48]:
# You can delete elements from `corpus`
del corpus[:10]
corpus

Corpus(90 docs; 67161 tokens)

We can also get basic statistics of the `corpus` object:

In [49]:
corpus.n_docs, corpus.n_sents, corpus.n_tokens

(90, 2943, 67161)

In [50]:
# Word Counts - as a dictionary
counts = corpus.word_freqs(weighting='count', as_strings=True)

In [51]:
# we can get the top-5 frequent words form the `counts` dictionary
sorted(counts.items(), key=lambda x: x[1], reverse=True)[:5]

[('-PRON-', 4726),
 ('people', 241),
 ('mr.', 240),
 ('president', 216),
 ('bill', 209)]

We also introduce a new text metric called __term frequency - inversed document frequency__ (`tf-idf`). 

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.[Source: Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Term frequency (`tf`) in `tf-idf` is the word frequency we get from above dictionary (`counts`). Now we need to calculate the `idf` part. 

The inverse document frequency is a measure of how much information the word provides, i.e., if it's common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):

\begin{equation*}
idf(t, D) = log(\frac{N_D}{N_t})
\end{equation*}

In which, $ N_D $ is the number of documents (`doc`) in `corpus` D; and $ N_t $ is the number of `doc`s in $ D $ containing term $ t $.

Looks complicated, right? Fortunately we can use TextaCy's built-in methods to calculate `idf`. 

In [52]:
idf = corpus.word_doc_freqs(as_strings=True, weighting='idf')

In [53]:
sorted(idf.items(), key=lambda x: x[1], reverse=True)[:5]

[('355', 4.51085950651685),
 ('absent', 4.51085950651685),
 ('unavoidably', 4.51085950651685),
 ('ordering', 4.51085950651685),
 ('con', 4.51085950651685)]

Since now we have both __tf__ as a dict object `counts`; and idf as a dict object `idf`, we can calculate the complete __tf-idf__ metric.

In [54]:
tf_idf = {k: counts[k]/idf[k] for k in idf.keys() & counts}

### YOUR TURN HERE

Please print out the top-20 terms with the highest `tf-idf` values.

In [55]:
#### Complete your code here
sorted(tf_idf.items(), key=lambda x: x[1], reverse=True)[:20]

[('-PRON-', 6654.073638294795),
 ('mr.', 295.9564154851718),
 ('president', 222.81811001485437),
 ('people', 204.4703503128777),
 ('year', 172.23021208927045),
 ('bill', 172.09519917661507),
 ('think', 126.47630887391287),
 ("'s", 121.62794885731749),
 ('work', 118.93710442204127),
 ('american', 106.85316269251574),
 ('know', 102.90929479802848),
 ('$', 99.25286446516297),
 ('money', 92.74811562385676),
 ('good', 92.7279409085042),
 ('want', 90.44774564026228),
 ('time', 88.60415161642001),
 ('federal', 84.80466267593039),
 ('let', 83.60715983553656),
 ('percent', 79.19058576140583),
 ('state', 77.65646307702684)]

Some of above results make sense, such as 'president' and 'bill' and 'act'. But terms like '-PRON-' (referring to pronouns such as 'you' or 'I') and ''s' do not make sense. Similar conclusion can be drawn onto words such as 'a', 'an', 'the', ...

These words are called __stop words__. In computing, stop words are words which are filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search. [Source: Wikipedia](https://en.wikipedia.org/wiki/Stop_words).

We can check if a word (a.k.a. token) is stop by using the `is_stop` attribute provided with `textacy.token` object.


In [56]:
my_doc = corpus[0]

for t in my_doc.tokens:
    print(t, t.is_stop)

Mr. False
Speaker False
, False
I False
was True
unavoidably False
absent False
during True
the True
votes False
on True
default False
legislation False
. False
If False
I False
had True
been True
present False
, False
I False
would True
have True
voted False
" False
nay False
" False
on True
the True
motions False
to True
table False
the True
appeal False
of True
the True
ruling False
of True
the True
Chair False
with True
regards False
to True
the True
resolutions False
offered False
by True
Mr. False
Gephardt False
( False
rollcall False
No False
. False
26 False
) False
and True
Ms. False
Jackson False
- False
Lee False
( False
rollcall False
No False
. False
27 False
) False
, False
I False
would True
have True
voted False
" False
nay False
" False
on True
the True
ordering False
of True
the True
previous False
question False
on True
House False
Resolution False
355 False
( False
rollcall False
No False
. False
28 False
) False
. False
I False
would True
have True
voted False
" Fa

### YOUR TURN HERE

Remove all stop words from `my_doc`.

__HINT__: using an `if` statement (with the `is_stop` attribute) in the `for` loop.

In [70]:
#### Complete your code here
not_stop_words =[]
for t in my_doc.tokens:
    if t.is_stop == True:
        del t
    else:
        not_stop_words.append(t)
print(not_stop_words)

        


[Mr., Speaker, ,, I, unavoidably, absent, votes, default, legislation, ., If, I, present, ,, I, voted, ", nay, ", motions, table, appeal, ruling, Chair, regards, resolutions, offered, Mr., Gephardt, (, rollcall, No, ., 26, ), Ms., Jackson, -, Lee, (, rollcall, No, ., 27, ), ,, I, voted, ", nay, ", ordering, previous, question, House, Resolution, 355, (, rollcall, No, ., 28, ), ., I, voted, ", nay, ", H., Con, ., Res, ., 141, (, rollcall, No, ., 29, ), ., I, voted, ", yea, ", H.R., 2924, (, rollcall, No, ., 30, ), .]


Sometimes we only care about __word tokens__, which means we need to filter out _numbers_, _punctuations_, and so forth. 

TextaCy provides a `is_alpha` attribute for that purpose.

In [66]:
for t in my_doc.tokens:
    print(t, t.is_alpha)

Mr. False
Speaker True
, False
I True
was True
unavoidably True
absent True
during True
the True
votes True
on True
default True
legislation True
. False
If True
I True
had True
been True
present True
, False
I True
would True
have True
voted True
" False
nay True
" False
on True
the True
motions True
to True
table True
the True
appeal True
of True
the True
ruling True
of True
the True
Chair True
with True
regards True
to True
the True
resolutions True
offered True
by True
Mr. False
Gephardt True
( False
rollcall True
No True
. False
26 False
) False
and True
Ms. False
Jackson True
- False
Lee True
( False
rollcall True
No True
. False
27 False
) False
, False
I True
would True
have True
voted True
" False
nay True
" False
on True
the True
ordering True
of True
the True
previous True
question True
on True
House True
Resolution True
355 False
( False
rollcall True
No True
. False
28 False
) False
. False
I True
would True
have True
voted True
" False
nay True
" False
on True
H. False
Co

### YOUR TURN HERE

What if we want non-stop and word tokens?

__HINT__: combine the two above steps together.

In [68]:
#### Complete you code here
[t for t in my_doc.tokens if ((not t.is_stop) &(t.is_alpha))]

[Speaker,
 I,
 unavoidably,
 absent,
 votes,
 default,
 legislation,
 If,
 I,
 present,
 I,
 voted,
 nay,
 motions,
 table,
 appeal,
 ruling,
 Chair,
 regards,
 resolutions,
 offered,
 Gephardt,
 rollcall,
 No,
 Jackson,
 Lee,
 rollcall,
 No,
 I,
 voted,
 nay,
 ordering,
 previous,
 question,
 House,
 Resolution,
 rollcall,
 No,
 I,
 voted,
 nay,
 Con,
 Res,
 rollcall,
 No,
 I,
 voted,
 yea,
 rollcall,
 No]

From the results, have you noticed that different forms of the same token may appear in the text? For instance, 'run', 'ran', and 'running' are all different forms of the root word 'run'. Counting them as different words may bias any subsequent model. Thus, it would be ideal to make different forms of the same word to the root. This process is called __lemmatization__.

_Lemmatization_ usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the _lemma_. If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. Linguistic processing for stemming or lemmatization is often done by an additional plug-in component to the indexing process, and a number of such components exist, both commercial and open-source. [Source: Stanford NLP Group](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

TextaCy provides a `lemma_` attribute for `token` object for exactly this purpose.

In [None]:

for t in my_doc.tokens:
    print(t, t.lemma_)

# Exercise

In this exercise, you are going to complete following tasks.
1. From the `corpus` variable, generate a new list named `jb_lst` that contains all speeches from `Joseph Biden`. (__HINT__: use similar code as we filter `cw` for 'Bernie Sanders'.)
2. Select first 5 speeches (`doc`) from `jb_lst` and store them in a new list named `jb_selected`.
3. For each element in `jb_selected`, print out:
    a. Named Entities (`named_entities()`)
    b. Text Statistics (`TextStats()`)
4. For each element in `jb_selected`, print out the top 20 words based on their tf-idf score.
5. For each element in `jb_selected`, print out __lemmas__ (`lemma_`) for each token if it is not a stop word (`is_stop`) and is a word token (`is_alpha`).

In [73]:
#### Complete your code here
jb_list = []
for i in corpus:
    if i.corpus
jb_list = doc.corpus.get("speaker_name") == "Bernie Sanders"
print(jb_list)

False


This is the end of part 1. We will resume on text analytics (NLP) after the break.