# Basic operations on texual data

## Installing spacy

`spaCy` is an open-source natural language processing library for Python. It is designed particularly for production use, and it can help us to build applications that process massive volumes of text efficiently. 

First we need to install spacy and its English-language model

In [None]:
!pip install spacy

!python -m spacy download en

In [1]:
import spacy

## Tokenizing the Text

Tokenization is the process of breaking text into pieces, called tokens, and ignoring characters like punctuation marks (,. “ ‘) and spaces. spaCy‘s tokenizer takes input in form of unicode text and outputs a sequence of token objects.

Let’s take a look at a simple example. Imagine we have the following text, and we’d like to tokenize it:

> When learning data science, you shouldn’t get discouraged.

> Challenges and setbacks aren’t failures, they’re just part of the journey.

There are a couple of different ways we can appoach this. The first is called __word tokenization__, which means breaking up the text into individual words. This is a critical step for many language processing applications, as they often require input in the form of individual words rather than longer strings of text.

In [2]:
# Word tokenization
from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

#  "nlp" Object is used to create documents with linguistic annotations.
my_doc = nlp(text)

# Create list of word tokens
token_list = []
for token in my_doc:
    token_list.append(token.text)
print(token_list)

['When', 'learning', 'data', 'science', ',', 'you', 'should', "n't", 'get', 'discouraged', '!', '\n', 'Challenges', 'and', 'setbacks', 'are', "n't", 'failures', ',', 'they', "'re", 'just', 'part', 'of', 'the', 'journey', '.', 'You', "'ve", 'got', 'this', '!']


As we can see, spaCy produces a list that contains each token as a separate item. Notice that it has recognized that contractions such as _shouldn’t_ actually represent two distinct words, and it has thus broken them down into two distinct tokens.

Fist we need to load language dictionaries, Here in abve example, we are loading english dictionary using English() class and creating nlp nlp object. “nlp” object is used to create documents with linguistic annotations and various nlp properties. After creating document, we are creating a token list.

If we want, we can also break the text into sentences rather than words. This is called __sentence tokenization__. When performing sentence tokenization, the tokenizer looks for specific characters that fall between sentences, like periods, exclaimation points, and newline characters. For sentence tokenization, we will use a preprocessing pipeline because sentence preprocessing using spaCy includes a tokenizer, a tagger, a parser and an entity recognizer that we need to access to correctly identify what’s a sentence and what isn’t.

In the code below,spaCy tokenizes the text and creates a Doc object. This Doc object uses our preprocessing pipeline’s components tagger,parser and entity recognizer to break the text down into components. From this pipeline we can extract any component, but here we’re going to access sentence tokens using the sentencizer component.

In [9]:
# sentence tokenization

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

# Create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer')

# Add the component to the pipeline
nlp.add_pipe(sbd)

text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

# create list of sentence tokens
sents_list = []
for sent in doc.sents:
    sents_list.append(sent.text)
print(sents_list)

["When learning data science, you shouldn't get discouraged!", "\nChallenges and setbacks aren't failures, they're just part of the journey.", "You've got this!"]


## Cleaning Text Data: Removing Stopwords
Most text data that we work with is going to contain a lot of words that aren’t actually useful to us. These words, called stopwords, are useful in human speech, but they don’t have much to contribute to data analysis. Removing stopwords helps us eliminate noise and distraction from our text data, and also speeds up the time analysis takes (since there are fewer words to process).

Let’s take a look at the stopwords spaCy includes by default.

In [10]:
#Stop words
#importing stop words from English language.
import spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

#Printing the total number of stop words:
print('Number of stop words: %d' % len(spacy_stopwords))

#Printing first ten stop words:
print('First ten stop words: %s' % list(spacy_stopwords)[:20])

Number of stop words: 326
First ten stop words: ['the', 'thru', 'within', 'just', 'him', 'do', 'she', 'whatever', 'what', 'below', 'a', 'others', 'front', 'towards', 'thereupon', 'somewhere', 'once', 'above', 'onto', 'whose']


Now that we’ve got our list of stopwords, let’s use it to remove the stopwords from the text string we were working on in the previous section.

In [11]:
#Implementation of stop words:
filtered_sent=[]

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

# filtering stop words
for word in doc:
    if word.is_stop==False:
        filtered_sent.append(word)
print("Filtered Sentence:",filtered_sent)

Filtered Sentence: [learning, data, science, ,, discouraged, !, 
, Challenges, setbacks, failures, ,, journey, ., got, !]


## Lemmatization
Lemmatization is a way of dealing with the fact that while words like connect, connection, connecting, connected, etc. aren’t exactly the same, they all have the same essential meaning: connect. The differences in spelling have grammatical functions in spoken language, but for machine processing, those differences can be confusing, so we need a way to change all the words that are forms of the word connect into the word connect itself.

One method for doing this is called __stemming__. Stemming involves simply lopping off easily-identified prefixes and suffixes to produce what’s often the simplest version of a word. Connection, for example, would have the -ion suffix removed and be correctly reduced to connect. This kind of simple stemming is often all that’s needed, but lemmatization—which actually looks at words and their roots (called lemma) as described in the dictionary—is more precise (as long as the words exist in the dictionary).

Let's look at this simple example.

In [12]:
# Implementing lemmatization
lem = nlp("run runs running runner")
# finding lemma for each word
for word in lem:
    print(word.text,word.lemma_)

run run
runs runs
running running
runner runner


First, install the English language model of spacy

In [17]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en_core_web_sm==2.2.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz (12.0MB)
[K     |████████████████████████████████| 12.0MB 2.3MB/s eta 0:00:01    |███████████████████████▊        | 8.9MB 2.3MB/s eta 0:00:02
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.2.0-cp37-none-any.whl size=12019125 sha256=2b67d6c00a8084294c5f510d6f63564f03ecbca112af1d517651b9a7592f5aca
  Stored in directory: /tmp/pip-ephem-wheel-cache-sq6olszz/wheels/48/5c/1c/15f9d02afc8221a668d2172446dd8467b20cdb9aef80a172a4
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.2.0
[38;5;2m✔ D

In [18]:
# POS tagging

# importing the model en_core_web_sm of English for vocabluary, syntax & entities
import en_core_web_sm

# load en_core_web_sm of English for vocabluary, syntax & entities
nlp = en_core_web_sm.load()

#  "nlp" Objectis used to create documents with linguistic annotations.
docs = nlp(u"All is well that ends well.")

for word in docs:
    print(word.text,word.pos_)

All DET
is AUX
well ADV
that PRON
ends VERB
well ADV
. PUNCT


## Entity detection

__Entity detection__, also called entity recognition, is a more advanced form of language processing that identifies important elements like places, people, organizations, and languages within an input string of text. This is really helpful for quickly extracting information from text, since you can quickly pick out important topics or indentify key sections of text.

Let’s try out some entity detection using a few paragraphs from this [article](https://www.washingtonpost.com/health/2019/04/09/new-york-orders-measles-vaccinations-brooklyn-amid-outbreak-mayor-says/).

In [19]:
#for visualization of Entity detection importing displacy from spacy:

from spacy import displacy

nytimes= nlp(u"""New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases.

At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday.

The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.""")

entities=[(i, i.label_, i.label) for i in nytimes.ents]
entities

[(New York City, 'GPE', 384),
 (Tuesday, 'DATE', 391),
 (At least 285, 'CARDINAL', 397),
 (September, 'DATE', 391),
 (Brooklyn, 'GPE', 384),
 (Williamsburg, 'GPE', 384),
 (four, 'CARDINAL', 397),
 (Bill de Blasio, 'PERSON', 380),
 (Tuesday, 'DATE', 391),
 (Orthodox Jews, 'NORP', 381),
 (as young as 6 months old, 'DATE', 391),
 (up to $1,000, 'MONEY', 394)]

example above shows spacy is able to identify a variety of different entity types, including specific locations (GPE), date-related words (DATE), important numbers (CARDINAL), specific individuals (PERSON), etc.

Using `displaCy` we can also visualize our input text, with each identified entity highlighted by color and labeled. We’ll use `style = "ent"` to tell displaCy that we want to visualize entities here.

In [20]:
displacy.render(nytimes, style = "ent",jupyter = True)


## Dependency Parsing
__Depenency parsing__ is a language processing technique that allows us to better determine the meaning of a sentence by analyzing how it’s constructed to determine how the individual words relate to each other.

Consider, for example, the sentence “Bill throws the ball.” We have two nouns (Bill and ball) and one verb (throws). But we can’t just look at these words individually, or we may end up thinking that the ball is throwing Bill! To understand the sentence correctly, we need to look at the word order and sentence structure, not just the words and their parts of speech.

Below, we have a short sentence. We’ll use another spaCy called `noun_chunks`, which breaks the input down into nouns and the words describing them, and iterate through each chunk in our source text, identifying the word, its root, its dependency identification, and which chunk it belongs to.

In [22]:
docp = nlp (" In pursuit of a wall, President Trump ran into one.")

for chunk in docp.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

pursuit pursuit pobj In
a wall wall pobj of
President Trump Trump nsubj ran


Let's see this outpu visually!

In [23]:
displacy.render(docp, style="dep", jupyter= True)