### Connecting Google Colab with your Google Drive

In [None]:
from google.colab import drive

drive.mount("/content/drive")

In [None]:
import os
os.chdir("/content/drive/MyDrive/nlp/textpre/Text_Preprocessing")
!ls

# <center>Text Preprocessing</center>

Text data falls into the category of unstructured data and requires some preparation before it can be used for modeling. Text preparation is different from structured data pre-processing. In this activitiy we will be using spaCy to do the same.




In [None]:
from IPython.display import Image
Image(filename='./img/spacy_img.jpg')


## Why spaCy?

* Is a free and open-source library developed by Explosion AI.
* Works well for simple to complex language understanding tasks and is designed specifically for production use.
* Helps build applications that process and “understand” large volumes of text.

### Features of spaCy

spaCy provides [trained models](https://spacy.io/models/en) for different languages and has a model for multi-language as well.

Before jumping in, let's have a look at various features provided by popular NLP related libraries and their performance in compairision to SpaCy.  

In [None]:
Image(filename='./img/Spacy_Features.png')

### Feature Comparision

In [None]:
Image(filename='./img/feature-comparision.png')

### Speed Comparision

In [None]:
Image(filename='./img/speed-comparision.png')

### SpaCy Installation

To get started with spaCy, install the package using pip in Terminal (for Mac) or CommandLine (for Windows).

__Note__: No separate installation of spaCy is required to execute in colab.

The language pre-trained model packages can be downloaded using the "spacy download" command. We will download `en_core_web_sm` package

- en = English
- core = Core (Vocab, Syntax, Entities, Vectors)
- web = Web Text
- sm/md/lg = Small/Medium/Large

``` Python
!pip install -U spacy

(or)

!conda install -c conda-forge spacy

!python -m spacy download en_core_web_sm
```

In [None]:
import spacy
import re

In [None]:
# Load the pre trained language model
nlp = spacy.load('en_core_web_sm')

# Incase if the above command fails please uncomment the below lines and execute the code.
# import en_core_web_sm
# nlp = en_core_web_sm.load()

# Create SpaCy Object
doc = nlp("Hello World")

# Print the document text
print(doc)

In [None]:

# doc1 = nlp(string)
# doc1

#### We will be using a sample text to demonstrate various text pre-processing steps.

In [None]:
string = '''At Waterloo we were fortunate in catching a train for Leatherhead, where we hired a trap at the station inn and drove for four or five miles through the lovely Surrey lanes.
It was a perfect day, with a bright sun and a few fleecy clouds in the heavens. The trees and wayside hedges were just throwing out their first green shoots, and the air was full of the pleasant smell of the moist earth. To me at least there was a strange contrast between the sweet promise of the spring and this sinister quest upon which we were engaged.
My companion Mr. Alfred sat in the front of the trap, his arms folded, his hat pulled down over his eyes, and his chin sunk upon his breast, buried in the deepest thought.
Suddenly, however, he started, tapped me on the shoulder, and pointed over the meadows.
The train was @09:30 AM and we have to reach the station by 08:30 AM. At Waterloo we were fortunate in catching a train for Leatherhead, where we hired a trap at the station inn and drove for four or five miles through the lovely Surrey lanes.
It was a perfect day, with a bright sun and a few fleecy clouds in the heavens.'''

In [None]:
string

In [None]:
type(string)

In [None]:
len(string)

In [None]:
string.count("Waterloo")

In [None]:
string[5]

In [None]:
string[:3]

#### Usually we will reading the data from files. So let's do the same.

In [None]:
# Change the working directory using ".chdir()" method
PATH = os.getcwd()

DATA_PATH = os.path.join(PATH, "data")

os.chdir(DATA_PATH)

In [None]:
DATA_PATH

In [None]:
# We can verify the files that are present in the path
print(os.listdir())

In [None]:
# Reading from a text file
with open('sample_text.txt', 'r') as f:
    string = f.read()

# 'r' in the code stands for read operation. One would use 'w' to write to a file and 'a' to append to an existing file

In [None]:
string

### Now that our sample text is ready, let us perform the following steps:

1. Sentence Tokenizing
2. Word Tokenizing
3. Stop Word Removal
4. Lemmatization

# Tokenizing

#### Sentence tokenize - splits entire text to sentences.

In [None]:
# string="a boy hello 1#"
string="sentence tokenization is Done. It is a part of preprocessing"

In [None]:
doc = nlp(string)                         # doc is as spacy object

sent_tokens = [w.text for w in doc.sents] # list comprehensions

In [None]:
list(doc.sents)

In [None]:
len(sent_tokens)

In [None]:
for sent in sent_tokens:
    print(sent)

#### word tokenize - splits strings to words and separates punctuations also

In [None]:
tokens = []
for token in doc:
  # print(token.text)
    tokens.append(token.text)
print(tokens)

In [None]:
def tok(stri):
  doc = nlp(stri)
  tokens = []
  for token in doc:
      tokens.append(token.text)
  return tokens

In [None]:
# Just for the explanation

for token in doc:
  print(token, ":", type(token))
  print(token.text, ":", type(token.text))
  break

In [None]:
print(tokens[0:11])

## Regular-Expression Tokenizers

**What is Regular Expression?**

A RegEx or Regular Expression in a programming language is a special text string used for describing a search pattern. It is extremely useful for extracting information from text such as code, files, log, spreadsheets or even documents.

While using the regular expression the first thing is to recognize is that everything is essentially a character, and we are writing patterns to match a specific sequence of characters also referred as string. Ascii or latin letters are those that are on your keyboards and Unicode is used to match the foreign text. It includes digits and punctuation and all special characters like $#@!%, etc.

**Lets try Regular Expressions using "re"**

__re__ module included with Python is primarily used for string searching and manipulation. It is quite useful for text extraction and pre-processing. The most common use for __re__ is to search for patterns in text.

A __re__ splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences:('\w+|$[\d.]+|\S+') For more information or different variations - https://docs.python.org/3/library/re.html

https://spacy.io/usage/rule-based-matching

**Regular Expressions**
1. `\d` - Matches any decimal digit; this is equivalent to the class [0-9].
2. `\D` - Matches any non-digit character; this is equivalent to the class [^0-9].
3. `\s` - Matches any whitespace character;
4. `\S` - Matches any non-whitespace character;
5. `\w` - Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
6. `\W` - Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

The special characters are:

7. `.` - (Dot) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
8. `^` - (Caret) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.
9. `$` - Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline.
10. `*` - Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
11. `+` - Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
12. `?` - Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
13. `\` - Either escapes special characters (permitting you to match characters like `*`, `?`, and so forth), or signals a special sequence; special sequences are discussed below.
14. `[]` - Used to indicate a set of characters. In a set: Example: Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.
15. `|` - A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the `|` in this way.


### E.g. to understand Regular Expression

Matches one or more occurence of alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].


In [None]:
print([w for w in tokens if re.search('\w+', w)])


In [None]:
print([w for w in tokens if re.search('\w', w)])

Matches a string that starts with capital letter followed by zero or more occurence of alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

In [None]:
print([w for w in tokens if re.search('[A-Z]\w*', w)])

In [None]:
tokens=tok("deekshitha95")

Matches a string that contains "ee" in it.

In [None]:
[w for w in tokens if re.search('\w*ee\w*', w)]

Matches a string that contains digits in it.

In [None]:
[w for w in tokens if re.search('\d+', w)]

In [None]:
text = nlp('''International School of Engineering (INSOFE) is an Applied Engineering school with area of focus in Data Science. It is located in Hyderabad, Bengaluru and Mumbai. It opened in 2011.
The program is delivered through classroom only sessions and is suitable for students and working professionals. Dr. Dakshinamurthy V Kolluru, Dr. Sridhar Pappu and A S L Ganapathi Kumar started the institution in Hyderabad in mid-2011 and expanded to Bengaluru in early-2016. Initially the school functioned under mentorship of Dr. Dakshinamurthy, Dr. Sridhar and Dr. Sreerama Murthy. They are now supported by a team of additional mentors and in-house data scientists.
In 2012, INSOFE also started Corporate training services. It extended operations to Bengaluru in 2016. CIO.com listed INSOFE 3rd in their list of "16 Big Data Certifications That Will Pay Off" consecutively from 2013-2016. Silicon India Magazine listed INSOFE in their list of "Top 5 Big Data Training Institutes 2016". Analytics India Magazine, listed INSOFE in "Top 9 Analytics Training Institutes in India in 2016". KDnuggets mentioned INSOFE in their list of Certificates in Analytics, Data Mining, and Data Science in 2014.
''')

Tokenize the text


In [None]:
tokens1 = [token.text for token in text]
print(tokens1)

Match pattern starting with I

In [None]:
print([w for w in tokens1 if re.search('^I', w)])

Match all tokens that ends with either `ing` or `uru`

In [None]:
[w for w in tokens1 if re.search('ing$|uru$', w)]

In [None]:
[w for w in tokens1 if re.search('!ing$|uru$', w)]

Match all tokens that starts with H, B or M char

In [None]:
[w for w in tokens1 if re.search('^[H|B|M]', w)]

Search for words - Hyderabad, Bengaluru and Mumbai

In [None]:
[w for w in tokens1 if re.search('^Hyd|Ben|Mum', w)]

In [None]:
[w for w in tokens1 if re.search('^Hyd|Ben|Mum', w)]

Search for the word 'Data' or 'Analytics or Science'

In [None]:
[w for w in tokens1 if re.search('Data|Ana|Sci', w)]

Match pattern that ends with es

In [None]:
[w for w in tokens1 if re.search('es$', w)]

Extract pattern with numbers

In [None]:
[w for w in tokens1 if re.search('[0-9]', w)]

# Lower case
Converting the tokens to lower case

In [None]:
tokens = [token.lower() for token in tokens]
print(tokens)

# Stopwords

A stop word is a commonly used word (such as "a", "an“, "it”, “in”, “the”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

We would not want these words taking up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to be stop words. spacy in python has a list of stopwords stored in many different languages. You can find them in the spacy directory.

Note: You can even modify the list by adding words of your choice in the english .txt. file in the stopwords directory.

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS

print(STOP_WORDS)

Stopword removal

In [None]:
tokens = [token for token in tokens if token not in STOP_WORDS]
print(len(tokens))

# Lemmatizers

Lemmas are root form of a word. It is helpful to reduce the bag of words by using the same root word for all similar kind of words.

* Lemmatizers use a corpus. The result is always a dictionary word.
* Lemmatizers need extra info about the part of speech they are processing.

Note: spaCy adds a special case for English pronouns: all English pronouns are lemmatized to the special token -PRON-. Unlike verbs and common nouns, there’s no clear base form of a personal pronoun. Should the lemma of “me” be “I”, or should we normalize person as well, giving “it” — or maybe “he”? spaCy’s solution is to introduce a novel symbol, -PRON-, which is used as the lemma for all personal pronouns.

https://spacy.io/usage#pron-lemma

https://spacy.io/api/annotation#lemmatization

In [None]:
tokens_new = nlp("going gone go goes went")

# Print the text and the predicted tags
print([(w.text, w.lemma_) for w in tokens_new])

In [None]:
plurals = nlp("Indian caresses flies dies education denied computer computing xyzing done slept")

# Print the text and the predicted tags
print([(w.text, w.lemma_) for w in plurals])

In [None]:
# Print the text and the predicted tags
print([(w.lemma_) for w in doc])

#### Let's combine all the above commands into a single function

In [None]:
def process_text(doc):

    tokens = [token.text for token in doc]
    tokens = [token for token in tokens if token not in STOP_WORDS]

    doc_new = nlp(" ".join(tokens))

    tokens_lemma = [w.lemma_ for w in doc_new]

    return tokens_lemma

In [None]:
print(process_text(nlp(string)))

``` python
!pip install wordcloud

  (or)

!conda install -c conda-forge wordcloud=1.6.0
```

In [None]:
from wordcloud import WordCloud

import matplotlib.pyplot as plt

In [None]:
tokens

In [None]:
wordcloud = WordCloud(max_font_size=80, max_words=25, background_color="lavender").generate(string)
plt.figure(figsize=(10,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

# Reference links

https://spacy.io/usage

https://spacy.io/usage/rule-based-matching

https://spacy.io/universe/project/textacy

https://chartbeat-labs.github.io/textacy/build/html/api_reference/information_extraction.html#textacy.extract.ngrams

https://docs.python.org/3/library/re.html

https://spacy.io/usage/rule-based-matching
