# Introduction to Texts Analytics

## Text Analytics  and NLP applications

### Automated grading of written homework, Grammar checkers
### Automated teaching systems, 
### Categorizing/Classifying articles, documents, emails
### Creating Chatbots
### Creative writing
### Cryptography, Steganography
### Fraud detection
### Inter-language translation
### Legal document preparation
### Monitoring social media posts
### Reading books, articles, documentation and absorbing knowledge
### Sentiment analysis
### Text-to-speech engines
### Word games
### and many more… 		
		

#  Topics we will cover

### Sentiment Analysis:  Determine whether feedbacks and comments are positive, negative, or neutral and apply the infomration for customer support/Social media monitoring etc.

### Text Classification:  Supervised ML method to classify text into predefined categories such as spam filtering.

### Named Entity Recognition (NER): Finding import entities such as people, places, numbers from the text.


### Topic Modeling: Identifying topics and themes in the text data.

### Text Clustering: Grouping similar texts into into clusters or groups  for  document retrieval, summarization etc.

# Libraries 
### TextBlob:  for part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation. Find more https://textblob.readthedocs.io/en/dev/
### NLTK ( Natural Language Tool Kit) : Comprehensive Taxt Analytics API. Find more https://www.nltk.org/
### Textatistic: To check the readability of the texts. Find more https://pypi.org/project/textatistic/
### spaCy : More advanced text analytics.  Find more  https://spacy.io/usage

# TextBlob
* https://textblob.readthedocs.io/en/latest/
* Object-oriented NLP text-processing library that is built on the NLTK and pattern NLP libraries
* Some of the NLP tasks TextBlob can perform include:
    * Tokenization—splitting text into pieces called tokens, which are meaningful units, such as words and numbers
    * Parts-of-speech (POS) tagging—identifying each word’s part of speech, such as noun, verb, adjective, etc.
    * Noun phrase extraction—locating groups of words that represent nouns, such as “red brick factory.”
        * The phrase “red brick factory” illustrates why natural language is such a difficult subject. Is a “red brick factory” a factory that makes red bricks? Is it a red factory that makes bricks of any color? Is it a factory built of red bricks that makes products of any type? In today’s music world, it could even be the name of a rock band or the name of a game on your smartphone.
    * Sentiment analysis—determining whether text has positive, neutral or negative sentiment.
    * Inter-language translation and language detection powered by Google Translate.


## Installing the TextBlob Module
```
conda install -c conda-forge textblob
```

* Once installation completes, execute the following command to download the NLTK corpora used by TextBlob:
>```
ipython -m textblob.download_corpora
```

# TextBlob (cont.)
### Project Gutenberg
* Great source of text for analysis is the free e-books at Project Gutenberg:
> https://www.gutenberg.org
* Over 57,000 e-books in various formats, including plain text files
* Out of copyright in the United States
* [Terms of Use and copyright in other countries](https://www.gutenberg.org/wiki/Gutenberg:Terms_of_Use)
* Some examples use the plain-text e-book file for Shakespeare’s Romeo and Juliet:
https://www.gutenberg.org/ebooks/1513
* You’re required to copy books for programmatic access
> https://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages
* To download
    * Right click the Plain Text UTF-8 link on a book’s web page
    * Select **Save Link As…** (Chrome/FireFox), **Download Linked File As…** (Safari) or **Save target as** (Microsoft Edge)
* Save Romeo and Juliet as `RomeoAndJuliet.txt` in the `ch12` examples 
* **For analysis purposes, we removed the Project Gutenberg text before "THE TRAGEDY OF ROMEO AND JULIET", as well as the Project Guttenberg information at the end of the file starting with: "End of the Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare"**

## Create a TextBlob

In [1]:
from textblob import TextBlob

In [2]:
text = 'Today is a beautiful day. Tomorrow looks like bad weather.'

In [3]:
blob = TextBlob(text)

In [4]:
blob

TextBlob("Today is a beautiful day. Tomorrow looks like bad weather.")

### `TextBlob`, `Sentence`s and `Word`s Support String Methods and Comparisons 
* `Sentence`s, `Word`s and `TextBlob`s inherit from **`BaseBlob`**, which defines many common methods and properties
* [**`BaseBlob` documentation**](https://textblob.readthedocs.io/en/dev/api_reference.html)

## Tokenizing Text into Sentences and Words
* Getting a list of sentences

In [5]:
blob.sentences

[Sentence("Today is a beautiful day."),
 Sentence("Tomorrow looks like bad weather.")]

* A `WordList` is a subclass of Python’s **built-in list type** with additional NLP methods. 
* Contains TextBlob `Word` objects

In [6]:
blob.words

WordList(['Today', 'is', 'a', 'beautiful', 'day', 'Tomorrow', 'looks', 'like', 'bad', 'weather'])

## Parts-of-Speech Tagging

* Evaluate words based on context to determine **parts of speech**, which can help determine meaning
* Eight primary English parts of speech
	* **nouns**, **pronouns**, **verbs**, **adjectives**, **adverbs**, **prepositions**, **conjunctions** and **interjections** (words that express emotion and that are typically followed by **punctuation**, like “Yes!” or “Ha!”) 
    * Many subcategories 
* Some words have multiple meanings
	* E.g., “set” and “run” have **hundreds of meanings** each! 

In [7]:
blob

TextBlob("Today is a beautiful day. Tomorrow looks like bad weather.")

In [8]:
blob.tags

[('Today', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('beautiful', 'JJ'),
 ('day', 'NN'),
 ('Tomorrow', 'NNP'),
 ('looks', 'VBZ'),
 ('like', 'IN'),
 ('bad', 'JJ'),
 ('weather', 'NN')]

## Parts-of-Speech Tagging (cont.)
* `TextBlob` uses a `PatternTagger` to determine parts-of-speech
* Uses [**pattern library**](https://www.clips.uantwerpen.be/pattern) POS tagging
* Pattern's [63 parts-of-speech tags](https://www.clips.uantwerpen.be/pages/MBSP-tags`)
* In preceding output:
    * `NN`—a **singular noun** or **mass noun**
    * `VBZ`—a [**third person singular present verb**](https://www.grammar.cl/Present/Verbs_Third_Person.htm)
    * `DT`—a [**determiner**](https://en.wikipedia.org/wiki/Determiner) (the, an, that, this, my, their, etc.)
    * `JJ`—an **adjective**
    * `NNP`—a **proper singular noun**
    * `IN`—a **subordinating conjunction** or **preposition**

## Extracting Noun Phrases
* Preparing to purchase a **water ski**
* Might search for **“best water ski”**—**“water ski”** is a **noun phrase** 
* For best results, search engine must parse the noun phrase properly 
* Try searching for **“best water,”** **“best ski”**,  **“water ski”** and **“best water ski”** and see what you get 

In [None]:
blob

In [9]:
blob.noun_phrases

WordList(['beautiful day', 'tomorrow', 'bad weather'])

* A **`Word`** can represent a noun phrase with **multiple words**. 

## Sentiment Analysis with TextBlob’s Default Sentiment Analyzer
* Determines whether text is **positive**, **neutral** or **negative**. 
* One of the most common and valuable NLP tasks (several later case studies do it) 
* Consider the **positive word “good”** and the **negative word “bad"**
    * Alone they are positive and negative, respectively, but...
    * **The food is not good** — clearly has negative sentiment
    * **The movie was not bad** — clearly has positive sentiment (but not as positive as **The movie was excellent!**)
* Complex **machine-learning problem**, but libraries like TextBlob can do it for you

### Getting the Sentiment of a TextBlob

In [11]:
blob

TextBlob("Today is a beautiful day. Tomorrow looks like bad weather.")

In [10]:
blob.sentiment

Sentiment(polarity=0.07500000000000007, subjectivity=0.8333333333333333)

* **`polarity`** is the **sentiment** — from **`-1.0` (negative)** to **`1.0` (positive)** with **`0.0`** being **neutral**. 
* **`subjectivity`** is a value from **0.0 (objective)** to **1.0 (subjective)**. 

### Getting the polarity and subjectivity from the Sentiment Object
* **`%precision`** magic specifies the **default precision** for **standalone** `float` objects and `float` objects in **built-in types** like lists, dictionaries and tuples:

In [12]:
%precision 3

'%.3f'

In [13]:
blob.sentiment.polarity

0.075

In [14]:
blob.sentiment.subjectivity

0.833

### Getting the Sentiment of a Sentence 
* One is **positive (`0.85`)** and one is **negative (`-0.6999999999999998`)**, which might explain why the entire `TextBlob`’s `sentiment` was close to **`0.0` (neutral)**

In [15]:
blob.sentiment

Sentiment(polarity=0.07500000000000007, subjectivity=0.8333333333333333)

In [16]:
for sentence in blob.sentences:
    print(sentence.sentiment)

Sentiment(polarity=0.85, subjectivity=1.0)
Sentiment(polarity=-0.6999999999999998, subjectivity=0.6666666666666666)


## Language Detection and Translation
* **Google Translate**, **Microsoft Bing Translator** and others can translate between scores of languages instantly
* Now working on **near-real-time translation**
    * Converse in real time with people who do not know your natural language
* In the **IBM Watson** presentation, we'll develop a script that does **inter-language translation**

In [17]:
blob

TextBlob("Today is a beautiful day. Tomorrow looks like bad weather.")

## Inflection: Pluralization and Singularization
* Inflections are different forms of the same words, such as singular and plural (like “person” and “people”) and different verb tenses (like “run” and “ran”)
* When you’re calculating word frequencies, you might first want to convert all inflected words to the same form for more accurate word frequencies

In [18]:
from textblob import Word

In [19]:
index = Word('index')

In [20]:
index.pluralize()

'indices'

In [21]:
cacti = Word('cacti')

In [22]:
cacti.singularize()

'cactus'

* Pluralizing and singularizing are not as simple as adding or removing an “s” or “es” at the end of a word

In [23]:
from textblob import TextBlob

In [24]:
animals = TextBlob('dog cat fish bird').words

In [25]:
animals.pluralize()

WordList(['dogs', 'cats', 'fish', 'birds'])

## Spell Checking and Correction
* For natural language processing tasks, it’s important that the text be free of spelling errors
* A `Word`’s **`spellcheck` method** returns a list of tuples containing **possible correct spellings** and **confidence values**
* Assume we meant to type “they” but misspelled it as “theyr”

In [26]:
from textblob import Word

In [27]:
word = Word('theyr')

In [28]:
word.spellcheck()

[('they', 0.571), ('their', 0.429)]

In [29]:
word.correct()  # chooses word with the highest confidence value

'they'

* Word with the highest confidence value might not be the correct word for the given context
* `TextBlob`s, `Sentence`s and `Word`s all have a `correct` method that you can call to correct spelling
* Calling `correct` on a `Word` returns the correctly spelled word that has the highest confidence value

In [30]:
from textblob import TextBlob, Word

In [31]:
sentence = TextBlob('Ths sentense has missplled wrds.')

In [32]:
sentence.correct()

TextBlob("The sentence has misspelled words.")

## Normalization: Stemming and Lemmatization
* **Stemming** removes a **prefix** or **suffix** from a word leaving only a **stem**, which **may or may not be a real word**
* **Lemmatization** is similar, but factors in the word’s **part of speech** and **meaning** and results in a **real word**
* Both **normalize** words for analysis
	* Before calculating statistics on words in a body of text, you might convert all words to lowercase so that capitalized and lowercase words are not treated differently. 
* You might want to use a word’s root to represent the word’s many forms. 
	* E.g., treat "program" and "programs" as "program"

In [33]:
word = Word('varieties')

In [34]:
word.stem()

'varieti'

**NOTE: Before running this notebook, place a copy of your downloaded RomeoAndJuliet.txt file in the same folder with this notebook.**

##  Word Frequencies
* Various techniques for detecting **similarity between documents** rely on **word frequencies**
* `TextBlob` can count word frequencies for you
* When you read a file with `Path`’s `read_text` method, it closes the file immediately after it finishes reading the file

In [35]:
from pathlib import Path

In [36]:
from textblob import TextBlob

In [38]:
blob = TextBlob(Path('RomeoAndJuliet.txt').read_text())

* Access the word frequencies through the `TextBlob`’s `word_counts` dictionary

In [39]:
blob.word_counts['juliet']

195

In [40]:
blob.word_counts['romeo']

320

In [41]:
blob.word_counts['thou']

278

* If you already have tokenized a `TextBlob` into a `WordList`, you can count specific words in the list via the `count` method

In [42]:
blob.words.count('joy')

14

In [43]:
blob.noun_phrases.count('lady capulet')

46

##  Getting Definitions, Synonyms and Antonyms from WordNet
### Getting Definitions
* [**WordNet**](https://wordnet.princeton.edu/) is a **English word database** created by **Princeton University**
* TextBlob uses **NLTK’s WordNet interface** to look up word **definitions**, and get **synonyms** and **antonyms** 
* [NLTK WordNet interface documentation](https://www.nltk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader.wordnet)

In [44]:
from textblob import Word

In [45]:
happy = Word('happy')

* `Word` class’s `definitions` property returns a list of all the word’s definitions in the WordNet database

## Deleting Stop Words
* Common words that are often removed before analysis because they do not provide useful information
* Returned by the NLTK `stopwords` module’s [`words` function](https://www.nltk.org/book/ch02.html)

| NLTK’s English stop words list
| :---
| `['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', 'were', 'weren', "weren't", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'won', "won't", 'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves']` 

## n-grams 
* [**n-gram**](https://en.wikipedia.org/wiki/N-gram) &mdash; a sequence of **n** text items, such as letters in words or words in a sentence. 
* Used to identify letters or words that frequently appear adjacent to one another
    * **Predictive text input**
    * **Speech-to-text**

In [46]:
from textblob import TextBlob

In [47]:
text = 'Today is a beautiful day. Tomorrow looks like bad weather.'

In [48]:
blob = TextBlob(text)

* `TextBlob`’s `ngrams` method produces a list of `WordList` n-grams of length three by default—known as trigrams
* Use keyword argument `n` to produce n-grams of any desired length

In [49]:
blob.ngrams()

[WordList(['Today', 'is', 'a']),
 WordList(['is', 'a', 'beautiful']),
 WordList(['a', 'beautiful', 'day']),
 WordList(['beautiful', 'day', 'Tomorrow']),
 WordList(['day', 'Tomorrow', 'looks']),
 WordList(['Tomorrow', 'looks', 'like']),
 WordList(['looks', 'like', 'bad']),
 WordList(['like', 'bad', 'weather'])]

In [50]:
blob.ngrams(n=5)

[WordList(['Today', 'is', 'a', 'beautiful', 'day']),
 WordList(['is', 'a', 'beautiful', 'day', 'Tomorrow']),
 WordList(['a', 'beautiful', 'day', 'Tomorrow', 'looks']),
 WordList(['beautiful', 'day', 'Tomorrow', 'looks', 'like']),
 WordList(['day', 'Tomorrow', 'looks', 'like', 'bad']),
 WordList(['Tomorrow', 'looks', 'like', 'bad', 'weather'])]

https://www.kaggle.com/competitions/commonlitreadabilityprize

In [51]:
import pandas as pd
textdata = pd.read_csv('textdata.csv')
textdata.head()

Unnamed: 0.1,Unnamed: 0,id,excerpt
0,0,c12129c31,When the young people returned to the ballroom...
1,1,85aa80a4c,"All through dinner time, Mrs. Fayre was somewh..."
2,2,b69ac6792,"As Roger had predicted, the snow departed as q..."
3,3,dd1000b26,And outside before the palace a great garden w...
4,4,37c1b32fb,Once upon a time there were Three Bears who li...


In [52]:
textdata =  textdata[['id', 'excerpt']]
textdata.shape

(2834, 2)

In [54]:
textdata.head()

Unnamed: 0,id,excerpt
0,c12129c31,When the young people returned to the ballroom...
1,85aa80a4c,"All through dinner time, Mrs. Fayre was somewh..."
2,b69ac6792,"As Roger had predicted, the snow departed as q..."
3,dd1000b26,And outside before the palace a great garden w...
4,37c1b32fb,Once upon a time there were Three Bears who li...


In [55]:
textdata.excerpt[0]

'When the young people returned to the ballroom, it presented a decidedly changed appearance. Instead of an interior scene, it was a winter landscape.\nThe floor was covered with snow-white canvas, not laid on smoothly, but rumpled over bumps and hillocks, like a real snow field. The numerous palms and evergreens that had decorated the room, were powdered with flour and strewn with tufts of cotton, like snow. Also diamond dust had been lightly sprinkled on them, and glittering crystal icicles hung from the branches.\nAt each end of the room, on the wall, hung a beautiful bear-skin rug.\nThese rugs were for prizes, one for the girls and one for the boys. And this was the game.\nThe girls were gathered at one end of the room and the boys at the other, and one end was called the North Pole, and the other the South Pole. Each player was given a small flag which they were to plant on reaching the Pole.\nThis would have been an easy matter, but each traveller was obliged to wear snowshoes.'

In [56]:
list(textdata.excerpt)[:3]

['When the young people returned to the ballroom, it presented a decidedly changed appearance. Instead of an interior scene, it was a winter landscape.\nThe floor was covered with snow-white canvas, not laid on smoothly, but rumpled over bumps and hillocks, like a real snow field. The numerous palms and evergreens that had decorated the room, were powdered with flour and strewn with tufts of cotton, like snow. Also diamond dust had been lightly sprinkled on them, and glittering crystal icicles hung from the branches.\nAt each end of the room, on the wall, hung a beautiful bear-skin rug.\nThese rugs were for prizes, one for the girls and one for the boys. And this was the game.\nThe girls were gathered at one end of the room and the boys at the other, and one end was called the North Pole, and the other the South Pole. Each player was given a small flag which they were to plant on reaching the Pole.\nThis would have been an easy matter, but each traveller was obliged to wear snowshoes.'

In [57]:
from nltk.tokenize import word_tokenize

In [58]:
textdata['tokens'] = textdata['excerpt'].apply(word_tokenize)
textdata.head()

Unnamed: 0,id,excerpt,tokens
0,c12129c31,When the young people returned to the ballroom...,"[When, the, young, people, returned, to, the, ..."
1,85aa80a4c,"All through dinner time, Mrs. Fayre was somewh...","[All, through, dinner, time, ,, Mrs., Fayre, w..."
2,b69ac6792,"As Roger had predicted, the snow departed as q...","[As, Roger, had, predicted, ,, the, snow, depa..."
3,dd1000b26,And outside before the palace a great garden w...,"[And, outside, before, the, palace, a, great, ..."
4,37c1b32fb,Once upon a time there were Three Bears who li...,"[Once, upon, a, time, there, were, Three, Bear..."


In [67]:
from wordcloud import STOPWORDS
stops = list(STOPWORDS)

In [68]:
textdata['tokens'] = textdata['tokens'].apply(lambda x: [item for item in x if item not in stops])

In [69]:
textdata.head(2)

Unnamed: 0,id,excerpt,tokens,subjectivity,polarity
0,c12129c31,When the young people returned to the ballroom...,"[When, young, people, returned, ballroom, ,, p...",0.525758,0.134848
1,85aa80a4c,"All through dinner time, Mrs. Fayre was somewh...","[All, dinner, time, ,, Mrs., Fayre, somewhat, ...",0.566643,0.133999


In [70]:
textdata['subjectivity'] = textdata.apply(lambda x: TextBlob(x['excerpt']).sentiment.subjectivity, axis=1)
textdata['polarity'] = textdata.apply(lambda x: TextBlob(x['excerpt']).sentiment.polarity, axis=1)
textdata.head()

Unnamed: 0,id,excerpt,tokens,subjectivity,polarity
0,c12129c31,When the young people returned to the ballroom...,"[When, young, people, returned, ballroom, ,, p...",0.525758,0.134848
1,85aa80a4c,"All through dinner time, Mrs. Fayre was somewh...","[All, dinner, time, ,, Mrs., Fayre, somewhat, ...",0.566643,0.133999
2,b69ac6792,"As Roger had predicted, the snow departed as q...","[As, Roger, predicted, ,, snow, departed, quic...",0.61164,0.082672
3,dd1000b26,And outside before the palace a great garden w...,"[And, outside, palace, great, garden, walled, ...",0.636667,0.333869
4,37c1b32fb,Once upon a time there were Three Bears who li...,"[Once, upon, time, Three, Bears, lived, togeth...",0.567593,0.198611
