![image.png](https://raw.githubusercontent.com/fjvarasc/DSPXI/master/figures/py_logo.png)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Text-Mining" data-toc-modified-id="Text-Mining-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Text Mining</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Useful-Links" data-toc-modified-id="Useful-Links-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Useful Links</a></span></li></ul></li><li><span><a href="#Regular-Expressions" data-toc-modified-id="Regular-Expressions-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Regular Expressions</a></span><ul class="toc-item"><li><span><a href="#Useful-Links" data-toc-modified-id="Useful-Links-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Useful Links</a></span></li><li><span><a href="#Let's-play" data-toc-modified-id="Let's-play-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Let's play</a></span></li></ul></li><li><span><a href="#Text-Preprocessing" data-toc-modified-id="Text-Preprocessing-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Text Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Noise-Removal" data-toc-modified-id="Noise-Removal-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Noise Removal</a></span></li><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Tokenization</a></span></li><li><span><a href="#Stopwords" data-toc-modified-id="Stopwords-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span>Stopwords</a></span></li><li><span><a href="#Stemming" data-toc-modified-id="Stemming-1.3.4"><span class="toc-item-num">1.3.4&nbsp;&nbsp;</span>Stemming</a></span></li><li><span><a href="#Lemming" data-toc-modified-id="Lemming-1.3.5"><span class="toc-item-num">1.3.5&nbsp;&nbsp;</span>Lemming</a></span></li><li><span><a href="#NLTK" data-toc-modified-id="NLTK-1.3.6"><span class="toc-item-num">1.3.6&nbsp;&nbsp;</span>NLTK</a></span></li><li><span><a href="#SpaCy" data-toc-modified-id="SpaCy-1.3.7"><span class="toc-item-num">1.3.7&nbsp;&nbsp;</span>SpaCy</a></span></li><li><span><a href="#Let's-play" data-toc-modified-id="Let's-play-1.3.8"><span class="toc-item-num">1.3.8&nbsp;&nbsp;</span>Let's play</a></span></li></ul></li><li><span><a href="#Document-Term-Matrix" data-toc-modified-id="Document-Term-Matrix-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Document Term Matrix</a></span><ul class="toc-item"><li><span><a href="#TF-IDF" data-toc-modified-id="TF-IDF-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span>TF-IDF</a></span></li></ul></li><li><span><a href="#Asignment" data-toc-modified-id="Asignment-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Asignment</a></span></li></ul></li></ul></div>

# Text Mining

## Introduction

Text mining, sometimes also referred as Text analytics, is the process of deriving information from text. The objective is to extract useful information from text based content. This content can be a an email, a comment on social media, a review, a scientific paper, a contract, a book, and so on.  
Some useful applications for text mining are:
* Check popularity of a topic on social media
* Evaluate a review of a product as positive or negative
* Summarize the content of multiple news sources
* Surveillance of email to prevent fraud
* Extract important information from a contract

Text Mining can also be used to turn unstructured data into structured data. Qualitative data or unstructured data are data that cannot be measured in terms of numbers. These data usually contain information like colour, texture and text. Quantitative data or structured data are data that can be measured easily.

### Useful Links
* [Wikipedia Article about Text Mining](https://en.wikipedia.org/wiki/Text_mining)
* [A Definitive Guide on How Text Mining Works](https://www.educba.com/text-mining/)
* [About Text Mining (IBM article)](https://www.ibm.com/support/knowledgecenter/en/SS3RA7_17.1.0/ta_guide_ddita/textmining/shared_entities/tm_intro_tm_defined.html)

## Regular Expressions

[Regular Expression](https://en.wikipedia.org/wiki/Regular_expression) (usually expressed as **RegEx**) is a way of finding and/or replacing text by matching patterns. Opposite to regular text finding, where we want to match an exact string or character, in RegEx we want to find an specific pattern. Examples of data that follow a pattern:
* Telephone Numbers
* Document Numbers (RUT, Passport)
* Car Plates
* Dates
* URLs
* Emails

The website RegEx101 provides a summary of the regex patterns and an easy way to test an expression.  
For Python, we are going to use the [re](https://docs.python.org/3/library/re.html) package.
This package has a function, findall, that returns all occurrences of a regular expression in a given text.

In [162]:
import re

# Text to be searched
sample_text = 'You can send the details for eduardo.lopes@evalueserve.com. \
                Please keep in copy compliance@evalueserve.com'

# Pattern to be found
 

# Return all occurences of email_pattern in sample_text
re.findall(email_pattern, sample_text)

['eduardo.lopes@evalueserve.com', 'compliance@evalueserve.com']

In the example above, we are saying to Python that we are looking for a sequence of lowercase letters and/or dots (**[a-z.]**) of any length (the \* ), followed by an **@***, followed then by another sequence of lowercase letters (**[a-z]**), ending with **.com**.

To help build our own Regular Expression, there are some pre-defined characters classes for the most used cases.  
Let's explore them.

**Basic Character Classes**

These are the basic predefined classes.

| Symbol | Matches                                 |
|--------|-----------------------------------------|
|\d     |Any digit                               |    
|\w     |Any Alphanumeric character (Includes _) |    
|\s     |Any Whitespaces                     |        
|\t     |Tab character                           |    
|\n     |New line character                      |    

In [32]:
sample_text = 'email1991@evalueserve.com compliance@evalueserve.com'
re.findall(r'\d', sample_text)

['1', '9', '9', '1']

**Negation Character Classes**

These are the negation of the classes above. As we can see, hey are the representation of the respective class, but in **UPPERCASE**.

| Symbol | Matches                        |
|--------|--------------------------------|
| \D     | Any non-digit                  |
| \W     | Any non-Alphanumeric character |
| \S     | Any non-whitespaces            |

In [39]:
sample_text = 'email1991@evalueserve.com compliance@evalueserve.com'
re.findall(r'\W', sample_text)

['@', '.', ' ', '@', '.']

**User Defined Character Classes**

If we want a specific set of characters (just vowels, for instance), we can use this class to define our own group.  
We used that in the Email example above.

| Symbol         | Matches                                         |
|----------------|-------------------------------------------------|
| [abc]          | a, b or c                                       |
| [ab] \| [c]          | Or operator, equivalent to above (a, b or c)                                       |
| [^abc]         | Negation, matches everything except a, b, or c. |
| [a-c]          | Range, matches anything between a and c         |
| [a-c[f-h]]     | Union, matches a, b, c, f, g, h                 |
| [a-c&&[b-c]]   | Intersection, matches b or c                    |
| [a-c&&[^b-c]]  | Subtraction, matches a                          |

In [192]:
sample_text = 'email1991@evalueserve.com compliance@evalueserve.com'
re.findall(r'[aeiou]|[0-9]', sample_text)

['e',
 'a',
 'i',
 '1',
 '9',
 '9',
 '1',
 'e',
 'a',
 'u',
 'e',
 'e',
 'e',
 'o',
 'o',
 'i',
 'a',
 'e',
 'e',
 'a',
 'u',
 'e',
 'e',
 'e',
 'o']

**Quantifiers**

Quantifiers are operators used in combination with the classes to determine the length of the sequence we are interested.
For instance, if we are looking for years, we are only interestd in sequence of digits of length 4.

| Symbol  | Matches                                         | Example |
|---------|-------------------------------------------------|---------|
| ? | Zero or one ocurrences |   \d?      |
| * | Zero or more ocurrences |   \d*      |
| + | One or more ocurrences |    \d+     |
| {n}   | A sequence of length n               |  \d{n}        |
| {n,}  | A sequence length at least n      |   \d{n,}      |
| {n,m} | A sequence of length between n and m |   \d{n,m}      |

In [48]:
sample_text = 'email1991@evalueserve.com compliance@evalueserve.com'
re.findall(r'[0-9]{3,}', sample_text)

['1991']

**Other Operators**

Here we introduced the most commom RegEx operators, but there are many more.  
You can find a list of all the operators available in [Regex101](https://regex101.com/). 

### Useful Links
* [Wikipedia article about RegEx](https://en.wikipedia.org/wiki/Regular_expression)
* [Regex101](https://regex101.com/)
* [re documentation](https://docs.python.org/3/library/re.html)
* [Regex on Python](https://www.w3schools.com/python/python_regex.asp)

### Let's play

Going back to our previous example, can we make a better regular expression to identify emails?

In [50]:
sample_text = 'You can send the details for eduardo.lopes@evalueserve.com. Please keep in copy compliance@evalueserve.com'
email_pattern = r'[a-z.]*@[a-z]*.com' # Put your pattern here
re.findall(email_pattern, sample_text)

['eduardo.lopes@evalueserve.com', 'compliance@evalueserve.com']

## Text Preprocessing

When working with text, we usually have to deal with sequence of characters where most of them are not interesting for analysis (letters, numbers, punctuation, especial characters, blanks, spaces, etc). Because of that,  in every text mining/analytics project, the first step is the preprocessing.  
Usually the Text Preprocessing involves the folowing steps:
* Noise Removal
* Tokenization
* Stopword Removal
* Stemming
* Lemmatization

Depending on the data available and the objective of the analysis, some of these steps can be skipped and additional steps could be needed.  
Now, we will walk through each one of these steps.

### Noise Removal

Noise removal is the process of removing unwanted characters such as:  
* Text file headers and/or footers 
* HTML, XML, etc. markup and metadata  

Usually it is necessary when dealing with content extracted from the web. Because of that, noise removal is a more specific, that depends on the kind of text we are dealing with.  

For the next example, we are using the [urllib](https://docs.python.org/3/library/urllib.html) package to download a html page from the web.

In [158]:
# Import package
from urllib import request

# Download html web page
url = "https://en.wikipedia.org/wiki/Text_mining"
html = request.urlopen(url).read().decode('utf8')

# Print a sample of the html
print(html[10000:12000])

Information_retrieval" title="Information retrieval">information retrieval</a>, <a href="/wiki/Lexical_analysis" title="Lexical analysis">lexical analysis</a> to study word frequency distributions, <a href="/wiki/Pattern_recognition" title="Pattern recognition">pattern recognition</a>, <a href="/wiki/Tag_(metadata)" title="Tag (metadata)">tagging</a>/<a href="/wiki/Annotation" title="Annotation">annotation</a>, <a href="/wiki/Information_extraction" title="Information extraction">information extraction</a>, <a href="/wiki/Data_mining" title="Data mining">data mining</a> techniques including link and association analysis, <a href="/wiki/Information_visualization" title="Information visualization">visualization</a>, and <a href="/wiki/Predictive_analytics" title="Predictive analytics">predictive analytics</a>. The overarching goal is, essentially, to turn text into data for analysis, via application of <a href="/wiki/Natural_language_processing" title="Natural language processing">natura

As we can, see there are a lot of undesired text, mostly html tags. To continue with our analysis, we need to remove this unwanted data.  
Luckily, there is a Python package that can help us with that, [Beatiful Soup](https://pypi.org/project/beautifulsoup4/). It is a package for pulling data out of HTML and XML files. 


In [157]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text()[8000:10000])

business problems, whether independently or in conjunction with query and analysis of fielded, numerical data. It is a truism that 80 percent of business-relevant information originates in unstructured form, primarily text.[5] These techniques and processes discover and present knowledge – facts, business rules, and relationships – that is otherwise locked in textual form, impenetrable to automated processing.

Text analysis processes[edit]
Subtasks—components of a larger text-analytics effort—typically include:

Information retrieval or identification of a corpus is a preparatory step: collecting or identifying a set of textual materials, on the Web or held in a file system, database, or content corpus manager, for analysis.
Although some text analytics systems apply exclusively advanced statistical methods, many others apply more extensive natural language processing, such as part of speech tagging, syntactic parsing, and other types of linguistic analysis.[citation needed]
Named ent

With the html tags removed, we are left with pure text. So now we can move on in our analysis.

### Tokenization

Tokenization is the act of spliting longer strings of text into smaller pieces, or **tokens**. Large documents can be tokenized into sentences, sentences can be tokenized into words, etc. In some situations, a name composed of two or more words can be considered a single token (Banf of America, Great Britain, Vina del Mar, etc). 

Tokenization make further analysis, like counting words or summarizing information, easier and faster to perform.

To understand how it is done, let's tokenize our first sentence.We will use the function split from the re package. This function split a text for a given regular expression.  

In [418]:
sample_text = "To understand how it's done, let's tokenize our first sentence. Don't forget special cases, like U.S.A. This is important for understanding, specially when learning"

# Split ouourr text in whitespaces
split_pattern = r'\s'
tokens = re.split(split_pattern, sample_text)
print(tokens)

['To', 'understand', 'how', "it's", 'done,', "let's", 'tokenize', 'our', 'first', 'sentence.', "Don't", 'forget', 'special', 'cases,', 'like', 'U.S.A.', 'This', 'is', 'important', 'for', 'understanding,', 'specially', 'when', 'learning']


This is our first list of tokens.  
Unfortunatelly, it resulted in some undesirable tokens:
* **done,** (comma at the end)
* **sentence.** (period at the end)

Also, there are some special cases that some applications find acceptable. We will explore these later.
* **it's**
* **let's**
* **Don't**
* **U.S.A.**

One alternative is to look for anything that is not an alphanumeric character, and split the text in that position.
RegEx has an operator for that, the **\W** class. We will include the quantifier operator **+** to identify sequence of one or more characters, so sequence of whitesapaces or special characters folowed by a whitespace will be counted as one.

In [383]:
# Split text in whitespaces
split_pattern = r'\W+'
tokens = re.split(split_pattern, sample_text)
print(tokens)

['To', 'understand', 'how', 'it', 's', 'done', 'let', 's', 'tokenize', 'our', 'first', 'sentence', 'Don', 't', 'forget', 'special', 'cases', 'like', 'U', 'S', 'A', 'This', 'is', 'important', 'for', 'understanding', 'specially', 'when', 'learning']


We were able to correct the undesirable tokens, but the special cases (it's, let's, Don't were U.S.A.) were split.  
So we need to think about a more complex pattern.  
We can make a regular expressio to each special case, and then concatenate them with the OR (**|**) operator 

In [421]:
pattern = r'''(?x)         # set flag to allow regexps over multiple lines
            \w[.]\w[.]\w   # U.S.A.
            | \w+[']\w     # it's, let's, Don't
            | \w+          # other words
        '''
tokens = re.findall(pattern, sample_text)
print(tokens)

['To', 'understand', 'how', "it's", 'done', "let's", 'tokenize', 'our', 'first', 'sentence', "Don't", 'forget', 'special', 'cases', 'like', 'U.S.A', 'This', 'is', 'important', 'for', 'understanding', 'specially', 'when', 'learning']


This appear to work with our example, but still is not the perfect solution. For other abbreviations or word with hyphens (drive-thru), our tokenizer won't work properly.  
To help us with that task, the Natural Language Toolkit ([NLTK](http://www.nltk.org/)) package has a tokenizer. This package is focused on text mining, and will be used in our next steps in pre-processing.

In [422]:
import nltk
#nltk.download('punkt')
from nltk.tokenize import word_tokenize
tokens = word_tokenize(sample_text)
print(tokens)

['To', 'understand', 'how', 'it', "'s", 'done', ',', 'let', "'s", 'tokenize', 'our', 'first', 'sentence', '.', 'Do', "n't", 'forget', 'special', 'cases', ',', 'like', 'U.S.A', '.', 'This', 'is', 'important', 'for', 'understanding', ',', 'specially', 'when', 'learning']


The tokenizer from NLTK, diferent from the one we built with RegEx, separates the words with apostrophe (\').  
The NLTK tokenizer is equivalent to the following code:

In [423]:
pattern = r'''(?x)          # set flag to allow regexps over multiple lines
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''
tokens = re.findall(pattern, sample_text)
print(tokens)

['To', 'understand', 'how', 'it', "'", 's', 'done', ',', 'let', "'", 's', 'tokenize', 'our', 'first', 'sentence', '.', 'Don', "'", 't', 'forget', 'special', 'cases', ',', 'like', 'U.S.A.', 'This', 'is', 'important', 'for', 'understanding', ',', 'specially', 'when', 'learning']


As we can see, tokenization is not an easy task. It involves not only technical skills, but also an understanding of the language and the problem to be solved. Specially when dealing with specific subjects (scientific research, for instance), we can stumble accros some domain specific terms, acronomys, abbreviations, etc 

> **"It is not safe to make the assumption that source text will be perfect. A tokenizer must often be customized to the data in question."** 
>>*Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit  
>>Steven Bird, Ewan Klein, and Edward Loper - NLTK package creators*

That's why NLTK tokenizator is an option, but not the only one, as we can see below:

![Table comparing different Tokenizers](https://raw.githubusercontent.com/fjvarasc/DSPXI/master/figures/TokenizerComparison.png "Comparing different Tokenizers")
*source: https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en*

**Useful Link**
* [Compare different tokenization methods](https://text-processing.com/demo/tokenize/)

### Stopwords

**Stop words** are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts. NLTK has a built-in stopwords dictionary for multple languages.

In [424]:
from nltk.corpus import stopwords
#nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
print(stop_words)

{'why', 'd', 'doing', 'those', 'needn', 'very', 'from', 'as', 'any', 'our', 'didn', 'because', 'isn', "needn't", 'herself', "mightn't", 'myself', 'be', 'is', 'after', 'what', "weren't", 'wouldn', 'its', 'off', 'or', 'ourselves', 'she', 'into', 've', 'i', 'their', 'weren', 'against', "wouldn't", 'same', "didn't", 'about', 'wasn', 'at', 'ours', 'all', "shan't", "couldn't", 'until', 'aren', 'hasn', 'hers', 'were', "you're", 'through', 'above', "you'll", 'few', 'theirs', 't', 'now', "haven't", 'to', "should've", 'under', 'them', 'such', 'by', 'should', 'once', 'up', 'only', 'here', 'in', 'the', 'shouldn', 'himself', 'whom', "shouldn't", 'between', 'your', 'most', 'can', 'haven', 'with', 'further', 'won', 'there', "that'll", 'couldn', 'mightn', 'am', 'than', "wasn't", 'had', 'been', 'during', 'are', 'y', 'no', "hadn't", 'being', 'and', 'did', 'a', 'has', "doesn't", 'ain', 'doesn', 'mustn', 'his', 'if', 'itself', 'will', 'we', "don't", 'of', 'that', 'him', 'on', 'me', "isn't", 'yours', 'who'

In [425]:
tokens_nostop = [i for i in tokens if not i in stop_words]
print (tokens_nostop)

['To', 'understand', "'", 'done', ',', 'let', "'", 'tokenize', 'first', 'sentence', '.', 'Don', "'", 'forget', 'special', 'cases', ',', 'like', 'U.S.A.', 'This', 'important', 'understanding', ',', 'specially', 'learning']


### Stemming

Stemming is the process of keeping just the root (or stem) of a word. 
The reason why we stem is to shorten the lookup, and normalize sentences. Many variations of words carry the same meaning, so we can treat these variations as onde single case. For instance:

* I was taking a ride in the car.
* I was riding in the car. 

Both sentences carry the same meaning, so the variation in the verb ride can be stemmed so we treat **ride** and **riding** as the same word.

In [426]:
from nltk.stem import PorterStemmer

stemmer= PorterStemmer()

print("study :", stemmer.stem("study")) 
print("studying :", stemmer.stem("studying"))
print("studied :", stemmer.stem("studied"))

study : studi
studying : studi
studied : studi


In [427]:
tokens_stem = [stemmer.stem(word) for word in tokens_nostop]
print(tokens_stem)

['To', 'understand', "'", 'done', ',', 'let', "'", 'token', 'first', 'sentenc', '.', 'don', "'", 'forget', 'special', 'case', ',', 'like', 'u.s.a.', 'thi', 'import', 'understand', ',', 'special', 'learn']


### Lemming

The aim of lemming, like stemming, is to reduce variations on the same words. As opposed to stemming, lemming does not simply chop off inflections. Instead it uses lexical knowledge bases and built in dictionaries to get the correct base forms of words.
When dealing with irregular verbs, lemmatization is able to find the verb original form, while stemming is not.  
One of the downsides of lemmatization is that we have to inform the function the Part of Speech (POS) for the term. POS is the role of the term in the sentence, usually: noun, verb, adjective, adverb, and so on

In [419]:
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')

lemmatizer=WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize("rocks")) 
print("corpora :", lemmatizer.lemmatize("corpora")) 
print("better (without POS):", lemmatizer.lemmatize("better")) 
print("better (with POS):", lemmatizer.lemmatize("better", pos ="a")) 
print("are (without POS):", lemmatizer.lemmatize("are"))
print("are (with POS):", lemmatizer.lemmatize("are", pos ="v")) 

rocks : rock
corpora : corpus
better (without POS): better
better (with POS): good
are (without POS): are
are (with POS): be


In [420]:
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')

lemmatizer=WordNetLemmatizer()

tokens_lemma = [lemmatizer.lemmatize(word) for word in tokens_stem]
print (tokens_lemma)

['To', 'understand', 'done,', "let'", 'token', 'first', 'sentence.', "don't", 'forget', 'special', 'cases,', 'like', 'u.s.a.', 'thi', 'import', 'better', 'understanding,', 'special', 'learn']


As you can imagine, it is not possible to manually provide the corrent POS tag for every word. The NLTK package has the nltk.pos_tag() function, that can give us that information.
But it gives ina  different format that the one lemmatize expect. So we build a function that makes the translation to the correct POS.

In [413]:
from nltk.corpus import wordnet
#nltk.download('averaged_perceptron_tagger')

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

# Init Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize a Sentence with the appropriate POS tag
tokens_lemma = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens_stem]
print (tokens_lemma)

['To', 'understand', 'done,', "let'", 'token', 'first', 'sentence.', "don't", 'forget', 'special', 'cases,', 'like', 'u.s.a.', 'thi', 'import', 'well', 'understanding,', 'special', 'learn']


**Useful Links**

Pre-processing:
* https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html  
* https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908  
* https://www.researchgate.net/publication/273127322_Preprocessing_Techniques_for_Text_Mining  
 
Lemmatization:

* https://www.machinelearningplus.com/nlp/lemmatization-examples-python/  
* https://www.geeksforgeeks.org/python-lemmatization-with-nltk/ 
* https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html 
* https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/

### NLTK

With all that we saw unil now, we can build a function that normalizes a given text.

In [428]:
def normalize(text):
    text = text.lower()
    tokens = word_tokenize(text)
    tokens_nostop = [i for i in tokens if not i in stop_words]
    tokens_stem = [stemmer.stem(word) for word in tokens_nostop]
    tokens_lemma = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens_stem]
    return tokens_lemma

In [429]:
input_str="He is very intelligent and smart, smarter than me, maybe better"
words = normalize(input_str)
print(words)

['intellig', 'smart', ',', 'smarter', ',', 'mayb', 'well']


### SpaCy

[SpaCy](https://spacy.io/) is, like NLTK, a package focused on text mining. One of the main differences between them is that the spaCy package has a built-in text processor that combines all the text preprocessing tasks. That means that with fewer lines of code, we can accomplish the same result that we get with NLTK. That means that spaCy is slower if we want to do just one task (tokenize, for instance)

In [437]:
import spacy
import en_core_web_sm #English Text processor

nlp = en_core_web_sm.load() #Loading English Text processor
doc = nlp(sample_text) #Processing text

for token in doc:
    print("{0}\t\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.is_stop,
        token.shape_,
        token.pos_,
        token.tag_
    ))

To		0	to	False	False	True	Xx	PART
understand		3	understand	False	False	False	xxxx	VERB
how		14	how	False	False	True	xxx	ADV
it		18	-PRON-	False	False	True	xx	PRON
's		20	be	False	False	True	'x	VERB
done		23	do	False	False	True	xxxx	VERB
,		27	,	True	False	False	,	PUNCT
let		29	let	False	False	False	xxx	VERB
's		32	-PRON-	False	False	True	'x	PRON
tokenize		35	tokenize	False	False	False	xxxx	VERB
our		44	-PRON-	False	False	True	xxx	DET
first		48	first	False	False	True	xxxx	ADJ
sentence		54	sentence	False	False	False	xxxx	NOUN
.		62	.	True	False	False	.	PUNCT
Do		64	do	False	False	True	Xx	VERB
n't		66	not	False	False	True	x'x	ADV
forget		70	forget	False	False	False	xxxx	VERB
special		77	special	False	False	False	xxxx	ADJ
cases		85	case	False	False	False	xxxx	NOUN
,		90	,	True	False	False	,	PUNCT
like		92	like	False	False	False	xxxx	ADP
U.S.A.		97	U.S.A.	False	False	False	X.X.X.	PROPN
This		104	This	False	False	True	Xxxx	DET
is		109	be	False	False	True	xx	VERB
important		112	important	Fals

There are more differences between these packages, that we won't cover in this material, but can be found in the Useful links  below

**Useful Links**

Spacy:  

* https://nlpforhackers.io/complete-guide-to-spacy/ 

NLTK vs Spacy:

* https://www.oreilly.com/learning/how-can-i-tokenize-a-sentence-with-python  
* https://medium.com/@akankshamalhotra24/introduction-to-libraries-of-nlp-in-python-nltk-vs-spacy-42d7b2f128f2  
* https://medium.com/@pemagrg/private-nltk-vs-spacy-3926b3674ee4  


### Let's play

Rewrite the normalize function, but using the spaCy package.

In [None]:
# Code here

## Document Term Matrix

At this point, we can start analyzing our text. The first thing we can think about it is counting words. 

A Document Term Matrix (DTM) is a mathematical matrix that shows the frequency of a token for a given document. In a DTM, rows correspond to documents and columns correspond to tokens.

This representation is useful because it turns the task of comparing two texts into comparing two vectors (two rows in the DTM)

In [302]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

data = ["This is a sample text",
        "This is another sample",
        "This is the third sentence"]

vec = CountVectorizer()
X = vec.fit_transform(data)

df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)

   another  is  sample  sentence  text  the  third  this
0        0   1       1         0     1    0      0     1
1        1   1       1         0     0    0      0     1
2        0   1       0         1     0    1      1     1


Note that this concept has a flaw. When comparing documents in the same subject, the task of differentiating them can be difficult for they share a lot of the same terms.  
For example: Mathmatical papers, they all share words as formula, theory, calculation, etc.

A possible solution for that is TF-IDF

**Useful links**
* [DTM with Pandas and Sklearn](https://stackoverflow.com/questions/15899861/efficient-term-document-matrix-with-nltk)* https://datawarrior.wordpress.com/2018/01/22/document-term-matrix-text-mining-in-r-and-python/  
* https://markroxor.github.io/gensim/static/notebooks/dtm_example.html  

### TF-IDF 

Given a collection of document, how can we identify the keywords for each one of them?

The TF-IDF was created with that in mind. TF-IDF stands for:
* TF - Term Frequency
* IDF - Inverse Document Frequency

<br>

\begin{equation*}
tf(t) = \frac{\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}} \\
\end{equation*}

\begin{equation*}
idf(t) = \log \frac{\text{Total number of documents}}{\text{Number of documents with term t in it}} \\
\end{equation*}

\begin{equation*}
tf\text{-}idf(t) = tf(t) \cdot idf(t) \\
\end{equation*}

In this equation, a term has a high TF-IDF if it appears a lot in a document (TF), but is penalized if also appears im multiple documents (IDF). With that, we assure that only the most commom and unique words in the document have a high weight.  
Now, we can reproduce the DTM table, but instead of having a simple count, we will have the TF-IDF score.

In [210]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
Y = tfidf_transformer.fit_transform(X)
df = pd.DataFrame(Y.toarray(), columns=vec.get_feature_names())
print(df)

   another        is    sample  sentence     text      the    third      this
0  0.00000  0.391484  0.504107   0.00000  0.66284  0.00000  0.00000  0.391484
1  0.66284  0.391484  0.504107   0.00000  0.00000  0.00000  0.00000  0.391484
2  0.00000  0.307144  0.000000   0.52004  0.00000  0.52004  0.52004  0.307144


**Useful links**
* [TF-IDF](http://www.tfidf.com/)
* [A Simple Probabilistic Explanation of TF-IDF Heuristic](https://digitalcommons.utep.edu/cgi/viewcontent.cgi?article=1852&context=cs_techrep)
* [Keyword Extraction with TF-IDF and scikit-learn ](http://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.XKNh2mczWig)

## Asignment

The [Project Gutenberg](http://www.gutenberg.org/) is a website that provides free access to over 58,000 free eBooks which are in public domain. 

We are going to work with the book [The Adventures of Sherlock Holmes](http://www.gutenberg.org/files/1661/1661-h/1661-h.htm), by Sir Arthur Conan Doyle, available on the website. This book is a collection of short stories.

Our objective is to find the keywords for each short story.

To accomplish that, we need to:
1. Download the book
3. Preprocess the text
4. Calculate TF-IDF