# Part III. Text mining

---


## What is Text Mining?
Text mining or text analytics is the process of extracting quantified features from (un)structured (natural language) texts. Processing unstructured data involves using Natural Language Processing (NLP), statistical modeling and machine learning techniques.

## Why is it important?
80% of the generated data is not available in structured, numerical format (emails, texts, meeting notes, documents, social media feeds). These unstructured data includes images, drawings, videos, voice recordings and unstructured texts. These data can be described with their meta data (length, topic, category, etc.) but transforming them into structured data is important to access the more detailed information stored in such data sources. Voice recordings, videos and drawings can also be transcribed into unstructured texts so they can be processing as textual data. 
Most common use cases are:

- document similarity computation
- document deduplication
- document clustering
- topic extraction
- sentiment analysis
- automated annotation
- text filtering
- text classification

## Tools
- NLP tools
    - tokenization
    - stemming
    - part-of-speech (POS) tagging
    - stop word filtering
    - bag-of-words representation
    - tf-idf transformation
- Other tools
    - Word2Vec representation
    - hashing
    - cosine/jaccard/levenshtein/etc similarity computation
    - matrix factorization
    - etc

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import scipy.sparse as sp
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Text Mining in practice

### 1. Read and examine data

Examining unstructured data is the key to proper preprocessing.  
The collection of texts is called __`corpus`__.

In [None]:
with open('./data/SMSSpamCollection', encoding='utf-8') as spamfile:
    corpus = [line.strip() for line in spamfile]
len(corpus)

In [None]:
for text in corpus[:5]:
    print(text)

We can see that the data is in TSV format, read it accordingly.

In [None]:
corpus = pd.read_table('./data/SMSSpamCollection', names=['label', 'message'])

In [None]:
corpus.groupby('label').describe()

In [None]:
corpus['length'] = corpus.message.str.len()
corpus.head()

In [None]:
corpus.length.plot(bins=20, kind='hist');

In [None]:
corpus.length.describe()

910 long sms???

In [None]:
corpus.loc[corpus.length > 900, 'message'].values

Is there a difference between spam and ham messages?

In [None]:
corpus.hist(bins=50, by='label');

Why not try a simple predictor?

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
splitted = train_test_split(corpus.length.values[:, np.newaxis], # we need a matrix, not a vector
                            corpus.label.values,
                            test_size=.25,
                            random_state=42)
X_train, X_test, y_train, y_test = splitted

In [None]:
pipe = Pipeline([('nb', MultinomialNB())])
pipe.fit(X_train, y_train)

In [None]:
accuracy_score(y_test, pipe.predict(X_test))

87% percent is our baseline, let's get into the preprocessing!

### 2. Preprocessing
#### a) [Bag-of-words representation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)

Bag of words representation represent documents as a vector where each different word is represented in a fixed position. The values in the positions are the occurence counts in the given document. For example:
The vector features for the documents:
```python
docs = ["I like trains.", "Trains are like big cars.", "I like big cars"]
```
will be 
```python
features = {'I': 0, 'like': 1, 'trains': 2, 'are': 3, 'big': 4, 'cars': 5}
```
and the vectorial representations will be
```python
vectors = [[1, 1, 1, 0, 0, 0],
           [0, 1, 1, 1, 1, 1],
           [1, 1, 0, 0, 1, 1]]
```

Fortunately `scikit-learn` has a built-in solution for this: the [`CountVectorizer`](http://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation).  
Let's try out our little example:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cntvec = CountVectorizer()
docs = ["I like trains.",
        "Trains are like big cars.",
        "I like big cars"]

cntvec.fit_transform(docs).todense(), cntvec.vocabulary_

#### N-grams
N-grams are n long word tuples. They are generated by an n long rolling window and they can provide contextual information which sometimes yields better results. An example 2-gram for the `"I like trains."` sentence would be:
```python
[("I", "like"), ("like", "trains")]
```

In [None]:
cntvec = CountVectorizer(ngram_range=(2, 2))
cntvec.fit_transform(docs).todense(), cntvec.vocabulary_

#### Minimum and maximum document frequency

Minimum and maximum document frequency (`min_df` and `max_df`) are set thresholds to limit the feature numbers. If a _term_ (transformed word) appears less than `min_df` or more than `max_df` times (or percent) then it will be left out from the vocabulary.

In [None]:
cntvec = CountVectorizer(max_df=1)
cntvec.fit_transform(docs).todense(), cntvec.vocabulary_

In [None]:
cntvec = CountVectorizer(min_df=3)
cntvec.fit_transform(docs).todense(), cntvec.vocabulary_

### Advanced tokenization

In the vocabulary generation process each word are analyzed and transformed in order to reduce vocabulary length. Scikit-learn's default analization function is lowercasing the words and filtering short and stop words but no further transformation is done.

NLP has more detailed techniques to better extract the base words. Lemmatization is a powerful tool to reduce a word into it's _root_ form (as it appears in dictionaries): that's how `are` becomes `be` and `trains` becomes `train`, etc.

#### Stemming

Word stemming means removing affixes from words and return the root word. Stemming do not use contextual information to execute the stripping.

In [None]:
from nltk.stem import PorterStemmer

In [None]:
stemmer = PorterStemmer()
stemmer.stem('trains')

#### Lemmatization

Word lemmatizing is similar to stemming, but the difference is the result of lemmatizing is a real word. 

In [None]:
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('are', pos='v')

Lemmatization has better accuracy but slower than stemming.

Using lemmatization we can create custom analyzers to use in CountVectorizers.

In [None]:
def split_into_lemmas(message):
    message = ''.join([char for char in message.lower()
                       if char.isalnum() or char.isspace()])
    return [lemmatizer.lemmatize(word, pos='v') 
            for word in message.split()]

[split_into_lemmas(doc) for doc in docs]

In [None]:
cntvec = CountVectorizer(analyzer=split_into_lemmas)
cntvec.fit_transform(docs).todense(), cntvec.vocabulary_

Let's insert our vectorizer to our pipeline!

In [None]:
splitted = train_test_split(corpus.message,
                            corpus.label.values,
                            test_size=.25,
                            random_state=42)
X_train, X_test, y_train, y_test = splitted

In [None]:
pipe = Pipeline([('cntvec', CountVectorizer(analyzer=split_into_lemmas, min_df=10, max_df=.5)),
                 ('nb', MultinomialNB())])
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

In [None]:
len(pipe.steps[0][-1].vocabulary_)

### Spacy, an advanced NLP library

There is an advanced library called `spacy` which has more sophisticated tokenization and lemmatization.

In [None]:
import spacy

In [None]:
nlp = spacy.load('en')
[token.lemma_ for token in nlp(docs[0])]

In [None]:
pd.DataFrame([
    {'text': token.text, 
     'lemma': token.lemma_, 
     'POS': token.pos_, 
     'tag': token.tag_, 
     'dep': token.dep_,
     'shape': token.shape_,
     'is_alpha': token.is_alpha, 
     'is_stop': token.is_stop}
    for token in nlp(docs[0])
]).set_index('text').transpose()

In [None]:
pipe = Pipeline([('cntvec', CountVectorizer(analyzer=lambda x: [w.lemma_ for w in nlp(x)], min_df=10, max_df=.5)),
                 ('nb', MultinomialNB())])
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

#### b) [Tf-Idf](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

Tf-Idf is short for Term Frequency - Inverse Document Frequency and is a way of normalizing term counts. It is a product of two separate metrics:

- _Term Frequency_ shows how often a word is appearing in a document. ${\displaystyle \mathrm {tf} (t,d)={\frac {1}{2}} + {\frac {f_{t,d}}{2 \cdot \max\{f_{t',d}:t'\in d\}}}}$
- _Inverse Document Frequency_ shows if a term is common or rare across all documents. $ \mathrm {idf}(t, D) =  \log \frac{N}{|\{d \in D: t \in d\}|}$ where $N$ is the total number of documents in the corpus, $t$ is a term, $D$ is the set of documents.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer

In [None]:
pipe = Pipeline([('cntvec', CountVectorizer(analyzer=split_into_lemmas, min_df=5, max_df=.9)),
                 ('tfidf', TfidfTransformer()),
                 ('nb', MultinomialNB())])
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

In [None]:
for word in np.argsort(pipe.steps[1][1].idf_)[-20:][::-1]:
    print(word, pipe.steps[0][1].get_feature_names()[word], pipe.steps[1][1].idf_[word])

#### c) [Hashing](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer)

Really large corpora induce several problems with required memory: the larger the corpus, the larger the vocabulary will grow in memory.
To avoid this issue a [_hashing trick_](http://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing) can be used. Basically instead of storing each different word in a dictionary it applies a hash function to the features to determine their column index in sample matrices directly.

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

In [None]:
pipe = Pipeline([('hash', HashingVectorizer(analyzer=split_into_lemmas, n_features=1000, non_negative=True)),
                 ('nb', MultinomialNB())])
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

### 3. Latent Semantic Indexing

_"Latent semantic analysis (LSA) is a technique in natural language processing of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text."_ from: [wiki](https://en.wikipedia.org/wiki/Latent_semantic_analysis)  
LSA can be created by appling [`SVD`](http://scikit-learn.org/stable/modules/decomposition.html#truncated-singular-value-decomposition-and-latent-semantic-analysis) to `Tf-Idf` vectors.

In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.svm import SVC

In [None]:
pipe = Pipeline([('cntvec', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),
                 ('tfidf', TfidfTransformer(sublinear_tf=True)),
                 ('svd', TruncatedSVD(n_components=100, random_state=42)),
                 ('svm', SVC(C=300))])
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

In [None]:
feat_names = pipe.steps[0][1].get_feature_names()
topics = pipe.steps[2][1].components_
topic_str = pipe.steps[2][1].explained_variance_ratio_

In [None]:
def get_most_important(topic, feat_names):
    indices = np.argsort(topic)[-10:][::-1]
    return pd.DataFrame([{'topic': feat_names[index], 
                          'weight': topic[index]}
                         for index in indices]).set_index('topic')

In [None]:
fig, ax = plt.subplots(nrows=10, figsize=(15, 60))
for i in range(10):
    get_most_important(topics[i], feat_names).plot.bar(ax=ax[i])

### 4. Document similarity metrics

The euclidean distance is not feasable to determine the likeness of documents. Instead the cosine similarity is used which is the angle between the document vectors.

It is super convenient that the cosine similarity can be computed by the dot product between document tfidf vectors.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf = TfidfVectorizer(analyzer=split_into_lemmas,
                        stop_words='english',
                        min_df=10,
                        max_df=.5).fit(corpus.message)
vects = tfidf.transform(corpus.message)

In [None]:
vect = tfidf.transform([corpus.message[0]])

In [None]:
sims = vects.dot(vect.T).toarray().flatten()
most_similar = np.argsort(sims)[::-1][:10]

print(corpus.message[0])
print('-' * 80)
for i, index in enumerate(most_similar):
    print(i, sims[index])
    print(corpus.message[index])
    print('-' * 80)

## Further reading

- [nltk introduction](http://www.nltk.org/book/ch01.html)
- [dive into nltk blogpost series](http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk)
- [sklearn text data introduction](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
- [sklearn text feature extraction introduction](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
- [spacy introduction](https://spacy.io/usage/spacy-101)
- [LSA tutorial](http://mccormickml.com/2016/03/25/lsa-for-text-classification-tutorial/)