# Intro to Data Science
## Part V. - Text Mining

### Table of contents
- #### Text Mining
    - <a href="#What-is-Text-Mining?">Theory</a>
    - <a href="#Text-Mining-in-practice">In practice</a>
    - <a href="#a)-Bag-of-words-representation">Vectorizing documents</a>
    - <a href="#b)-Tf-Idf">Normalizing document vectors</a>
    - <a href="#c)-Hashing">Vectorizing large corpus</a>
    - <a href="#3.-Latent-Semantic-Indexing">Topic modelling</a>
    - <a href="#4.-Document-similarity-metrics">Document similarity</a>
- #### ANN
    - <a href="#Neural-Networks">Single layer networks</a>
    - <a href="#Solving-non-linear-problems">Multi layer networks</a>
    

## What is Text Mining?
Text mining or text analytics is the process of extracting quantified features from (un)structured (natural language) texts. Processing unstructured data involves using Natural Language Processing (NLP), statistical modeling and machine learning techniques.

## Why is it important?
80% of the generated data is not available in structured, numerical format (emails, texts, meeting notes, documents, social media feeds). These unstructured data includes images, drawings, videos, voice recordings and unstructured texts. These data can be described with their meta data (length, topic, category, etc.) but transforming them into structured data is important to access the more detailed information stored in such data sources. Voice recordings, videos and drawings can also be transcribed into unstructured texts so they can be processing as textual data. 
Most common use cases are:

- document similarity computation
- document deduplication
- document clustering
- topic extraction
- sentiment analysis
- automated annotation
- text filtering
- text classification

## Tools
- NLP tools
    - tokenization
    - stemming
    - part-of-speech (POS) tagging
    - stop word filtering
    - bag-of-words representation
    - tf-idf transformation
- Other tools
    - Word2Vec representation
    - hashing
    - cosine/jaccard/levenshtein/etc similarity computation
    - matrix factorization
    - etc

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import scipy.sparse as sp
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Text Mining in practice

### 1. Read and examine data

Examining unstructured data is the key to proper preprocessing.  
The collection of texts is called __`corpus`__.

In [None]:
with open('./data/SMSSpamCollection', 'rb') as spamfile:
    corpus = [line.decode('utf-8').strip() for line in spamfile]
len(corpus)

In [None]:
for text in corpus[:5]:
    print(text)

We can see that the data is in TSV format, read it accordingly.

In [None]:
corpus = pd.read_csv('./data/SMSSpamCollection', sep='\t', names=['label', 'message'])

In [None]:
corpus.groupby('label').describe()

In [None]:
corpus['length'] = corpus.message.str.len()
corpus.head()

In [None]:
corpus['wordcount'] = corpus.message.str.split().str.len()
corpus.head()

In [None]:
corpus.length.plot(bins=20, kind='hist');

In [None]:
corpus.length.describe()

910 long sms???

In [None]:
corpus.loc[corpus.length > 900, 'message'].values

Is there a difference between spam and ham messages?

In [None]:
corpus[['length', 'label']].hist(bins=50, by='label', sharex=True);

Why not try a simple predictor?

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
splitted = train_test_split(corpus.length.values[:, np.newaxis], # we need a matrix, not a vector
                            corpus.label.values,
                            test_size=.25,
                            random_state=42)
X_train, X_test, y_train, y_test = splitted

In [None]:
pipe = Pipeline([('nb', MultinomialNB())])
pipe.fit(X_train, y_train)

In [None]:
accuracy_score(y_test, pipe.predict(X_test))

87% percent is our baseline, let's get into the preprocessing!

### 2. Preprocessing
#### a) [Bag-of-words representation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)

Bag of words representation represent documents as a vector where each different word is represented in a fixed position. The values in the positions are the occurence counts in the given document. For example:
The vector features for the documents:
```python
docs = ["I like trains.", "Trains are like big cars.", "I like big cars"]
```
will be 
```python
features = {'I': 0, 'like': 1, 'trains': 2, 'are': 3, 'big': 4, 'cars': 5}
```
and the vectorial representations will be
```python
vectors = [[1, 1, 1, 0, 0, 0],
           [0, 1, 1, 1, 1, 1],
           [1, 1, 0, 0, 1, 1]]
```

Fortunately `scikit-learn` has a built-in solution for this: the [`CountVectorizer`](http://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation).  
Let's try out our little example:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cntvec = CountVectorizer()
docs = ["I like trains.",
        "Trains are like big cars.",
        "I like big cars"]

cntvec.fit_transform(docs).todense(), cntvec.vocabulary_

#### N-grams
N-grams are n long word tuples. They are generated by an n long rolling window and they can provide contextual information which sometimes yields better results. An example 2-gram for the `"I like trains."` sentence would be:
```python
[("I", "like"), ("like", "trains")]
```

In [None]:
cntvec = CountVectorizer(ngram_range=(2, 2))
cntvec.fit_transform(docs).todense(), cntvec.vocabulary_

#### Minimum and maximum document frequency

Minimum and maximum document frequency (`min_df` and `max_df`) are set thresholds to limit the feature numbers. If a _term_ (transformed word) appears less than `min_df` or more than `max_df` times (or percent) then it will be left out from the vocabulary.

In [None]:
cntvec = CountVectorizer(max_df=1)
cntvec.fit_transform(docs).todense(), cntvec.vocabulary_

In [None]:
cntvec = CountVectorizer(min_df=3)
cntvec.fit_transform(docs).todense(), cntvec.vocabulary_

### Advanced tokenization

In the vocabulary generation process each word are analyzed and transformed in order to reduce vocabulary length. Scikit-learn's default analization function is lowercasing the words and filtering short and stop words but no further transformation is done.

NLP has more detailed techniques to better extract the base words. Lemmatization is a powerful tool to reduce a word into it's _root_ form (as it appears in dictionaries): that's how `are` becomes `be` and `trains` becomes `train`, etc.

#### Stemming

Word stemming means removing affixes from words and return the root word. Stemming do not use contextual information to execute the stripping.

##### NLTK, the base NLP library
There is an almost standard library called [`Natural Language ToolKit`](https://www.nltk.org/) for basic NLP tasks, like stemming.  
To use:
```bash
conda install nltk
```

In [None]:
from nltk.stem import PorterStemmer

In [None]:
stemmer = PorterStemmer()
stemmer.stem('trains')

#### Lemmatization

Word lemmatizing is similar to stemming, but the difference is the result of lemmatizing is a real word. 

In [None]:
import nltk
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('are', pos='v')

Lemmatization has better accuracy but slower than stemming.

Using lemmatization we can create custom analyzers to use in CountVectorizers.

In [None]:
def split_into_lemmas(message):
    message = ''.join([char for char in message.lower()
                       if char.isalnum() or char.isspace()])
    return [lemmatizer.lemmatize(word, pos='v') 
            for word in message.split()]

[split_into_lemmas(doc) for doc in docs]

##### Textblob, an advanced NLP library

The [__`textblob`__](https://textblob.readthedocs.io/en/dev/) library provides a more user friendly interface for lemmatization.  
To use:
```bash
conda install -c conda-forge textblob
```

In [None]:
from textblob import TextBlob

In [None]:
def split_into_lemmas(message):
    message = message.lower()
    words = TextBlob(message).words
    return [word.lemma for word in words]

[split_into_lemmas(doc) for doc in docs]

In [None]:
cntvec = CountVectorizer(analyzer=split_into_lemmas)
cntvec.fit_transform(docs).todense(), cntvec.vocabulary_

Let's insert our vectorizer to our pipeline!

In [None]:
splitted = train_test_split(corpus.message,
                            corpus.label.values,
                            test_size=.25,
                            random_state=42)
X_train, X_test, y_train, y_test = splitted

In [None]:
pipe = Pipeline([('cntvec', CountVectorizer(analyzer=split_into_lemmas, min_df=10, max_df=.5)),
                 ('nb', MultinomialNB())])
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

In [None]:
len(pipe['cntvec'].vocabulary_)

##### Spacy, a more advanced NLP library

There is an advanced library called [`spacy`](https://spacy.io/) which has more sophisticated tokenization and lemmatization.  
To use (__requires admin rights!__):
```bash
conda install -c conda-forge spacy
python -m spacy download en_core_web_sm
```

In [None]:
import spacy

In [None]:
nlp = spacy.load('en_core_web_sm')
[token.lemma_ for token in nlp(docs[0])]

In [None]:
pd.DataFrame([
    {'text': token.text, 
     'lemma': token.lemma_, 
     'POS': token.pos_, 
     'tag': token.tag_, 
     'dep': token.dep_,
     'shape': token.shape_,
     'is_alpha': token.is_alpha, 
     'is_stop': token.is_stop}
    for token in nlp(docs[0])
]).set_index('text').transpose()

In [None]:
pipe = Pipeline([('cntvec', CountVectorizer(analyzer=lambda x: [w.lemma_ for w in nlp(x)], 
                                            min_df=10, max_df=.5)),
                 ('nb', MultinomialNB())])
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

#### b) [Tf-Idf](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

Tf-Idf is short for Term Frequency - Inverse Document Frequency and is a way of normalizing term counts. It is a product of two separate metrics:

- _Term Frequency_ shows how often a word is appearing in a document. ${\displaystyle \mathrm {tf} (t,d)={\frac {1}{2}} + {\frac {f_{t,d}}{2 \cdot \max\{f_{t',d}:t'\in d\}}}}$
- _Inverse Document Frequency_ shows if a term is common or rare across all documents. $ \mathrm {idf}(t, D) =  \log \frac{N}{|\{d \in D: t \in d\}|}$ where $N$ is the total number of documents in the corpus, $t$ is a term, $D$ is the set of documents.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer

In [None]:
pipe = Pipeline([('cntvec', CountVectorizer(analyzer=split_into_lemmas, min_df=5, max_df=.9)),
                 ('tfidf', TfidfTransformer()),
                 ('nb', MultinomialNB())])
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

In [None]:
np.argsort([5, 3, 7, 9, 1])

In [None]:
for word in np.argsort(pipe['tfidf'].idf_)[-20:][::-1]:
    print(word, pipe['cntvec'].get_feature_names_out()[word], pipe['tfidf'].idf_[word])

#### c) [Hashing](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer)

Really large corpora induce several problems with required memory: the larger the corpus, the larger the vocabulary will grow in memory.
To avoid this issue a [_hashing trick_](http://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing) can be used. Basically instead of storing each different word in a dictionary it applies a hash function to the features to determine their column index in sample matrices directly.

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

In [None]:
pipe = Pipeline([('hash', HashingVectorizer(analyzer=split_into_lemmas, n_features=1000, alternate_sign=False)),
                 ('nb', MultinomialNB())])
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

### 3. Latent Semantic Indexing

_"Latent semantic analysis (LSA) is a technique in natural language processing of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text."_ from: [wiki](https://en.wikipedia.org/wiki/Latent_semantic_analysis)  
LSA can be created by appling SVD to Tf-Idf vectors.

In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.svm import SVC

In [None]:
pipe = Pipeline([('cntvec', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),
                 ('tfidf', TfidfTransformer(sublinear_tf=True)),
                 ('svd', TruncatedSVD(n_components=300, random_state=42)),
                 ('svm', SVC(C=300))])
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

In [None]:
feat_names = pipe['cntvec'].get_feature_names_out()
topics = pipe['svd'].components_
topic_str = pipe['svd'].explained_variance_ratio_

In [None]:
def get_most_important(topic, feat_names):
    indeces = np.argsort(topic)[::-1]
    terms = [feat_names[weightIndex] for weightIndex in indeces[:10]]    
    weights = [topic[weightIndex] for weightIndex in indeces[:10]] 
    return dict(zip(terms, weights))

In [None]:
for i in range(10):
    print(i, topic_str[i], get_most_important(topics[i], feat_names))
    print('-' * 80)

### 4. Document similarity metrics

The euclidean distance is not feasable to determine the likeness of documents. Instead the cosine similarity is used which is the angle between the document vectors.

It is super convenient that the cosine similarity can be computed by the dot product between document tfidf vectors.

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
def split_to_lemmas_and_filter(message):
    lemmas = split_into_lemmas(message)
    return [lemma for lemma in lemmas 
            if lemma not in ENGLISH_STOP_WORDS]
    
tfidf = TfidfVectorizer(analyzer=split_to_lemmas_and_filter,
                        min_df=10,
                        max_df=.5).fit(corpus.message)
vects = tfidf.transform(corpus.message)

In [None]:
vect = tfidf.transform([corpus.message[0]])
corpus.message[0], vect

In [None]:
sims = vects.dot(vect.T).toarray().flatten()
most_similar = np.argsort(sims)[-10:][::-1]

for i, index in enumerate(most_similar):
    print(i, sims[index])
    print(corpus.message[index])
    print('-' * 80)

## Model of the week:
### Neural Networks

<img src ="pics/artificial_neuron.png" width="300" align="left"/>

Artificial Neural Networks are a supervised machine learning method for classification and regression purpose. It is based on the inner workings of the (human) brain. It consists of basic execution units and the connections between them.  

The execution units are called __neurons__, and their job is to compute the weighted summation of their inputs, then appling an output function. Based on their simple nature a [neuron](https://en.wikipedia.org/wiki/Perceptron) is only capable of solving linear problems. Their mechanism is easily expressed by the following equation:
$$y_{i} = f(\sum_{i}w(t)_{i}x_{i})$$

The learning process is simply updating the input weights: 
$${\displaystyle w_{i}(t+1)=w_{i}(t)+(d_{j}-y_{j}(t))x_{j,i}\,}$$
where $d$ is the expected output for the $j$th input.

In [None]:
from sklearn.linear_model import Perceptron

In [None]:
XOR_X, XOR_y = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]), np.array([0, 1, 1, 0])
df = pd.DataFrame(data=XOR_X, columns=['x', 'y'])
df['label'] = XOR_y

In [None]:
def plot_results_with_hyperplane(clf, clf_name, df, ax):
    x_min, x_max = df.x.min() - .5, df.x.max() + .5
    y_min, y_max = df.y.min() - .5, df.y.max() + .5

    xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    ax.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired, shading='auto')
    ax.scatter(df.x, df.y, c=df.label, edgecolors='k')
    ax.set_title(clf_name)

In [None]:
perceptron = Perceptron(verbose=2, random_state=42).fit(XOR_X, XOR_y)
perceptron

In [None]:
fig, ax = plt.subplots()
plot_results_with_hyperplane(perceptron, 'perceptron', df, ax);

In [None]:
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(XOR_y, perceptron.predict(XOR_X))
conf_mat

In [None]:
sns.heatmap(conf_mat)

#### Solving non-linear problems

As we can see, a single neuron is not able to solve this (not linear) problem. But they are not called __networks__ for nothing! The power of artificial neural networks lies in their topology. If we connect more of the __neurons__ we get a (real) neural network. The neurons are ordered into __layer__s. The first layer is the __input layer__ then there are zero, or more __hidden layer__(s), finally there is the __output layer__. Each layer can cosist any number of neurons. Based on the different topologies there are different ANN subtypes.

<img src="pics/mlp.png" width=350 align="left"/>

The most simple version of ANNs is the multi layer feed forward perceptron network (MLP). The tipical output (activation) functions are $y(v_i) = \tanh(v_i) ~~ \textrm{and} ~~ y(v_i) = (1+e^{-v_i})^{-1}$

The weight updating algorithm is called [__Backpropagation__](https://en.wikipedia.org/wiki/Backpropagation). It basically propagates the errors back to their "root" neuron. So every neuron which was (even slightly) responsible for an error will get their weights updated accordingly. The main updating equations are easily expressed by appling [gradient descent rules](https://en.wikipedia.org/wiki/Backpropagation#Derivation).

In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
mlp = MLPClassifier(random_state=42).fit(XOR_X, XOR_y)
mlp

In [None]:
fig, ax = plt.subplots()
plot_results_with_hyperplane(mlp, 'mlp', df, ax)

In [None]:
conf_mat = confusion_matrix(XOR_y, mlp.predict(XOR_X))
conf_mat

A super nice tutorial can be found [here](https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb), it is worth checking out.