# Intro to Data Science
## Part V. Text Mining

### Table of contents
- #### Text Mining
    - <a href="#What-is-Text-mining?">Theory</a>
    - <a href="#Tools">In practice</a>
- #### Neural Networks
    - <a href="#Neural-Networks">Theory</a>
    - <a href="#Example">Linear regression</a>
    

## What is Text Mining?
Text mining or text analytics is the process of extracting quantified features from (un)structured (natural language) texts. Processing unstructured data involves using Natural Language Processing (NLP), statistical modeling and machine learning techniques.

## Why is it important?
80% of the generated data is not available in structured, numerical format (emails, texts, meeting notes, documents, social media feeds). These unstructured data includes images, drawings, videos, voice recordings and unstructured texts. These data can be described with their meta data (length, topic, category, etc.) but transforming them into structured data is important to access the more detailed information stored in such data sources. Voice recordings, videos and drawings can also be transcribed into unstructured texts so they can be processing as textual data. 
Most common use cases are:

- document similarity computation
- document deduplication
- document clustering
- topic extraction
- sentiment analysis
- automated annotation
- text filtering
- text classification

## Tools
- NLP tools
    - tokenization
    - stemming
    - part-of-speech (POS) tagging
    - stop word filtering
    - bag-of-words representation
    - tf-idf transformation
- Other tools
    - Word2Vec representation
    - hashing
    - cosine/jaccard/levenshtein/etc similarity computation
    - matrix factorization
    - etc

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import scipy.sparse as sp
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Text Mining in practice

### 1. Read and examine data

Examining unstructured data is the key to proper preprocessing.  
The collection of texts is called __`corpus`__.

In [None]:
with open('./data/SMSSpamCollection') as spamfile:
    corpus = [line.strip() for line in spamfile]
len(corpus)

In [None]:
for text in corpus[:5]:
    print text

We can see that the data is in TSV format, read it accordingly.

In [None]:
corpus = pd.read_csv('./data/SMSSpamCollection', sep='\t', names=['label', 'message'])

In [None]:
corpus.groupby('label').describe()

In [None]:
corpus['length'] = corpus.message.str.len()
corpus.head()

In [None]:
corpus.length.plot(bins=20, kind='hist')

In [None]:
corpus.length.describe()

910 long sms???

In [None]:
list(corpus.message[corpus.length > 900])

Is there a difference between spam and ham messages?

In [None]:
corpus.hist(bins=50, by='label')

Why not try a simple predictor?

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
splitted = train_test_split(corpus.length.values[:, np.newaxis], # we need a matrix, not a vector
                            corpus.label.values,
                            train_size=.75,
                            random_state=42)
X_train, X_test, y_train, y_test = splitted

In [None]:
pipe = Pipeline([('nb', MultinomialNB())])
pipe.fit(X_train, y_train)

In [None]:
accuracy_score(y_test, pipe.predict(X_test))

87% percent is our baseline, let's get into the preprocessing!

### 2. Preprocessing
#### a) Bag-of-words representation

Bag of words representation represent documents as a vector where each different word is represented in a fixed position. The values in the positions are the occurence counts in the given document. For example:
The vector features for the documents:
```python
docs = ["I like trains.", "Trains are like big cars.", "I like big cars"]
```
will be 
```python
features = {'I': 0, 'like': 1, 'trains': 2, 'are': 3, 'big': 4, 'cars': 5}
```
and the vectorial representations will be
```python
vectors = [[1, 1, 1, 0, 0, 0],
           [0, 1, 1, 1, 1, 1],
           [1, 1, 0, 0, 1, 1]]
```

Fortunately `scikit-learn` has a built-in solution for this: the [`CountVectorizer`](http://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation).  
Let's try out our little example:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cntvec = CountVectorizer()
docs = ["I like trains.",
        "Trains are like big cars.",
        "I like big cars"]
cntvec.fit_transform(docs).todense(), cntvec.vocabulary_

#### N-grams
N-grams are n long word tuples. They are generated by an n long rolling window and they can provide contextual information which sometimes yields better results. An example 2-gram for the `"I like trains."` sentence would be:
```python
[("I", "like"), ("like", "trains")]
```

In [None]:
cntvec = CountVectorizer(ngram_range=(2, 2))
cntvec.fit_transform(docs).todense(), cntvec.vocabulary_

#### Minimum and maximum document frequency

Minimum and maximum document frequency (`min_df` and `max_df`) are set thresholds to limit the feature numbers. If a _term_ (transformed word) appears less than `min_df` or more than `max_df` times (or percent) then it will be left out from the vocabulary.

In [None]:
cntvec = CountVectorizer(max_df=1)
cntvec.fit_transform(docs).todense(), cntvec.vocabulary_

In [None]:
cntvec = CountVectorizer(min_df=3)
cntvec.fit_transform(docs).todense(), cntvec.vocabulary_

#### Lemmatization

In the vocabulary generation process each word are analyzed and transformed in order to reduce vocabulary length. Scikit-learn's default analization function is lowercasing the words and filtering short and stop words but no further transformation is done.

NLP has more detailed techniques to better extract the base words. Lemmatization is a powerful tool to reduce a word into it's _root_ form (as it appears in dictionaries): that's how `are` becomes `be` and `trains` becomes `train`, etc.

In [None]:
import nltk
from textblob import TextBlob

In [None]:
nltk.download('wordnet')

In [None]:
def split_into_lemmas(message):
    message = unicode(message, 'utf8').lower()
    words = TextBlob(message).words
    return [word.lemma for word in words]

[split_into_lemmas(doc) for doc in docs]

In [None]:
cntvec = CountVectorizer(analyzer=split_into_lemmas)
cntvec.fit_transform(docs).todense(), cntvec.vocabulary_

Let's insert our vectorizer to our pipeline!

In [None]:
splitted = train_test_split(corpus.message,
                            corpus.label.values,
                            train_size=.75,
                            random_state=42)
X_train, X_test, y_train, y_test = splitted

In [None]:
pipe = Pipeline([('cntvec', CountVectorizer(analyzer=split_into_lemmas, min_df=10, max_df=.5)),
                 ('nb', MultinomialNB())])
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

In [None]:
len(pipe.steps[0][-1].vocabulary_)

#### b) Tf-Idf

\# TODO!

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer

In [None]:
pipe = Pipeline([('cntvec', CountVectorizer(analyzer=split_into_lemmas, min_df=5, max_df=.9)),
                 ('tfidf', TfidfTransformer()),
                 ('nb', MultinomialNB())])
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

In [None]:
for word in np.argsort(pipe.steps[1][1].idf_)[-20:][::-1]:
    print word, pipe.steps[0][1].get_feature_names()[word], pipe.steps[1][1].idf_[word]

#### c) Hashing

\# TODO!

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

In [None]:
pipe = Pipeline([('hash', HashingVectorizer(analyzer=split_into_lemmas, n_features=1000, non_negative=True)),
                 ('nb', MultinomialNB())])
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

### 3. Latent Semantic Indexing

Topic modelling for the masses. Tfidf+SVD=LSA

\# TODO!

In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.svm import SVC

In [None]:
pipe = Pipeline([('cntvec', CountVectorizer(analyzer=split_into_lemmas)),
                 ('tfidf', TfidfTransformer(sublinear_tf=True)),
                 ('svd', TruncatedSVD(n_components=300, random_state=42)),
                 ('svm', SVC(C=300))])
pipe.fit(X_train, y_train)
accuracy_score(y_test, pipe.predict(X_test))

## Model of the week:
### Neural Networks
They are fun!

\# TODO!

In [None]:
from sklearn.linear_model import Perceptron

In [None]:
XOR_X, XOR_y = np.array([[0,0], [0, 1], [1, 0], [1, 1]]), np.array([0, 1, 1, 0])
df = pd.DataFrame(data=XOR_X, columns=['x', 'y'])
df['label'] = XOR_y

In [None]:
def plot_results_with_hyperplane(clf, clf_name, df, ax):
    x_min, x_max = df.x.min() - .5, df.x.max() + .5
    y_min, y_max = df.y.min() - .5, df.y.max() + .5

    xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    ax.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
    ax.scatter(df.x, df.y, c=df.label, edgecolors='k')
    ax.set_title(clf_name)

In [None]:
perceptron = Perceptron(verbose=2, random_state=42).fit(XOR_X, XOR_y)
perceptron

In [None]:
fig, ax = plt.subplots()
plot_results_with_hyperplane(perceptron, 'perceptron', df, ax)

In [None]:
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(XOR_y, perceptron.predict(XOR_X))
conf_mat

In [None]:
sns.heatmap(conf_mat)

In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
mlp = MLPClassifier(random_state=42).fit(XOR_X, XOR_y)
mlp

In [None]:
fig, ax = plt.subplots()
plot_results_with_hyperplane(mlp, 'mlp', df, ax)

In [None]:
conf_mat = confusion_matrix(XOR_y, mlp.predict(XOR_X))
conf_mat