# 1. Text Pre-processing

## Tokenization

The process that splits an input sequence into so-called tokens. 

![white_token.png](pics/white_token.png)
![white_token.png](pics/more_tokens.png)


In [2]:
import nltk

text = "This is Andrew's text, isn't it?"

In [3]:
tokenizer = nltk.tokenize.WhitespaceTokenizer()
tokenizer.tokenize(text)

['This', 'is', "Andrew's", 'text,', "isn't", 'it?']

In [5]:
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokenizer.tokenize(text)

['This', 'is', 'Andrew', "'s", 'text', ',', 'is', "n't", 'it', '?']

In [6]:
tokenizer = nltk.tokenize.WordPunctTokenizer()
tokenizer.tokenize(text)

['This', 'is', 'Andrew', "'", 's', 'text', ',', 'isn', "'", 't', 'it', '?']

## Token Normalization

We may want the same token for different forms of the world 
* wolf, wolves -> wolf
* talk, talks -> talk

### Stemming 
* A process of removing and replacing suffixes to get to the root form of the world which is called **stem**.
* Usually refers to hezristics that chop off suffixes
![lemma_1.png](pics/lemma_1.png)

### Lemmatization
* Usually refers to doing things properly with the use of a vocabulary and morphological analysis
* Returns the base or dictionary form of a word, which is known as the **lemma**


![lemma_1.png](pics/lemma_2.png)



In [8]:
text = "feet cats wolves talked"
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)


In [9]:
stemmer = nltk.stem.PorterStemmer()
" ".join(stemmer.stem(token) for token in tokens)

'feet cat wolv talk'

In [13]:
stemmer = nltk.stem.WordNetLemmatizer()
" ".join(stemmer.lemmatize(token) for token in tokens)

'foot cat wolf talked'

![norm_problems.png](pics/norm_problems.png)

## Summary
* We can think of text as a sequence of tokens
* Tokenization is a process of extracting those tokens
* We can normalize tokens using stemming of lemmatization
* We can also normalize casing and acronyms 

# 2. Feature Extraction from text

## Bag of Words - BOW
![bow](pics/bow.png)

## N-grams
![bow](pics/bow_1.png)

![bow](pics/remove_ngrams.png)

![bow](pics/remove_ngrams_1.png)

There're a lot of medium frequency n-grams
* It proved to be useful to look at n-gram frequency in our corpus for filtering out bad n-grams
* What if we use it for ranking of medium frequency n-grams?
* **Idea**: the n-gram with smaller frequency can be more discriminating becase it can capture a specific issue in the review.

### TF-IDF

![tf_idf_1.png](pics/tf_idf_1.png)
![inv_tf.png](pics/inv_tf.png)
![tf_idf_pros.png](pics/tf_idf_pros.png)
![tidf_code.png](pics/tidf_code.png)



In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd 
texts = [
    "good movie", "not a good movie", "did not like", "i like it", "good one"
]
tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
features = tfidf.fit_transform(texts)
pd.DataFrame(
    features.todense(),
    columns=tfidf.get_feature_names()
)

Unnamed: 0,good movie,like,movie,not
0,0.707107,0.0,0.707107,0.0
1,0.57735,0.0,0.57735,0.57735
2,0.0,0.707107,0.0,0.707107
3,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0


In [7]:
texts = ["good movie",
         "not a good movie",
         "did not like",
         "i like it",
         "good one"
        ]
tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
features = tfidf.fit_transform(texts)
pd.DataFrame(
    features.todense(),
    columns=tfidf.get_feature_names()
)

Unnamed: 0,good movie,like,movie,not
0,0.707107,0.0,0.707107,0.0
1,0.57735,0.0,0.57735,0.57735
2,0.0,0.707107,0.0,0.707107
3,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0


## Summary
* We've made simple counter features in bag of words manner
* You can add n-grams 
* You can replace counters with TF-IDF values

# 3. Linear models for Sentiment analysis

We will use the Imdb data base which has 50k records with 25k positive and 25k negative reviews. 

In [6]:
data = pd.read_csv("data/IMDB Dataset.csv")
print(data.shape)

(50000, 2)


![sparse_data.png](pics/sparse_data.png)
![sparse_models.png](pics/sparse_models.png)
![lr_1.png](pics/lr_1.png)
![lr_acc.png](pics/lr_acc.png)
![lr_acc_1.png](pics/lr_acc_1.png)
![lr_acc_1.png](pics/lr_acc_2.png)
![lr_acc_1.png](pics/lr_acc_3.png)

## Summary 

* Bag of words and simple linear models actually work for texts
* The accuracy gain from deep learning models is not mind blowing for sentiment classification




# 4. Hashing trick in spam filtering 

![m_n_g.png](pics/m_n_g.png)
![m_n_g_1.png](pics/m_n_g_1.png)
![m_n_g_1.png](pics/m_n_g_2.png)
![m_n_g_1.png](pics/m_n_g_3.png)
![m_n_g_1.png](pics/m_n_g_4.png)
![m_n_g_1.png](pics/m_n_g_5.png)
![m_n_g_1.png](pics/m_n_g_6.png)
![m_n_g_1.png](pics/m_n_g_7.png)
![m_n_g_1.png](pics/m_n_g_8.png)

## Summary

* We've taken a look on applications of feature hashing 
* Personalized features is a nice trick
* Linear models over bag of words scale well for production


# 5. Quiz - Classical text mining

![quiz_1_1.png](pics/quiz_1_1.png)

![quiz_1_2.png](pics/quiz_1_2.png)

![quiz_1_2.png](pics/quiz_1_3.png)
![quiz_1_2.png](pics/quiz_1_3_a.png)

![quiz_1_2.png](pics/quiz_1_4.png)
![quiz_1_2.png](pics/quiz_1_5.png)
![quiz_1_2.png](pics/quiz_1_6.png)

![fet_1.png](pics/fet_1.png)
![fet_1.png](pics/fet_2.png)
![fet_1.png](pics/fet_3.png)
![fet_1.png](pics/fet_4.png)


In [3]:
import re
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
text = "gagae{}"

In [4]:
REPLACE_BY_SPACE_RE.sub(r' ',text)

'gagae  '

In [8]:
examples = ["SQL Server - any equivalent of Excel's CHOOSE function?",
                "How to free c++ memory vector<int> * arr?"]
answers = ["sql server equivalent excels choose function", 
           "free c++ memory vectorint arr"]
for text, ans in zip(examples, answers):
    print(text)
    print(REPLACE_BY_SPACE_RE.sub(r' ',text))
    if REPLACE_BY_SPACE_RE.sub(r' ',text) != ans:
        print("Wrong answer for the case: '%s'" % text) 


SQL Server - any equivalent of Excel's CHOOSE function?
SQL Server - any equivalent of Excel's CHOOSE function?
Wrong answer for the case: 'SQL Server - any equivalent of Excel's CHOOSE function?'
How to free c++ memory vector<int> * arr?
How to free c++ memory vector<int> * arr?
Wrong answer for the case: 'How to free c++ memory vector<int> * arr?'


# 6. Neural Networks for words

![bow_repr_1.png](pics/bow_repr_1.png)

In neural networks we prefer a dense representation.

![nn_rep_1.png](pics/nn_rep_1.png)

How do you represent a sentece? A sum is an approach that works well in practice
![nn_rep_2.png](pics/nn_rep_2.png)
![nn_rep_3.png](pics/nn_rep_3.png)
![nn_rep_4.png](pics/nn_rep_4.png)
![nn_reo_4.png](pics/nn_reo_4.png)
![nn_rep_4.png](pics/nn_rep_5.png)
![nn_rep_4.png](pics/nn_rep_6.png)
![nn_rep_4.png](pics/nn_rep_7.png)

![m_p_ot_1.png](pics/m_p_ot_1.png)

![m_p_ot_2.png](pics/m_p_ot_2.png)

## Summary

* You can just average pre-trained word2vec vectors for your text
* You can do better with 1D convolutions that learn more complex features

# 7. Neural Networks for words

![nn_w.png](pics/nn_w.png)
![nn_w_1.png](pics/nn_w_1.png)
![nn_w_1.png](pics/nn_w_3.png)
![nn_w_1.png](pics/nn_w_4.png)
![nn_w_1.png](pics/nn_w_5.png)
![nn_w_1.png](pics/nn_w_6.png)
![nn_w_1.png](pics/nn_w_7.png)
![nn_w_1.png](pics/nn_w_8.png)

## Summary

* You can use convolutional networks on top of characters (called learning from scratch)
* It works best for large datasets where it beats classical approaches (like BOW)
* Sometimes it even beats LSTM that works on word level


# 8. Quiz

![quiz_1](pics/quiz_2_1.png)
![quiz_1](pics/quiz_2_2.png)
![quiz_1](pics/quiz_2_3.png)
![quiz_1](pics/quiz_2_4.png)

