# Bag of Words

## Summary.

**Pipeline of applying BoW**
1. Preprocessing:
  * Lowercase, stemming, lemmatization, stopwords
2. Ngrams can help to use local context
3. Postprocessing: TfiDF

## Feature extraction from texts and images

![extract-features-ex-titanic](img/extract-features-ex-titanic.png)

![extract-features-commonimgtxt](img/extract-features-commonimgtxt.png)

## Text -> vector

![txt-to-vector](img/txt-to-vector.png)

## Bag of Words
`sklearn.feature_extraction.text.CountVectorizer`

![bow](img/bag-of-words.png)

#### Post process
We also can post process calculated metrics using some pre-defined methods
* to make out why we need post-processing, let's remember some models like kNN, linear regression and neural nets depend on scaling of features.
  * **Main goal of post-process** 
    * `to make sampels more comparable`
    * `to boost more important features while decreasing the scale of useless one`


## Bag of Words: TFiDF

#### Term frequency

```python
tf = 1 / x.sum(axis=1)[:, None]
x = x * tf
```

In this way, we will count not occurrences but frequencies of words - thus, texts with different sizes will be more comparable --> **`term frequency transformation`**

#### Inverse Document Frequency
```python
idf = np.log(x.shape[0] / (x > 0).sum(0))
x = x * idf
```

`sklearn.future_extraction.text.TfidVectorizer`

<br>
To boost more important features, we will make post process our matrix by normalizing data column-wise.
* A good idea is to **normalize each feature by the inverse fraction of document which contain the exact word corresponding to this feature**.
* In this case, features corresponding to frequent words will be scaled down compared to features corresponding to rarer words. 
* We can further improve this idea by taking a logarithm of these normalization coefficients.
* As results, this will decrease the significance of widespread words in the dataset and do require feature scaling. **This is the purpose of inverse document frequency transformation.**



applying TFiD Transformation to the previous example
## Bag of Words: TF

![bow-tf](img/bow-tf.png)
* `Occurences` switched to `Frequencies`
  * it means that some of variance for each row is now equal to `1`.
  
  
## Bag of Words: TF + iDF

![bow-tf-idf](img/bow-tf-idf.png)
* Data normalized column-wise.
* iDF transformation scaled down the appropriate feature 
  * **There are plenty of other variants of TfiD which may work better depending on the specific data.**

## N-grams

![n-grams](img/n-grams.png)

The concept of n-grams
* You add not only column corresponding to the word, but also columns corresponding to inconsequent words
* This concept can also be applied to sequence of chars
  * in cases with low N, we will have a column for each possible combination of N chars.
  
#### Number of biagrams (`N = 2`) for `28` unique symbols is equal to:
* `28 * 28 = 784`

#### It can be cheaper to have every possible char Ngram as a feature, instead of having a feature for each unique word from the dataset.
* Using char Ngrams also helps our model to handle unseen words - e.g. rarer forms of already used words

#### In a scaled CountVectorizer has appropriate parameter for Ngrams - `Ngramrange` to change from word Ngrams to char Ngrams
* You may use parameter named `analyzer`
```
sklearn.feature_extraction.text.CountVectorizer:
Ngram_range, analyzer
```

## Text preprocessing

1. **Lowercase**
2. **Lemmatization**
3. **Stemming**
4. **Stopwords**

### Text preprocessing : lowercase

`very == VERY`
`Sunny == sunny`

### Text preprocessing: lemmatization and stemming

```
I had a car == I have car
```
```
We have cars == We have car
```

#### Stemming
a heuristic process that **chops off ending of words** and thus unite duration of related words like democracy, democratic, democratization.
* democracy, democratic and democratization -> `democr`
* Saw -> `s`

#### Lemmatization
Using knowledge of vocabulary / morphological analogies of quotes returning democracy for each of the words below.
* democracy, democratic and democratization -> `democracy`
* Saw -> `see` or `saw` (depending on context)

### Text preprocessing: stopwords

Examples:
1. Articles or prepositions
2. Very common words

NLTK, Natrual Lanuage Toolkit library for python
```python
# stopwords related parameter in CountVectorizer
sklearn.feature_extraction.text.CountVectorizer:
max_df # threshold of words frequency
```