There are 2 cases when dealing with text or images.

## When there are only text or images as features

#### Text
- Text Feature extraction
- Kaggle's Allen Institute Challenge

#### Images
- Convolutional NN
- Kaggle's Data Science Bowl

## When text or images are additional data

We can extract features complementary to the other features.

- Kaggle's Avito Duplicate Ads Detection
- Kaggle's TRADESHIFT
- Kaggle's Titanic

## Text Preprocessing

These topics text and image processing should be covered separately, so this is more like a quick review.

### 1. Lowercase
Convert strings to lowercase.

### 2. Stemming
A stemmer should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat". It would seem like it only shortens words.
- It operates on a single word without knowledge of the context.
- democracy, democratic, and democratization ---> democr
- saw ---> s
    
### 3. Lemmatization
For example, the word "better" has "good" as its lemma. 
- It converts words of a sentence to their dictionary form. For example, given the words amusement, amusing, and amused, the lemma for each and all would be amuse.
- democracy, democratic, and democratization ---> democracy
- saw ---> see or saw (depending on the context

### 4. Stopwords
Examples:
- Articles or prepositions
- Very common words

## Text Feature Extraction

![](images/textfeatures.png)

There are 2 main ways

#### 1. BoW

- https://github.com/rgap/simbig2016-facebook-reactions/blob/master/lectures/TF-IDF.ipynb
- **TF-iDF** is the most known **Postprocessing Method** after BoW. There are different variants of TF-iDF that may work better depending on the data.
- **N-grams** may be useful. Add columns corresponding to words and also columns corresponding to 'n' consequent words.
- N-grams can also refer to sequences of caracters.

##### Features
- It produces very large vectors
- The meaning of each value in the vector is known

#### 2. Word2vec
- https://github.com/rgap/simbig2016-facebook-reactions/blob/master/lectures/Doc2Vec.ipynb
- Implementations of this approach:
    - For words: Word2vec, Glove, FastText, etc
    - For sentences:
        - Use the mean, or the sum of word vectors.
        - Use Doc2vec
- Training can be slow, so there are **pre-trained models**

##### Features
- It produces relatively small vectors
- **Values in the vector can be interpreted only in some cases**
- Words with similar meaning often have similar embeddings (vector representations)



## Image Feature Extraction

**Convolutional Neural Networks (CNN)** can give a vector representation for an image.

**Descriptors:** The outputs of any layer are called descriptors.

When getting the output of the network, besides the output of the last layer, it also gives the output for the inner layers.

- Descriptors from later layers are about what the network solves.
- Descriptors from earlier layers have more task independent information and can be used to perform other tasks.

![](networkCNN.jpg)

Extracting features comes essentially within the training step of a CNN.

#### a. Train a network from scratch

- It is usually better because it allows to tune more parameters.

#### b. Use a pre-trained network
- **Fine-tuning** = process of pre-trained model tuning.
- It's useful if we have **too little data**.
- Mostly used when the only data given is images.
- When the task we are solving is similar to the task the model was trained on.
- When solving medical-specific tasks you may use these pre-trained networks.
    - VGG
    - RestNet

## Image augmentation

This is about adding training images to train a better network.

- Adding images rotated by 90 degrees.
- Adding images rotated by 180 degrees.
- Adding denoised images.
- etc

Augmented images can be used in both train and test time
- At **train time** to increase the amount of training data
- At **test time** to average predictions for one augmented sample