## Machine Learning Steps

If the data is huge, you may want to sample smaller training sets so you can train many different
models in a reasonable time (be aware that this penalizes complex models such as large neural nets
or Random Forests).
Once again, try to automate these steps as much as possible.

1. Train many quick and dirty models from different categories (e.g., linear, naive Bayes, SVM, Random Forests, neural net, etc.) using standard parameters.
2. Measure and compare their performance. For each model, use N-fold cross-validation and compute the mean and standard deviation of the performance measure on the N folds.
3. Analyze the most significant variables for each algorithm.
4. Analyze the types of errors the models make. What data would a human have used to avoid these errors?
5. Have a quick round of feature selection and engineering.
6. Have one or two more quick iterations of the five previous steps.
7. Short-list the top three to five most promising models, preferring models that make different types of errors.

Source: p. 646. Hands-on Machine Learning

## Sentiment Analysis options

### Features

- Length of review

*Ignoring word order*
- Frequencies of all relevant words (ignore stop words)
OR
- Frequency of positive words (list from NLTK)
- Frequency of negative words (list from NLTK), etc.

*Keeping word order*
- One hot encoding of words -> list of all words in vocab (as a vector/column) with row respective to word = 1
- Word embeddings -> features of word learned through different algorithms
Video on word embeddings; https://www.youtube.com/watch?v=186HUTBQnpY

### Algorithms

#### Non-sequential algorithms

Any classification algorithms;
- Logistic Regression
- Decision trees
- Naive Bayes

#### Sequential algorithms

RNNs (~ BRNNs, LSTMs, GRUs)

## Models Covered In Class

### Linear Regression - Baseline Model
Features; number of +ve words, Number of -ve words

The mean absolute error on the training data is 0.832466 stars

### Random Forests (non-linear model)
Features; number of +ve words, Number of -ve words -> after taking into account negations, e.g. (not good)

A nonlinear regressor achieves a MAE of 0.715708 stars

### Linear Regression with NLTK Sentiment Intensity Analyser
Features; number of +ve words, Number of -ve words -> after taking into account negations e.g. (not good) & *(see below)

Now the mean absolute error on the training data is 0.758256 stars

On the validation set, we get 0.755795 error for the linear regression

### Random Forests with NLTK Sentiment Intensity Analyser
Features; number of +ve words, Number of -ve words -> after taking into account negations e.g. (not good) & *(see below)

For the RF, it is 0.283528 stars

Validation set; 0.731631 for the random forest regression

* *Features*

       (1) the mean positive sentiment over all sentences
       (2) the mean neutral sentiment over all sentences
       (3) the mean negative sentiment over all sentences
       (4) the maximum positive sentiment over all sentences
       (5) the maximum neutral sentiment over all sentences
       (6) the maximum negative sentiment over all sentences
       (7) length of review (in thousands of characters) - truncate at 2,500
       (8) percentage of exclamation marks (in %)


## Plan

- Naive Bayes classifier

https://www.youtube.com/watch?v=tOP5DzKxc20

https://www.youtube.com/watch?v=5YymjfzMpL8

Choices; Negate words?

- RNNs (BRNNs?) with word embeddings