### Description of Models

#### Baseline: Logistic Regression

#### SVM

We selected an SVM as these models generally produce very good classifications and they are robust to noise and less prone to overfitting. The major disadvantage to the SVM, which we ran into, is that it's computationally expensive. We tried to speed up tuning time by running PCA and retaining the principle components that accounted for 90% of the variance. We also tried randomly sampling a proportion of the training data with which to tune on. We ran the SVM using five-fold cross validdation and performed grid search with the parameters kernel (rbf, linear), gamma, and cost. 

#### Random Forest

We selected to use a Random Forest as an alternative to the SVM, both for its general classification accuracy and relative training efficiency. Random forests are generally good candidates for classification tasks due to their ability to combine multiple predictions into an ensemble in order to reduce model variance. However, random forests are also susceptible to overfitting, particularly in instances where the data are sparse. In this dataset, we retained several numerical features from the original TMDB/IMDB datasets, and constructed bag-of-words representations of the titles and overviews, retaining the top 100 most frequent words for each feature. While still sparse, we expected that retaining only the most frequent words would reduce the overall sparsity and allow the RF to perform relatively well. 

In order to reduce overfitting, we tuned the Random Forest using five-fold cross validation and performed grid search using three values each for max_features, max_depth, and num_estimators.

### Performance Metrics

#### F1 Score
F1 measures the balance between precision (exactness) and recall (sensitivity) scores and is calculated as:

$$\frac{2*(precision*recall)}{precision + recall}$$

For this task, we are concerned by the imbalance among the genres in the dataset. We determined that one way to account for this problem during was to use the F-score during model tuning; by balancing precision and recall, the model with the "best performance" will not be one that selectively performs well only on the dominant class.

#### Hamming Loss

This metric measures model accuracy for multi-label classification problems. Although we implemented multi-class models above, we constructed "multi-label" final outputs by combining each genre classifier's predicted output into a multi-label Y. We then computed the hamming loss on this final output in order to compare the three strategies (Logistic Regression, RF, and SVM).

Hamming loss is given by the following formula:

$$\frac{1}{|D|}\sum_{i=1}^{|D|} \frac{xor(x_i,y_i)}{|L|}$$

where $|D|$ is the number of observations, $|L|$ is the number of labels, $y_i$ are the actual labels, and $x_i$ are the predicted labels.

Source: (https://www.kaggle.com/wiki/HammingLoss)

### Performance Evaluation

The following scores are reported for the test set:

|        Model        |  Hamming Loss  |
|---------------------|----------------|
| Logistic Regression |      0.21      |
|    Random Forest    |      0.07      |
|         SVM         |      0.23      |

### Discussion of Models

The SVM model performed the worst in terms of Hamming loss of the three models. WE attribute this to the fact that the SVM was too computationally expensive to tune with a finer grid-search. Further, while sampling did speed up the computation time, it decreased the f1 score of particularly the more obscure genres that are not common in our dataset. In terms of model improvement there are two clear paths forward. Firstly, the results would almost certainly improve if we used a wider array of values for the hyperparameters in tuning. Secondly, sampling could be preformed in a stratified manner, thus decreasing the computation time while maintaining enough of each genre to make an accurate classifier. 