# Exercise SML -- Morning day 4

### Option 1: practice with the ImDB data

* Reproduce examples from the book for SML on the IMDB data (11.2, 11.3, 11.4) (check [codefrombook.py](codefrombook.py) if you do not want to type the code)
* Play around with different options! Can you tweak the models and make them even better? Take a look back at the [slides](../../day1/day1-afternoon.pdf) where we compared different vectorizers as well!

### Option 2: practice with the data from [Vermeer](data-vermeer/)

* Work with the file `train.csv` and `test.csv` in the folder [`data-vermeer`](data-vermeer/) and train a classifier using this data.

### BONUS: Try it with your own data!

### Classifying news categories with data-vermeer

In [None]:
import sys
import csv

from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB 
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

csv.field_size_limit(sys.maxsize)

In [None]:
def get_data(t='test'):
    text= []
    label= []

    with open(f'data-vermeer/{t}.csv') as fi:
        next(fi) # skips header row
        reader = csv.reader(fi, delimiter=',')

        for row in reader:
            text.append(row[0])
            label.append(row[1])

    return text, label

In [None]:
print(len(X_test), len(y_test))
print(len(X_train), len(y_train))   

In [None]:
# load data with existing train/test split
X_test, y_test = get_data('test')
X_train, y_train = get_data('train')

- Which configuration achieves the best performance? Based on which metrics? 
- Can you add to this configuration? (e.g., n_grams, custom tokenizer, ...?)

In [None]:
configurations = [('NB with Count', CountVectorizer(min_df=5, max_df=.5), MultinomialNB()),
                 ('NB with TfIdf', TfidfVectorizer(min_df=5, max_df=.5), MultinomialNB()),
                 ('LogReg with Count', CountVectorizer(min_df=5, max_df=.5), LogisticRegression(solver='liblinear')),
                 ('LogReg with TfIdf', TfidfVectorizer(min_df=5, max_df=.5), LogisticRegression(solver='liblinear')),
                 ('SVM with Count - rbf kernel', CountVectorizer(min_df=5, max_df=.5), SVC(kernel='rbf')),
                 ('SVM with Count - linear kernel', CountVectorizer(min_df=5, max_df=.5), SVC(kernel='linear')),
                 ('SVM with Tfidf - rbf kernel', TfidfVectorizer(min_df=5, max_df=.5), SVC(kernel='rbf')),
                 ('SVM with Tfidf - linear kernel', TfidfVectorizer(min_df=5, max_df=.5), SVC(kernel='linear')),

                 ]

for description, vectorizer, classifier in configurations:
    print(description)
    X_tr = vectorizer.fit_transform(X_train)
    X_te = vectorizer.transform(X_test)
    classifier.fit(X_tr, y_train)
    y_pred = classifier.predict(X_te)
    print(metrics.classification_report(y_test, y_pred) )
    print('\n')