### Sentiment classification with logistic regression (in Sklearn)

The goal of this first exercise is to make you acquainted with the data, understand how instances are represented, and implement a traditional baseline in `sklearn`.

To install the package:

```
pip install scikit-learn
```

### References:

* [sklearn tutorial](https://github.com/bplank/2018-ma-notebooks/blob/master/01_Intro_to_ML.ipynb)
* [Machine Learning 101 with sklearn](https://github.com/bplank/2018-ma-notebooks/blob/master/01_Intro_to_ML.ipynb)


### Exercise 1
```
Q1) Examine the code - How is the text represented? What is the difference between fit_transform and transform? (Hint: check the sklearn documentation) 
Q2) How many labels are there per class in the training data? Which class is the least frequent?
Q3) Add code to train and evaluate the classifier. What accuracy do you get? Which class is the most difficult to get correct?
Q4) The code implements a simple baseline to compare your system to. What is this baseline?
```

In [7]:
from collections import Counter
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score

def load_data(filename, vectorizer, train=False):
    """
    loads the movie review data
    """
    labels, sentences = [],[]
    with open(filename, "r") as f:
        for line in f:
            tag, sentence = line.lower().strip().split(" ||| ")
            labels.append(tag)
            sentences.append(sentence)
    print("vectorize data..")
    if train:
        X = vectorizer.fit_transform(sentences) # Q1: make sure you understand the difference between fit_transform and transform
    else:
        X = vectorizer.transform(sentences)

    y = labels
    assert (X.shape[0] == len(y))
    return X, y


## read input data
print("load data..")
vectorizer = CountVectorizer() # Q1: What does the CountVectorizer() do?
X_train, y_train = load_data("data/classes/train.txt", vectorizer, train=True)
X_dev, y_dev = load_data("data/classes/dev.txt", vectorizer)


label_occurences = {}
for l in y_train:
    label_occurences[lab] = label_occurences.get(lab, 0) + 1
    
print("label occurences:")
print(label_occurences)

print("#train instances: {} #dev: {}".format(X_train.shape[0], X_dev.shape[0]))

### Q3: Train and evaluate the classifier on the dev set -- add your code here
# Create linear regression object
clf = LogisticRegression()
clf.fit(X_train, y_train)

print("train (fit) model..")
print(clf) # shows model and its parameters


print("predict..")
y_predicted_dev = clf.predict(X_dev)

## end add your code 

## Q3: Add a simple baseline -- what happens here?
base = DummyClassifier()
base.fit(X_train, y_train)
baseline_dev = base.predict(X_dev)

### evaluate
accuracy_dev = accuracy_score(y_dev, y_predicted_dev)

print("===== dev set ====")
print("Base accuracy:   {0:.2f}".format(accuracy_score(y_dev, baseline_dev)*100))
print("Classifier:      {0:.2f}".format(accuracy_dev*100))


load data..
vectorize data..
vectorize data..
{'3': 2322, '4': 1288, '2': 1624, '1': 2218, '0': 1092}
#train instances: 8544 #dev: 1101
train (fit) model..
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
predict..
===== dev set ====
Base accuracy:   21.25
Classifier:      35.33
