# Swiss SMS Classifier

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

df = pd.read_csv('Annotation_Swiss_SMS - Annotations_03_05_2019.csv', header=None)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df[[0]][0])



In [3]:
X_train_counts.shape

(70, 769)

## Creating the Ground Truth.

Before we can start using the annotations we have created in class we have to create, what is referred to as `ground truth`.

We have already seen multiple times that the input of the training process consists of two elements:
* input (a.k.a. X, sample representation)
* target (a.k.a Y, the true label)

In examples we have seen before the target value was always pre-defined as part of the test collection (a.k.a training collection). In the case of the Swiss SMS collection we have to create the ground truth ourselves.
Creating the ground truth in this case, is the process of `consolidating` the labels or ratings provided by the set of annotators.

It is generally not a straightforward process. Depending on the type of annotation this can be done by averaging the values of multiple annotators (e.g. when rating on a scale) or by forcing consensus (e.g. by defining if 4/5 annotators agree we define the rating of the 4 as ground truth).


Creating the `ground truth` should also not be considered a 'fire and forget' task. It is generally necessary to re-visit the `ground truth` creation process, its definitions at multiple stages during the lifetime of machine learning applications. This can be triggered by events such as:
* Addition of more annotators
* Expansion of annotated material
* Observations from tests on unseen data


 
 


# `Exercise: Establish Ground Truth`

Establish the ground truth for the content_type part of the annotated SMS collection.

The end result should be a dataframe that consists of two columns, one containing the text of the SMS and the other the consensus rating. 


In [9]:
classes = ['NEWS', 'APP', 'NC']
row_frequencies = []
for class_name in classes:
    row_frequencies.append(df[:50].apply(lambda row: sum(row[1:7]==class_name) ,axis=1))

df_row_freq = pd.concat((row_frequencies[0].to_frame(), row_frequencies[1].to_frame(), row_frequencies[2].to_frame()), axis=1)
df_row_freq.columns =  ['NEWS', 'APP', 'NC']

row_max_count = df_row_freq.max(axis=1)
row_max_label = df_row_freq.idxmax(axis=1)
row_max_count.name = 7
row_max_label.name = 8

df_consensus = pd.concat((df[:50], row_max_count, row_max_label), axis=1)
df_tcol = df_consensus[df_consensus[7] > 4][[0,8]]
len(df_tcol)

35

In [10]:
df_consensus.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,"De bini aber froh , dassd ni bisch abgschtür...",NEWS,NEWS,NEWS,NEWS,NEWS,NEWS,6,NEWS
1,höt ned zu der cho han leider no vell ufzgi a ...,APP,APP,APP,APP,APP,APP,6,APP
2,"nei , säg das ned ! :( debi hani der welle ve...",NC,NC,NC,NC,NC,NC,6,NC
3,d' shm isch erscht nöchscht wuuchä ( dez. )...,APP,NEWS,APP,"NEWS, APP",APP,NC,3,APP
4,wir sicher klappe . Also chunnsch dä chäs am f...,APP,APP,APP,APP,APP,APP,6,APP


In [11]:
df_tcol

Unnamed: 0,0,8
0,"De bini aber froh , dassd ni bisch abgschtür...",NEWS
1,höt ned zu der cho han leider no vell ufzgi a ...,APP
2,"nei , säg das ned ! :( debi hani der welle ve...",NC
4,wir sicher klappe . Also chunnsch dä chäs am f...,APP
5,"Ne , glaub nöd . Bin mal go luege , isch nöd s...",NC
6,"brucsch nu öpis ? ich bringä sicher milch , an...",NEWS
7,eus ischs eigetli glich wägem ? Du müessti...,NC
8,usserd mine für is oberland . Will aber am HB ...,NEWS
10,"i gani heii .. :) Nacher schlafi , bi brutaa...",NC
11,Säge uf dim wiitere lebeswäg ! ! ! Er umgitt D...,NC


# Create Vectoriser

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
count_vect.fit_transform(df[[0]][0])
len(count_vect.vocabulary_)

769

# Transform Input - Create Input Vectors

In [None]:
count_vect.transform(['hey dudeli']).todense().A1

In [16]:
df_tcol[0] = df_tcol[0].map(lambda x: count_vect.transform([x]).todense().A1)

In [30]:
from sklearn.utils import shuffle

df_tcol_shuffled = shuffle(df_tcol)

In [35]:
train_x = df_tcol_shuffled[0][:25].ravel().tolist()
train_y = df_tcol_shuffled[8][:25]
test_x = df_tcol_shuffled[0][25:].ravel().tolist()
test_y = df_tcol_shuffled[8][25:]

# A First Classification Run

In [36]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(train_x, train_y)

In [53]:
from sklearn.ensemble import RandomForestClassifier

clf_rf = RandomForestClassifier(100).fit(train_x, train_y)

In [58]:
from sklearn.svm import LinearSVC

clf_svm = LinearSVC(random_state=0, tol=1e-5)
clf_svm = clf_svm.fit(train_x, train_y)


In [60]:
from sklearn.naive_bayes import MultinomialNB

clf_mnnb = MultinomialNB().fit(train_x, train_y)

# Manually Testing - Qualitative Assessment

In [37]:
clf.predict([count_vect.transform(['sägi 7 viertelab morn wie wenn uf di halbi 5']).todense().A1])

array(['APP'], dtype=object)

In [61]:
clf_mnnb.score(test_x, test_y)

0.6

In [59]:
clf_svm.score(test_x, test_y)

0.4

In [54]:
clf.score(test_x, test_y)

0.5

In [52]:
clf_rf.score(test_x, test_y)

0.3

In [15]:
clf?

In [49]:
clf.predict_proba([count_vect.transform(['  viertelab  wenn wie halbi 5']).todense().A1])

array([[0.51182204, 0.40406037, 0.08411759]])

In [20]:
clf.classes_

array(['APP', 'NC', 'NEWS'], dtype=object)