# SVM
This notebook shows the traning of the final SVM model.

- accuracy: 0.922975

In [10]:
import pandas as pd
df = pd.read_csv('dataset.csv')

df_target = df['humor']
df_data = df.copy()
df_data.drop(columns='humor')

df.describe()

Unnamed: 0,text,humor
count,200000,200000
unique,200000,2
top,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
freq,1,100000


## Preprocessing
The preprosessing for KNN consists of only stemming, since this approach appeared to show the best results.
Also, the data gehts vectorized via Tf/idf.

In [2]:
from sklearn import preprocessing

#encode target to numeric
label_encoder = preprocessing.LabelEncoder()
df_target = label_encoder.fit_transform(df_target)
#df_target

In [3]:
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
import re, string

#when running for the first time you need to activate this line for once.
#nltk.download('stopwords')

#definition of stemming function
token_pattern = re.compile(r"(?u)\b\w\w+\b") # split on whitespace

def tokenize(text):
    stemmer = PorterStemmer()
    stems = []
    
    tokens = token_pattern.findall(text)
    for item in tokens:
        stems.append(stemmer.stem(item))
    return stems

In [4]:
#Stem data with Tfidf vectorizer

stem_vectorizer = TfidfVectorizer(tokenizer=tokenize, min_df=0.0001)
matrix = stem_vectorizer.fit_transform(df_data['text'])
df_data_stemmed = pd.DataFrame(matrix.toarray(), columns=stem_vectorizer.get_feature_names())
#display(df_data_stemmed)




## SVM Model
Here we train the final model with the identified parameters and calculate the scores accuracy afterwards.

In [9]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create train/test split
df_data_train, df_data_test, df_target_train, df_target_test = train_test_split(
    df_data_stemmed, df_target, test_size=0.2, random_state=42)

#SVM classifier with final parameters
svm = LinearSVC(random_state=42, tol=0.1,C=1,dual=True,penalty='l2', loss='hinge',max_iter=5000)

#train final model
svm.fit(df_data_train, df_target_train)

#test final model
df_prediction = svm.predict(df_data_test)

print("Accuracy: {}".format(accuracy_score(df_target_test, df_prediction)))

Accuracy: 0.922975


In [12]:
from sklearn.metrics import precision_score, recall_score, f1_score

#also try out different evaluation measures to make sure there are no differences
print("Precision: {}".format(precision_score(df_target_test,df_prediction)))
print("Recall: {}".format(recall_score(df_target_test,df_prediction)))
print("Precision: {}".format(f1_score(df_target_test,df_prediction)))

Precision: 0.9200516436587546
Recall: 0.9264463223161158
Precision: 0.9232379101577098


Since the different evaluation measures did not show any unexpected differences to the accuracy score, they are not further used.