# SVM
This notebook shows the process of preprocessing and hyperparameter selection for SVM.

In [1]:
import pandas as pd
df = pd.read_csv('dataset.csv')
df = df.sample(20000)
df_target = df['humor']
df_data = df.copy()
df_data.drop(columns='humor')

df_target.head()

4969      False
143983    False
46974     False
7079      False
100492    False
Name: humor, dtype: bool

## Preprocessing
The preprosessing for SVM consists of only stemming, since this approach appeared to show the best results.
Also, the data gehts vectorized via Tf/idf.

In [2]:
from sklearn import preprocessing

#encode target to numeric
label_encoder = preprocessing.LabelEncoder()
df_target = label_encoder.fit_transform(df_target)
#df_target

In [3]:
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
import re, string

#when running for the first time you need to activate this line for once.
#nltk.download('stopwords')

#definition of stemming function
token_pattern = re.compile(r"(?u)\b\w\w+\b") # split on whitespace

def tokenize(text):
    stemmer = PorterStemmer()
    stems = []
    
    tokens = token_pattern.findall(text)
    for item in tokens:
        stems.append(stemmer.stem(item))
    return stems

In [4]:
#Stem data with Tfidf vectorizer
stem_vectorizer = TfidfVectorizer(tokenizer=tokenize, min_df=0.0001)
matrix = stem_vectorizer.fit_transform(df_data['text'])

df_data_stemmed = pd.DataFrame(matrix.toarray(), columns=stem_vectorizer.get_feature_names())
#display(df_data_stemmed)




## Grid Search and Cross Validation

In the following, different parameter values are tested using a gridsearch in combination with a cross validation to find the most fitting parameter combinations for the final model.
Note: Due to some parameter combinations being invalid, only 700 of the initial 1400 combinations could be calculated.

In [5]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

# Create train/test split
df_data_train, df_data_test, df_target_train, df_target_test = train_test_split(
    df_data_stemmed, df_target, test_size=0.2, random_state=42)

svm = LinearSVC(random_state=42,max_iter=300000)

# Specify the tunable hyper parameters
parameters = {
    'penalty': ['l2','l1'],
    'loss': ['hinge','squared_hinge'],
    'dual': [True,False],
    'tol': [1, 1e-01, 1e-02, 1e-03, 1e-04, 1e-05, 1e-06],
    'C': [1, 2, 10, 100, 1000]
}

# Define KFold parameters
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

#run grid search and train model
estimator = GridSearchCV(svm, parameters, scoring="accuracy", cv=cv)
estimator.fit(df_data_train, df_target_train)

#print results
print(estimator.best_params_)
print(estimator.best_estimator_)
print(estimator.best_score_)

700 fits failed out of a total of 1400.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
175 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\mcl.NB-MCL\anaconda3\envs\datamining\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\mcl.NB-MCL\anaconda3\envs\datamining\lib\site-packages\sklearn\svm\_classes.py", line 257, in fit
    self.coef_, self.intercept_, self.n_iter_ = _fit_liblinear(
  File "C:\Users\mcl.NB-MCL\anaconda3\envs\datamining\lib\site-packages\sklearn\svm\_base.py", line 1185, in _fit_liblinear
    solver_type = _get_liblinear_solver_type(multi_class, penalty, loss, dual)
  File "C

{'C': 1, 'dual': True, 'loss': 'hinge', 'penalty': 'l2', 'tol': 0.1}
LinearSVC(C=1, loss='hinge', max_iter=300000, random_state=42, tol=0.1)
0.8963125
