# Tf-idf + Linear SVC (Baseline)

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

In [2]:
df = pd.read_csv("../data/raw_data.csv").drop(columns=["Publish Date"])
df

Unnamed: 0,Title,Abstract,Primary Category
0,Estimating the J function without edge correction,The interaction between points in a spatial po...,stat.TH
1,Is Bootstrap Really Helpful in Point Process S...,There are some papers which describe the use o...,stat.TH
2,Reconstruction of Gray-scale Images,Reconstruction of images corrupted by noise is...,stat.TH
3,Algorithmic Statistics,While Kolmogorov complexity is the accepted ab...,stat.TH
4,Minimax Entropy and Maximum Likelihood. Comple...,Concept of exponential family is generalized b...,stat.TH
...,...,...,...
60749,A stochastic optimisation unadjusted Langevin ...,This paper presents a novel stochastic optimis...,stat.AP
60750,Estimating hidden population size from a singl...,This work is concerned with the estimation of ...,stat.ME
60751,Repelling-Attracting Hamiltonian Monte Carlo,We propose a variant of Hamiltonian Monte Carl...,stat.TH
60752,Simultaneous Conformal Prediction of Missing O...,We study the problem of simultaneous predictiv...,stat.ME


In [3]:
df["Text"] = df["Title"] + "\n\n" + df["Abstract"]
df = df.drop(columns=["Title", "Abstract"])
df

Unnamed: 0,Primary Category,Text
0,stat.TH,Estimating the J function without edge correct...
1,stat.TH,Is Bootstrap Really Helpful in Point Process S...
2,stat.TH,Reconstruction of Gray-scale Images\n\nReconst...
3,stat.TH,Algorithmic Statistics\n\nWhile Kolmogorov com...
4,stat.TH,Minimax Entropy and Maximum Likelihood. Comple...
...,...,...
60749,stat.AP,A stochastic optimisation unadjusted Langevin ...
60750,stat.ME,Estimating hidden population size from a singl...
60751,stat.TH,Repelling-Attracting Hamiltonian Monte Carlo\n...
60752,stat.ME,Simultaneous Conformal Prediction of Missing O...


In [4]:
print(df["Text"].iloc[477])

Laws and Likelihoods for Ornstein Uhlenbeck-Gamma and other BNS OU Stochastic Volatilty models with extensions

In recent years there have been many proposals as flexible alternatives to
Gaussian based continuous time stochastic volatility models. A great deal of
these models employ positive L\'evy processes. Among these are the attractive
non-Gaussian positive Ornstein-Uhlenbeck (OU) processes proposed by
Barndorff-Nielsen and Shephard (BNS) in a series of papers. One current problem
of these approaches is the unavailability of a tractable likelihood based
statistical analysis for the returns of financial assets. This paper, while
focusing on the BNS models, develops general theory for the implementation of
statistical inference for a host of models. Specifically we show how to reduce
the infinite-dimensional process based models to finite, albeit high,
dimensional ones. Inference can then be based on Monte Carlo methods. As
highlights, specific to BNS we show that an OU process drive

### Default Parameters

In [10]:
X_train, X_test, y_train, y_test = train_test_split(df["Text"], df["Primary Category"], test_size=0.3, random_state=42)

pipe = Pipeline(
    [
        ("tfidf", TfidfVectorizer()),
        ("clf", LinearSVC(dual="auto", random_state=42, max_iter=5000))
    ]
)

In [11]:
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

print(classification_report(y_test, y_pred))
print(f"Weighted F1-Score: {f1_score(y_test, y_pred, average='weighted')}")

              precision    recall  f1-score   support

     stat.AP       0.63      0.57      0.60      2674
     stat.CO       0.56      0.42      0.48      1009
     stat.ME       0.63      0.66      0.64      5544
     stat.ML       0.77      0.80      0.79      4755
     stat.TH       0.71      0.72      0.71      4245

    accuracy                           0.68     18227
   macro avg       0.66      0.63      0.65     18227
weighted avg       0.68      0.68      0.68     18227

Weighted F1-Score: 0.681991560210697


### Grid Search

In [19]:
param_grid = {
    "tfidf__stop_words": ["english", None],
    "clf__C": [0.05, 0.075, 0.1, 0.25, 0.5],
}

grid = GridSearchCV(pipe, param_grid, cv=5, verbose=3, scoring="f1_weighted")
grid.fit(X_train, y_train)
print(grid.best_params_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END clf__C=0.05, tfidf__stop_words=english;, score=0.685 total time=   5.8s
[CV 2/5] END clf__C=0.05, tfidf__stop_words=english;, score=0.693 total time=   5.8s
[CV 3/5] END clf__C=0.05, tfidf__stop_words=english;, score=0.688 total time=   5.7s
[CV 4/5] END clf__C=0.05, tfidf__stop_words=english;, score=0.686 total time=   6.0s
[CV 5/5] END clf__C=0.05, tfidf__stop_words=english;, score=0.690 total time=   5.7s
[CV 1/5] END clf__C=0.05, tfidf__stop_words=None;, score=0.686 total time=   6.0s
[CV 2/5] END clf__C=0.05, tfidf__stop_words=None;, score=0.692 total time=   5.9s
[CV 3/5] END clf__C=0.05, tfidf__stop_words=None;, score=0.688 total time=   6.2s
[CV 4/5] END clf__C=0.05, tfidf__stop_words=None;, score=0.684 total time=   6.2s
[CV 5/5] END clf__C=0.05, tfidf__stop_words=None;, score=0.687 total time=   6.1s
[CV 1/5] END clf__C=0.075, tfidf__stop_words=english;, score=0.691 total time=   5.7s
[CV 2/5] END clf__

In [20]:
y_pred = grid.predict(X_test)

print(classification_report(y_test, y_pred))
print(f"Weighted F1-Score: {f1_score(y_test, y_pred, average='weighted')}")

              precision    recall  f1-score   support

     stat.AP       0.69      0.57      0.62      2674
     stat.CO       0.61      0.38      0.47      1009
     stat.ME       0.64      0.70      0.67      5544
     stat.ML       0.78      0.82      0.80      4755
     stat.TH       0.72      0.74      0.73      4245

    accuracy                           0.70     18227
   macro avg       0.69      0.64      0.66     18227
weighted avg       0.70      0.70      0.70     18227

Weighted F1-Score: 0.6984403783457029
