# Spooky Books Author Prediction - Version 3

Notebook ini memperbaiki pipeline asli dengan:
1. Peningkatan preprocessing TF-IDF
2. Tuning hyperparameter Logistic Regression
3. Penambahan fitur metadata teks
4. Evaluasi cross-validation log-loss

## 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from scipy.sparse import hstack

## 2. Load Data

In [2]:
train = pd.read_csv('./train/train.csv')
test  = pd.read_csv('./test/test.csv')
print('Train:', train.shape, 'Test:', test.shape)

Train: (19579, 3) Test: (8392, 2)


## 3. Feature Engineering & TF-IDF

In [3]:
# Metadata features: length, word count, uppercase ratio
def extract_meta(texts):
    lengths = [len(t) for t in texts]
    words = [len(t.split()) for t in texts]
    uppers = [sum(c.isupper() for c in t)/len(t) for t in texts]
    return np.vstack([lengths, words, uppers]).T

meta_train = extract_meta(train['text'])
meta_test = extract_meta(test['text'])

# TF-IDF dengan parameter ditingkatkan
vectorizer = TfidfVectorizer(
    max_features=10000,
    ngram_range=(1,2),
    stop_words='english',
    min_df=3,
    max_df=0.9
)
X_tfidf = vectorizer.fit_transform(train['text'])
X_test_tfidf = vectorizer.transform(test['text'])

# Gabungkan TF-IDF dan metadata
X_train_full = hstack([X_tfidf, meta_train])
X_test_full = hstack([X_test_tfidf, meta_test])

## 4. Cross-Validation Log-loss

In [4]:
# Model dengan class_weight untuk menangani ketidakseimbangan
grid = LogisticRegression(
    multi_class='multinomial',
    solver='saga',
    C=1.0,
    max_iter=1000,
    class_weight='balanced',
    random_state=42
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(grid, X_train_full, train['author'],
                         cv=cv, scoring='neg_log_loss', n_jobs=-1)
print('CV Log-loss:', -scores.mean(), '±', scores.std())

CV Log-loss: 1.099856573416362 ± 0.00041719144257379594


## 5. Train Final Model & Predict

In [5]:
# Latih pada seluruh data
final_model = LogisticRegression(
    multi_class='multinomial',
    solver='saga',
    C=1.0,
    max_iter=1000,
    class_weight='balanced',
    random_state=42
)
final_model.fit(X_train_full, train['author'])

# Prediksi probabilitas
y_pred = final_model.predict_proba(X_test_full)
submission = pd.DataFrame(y_pred, columns=final_model.classes_)
submission.insert(0, 'id', test['id'])
submission = submission[['id', 'EAP', 'HPL', 'MWS']]
submission.to_csv('./sample_submission/submission_v3.csv', index=False)
submission.head()



Unnamed: 0,id,EAP,HPL,MWS
0,id02310,0.32749,0.344451,0.328059
1,id24541,0.296788,0.329853,0.373359
2,id00134,0.321989,0.350467,0.327545
3,id27757,0.312124,0.338091,0.349784
4,id04081,0.323291,0.325298,0.351412
