# Extra features
We definetly need some extra features, since we can't achieve good results with only the manual labels.  We'll test the new results both on naïve bayes and xgboost.

## Data
Our train and test set remain the same.

In [4]:
from util import get_wpm_train_test, get_label_columns

train_x_full, train_y, test_x, test_y, groups = get_wpm_train_test(include_groups=True, x_train_features_only=False)

features = get_label_columns() # initially only the manual labels

## Baseline
Get the baseline results.

### Naive Bayes

In [5]:
from util import get_naive_bayes_model_wp, fit_predict_print_wp, print_wp_evaluation, predict_wp

model = get_naive_bayes_model_wp()

model.fit(train_x_full[features], train_y["Winner"])
predicted_winners = predict_wp(model, test_x, features=features)
print_wp_evaluation(predicted_winners, test_y)

Accuracy: 57.14% (104/182)


### XGBoost

In [7]:
from util import get_xgboost_model_wp, fit_predict_print_wp, print_wp_evaluation, predict_wp

model = get_xgboost_model_wp()

model.fit(train_x_full[features], train_y["Winner"], groups)
predicted_winners = predict_wp(model, test_x, features=features)
print_wp_evaluation(predicted_winners, test_y)



Accuracy: 51.65% (94/182)


## Sample feature
Let's add the length of the headline as extra feature.

In [9]:
features.append("Length")
def add_length_to_dataframe(df):
    modified_df = df.copy()
    modified_df["Length"] = df["Headline"].apply(len)
    return modified_df

train_x_full_new = add_length_to_dataframe(train_x_full)
test_x_new = add_length_to_dataframe(test_x)

### Evaluate

In [10]:
from util import get_naive_bayes_model_wp, fit_predict_print_wp, print_wp_evaluation, predict_wp

model = get_naive_bayes_model_wp()

model.fit(train_x_full_new[features], train_y["Winner"])
predicted_winners = predict_wp(model, test_x_new, features=features)
print_wp_evaluation(predicted_winners, test_y)

Accuracy: 57.14% (104/182)


In [11]:
from util import get_xgboost_model_wp, fit_predict_print_wp, print_wp_evaluation, predict_wp

model = get_xgboost_model_wp()

model.fit(train_x_full_new[features], train_y["Winner"], groups)
predicted_winners = predict_wp(model, test_x_new, features=features)
print_wp_evaluation(predicted_winners, test_y)



Accuracy: 53.85% (98/182)


"Lang" was only at the 8th position of important features in the xgboost model, but already caused some increase, so let's hope extra features will keep incresing the accuracy. This didn't make any difference for naïve bayes.