# Winner prediction (manual labels)
In our dataset, we have multiple candidate, of which one gets selected as a winner. My goal in this section is to predict this winner based on the manually labelled data.

In [1]:
from util import get_preprocessed_dataset

df = get_preprocessed_dataset()
df[df["Test"] == 1]

Unnamed: 0,Test,Headline ID,Winner,Headline,Actief,Lang,Vragen,Interpunctie,Tweeledigheid,Emotie,...,Lidwoorden,Adjectieven,Eigennamen,Betrekking,Voor+Achternaam,Cijfers,Quotes,Wat zit erin voor mij?,Modaliteit,Sensatie
0,1,A,False,Barack en Michelle Obama laten dansmoves zien ...,0,1,0,0,0,0,...,0,0,0,0,1,0,0,,,
1,1,B,True,Barack en Michelle Obama gaan helemaal los tij...,1,0,0,0,0,0,...,0,0,0,0,1,0,0,,,


The number of candidate headlines per test is often different, e.g.:

In [2]:
df[df["Test"] == 16]

Unnamed: 0,Test,Headline ID,Winner,Headline,Actief,Lang,Vragen,Interpunctie,Tweeledigheid,Emotie,...,Lidwoorden,Adjectieven,Eigennamen,Betrekking,Voor+Achternaam,Cijfers,Quotes,Wat zit erin voor mij?,Modaliteit,Sensatie
37,16,A,False,"Verpleegster klapt uit de biecht: ""Ja, onze pa...",1,1,0,0,1,0,...,0,0,0,0,0,0,1,,,0.0
38,16,B,False,"Verpleegster klapt uit de biecht: ""Ik kijk pat...",1,1,0,0,1,0,...,0,0,0,0,0,0,1,,,0.0
39,16,C,False,"Verpleegster klapt uit de biecht: ""Ik kijk hen...",1,1,0,0,1,0,...,0,0,0,0,0,0,1,,,0.0
40,16,D,False,"Verpleegster klapt uit de biecht: ""Soms is het...",1,1,0,0,1,0,...,0,0,0,0,0,0,1,,,0.0
41,16,E,True,"Verpleegster klapt uit de biecht: ""Ook wij voe...",1,1,0,0,1,0,...,0,0,0,0,0,0,1,,,1.0
42,16,F,False,"Verpleegster klapt uit de biecht: ""Ik gooide k...",1,1,0,0,1,0,...,0,0,0,0,0,0,1,,,0.0
43,16,G,False,"Verpleegster klapt uit de biecht., ""Eén keer g...",1,1,0,0,1,0,...,0,0,0,0,0,1,1,,,0.0
44,16,H,False,"Verpleegster klapt uit de biecht: ""Ik was 25 e...",1,1,0,0,1,0,...,0,0,0,0,0,1,1,,,0.0


In this set, you can already see that only the features that were manually labelled won't be enough to predict the winner.

In [3]:
print(
    f"There are {len(df.Test.unique())} tests in total, with on average {len(df.Test) / len(df.Test.unique()):.2f} candidate headlines per test.")

There are 909 tests in total, with on average 2.48 candidate headlines per test.


## Train-test split
Our train-test split is of course different than for our label classification problem. We now need to have all candidate headlines for a certain test in the same split, instead of only one candidate per test in each of them.

In [4]:
from sklearn.model_selection import train_test_split

# Get the unique test IDs
test_ids = df.Test.unique()

# Create a train test split based on the test IDs
train_ids, test_ids = train_test_split(test_ids, test_size=0.2, random_state=42)
train_ids[:5]

array([84, 11, 617, 250, 870], dtype=object)

In [5]:
# Get the train and test dataframes
train_df = df[df["Test"].isin(train_ids)]
test_df = df[df["Test"].isin(test_ids)]
train_df.head()

Unnamed: 0,Test,Headline ID,Winner,Headline,Actief,Lang,Vragen,Interpunctie,Tweeledigheid,Emotie,...,Lidwoorden,Adjectieven,Eigennamen,Betrekking,Voor+Achternaam,Cijfers,Quotes,Wat zit erin voor mij?,Modaliteit,Sensatie
0,1,A,False,Barack en Michelle Obama laten dansmoves zien ...,0,1,0,0,0,0,...,0,0,0,0,1,0,0,,,
1,1,B,True,Barack en Michelle Obama gaan helemaal los tij...,1,0,0,0,0,0,...,0,0,0,0,1,0,0,,,
2,2,A,True,Marc Coucke maakt opvallende keuze bij start v...,1,0,0,0,0,0,...,0,0,0,0,1,0,0,,,
3,2,B,False,"Marc Coucke: ""Spelers van 10 miljoen? Neen, li...",1,0,0,0,1,0,...,0,0,0,0,1,0,1,,,
4,3,A,True,Maandag drukste dag van het jaar op Brussels A...,1,1,0,0,1,0,...,0,0,0,1,0,0,1,1.0,,


In [6]:
# Get the number of unique values in "Test" column
print(f"{len(train_df.Test.unique())} tests in the train set, {len(test_df.Test.unique())} tests in the test set")

727 tests in the train set, 182 tests in the test set


In [7]:
# Split into train_x (containing all columns except Winner) and train_y (containing the Winner column)
train_x = train_df.drop(columns=["Winner"])
train_y = train_df[train_df["Winner"] == True][["Test", "Headline ID"]]

test_x = test_df.drop(columns=["Winner"])
test_y = test_df[test_df["Winner"] == True][["Test", "Headline ID"]]

In [8]:
train_x.head()

Unnamed: 0,Test,Headline ID,Headline,Actief,Lang,Vragen,Interpunctie,Tweeledigheid,Emotie,Voorwaartse Verwijzing,...,Lidwoorden,Adjectieven,Eigennamen,Betrekking,Voor+Achternaam,Cijfers,Quotes,Wat zit erin voor mij?,Modaliteit,Sensatie
0,1,A,Barack en Michelle Obama laten dansmoves zien ...,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,,,
1,1,B,Barack en Michelle Obama gaan helemaal los tij...,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,,,
2,2,A,Marc Coucke maakt opvallende keuze bij start v...,1,0,0,0,0,0,1,...,0,0,0,0,1,0,0,,,
3,2,B,"Marc Coucke: ""Spelers van 10 miljoen? Neen, li...",1,0,0,0,1,0,0,...,0,0,0,0,1,0,1,,,
4,3,A,Maandag drukste dag van het jaar op Brussels A...,1,1,0,0,1,0,0,...,0,0,0,1,0,0,1,1.0,,


In [9]:
train_y.head()

Unnamed: 0,Test,Headline ID
1,1,B
2,2,A
4,3,A
8,4,B
10,5,B


In [10]:
# Assert no duplicate entries in train_y Test column
assert len(train_y.Test.unique()) == len(train_y)

Only the test and the headline ID is the prefered format for the test dataset, but for the training dataset it might be useful to keep the Winner label, so it doesn't need to be reconstructed:

In [11]:
train_y_full = train_df[train_df["Winner"] == True][["Test", "Headline ID", "Winner"]]
train_y_full.head()

Unnamed: 0,Test,Headline ID,Winner
1,1,B,True
2,2,A,True
4,3,A,True
8,4,B,True
10,5,B,True


### Feature vectors
Often classifiers will only want the features itself as training set, instead of the entire dataframe, so let's also provide something for that.

In [12]:
from util import get_label_columns
train_x_features = train_x[get_label_columns()]
train_x_features.head()

Unnamed: 0,Actief,Lang,Vragen,Interpunctie,Tweeledigheid,Emotie,Voorwaartse Verwijzing,Signaalwoorden,Lidwoorden,Adjectieven,Eigennamen,Betrekking,Voor+Achternaam,Cijfers,Quotes
0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0
3,1,0,0,0,1,0,0,0,0,0,0,0,1,0,1
4,1,1,0,0,1,0,0,0,0,0,0,1,0,0,1
