# Label Classification
Soubry defined multiple labels per headline, like whether it contains emotion, uses the combination of first and last name or contains a quotation. Let's take a look at the labels she defined:


In [1]:
from util import get_preprocessed_dataset
df = get_preprocessed_dataset()
print(f"The columns in the dataset are: {', '.join(df.columns)}")

Check
The columns in the dataset are: Test, Headline ID, Winner, Headline, Actief, Lang, Vragen, Interpunctie, Tweeledigheid, Emotie, Voorwaartse Verwijzing, Signaalwoorden, Lidwoorden, Adjectieven, Eigennamen, Betrekking, Voor+Achternaam, Cijfers, Quotes, Wat zit erin voor mij?, Modaliteit, Sensatie


"Test" contains a unique number per test, in order to know whether two headlines are from the same test. For each test, there is always one headline which gets selected as "Winner" (which is the finally chosen headline to publish). For example for the first test, we have those two headlines, of which the second one is the winner and thus used for the published article:

In [2]:
df[df["Test"]==1].head()

Unnamed: 0,Test,Headline ID,Winner,Headline,Actief,Lang,Vragen,Interpunctie,Tweeledigheid,Emotie,...,Lidwoorden,Adjectieven,Eigennamen,Betrekking,Voor+Achternaam,Cijfers,Quotes,Wat zit erin voor mij?,Modaliteit,Sensatie
0,1,A,False,Barack en Michelle Obama laten dansmoves zien ...,0,1,0,0,0,0,...,0,0,0,0,1,0,0,,,
1,1,B,True,Barack en Michelle Obama gaan helemaal los tij...,1,0,0,0,0,0,...,0,0,0,0,1,0,0,,,


"Headline ID" contains a unique letter per headline in given test. The last three columns ("Wat zit erin voor mij?", "Modaliteit" and "Sensatie") are columns that weren't used in the analysis, so we'll just ignore them. The other columns contain attributes which are manually labeled for each of the headlines, which we'll try to predict in the following chapters.

## Train-test split
We need to split our data in a training and testing set. Since the labels are independent of the tests, I started by randomly selecting headlines as train or test set. The problem here however is that labels with the same test number are often the same (not all of them, but most), which leads to overfitting on specific words from the headlines in the training set, which also occured in the testing set. So to create our train and test split, I'll only use one headline per test and split this then in our train and testing set.

In [6]:
from sklearn import model_selection
df_per_test = df.groupby("Test").apply(lambda x: x.sample(1))
train, test = model_selection.train_test_split(df, random_state=42)
print(f"Selected {len(train)} headlines as training set and {len(test)} headlines as testing set.")
train.head()

Selected 1692 headlines as training set and 565 headlines as testing set.


Unnamed: 0,Test,Headline ID,Winner,Headline,Actief,Lang,Vragen,Interpunctie,Tweeledigheid,Emotie,...,Lidwoorden,Adjectieven,Eigennamen,Betrekking,Voor+Achternaam,Cijfers,Quotes,Wat zit erin voor mij?,Modaliteit,Sensatie
1572,628,B,False,Deze jobs zijn zware beroepen volgens de vakbo...,1,1,0,1,1,0,...,0,0,0,0,0,0,0,,,
458,176,B,False,Vandereycken onthult ware toedracht over vertr...,1,1,0,0,1,0,...,0,0,0,0,0,0,1,,,
78,31,B,True,Vijfjarige en haar zusjes mooiste meisjes op I...,1,0,0,0,0,0,...,0,0,0,0,0,1,0,,,
32,13,D,False,Arrestant klimt op dak van rijdende politiewag...,1,1,0,0,1,0,...,1,0,0,0,0,0,0,,,0.0
1557,623,B,False,Een maand lang werd ze gemarteld en uitgehonge...,0,1,0,0,0,1,...,1,0,1,0,0,1,0,,,


### Per label
For our current train and test scenarios, we only need the headline and the column we're trying to predict, e.g. for "Actief":


In [16]:
train_x = train[["Headline"]]
train_y = train[["Actief"]]

In [17]:
train_x.head()

Unnamed: 0,Headline
1572,Deze jobs zijn zware beroepen volgens de vakbo...
458,Vandereycken onthult ware toedracht over vertr...
78,Vijfjarige en haar zusjes mooiste meisjes op I...
32,Arrestant klimt op dak van rijdende politiewag...
1557,Een maand lang werd ze gemarteld en uitgehonge...


In [18]:
train_y.head()

Unnamed: 0,Actief
1572,1
458,1
78,1
32,1
1557,0
