The competition explicity wants us to not misclassify examples which have identities. One way to ensure the training takes this into consideration is to modify the loss to a weighted cross entropy. This overweighs examples that are likely to get misclassified. 

Shown below is the general algorithm used to reweigh samples. This was repeated for p1 and p2 of the training data. Additionally different combinations of the weighing parameters were used.

In [34]:
import pandas as pd
import pathlib
import numpy as np

In [35]:
#read data with identities
p1_iden=pd.read_csv(pathlib.Path.cwd().joinpath('data_float_all_cols', 'p1', 'train.tsv'), sep='\t')
#read data without identities
p1=pd.read_csv(pathlib.Path.cwd().joinpath('data_float', 'p1', 'train.tsv'), sep='\t', header=None)
p1.head()

Unnamed: 0,0,1,2,3
0,6229878,0.0,a,"Beyond the non-existent customer service, the ..."
1,6116726,0.5,a,Lol. The Idiocracy pResidency continues...
2,6157047,0.0,a,"""someone had jammed a British Fantasy Series M..."
3,719889,0.0,a,"""Social contract,"" ""will of the gods,"" ""divine..."
4,1019477,0.166667,a,"Well, perhaps the money to fix this should com..."


In [44]:
p1[3]=p1[3].replace(r'\t', ' ', regex=True)

In [47]:
identity_cols = ['male', 'female', 'homosexual_gay_or_lesbian', 'christian', 'jewish',
    'muslim', 'black', 'white', 'psychiatric_or_mental_illness']

In [50]:
def make_weights(df_iden):
    df=df_iden.copy(deep=True)
    for column in identity_cols+['target']:
        df[column] = np.where(df[column] >= 0.5, True, False)
    sample_weights = np.ones(df.shape[0], dtype=np.float32)
    #Increase the sample weight by the number of true identity columns
    sample_weights += df[identity_cols].sum(axis=1)
    #If the target is true increase the weight by the number of 'FALSE' identity columns
    sample_weights += df['target'] * (~df[identity_cols]).sum(axis=1)
    #If the target is false, increase the weight by the number of 'TRUE' identity columns, multiplied by 5. 
    sample_weights += (~df['target']) * df[identity_cols].sum(axis=1) * 5
    #Average these
    sample_weights /= sample_weights.mean()
    return sample_weights

In [51]:
p1['weights']=make_weights(p1_iden)

In [52]:
p1.to_csv(pathlib.Path.cwd().joinpath('data_float_weighted', 'p1', 'train.tsv'), sep='\t', header=False, index=False)