In [None]:
"""
Model description :
    
The model is based on the sign of "proj_score1"-proj_score2", which is very significant.

If the sign is strictly positive, Team 1 is winning, and if the sign is strictly negative, Team 1 is losing.
There is only 9 exceptions (out of 7395 observations) in the training set:
    - 2 in the "Mexican Primera Division Torneo Clausura" league
    - 7 in the "APD" league
Since those leagues aren't in the test set, and these exceptions happen rarely (0.12% cases),
we consider them as outliers.
Then in the test set we affect a probability of 1 if the sign of "proj_score1"-proj_score2" is strictly positive.
At the same time we  affect a probability of 0 if the sign of "proj_score1"-proj_score2" is strictly negative.

Only 29 observations (value of "proj_score1" equal value "proj_score2") remains in the test set.
Since only 48 observations in the training set have the same value in "proj_score1" and "proj_score2",
we can't develop a efficient model (to few observations).
Since the "Outcome" in the training set is biased (expectation of 0.68), 
we use this expectation for the 29 remaining observations (instead of a full random 0.5 probability).
This small chance reduce the log-loss from 0.00576 to 0.00445 in the pulic leaderboard,
and from 0.00502 to 0.00486 to the private one.
"""

import numpy as np
import pandas as pd

dtrain = pd.read_csv('train.csv')
dtrain_expectation = dtrain['Outcome'].mean()

dtest = pd.read_csv('test.csv')
sign_diff_projscore_test = np.sign(dtest['proj_score1'] -  dtest['proj_score2'] )

outcome_test = np.zeros(len(sign_diff_projscore_test))
outcome_test[sign_diff_projscore_test==1] = 1
outcome_test[sign_diff_projscore_test==-1] = 0
outcome_test[sign_diff_projscore_test==0] = dtrain_expectation

submission = pd.read_csv('submission.csv' )
submission['Outcome'] = outcome_test
submission.to_csv('LHS_WE_2_Soccer_Fever_FinalSubmission.csv', index=False)
