## Predicting the 2018 FIFA World Cup Winner
In this Jupyter Notebook, we are going to develop a Machine Learning model to try to predict the outcomes of all games in World Cup 2018, which also means predicting the winner of the championship. 

We're going first to do some exploratory analysis on two datasets obtained from Kaggle. Then, we're going to do some feature engineering in order to select which features are the most relevant to the case in point, do some data manipulation, choose an ML model and finally deploy it on the dataset. Let's go!

Load the data set
we are considering two data set the resilt of the world cup which started from 1930


In [2]:
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Read data files
wc = pd.read_csv('data/World Cup 2018 Dataset.csv')
results = pd.read_csv('data/results.csv')

In [4]:
len(wc),len(results)

(33, 39082)

In [5]:
wc.head()

Unnamed: 0,Team,Group,Previous appearances,Previous titles,Previous  finals,Previous  semifinals,Current FIFA rank,First match against,Match index,history with first opponent  W-L,history with  first opponent  goals,Second match  against,Match index.1,history with  second opponent  W-L,history with  second opponent  goals,Third match  against,Match index.2,history with  third opponent  W-L,history with  third opponent  goals,Unnamed: 19
0,Russia,A,10.0,0.0,0.0,1.0,65.0,Saudi Arabia,1.0,-1.0,-2.0,Egypt,17.0,,,Uruguay,33.0,0.0,0.0,
1,Saudi Arabia,A,4.0,0.0,0.0,0.0,63.0,Russia,1.0,1.0,2.0,Uruguay,18.0,1.0,1.0,Egypt,34.0,-5.0,-5.0,
2,Egypt,A,2.0,0.0,0.0,0.0,31.0,Uruguay,2.0,-1.0,-2.0,Russia,17.0,,,Saudi Arabia,34.0,5.0,5.0,
3,Uruguay,A,12.0,2.0,2.0,5.0,21.0,Egypt,2.0,1.0,2.0,Saudi Arabia,18.0,-1.0,-1.0,Russia,33.0,0.0,0.0,
4,Porugal,B,6.0,0.0,0.0,2.0,3.0,Spain,3.0,-12.0,-31.0,Morocco,19.0,-1.0,-2.0,Iran,35.0,2.0,5.0,


In [6]:
results.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


### 2) Exploring the data


In [7]:
# Adding new column for winner of each match
winner = []
for i in range(len(results['home_team'])):
    if results['home_score'][i] > results['away_score'][i]:
        winner.append(results['home_team'][i])
    elif results['home_score'][i] < results['away_score'][i]:
        winner.append(results['away_team'][i])
    else:
        winner.append('Tie')
results['winning_team'] = winner

# Adding new column for goal difference in matches
results['goal_difference'] = np.absolute(results['home_score'] - results['away_score'])

results.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,winning_team,goal_difference
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False,Tie,0
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False,England,2
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False,Scotland,1
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,Tie,0
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False,Scotland,3


Getting win rates for every country that will participate in the World Cup should be a useful information for our analysis. Maybe we can use these winning rates to predict the most likely outcomes of each match in the tournament?

However, there are many features in this dataset that won't help. The World Cup is always held in one place (in this year, Russia), so the location of previous matches won't add much to our analysis. The column 'tournament' will also be of little help, since all matches in which we are going to apply the model will be 'FIFA World Cup'. Thus, we can start defining more clearly our model's scope and limitations.

### 3) Defining the project

**Objective**: to create a Machine Learning model capable of predicting the outcomes of football games in the 2018 FIFA World Cup.

**Features**: Results of historical matches since the beginning of the championship (1930) for all participating teams.


#### 3.1) Narrowing the 2018 World Cup participants

In [21]:
# Create a DF with all participating teams
last_16=["Uruguay",
"Russia",
"Spain", 
"Portugal",
"France",
"Denmark" ,
"Croatia" ,
"Argentina", 
"Brazil", 
"Switzerland",
"Sweden" ,
"Mexico", 
"Belgium" ,
"England", 
"Colombia", 
"Japan"]



# Filter the 'results' dataframe to show only teams in this years' world cup, from 1930 onwards
df_teams_home = results[results['home_team'].isin(last_16)]
df_teams_away = results[results['away_team'].isin(last_16)]
df_teams = pd.concat((df_teams_home, df_teams_away))
df_teams.drop_duplicates()
df_teams.count()


date               12068
home_team          12068
away_team          12068
home_score         12068
away_score         12068
tournament         12068
city               12068
country            12068
neutral            12068
winning_team       12068
goal_difference    12068
dtype: int64

In [9]:
df_teams.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,winning_team,goal_difference
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False,England,2
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,Tie,0
6,1877-03-03,England,Scotland,1,3,Friendly,London,England,False,Scotland,2
10,1879-01-18,England,Wales,2,1,Friendly,London,England,False,England,1
11,1879-04-05,England,Scotland,5,4,Friendly,London,England,False,England,1


#### 3.2) Manipulating the data

Subset the data for world cup matched only which started from 1930 on wards


In [22]:
# Loop for creating a new column 'year'
year = []
for row in df_teams['date']:
    year.append(int(row[:4]))
df_teams['match_year'] = year


df_fifa_world_cup = df_teams[df_teams.match_year >= 1930]
df_fifa_world_cup.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,winning_team,goal_difference,match_year
1230,1930-01-01,Spain,Czechoslovakia,1,0,Friendly,Barcelona,Spain,False,Spain,1,1930
1231,1930-01-12,Portugal,Czechoslovakia,1,0,Friendly,Lisbon,Portugal,False,Portugal,1,1930
1237,1930-02-23,Portugal,France,2,0,Friendly,Porto,Portugal,False,Portugal,2,1930
1240,1930-03-23,France,Switzerland,3,3,Friendly,Colombes,France,False,Tie,0,1930
1241,1930-04-05,England,Scotland,5,2,British Championship,London,England,False,England,3,1930


In [11]:
df_fifa_world_cup = df_fifa_world_cup.drop(['date', 'tournament', 'city', 'country', 'goal_difference', 'match_year',"home_score",
                             "away_score"], axis=1)
df_fifa_world_cup.head(5)

Unnamed: 0,home_team,away_team,neutral,winning_team
1230,Spain,Czechoslovakia,False,Spain
1231,Portugal,Czechoslovakia,False,Portugal
1237,Portugal,France,False,Portugal
1240,France,Switzerland,False,Tie
1241,England,Scotland,False,England


#### 3.3) Building the ML model

First, let's modify the "Y" (prediction label) in order to simplify our model's processing. The winning_team column will show "2" if the home team has won, "1" if it was a tie, and "0" if the away team has won.

In [23]:
df_fifa_world_cup = df_fifa_world_cup.reset_index(drop=True)
df_fifa_world_cup.loc[df_fifa_world_cup.winning_team == df_fifa_world_cup.home_team, 'winning_team']= 2
df_fifa_world_cup.loc[df_fifa_world_cup.winning_team == 'Tie', 'winning_team']= 1
df_fifa_world_cup.loc[df_fifa_world_cup.winning_team == df_fifa_world_cup.away_team, 'winning_team']= 0

df_fifa_world_cup.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,winning_team,goal_difference,match_year
0,1930-01-01,Spain,Czechoslovakia,1,0,Friendly,Barcelona,Spain,False,2,1,1930
1,1930-01-12,Portugal,Czechoslovakia,1,0,Friendly,Lisbon,Portugal,False,2,1,1930
2,1930-02-23,Portugal,France,2,0,Friendly,Porto,Portugal,False,2,2,1930
3,1930-03-23,France,Switzerland,3,3,Friendly,Colombes,France,False,1,0,1930
4,1930-04-05,England,Scotland,5,2,British Championship,London,England,False,2,3,1930


We will now set some dummies for the home_team and away_team variables, because otherwise we wouldn't be able to deploy a scikit-learn model on the dataset, since they are presented as categorical variables.

In [31]:
from sklearn.model_selection import train_test_split

# Get dummy variables
final = pd.get_dummies(df_teams30, prefix=['home_team', 'away_team'], columns=['home_team', 'away_team'])

# joining dummy with actual df 
#final = pd.concat([final,df_teams30.iloc[:,2:4]],axis=1)

In [32]:
# Separate X and y sets
X = final.drop(['winning_team'], axis=1)
y = final["winning_team"]
y = y.astype('int')

# Separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [33]:
X_train.shape,y_train.shape

((8874, 309), (8874,))

# Trying different models to check for better accuracy 

In [34]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

In [35]:
classifiers = [
    KNeighborsClassifier(3),
    SVC(probability=True,degree=2),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    GaussianNB(),
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis(),
    LogisticRegression(),
    MLPClassifier(hidden_layer_sizes=(20,10))]


log_cols = ["Classifier", "Accuracy"]
log = pd.DataFrame(columns=log_cols)

acc_dict = {}

# Loop to do fit and predictions of each classifier into the dataset
for clf in classifiers:
    name = clf.__class__.__name__
    clf.fit(X_train, y_train)
    train_predictions = clf.predict(X_test)
    acc = accuracy_score(y_test, train_predictions)
    
    # Storing each score into a dict
    if name in acc_dict:
        acc_dict[name] += acc
    else:
        acc_dict[name] = acc

# Storing the results in a DataFrame to be visualized
for clf in acc_dict:
    acc_dict[clf] = acc_dict[clf] / 10.0
    log_entry = pd.DataFrame([[clf, acc_dict[clf]]], columns=log_cols)
    log = log.append(log_entry)

plt.xlabel('Accuracy')
plt.title('Classifier Accuracy')

sns.set_color_codes("muted")
sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")



<matplotlib.axes._subplots.AxesSubplot at 0x1d5b6ffb860>

In [36]:
# Because the Logistic regression is performing well so we are Choosing Logistic regression as our final model .
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
score = logreg.score(X_train, y_train)
score2 = logreg.score(X_test, y_test)

print("Training set accuracy: ", '%.3f'%(score))
print("Test set accuracy: ", '%.3f'%(score2))

Training set accuracy:  0.577
Test set accuracy:  0.558


In [37]:
sns.countplot(x='winning_team', data=df_teams30)

<matplotlib.axes._subplots.AxesSubplot at 0x1d5b6ffb860>

In [38]:
# Loading new datasets
ranking = pd.read_csv('data/fifa_rankings.csv') 
fixtures = pd.read_csv('data/fixtures.csv')


pred_set = []

FileNotFoundError: File b'data/fifa_rankings.csv' does not exist

In [None]:
# Create new columns with ranking position of each team
fixtures.insert(1, 'first_position', fixtures['Home Team'].map(ranking.set_index('Team')['Position']))
fixtures.insert(2, 'second_position', fixtures['Away Team'].map(ranking.set_index('Team')['Position']))

# We only need the group stage games, so we have to slice the dataset
fixtures = fixtures.iloc[:48, :]
fixtures.tail()

In [None]:
# Loop to add teams to new prediction dataset based on the ranking position of each team
for index, row in fixtures.iterrows():
    if row['first_position'] < row['second_position']:
        pred_set.append({'home_team': row['Home Team'], 'away_team': row['Away Team'], 'winning_team': None})
    else:
        pred_set.append({'home_team': row['Away Team'], 'away_team': row['Home Team'], 'winning_team': None})
        
pred_set = pd.DataFrame(pred_set)
backup_pred_set = pred_set

pred_set.head()

Great! Now we already have a clean dataset with all group stage games for the FIFA World Cup 2018. Now we will just need to create some dummy variables, and then deploy an ML model into this DF. We can start by using our Logistic Regression Model.

In [None]:
# Get dummy variables and drop winning_team column
pred_set = pd.get_dummies(pred_set, prefix=['home_team', 'away_team'], columns=['home_team', 'away_team'])

# Add missing columns compared to the model's training dataset
missing_cols = set(final.columns) - set(pred_set.columns)
for c in missing_cols:
    pred_set[c] = 0
pred_set = pred_set[final.columns]

# Remove winning team column
pred_set = pred_set.drop(['winning_team'], axis=1)

pred_set.head()

### 5) Deploying the model

At last, it's time to deploy our model and start predicting the matches away. Let's go!

In [None]:
predictions = logreg.predict(pred_set)
for i in range(fixtures.shape[0]):
    print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
    if predictions[i] == 2:
        print("Winner: " + backup_pred_set.iloc[i, 1])
    elif predictions[i] == 1:
        print("Tie")
    elif predictions[i] == 0:
        print("Winner: " + backup_pred_set.iloc[i, 0])
    print('Probability of ' + backup_pred_set.iloc[i, 1] + ' winning: ', '%.3f'%(logreg.predict_proba(pred_set)[i][2]))
    print('Probability of Tie: ', '%.3f'%(logreg.predict_proba(pred_set)[i][1]))
    print('Probability of ' + backup_pred_set.iloc[i, 0] + ' winning: ', '%.3f'%(logreg.predict_proba(pred_set)[i][0]))
    print("")


We'll have to separate the nations into 'home' teams and 'away' teams again. 

In [39]:
# List of tuples before we arrange the teams in home and away
group_16 = [
            ("Uruguay", "Portugal"),
            ("France", "Argentina"),
            ("Brazil", "Mexico"),
            ("Belgium", "Japan"),
            ("Spain", "Russia"),
            ("Croatia", "Denmark"),
            ("Sweden", "Switzerland"),
            ("Colombia","England")]

In [44]:
def predict(matches, ranking, final, logreg):

    # Initialization of auxiliary list for data cleaning
    positions = []

    # Loop to retrieve each team's position according to FIFA ranking
    for match in matches:
        positions.append(ranking.loc[ranking['Team'] == match[0],'Position'].iloc[0])
        positions.append(ranking.loc[ranking['Team'] == match[1],'Position'].iloc[0])
    
    # Creating the DataFrame for prediction
    pred_set = []

    # Initializing iterators for while loop
    i = 0
    j = 0

    # 'i' will be the iterator for the 'positions' list, and 'j' for the list of matches (list of tuples)
    while i < len(positions):
        dict1 = {}

        # If position of first team is better, he will be the 'home' team, and vice-versa
        if positions[i] < positions[i + 1]:
            dict1.update({'home_team': matches[j][0], 'away_team': matches[j][1]})
        else:
            dict1.update({'home_team': matches[j][1], 'away_team': matches[j][0]})

        # Append updated dictionary to the list, that will later be converted into a DataFrame
        pred_set.append(dict1)
        i += 2
        j += 1

    # Convert list into DataFrame
    pred_set = pd.DataFrame(pred_set)
    backup_pred_set = pred_set

    # Get dummy variables and drop winning_team column
    pred_set = pd.get_dummies(pred_set, prefix=['home_team', 'away_team'], columns=['home_team', 'away_team'])

    # Add missing columns compared to the model's training dataset
    missing_cols2 = set(final.columns) - set(pred_set.columns)
    for c in missing_cols2:
        pred_set[c] = 0
    pred_set = pred_set[final.columns]

    # Remove winning team column
    pred_set = pred_set.drop(['winning_team'], axis=1)

    # Predict!
    predictions = logreg.predict(pred_set)
    for i in range(len(pred_set)):
        print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
        if predictions[i] == 2:
            print("Winner: " + backup_pred_set.iloc[i, 1])
        elif predictions[i] == 1:
            print("Tie")
        elif predictions[i] == 0:
            print("Winner: " + backup_pred_set.iloc[i, 0])  
        print(" ")

In [45]:
predict(group_16, ranking, final, logreg)

Portugal and Uruguay
Winner: Portugal
 
Argentina and France
Winner: Argentina
 
Brazil and Mexico
Winner: Brazil
 
Belgium and Japan
Winner: Belgium
 
Spain and Russia
Winner: Spain
 
Denmark and Croatia
Winner: Croatia
 
Switzerland and Sweden
Winner: Switzerland
 
England and Colombia
Winner: England
 


In [46]:
# List of matches
quarters = [('Brazil', 'Belgium'),
            ('Portugal', 'Argentina'),
            ('Croatia', 'Spain'),
            ('Sweden', 'England')]

In [47]:
predict(quarters, ranking, final, logreg)

Brazil and Belgium
Winner: Brazil
 
Portugal and Argentina
Winner: Portugal
 
Spain and Croatia
Winner: Spain
 
England and Sweden
Winner: England
 


** THE SEMI FINALISTS

In [48]:
# List of matches
semi = [('Spain', 'England'),
        ('Brazil', 'Portugal')]

In [49]:
predict(semi, ranking, final, logreg)

Spain and England
Winner: Spain
 
Brazil and Portugal
Winner: Brazil
 


** THE FINAL

In [50]:
# The big game
finals = [('Spain', 'Brazil')]

In [51]:
predict(finals, ranking, final, logreg)

Brazil and Spain
Winner: Brazil
 


According to the model winner is Brazil