# Challenge - Football match prediction

## Description

For this week's challenge, we will try to build a model for the football match outcome prediction. The data containing the international football match results from 1872 to 2021 can be found in the Kaggle (***see Kaggle section for link***). 

On the other hand, this week's challenge has been created in the format of Kaggle in-class competition: we will develop our model on the given ```train``` dataset and will have to submit a file containing predictions from the ```test``` dataset. Then, according to the accuracy of your prediction, you will be ranked in the leaderboard.


## Kaggle

You can access the competition via [this link](https://www.kaggle.com/c/ucl-ai-society-football-match-prediction/data) where you can also find the detailed description of the challenge and provided data. The key points:

- The data section contains 3 files - ```train.csv```, ```test.csv```, ```sample_submission.csv```. You should develop the model using the **```train.csv```** file.
- As the outcome is binary, we are going to use the logistic regression model.
- After training the model, use it to predict the outcomes of the ```test.csv``` dataset. With outputs and id, create a submission file (the format can be seen in the ```sample_submission.csv``` file).
- As the dataset features are in string format, we are going to use the string-to-integer encoder. Since it will only be covered in the latter tutorials, at this point, we will provide the function that takes the feature and provides transformed output.

![photo](https://user-images.githubusercontent.com/73468790/137705753-f5bd879d-0a60-4bce-bf3c-63a9b38ae5c0.jpg)

## Code

You can use the code structure below as a guidance. Some of the steps will be already done for you (string to integer encoding), so **do not change it**.

In [None]:
#Importing libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

After importing the libraries, specify the path for the downloaded ```train.csv``` dataset and read the file.

In [None]:
#Reading the data
PATH = 
data = 

As usual, feel free to have a look at data using an appropriate function

In [1]:
#Look at the top dataset values with the Pandas function (.head())


During the previous time, we learned the importance of checking missing values. Check if there are any missing values and if so, remove them.

In [None]:
#Check missing values and remove them (if there are any)


Now we came to the data preprocessing step. As you can see, we have already written the ```encode()``` function which takes the data file and outputs ```X``` features. **Do not change this part of the code**.

You will have to write a function that:
- Takes the data file and passes it through the ```encode()``` function
- Drops the unwanted columns from the data file
- Extracts the outcome from the data file and converts it to numpy array.

In [None]:
def encoder(data):
    
    #--------------------------------------
    #Extracting teams' names
    H_team = np.array(data['home_team'])
    A_team = np.array(data['away_team'])
    
    Teams = np.hstack((H_team, A_team))
    
    #--------------------------------------
    #Encoding names
    team_encoder = LabelEncoder().fit(Teams)
    
    H_encoded = team_encoder.transform(H_team)
    A_encoded = team_encoder.transform(A_team)
 
    #--------------------------------------
    #Creating X feature encoded values
    H = np.expand_dims(H_encoded, axis = -1)
    A = np.expand_dims(A_encoded, axis = -1)
    
    X = np.concatenate([H, A], axis = 1)
    
    #--------------------------------------
    #Scale values
    X = StandardScaler().fit(X).transform(X)
    return X  
#--------------------------------------------------------------------------------------------------------------------------


def preprocessing(data):
    
    #Removing unwanted columns:
    
    #Extracting encoded features
    X = 
    
    #Extracting 'Outcome' and converting to the numpy array
    y = 
    
    return X, y

X, y = preprocessing(data)

As we have our feature, we can split the data into the train and test sets. Feel free to chose the exact method

In [None]:
#Split X and y into train and test datasets


Finally, we can build and train out logistic regression model. Feel free to do it from scratch or using Scikit-learn (you should have the code from the previous notebook).

In [None]:
#Build and train logistic regression model


#### Prepare submission file

As we now have our trained model, we can pass the ```test.csv``` data file to generate our predictions and convert them to an appropriate submission file.

In [None]:
#Specify your test file path
test_path = 
test_data = pd.read_csv(test_path)

#Encoding and preprocessing our test features (do not change this part)
X = encoder(test_data)

#Making prediction
y = 

#Extracting id values from the test dataset and convert them to array
idx = 

#Converting to DataFrame (do not change this part)
sub_file = pd.DataFrame([idx, y]).T
sub_file.columns = ['id', 'outcome']

#Specify your submission file
saving_path = 

#Saving submission file (do not change this part)
sub_file.to_csv(saving_path, index = False)