# Project Report Notebook - Football Transfer Analytics

[Executive_SummaryAbstract](#Executive_Summary/Abstract)
<br>[Table of Contents Introduction](#Table_of_Contents_Introduction)
<br>[Analysis](#analysis)
<br>[Conclusion](#conclusion)
<br>[Appendix: Project Structure](#Project_Structure)

## Executive Summary/Abstract <a id='Executive_Summary/Abstract'></a>

<br>The question: What recipy a football player should follow in order to increase his market value the most?

<br>Purpose: We want to analyze what are the most important factors that might increase the value of a player when he transfers from one club to another. Hence, we built a network by taking 10 years of historical data and computed different centrality measurements in order to find out valuable information using machine learning algorithms.

- Network: Transfers of professional and semi-professional football players.
- Type: Directed and weighted network
- Source: [TransferMarket](https://www.transfermarkt.co.uk/).
- Modification: We have scrapped the clubs that have a total market value above 200 million euros. From these clubs, we downloaded the html files of their football players. This include the nationality, birthdate, transfer history, transfer fee etc.

## Table of Contents Introduction

In [7]:
from urllib.request import *
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import scipy.stats as stats
import pylab as pl
import requests
from geopy.geocoders import Nominatim
import json
import folium
import os
import time
from tempfile import TemporaryFile
from datetime import datetime
import matplotlib.pyplot as plt
from collections import Counter
from scipy import sparse, stats, spatial
import scipy.sparse.linalg
import networkx as nx
from pylab import rcParams
%matplotlib inline

# ML related libraries
from sklearn.linear_model import Lasso
import copy
from sklearn.externals import joblib
from sklearn.metrics import mean_absolute_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
import sklearn.svm as svm
from sklearn.ensemble import RandomForestRegressor
from sklearn.decomposition import PCA

from models import *
from helpers import *

import warnings
warnings.filterwarnings('ignore')

### Load data

In [2]:
ifile = open('transfers.csv')
df = pd.read_csv(ifile)

G = nx.read_graphml("transfers.graphml")

In [3]:
print(nx.info(G))

Name: 
Type: DiGraph
Number of nodes: 401
Number of edges: 9621
Average in degree:  23.9925
Average out degree:  23.9925


We can see that some of the 418 clubs were not present in the graph. That's because the players of these clubs did not transfer to other clubs given the time constraint of 5 years (or no one from the other clubs transfered to them).

In [4]:
df

Unnamed: 0,from_club_id,to_club_id,player_nationality,value_increase,player_stay_in_years,club_market_value_from,club_market_value_to,transfer_year,position,birth_date_year,...,from_fee_pagerank,to_fee_pagerank,from_transfers_pagerank,to_transfers_pagerank,from_eigenvector,to_eigenvector,from_fee_eigenvector,to_fee_eigenvector,from_transfers_eigenvector,to_transfers_eigenvector
0,1184,631,Belgium,8950000,2,63300000,631900000,2011,Goalkeeper,1992,...,0.001574,0.030900,0.001918,0.007535,0.032253,0.142025,0.012032,0.329926,0.000452,0.006960
1,631,13,Belgium,-7750000,0,631900000,509500000,2011,Goalkeeper,1992,...,0.030900,0.012826,0.007535,0.004276,0.142025,0.111613,0.329926,0.122847,0.006960,0.006814
2,13,631,Belgium,23800000,3,509500000,631900000,2014,Goalkeeper,1992,...,0.012826,0.030900,0.004276,0.007535,0.111613,0.142025,0.122847,0.329926,0.006814,0.006960
3,1084,281,Argentina,7100000,3,67580000,629500000,2014,Goalkeeper,1981,...,0.003632,0.028804,0.003364,0.004917,0.095054,0.112573,0.025786,0.355295,0.002131,0.003495
4,281,631,Argentina,-8000000,3,629500000,631900000,2017,Goalkeeper,1981,...,0.028804,0.030900,0.004917,0.007535,0.112573,0.142025,0.355295,0.329926,0.003495,0.006960
5,1075,1085,Portugal,0,0,53350000,15400000,2007,Goalkeeper,1982,...,0.003167,0.000679,0.004553,0.001639,0.097993,0.030901,0.014606,0.004624,0.007572,0.001900
6,1085,1075,Portugal,1500000,1,15400000,53350000,2008,Goalkeeper,1982,...,0.000679,0.003167,0.001639,0.004553,0.030901,0.097993,0.004624,0.014606,0.001900,0.007572
7,1075,252,Portugal,3000000,2,53350000,87900000,2010,Goalkeeper,1982,...,0.003167,0.006561,0.004553,0.005689,0.097993,0.118558,0.014606,0.085266,0.007572,0.006034
8,252,294,Portugal,2000000,1,87900000,165850000,2011,Goalkeeper,1982,...,0.006561,0.009744,0.005689,0.006974,0.118558,0.165283,0.085266,0.073810,0.006034,0.015205
9,294,252,Portugal,-2500000,1,165850000,87900000,2012,Goalkeeper,1982,...,0.009744,0.006561,0.006974,0.005689,0.165283,0.118558,0.073810,0.085266,0.015205,0.006034


### Predicting player value increase

In [5]:
SEED = 20

def build_models(list_of_network_features, normalize = False):
    for columns_to_keep in list_of_network_features:
        x_train, y_train = get_data(columns_to_keep)
        if normalize:
            x_train = normalize_feat(x_train)
        
        print ("Build models that include {} network measures".format(columns_to_keep))
        
        # Build Lasso regression
        lasso_regressor = Lasso_Regression(alpha=1e4, seed=SEED)
        # Get cross validation error
        mean_error_lasso, std_error_lasso = cross_validation(model_ori=lasso_regressor, input=x_train, labels=y_train, K=5)
        print('* Lasso cross validation error mean: \t\t{}, \t\tstd: {}'.format(int(mean_error_lasso), int(std_error_lasso)))

        # Build KNN regressor
        knn_regressor = KNN()
        # Get cross validation error
        mean_error_knn, std_error_knn = cross_validation(model_ori=knn_regressor, input=x_train, labels=y_train, K=5)
        print('* KNN cross validation error mean: \t\t{}, \t\tstd: {}'.format(int(mean_error_knn), int(std_error_knn)))

        # Build MLP regressor
        mlp_regressor = MLP(seed=SEED, solver='adam', alpha=1e-5, hidden_layers=(25, 25), lr=1e-4, max_iter=100000)
        # Get cross validation error
        mean_error_mlp, std_error_mlp = cross_validation(model_ori=mlp_regressor, input=x_train, labels=y_train, K=5)
        print('* MLP cross validation error mean: \t\t{}, \t\tstd: {}'.format(int(mean_error_mlp), int(std_error_mlp)))
        
        # Build SVR regressor
        svr_regressor = SVR(kernel='linear', seed=SEED)
        # Get cross validation error
        mean_error_svm, std_error_svm = cross_validation(model_ori=svr_regressor, input=x_train, labels=y_train, K=5)
        print('* SVR cross validation error mean: \t\t{}, \t\tstd: {}'.format(int(mean_error_svm), int(std_error_svm)))

        # Build Random Forest Regressor
        random_forest_regressor = Random_Forest(n_estimators=100, max_depth=20,seed=SEED)
        # Get cross validation error
        mean_error_rf, std_error_rf = cross_validation(model_ori=random_forest_regressor,input=x_train, labels=y_train, K=5)
        print('* Random Forest cross validation error mean: \t{}, \t\tstd: {} \n'.format(int(mean_error_rf), int(std_error_rf)))

### Feature evaluation

In [8]:
build_models([['to_closeness','to_in_degree'],[]])

Build models that include ['to_closeness', 'to_in_degree'] network measures
* Lasso cross validation error mean: 		1795623, 		std: 994607
* KNN cross validation error mean: 		1680808, 		std: 1067207
* MLP cross validation error mean: 		1743559, 		std: 1037453
* SVR cross validation error mean: 		2397499, 		std: 1390480
* Random Forest cross validation error mean: 	1563739, 		std: 1068441 

Build models that include [] network measures
* Lasso cross validation error mean: 		1795680, 		std: 994646
* KNN cross validation error mean: 		1680814, 		std: 1067209
* MLP cross validation error mean: 		1766269, 		std: 1017408
* SVR cross validation error mean: 		2397499, 		std: 1390480
* Random Forest cross validation error mean: 	1571424, 		std: 1062634 



Error for 'to_closeness' and 'from_in_degree': * Random Forest cross validation error mean: 	1559847, 		std: 1054189 
<br>Error for no centrality: * Random Forest cross validation error mean: 	1571424, 		std: 1062634
<br>
<br>Error for 'to_closeness' and 'to_in_degree': * Random Forest cross validation error mean: 	1563739, 		std: 1068441 
<br>Error for no centrality: * Random Forest cross validation error mean: 	1571424, 		std: 1062634 

## Analysis

## Conclusion

## Appendix: Project Structure