## Predicting players rating
In this project you are going to predict the overall rating of soccer player based on their attributes
such as 'crossing', 'finishing etc.
The dataset you are going to use is from European Soccer Database
(https://www.kaggle.com/hugomathien/soccer) has more than 25,000 matches and more than
10,000 players for European professional soccer seasons from 2008 to 2016.
Download the data in the same folder and run the following commmand to get it in the environment

### Import required libraries

In [None]:
import sqlite3
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score,make_scorer
from math import sqrt
from matplotlib import pyplot as plt
import xgboost as xgb
%matplotlib inline

## Step:1 Download Data from source

In [None]:
# Create your connection.
cnx = sqlite3.connect('database.sqlite')
df = pd.read_sql_query("SELECT * FROM Player_Attributes", cnx)

## Step:2 Exploring Data

In [None]:
df.head(10).transpose()

### Data columns

In [None]:
df.columns

### Data description

In [None]:
df.describe().transpose()

In [None]:
df.hist(figsize=(25,25))

### Count of null values in each column

In [None]:
df.isnull().sum(axis=0)

### Unique values in Categorical Variable

In [None]:
df.preferred_foot.unique()

In [None]:
df.attacking_work_rate.unique()

In [75]:
df.defensive_work_rate.unique()

array(['medium', 'high', 'low', '_0', None, '5', 'ean', 'o', '1', 'ormal',
       '7', '2', '8', '4', 'tocky', '0', '3', '6', '9', 'es'],
      dtype=object)

## Step:3 Data Cleaning

### Drop null values

In [76]:
total_rows_with_null_values = df.shape[0]
df=df.dropna()
total_rows_removed = total_rows_with_null_values - df.shape[0]
print("Total Rows Removed {}".format(total_rows_removed))

Total Rows Removed 3624


### Convert Categorical values into numerical values

In [2]:
preferred_foot_num = pd.factorize(df.preferred_foot)
attacking_work_rate_num = pd.factorize(df.attacking_work_rate)
defensive_work_rate = pd.factorize(df.defensive_work_rate)

NameError: name 'pd' is not defined

In [3]:
df.preferred_foot=preferred_foot_num[0]
df.attacking_work_rate = attacking_work_rate_num[0]
df.defensive_work_rate = defensive_work_rate[0]
df.head()

NameError: name 'preferred_foot_num' is not defined

## Step: 4 Feature Correlation Analysis

In [86]:
df.corr().sort_values(['overall_rating'])['overall_rating']

player_api_id         -0.328315
player_fifa_api_id    -0.278703
id                    -0.003738
preferred_foot         0.001417
gk_handling            0.006717
gk_reflexes            0.007804
gk_positioning         0.008029
defensive_work_rate    0.023312
gk_diving              0.027675
gk_kicking             0.028799
attacking_work_rate    0.069407
sliding_tackle         0.128054
marking                0.132185
balance                0.160211
standing_tackle        0.163986
agility                0.239963
acceleration           0.243998
interceptions          0.249094
sprint_speed           0.253048
jumping                0.258978
heading_accuracy       0.313324
strength               0.315684
aggression             0.322782
stamina                0.325606
finishing              0.330079
free_kick_accuracy     0.349800
dribbling              0.354191
crossing               0.357320
curve                  0.357566
volleys                0.361739
positioning            0.368978
long_sho

#### Note :  As per above analysis, features which are correlated morethan .3 with overal rating has been considered for traning model
Features List :
heading_accuracy       (0.313324)
strength               (0.315684)
aggression             (0.322782)
stamina                (0.325606)
finishing              (0.330079)
free_kick_accuracy     (0.349800)
dribbling              (0.354191)
crossing               (0.357320)
curve                  (0.357566)
volleys                (0.361739)
positioning            (0.368978)
long_shots             (0.392668)
penalties              (0.392715)
shot_power             (0.428053)
vision                 (0.431493)
long_passing           (0.434525)
ball_control           (0.443991)
short_passing          (0.458243)
potential              (0.765435)
reactions              (0.771856)

In [92]:
data = df[['heading_accuracy','strength','aggression','stamina','finishing','free_kick_accuracy','dribbling','crossing','curve','volleys','positioning','long_shots','penalties','shot_power','vision','long_passing','ball_control','short_passing','potential','reactions','overall_rating']].copy()

In [93]:
data.head()

Unnamed: 0,heading_accuracy,strength,aggression,stamina,finishing,free_kick_accuracy,dribbling,crossing,curve,volleys,...,long_shots,penalties,shot_power,vision,long_passing,ball_control,short_passing,potential,reactions,overall_rating
0,71.0,76.0,71.0,54.0,44.0,39.0,51.0,49.0,45.0,44.0,...,35.0,48.0,55.0,54.0,64.0,49.0,61.0,71.0,47.0,67.0
1,71.0,76.0,71.0,54.0,44.0,39.0,51.0,49.0,45.0,44.0,...,35.0,48.0,55.0,54.0,64.0,49.0,61.0,71.0,47.0,67.0
2,71.0,76.0,63.0,54.0,44.0,39.0,51.0,49.0,45.0,44.0,...,35.0,48.0,55.0,54.0,64.0,49.0,61.0,66.0,47.0,62.0
3,70.0,76.0,62.0,54.0,43.0,38.0,50.0,48.0,44.0,43.0,...,34.0,47.0,54.0,53.0,63.0,48.0,60.0,65.0,46.0,61.0
4,70.0,76.0,62.0,54.0,43.0,38.0,50.0,48.0,44.0,43.0,...,34.0,47.0,54.0,53.0,63.0,48.0,60.0,65.0,46.0,61.0


## Step: 5 Split Data for Training & Testing

In [134]:
X = np.asarray(data[['heading_accuracy','strength','aggression','stamina','finishing','free_kick_accuracy','dribbling','crossing','curve','volleys','positioning','long_shots','penalties','shot_power','vision','long_passing','ball_control','short_passing','potential','reactions']])
#X = np.asarray(data[['volleys','positioning','long_shots','penalties','shot_power','vision','long_passing','ball_control','short_passing','potential','reactions']])
Y = np.asarray(data.overall_rating)

In [135]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.33)

### 1.  LinearRegression

In [187]:
lregr = LinearRegression()
# Fit data
lregr.fit(X_train, Y_train)
# Predict Y values
Y_pred = lregr.predict(X_test)
print("Mean squared error: %.2f"
      % mean_squared_error(Y_test, Y_pred))
#  variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(Y_test, Y_pred))

Mean squared error: 10.83
Variance score: 0.78


### 2. DecisionTreeRegressor

#### Hyperparameter tuning

In [203]:
# scoring method
scoring = make_scorer(r2_score)

# GridSearchCV
g_cv = GridSearchCV(DecisionTreeRegressor(random_state=0),
              param_grid={'min_samples_split': range(2, 10)},
              scoring=scoring, cv=5, refit=True)
g_cv.fit(X_train, Y_train)

#best hyper parameters
g_cv.best_params_

{'min_samples_split': 4}

In [204]:
dregr = DecisionTreeRegressor(random_state=0,min_samples_split= 4)
# Fit Data
dregr.fit(X_train, Y_train)
# Predict Y Value
Y_pred = dregr.predict(X_test)
print("Mean squared error: %.2f"
      % mean_squared_error(Y_test, Y_pred))
#  variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(Y_test, Y_pred))

Mean squared error: 3.22
Variance score: 0.94


### 3. RandomForestRegressor

#### Hyper Parameter Tuning

In [None]:
# scoring method
scoring = make_scorer(r2_score)

# GridSearchCV
g_cv = GridSearchCV(RandomForestRegressor(),
              param_grid={'max_features':range(2,10), 'min_samples_split':range(2,10), 'min_samples_leaf':range(2,10)},
              scoring=scoring, cv=5, refit=True)
g_cv.fit(X_train, Y_train)
#best hyper parameters
g_cv.best_params_

In [202]:
ranregr = RandomForestRegressor(max_features=2, min_samples_split=4, n_estimators=50, min_samples_leaf=2)
# Fit Data
ranregr.fit(X_train, Y_train)
# Predict Y Value
Y_pred = ranregr.predict(X_test)
print("Mean squared error: %.2f"
      % mean_squared_error(Y_test, Y_pred))
# variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(Y_test, Y_pred))

Mean squared error: 1.77
Variance score: 0.96


### 4. XGBoost Regressor

In [None]:
# scoring method
scoring = make_scorer(r2_score)

# GridSearchCV
g_cv = GridSearchCV(xgb.XGBRegressor(),
              param_grid={'min_child_weight':[4,5], 'gamma':[i/10.0 for i in range(3,6)],  'subsample':[i/10.0 for i in range(6,11)],
'colsample_bytree':[i/10.0 for i in range(6,11)], 'max_depth': [20,25,30,35,40,45,50]},
              scoring=scoring, cv=5, refit=True)
g_cv.fit(X_train, Y_train)
#best hyper parameters
g_cv.best_params_

In [221]:
xgregr=xgb.XGBRegressor(min_child_weight=11,gamma=0.5,subsample=0.8,colsample_bytree=0.6,max_depth=65)
# Fit Data
xgregr.fit(X_train, Y_train)
# Predict Y Value
Y_pred = xgregr.predict(X_test)
print("Mean squared error: %.2f"
      % mean_squared_error(Y_test, Y_pred))
#  variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(Y_test, Y_pred))

Mean squared error: 1.19
Variance score: 0.98


## Note : From the analysis above we can see that XGBost is giveing less Mean Squared Error & highest  Variance Score