# Intelligent Player Scouting and Talent Acquisition for Football Managers using AI

With the use of the FIFA19 dataset, the proposed AI model solves the difficulties managers have while attempting to choose the best players, as well as identifying the average, underperforming, undervalued, and overpriced players.


**The structure of the model is described in two phases:**

The first phase is to effectively build a model capable of grouping players based on their similarity in traits. To do this, I have implemented K-means, K-means++ and DBSCAN algorithms to group players based on their individual abilities, as well as noise removal from the dataset. The model can potentially identify patterns those certain players share in ways that would not normally have been considered by the team managers during their manual evaluation.

The second phase entails building a classification model that will be capable of re-evaluating the players based on the cluster labels provided by the clustering algorithm in the first phase. These classifiers will be able to predict what group a fresh set of players will belong to. Support Vector Machine and Random Forest are two ML algorithms that I used for this. This would also help managers diagnose lack of skill diversity, identify under-priced and over-priced players, and potentially influence their transfer decisions.



### Importing all of the libraries required to build the model

In [None]:
import numpy as np, pandas as pd, seaborn as sns, matplotlib.pyplot as plt

In [None]:
from sklearn.preprocessing import scale
from sklearn import preprocessing
import itertools
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from sklearn.model_selection import ParameterGrid
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
import imblearn
from imblearn.over_sampling import SMOTE
from collections import Counter
from imblearn.under_sampling import EditedNearestNeighbours 
from sklearn.metrics import silhouette_score
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils import shuffle
import plotly.graph_objs as go
from itertools import product
from sklearn.neighbors import NearestNeighbors

### Import the libraries required to evaluate the classifiers' accuracy and performance.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import jaccard_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

### About the Dataset

The dataset in use was obtained from Kaggle, which can be accessed online. Please, click on the link to download the dataset. https://www.kaggle.com/karangadiya/fifa19.

In [None]:
data=pd.read_csv("data.csv")
data.head()

### Dataset contains 18,207 rows and 89 columns

In [None]:
data.shape

### Examine the percentage of empty rows also known as NaN (not a number)

In [None]:
train_test = pd.concat([data.drop('Photo', axis = 1)], keys = ['data'], axis = 0)
missing_values = pd.concat([train_test.isna().sum(),
                            (train_test.isna().sum() / train_test.shape[0]) * 100], axis = 1, 
                           keys = ['Values missing', 'Percent of missing'])
missing_values.loc[missing_values['Percent of missing'] > 0].sort_values(ascending = False, by = 'Percent of missing').style.background_gradient('Blues')

### To remove columns that will not be used in the model, I need replace the column identifier ignoring spaces

In [None]:
data.columns = [c.replace(' ', '') for c in data.columns]
data.columns

### Position
The first way to examining the ground truth is to check the player's position with the greatest number.

In [None]:
ax = sns.countplot(x = data['Position'])
plt.figure(figsize=(80, 40))
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
ax.set_xlabel('Position') 
ax.set_ylabel('Number of players')
plt.tight_layout()
plt.show()

### Age
Every football player must be considered by their age. It contributes to their market value.

In [None]:
ax = sns.countplot(x = data['Age'])
plt.figure(figsize=(80, 40))
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
ax.set_xlabel('Age') 
ax.set_ylabel('Number of players')
plt.tight_layout()
plt.show()

### Potential 
Every football player has a unique quality called potential. It describes their expertise, which highly contributes to their market value.

In [None]:
ax = sns.countplot(x = data['Potential'])
plt.figure(figsize=(80, 40))
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
ax.set_xlabel('Potential') 
ax.set_ylabel('Number of Player')
plt.tight_layout()
plt.show()

### Drop any column that aren't necessary for the model.

In [None]:
data=data.drop(['Name','Unnamed:0','ID','Photo','Flag','Overall','ClubLogo', 'Special', 'InternationalReputation', 'WeakFoot',
               'SkillMoves','WorkRate','BodyType','RealFace','JerseyNumber','Joined','LoanedFrom','ContractValidUntil',
                'Weight','Crossing','Finishing','HeadingAccuracy','ShortPassing','Volleys','Dribbling','Curve','FKAccuracy',
                'LongPassing','BallControl','Acceleration','SprintSpeed','Agility','Reactions','Balance','ShotPower','Jumping',
                'Stamina','Strength','LongShots','Aggression','Interceptions','Positioning','Vision','Penalties','Composure',
                'Marking','SlidingTackle','StandingTackle','ReleaseClause'], axis=1)

data.head(10)

### Fill up the empty rows (NaN) with the dataset's mean value
There are many characteristics that are unimportant for goalkeepers, which is why some of their rows were empty.

In [None]:
column_means = data.mean()
data = data.fillna(column_means)
data

### Remove the Pounds symbol and letters from the players' wages and values.

In [None]:
data.Wage = data.Wage.str.replace("€","")
data.Wage = data.Wage.str.replace("K","").astype("float")
data.Wage.head() 

In [None]:
data.Value = data.Value.str.replace("€","")
data.Value = data.Value.str.replace("M","")
data.Value = data.Value.str.replace("K","").astype("float")
data.Value.head() 

### One-hot encoding
This is used to convert all categorical variable into indicator variable i.e., (0's and 1's)

In [None]:
dummies=pd.get_dummies(data)
dummies

### Store the dummy method into variable X

In [None]:
X=dummies
X

### Standardize the dataset
Now let's normalize the dataset. But why do i need normalization in the first place? Normalization is a statistical method that helps mathematical-based algorithms to interpret features with different magnitudes and distributions equally. I used StandardScaler() to normalize the dataset.

In [None]:
players_scale = preprocessing.StandardScaler().fit(X).transform(X)
players_scale[0:5]

In [None]:
#Store the scaled data into a dataframe object
df_players = pd.DataFrame(players_scale, columns=X.columns)
df_players.head()

# K-Means Algorithm

## The following is the analytical strategy used in the K-means experiment:

### 1. Applying the elbow method to determine the optimal number of K using the silhouette coefficient
### 2. Applying K-means++ to the original dataset 
### 3. Hyperparameter tuning for K-means
### 4. Applying PCA to K-means++


### 1. Applying the elbow method to determine the optimal number of K using the silhouette coefficient
inertia: (sum of squared error between each point and its cluster center) as a function of the number of clusters.

In [None]:
inertia = []
k_list = range(1, 15)

for k in k_list:
    km = KMeans(n_clusters=k)
    km.fit(df_players)
    inertia.append([k, km.inertia_])
    
pca_results = pd.DataFrame({'Cluster': range(1,15), 'SSE': inertia})
plt.figure(figsize=(12,6))
plt.plot(pd.DataFrame(inertia)[0], pd.DataFrame(inertia)[1], marker='o', color='green')
plt.title('Optimal Number of Clusters using Elbow Method (Original Data)')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')

### 2. Applying K-means++ to the original dataset

The plot indicates that the number of clusters should be between 4 and 5, but for the purpose of simplicity, I chose 4 as my preferred number. Compute the sihouette score using k-means++ on the original dataset

In [None]:
kmeans_scale = KMeans(n_clusters=4, n_init=100, max_iter=400, init='k-means++').fit(df_players)
print('KMeans Scaled Silhouette Score: {}'.format(silhouette_score(df_players, 
                                                                   kmeans_scale.labels_, metric='euclidean')))
labels_scale = kmeans_scale.labels_
clusters_scale = pd.concat([df_players, pd.DataFrame({'cluster_scaled':labels_scale})], axis=1)

### 3. Hyperparameter tuning for K-means

In [None]:
parameters = {'n_clusters': [2, 3, 4, 5, 10, 20, 30]}

parameter_grid = ParameterGrid(parameters)

In [None]:
list(parameter_grid)

In [None]:
best_score = -1
model = KMeans()

### 3.1. Fine-tune the K-means model

In [None]:
for g in parameter_grid:
    model.set_params(**g)
    model.fit(df_players)

    ss = metrics.silhouette_score(df_players, model.labels_)
    print('Parameter: ', g, 'Score: ', ss)
    if ss > best_score:
        best_score = ss
        best_grid = g

### 3.2. Get the best silhouette score along with the number of clusters

In [None]:
best_grid

### 3.3. A scatter plot of the original dataset using K-means

In [None]:
labels_scale=k_means.labels_
pca2 = PCA(n_components=3).fit(df_players)
pca2d = pca2.transform(df_players)
plt.figure(figsize = (10,10))
sns.scatterplot(pca2d[:,0], pca2d[:,1], 
                hue=labels_scale, 
                palette='Set1',
                s=100, alpha=0.2).set_title('KMeans Clusters (4) Derived from Original Dataset', fontsize=15)
plt.legend()
plt.ylabel('PC2')
plt.xlabel('PC1')
plt.show()

### 3.4. Plot a 3-D graph of the original dataset

In [None]:
Scene = dict(xaxis = dict(title  = 'PC1'),yaxis = dict(title  = 'PC2'),zaxis = dict(title  = 'PC3'))
labels = labels_scale
trace = go.Scatter3d(x=pca2d[:,0], y=pca2d[:,1], z=pca2d[:,2], mode='markers',marker=dict(color = labels, colorscale='Viridis', size = 10, line = dict(color = 'gray',width = 5)))
layout = go.Layout(margin=dict(l=0,r=0),scene = Scene, height = 1000,width = 1000)
data = [trace]
fig = go.Figure(data = data, layout = layout)
fig.show()

### 4. Applying PCA to K-means 

In [None]:
#n_components=900 because we have 900 features in the dataset
pca = PCA(n_components=900)
pca.fit(df_players)
variance = pca.explained_variance_ratio_
var = np.cumsum(np.round(variance, 3)*100)
plt.figure(figsize=(12,6))
plt.ylabel('% Variance Explained')
plt.xlabel('# of Features')
plt.title('PCA Analysis')
plt.ylim(0,100.5)
plt.plot(var)

### 4.1. Examine the n components with a value of 2

In [None]:
pca = PCA(n_components=2)
pca_scale = pca.fit_transform(df_players)
pca_df_scale = pd.DataFrame(pca_scale,  columns=['pc1','pc2'])
print(pca.explained_variance_ratio_)

### 4.2. Evaluate the elbow method distribution using PCA (2)

In [None]:
sse = []
k_list = range(1, 15)

for k in k_list:
    km = KMeans(n_clusters=k)
    km.fit(pca_df_scale)
    sse.append([k, km.inertia_])
    
pca_results_scale = pd.DataFrame({'Cluster': range(1,15), 'SSE': sse})
plt.figure(figsize=(12,6))
plt.plot(pd.DataFrame(sse)[0], pd.DataFrame(sse)[1], marker='o', color='green')
plt.title('Optimal Number of Clusters using Elbow Method (PCA_Scaled Data)')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')

### 4.3. After applying PCA (2), recalculate the silhouette score

In [None]:
kmeans_pca_scale = KMeans(n_clusters=4, n_init=100, max_iter=400, init='k-means++', random_state=42).fit(pca_df_scale)

print('KMeans PCA Scaled Silhouette Score: {}'.format(silhouette_score(pca_df_scale, kmeans_pca_scale.labels_, metric='euclidean')))
labels_pca_scale = kmeans_pca_scale.labels_
clusters_pca_scale = pd.concat([pca_df_scale, 
                                pd.DataFrame({'pca_clusters':labels_pca_scale})], axis=1)

### 4.4. Examine the n components with a value of 3

In [None]:
pca = PCA(n_components=3)
pca_scale = pca.fit_transform(df_players)
pca_df_scale = pd.DataFrame(pca_scale,  columns=['pc1','pc2','pc3'])
print(pca.explained_variance_ratio_)

### 4.5. Evaluate the elbow method distribution using PCA (3)

In [None]:
sse = []
k_list = range(1, 15)

for k in k_list:
    km = KMeans(n_clusters=k)
    km.fit(pca_df_scale)
    sse.append([k, km.inertia_])
    
pca_results = pd.DataFrame({'Cluster': range(1,15), 'SSE': sse})
plt.figure(figsize=(12,6))
plt.plot(pd.DataFrame(sse)[0], pd.DataFrame(sse)[1], marker='o', color='green')
plt.title('Optimal Number of Clusters using Elbow Method (PCA_Scaled Data)')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')

### 4.6.  After applying PCA (3), recalculate the silhouette score

In [None]:
kmeans_pca_scale = KMeans(n_clusters=4, n_init=100, max_iter=400, init='k-means++', random_state=42).fit(pca_df_scale)

print('KMeans PCA Scaled Silhouette Score: {}'.format(silhouette_score(pca_df_scale, kmeans_pca_scale.labels_, metric='euclidean')))
labels_pca_scale = kmeans_pca_scale.labels_
clusters_pca_scale = pd.concat([pca_df_scale, 
                                pd.DataFrame({'pca_clusters':labels_pca_scale})], axis=1)

### 4.7. Examine the n components with a value of 4

In [None]:
pca = PCA(n_components=4)
pca_scale = pca.fit_transform(df_players)
pca_df_scale = pd.DataFrame(pca_scale,  columns=['pc1','pc2','pc3','pc4'])
print(pca.explained_variance_ratio_)

### 4.8. Evaluate the elbow method distribution using PCA (4)

In [None]:
sse = []
k_list = range(1, 15)

for k in k_list:
    km = KMeans(n_clusters=k)
    km.fit(pca_df_scale)
    sse.append([k, km.inertia_])
    
pca_results = pd.DataFrame({'Cluster': range(1,15), 'SSE': sse})
plt.figure(figsize=(12,6))
plt.plot(pd.DataFrame(sse)[0], pd.DataFrame(sse)[1], marker='o', color='green')
plt.title('Optimal Number of Clusters using Elbow Method (PCA_Scaled Data)')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')

### 4.9. After applying PCA (4), recalculate the silhouette score using K-means++

In [None]:
kmeans_pca_scale = KMeans(n_clusters=4, n_init=100, max_iter=400, init='k-means++', random_state=42).fit(pca_df_scale)

print('KMeans PCA Scaled Silhouette Score: {}'.format(silhouette_score(pca_df_scale, kmeans_pca_scale.labels_, metric='euclidean')))
labels_pca_scale = kmeans_pca_scale.labels_
clusters_pca_scale = pd.concat([pca_df_scale, 
                                pd.DataFrame({'pca_clusters':labels_pca_scale})], axis=1)

### 4.10. Examine the n components with a value of 5

In [None]:
pca = PCA(n_components=5)
pca_scale = pca.fit_transform(df_players)
pca_df_scale = pd.DataFrame(pca_scale,  columns=['pc1','pc2','pc3','pc4','pc5'])
print(pca.explained_variance_ratio_)

### 4.11. Evaluate the elbow method distribution using PCA (5)

In [None]:
sse = []
k_list = range(1, 15)

for k in k_list:
    km = KMeans(n_clusters=k)
    km.fit(pca_df_scale)
    sse.append([k, km.inertia_])
    
pca_results = pd.DataFrame({'Cluster': range(1,15), 'SSE': sse})
plt.figure(figsize=(12,6))
plt.plot(pd.DataFrame(sse)[0], pd.DataFrame(sse)[1], marker='o', color='green')
plt.title('Optimal Number of Clusters using Elbow Method (PCA_Scaled Data)')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')

### 4.12. After applying PCA (5), recalculate the silhouette score using K-means++

In [None]:
kmeans_pca_scale = KMeans(n_clusters=4, n_init=100, max_iter=400, init='k-means++', random_state=42).fit(pca_df_scale)

print('KMeans PCA Scaled Silhouette Score: {}'.format(silhouette_score(pca_df_scale, kmeans_pca_scale.labels_, metric='euclidean')))
labels_pca_scale = kmeans_pca_scale.labels_
clusters_pca_scale = pd.concat([pca_df_scale, 
                                pd.DataFrame({'pca_clusters':labels_pca_scale})], axis=1)

### 4.13. Examine the n components with a value of 30

In [None]:
pca = PCA(n_components=30)
pca_scale = pca.fit_transform(df_players)
pca_df_scale = pd.DataFrame(pca_scale)
print(pca.explained_variance_ratio_)

### 4.14. Evaluate the elbow method distribution using PCA (30)

In [None]:
sse = []
k_list = range(1, 15)

for k in k_list:
    km = KMeans(n_clusters=k)
    km.fit(pca_df_scale)
    sse.append([k, km.inertia_])
    
pca_results = pd.DataFrame({'Cluster': range(1,15), 'SSE': sse})
plt.figure(figsize=(12,6))
plt.plot(pd.DataFrame(sse)[0], pd.DataFrame(sse)[1], marker='o', color='green')
plt.title('Optimal Number of Clusters using Elbow Method (PCA_Scaled Data)')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')

### 4.15. After applying PCA (30), recalculate the silhouette score using K-means++

In [None]:
kmeans_pca_scale = KMeans(n_clusters=4, n_init=100, max_iter=400, init='k-means++', random_state=42).fit(pca_df_scale)

print('KMeans PCA Scaled Silhouette Score: {}'.format(silhouette_score(pca_df_scale, kmeans_pca_scale.labels_, metric='euclidean')))
labels_pca_scale = kmeans_pca_scale.labels_
clusters_pca_scale = pd.concat([pca_df_scale, 
                                pd.DataFrame({'pca_clusters':labels_pca_scale})], axis=1)

### PCA n_components and silhoutte score

In [None]:
pca=[{'Number of PCA':2,
         'Number of Clusters': 4,
         'Silhouette Score': 0.411
         },
        {
         'Number of PCA':3,
         'Number of Clusters': 4,
         'Silhouette Score': 0.458
        },
        {'Number of PCA':4,
         'Number of Clusters': 4,
         'Silhouette Score': 0.410
         },
        {'Number of PCA':5,
         'Number of Clusters': 4,
         'Silhouette Score': 0.374
         },
        {'Number of PCA':30,
         'Number of Clusters': 4,
         'Silhouette Score': 0.158
         },]
df=pd.DataFrame(pca, index=['Princpal Component Analysis','Princpal Component Analysis','Princpal Component Analysis','Princpal Component Analysis', 'Princpal Component Analysis'])
df.head()

### 4.16. For the K-means, PCA with a value of 3 produced the best silhouette score.

In [None]:
pca = PCA(n_components=3)
pca_scale = pca.fit_transform(df_players)
pca_df_scale = pd.DataFrame(pca_scale,  columns=['pc1','pc2','pc3'])
print(pca.explained_variance_ratio_)

### 4.17. Evaluate the elbow method distribution using PCA (3)

In [None]:
sse = []
k_list = range(1, 15)

for k in k_list:
    km = KMeans(n_clusters=k)
    km.fit(pca_df_scale)
    sse.append([k, km.inertia_])
    
pca_results = pd.DataFrame({'Cluster': range(1,15), 'SSE': sse})
plt.figure(figsize=(12,6))
plt.plot(pd.DataFrame(sse)[0], pd.DataFrame(sse)[1], marker='o', color='green')
plt.title('Optimal Number of Clusters using Elbow Method (PCA_Scaled Data)')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')

### 4.18. After applying PCA (3), recalculate the silhouette score using K-means++
I tried a few other numbers for the n component, but it appears that **0.458** is the highest possible score for the silhouette.

In [None]:
kmeans_pca_scale = KMeans(n_clusters=4, n_init=100, max_iter=400, init='k-means++', random_state=42).fit(pca_df_scale)

print('KMeans PCA Scaled Silhouette Score: {}'.format(silhouette_score(pca_df_scale, kmeans_pca_scale.labels_, metric='euclidean')))
labels_pca_scale = kmeans_pca_scale.labels_
clusters_pca_scale = pd.concat([pca_df_scale, 
                                pd.DataFrame({'pca_clusters':labels_pca_scale})], axis=1)

### 4.19. Hyperparameter tunning on K-means after applying PCA (3)

In [None]:
parameters = {'n_clusters': [2, 3, 4, 5, 10, 20, 30]}

parameter_grid = ParameterGrid(parameters)

In [None]:
list(parameter_grid)

In [None]:
best_score = -1
model = KMeans()

### 4.20. Fine-tune the model

In [None]:
for g in parameter_grid:
    model.set_params(**g)
    model.fit(pca_df_scale)

    ss = metrics.silhouette_score(pca_df_scale, model.labels_)
    print('Parameter: ', g, 'Score: ', ss)
    if ss > best_score:
        best_score = ss
        best_grid = g

### 4.21. The best silhouette score for K-means++ algorithm is (4) clusters

In [None]:
best_grid

### 4.22. Present a graph that was derived from PCA (3) using K-means++

In [None]:
plt.figure(figsize = (10,10))
sns.scatterplot(clusters_pca_scale.iloc[:,0],
                clusters_pca_scale.iloc[:,1], 
                hue=labels_pca_scale, palette='Set1', s=100, 
                alpha=0.2).set_title('KMeans Clusters (4) Derived from PCA', fontsize=15)
plt.legend()
plt.show()

### 4.23. Plot a 3-D graph of K-means clusters derived from PCA (3)

In [None]:
Scene = dict(xaxis = dict(title  = 'PC1'),yaxis = dict(title  = 'PC2'),zaxis = dict(title  = 'PC3'))
labels = labels_scale
trace = go.Scatter3d(x=pca2d[:,0], y=pca2d[:,1], z=pca2d[:,2], mode='markers',marker=dict(color = labels, colorscale='Viridis', size = 10, line = dict(color = 'gray',width = 5)))
layout = go.Layout(margin=dict(l=0,r=0),scene = Scene, height = 1000,width = 1000)
data = [trace]
fig = go.Figure(data = data, layout = layout)
fig.show()

# DBSCAN Algorithm

DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.
Dense region, Sparse region, Core point, Border point, Noise point , Density edge , Density connected points

The DBSCAN algorithm uses two parameters:

**min points:** The minimum number of points (a threshold) clustered together for a region to be considered dense.

**epsilon (ε):** A distance measure that will be used to locate the points in the neighborhood of any point.

The following is the analytical strategy used in the DBSCAN algorithm:


### 1. Apply elbow method using nearest neighbors
### 2. Applying DBSCAN to the original dataset
### 3. Construct a 3-D graph to illustrate each cluster distribution.
### 4. Applying PCA to DBSCAN (original dataset)
### 5. Hyperparameter tuning for DBSCAN(epsilon & minimum points)
### 6. Save the results to a file and then analyse them after adjusting the PCA parameters.
### 7. Construct a 3-D graph to illustrate each cluster distribution.
### 8. Apply a value range of (2,3, & 4) for the PCA n component and examine the results
### 9. Eliminate the rows containing noise (-1), store the model label back into a dataframe object



In [None]:
pca = PCA(n_components=3)
pca.fit(df_players)
pca_scale = pca.transform(df_players)
pca_df = pd.DataFrame(pca_scale, columns=['pc1', 'pc2', 'pc3'])
print(pca.explained_variance_ratio_)

### 1. Applying the elbow method using nearest neighbors (2)

In [None]:
# we use nearestneighbors for calculating distance between points
neigh=NearestNeighbors(n_neighbors=2)
distance=neigh.fit(pca_df)
distances,indices=distance.kneighbors(pca_df)
sorting_distances=np.sort(distances,axis=0)
sorted_distances=sorting_distances[:,1]
plt.figure(figsize=(10,5))
plt.plot(sorted_distances)
plt.xlabel('Distance')
plt.ylabel('Epsilon')
plt.axhline(y=0.2, color='red', ls='--')
plt.grid()
plt.show()

### 2. Applying DBSCAN to the original dataset

In [None]:
dbscan = DBSCAN(eps=0.2, min_samples=4).fit(pca_df)
labels = dbscan.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(pca_df, labels))

### 3. Construct a 3-D graph to illustrate each cluster distribution (original dataset)

In [None]:
Scene = dict(xaxis = dict(title  = 'PC1'),yaxis = dict(title  = 'PC2'),zaxis = dict(title  = 'PC3'))
trace = go.Scatter3d(x=pca_df.iloc[:,0], y=pca_df.iloc[:,1], z=pca_df.iloc[:,2],
                     mode='markers',marker=dict(colorscale='Greys', opacity=0.3, size = 10, ))
layout = go.Layout(margin=dict(l=0,r=0),scene = Scene, height = 1000,width = 1000)
data = [trace]
fig = go.Figure(data = data, layout = layout)
fig.update_layout(title='DBSCAN clusters Derived from Original Data', font=dict(size=12,))
fig.show()

### 4. Applying PCA (3) to DBSCAN (original dataset)

In [None]:
pca_dbscan = PCA(n_components=3)
pca_dbscan.fit(df_players)
pca_scale_dbscan = pca_dbscan.transform(df_players)
pca_df = pd.DataFrame(pca_scale_dbscan, columns=['pc1', 'pc2', 'pc3'])
print(pca_dbscan.explained_variance_ratio_)

### 5. Hyperparameter tuning for DBSCAN (epsilon & minimum points)

In [None]:
pca_eps_values = np.arange(0.2,2.6,0.1) 
pca_min_samples = np.arange(2,11) 
pca_dbscan_params = list(product(pca_eps_values, pca_min_samples))
pca_no_of_clusters = []
pca_sil_score = []
pca_epsvalues = []
pca_min_samp = []
for p in pca_dbscan_params:
    pca_dbscan_cluster = DBSCAN(eps=p[0], min_samples=p[1]).fit(pca_df)
    pca_epsvalues.append(p[0])
    pca_min_samp.append(p[1])
    pca_no_of_clusters.append(len(np.unique(pca_dbscan_cluster.labels_)))
    pca_sil_score.append(silhouette_score(pca_df, pca_dbscan_cluster.labels_))
pca_eps_min = list(zip(pca_no_of_clusters, pca_sil_score, pca_epsvalues, pca_min_samp))
pca_eps_min_df = pd.DataFrame(pca_eps_min, columns=['no_of_clusters', 'silhouette_score', 'epsilon_values', 'minimum_points'])
pca_eps_min_df

### 6. Save the result to file

In [None]:
pd.DataFrame(pca_eps_min_df).to_csv('dbscanresultpca1.csv', index=False)

### 6.1 I evaluated the obtained result to fine-tune the model

In [None]:
dbscan = DBSCAN(eps=1.3, min_samples=4).fit(pca_df)
labels = dbscan.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(pca_df, labels))

### 7. Construct a 3-D graph to illustrate (4) cluster distributions using the above parameters

In [None]:
Scene = dict(xaxis = dict(title  = 'PC1'),yaxis = dict(title  = 'PC2'),zaxis = dict(title  = 'PC3'))
labels = dbscan.labels_
trace = go.Scatter3d(x=pca_df.iloc[:,0], y=pca_df.iloc[:,1], z=pca_df.iloc[:,2], mode='markers', marker=dict(color = labels, colorscale='Viridis', size = 10, line = dict(color = 'gray',width = 5)))
layout = go.Layout(scene = Scene, height = 1000,width = 1000)
data = [trace]
fig = go.Figure(data = data, layout = layout)
fig.update_layout(title='DBSCAN clusters (4) Derived from PCA', font=dict(size=12,))
fig.show()

### 8. Apply a value range of (2,3, & 4) for the PCA n component and examine the results
Applying PCA (2) to DBSCAN (original dataset)


In [None]:
pca_dbscan = PCA(n_components=2)
pca_dbscan.fit(df_players)
pca_scale_dbscan = pca_dbscan.transform(df_players)
pca_df = pd.DataFrame(pca_scale_dbscan, columns=['pc1', 'pc2'])
print(pca_dbscan.explained_variance_ratio_)

### 8.1. Calculate the epsilon and minimum point parameters while simultaneously eliminating noise to produce the silhouette coefficient

In [None]:
dbscan = DBSCAN(eps=2, min_samples=2).fit(pca_df)
labels = dbscan.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(pca_df, labels))

### 8.2. Hyperparameter tuning for DBSCAN (epsilon & minimum points)

In [None]:
pca_eps_values = np.arange(0.2,2.1,0.1) 
pca_min_samples = np.arange(2,11) 
pca_dbscan_params = list(product(pca_eps_values, pca_min_samples))
pca_no_of_clusters = []
pca_sil_score = []
pca_epsvalues = []
pca_min_samp = []
for p in pca_dbscan_params:
    pca_dbscan_cluster = DBSCAN(eps=p[0], min_samples=p[1]).fit(pca_df)
    pca_epsvalues.append(p[0])
    pca_min_samp.append(p[1])
    pca_no_of_clusters.append(len(np.unique(pca_dbscan_cluster.labels_)))
    pca_sil_score.append(silhouette_score(pca_df, pca_dbscan_cluster.labels_))
pca_eps_min = list(zip(pca_no_of_clusters, pca_sil_score, pca_epsvalues, pca_min_samp))
pca_eps_min_df = pd.DataFrame(pca_eps_min, columns=['no_of_clusters', 'silhouette_score', 'epsilon_values', 'minimum_points'])
pca_eps_min_df

### 8.3. Save the result to file

In [None]:
pd.DataFrame(pca_eps_min_df).to_csv('dbscanresultpca2.csv', index=False)

### 8.4. I evaluated the obtained result to fine-tune the model

In [None]:
dbscan = DBSCAN(eps=1.2, min_samples=2).fit(pca_df)
labels = dbscan.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(pca_df, labels))

### 8.5. Construct a 3-D graph to illustrate (4) cluster distributions using the above parameters

In [None]:
Scene = dict(xaxis = dict(title  = 'PC1'),yaxis = dict(title  = 'PC2'),zaxis = dict(title  = 'PC3'))
labels = dbscan.labels_
trace = go.Scatter3d(x=pca_df.iloc[:,0], y=pca_df.iloc[:,1], z=pca_df.iloc[:,2], mode='markers', marker=dict(color = labels, colorscale='Viridis', size = 10, line = dict(color = 'gray',width = 5)))
layout = go.Layout(scene = Scene, height = 1000,width = 1000)
data = [trace]
fig = go.Figure(data = data, layout = layout)
fig.update_layout(title='DBSCAN clusters (4) Derived from PCA', font=dict(size=12,))
fig.show()

### 8.6. Applying PCA (4) to DBSCAN (original dataset)

In [None]:
pca_dbscan = PCA(n_components=4)
pca_dbscan.fit(df_players)
pca_scale_dbscan = pca_dbscan.transform(df_players)
pca_df = pd.DataFrame(pca_scale_dbscan, columns=['pc1', 'pc2','pc3','pc4'])
print(pca_dbscan.explained_variance_ratio_)

### 8.7. Calculate the epsilon and minimum point parameters while simultaneously eliminating noise to produce the silhouette coefficient

In [None]:
dbscan = DBSCAN(eps=3.3, min_samples=2).fit(pca_df)
labels = dbscan.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(pca_df, labels))

### 8.8 Hyperparameter tuning for DBSCAN (epsilon & minimum points)

In [None]:
pca_eps_values = np.arange(0.2,3.3,0.1) 
pca_min_samples = np.arange(2,11) 
pca_dbscan_params = list(product(pca_eps_values, pca_min_samples))
pca_no_of_clusters = []
pca_sil_score = []
pca_epsvalues = []
pca_min_samp = []
for p in pca_dbscan_params:
    pca_dbscan_cluster = DBSCAN(eps=p[0], min_samples=p[1]).fit(pca_df)
    pca_epsvalues.append(p[0])
    pca_min_samp.append(p[1])
    pca_no_of_clusters.append(len(np.unique(pca_dbscan_cluster.labels_)))
    pca_sil_score.append(silhouette_score(pca_df, pca_dbscan_cluster.labels_))
pca_eps_min = list(zip(pca_no_of_clusters, pca_sil_score, pca_epsvalues, pca_min_samp))
pca_eps_min_df = pd.DataFrame(pca_eps_min, columns=['no_of_clusters', 'silhouette_score', 'epsilon_values', 'minimum_points'])
pca_eps_min_df

### 8.9. Save the result to file

In [None]:
pd.DataFrame(pca_eps_min_df).to_csv('dbscanresultpca3.csv', index=False)

### 8.10. I evaluated the obtained result to fine-tune the model

In [None]:
dbscan = DBSCAN(eps=1.7, min_samples=3).fit(pca_df)
labels = dbscan.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(pca_df, labels))

### 8.11. Construct a 3-D graph to illustrate (2) cluster distributions using the above parameters

In [None]:
Scene = dict(xaxis = dict(title  = 'PC1'),yaxis = dict(title  = 'PC2'),zaxis = dict(title  = 'PC3'))
labels = dbscan.labels_
trace = go.Scatter3d(x=pca_df.iloc[:,0], y=pca_df.iloc[:,1], z=pca_df.iloc[:,2], mode='markers',marker=dict(color = labels, colorscale='Viridis', size = 10, line = dict(color = 'gray',width = 5)))
layout = go.Layout(scene = Scene, height = 1000,width = 1000)
data = [trace]
fig = go.Figure(data = data, layout = layout)
fig.update_layout(title="'DBSCAN Clusters (4) Derived from PCA'", font=dict(size=12,))
fig.show()

### 8.12. For the DBSCAN, PCA with n component value of 2 produced the best silhouette score.

In [None]:
pca_dbscan = PCA(n_components=2)
pca_dbscan.fit(df_players)
pca_scale_dbscan = pca_dbscan.transform(df_players)
pca_df = pd.DataFrame(pca_scale_dbscan)
print(pca_dbscan.explained_variance_ratio_)

### 8.13. Evaluate the obtained result to fine-tune the model

In [None]:
dbscan = DBSCAN(eps=1.2, min_samples=2).fit(pca_df)
labels = dbscan.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(pca_df, labels))

### 9. Eliminate the rows containing noise (-1), store the model label back into a dataframe object

In [None]:
df_players["Labels"] = labels
df_players.head(10)

### 9.1.  Eliminate rows with noise using DataFrame Object

In [None]:
n_noise_ = list(labels).count(-1)
print('Count:', n_noise_)
indexNames = df_players[df_players['Labels'] == -1 ].index
df_players.drop(indexNames , inplace=True)
df_players.head(10)

### 9.2. Confirming the size of the dataset after dropping 5 rows with noise (-1)
**After removing the noisy datapoint, the dataset has been altered. I'll have to re-import the dataset in order to build a classification model.**

In [None]:
df_players.shape

### 9.3. Examine the labels

In [None]:
df_players['Labels'].unique()

## Classification Model

This is the second phase of the experiment that involves building a machine learning pipeline using _**Support Vector Machine**_ and _**Random Forest**_. I am going to use the K-means algorithm because the results were fascinating from a business point of view. All of the clusters in the K-means were evenly distributed, providing a strong understanding of the football players. 

DBSCAN results were not even dispersed; over 90% of the football players were concentrated in a single cluster. This would not contribute to the model's core functionality.

In [None]:
data=pd.read_csv("data.csv")
data.head()

### Examine the size of the dataset to ensure that no changes occurred 

In [None]:
data.shape

### Check for NaN (Not a Number)

In [None]:
data.isnull().sum()

### To remove columns that will not be used in the model, I need replace the column identifier ignoring spaces

In [None]:
data.columns = [c.replace(' ', '') for c in data.columns]
data.columns

### Drop any column that aren't necessary for the model

In [None]:
data=data.drop(['Name','Unnamed:0','ID','Photo','Flag','Overall','ClubLogo', 'Special', 'InternationalReputation', 'WeakFoot',
               'SkillMoves','WorkRate','BodyType','RealFace','JerseyNumber','Joined','LoanedFrom','ContractValidUntil',
                'Weight','Crossing','Finishing','HeadingAccuracy','ShortPassing','Volleys','Dribbling','Curve','FKAccuracy',
                'LongPassing','BallControl','Acceleration','SprintSpeed','Agility','Reactions','Balance','ShotPower','Jumping',
                'Stamina','Strength','LongShots','Aggression','Interceptions','Positioning','Vision','Penalties','Composure',
                'Marking','SlidingTackle','StandingTackle','ReleaseClause'], axis=1)

data.head(10)

### Fill up the empty rows (NaN) with the dataset's mean value

In [None]:
column_means = data.mean()
data = data.fillna(column_means)
data

### Remove the Pounds symbol and letters from the players' wages and values

In [None]:
data.Wage = data.Wage.str.replace("€","")
data.Wage = data.Wage.str.replace("K","").astype("float")
data.Wage.head() 

In [None]:
data.Value = data.Value.str.replace("€","")
data.Value = data.Value.str.replace("M","")
data.Value = data.Value.str.replace("K","").astype("float")
data.Value.head() 

### One-hot encoding
This is used to convert all categorical variable into indicator variable i.e., (0's and 1's)

In [None]:
dummies=pd.get_dummies(data)
dummies

### Store the dummy method into variable X

In [None]:
X=dummies
X

### Normalize the dataset

In [None]:
players_scale = preprocessing.StandardScaler().fit(X).transform(X)
players_scale[0:5]

In [None]:
df_players = pd.DataFrame(players_scale, columns=X.columns)
df_players.head()

### Applying PCA to K-means++

In [None]:
#n_components=900 because we have 900 features in the dataset
pca = PCA(n_components=900)
pca.fit(df_players)
variance = pca.explained_variance_ratio_
var = np.cumsum(np.round(variance, 3)*100)
plt.figure(figsize=(12,6))
plt.ylabel('% Variance Explained')
plt.xlabel('# of Features')
plt.title('PCA Analysis')
plt.ylim(0,100.5)
plt.plot(var)

### Examine the n components with a value of 3 given that it produced the best silhouette score previously.

In [None]:
pca = PCA(n_components=3)
pca_scale = pca.fit_transform(df_players)
pca_df_scale = pd.DataFrame(pca_scale,  columns=['pc1','pc2','pc3'])
print(pca.explained_variance_ratio_)

### Applying silhouette coefficient (using the elbow method)

In [None]:
sse = []
k_list = range(1, 15)

for k in k_list:
    km = KMeans(n_clusters=k)
    km.fit(pca_df_scale)
    sse.append([k, km.inertia_])
    
pca_results = pd.DataFrame({'Cluster': range(1,15), 'SSE': sse})
plt.figure(figsize=(12,6))
plt.plot(pd.DataFrame(sse)[0], pd.DataFrame(sse)[1], marker='o', color='green')
plt.title('Optimal Number of Clusters using Elbow Method (PCA_Scaled Data)')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')

### Applying PCA to K-means++
Recall that PCA component of 3 gave us the best silhouette score

In [None]:
kmeans_pca_scale = KMeans(n_clusters=4, n_init=100, max_iter=400, init='k-means++', random_state=42).fit(pca_df_scale)

print('KMeans PCA Scaled Silhouette Score: {}'.format(silhouette_score(pca_df_scale, kmeans_pca_scale.labels_, metric='euclidean')))
labels_pca_scale = kmeans_pca_scale.labels_
clusters_pca_scale = pd.concat([pca_df_scale, 
                                pd.DataFrame({'pca_clusters':labels_pca_scale})], axis=1)

### Execute the K-means++ model

In [None]:
clusterNum = 4
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)
k_means.fit(pca_df_scale)
labels = k_means.labels_
print(labels)

### Create a new column for the clustered labels

In [None]:
df_players["Clusters"] = labels
df_players.head(20)

### Examine the size of the dataset

In [None]:
df_players.shape

### Separate the data into features (X) and targets (clusters (y)).

In [None]:
X = df_players.drop('Clusters', axis=1)
y = df_players['Clusters']

### Over-sampling and under-sampling on unbalanced data

In [None]:

print(imblearn.__version__)

oversample = SMOTE()
enn = EditedNearestNeighbours()
# label encode the target variable

y = LabelEncoder().fit_transform(y)

X, y = enn.fit_resample(X, y)
# summarize distribution
counter = Counter(y)
for k,v in counter.items():
    per = v / len(y) * 100

    print('Class=%d, n=%d (%.3f%%)' % (k, v, per))
#plot the distribution
plt.bar(counter.keys(), counter.values())
plt.show()

### Train Test Split divides the dataset into 70 percent for training and 30 percent for testing.

In [None]:
x_train, x_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=50)
print ('Train set:', x_train.shape,  y_train.shape)
print ('Test set:', x_test.shape,  y_test.shape)

### Insights on the clustering pattern

In [None]:
ax = sns.countplot(x = df_players['Clusters'])
plt.figure(figsize=(80, 40))
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
ax.set_xlabel('Clusters') 
ax.set_ylabel('Number of players')
plt.tight_layout()
#plt.title("Visualization of players based on their position")
plt.show()

## Support Vector Machine 
To build this model, I have used the Support Vector Machine Classifier

In [None]:
clf = SVC()

### Grid search cross validation hyperparameter tuning will be used to improve our model's performance accuracy.

In [None]:
# defining parameter range
param_grid = {'C': [0.1, 1, 10, 100, 1000], 
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf']} 
  
grid = GridSearchCV(clf, param_grid, refit = True, verbose = 3)
  
# fitting the model for grid search
grid.fit(x_train, y_train)

### This will produce the best parameters, estimator, and score for the SVM classifier

In [None]:
print('Best parameter:',grid.best_params_)
print('Grid best estimator:',grid.best_estimator_)
print('Best score:',grid.best_score_)

### Will now, apply the above-mentioned parameters for the SVM classifier

In [None]:
clf = SVC(C=1, gamma=0.0001, kernel= 'rbf')
clf.fit(x_train, y_train) 

### Will apply the predict method to the test set

In [None]:
y_pred = clf.predict(x_test)

### I used cross validation to further analyze the model's performance on its test set

In [None]:
print(classification_report(y_test, y_pred))

print('Accuracy of SVM classifier on the training set: {:.2f}'.format(clf.score(x_train, y_train)))
print('Accuracy of SVM classifier on the test set: {:.2f}'.format(clf.score(x_test, y_test)))

#Decision Trees are very prone to overfitting as shown in the scores

score = cross_val_score(clf, x_train, y_train, cv=10) 
print('Cross-validation score: ',score)
print('Cross-validation mean score: ',score.mean())

### Summarize the model's performance using different classification metrics

In [None]:
def summarize_classification(y_test,y_pred,avg_method='weighted'):
    acc = accuracy_score(y_test, y_pred,normalize=True)
    num_acc = accuracy_score(y_test, y_pred,normalize=False)
    f1= f1_score(y_test, y_pred, average=avg_method)
    prec = precision_score(y_test, y_pred, average=avg_method)
    recall = recall_score(y_test, y_pred, average=avg_method)
    jaccard = jaccard_score(y_test, y_pred, average=avg_method)
    
    print("Length of testing data: ", len(y_test))
    print("accuracy_count : " , num_acc)
    print("accuracy_score : " , acc)
    print("f1_score : " , f1)
    print("precision_score : " , prec)
    print("recall_score : ", recall)
    print("jaccard_score : ", jaccard)
    
summarize_classification(y_test, y_pred)

### To further evaluate our performance findings, let's build a confusion matrix that describes the false positive, true negative, true positive, and false negative

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred, labels=[0,1])
np.set_printoptions(precision=2)

print (classification_report(y_test, y_pred))

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['False(0)','True(1)'],normalize= False,  title='Confusion matrix')

### Comparison of the actual test set to the predicted labels

In [None]:
#Accuracy
pred_results = pd.DataFrame({'y_test': pd.Series(y_test),
                             'y_pred': pd.Series(y_pred)})

pred_results.sample(10)

## Random Forest 

In [None]:
rf=RandomForestClassifier()
rf

### Grid search cross validation hyperparameter tuning will be used to improve our model's performance accuracy

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 1000]
}

In [None]:
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 10, n_jobs = -1, verbose = 2)

In [None]:
grid_search.fit(x_train,y_train)

### This will reccommend the best parameters, estimator, and score for the Random Forest classifier

In [None]:
print('Best parameter:', grid_search.best_params_)
print('Best grid estimator:', grid_search.best_estimator_)
print('Best score', grid_search.best_score_)

### I applied the above-mentioned parameters for the Random Forest classifier

In [None]:
rf=RandomForestClassifier(bootstrap=True, max_depth=100, max_features=3, 
                          min_samples_leaf=3, min_samples_split=8,
                          n_estimators=100).fit(x_train,y_train)
rf

### I applied the predict method to the test set

In [None]:
y_pred = rf.predict(x_test)

### I used cross validation to further analyze the model's performance on its test set

In [None]:
print(classification_report(y_test, y_pred))

print('Accuracy of Random Forest classifier on the training set: {:.2f}'.format(rf.score(x_train, y_train)))
print('Accuracy of Random Forest classifier on the test set: {:.2f}'.format(rf.score(x_test, y_test)))

#Decision Trees are very prone to overfitting as shown in the scores

score = cross_val_score(rf, x_train, y_train, cv=5) 
print('Cross-validation score: ',score)
print('Cross-validation mean score: ',score.mean())

### Summarize the model's performance using different classification metrics

In [None]:
def summarize_classification(y_test,y_pred,avg_method='weighted'):
    acc = accuracy_score(y_test, y_pred,normalize=True)
    num_acc = accuracy_score(y_test, y_pred,normalize=False)
    f1= f1_score(y_test, y_pred, average=avg_method)
    prec = precision_score(y_test, y_pred, average=avg_method)
    recall = recall_score(y_test, y_pred, average=avg_method)
    jaccard = jaccard_score(y_test, y_pred, average=avg_method)
    
    print("Length of testing data: ", len(y_test))
    print("accuracy_count : " , num_acc)
    print("accuracy_score : " , acc)
    print("f1_score : " , f1)
    print("precision_score : " , prec)
    print("recall_score : ", recall)
    print("jaccard_score : ", jaccard)
    
summarize_classification(y_test, y_pred)

### Comparison of the actual test set to the predicted labels

In [None]:
#Accuracy
pred_results = pd.DataFrame({'y_test': pd.Series(y_test),
                             'y_pred': pd.Series(y_pred)})

pred_results.sample(10)

### To further evaluate our performance findings, let's build a confusion matrix that describes the false positive, true negative, true positive, and false negative.

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred, labels=[0,1])
np.set_printoptions(precision=2)

print (classification_report(y_test, y_pred))

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['False(0)','True(1)'],normalize= False,  title='Confusion matrix')

### Summary of the confusion matrix
Looking at the first row. The first row contains players whose false value in the test set is (0). As you can see, 843 of the 1,441 players have a false value of (0). And, of these 843, the classifier accurately predicted 843 as (0), and 0 as (1) for the predicted label.

This indicates that in the test set, the actual false value for 843 players was (0), and the classifier accurately predicted those as (0). However, while the actual label of 0 players was 0 (false value), the classifier predicted those as 1, which means it did excellently well. We may think of it as a model excellence for the first row.

What about the players that have a true value of 1?  


Looking at the second row. It appears that there are 599 players whose true value was 1. The classifier accurately identified 577 of them as 1, and 21 of them wrongly as 0 (false). As a result, it has done a good job at predicting players with true value 1. 

### Final review of the results and evaluations
This provides an understanding of the business perspective related to our model. The clustering method (K-means) was able to categorise players according to their attributes. We may also conclude that the algorithm accurately identified the average, undervalued, and overperforming players. It also produced astounding results for the re-evaluation of the players, with 99 percent accuracy on the test set. 

In [None]:
data["Clusters"] = labels
data.head(20)

### Model should be saved to file for adequate evaluation.

In [None]:
pd.DataFrame(data).to_csv('clustereddata.csv', index=False)

In [None]:
result=[{ 'Accuracy Score':'99%',
         'F1 Score': '99%',
         'Precision Score': '99%',
         'Recall Score': '99%',
         'Jaccard Score': '99%'},
        {'Accuracy Score':'98%',
         'F1 Score': '98%',
         'Precision Score': '98%',
         'Recall Score': '98%',
         'Jaccard Score': '97%'}]
df=pd.DataFrame(result, index=['Support Vector Machine','Random Forest'])
df.head()

# Conclusion 

K-means++ was the method adopted in this research.   At first, one could have assumed that the poor performance was due to the dataset's susceptibility to noise, large dimensionality, or even the cluster shape. The use of PCA on K-means++ has resulted in a more equitable and business-friendly solution. The K-means++ algorithm was able to satisfy the requirements of this research by providing managers with insight on player's skill diversity problems such as underperforming, undervalue, average, overperforming   among many others. The K-means++ method was successful in identifying possible groupings of players based of various attributes. Managers can now understand how the model works and make sound recommendations based on their preferences. The final phase of the project involves re-evaluating the players using a supervised machine learning technique. But, before I get into the next phase, let's have a look at the intelligent distribution of clusters that K-means++ has produced.

Having said that, I went ahead and used DBSCAN, a density clustering technique commonly employed on non-linear or non-spherical datasets. Two parameters are required: epsilon and minimum points. I also used PCA to reduce the number of dimensions to 3 principal components. I estimated an epsilon value of 0.2 and a minimum point value of 4 using the elbow method. I was able to attain 72 clusters, 1406 noise, and a silhouette score of -0.55 by using this parameter. Admittedly, the findings were unimpressive. To fine-tune the epsilon and minimum points values, I have used an iterative approach. I chose an epsilon value of 1.2 and a minimum points value of 2. The method produced 6 valid clusters, 5 noises, and a silhouette score of 0.46. However, when the generated clusters were plotted, it was observed that the first cluster contained 90% of the players. Similarly, from a business perspective, I would like that the clusters be more evenly distributed in order to give us with useful information about the players. Perhaps DBSCAN is not the best clustering technique for this dataset.

On the test set, the classification model generated excellent results. Because I have adequate data to train on, the model is not prone to overfitting. On the test set, Support Vector Machine and Random Forest achieve 98 and 99 percent accuracy, respectively. The f1 score and recall for all classes yielded 100 percent score for SVM classifier. This will address the manager's re-evaluation problem by predicting which category a given player’s skill set should belong. Managers can now determine if a player's release clause is genuinely worth the amount asked on the transfer market and which players should be rotated into other positions. This model has helped managers in diagnosing a lack of skill diversity and potentially influencing transfer decisions. Models now offers recommendation on players based on the manager's preferences.