# Explore football skills and cluster football players based on their attributes


The attributes used in the project are:
●	Name: Name of the player.
●	Age: Age of the player.
●	Height: Height of the player in inches (transformed to centimeters in preprocessing).
●	Overall: General performance quality and value of the player representing the key positional skills and international reputation rated between 1-99. Overall attribute is used only in preprocessing and discussion stages because using it in modelling could lead to domination by this feature. The aim of the project is not basically sort and categorize the players using their overall talent and international reputation, but to cluster them based on using their whole skillset.
●	Potential: Maximum Overall rating expected to be reached by a player in the top of his career rated between 1-99.
●	PreferredFoot: Right or Left. Label encoder is applied as 0 for left and 1 for right.
●	WeakFoot: Represents how well a player uses his weak foot (e.g. left for righties) rated between 1 to 5.
●	WorkRate: Degree of the effort the player puts in terms of attack and defense rated as low, medium and high. This feature is divided into two new features as AttackWorkRate and DefenseWorkRate. Besides, label encoder is applied as 0 for low, 0.5 for medium and 1 for high.
●	Position: Position of the players on the pitch which determines their roles and responsibilities in the team. Forward positions in the football and FIFA 19 can be grouped as striker (ST: center striker, RS: right striker, LS: left striker), forward (CF: center forward, RF: right forward, LF: left forward) and winger (RW: right winger, LW: left winger). The word, forward, is used both as a general term and a special position. Strikers are positioned in front of forwards and wingers and very closed to the opposing goal. Their main responsibilities are attacking and scoring goals, that’s why their ball control, shooting and finishing skills are expected to be well. Center forwards are positioned right behind the strikers. They are expected to receive balls from the others and score assists to the others or goals. In addition to the skills expected from strikers, they have to be good at passing. Right forwards and left forwards are positioned at the right and left of the center forwards with the same expectations. Wingers are positioned near the touchlines to create chances for strikers and forwards from the right and left side of the field by breakthrough and crosses and to score goals. They are expected to be good at dribbling, acceleration, passing and crossing. Positions are used only in preprocessing and discussion stages.
●	ST: Positional skill. Player’s general ability when playing in ST position rated between 1-99.
●	RS: Positional skill. Player’s general ability when playing in in RS position rated between 1-99.
●	LS: Positional skill. Player’s general ability when playing in in LS position rated between 1-99.
●	CF: Positional skill. Player’s general ability when playing in in CF position rated between 1-99.
●	RF: Positional skill. Player’s general ability when playing in in RF position rated between 1-99.
●	LF: Positional skill. Player’s general ability when playing in in LF position rated between 1-99.
●	RW: Positional skill. Player’s general ability when playing in in RW position rated between 1-99.
●	LW: Positional skill. Player’s general ability when playing in in LW position rated between 1-99.
●	Crossing: Crossing skill of the player rated between 1-99. Cross is a long-range pass from wings to center.
●	Finishing: Finishing skill of the player rated between 1-99. Finishing in football refers to finish an attack by scoring a goal.
●	HeadingAccuracy: Player’s accuracy to pass or shoot by using his head rated between 1-99.
●	ShortPassing: Player’s accuracy for short passes rated between 1-99.
●	LongPassing: Player’s accuracy for long passes rated between 1-99.
●	Dribbling: Dribbling skill of the player rated between 1-99. Dribbling is carrying the ball without losing while moving in one particular direction.
●	SprintSpeed: Speed rate of the player rated between 1-99.
●	Acceleration: Shows how fast a player can reach his maximum sprint speed rated between 1-99.
●	FKAccuracy: Player’s accuracy to score free kick goals rated between 1-99.
●	BallControl: Player’s ability to control the ball rated between 1-99.
●	Balance: Player’s ability to remain steady while running, carrying and controlling the ball rated between 1-99.
●	ShotPower: Player’s strength level of shooting the ball rated between 1-99.
●	Jumping: Player’s jumping skill rated between 1-99.
●	Penalties: Player’s accuracy to score goals from penalty rated between 1-99.
●	Strength: Physical strength of the player rated between 1-99.
●	Agility: Gracefulness and quickness of the player while controlling the ball rated between 1-99.
●	Reactions: Acting speed of the player to what happens in his environment rated between 1-99.
●	Aggression: Aggression level of the player while pushing, pulling and tackling rated between 1-99.
●	Positioning: Player’s ability to place himself in the right position to receive the ball or score goals rated between 1-99.
●	Vision: Player’s mental awareness about the other players in the team for passing rated between 1-99.
●	Volleys: Player’s ability to perform volleys rated between 1-99.
●	LongShots: Player’s accuracy of shoots from long distances rated between 1-99.
●	Stamina: Player’s ability to sustain his stamina level during the match rated between 1-99. Players with lower stamina get tired fast.
●	Composure: Player’s ability to control his calmness and frustration during the match rated between 1-99.
●	Curve: Player’s ability to curve the ball while passing or shooting rated between 1-99.
●	Interceptions: Player’s ability to intercept the ball while opposite team’s players are passing rated between 1-99. It is a defensive skill.
●	StandingTackle: Player’s ability to perform tackle (take the ball from the opposite player) while standing rated between 1-99. It is a defensive skill.
●	SlidingTackle: Player’s ability to perform tackle by sliding rated between 1-99. It is a defensive skill.
●	Marking: Player’s ability to apply strategies to prevent opposing team from taking the ball rated between 1-99. It is a defensive skill.  



In [1]:
#import necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler,LabelEncoder,OneHotEncoder,OrdinalEncoder,MinMaxScaler,RobustScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

In [3]:
# Load the dataset
data=pd.read_csv(r'/content/players_20.csv', encoding='latin-1') # or encoding='ISO-8859-1'

  data=pd.read_csv(r'/content/players_20.csv', encoding='latin-1') # or encoding='ISO-8859-1'


### Basic checks &EDA

In [None]:
!pip install sweetviz


In [None]:
#View sample data
data.head(10)

In [None]:
#checking the dimension
data.shape

In [None]:
#checking for null values

print(data.isnull().sum().sum)

In [None]:
#checking for duplicate values
data.duplicated().sum()

#Exploratory Data Analysis

In [None]:
#variable with floating values
data_flt = list(data.select_dtypes('float').columns)
#variable with integer values
data_int = list(data.select_dtypes('integer').columns)
#variable with string values
data_obj = list(data.select_dtypes('object').columns)

In [None]:
# Basic information about the dataset
data.info()
data[data_flt].info()

In [None]:
data[data_int].info()

In [None]:
data[data_obj].info()

In [None]:
#Key description\statistics of features of datatype float
data[data_flt].describe().T

In [None]:
data[data_flt].corr()

In [None]:
#Key description\statistics of features of datatype integer
data[data_int].describe().T

In [None]:
##Key description\statistics of features of datatype object
data[data_obj].describe().T

In [None]:
import sweetviz as sv
my_report=sv.analyze(data[data_flt])

my_report.show_html()

In [None]:
import sweetviz as sv
my_report=sv.analyze(data[data_int])

my_report.show_html()

In [None]:
import sweetviz as sv
my_report=sv.analyze(data[data_obj])

my_report.show_html()

### preprocessing

In [None]:
#remove of specific columns which have more than 50% null values and which are not relevant for clustering ,
k3=['sofifa_id','player_url','short_name','long_name','dob','club','value_eur','wage_eur','work_rate','real_face',
   'player_tags','team_jersey_number','joined','contract_valid_until','nation_position','nation_jersey_number','pace',
   'player_traits','nationality','gk_diving','gk_handling','loaned_from','gk_kicking','gk_reflexes','gk_speed','gk_positioning','nation_position','release_clause_eur','body_type','player_positions']
data1=data.drop(k3,axis=1)


In [None]:
#conver object column to numeric column
col_val=['ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam', 'ram', 'lm', 'lcm',
                'cm', 'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm', 'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb']

In [None]:
# Columns given below are rating of the positional skills of each player.  For player_positions GK, we can put the values
# as zero as these are meant for non-goal keeping positions

import re
def col_typ_chg(col_val):
    if col_val is np.nan:
        return 0
    col_val_spl = re.split('[-+]', col_val)
    col_val_int = list(map(int, col_val_spl))
    if "+" in col_val:
        return sum(col_val_int)
    return np.abs(np.diff(col_val_int))[0]

In [None]:
for col2 in col_val:
    data1[col2] = data1[col2].apply(col_typ_chg)

In [None]:
data_flt1 = list(data1.select_dtypes('float').columns)
data_int1 = list(data1.select_dtypes('integer').columns)

In [None]:
data[data_flt1].corr()

In [None]:
data1[data_int1].corr()

In [None]:
#remove the columns which have high correlatin
data2=data1.drop(['ls', 'rs', 'lw', 'lf', 'rf', 'rw', 'lam', 'ram', 'lm', 'lcm',
                 'rcm', 'rm', 'lwb', 'ldm', 'rdm', 'rwb', 'lb', 'lcb', 'cb', 'rcb',
                  'attacking_crossing','team_position','attacking_finishing'],axis=1)

In [None]:
data2.shape

In [None]:
data2.isnull().sum()

In [None]:
#missing values handeling

data2.fillna(value={'shooting':data2.shooting.median(),'passing':data2.passing.median(),'dribbling':data2.dribbling.median(),'physic':data2.physic.median(),'defending':data2.defending.median()},inplace=True)
#data2.fillna(value={'team_position':data2.team_position.mode()},inplace=True)

### Feature Engineering

In [None]:
enc=OneHotEncoder()


In [None]:
data_enc=pd.get_dummies(data2,columns=['preferred_foot'],dtype='int')

In [None]:
data_enc.head()

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
fifa_data_upd_scl = scaler.fit_transform(data_enc)
fifa_data_upd_scl1=scaler.fit_transform(data_enc)

### Model Building

In [None]:
wcss = []
for cluster in range(2,20):
    kme_clu_dup_fea = KMeans(n_clusters=cluster, random_state=9)
    kme_clu_dup_fea.fit(fifa_data_upd_scl)
    # Access inertia_ from the KMeans object
    wcss.append(kme_clu_dup_fea.inertia_)

plt.plot(range(2,20), wcss)
plt.title("Elbow method")
plt.xlabel("Number of clusters")
plt.ylabel("WCSS(for duplicated features)")
plt.show()

In [None]:
# For dataframe "fifa_data_upd_scl_upd"

# For 2 clusters
kmea_clu_dup_fea_clu2 = KMeans(n_clusters=2, random_state=9)
kmea_clu_dup_fea_clu2.fit(fifa_data_upd_scl)
print("WCSS for dup fea with 2 clusters:", kmea_clu_dup_fea_clu2.inertia_)



# For 3 clusters
kmea_clu_dup_fea_clu3 = KMeans(n_clusters=3, random_state=9)
kmea_clu_dup_fea_clu3.fit(fifa_data_upd_scl)
print("WCSS for dup fea with 3 clusters:", kmea_clu_dup_fea_clu3.inertia_)
# For 4 clusters
kmea_clu_dup_fea_clu4 = KMeans(n_clusters=4, random_state=9)
kmea_clu_dup_fea_clu4.fit(fifa_data_upd_scl)
print("WCSS for dup fea with 4 clusters:", kmea_clu_dup_fea_clu4.inertia_)

# For 5 clusters
kmea_clu_dup_fea_clu5 = KMeans(n_clusters=5, random_state=9)
kmea_clu_dup_fea_clu5.fit(fifa_data_upd_scl)
print("WCSS for dup fea with 5 clusters:", kmea_clu_dup_fea_clu5.inertia_)

# For 6 clusters
kmea_clu_dup_fea_clu6 = KMeans(n_clusters=6, random_state=9)
kmea_clu_dup_fea_clu6.fit(fifa_data_upd_scl)
print("WCSS for dup fea with 6 clusters:", kmea_clu_dup_fea_clu2.inertia_)


In [None]:
#Calculate the silhouette score for 2 cluster
labels2 = kmea_clu_dup_fea_clu2.fit_predict(fifa_data_upd_scl)
silhouette_avg2 = silhouette_score(fifa_data_upd_scl, labels2)
print(silhouette_avg2)
#Calculate the silhouette score for 3 cluster
labels3 = kmea_clu_dup_fea_clu3.fit_predict(fifa_data_upd_scl)
silhouette_avg3 = silhouette_score(fifa_data_upd_scl, labels3)
print(silhouette_avg3)
#Calculate the silhouette score for 4 cluster
labels4 = kmea_clu_dup_fea_clu4.fit_predict(fifa_data_upd_scl)
silhouette_avg4 = silhouette_score(fifa_data_upd_scl, labels4)
print(silhouette_avg4)
#Calculate the silhouette score for 6 cluster
labels6 = kmea_clu_dup_fea_clu4.fit_predict(fifa_data_upd_scl)
silhouette_avg6 = silhouette_score(fifa_data_upd_scl, labels6)
print(silhouette_avg6)
#Calculate the silhouette score for 5 cluster
labels5 = kmea_clu_dup_fea_clu5.fit_predict(fifa_data_upd_scl)
silhouette_avg5 = silhouette_score(fifa_data_upd_scl, labels5)
silhouette_avg5

I have got high silhouette score after taking 2 cluster  .

In [None]:
label=kmea_clu_dup_fea_clu2 .labels_
label

In [None]:
from collections import Counter
Counter(label)

In [None]:
# Assuming fifa_data_upd_scl is your NumPy array
fifa_data_upd_scl = pd.DataFrame(fifa_data_upd_scl)  # Convert to DataFrame

# Now you can add the 'Labels' column
fifa_data_upd_scl['Labels'] = label
fifa_data_upd_scl

In [None]:
# Hierchical clustering
import scipy.cluster.hierarchy as shc
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 7))
plt.title('Dendrogram')
dend = shc.dendrogram(shc.linkage(fifa_data_upd_scl1, method='ward'))
plt.show()

In [None]:

hie = AgglomerativeClustering(n_clusters=2)
hie_labels = hie.fit_predict(fifa_data_upd_scl1)

In [None]:
sil_score = silhouette_score(fifa_data_upd_scl1, hie_labels)
sil_score



In [None]:
hie_labels

In [None]:
from collections import Counter
Counter(hie_labels)

### Which countries are producing the most footballers

In [None]:

country_count=data['nationality'].value_counts().head(10)
print("\nTop 10 Countries with Most Players:")
print(country_count)

### Plot the distribution of overall rating vs. age of players. Interpret what is the age after which a player stops improving?

In [None]:

x = data['age']  # X-axis points
y = data['overall']  # Y-axis points

plt.figure(figsize=(8, 5))
plt.bar(x, y, color='blue')

### Which type of offensive players tends to get paid the most: the striker, the right-winger, or the left-winger


In [None]:
offensive_positions = ['ST', 'RW', 'LW']
offensive_players = data[data['team_position'].isin(offensive_positions)]

# Convert 'wage_eur' column to numeric, coercing errors to NaN
offensive_players['wage_eur'] = pd.to_numeric(offensive_players['wage_eur'], errors='coerce')

# Group by position and find the average wage, skipping NaN values
wage_comparison = offensive_players.groupby('team_position')['wage_eur'].mean(numeric_only=True).sort_values(ascending=False)
print("\nAverage Wage by Offensive Position:")
print(wage_comparison)

# Plotting the average wage comparison
wage_comparison.plot(kind='bar', title='Average Wage by Offensive Position', xlabel='Position', ylabel='Average Wage (EUR)')
plt.show()

### Model Comparison Report
For 2 cluster I have got better silhoutte at Kmeans clustering in compare to Hierarchial clustering.

### Report on Challenges faced
1-There are lot of missing values in the data set.Which column has more than 50% missing values,I have removed that columns.which
column has less 10% missing values ,i have imput by median..
2-There  are many columns in the dataset,which have very high correlation,i have removed them.