Understanding Data Clustering with K-Means and PCA

Data analysis is an essential part of decision-making processes in various industries. Clustering, a form of unsupervised machine learning, plays a vital role in segmenting data into meaningful groups. In this article, we will explore the use of K-Means clustering and Principal Component Analysis (PCA) to gain insights from a dataset containing FIFA player statistics.

Getting Started

Before we dive into clustering, we need to prepare our environment by importing the necessary Python libraries and loading the dataset. For this tutorial, we'll be using the FIFA player dataset. Here's how we set things up:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

fifa_data = pd.read_csv('./FIFA.csv')

Data Preprocessing

Handling Missing Values

Data preprocessing is crucial to ensure the accuracy of our analysis. We start by checking for missing values in our dataset:

cluster_data["Club"] = cluster_data["Club"].fillna("No Club")
cluster_data.isna().sum()

C:\Users\HP\AppData\Local\Programs\Python\Python37\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.





Age            0
Nationality    0
Overall        0
Club           0
Value          0
Wage           0
dtype: int64

We also need to fix the Value and Wage columns by converting them to numerical values:

Fix `Value` and `Wage` Columns

# Function to convert the `Value` and `Wage` to be numerical
def convert_value(value):
    # Remove Euro symbol and leading/trailing whitespace
    value = value.replace('€', '').strip()
    
    # Check if the value ends with 'M'
    if value.endswith('M'):
        # Convert 'M' to six zeros
        value = float(value.replace('M', '')) * 1e6
    elif value.endswith('K'):
        # Convert 'K' to three zeros
        value = float(value.replace('K', '')) * 1e3
    
    return value

# Apply the conversion function to the 'Value' column
cluster_data['Value'] = cluster_data['Value'].apply(convert_value)

cluster_data["Wage"] = cluster_data["Wage"].apply(convert_value)
cluster_data

C:\Users\HP\AppData\Local\Programs\Python\Python37\lib\site-packages\ipykernel_launcher.py:17: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()
C:\Users\HP\AppData\Local\Programs\Python\Python37\lib\site-packages\ipykernel_launcher.py:19: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age	Nationality	Overall	Club	Value	Wage
0	31	Argentina	94	FC Barcelona	110500000.0	565000.0
1	33	Portugal	94	Juventus	77000000.0	405000.0
2	26	Brazil	92	Paris Saint-Germain	118500000.0	290000.0
3	27	Spain	91	Manchester United	72000000.0	260000.0
4	27	Belgium	91	Manchester City	102000000.0	355000.0
...	...	...	...	...	...	...
18202	19	England	47	Crewe Alexandra	60000.0	1000.0
18203	19	Sweden	47	Trelleborgs FF	60000.0	1000.0
18204	16	England	47	Cambridge United	60000.0	1000.0
18205	17	England	47	Tranmere Rovers	60000.0	1000.0
18206	16	England	46	Tranmere Rovers	60000.0	1000.0

18207 rows × 6 columns

We turn the wage and value columns to integers.

cluster_data["Value"] = cluster_data["Value"].astype('int')
cluster_data["Wage"] = cluster_data["Wage"].astype('int')

C:\Users\HP\AppData\Local\Programs\Python\Python37\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
C:\Users\HP\AppData\Local\Programs\Python\Python37\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

cluster_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Age          18207 non-null  int64 
 1   Nationality  18207 non-null  object
 2   Overall      18207 non-null  int64 
 3   Club         18207 non-null  object
 4   Value        18207 non-null  int32 
 5   Wage         18207 non-null  int32 
dtypes: int32(2), int64(2), object(2)
memory usage: 711.3+ KB

Encoding Categorical Data

Next, we encode categorical data using one-hot encoding:

# Encode the data
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1,3])], remainder='passthrough')
X = ct.fit_transform(cluster_data)
print(X)

  (0, 6)	1.0
  (0, 376)	1.0
  (0, 816)	31.0
  (0, 817)	94.0
  (0, 818)	110500000.0
  (0, 819)	565000.0
  (1, 123)	1.0
  (1, 490)	1.0
  (1, 816)	33.0
  (1, 817)	94.0
  (1, 818)	77000000.0
  (1, 819)	405000.0
  (2, 20)	1.0
  (2, 600)	1.0
  (2, 816)	26.0
  (2, 817)	92.0
  (2, 818)	118500000.0
  (2, 819)	290000.0
  (3, 139)	1.0
  (3, 539)	1.0
  (3, 816)	27.0
  (3, 817)	91.0
  (3, 818)	72000000.0
  (3, 819)	260000.0
  (4, 13)	1.0
  :	:
  (18202, 819)	1000.0
  (18203, 144)	1.0
  (18203, 752)	1.0
  (18203, 816)	19.0
  (18203, 817)	47.0
  (18203, 818)	60000.0
  (18203, 819)	1000.0
  (18204, 46)	1.0
  (18204, 286)	1.0
  (18204, 816)	16.0
  (18204, 817)	47.0
  (18204, 818)	60000.0
  (18204, 819)	1000.0
  (18205, 46)	1.0
  (18205, 751)	1.0
  (18205, 816)	17.0
  (18205, 817)	47.0
  (18205, 818)	60000.0
  (18205, 819)	1000.0
  (18206, 46)	1.0
  (18206, 751)	1.0
  (18206, 816)	16.0
  (18206, 817)	46.0
  (18206, 818)	60000.0
  (18206, 819)	1000.0

Determining the Optimal Number of Clusters

Before applying K-Means clustering, we need to determine the optimal number of clusters. We can do this using the Elbow Method:

from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss, marker = 'o', linestyle = '--')
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

K-Means Clustering

With the optimal number of clusters identified (in this case, let's say 4), we can now perform K-Means clustering:

kmeans = KMeans(n_clusters = 4, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)

Analyzing the Clusters

Let's take a closer look at our clusters and analyze the results:

kmeans.labels_

array([2, 2, 2, ..., 1, 1, 1])

df_cluster = cluster_data.copy()

df_cluster["K-means"] = kmeans.labels_

df_cluster_analysis = df_cluster.groupby(["K-means"]).mean()
df_cluster_analysis

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age	Overall	Value	Wage
K-means
0	26.057471	75.959291	8.282807e+06	28850.574713
1	24.973728	64.516539	9.071590e+05	4809.478372
2	26.873016	87.984127	6.223016e+07	222793.650794
3	25.928571	82.327381	2.504762e+07	81241.071429

We also calculate the size and proportions of the clusters:

# Compute the size and proportions of the four clusters
df_cluster_analysis['N Obs'] = df_cluster[['K-means','Nationality']].groupby(['K-means']).count()
df_cluster_analysis['Prop Obs'] = df_cluster_analysis['N Obs'] / df_cluster_analysis['N Obs'].sum()

df_cluster_analysis

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age	Overall	Value	Wage	N Obs	Prop Obs
K-means
0	26.057471	75.959291	8.282807e+06	28850.574713	2088	0.114681
1	24.973728	64.516539	9.071590e+05	4809.478372	15720	0.863404
2	26.873016	87.984127	6.223016e+07	222793.650794	63	0.003460
3	25.928571	82.327381	2.504762e+07	81241.071429	336	0.018454

df_cluster_analysis.rename({0:'Average Performers',
                            1:'Worst Performers',
                            2:'Best Performers',
                            3:'Performers'})

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age	Overall	Value	Wage	N Obs	Prop Obs
K-means
Average Performers	26.057471	75.959291	8.282807e+06	28850.574713	2088	0.114681
Worst Performers	24.973728	64.516539	9.071590e+05	4809.478372	15720	0.863404
Best Performers	26.873016	87.984127	6.223016e+07	222793.650794	63	0.003460
Performers	25.928571	82.327381	2.504762e+07	81241.071429	336	0.018454

# Add the segment labels to our table
df_cluster['Labels'] = df_cluster['K-means'].map({0:'Average Performers',
                                                  1:'Worst Performers',
                                                  2:'Best Performers',
                                                  3:'Performers'})

df_cluster

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age	Nationality	Overall	Club	Value	Wage	K-means	Labels
0	31	Argentina	94	FC Barcelona	110500000	565000	2	Best Performers
1	33	Portugal	94	Juventus	77000000	405000	2	Best Performers
2	26	Brazil	92	Paris Saint-Germain	118500000	290000	2	Best Performers
3	27	Spain	91	Manchester United	72000000	260000	2	Best Performers
4	27	Belgium	91	Manchester City	102000000	355000	2	Best Performers
...	...	...	...	...	...	...	...	...
18202	19	England	47	Crewe Alexandra	60000	1000	1	Worst Performers
18203	19	Sweden	47	Trelleborgs FF	60000	1000	1	Worst Performers
18204	16	England	47	Cambridge United	60000	1000	1	Worst Performers
18205	17	England	47	Tranmere Rovers	60000	1000	1	Worst Performers
18206	16	England	46	Tranmere Rovers	60000	1000	1	Worst Performers

18207 rows × 8 columns

Visualizing the Clusters

To visualize the clusters, we create scatterplots:

# We plot the results from the K-means algorithm. 
# Each point in our data set is plotted with the color of the clusters it has been assigned to.
x_axis = df_cluster['Wage']
y_axis = df_cluster['Overall']
plt.figure(figsize = (10, 8))
sns.scatterplot(data=df_cluster,x=x_axis, y=y_axis, 
                hue = df_cluster['Labels'], palette = ['g', 'r', 'c', 'm'])
plt.title('Segmentation K-means')
plt.show()

# Clubs with the best performing Players
df_cluster[df_cluster["Labels"]=="Best Performers"].groupby("Club", as_index=False)["Overall"].count()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Club	Overall
0	Arsenal	2
1	Atlético Madrid	5
2	Chelsea	2
3	FC Barcelona	7
4	FC Bayern München	5
5	Inter	2
6	Juventus	4
7	Lazio	2
8	Liverpool	4
9	Manchester City	6
10	Manchester United	3
11	Milan	1
12	Napoli	4
13	Olympique Lyonnais	1
14	Paris Saint-Germain	4
15	Real Madrid	9
16	Tottenham Hotspur	2

# Nationalities with the best performing Players
df_cluster[df_cluster["Labels"]=="Best Performers"].groupby("Nationality", as_index=False)["Overall"].count()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Nationality	Overall
0	Argentina	5
1	Belgium	5
2	Bosnia Herzegovina	1
3	Brazil	5
4	Colombia	1
5	Croatia	2
6	Denmark	1
7	Egypt	1
8	England	2
9	France	8
10	Gabon	1
11	Germany	5
12	Italy	3
13	Netherlands	1
14	Poland	1
15	Portugal	2
16	Senegal	2
17	Serbia	1
18	Slovakia	2
19	Slovenia	1
20	Spain	9
21	Uruguay	3
22	Wales	1

PCA

from sklearn.decomposition import PCA

# Employ PCA to find a subset of components, which explain the variance in the data.
pca = PCA()

# Select the data for use and Standardize the data
pca_data = fifa_data.select_dtypes(include=["int", "float"])
pca_data = pca_data.iloc[:,2:]
pca_data_all = pca_data.copy()
pca_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 42 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       18207 non-null  int64  
 1   Overall                   18207 non-null  int64  
 2   Potential                 18207 non-null  int64  
 3   Special                   18207 non-null  int64  
 4   International Reputation  18159 non-null  float64
 5   Weak Foot                 18159 non-null  float64
 6   Skill Moves               18159 non-null  float64
 7   Jersey Number             18147 non-null  float64
 8   Crossing                  18159 non-null  float64
 9   Finishing                 18159 non-null  float64
 10  HeadingAccuracy           18159 non-null  float64
 11  ShortPassing              18159 non-null  float64
 12  Volleys                   18159 non-null  float64
 13  Dribbling                 18159 non-null  float64
 14  Curve                     18159 non-null  float64
 15  FKAccuracy                18159 non-null  float64
 16  LongPassing               18159 non-null  float64
 17  BallControl               18159 non-null  float64
 18  Acceleration              18159 non-null  float64
 19  SprintSpeed               18159 non-null  float64
 20  Agility                   18159 non-null  float64
 21  Reactions                 18159 non-null  float64
 22  Balance                   18159 non-null  float64
 23  ShotPower                 18159 non-null  float64
 24  Jumping                   18159 non-null  float64
 25  Stamina                   18159 non-null  float64
 26  Strength                  18159 non-null  float64
 27  LongShots                 18159 non-null  float64
 28  Aggression                18159 non-null  float64
 29  Interceptions             18159 non-null  float64
 30  Positioning               18159 non-null  float64
 31  Vision                    18159 non-null  float64
 32  Penalties                 18159 non-null  float64
 33  Composure                 18159 non-null  float64
 34  Marking                   18159 non-null  float64
 35  StandingTackle            18159 non-null  float64
 36  SlidingTackle             18159 non-null  float64
 37  GKDiving                  18159 non-null  float64
 38  GKHandling                18159 non-null  float64
 39  GKKicking                 18159 non-null  float64
 40  GKPositioning             18159 non-null  float64
 41  GKReflexes                18159 non-null  float64
dtypes: float64(38), int64(4)
memory usage: 5.8 MB

# Fix missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputer.fit(pca_data)
pca_data = imputer.transform(pca_data)

sc = StandardScaler()
pca_data_std = sc.fit_transform(pca_data)
pca_data_std

array([[ 1.25867833,  4.01828714,  3.69809177, ..., -0.07391134,
        -0.13957297, -0.48489768],
       [ 1.68696087,  4.01828714,  3.69809177, ..., -0.07391134,
        -0.13957297, -0.31761143],
       [ 0.18797198,  3.72879875,  3.53512784, ..., -0.07391134,
        -0.08079776, -0.31761143],
       ...,
       [-1.95344072, -2.78469008, -0.70193445, ..., -0.37725737,
        -0.60977466, -0.20608726],
       [-1.73929945, -2.78469008, -0.86489839, ..., -0.13458054,
        -0.49222424, -0.4291356 ],
       [-1.95344072, -2.92943428, -0.86489839, ..., -0.43792658,
        -0.25712339, -0.4291356 ]])

# Fit PCA with our standardized data.
pca.fit(pca_data_std)

PCA()

# The attribute shows how much variance is explained by each of the four individual components.
pca.explained_variance_ratio_

array([4.97893735e-01, 1.20522384e-01, 9.46282620e-02, 4.31422784e-02,
       3.27713854e-02, 3.07283132e-02, 2.08695480e-02, 1.99185744e-02,
       1.74104984e-02, 1.40835060e-02, 1.06389704e-02, 8.61595863e-03,
       7.64499315e-03, 6.54800266e-03, 6.29149631e-03, 5.77424091e-03,
       5.42289076e-03, 5.08356777e-03, 4.92252161e-03, 4.63667148e-03,
       4.47230599e-03, 4.13637476e-03, 3.70159829e-03, 3.29256064e-03,
       2.95448496e-03, 2.92550280e-03, 2.64669981e-03, 2.48111575e-03,
       2.08116208e-03, 1.92259650e-03, 1.71205543e-03, 1.61107135e-03,
       1.53909831e-03, 1.44570868e-03, 1.12214070e-03, 8.90810720e-04,
       8.68520228e-04, 7.30638408e-04, 7.22031163e-04, 6.33660321e-04,
       5.42806030e-04, 1.92589381e-05])

# Plot the cumulative variance explained by total number of components.
# On this graph we choose the subset of components we want to keep. 
# Generally, we want to keep around 80 % of the explained variance.
plt.figure(figsize = (12,9))
plt.plot(range(1,43), pca.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')
plt.title('Explained Variance by Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')

Text(0, 0.5, 'Cumulative Explained Variance')

# We choose ten components. 9 or 10 seems the right choice according to the previous graph.
pca = PCA(n_components = 10)

#Fit the model the our data with the selected number of components. In our case three.
pca.fit(pca_data_std)

PCA(n_components=10)

PCA Results

# Here we discucss the results from the PCA.
# The components attribute shows the loadings of each component on each of the seven original features.
# The loadings are the correlations between the components and the original features. 
np.round(pca.components_, 3)

array([[-0.027, -0.103, -0.072, -0.211, -0.051, -0.08 , -0.179,  0.02 ,
        -0.191, -0.172, -0.154, -0.201, -0.178, -0.204, -0.188, -0.176,
        -0.179, -0.21 , -0.155, -0.153, -0.159, -0.102, -0.137, -0.188,
        -0.059, -0.175, -0.029, -0.188, -0.139, -0.11 , -0.191, -0.159,
        -0.173, -0.154, -0.116, -0.109, -0.102,  0.18 ,  0.18 ,  0.179,
         0.179,  0.18 ],
       [ 0.084,  0.041, -0.021,  0.024,  0.003, -0.083, -0.12 , -0.075,
        -0.034, -0.211,  0.177,  0.04 , -0.166, -0.1  , -0.109, -0.086,
         0.076, -0.032, -0.13 , -0.108, -0.158,  0.041, -0.14 , -0.073,
         0.123,  0.105,  0.241, -0.133,  0.265,  0.348, -0.143, -0.135,
        -0.138,  0.029,  0.332,  0.354,  0.356, -0.073, -0.073, -0.073,
        -0.071, -0.073],
       [ 0.283,  0.402,  0.222,  0.099,  0.27 ,  0.051,  0.015, -0.121,
         0.001,  0.015, -0.035,  0.041,  0.05 , -0.035,  0.049,  0.066,
         0.077, -0.   , -0.123, -0.12 , -0.05 ,  0.37 , -0.113,  0.052,
         0.06 , -0.037,  0.154,  0.056,  0.041,  0.015, -0.004,  0.158,
         0.021,  0.251, -0.024, -0.033, -0.044,  0.235,  0.236,  0.235,
         0.24 ,  0.237],
       [ 0.253, -0.084, -0.281, -0.058, -0.028,  0.015, -0.044, -0.014,
        -0.124,  0.19 ,  0.275, -0.044,  0.177, -0.036, -0.009,  0.029,
        -0.111,  0.005, -0.247, -0.207, -0.248, -0.068, -0.311,  0.18 ,
        -0.017, -0.049,  0.367,  0.131,  0.066, -0.146,  0.087, -0.056,
         0.22 ,  0.026, -0.129, -0.152, -0.175, -0.111, -0.108, -0.106,
        -0.108, -0.109],
       [ 0.233, -0.148, -0.317,  0.022, -0.072,  0.006, -0.002, -0.038,
         0.16 , -0.052, -0.23 ,  0.1  ,  0.004,  0.011,  0.165,  0.256,
         0.242,  0.008, -0.237, -0.289, -0.074, -0.09 ,  0.028,  0.003,
        -0.475, -0.126, -0.271,  0.069, -0.039,  0.122, -0.018,  0.202,
        -0.005, -0.017,  0.088,  0.122,  0.121,  0.028,  0.032,  0.035,
         0.03 ,  0.028],
       [-0.461,  0.052,  0.488, -0.053,  0.173,  0.011,  0.068,  0.387,
        -0.057,  0.024,  0.081,  0.087,  0.021,  0.03 , -0.002,  0.003,
         0.07 ,  0.059, -0.172, -0.162, -0.223, -0.015, -0.217,  0.055,
        -0.367, -0.137,  0.004,  0.013, -0.043, -0.004, -0.017,  0.032,
         0.034,  0.011,  0.026,  0.038,  0.029, -0.051, -0.053, -0.054,
        -0.058, -0.052],
       [ 0.216, -0.029, -0.181,  0.048,  0.097, -0.215, -0.053,  0.859,
         0.026, -0.014, -0.021, -0.028,  0.025, -0.02 ,  0.037,  0.045,
        -0.001, -0.038, -0.007, -0.027,  0.054,  0.038,  0.113,  0.015,
         0.278, -0.004, -0.045,  0.026,  0.07 ,  0.037,  0.012,  0.01 ,
         0.017,  0.018,  0.009,  0.017,  0.027,  0.041,  0.039,  0.038,
         0.039,  0.04 ],
       [-0.033,  0.046,  0.049,  0.02 , -0.079, -0.956,  0.027, -0.156,
         0.036,  0.038,  0.   ,  0.012,  0.021,  0.038,  0.025,  0.003,
         0.003,  0.021,  0.048,  0.062,  0.001,  0.037, -0.053,  0.037,
        -0.143,  0.034,  0.042,  0.036, -0.009, -0.024,  0.049,  0.014,
         0.02 ,  0.018, -0.025, -0.024, -0.025,  0.007,  0.01 ,  0.008,
         0.01 ,  0.009],
       [ 0.077, -0.081, -0.061, -0.072,  0.841, -0.113,  0.   , -0.187,
        -0.02 , -0.026,  0.052, -0.034,  0.026, -0.018, -0.009, -0.007,
        -0.067, -0.012, -0.043, -0.071,  0.   , -0.075,  0.148, -0.059,
         0.101, -0.16 , -0.252, -0.071, -0.056, -0.02 , -0.034, -0.095,
         0.079, -0.023,  0.015, -0.007,  0.003, -0.098, -0.099, -0.103,
        -0.099, -0.102],
       [ 0.256,  0.095, -0.166, -0.029,  0.221,  0.088,  0.068,  0.163,
         0.103, -0.04 , -0.019, -0.063, -0.064,  0.052, -0.071, -0.163,
        -0.126,  0.011,  0.285,  0.34 ,  0.072,  0.097, -0.14 , -0.11 ,
        -0.617,  0.133,  0.224, -0.114, -0.007, -0.003,  0.045, -0.136,
        -0.1  ,  0.033,  0.006, -0.005, -0.006, -0.019, -0.019, -0.022,
        -0.015, -0.019]])

df_pca_comp = pd.DataFrame(data = np.round(pca.components_, 5),
                           columns = pca_data_all.columns.values,
                           index = ['Component 1', 'Component 2', 'Component 3',
                                    'Component 4', 'Component 5', 'Component 6',
                                    'Component 7', 'Component 8', 'Component 9',
                                    'Component 10'])
df_pca_comp

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age	Overall	Potential	Special	International Reputation	Weak Foot	Skill Moves	Jersey Number	Crossing	Finishing	...	Penalties	Composure	Marking	StandingTackle	SlidingTackle	GKDiving	GKHandling	GKKicking	GKPositioning	GKReflexes
Component 1	-0.02720	-0.10256	-0.07215	-0.21072	-0.05145	-0.08010	-0.17947	0.02011	-0.19094	-0.17177	...	-0.17266	-0.15380	-0.11615	-0.10940	-0.10225	0.18018	0.17988	0.17917	0.17910	0.17999
Component 2	0.08357	0.04108	-0.02064	0.02415	0.00343	-0.08298	-0.12037	-0.07504	-0.03442	-0.21067	...	-0.13784	0.02887	0.33244	0.35426	0.35627	-0.07264	-0.07300	-0.07350	-0.07109	-0.07256
Component 3	0.28270	0.40170	0.22217	0.09904	0.27027	0.05064	0.01514	-0.12077	0.00121	0.01531	...	0.02104	0.25084	-0.02398	-0.03315	-0.04353	0.23538	0.23645	0.23462	0.23966	0.23658
Component 4	0.25284	-0.08394	-0.28144	-0.05831	-0.02801	0.01538	-0.04357	-0.01444	-0.12449	0.18960	...	0.21962	0.02631	-0.12874	-0.15184	-0.17486	-0.11124	-0.10843	-0.10613	-0.10796	-0.10913
Component 5	0.23293	-0.14833	-0.31722	0.02198	-0.07220	0.00592	-0.00157	-0.03751	0.15958	-0.05216	...	-0.00459	-0.01708	0.08794	0.12247	0.12118	0.02840	0.03193	0.03531	0.02969	0.02792
Component 6	-0.46107	0.05221	0.48806	-0.05284	0.17305	0.01129	0.06822	0.38718	-0.05734	0.02434	...	0.03355	0.01101	0.02552	0.03837	0.02926	-0.05135	-0.05330	-0.05408	-0.05843	-0.05227
Component 7	0.21600	-0.02938	-0.18055	0.04820	0.09656	-0.21504	-0.05315	0.85867	0.02603	-0.01442	...	0.01724	0.01750	0.00909	0.01742	0.02667	0.04118	0.03922	0.03789	0.03947	0.04038
Component 8	-0.03314	0.04628	0.04925	0.02015	-0.07902	-0.95620	0.02667	-0.15573	0.03582	0.03847	...	0.01973	0.01812	-0.02532	-0.02355	-0.02519	0.00719	0.01033	0.00757	0.00953	0.00936
Component 9	0.07650	-0.08110	-0.06054	-0.07170	0.84083	-0.11349	0.00032	-0.18750	-0.02047	-0.02563	...	0.07851	-0.02334	0.01543	-0.00696	0.00327	-0.09795	-0.09911	-0.10282	-0.09926	-0.10170
Component 10	0.25552	0.09523	-0.16574	-0.02898	0.22114	0.08777	0.06764	0.16324	0.10287	-0.03963	...	-0.09962	0.03265	0.00592	-0.00501	-0.00552	-0.01850	-0.01885	-0.02174	-0.01524	-0.01874

10 rows × 42 columns

# Heat Map for Principal Components against original features. Again we use the RdBu color scheme and set borders to -1 and 1.
sns.heatmap(df_pca_comp,
            vmin = -1, 
            vmax = 1,
            cmap = 'RdBu',
            annot = False)
plt.yticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 
           ['Component 1', 'Component 2', 'Component 3',
            'Component 4', 'Component 5', 'Component 6',
            'Component 7', 'Component 8', 'Component 9',
            'Component 10'],
#            rotation = 45,
           fontsize = 9)
plt.show()

pca.transform(pca_data_std)

array([[-10.18880231,  -4.93442409,   8.23040794, ...,  -0.70215123,
          6.87604897,   1.60076215],
       [-10.23781879,  -3.43526277,   8.77187294, ...,  -0.79892601,
          6.48071261,   1.43109952],
       [ -9.74182097,  -5.0496503 ,   7.47746363, ...,  -2.08051888,
          6.71829725,   2.37869145],
       ...,
       [  3.59911747,  -2.9297779 ,  -4.47672732, ...,  -0.4713472 ,
          1.2786716 ,  -0.43008435],
       [  3.49509554,  -2.87087077,  -4.98175048, ...,  -0.40109016,
          1.12513874,  -0.07484549],
       [  3.0281007 ,   0.26125409,  -3.5111489 , ...,  -0.70666993,
          0.54033841,  -0.67180464]])

scores_pca = pca.transform(pca_data_std)

KMeans Clustering with PCA

# We fit K means using the transformed data from the PCA.
wcss = []
for i in range(1,11):
    kmeans_pca = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans_pca.fit(scores_pca)
    wcss.append(kmeans_pca.inertia_)

# Plot the Within Cluster Sum of Squares for the K-means PCA model. Here we make a decission about the number of clusters.
# Again it looks like four is the best option.
plt.figure(figsize = (10,8))
plt.plot(range(1, 11), wcss, marker = 'o', linestyle = '--')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.title('K-means with PCA Clustering')
plt.show()

# We have chosen four clusters, so we run K-means with number of clusters equals four. 
# Same initializer and random state as before.
kmeans_pca = KMeans(n_clusters = 4, init = 'k-means++', random_state = 42)

# We fit our data with the k-means pca model
kmeans_pca.fit(scores_pca)

KMeans(n_clusters=4, random_state=42)

KMeans Clustering with PCA results

# We create a new data frame with the original features and add the PCA scores and assigned clusters.
df_pca_kmeans = pd.concat([pca_data_all.reset_index(drop = True), pd.DataFrame(scores_pca)], axis = 1)
df_pca_kmeans.columns.values[-10: ] = ['Component 1', 'Component 2', 'Component 3',
                                       'Component 4', 'Component 5', 'Component 6',
                                       'Component 7', 'Component 8', 'Component 9',
                                       'Component 10']
# The last column we add contains the pca k-means clustering labels.
df_pca_kmeans['Segment K-means PCA'] = kmeans_pca.labels_

df_pca_kmeans

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age	Overall	Potential	Special	International Reputation	Weak Foot	Skill Moves	Jersey Number	Crossing	Finishing	...	Component 2	Component 3	Component 4	Component 5	Component 6	Component 7	Component 8	Component 9	Component 10	Segment K-means PCA
0	31	94	94	2202	5.0	4.0	4.0	10.0	84.0	95.0	...	-4.934424	8.230408	-1.435587	-1.317932	2.239490	0.229915	-0.702151	6.876049	1.600762	0
1	33	94	94	2228	5.0	4.0	5.0	7.0	84.0	94.0	...	-3.435263	8.771873	0.382124	-3.993349	1.430279	0.463430	-0.798926	6.480713	1.431100	0
2	26	92	93	2143	5.0	5.0	5.0	10.0	79.0	87.0	...	-5.049650	7.477464	-2.162656	-1.532512	2.727201	-0.544759	-2.080519	6.718297	2.378691	0
3	27	91	93	1471	4.0	3.0	1.0	1.0	17.0	13.0	...	-1.402391	10.971208	-3.924055	-2.276305	1.493787	-0.356265	-0.312283	3.885447	1.214150	1
4	27	91	92	2281	4.0	5.0	4.0	7.0	93.0	82.0	...	-1.510164	7.324699	-1.306482	-0.178675	2.627330	-0.706180	-2.041418	3.840916	1.282955	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
18202	19	47	65	1307	1.0	2.0	2.0	22.0	34.0	38.0	...	-0.213781	-3.724551	-0.374969	1.356970	0.438140	0.055971	0.796588	1.367476	-0.877149	3
18203	19	47	63	1098	1.0	2.0	2.0	21.0	23.0	52.0	...	-1.058603	-3.849036	3.356601	0.320174	1.400514	-0.676317	1.003101	1.202197	-0.289718	2
18204	16	47	67	1189	1.0	3.0	2.0	33.0	25.0	40.0	...	-2.929778	-4.476727	0.610368	-0.619479	0.908759	-0.121580	-0.471347	1.278672	-0.430084	2
18205	17	47	66	1228	1.0	3.0	2.0	34.0	44.0	50.0	...	-2.870871	-4.981750	1.009859	1.127637	1.510725	-0.364779	-0.401090	1.125139	-0.074845	2
18206	16	46	66	1321	1.0	3.0	2.0	33.0	41.0	34.0	...	0.261254	-3.511149	0.015826	0.483510	1.014565	0.205704	-0.706670	0.540338	-0.671805	3

18207 rows × 53 columns

# We calculate the means by segments.
df_pca_kmeans_freq = df_pca_kmeans.groupby(['Segment K-means PCA']).mean()
df_pca_kmeans_freq

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age	Overall	Potential	Special	International Reputation	Weak Foot	Skill Moves	Jersey Number	Crossing	Finishing	...	Component 1	Component 2	Component 3	Component 4	Component 5	Component 6	Component 7	Component 8	Component 9	Component 10
Segment K-means PCA
0	27.117229	72.176377	74.593961	1874.646714	1.298224	3.192718	2.978863	16.688099	65.229840	59.123979	...	-3.983975	-0.096373	1.289207	-0.191330	0.277474	-0.007341	0.078892	0.008124	-0.091912	-0.099955
1	26.045903	64.599704	69.792695	1046.197927	1.095755	2.490128	1.000000	20.516543	14.257651	12.020237	...	11.023646	-1.076094	2.390843	-0.496690	0.194175	-0.120286	0.095272	0.012216	-0.168322	-0.054010
2	23.030741	62.762946	70.082732	1573.315659	1.004539	3.035898	2.565298	23.803553	50.910047	58.811223	...	-0.681592	-2.177790	-1.264678	0.475369	-0.242780	0.060576	-0.062461	0.053443	0.053460	0.033800
3	24.602209	63.913745	69.641830	1541.305400	1.028465	2.790842	2.059406	18.398514	46.009901	32.687412	...	0.595994	2.327926	-1.047017	-0.038680	-0.136539	-0.001504	-0.058631	-0.057771	0.105078	0.089121

4 rows × 52 columns

df_pca_kmeans['Legend'] = df_pca_kmeans['Segment K-means PCA'].map({0:'Best Performers',
                                                                    1:'Performers',
                                                                    2:'Below Average',
                                                                    3:'Average'})

# Plot data by PCA components. The Y axis is the first component, X axis is the second.
x_axis = df_pca_kmeans['Component 1']
y_axis = df_pca_kmeans['Component 2']
plt.figure(figsize = (10, 8))
sns.scatterplot(data=df_pca_kmeans, x=x_axis, y=y_axis, 
                hue = df_pca_kmeans['Legend'], palette = ['g', 'r', 'c', 'm'])
plt.title('Clusters by PCA Components')
plt.show()

# Plot data by PCA components. The Y axis is the first component, X axis is the second.
x_axis = df_pca_kmeans['Component 1']
y_axis = df_pca_kmeans['Component 3']
plt.figure(figsize = (10, 8))
sns.scatterplot(data=df_pca_kmeans, x=x_axis, y=y_axis, 
                hue = df_pca_kmeans['Legend'], palette = ['g', 'r', 'c', 'm'])
plt.title('Clusters by PCA Components')
plt.show()

Data Export

import pickle
pickle.dump(sc, open('scaler.pickle', 'wb'))

pickle.dump(pca, open('pca.pickle', 'wb'))

pickle.dump(kmeans_pca, open('kmeans_pca.pickle', 'wb'))

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Output		Output
FIFA.csv		FIFA.csv
KMeans Clustering.ipynb		KMeans Clustering.ipynb
README.md		README.md
kmeans_pca.pickle		kmeans_pca.pickle
pca.pickle		pca.pickle
scaler.pickle		scaler.pickle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

Output

Output

FIFA.csv

FIFA.csv

KMeans Clustering.ipynb

KMeans Clustering.ipynb

README.md

README.md

kmeans_pca.pickle

kmeans_pca.pickle

pca.pickle

pca.pickle

scaler.pickle

scaler.pickle

Repository files navigation

Understanding Data Clustering with K-Means and PCA

Getting Started

Data Preprocessing

Handling Missing Values

Fix `Value` and `Wage` Columns

Encoding Categorical Data

Determining the Optimal Number of Clusters

K-Means Clustering

Analyzing the Clusters

Visualizing the Clusters

PCA

PCA Results

KMeans Clustering with PCA

KMeans Clustering with PCA results

Data Export

About

Releases

Packages

Languages

brian-kipkoech-tanui/clustering

Folders and files

Latest commit

History

Repository files navigation

Understanding Data Clustering with K-Means and PCA

Getting Started

Data Preprocessing

Handling Missing Values

Fix Value and Wage Columns

Encoding Categorical Data

Determining the Optimal Number of Clusters

K-Means Clustering

Analyzing the Clusters

Visualizing the Clusters

PCA

PCA Results

KMeans Clustering with PCA

KMeans Clustering with PCA results

Data Export

About

Topics

Resources

Stars

Watchers

Forks

Languages

Fix `Value` and `Wage` Columns