BY: **RIYA JOSHI**

EMAIL: riya.joshi@somaiya.edu



---


### **Basic idea behind PCA**:
* Principal Components Analysis (PCA) is an algorithm to transform the columns of a dataset into a new set of features called Principal Components. By doing this, a large chunk of the information across the full dataset is effectively compressed in fewer feature columns. This enables dimensionality reduction and ability to visualize the separation of classes or clusters if any.
*   It reduces high dimensional data to lower dimensions while capturing maximum variability of the dataset.


### **Applications of PCA**:

* Data visualization is the most common application of PCA. 
* PCA is also used to make the training of an algorithm faster by reducing the number of dimensions of the data.
* Unboxing highly dimensional data in the field of banking and finance to reveal suspicious activities.




---




In [1]:
# Importing required libraries
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('players_20.csv') #importing dataset
df.head() # displaying first five records

Unnamed: 0,sofifa_id,player_url,short_name,long_name,age,dob,height_cm,weight_kg,nationality,club,overall,potential,value_eur,wage_eur,player_positions,preferred_foot,international_reputation,weak_foot,skill_moves,work_rate,body_type,real_face,release_clause_eur,player_tags,team_position,team_jersey_number,loaned_from,joined,contract_valid_until,nation_position,nation_jersey_number,pace,shooting,passing,dribbling,defending,physic,player_traits,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes
0,158023,https://sofifa.com/player/158023/lionel-messi/...,L. Messi,Lionel Andrés Messi Cuccittini,32,24-06-1987,170,72,Argentina,FC Barcelona,94,94,95500000,565000,"RW, CF, ST",Left,5,4,4,Medium/Low,Messi,Yes,195800000.0,"#Dribbler, #Distance Shooter, #Crosser, #FK Sp...",RW,10.0,,01-07-2004,2021.0,,,87.0,92.0,92.0,96.0,39.0,66.0,"Beat Offside Trap, Argues with Officials, Earl...",88,95,70,92,88,97,93,94,92,96,91,84,93,95,95,86,68,75,68,94,48,40,94,94,75,96,33,37,26,6,11,15,14,8
1,20801,https://sofifa.com/player/20801/c-ronaldo-dos-...,Cristiano Ronaldo,Cristiano Ronaldo dos Santos Aveiro,34,05-02-1985,187,83,Portugal,Juventus,93,93,58500000,405000,"ST, LW",Right,5,4,5,High/Low,C. Ronaldo,Yes,96500000.0,"#Speedster, #Dribbler, #Distance Shooter, #Acr...",LW,7.0,,10-07-2018,2022.0,LS,7.0,90.0,93.0,82.0,89.0,35.0,78.0,"Long Throw-in, Selfish, Argues with Officials,...",84,94,89,83,87,89,81,76,77,92,89,91,87,96,71,95,95,85,78,93,63,29,95,82,85,95,28,32,24,7,11,15,14,11
2,190871,https://sofifa.com/player/190871/neymar-da-sil...,Neymar Jr,Neymar da Silva Santos Junior,27,05-02-1992,175,68,Brazil,Paris Saint-Germain,92,92,105500000,290000,"LW, CAM",Right,5,5,5,High/Medium,Neymar,Yes,195200000.0,"#Speedster, #Dribbler, #Playmaker , #Crosser,...",CAM,10.0,,03-08-2017,2022.0,LW,10.0,91.0,85.0,87.0,95.0,32.0,58.0,"Power Free-Kick, Injury Free, Selfish, Early C...",87,87,62,87,87,96,88,87,81,95,94,89,96,92,84,80,61,81,49,84,51,36,87,90,90,94,27,26,29,9,9,15,15,11
3,200389,https://sofifa.com/player/200389/jan-oblak/20/...,J. Oblak,Jan Oblak,26,07-01-1993,188,87,Slovenia,Atlético Madrid,91,93,77500000,125000,GK,Right,3,3,1,Medium/Medium,Normal,Yes,164700000.0,,GK,13.0,,16-07-2014,2023.0,GK,1.0,,,,,,,"Flair, Acrobatic Clearance",13,11,15,43,13,12,13,14,40,30,43,60,67,88,49,59,78,41,78,12,34,19,11,65,11,68,27,12,18,87,92,78,90,89
4,183277,https://sofifa.com/player/183277/eden-hazard/2...,E. Hazard,Eden Hazard,28,07-01-1991,175,74,Belgium,Real Madrid,91,91,90000000,470000,"LW, CF",Right,4,4,4,High/Medium,Normal,Yes,184500000.0,"#Speedster, #Dribbler, #Acrobat",LW,7.0,,01-07-2019,2024.0,LF,10.0,91.0,83.0,86.0,94.0,35.0,66.0,"Beat Offside Trap, Selfish, Finesse Shot, Spee...",81,84,61,89,83,95,83,79,83,94,94,88,95,90,94,82,56,84,63,80,54,41,87,89,88,91,34,27,22,11,12,6,8,8


In [3]:
df.shape # in the form of (rows,cols)

(18278, 72)

In [4]:
# replacing null values with 0
df = df.fillna(value= 0)

In [5]:
# modifying dataset to contain only numeric attributes
df=df.drop(columns=['sofifa_id','player_url','short_name','long_name','dob','nationality','club','player_positions','preferred_foot','work_rate','body_type','real_face','player_tags','team_position','loaned_from','joined','nation_position'	,'nation_jersey_number','player_traits'])
df.head()

Unnamed: 0,age,height_cm,weight_kg,overall,potential,value_eur,wage_eur,international_reputation,weak_foot,skill_moves,release_clause_eur,team_jersey_number,contract_valid_until,pace,shooting,passing,dribbling,defending,physic,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes
0,32,170,72,94,94,95500000,565000,5,4,4,195800000.0,10.0,2021.0,87.0,92.0,92.0,96.0,39.0,66.0,88,95,70,92,88,97,93,94,92,96,91,84,93,95,95,86,68,75,68,94,48,40,94,94,75,96,33,37,26,6,11,15,14,8
1,34,187,83,93,93,58500000,405000,5,4,5,96500000.0,7.0,2022.0,90.0,93.0,82.0,89.0,35.0,78.0,84,94,89,83,87,89,81,76,77,92,89,91,87,96,71,95,95,85,78,93,63,29,95,82,85,95,28,32,24,7,11,15,14,11
2,27,175,68,92,92,105500000,290000,5,5,5,195200000.0,10.0,2022.0,91.0,85.0,87.0,95.0,32.0,58.0,87,87,62,87,87,96,88,87,81,95,94,89,96,92,84,80,61,81,49,84,51,36,87,90,90,94,27,26,29,9,9,15,15,11
3,26,188,87,91,93,77500000,125000,3,3,1,164700000.0,13.0,2023.0,0.0,0.0,0.0,0.0,0.0,0.0,13,11,15,43,13,12,13,14,40,30,43,60,67,88,49,59,78,41,78,12,34,19,11,65,11,68,27,12,18,87,92,78,90,89
4,28,175,74,91,91,90000000,470000,4,4,4,184500000.0,7.0,2024.0,91.0,83.0,86.0,94.0,35.0,66.0,81,84,61,89,83,95,83,79,83,94,94,88,95,90,94,82,56,84,63,80,54,41,87,89,88,91,34,27,22,11,12,6,8,8


In [6]:
# changing datatype to float
df.astype(float)

Unnamed: 0,age,height_cm,weight_kg,overall,potential,value_eur,wage_eur,international_reputation,weak_foot,skill_moves,release_clause_eur,team_jersey_number,contract_valid_until,pace,shooting,passing,dribbling,defending,physic,attacking_crossing,attacking_finishing,attacking_heading_accuracy,attacking_short_passing,attacking_volleys,skill_dribbling,skill_curve,skill_fk_accuracy,skill_long_passing,skill_ball_control,movement_acceleration,movement_sprint_speed,movement_agility,movement_reactions,movement_balance,power_shot_power,power_jumping,power_stamina,power_strength,power_long_shots,mentality_aggression,mentality_interceptions,mentality_positioning,mentality_vision,mentality_penalties,mentality_composure,defending_marking,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes
0,32.0,170.0,72.0,94.0,94.0,95500000.0,565000.0,5.0,4.0,4.0,195800000.0,10.0,2021.0,87.0,92.0,92.0,96.0,39.0,66.0,88.0,95.0,70.0,92.0,88.0,97.0,93.0,94.0,92.0,96.0,91.0,84.0,93.0,95.0,95.0,86.0,68.0,75.0,68.0,94.0,48.0,40.0,94.0,94.0,75.0,96.0,33.0,37.0,26.0,6.0,11.0,15.0,14.0,8.0
1,34.0,187.0,83.0,93.0,93.0,58500000.0,405000.0,5.0,4.0,5.0,96500000.0,7.0,2022.0,90.0,93.0,82.0,89.0,35.0,78.0,84.0,94.0,89.0,83.0,87.0,89.0,81.0,76.0,77.0,92.0,89.0,91.0,87.0,96.0,71.0,95.0,95.0,85.0,78.0,93.0,63.0,29.0,95.0,82.0,85.0,95.0,28.0,32.0,24.0,7.0,11.0,15.0,14.0,11.0
2,27.0,175.0,68.0,92.0,92.0,105500000.0,290000.0,5.0,5.0,5.0,195200000.0,10.0,2022.0,91.0,85.0,87.0,95.0,32.0,58.0,87.0,87.0,62.0,87.0,87.0,96.0,88.0,87.0,81.0,95.0,94.0,89.0,96.0,92.0,84.0,80.0,61.0,81.0,49.0,84.0,51.0,36.0,87.0,90.0,90.0,94.0,27.0,26.0,29.0,9.0,9.0,15.0,15.0,11.0
3,26.0,188.0,87.0,91.0,93.0,77500000.0,125000.0,3.0,3.0,1.0,164700000.0,13.0,2023.0,0.0,0.0,0.0,0.0,0.0,0.0,13.0,11.0,15.0,43.0,13.0,12.0,13.0,14.0,40.0,30.0,43.0,60.0,67.0,88.0,49.0,59.0,78.0,41.0,78.0,12.0,34.0,19.0,11.0,65.0,11.0,68.0,27.0,12.0,18.0,87.0,92.0,78.0,90.0,89.0
4,28.0,175.0,74.0,91.0,91.0,90000000.0,470000.0,4.0,4.0,4.0,184500000.0,7.0,2024.0,91.0,83.0,86.0,94.0,35.0,66.0,81.0,84.0,61.0,89.0,83.0,95.0,83.0,79.0,83.0,94.0,94.0,88.0,95.0,90.0,94.0,82.0,56.0,84.0,63.0,80.0,54.0,41.0,87.0,89.0,88.0,91.0,34.0,27.0,22.0,11.0,12.0,6.0,8.0,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18273,22.0,186.0,79.0,48.0,56.0,40000.0,2000.0,1.0,3.0,2.0,70000.0,36.0,2019.0,57.0,23.0,28.0,33.0,47.0,51.0,21.0,17.0,40.0,35.0,27.0,24.0,23.0,21.0,28.0,35.0,56.0,58.0,45.0,40.0,73.0,36.0,70.0,60.0,47.0,16.0,46.0,48.0,28.0,25.0,39.0,41.0,45.0,50.0,52.0,5.0,5.0,13.0,5.0,14.0
18274,22.0,177.0,66.0,48.0,56.0,40000.0,2000.0,1.0,2.0,2.0,72000.0,31.0,2022.0,58.0,24.0,33.0,35.0,48.0,48.0,24.0,20.0,42.0,43.0,28.0,32.0,24.0,29.0,39.0,31.0,55.0,61.0,43.0,41.0,76.0,33.0,72.0,55.0,44.0,20.0,42.0,49.0,23.0,25.0,37.0,35.0,42.0,53.0,57.0,13.0,6.0,14.0,11.0,9.0
18275,19.0,186.0,75.0,48.0,56.0,40000.0,1000.0,1.0,2.0,2.0,70000.0,38.0,2019.0,54.0,35.0,44.0,45.0,48.0,51.0,32.0,33.0,49.0,53.0,32.0,40.0,32.0,32.0,55.0,49.0,55.0,54.0,52.0,52.0,57.0,48.0,60.0,50.0,51.0,26.0,50.0,45.0,38.0,38.0,36.0,39.0,46.0,52.0,46.0,7.0,8.0,10.0,6.0,14.0
18276,18.0,185.0,74.0,48.0,54.0,40000.0,1000.0,1.0,2.0,2.0,70000.0,33.0,2022.0,59.0,35.0,47.0,47.0,45.0,52.0,39.0,34.0,47.0,54.0,28.0,42.0,37.0,39.0,48.0,49.0,55.0,63.0,55.0,54.0,59.0,46.0,61.0,42.0,55.0,28.0,57.0,49.0,31.0,48.0,36.0,40.0,39.0,44.0,54.0,14.0,9.0,13.0,13.0,13.0


In [7]:
# Standardization
scale= StandardScaler()
df = scale.fit_transform(df) 

In [8]:
df

array([[ 1.44233274, -1.68159832, -0.46489103, ..., -0.07301019,
        -0.1381906 , -0.48287516],
       [ 1.87180879,  0.83439432,  1.0959349 , ..., -0.07301019,
        -0.1381906 , -0.31655621],
       [ 0.36864262, -0.94160048, -1.0324641 , ..., -0.07301019,
        -0.07983402, -0.31655621],
       ...,
       [-1.34926158,  0.68639475, -0.03921123, ..., -0.3739755 ,
        -0.60504324, -0.15023727],
       [-1.5639996 ,  0.53839519, -0.1811045 , ..., -0.19339631,
        -0.19654718, -0.20567692],
       [ 0.15390459,  0.09439648,  0.38646857, ..., -0.13320325,
        -0.4299735 , -0.37199586]])

In [9]:
def PCA(X , num_components):
     
    # Step-1: Subtract the mean of each variable from the dataset so that the dataset should be centered on the origin. 
    X_meaned = X - np.mean(X , axis = 0)
     
    # Step-2: Calculate the Covariance Matrix of the mean-centered data.
    # The covariance matrix is a square matrix denoting the covariance of the elements with each other. 
    # The covariance of an element with itself is nothing but just its Variance.That’s why the diagonal elements of a covariance matrix are just the variance of the elements.
    cov_mat = np.cov(X_meaned , rowvar = False)
     
    # Step-3: Compute the Eigenvalues and Eigenvectors for the calculated Covariance matrix. 
    # The Eigenvectors of the Covariance matrix we get are Orthogonal to each other and each vector represents a principal axis.
    # A Higher Eigenvalue corresponds to a higher variability. 
    # Hence the principal axis with the higher Eigenvalue will be an axis capturing higher variability in the data.
    eigen_values , eigen_vectors = np.linalg.eigh(cov_mat)
     
    #Step-4: Sort the Eigenvalues in the descending order along with their corresponding Eigenvector.
    sorted_index = np.argsort(eigen_values)[::-1]
    sorted_eigenvalue = eigen_values[sorted_index]
    sorted_eigenvectors = eigen_vectors[:,sorted_index]
     
    #Step-5 : Select a subset from the rearranged Eigenvalue matrix as per our need i.e. number_comp = 2. 
    # This means we selected the first two principal components.n_components = 2 means our final data should be reduced to just 2 variables. 
    # if we change it to 3 then we get our data reduced to 3 variables.
    eigenvector_subset = sorted_eigenvectors[:,0:num_components]
     
    #Step-6 : Finally, transform the data by having a dot product between the Transpose of the Eigenvector subset and the Transpose of the mean-centered data.
    X_reduced = np.dot(eigenvector_subset.transpose() , X_meaned.transpose() ).transpose()
     
    return X_reduced

In [10]:
reduced = PCA(df , 5)
 
# Creating dataFrame of the reduced Dataset
principal_df = pd.DataFrame(reduced , columns = ['PC1','PC2','PC3','PC4','PC5'])

principal_df.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5
0,-14.283325,-4.213115,-25.146282,-12.635302,14.164378
1,-12.121091,-2.767809,-19.115795,-5.022045,9.866685
2,-12.956257,-5.324816,-21.248942,-11.15658,12.309517
3,8.185054,-0.74408,-19.279528,-9.345629,7.573565
4,-12.715291,-4.254214,-21.606704,-11.234742,13.260292
