## Univariate Statistics

- measure distribution of a single variable 

- 4 types of measures
    - central tendency
        - mean : average of all values
        - median : middle values when sorted
        - mode : most frequent value
    - disperison
        - range : difference between max and min
        - variance : average of squared differences from mean
        - standard deviation : square root of variance
    - distribution shape
        - Skewness: Measure of symmetry in data
        - Kurtosis: Measure of the “tailedness” of the data
            - how much of data are outliers 
            - The more the Kurtosis, the more the data is in the tails
                - Normal Distribution has a Kurtosis of 3 : Pearson ; Fisher = Pearson - 3
                - Higher Kurtosis: More data in the tails
                - Lower Kurtosis: Less data in the tails
                - Measures how many outliers
    - graphical representation


- variance use
    - Variance is used to measure how spread out the values in a data set are
    - A high variance indicates that the data points are very spread out from the mean

- standard deviation use
    - Turns variance into a more understandable metric
    - Standard deviation explains how much the data deviates from the mean
    - Each standard deviation away from the mean accounts for a certain percentage of the data

- skewness : normal, Gaussian, bell curve
    - 68% of data falls within 1 standard deviation
    - 95% of data falls within 2 standard deviations
    - 99.7% of data falls within 3 standard deviations
    -  = Median = Mode

- skewness : positive / negative
    - Positive Skewness: Tail is farther on the larger side of the x-axis
    - Negative Skewness: Tail is farther on the smaller side of the x-axis

In [23]:
import pandas as pd

url = "https://raw.githubusercontent.com/nkmwicz/data-for-students/refs/heads/main/soccer-players.csv"
df = pd.read_csv(url)
df.head()


Unnamed: 0,player_id,first_name,last_name,name,last_season,current_club_id,player_code,country_of_birth,city_of_birth,country_of_citizenship,...,foot,height_in_cm,contract_expiration_date,agent_name,image_url,url,current_club_domestic_competition_id,current_club_name,market_value_in_eur,highest_market_value_in_eur
0,10,Miroslav,Klose,Miroslav Klose,2015,398,miroslav-klose,Poland,Opole,Germany,...,right,184.0,,ASBW Sport Marketing,https://img.a.transfermarkt.technology/portrai...,https://www.transfermarkt.co.uk/miroslav-klose...,IT1,Società Sportiva Lazio S.p.A.,1000000.0,30000000.0
1,26,Roman,Weidenfeller,Roman Weidenfeller,2017,16,roman-weidenfeller,Germany,Diez,Germany,...,left,190.0,,Neubauer 13 GmbH,https://img.a.transfermarkt.technology/portrai...,https://www.transfermarkt.co.uk/roman-weidenfe...,L1,Borussia Dortmund,750000.0,8000000.0
2,65,Dimitar,Berbatov,Dimitar Berbatov,2015,1091,dimitar-berbatov,Bulgaria,Blagoevgrad,Bulgaria,...,,,,CSKA-AS-23 Ltd.,https://img.a.transfermarkt.technology/portrai...,https://www.transfermarkt.co.uk/dimitar-berbat...,GR1,Panthessalonikios Athlitikos Omilos Konstantin...,1000000.0,34500000.0
3,77,,Lúcio,Lúcio,2012,506,lucio,Brazil,Brasília,Brazil,...,,,,,https://img.a.transfermarkt.technology/portrai...,https://www.transfermarkt.co.uk/lucio/profil/s...,IT1,Juventus Football Club,200000.0,24500000.0
4,80,Tom,Starke,Tom Starke,2017,27,tom-starke,East Germany (GDR),Freital,Germany,...,right,194.0,,IFM,https://img.a.transfermarkt.technology/portrai...,https://www.transfermarkt.co.uk/tom-starke/pro...,L1,FC Bayern München,100000.0,3000000.0


In [24]:
def convert_height(cm) -> float:
    """
    Convert cm to feet
    
    Args: 
        cm (int): the num of cm
    Return:
        feet (float): num of feet
    
    """
    
    return (cm * 0.393701) / 12

In [25]:
df["height_in_ft"] = df["height_in_cm"].apply(convert_height)
df["height_in_ft"]

0        6.036749
1        6.233599
2             NaN
3             NaN
4        6.364833
           ...   
32398    5.839898
32399    5.905515
32400    5.380580
32401         NaN
32402         NaN
Name: height_in_ft, Length: 32403, dtype: float64

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32403 entries, 0 to 32402
Data columns (total 24 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   player_id                             32403 non-null  int64  
 1   first_name                            30345 non-null  object 
 2   last_name                             32403 non-null  object 
 3   name                                  32403 non-null  object 
 4   last_season                           32403 non-null  int64  
 5   current_club_id                       32403 non-null  int64  
 6   player_code                           32403 non-null  object 
 7   country_of_birth                      29600 non-null  object 
 8   city_of_birth                         29942 non-null  object 
 9   country_of_citizenship                32023 non-null  object 
 10  date_of_birth                         32356 non-null  object 
 11  sub_position   

In [27]:
df.describe()

Unnamed: 0,player_id,last_season,current_club_id,height_in_cm,market_value_in_eur,highest_market_value_in_eur,height_in_ft
count,32403.0,32403.0,32403.0,30059.0,30869.0,30869.0,30059.0
mean,344578.9,2019.337777,4809.252045,182.291161,1611350.0,3675045.0,5.980684
std,280683.1,3.965914,11601.982978,6.971224,6362017.0,9653897.0,0.228715
min,10.0,2012.0,3.0,17.0,10000.0,10000.0,0.557743
25%,106698.0,2016.0,403.0,178.0,100000.0,275000.0,5.839898
50%,282421.0,2020.0,1071.0,183.0,250000.0,800000.0,6.00394
75%,524208.5,2023.0,3060.0,187.0,700000.0,2800000.0,6.135174
max,1309800.0,2024.0,110302.0,207.0,200000000.0,200000000.0,6.791342


In [28]:
df["height_in_ft"].std()

0.22871483401772097

In [29]:
df["height_in_ft"].var()
df["height_in_ft"].std()

0.22871483401772097

In [None]:
df["height_in_ft"].mnea()

np.float64(5.980684355462036)

In [32]:
df["height_in_ft"].median()

6.00394025

In [33]:
df["height_in_ft"].mode(0)[0]

np.float64(nan)

In [35]:
df["height_in_ft"].min()
df["height_in_ft"].max()

6.7913422500000005

In [36]:
df["height_in_ft"].skew()

np.float64(-1.775147504851816)

In [37]:
from scipy import stats

In [38]:
stats.kurtosis(df["height_in_cm"], nan_policy= "omit")

np.float64(40.36548066183285)

In [None]:
def eda(df) -> None:
    """
    Evaluate basic statistics on a dataframe
    
    args:
        df (Pandas.Dataframe) : datafram to evaluate
        
    returns:
        none
    """
print(f"{'='*5}DF Shape: {df.shape}{'=' +5}")
num_cols = []
cat_cols = []
for col in df.columns
    if pd. api.types.is_numeric_dtype(col):

SyntaxError: expected ':' (3569476269.py, line 12)