# **The measure of central tendency and measure of dispersion can describe the distribution but they are not sufficient to describe the nature of the distribution.**

# **For this purpose, we use other two statistical measures that compare the shape to the normal curve called *Skewness* and *Kurtosis*.**

# **Normal Distribution**

# Normal distribution is a pattern of data where most values cluster around the average (mean), with fewer and fewer values as you move away from the average in either direction.

# It is often represented as a bell-shaped curve

# Key features of the normal distribution:

# • Mode, median and mean are the same and are together in the centre of the curve

# • Symmetrical shape

# • There can only be one mode (i.e. there is only one value which is most frequently observed).

# Most of the data are clustered around the centre, while the more extreme values on either side of the centre become less rare as the distance from the centre increases (i.e. About 68% of values lie within one standard deviation (σ) away from the mean; about 95% of the values lie within two standard deviations; and about 99.7% are within three standard deviations. This is known as the empirical rule or the 3-sigma rule.)

# **a. Skewness**

# Skewness is a statistical number that tells us if a distribution is symmetric or not.

# A distribution is symmetric if the right side of the distribution is similar to the left side of the distribution.

# If a distribution is symmetric, then the Skewness value is 0. i.e. If a distribution is Symmetric (normal distribution): median= mean= mode, (Skewness value is 0) If Skewness is greater than 0, then it is called right-skewed or that the right tail is longer than the left tail.If Skewness is less than 0, then it is called left-skewed or that the left tail is longer than the right tail.

# **Pearson's Coefficient of skewness (Sk) = (Mean - Mode)/σ**
# **skewness usually lies between -1 and +1**

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

In [3]:
df = sns.load_dataset('mpg')
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


In [4]:
df['mpg'].skew()   # positive skew for mpg column as skewness > 0

0.45706634399491913

In [7]:
df['horsepower'].skew()

1.0873262824048695

 # here the skewness value > 1 means not in Normal distribution. But we always try to keep it in the normal distribution range because it has some certain properties. In order to reduce the skewness,

In [12]:
df['horsepower_new'] = df['horsepower'] ** (1/2)  #without changing the main data set we're just experimenting by adding one more new column

In [13]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name,horsepower_new
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu,11.401754
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320,12.845233
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite,12.247449
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst,12.247449
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino,11.83216


In [14]:
df['horsepower_new'].skew()   # now the skewness value is less than 1 which is in the form of normal distribution

0.7327195325530649

# **in order to reduce the skewness more, we would perform cube root of the horsepower column**

In [15]:
df['horsepower_new'] = df['horsepower'] ** (1/3)

In [16]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name,horsepower_new
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu,5.065797
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320,5.484807
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite,5.313293
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst,5.313293
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino,5.192494


In [17]:
df['horsepower_new'].skew()   # now the skewness value is lesser than previous one which was 0.73

0.6132504951168221

# we can also increase the skewness by doing square or cube

# **b. Kurtosis**

# Kurtosis is a statistical number that tells us if a distribution is taller or shorter than a normal distribution.

# If a distribution is similar to the normal distribution, the Kurtosis value is O. If Kurtosis is greater than 0, then it has a higher peak compared to the normal distribution.

# If Kurtosis is less than 0, then it is flatter than a normal distribution.

# **Kurtosis = (E[(X - μ)]^4/σ^4) - 3**

In [18]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name,horsepower_new
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu,5.065797
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320,5.484807
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite,5.313293
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst,5.313293
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino,5.192494


In [19]:
df['mpg'].kurtosis()  # as kurtosis for mpg col is less than 0 means it's more flatter than Normal Distribution

-0.5107812652123154

In [20]:
df['acceleration'].kurtosis()  # as kurtosis for acceleration col is greater than 0 means it has a higher peak than Normal Distribution

0.4194968829706842

# There are three types of distributions:

# **Leptokurtic:** Sharply peaked with fat tails, and less variable. Kutosis > 0

# **Mesokurtic:** Medium peaked same as Normal Distribution. Kurtosis = 0

# **Platykurtic:** Flattest peak and highly dispersed. Kurtosis < 0