# Descriptive Statistics

- **Statistics**: It is defined as a way to get information from data. It is the science that deals with the collection, classification, analysis, and interpretation of numerical facts or data, and that, by use of mathematical theories of probability, imposes order and regularity on aggregates of more or less disparate elements.

- **Descriptive Statistics**: It deals with methods of organizing, summarizing, and presenting data in a convenient and informative way.

- **Measures of Central Tendency:**
    - **Mean:** $\mu=\frac{\sum\limits _{i=1} ^{N}x_{i}}{N}$
    - **Median:** The median is calculated by placing all the observations in ascending or descending order.
        - If n is odd, then the observation that falls in the middle is the median. $(\frac{n+1}{2})^{th}$ item.
        - If n is even, then the median is determined by averaging the two observations in the middle, $\frac{(\frac{n}{2}) ^{th} +(\frac{n}{2}+1) ^{th} }{2}$ item.
    - **Mode:** Observations that has the highest frequency.

## Calculating the measures of central tendency 

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Create a dataframe
s1 = pd.Series([2,1,0,4,3,15])
s1

0     2
1     1
2     0
3     4
4     3
5    15
dtype: int64

### Calculating Mean

In [3]:
# Compute arithmetic mean
s1.mean().round(2)

4.17

In [4]:
import scipy
from scipy import stats

# Compute Trimmed Mean: Smallest 30% and largest 30% of values have been removed from the dataset.
stats.trim_mean(s1, 0.3).round(2)

2.5

In [5]:
# Compute Weighted Mean
def weighted_average(df, value, weight):
    val = df[value]
    wt = df[weight]
    return (val * wt).sum() / wt.sum()

In [6]:
s2 = pd.Series([2,1,3,5,0])
s2

0    2
1    1
2    3
3    5
4    0
dtype: int64

In [7]:
s2.sort_values()

4    0
1    1
0    2
2    3
3    5
dtype: int64

### Calculating Median

In [8]:
# Compute median
s2.median()

2.0

In [9]:
s3 = pd.Series([3,0,2,3,4,1])

In [10]:
# Use value_counts() and then choose the top one - because value_counts() gives descending ordered list
s3.value_counts(ascending = True)   

0    1
2    1
4    1
1    1
3    2
Name: count, dtype: int64

### Calculating Mode

In [12]:
# Use the .mode() function
s3.mode()

0    3
dtype: int64

In [13]:
s4 = pd.Series([3, 0, 2, 3, 4, 1, 5, 10, 5, 6])
s4.mode()

0    3
1    5
dtype: int64

### Importing data and perform operations

In [15]:
!pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.1.3-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-1.1.0-py3-none-any.whl.metadata (1.8 kB)
Downloading openpyxl-3.1.3-py2.py3-none-any.whl (251 kB)
   ---------------------------------------- 0.0/251.3 kB ? eta -:--:--
   - -------------------------------------- 10.2/251.3 kB ? eta -:--:--
   ---------------------------------------  245.8/251.3 kB 5.0 MB/s eta 0:00:01
   ---------------------------------------- 251.3/251.3 kB 3.9 MB/s eta 0:00:00
Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.3


In [16]:
# Importing and Reading data
df = pd.read_excel('ABC attrition data.xlsx', sheet_name='Employee data class training')
df.head()

Unnamed: 0,E.No,Department,Gender,Age,DistanceFromHome,Education,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,NumCompaniesWorked,OverTime,Attrition
0,1,Sales,Female,41,1,2,2,3,2,4,5993,8,Yes,Yes
1,2,Research & Development,Male,49,8,1,3,2,2,2,5130,1,No,No
2,3,Research & Development,Male,37,2,2,4,2,1,3,2090,6,Yes,Yes
3,4,Research & Development,Female,33,3,4,4,3,1,3,2909,1,Yes,No
4,5,Research & Development,Male,27,2,1,1,3,1,2,3468,9,No,No


In [17]:
# Print the shape (Rows and Columns)
print('Shape:', df.shape)

Shape: (500, 14)


In [18]:
# Column names
print(df.columns)

Index(['E.No', 'Department', 'Gender', 'Age', 'DistanceFromHome', 'Education',
       'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel',
       'JobSatisfaction', 'MonthlyIncome', 'NumCompaniesWorked', 'OverTime',
       'Attrition'],
      dtype='object')


In [19]:
# Calculate average distance from home
avg_distance = df['DistanceFromHome'].mean()
print('Average Distance:', avg_distance)

Average Distance: 9.12


In [20]:
# Calculate median distance from home
med_distance = df['DistanceFromHome'].median()
print('Median Distance:', med_distance)

Median Distance: 6.0


In [21]:
# Calculate the highest number of members in the department
df['Department'].value_counts()

Department
Research & Development    333
Sales                     153
Human Resources            14
Name: count, dtype: int64

In [22]:
df['Department'].mode()

0    Research & Development
Name: Department, dtype: object

In [28]:
# Desciption of the data
df.describe().round(2)  #statistical parameters for the numerical columns in the data

Unnamed: 0,E.No,Age,DistanceFromHome,Education,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,NumCompaniesWorked
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,250.5,36.9,9.12,2.89,2.68,2.73,2.1,2.8,6598.64,2.69
std,144.48,9.36,8.26,1.04,1.07,0.68,1.13,1.08,4814.58,2.51
min,1.0,18.0,1.0,1.0,1.0,1.0,1.0,1.0,1102.0,0.0
25%,125.75,30.0,2.0,2.0,2.0,2.0,1.0,2.0,2900.25,1.0
50%,250.5,36.0,6.0,3.0,3.0,3.0,2.0,3.0,4952.0,2.0
75%,375.25,43.0,14.0,4.0,4.0,3.0,3.0,4.0,8741.75,4.0
max,500.0,60.0,29.0,5.0,4.0,4.0,5.0,4.0,19999.0,9.0


## Measures of Variability / Dispersion

- **Range**: Range is defined as the difference between the largest observation and the smallest observation in the data
- **Variance**: How far each observation is from mean. These differences from the mean are called deviations. The average squared deviation from the mean is called the variance.
Variance is calculated by: $\sigma ^{2}=\frac{\sum\limits _{i=1} ^{N}(x_{i}-\mu) ^{2}}{N}$ 

- **Standard Deviation**: Square root of variance is called as Standard Deviation

In [24]:
# Get the average, median, min, max, range, var and sd for the monthly income
print('Mean:', df['MonthlyIncome'].mean().round(2))
print('Median:', df['MonthlyIncome'].median())
print('Minimum:', df['MonthlyIncome'].min())
print('Maximum:', df['MonthlyIncome'].max())
print('Range:', df['MonthlyIncome'].max() - df['MonthlyIncome'].min())
print('Variance:', df['MonthlyIncome'].var().round(2))
print('Standard Deviation:', df['MonthlyIncome'].std().round(2))

Mean: 6598.64
Median: 4952.0
Minimum: 1102
Maximum: 19999
Range: 18897
Variance: 23180200.71
Standard Deviation: 4814.58


### Chebyshev’s Inequality

<centre>![image.png](attachment:8f3b5d62-39f7-4731-8b3a-7aa0f4714ac3.png)</centre>

For any data set, it can be proved mathematically that:
- Atleast 75% of all data points will lie within 2 standard deviations of the mean, 
- Atleast 89% within 3 standard deviations

#### Example: 
Suppose we have a data series, with a min of 200, a max of 1500, a mean of 600, and a standard deviation of 80. Then: 
- Atleast 75% of all the data points in the series will be within the range: $ (600 – 2*80, 600 + 2*80) =  (440, 760) $
- Atleast 89% of all data points will be within the range: $(600 – 3*80, 600 + 3*80) = (360, 840) $

## Quantiles, Quartiles and Percentiles of data

![quan.png](attachment:855d7977-1611-4e64-9a73-9a89e26bb370.png)

- **Quantile** is a certain portion of the given data.
  
- **Quartile** is one portion among four equal divisions of the data. That is, we need to divide the data in to 4 equal parts. **1st Quartile (Q1)** includes first 25% of the data (lowest value onwards). **2nd Quartile (Q2)** defines 50% of the data (this also equal to median). **3rd Quartile (Q3)** defines 75% of the data.
    
- **Percentile** is one portion among 100 equal divisions of the data. We can have P1 to P99. P25 is equal to Q1. P50 is equal to Q2, which is same as median. P75 is equal to Q3. and Interquartile change = Q3 - Q1

In [25]:
# This is Q1
print(df['Age'].quantile(0.25))

# This is Q2 and Median
print(df['Age'].quantile(0.5))

# This is Q3
print(df['Age'].quantile(0.75)) 

30.0
36.0
43.0


#### Inferences:
- There are 500 employees in the dataset. 25% (Q1) of the employees (which means, 25% of 500 =125 employees) are below (or equal) the age of 30.
- 50% of the employees = 250 employees are within 36 years.
- 75% of the employees = 375 employees are within 43 years of age.
- ***This clearly indicates that the organization has a young pool of employees***

In [28]:
# IQR of Income
print('IQR of income:', df['MonthlyIncome'].quantile(0.75) - df['MonthlyIncome'].quantile(0.25))

IQR of income: 5841.5


In [30]:
print('Q1 of income:', df['MonthlyIncome'].quantile(0.25))
print('Q3 of income:', df['MonthlyIncome'].quantile(0.75))

Q1 of income: 2900.25
Q3 of income: 8741.75


50% of the employees (between Q1 and Q3 there is 50% of the data): 250 people have salary in the range of Rs. 2900/- to Rs. 8742/- (assuming the income is in rupees)

## Shape measures
### 1. Skewness
- Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks like a bell-shaped curve or it follows a normal / gaussian distribution.

![Screenshot 2024-06-09 130148.png](attachment:2c6534a3-e4bc-4894-929e-429399de078f.png)

- If the skewness is between 
    - **-0.5 and 0.5**, the data are nearly symmetrical.
    - **-1 and -0.5**, then the distribution is slightly negatively skewed.
    - **0.5 and 1**, the distribution is slightly positive skewed.
- If the skewness is lower than -1  or greater than 1, the data are extremely skewed.

In [32]:
# Calculate the skewness in age
df['Age'].skew().round(2)

0.44

In [33]:
# Calculate the skewness in monthly income
df['MonthlyIncome'].skew().round(2)

1.33

### 2. Kurtosis
- Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers.

![Screenshot 2024-06-09 130445.png](attachment:06a6f1a7-72a0-4c22-8c69-1b1fb5a4749e.png)

- Kurtosis for normal distribution (to be more specific, standard normal distribution) is 3.
- **Leptokurtic distributions** have kurtosis > 3 and **Platykurtic distributions** have kurtosis < 3

In [35]:
# Skewness and Kurtosis of Age and Monthly Income
print(df['Age'].kurtosis().round(2))
print(df['MonthlyIncome'].kurtosis().round(2))

-0.38
0.82


#### Inferences:
- Age is slightly positively skewed.
- Monthly Income is Highly positively skewed - There are few people with very high salary compared to others.
- Both Age and Monthly Income are platykurtic. Which means, both are highly deviated fromt the mean, i.e. high standard deviation.

## Covariance And Correlation
- Covariance and correlation both primarily assess the relationship between variables.
    - Using covariance, we can only gauge the direction of the relationship (whether the variables tend to move in tandem or show an inverse relationship). However, it does not indicate the strength of the relationship, nor the dependency between the variables.
    - Correlation measures the strength of the relationship between variables. Correlation is the scaled measure of covariance. It is dimensionless. In other words, the correlation coefficient is always a pure value and not measured in any units.

![Screenshot 2024-06-09 131835.png](attachment:9282253b-489e-4eeb-8c74-073b488d4267.png)

- We can also write the correlation coefficient ($r$ or $\rho$) as -
![image.png](attachment:image.png)

- The value of correlation coefficient ranges from **-1 to +1**.

![image.png](attachment:image.png)

#### Example:
Consider the data related to age of the people and their glucose level. Find the correlation coefficient between these two variables. 

In [36]:
import numpy as np

Age = np.array([43, 21, 25, 42, 57, 59])
GlucoseLevel = np.array([99, 65, 79, 75, 87, 81])

In [37]:
# Calculate the correlation coefficient with Age and GlucoseLevel
R = np.corrcoef(Age, GlucoseLevel)
print(R)

[[1.        0.5298089]
 [0.5298089 1.       ]]


In [39]:
print('Correlation coefficient between Age and Glucose level: ', R[0][1].round(2))

Correlation coefficient between Age and Glucose level:  0.53


In [40]:
# Correlation between Age and MonthlyIncome in the employee dataset
df['MonthlyIncome'].corr(df['Age']).round(2)

0.49

**Inference:** Age and Monthly Income are positively correlated. However, we cannot say that as the age increases, the income also increases. Because, there is no strong bond between them.
If corr coeff was say, around 0.9, then we would have concluded that as the age increases, income increases.
However, in our data, there are chances that young people having more salary and vice-versa.