# Descriptive Statistics 

Descriptive Statistics is the building block of data science. Advanced analytics is often incomplete without analyzing descriptive statistics of the key metrics. In simple terms, descriptive statistics can be defined as the measures that summarize a given data, and these measures can be broken down further into the measures of central tendency and the measures of dispersion.

# Data loading

In [1]:
import pandas as pd
import numpy as np
import statistics as st 

# Load the data
data = pd.read_csv("data.csv")
print(data.info())
print(data.shape)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Mthly_HH_Income           50 non-null     int64 
 1   Mthly_HH_Expense          50 non-null     int64 
 2   No_of_Fly_Members         50 non-null     int64 
 3   Emi_or_Rent_Amt           50 non-null     int64 
 4   Annual_HH_Income          50 non-null     int64 
 5   Highest_Qualified_Member  50 non-null     object
 6   No_of_Earning_Members     50 non-null     int64 
dtypes: int64(6), object(1)
memory usage: 2.9+ KB
None
(50, 7)


# 1) Mean

Mean represents the arithmetic average of the data. The line of code below prints the mean of the numerical variables in the data.

Example below: mean of Mthly_HH_Income 

In [2]:
x=np.array(data['Mthly_HH_Income'])
print(np.mean(x))

41558.0


In [3]:
mean2=sum(data['Mthly_HH_Income'])/len(data['Mthly_HH_Income'])
print(mean2)

41558.0


In [4]:
print("mean of all Numerical columns")
data.mean()

mean of all Numerical columns


Mthly_HH_Income           41558.00
Mthly_HH_Expense          18818.00
No_of_Fly_Members             4.06
Emi_or_Rent_Amt            3060.00
Annual_HH_Income         490019.04
No_of_Earning_Members         1.46
dtype: float64

# 2) Median

median represents the 50th percentile, or the middle value of the data, that separates the distribution into two halves

Example below: median of  Mthly_HH_Expense

In [5]:
x=data['Mthly_HH_Expense']
x.sort_values(ascending=True)
print("Median is: ",x.median())

Median is:  15500.0


In [6]:
n = len(data['Mthly_HH_Expense'])
x=list(data['Mthly_HH_Expense'])
x.sort()
  
if n % 2 == 0:
    median1 = x[n//2]
    median2 = x[n//2 - 1]
    median = (median1 + median2)/2
else:
    median = x[n//2]
print("Median is: " + str(median))

Median is: 15500.0


In [7]:
print("median of all Numerical columns")
data.median()

median of all Numerical columns


Mthly_HH_Income           35000.0
Mthly_HH_Expense          15500.0
No_of_Fly_Members             4.0
Emi_or_Rent_Amt               0.0
Annual_HH_Income         447420.0
No_of_Earning_Members         1.0
dtype: float64

# 3) Mode

The mode is the number that occurs most often within a set of numbers.

Example below:Mode in Mthly_HH_Income

In [8]:
x=data['Mthly_HH_Expense']
res=str(x.mode())
print("index  mode")
print(" "+res)

index  mode
 0    25000
dtype: int64


In [9]:
from collections import Counter
  
x=list(data['Mthly_HH_Expense'])
n = len(x)
  
x = Counter(x)
get_mode = dict(x)
mode = [k for k, v in get_mode.items() if v == max(list(x.values()))]
  
if len(mode) == n:
    get_mode = "No mode found"
else:
    get_mode = "Mode is are: " + ', '.join(map(str, mode))
      
print(get_mode)

Mode is are: 25000


In [10]:
print("mode of all Numerical columns")
data.mode()

mode of all Numerical columns


Unnamed: 0,Mthly_HH_Income,Mthly_HH_Expense,No_of_Fly_Members,Emi_or_Rent_Amt,Annual_HH_Income,Highest_Qualified_Member,No_of_Earning_Members
0,45000,25000,4,0,590400,Graduate,1


# 4)Variance

Variance is another measure of dispersion. It is the square of the standard deviation and the covariance of the random variable with itself. 

$\sigma^2=\frac{1}{n} \sum_{i=0}^{n-1}(x_i-\mu)^2 $


Example below: Variance of Mthly_HH_Expense

In [31]:
x=data['Mthly_HH_Expense']
res=str(st.variance(x))
print("Variance is",res)

Variance is 146173342.85714287


In [29]:
x=list(data['Mthly_HH_Expense'])
mean = sum(x) / len(x)
res = sum((i - mean) ** 2 for i in x) / len(x)
  
print("The variance of is " + str(res))

The variance of is 143249876.0


In [12]:
print("Variance of all Numerical columns")
data.var()

Variance of all Numerical columns


Mthly_HH_Income          6.811009e+08
Mthly_HH_Expense         1.461733e+08
No_of_Fly_Members        2.302449e+00
Emi_or_Rent_Amt          3.895551e+07
Annual_HH_Income         1.024869e+11
No_of_Earning_Members    5.391837e-01
dtype: float64

# 5)Standard Deviation 

In [17]:
print("Standard deviation is ",data.loc[:,'Mthly_HH_Income'].std())

Standard deviation is  26097.908978713687


In [32]:
print("Standard deviation of all Numerical columns")
data.var()

Standard deviation of all Numerical columns


Mthly_HH_Income          6.811009e+08
Mthly_HH_Expense         1.461733e+08
No_of_Fly_Members        2.302449e+00
Emi_or_Rent_Amt          3.895551e+07
Annual_HH_Income         1.024869e+11
No_of_Earning_Members    5.391837e-01
dtype: float64