<table align="left" width=100%>
    <tr>
        <td width="20%">
            <img src="faculty.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=8px>
                  <b> Faculty Notebook <br> (Session 1) </b>
                </font>
            </div>
        </td>
    </tr>
</table>

## Table of Content

1. **[Import Libraries](#lib)**
2. **[Descriptive Statistics](#des)**
    - 2.1 - **[Measures of Central Tendency](#CT)**
    - 2.2 - **[Measures of Dispersion](#disp)**
    - 2.3 - **[Skewness and Kurtosis](#sk)**
    - 2.4 - **[Covariance and Correlation](#cc)**

<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random # to introduce randomness in selection
import statistics
from scipy import stats # Library for statistical calculations
import warnings
warnings.filterwarnings("ignore")

The study of statistics is mainly divided into two parts: `Descriptive` and `Inferential`.

Here we mainly focus on `Inferential Statistics`. Before that, let us recall the descriptive statistics methods learned as a part of exploratory data analysis.

# 1. Sampling 

The weights of 15 bags are given . 
1. Select randomly 7 bags with replacement
2. Select 7 bags without replacement

- weights=[15.6,15.7,14.6,14.7,15.8,15.0,15.1,14.9,14.8,15.5,15.4,15.3,14.8,15.1,14.8,14.9]


In [2]:
weights=[15.6,15.7,14.6,14.7,15.8,15.0,15.1,14.9,14.8,15.5,15.4,15.3,14.8,15.1,14.8,14.9]


#Use andom.choices(data, k) where k is the sample size
print("Sample with replacements:", random.choices(weights, k=7))


#Can show duplicates if the dataset contains duplicate values
print("Sample without replacements:", random.sample(weights, k=7))


Sample with replacements: [14.7, 14.6, 14.9, 14.9, 15.8, 14.8, 14.9]
Sample without replacements: [14.8, 15.0, 14.9, 15.5, 15.6, 15.1, 15.1]


In [3]:
#Get same numbers again
random.seed(7)
print("Sample with replacements:", random.choices(weights, k=7))


print("Sample without replacements:", random.sample(weights, k=7))

Sample with replacements: [15.0, 14.6, 15.4, 15.7, 14.8, 15.0, 15.6]
Sample without replacements: [15.1, 15.6, 15.7, 14.9, 14.8, 15.1, 14.7]


<a id="des"></a>
# 2. Descriptive Statistics

Descriptive statistics summarizes or describes the given data. It includes measures of central tendency, measures of dispersion and distribution of the data.

<a id="CT"></a>
## 2.1 Measures of Central Tendency

A measure of central tendency is a value that distinguishes the central position of the data. It includes mean, median, mode and partition values of the data.

### Mean:
It is defined as the ratio of the sum of all the observations to the total number of observations. It is affected by the presence of outliers.

### Median:
It is the middlemost observation in the data when it is arranged in the increasing or decreasing order based on the values. It divides the dataset into two equal parts.

### Mode: 
It is defined as the value in the data with the highest frequency. There can be more than one mode in the data.

### Partition values:
Partition values are defined as the values that divide the data into equal parts. `Quartiles` divide the data into 4 equal parts, `Deciles` divide the data into 10 equal parts and `Percentiles` divide the data into 100 equal parts.

### Example:

#### 1. A manager handles 12 branches of a supermarket situated in the U.S.A. Consider one day sale (in dollars) of all the branches. Calculate the mean, median and mode to find the average sale. Which measure would you report for the average sale?
    
    Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]

In [22]:
Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175] # Population data

print("Mean:", round(np.mean(Sale),1))

print("Median:", round(np.median(Sale),1))

mode = stats.mode(Sale)

print("Mode:", mode[0])


Mean: 169.3
Median: 173.0
Mode: 124


<a id="disp"></a>
## 2.2 Measures of Dispersion

A measure of dispersion describes the variability in the data. Some of the measures of dispersion are range, variance, standard deviation, coefficient of variation, and IQR.

### Range:
It is defined as the difference between the largest and smallest observation in the data. It is affected by the presence of extreme observations. 

### Variance: 
It calculates the dispersion of the data from the mean. It is defined as the average of the sum of squares of the difference between the observation and the mean.

### Standard Deviation:
It is the positive square root of variance. The unit of standard deviation is the same as the unit of data points. The variable with near-zero standard deviation is least important for the analysis.

### Coefficient of Variation
It is a measure of the dispersion of data points around the mean. It is always expressed in percentage. We can compare the coefficient of variation of two or more groups to identify the group with more spread.

### Interquartile Range (IQR):
It is defined as the difference between the third and first quartiles. It returns the range of the middle 50% of the data. IQR can be used to identify the outliers in the data.

### Example:

#### 1. A manager handles 12 branches of a supermarket situated in the U.S.A. Consider one day sale (in dollars) of all the branches. Calculate the range, variance and  standard deviation of the sale. Also, find the range in which the middle 50% of the sale would lie.
    
    Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]

In [54]:
#This is population data
sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]

print("Range:", np.max(sale)-np.min(sale))

print("Mean:", round(np.mean(sale),1))

print("Variance:", round(np.var(sale),1))

print("Standard Deviation:", round(np.std(sale),1))

print("Coeffiecient of Variation :", round((np.std(sale)/np.mean(sale))*100,2),"%")

iqr = np.quantile(sale, 0.75) -np.quantile(sale,0.25)

print("IQR/ Middle 50%:",iqr)



Range: 80
Mean: 169.3
Variance: 473.9
Standard Deviation: 21.8
Coeffiecient of Variation : 12.86 %
IQR/ Middle 50%: 22.5



<a id="sk"></a>
## 2.3 Skewness and Kurtosis

### Skewness:
It measures the degree to which the distribution of the data differs from the normal distribution. The value of skewness can be `positive`, `negative`, or `zero`.

### Kurtosis:
It identifies the peakedness of the data distribution. The positive value of kurtosis represents the `leptokurtic` distribution, the negative value represents the `platykurtic` distribution, and zero value represents the `mesokurtic` distribution.

### Example:

#### 1. A manager handles 12 branches of a supermarket situated in the U.S.A. Consider one day sale (in dollars) of all the branches. Identify the type of Skewness and Kurtosis for sales.
    
    Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]

In [57]:
sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]

skeewess = stats.skew(sale)

print("Skewnees:",round(skeewess,2))
print("-ve Skewness means Left Skewed")


kurtosis = stats.kurtosis(sale)
print("\nKurtosis:",round(kurtosis,2))
print("-ve Kutosis means Platykurtic")



Skewnees: -0.53
-ve Skewness means Left Skewed

Kurtosis: -0.38
-ve Kutosis means Platykurtic


<a id="cc"></a>
## 2.4 Covariance and Correlation

### Covariance:
It measures the degree to which two variables move together. The value of covariance can be between $-\infty$ to $\infty$. The magnitude of covariance is not easy to interpret.  

### Correlation:
It is the normalized value of covariance. The correlation value near to +1 indicates a `strong positive` correlation between the variables, and value near to -1 indicates a `strong negative` correlation.

### Example:

#### 1. A manager handles 12 branches of a supermarket situated in the U.S.A. Consider one day sale (in dollars) and working hours of all the branches. Find the relationship between the working hours of a store and its sales.
    Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]
    Working hours = [7, 8.5, 8, 10, 9, 8, 8.5, 7.5, 9.5, 8.5, 8, 9]

In [63]:
sales = pd.Series([165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175])
hours = pd.Series([7, 8.5, 8, 10, 9, 8, 8.5, 7.5, 9.5, 8.5, 8, 9])


#covariance between hours and sales

covarience = hours.cov(sales)

print("Covariance: ", round(covarience,2))

print("There exists positive relation between working hours and sales.\n")

correlation = hours.corr(sales)

print("Correlation: ", round(correlation,2))

print("There is a moderate-strong correlation between working hours and sales")

Covariance:  12.29
There exists positive relation between working hours and sales.

Correlation:  0.64
There is a moderate-strong correlation between working hours and sales


# THE END