# Research Topics
1. Types of Data types
2. Sample and estimated
3. mean, variance, and standard deviation
4. Linear transformation of means
5. linear transformation of variance
6. Range
7. Quartiles
8. Percentiles
9. Covariance and Pearson’s rank correlation
10. Handling Missing values
11. Detecting outliers
12. Handling outliers

## 1. Types of Data
There are two main types of data encountered in data science:
1. Qualitative Data
2. Quantitative Data

### Qualitative Data
Qualitative data or **categorical data**, refers to data that fits into distinct categories and which is non-numerical in nature.
E.g. Data about hair color(black, blonde, brunette, red) or sex(male, female).

There are two kinds of qualitative data:
1. **Nominal**: This refers to qualitative data that follows a specific order or ranking. Certain values can be said to be greater than other values.
E.g. test grade = \[A, B, C, D, E], socioeconomic status = \[low, middle, high]

2. **Ordinal**: This refers to qualitative data that does not follow any specific order or ranking. No value can be said to be greater than another value.
E.g. eye color = \[blue, black, brown, grey], weather = \[cloudy, sunny, windy, snowy]


### Quantitative Data
Quantitative data or **numerical data** refers to data that is numerical in nature or which is expressed as numerical values.
E.g. temperature = \[101, 37.5, 25, 0]

There are to kinds of quantitative data:
1. **Discrete**: This refers to numerical data that can only take up a limited number of values over any specific numeric interval. Such data is typically obtained by counting.
E.g. age = \[18, 9, 25, 88], number of students in different classes = \[15, 20, 17, 101]
2. **Continuous**: This refers to data that can take up an uncountable number of values between any specific numeric interval. Such data is typically obtained by measuring.
E.g. Height of person = \[5, 5.5, 5.59, 5.5559, 9.23456783]

## 2. Sampling and Estimation
### What is sampling?
In statistical studies, the **population** refers to the total number of observations obtained for the study.
Due to time, cost, and other constraints, it is difficult to collect data from every element of the population for studying.
To solve this problem, a subset of the population, referred to as the **sample** is selected and observed, and based on data obtained from the sample, an estimate is made concerning the entire population.

Sampling is therefore the act of taking a part or a portion of a population to represent the entire population during a research study. The sample is the group of individuals who will actually participate in the research. These people are only a subset of the total population the study wants to obtain information about.

There are two types of sampling methods:
1. Probability sampling methods
2. Non-Probability sampling methods

**Probability Sampling Methods**
These methods involve random selection and allow you to make strong statistical inferences about the whole group.
The different probability sampling methods are:
1. *Simple Random Sample*: In a simple random sample, every member of the population has an equal chance of being selected. Your sampling frame should include the whole population.With this type of sampling tools like simple random number generator will be needed.
2. *Systematic sampling*: Systematic sampling is similar to simple random sampling, but it is usually slightly easier to conduct. Every member of the population is listed with a number, but instead of randomly generating numbers, individuals are chosen at regular intervals.
3. *Stratified sampling*: Stratified sampling involves dividing the population into subpopulations that may differ in important ways. It allows you draw more precise conclusions by ensuring that every subgroup is properly represented in the sample.
4. *Cluster sampling*: Cluster sampling also involves dividing the population into subgroups, but each subgroup should have similar characteristics to the whole sample. Instead of sampling individuals from each subgroup, you randomly select entire subgroups.

**Non-Probability Sampling**
These methods involve non-random selection based on convenience or other criteria, allowing you to easily collect data.
The different non-probability sampling methods include:
1. *Convenience sampling*: A convenience sample includes the individuals who are most accessible to the researcher.
2. *Voluntary response sampling*: Similar to a convenience sample, a voluntary response sample is mainly based on ease of access. Instead of the researcher choosing participants and directly contacting them, people volunteer themselves.
3. *Purposive sampling*: This type of sampling, also known as judgment sampling, involves the researcher using their expertise to select a sample that is most useful to the purposes of the research.
4. *Snowball sampling*: If the population is hard to access, snowball sampling can be used to recruit participants via other participants. The number of people you have access to “snowballs” as you get in contact with more people.


### What is Estimation?
As alluded to earlier, estimation in statistics is any of the numerous procedures used to calculate the value of some property of a population from observations of a sample drawn from the population.

**Types of Estimation**
1. *Point Estimate*: A point estimate of a population parameter is a single value most likely to express the value of the parameter.
E.g. Based on the sample, we can estimate the mean age of a population to be 17.5 years.
2. *Interval Estimate*: An interval estimate is defined by two numbers, between which a population parameter is said to lie.
E.g. Based on the sample, we can estimate the mean age of a population to lie between 16 years and 18 years.



## 3. Mean, Variance, and Standard Deviation
The mean, variance, and standard deviation are key features used to describe a data set.

### Mean
The mean of is a measure of central tendency in descriptive statistics which shows the average value of a characteristic in a given statistical sample.
In simple terms, the mean is the average value of a particular data set.

Formula to calculate the arithmetic mean for an individual data series:
$$\bar{x} = \frac{x_1 + x_2 + ... + x_i}{n}$$
where n is the number of samples.

### Variance
Variance is another statistical feature of a data set that measures the spread or dispersion of values throughout the data set.
It essentially tries to measure the difference between each data value and the mean of the entire data set.
It is therefore defined in terms of the mean and can be calculated using the following formula:
$$\sigma^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{N}$$

### Standard Deviation
The standard deviation of a data set measures how much the values within the data set differ from the mean or average.
It is therefore a measure of how spread apart or dispersed a data set is.
The higher the standard deviation of a data set, the more spread apart the values within it are, and the lower the standard deviation of a data set, the more condensed the values within it are.

The formula for finding the standard deviation of a data set is:
 $$\sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{N}}$$

## 4. Linear Transformation of Means
A linear transformation (or simply transformation, sometimes called linear map) is a mapping between two vector spaces:
it takes a vector as input and transforms it into a new output vector.
 A function is said to be linear if the properties of additivity and scalar multiplication are preserved, that is, the same result is obtained if these operations are done before or after the transformation.
 Linear functions are synonymously called linear transformations.


A linear transformation changes each original value in the data set $x$ into the new value $x_{new}$ given by an equation of the form

$$x_{new} = a + bx$$

For example, given a data set `lengths = [1, 2, 3, 4, 5]`, a linear transformation of lengths could be the addition of a constant 2 to each data value, so that $\text{lengths}_{new}$` = [3, 4, 5, 6, 7]`.

This is an example of a linear transformation because there exists a linear relationship between the old data set and the new data set as in the equation stated above with $a = 2$ and $b = 1$.

An additional example would be to increase each value in the original data set by $10%$. For the same data set above, $\text{lengths}_{new}$` = [1.1, 2.2, 3.3, 4.4, 5.5]`.
We can confirm that this is also a linear transformation because there exists a linear relationship between the original data set and the new data set with $a = 0$ and $b = 1.1$

**Effect of Linear Transformation on Mean**
Now that we understand what a linear transformation is, it is worth asking what effect a linear transformation has on the mean of the data set.
If a linear transformation is performed on a data set $x_{new} = a + bx$ where $x$ is the original data set and $x_{new}$ is the new data set the following occurs:
 - The mean of $x_{new}$ increases by $a$
 - The mean of $x_{new}$ increases by a factor of $b$

The following code cell demonstrates this phenomenon for  a data set `ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]`

In [None]:
import numpy as np

# before linear transformation
ids = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])  # original data set
mean = ids.mean()  # original mean
print("Data set Before:", ids)
print("Mean Before:", mean)

# linear transformation
ids = 3 + (4 * ids)

# after linear transformation
mean = ids.mean()  # transformed mean
print("Data set After:", ids)
print("Mean After:", mean)

From the results above, when the data set was linearly transformed by multiplying each data value by 4 and adding 3, the mean of the transformed data set becomes the mean of the original data set multiplied by 4 and incremented by 3.

This confirms the effect of linear transformation on the mean of a data set.


## 5. Linear Transformation of Variance
As mentioned earlier, a linear transformation produces a new data set which has a linear relationship to the original.
The variance is a measure of variability. It is calculated by taking the average of squared deviations from the mean. Variance tells you the degree of spread in your data set. The more spread the data, the larger the variance is in relation to the mean.

**Effect of Linear Transformation on Variance**
As stated in the variance formula above, variance measures the difference between each data value and the mean of the data set.

 - If the linear transformation only involves the addition of a constant to each value in the data set, the variance of the transformed data set remains the same (because each value is increased by the same amount).

 - However, if the linear transformation involves the multiplication of the original data set by a factor, the variance of the new data set will increase by the factor.


The code cell below demonstrates this phenomenon for the data set `ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]`:

In [None]:
import numpy as np

# before linear transformation
ids = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])  # original data set
variance = ids.var()
print("Data Set Before:", ids)
print("Variance Before:", variance)

# linear transformation involving multiplication by factor
new_ids_1 = 0 + (4 * ids)

# linear transformation involving addition of constant
new_ids_2 = 3 + (1 * ids)

print(
    "\nAfter Linear Transformation Involving Addition of Constant\n----------------------------------------------------------")
variance = new_ids_2.var()
print("Data Set After:", new_ids_2)
print("Variance After:", variance)

# after linear transformation
print(
    "\nAfter Linear Transformation Involving Multiplication by Factor\n---------------------------------------------------------------")
variance = new_ids_1.var()
print("Data Set After:", new_ids_1)
print("Variance After:", variance)

As expected, the variance is not affected when the linear transformation only involves the addition of a constant.
However, when the linear transformation involves multiplication by a factor, the variance of the data set is increased by the same factor (`4` in the example above).

## 6. Range
The range of a data set refers to the difference between the largest and smallest value in the data set.
It describes how well the central tendency represents the data. If the range is large, the central tendency is not as representative of the data as it would be if the range was small.

Example: In `{4, 6, 9, 3, 7}` the lowest value is `3`, and the highest is `9`. So the range is `9 − 3 = 6`.
The following code cell demonstrates the calculation of the range of a data set `numbers = [12, 4, 56, 8, 32]` with an expected range of `56 - 4 = 52`.

In [None]:
import numpy as np

numbers = np.array([12, 4, 56, 8, 32])
range_n = numbers.ptp()  # calculating range
print(range_n)

## 7. Quartiles
A quartile is a statistical term that describes a division of observations into four defined intervals based on the values of the data and how they compare to the entire set of observations.
<br>
### How Quartiles work
There are three quartile values—a lower quartile, median, and upper quartile—to divide the data set into four ranges, each containing 25% of the data points. The lower quartile, or first quartile, is denoted as Q1 and is the middle number that falls between the smallest value of the dataset and the median. The second quartile, Q2, is also the median. The upper or third quartile, denoted as Q3, is the central point that lies between the median and the highest number of the distribution.
Now, we can map out the four groups formed from the quartiles. The first group of values contains the smallest number up to Q1; the second group includes Q1 to the median; the third set is the median to Q3; the fourth category comprises Q3 to the highest data point of the entire set.
Each interval contains 25% of the total observations. Generally, the data is arranged from smallest to largest:

- First interval: The set of data points between the minimum value and the first quartile.
- Second interval: The set of data points between the lower quartile and the median.
- Third interval: The set of data between the median and the upper quartile.
- Fourth interval: The set of data points between the upper quartile and the maximum value of the data set.


### Example of Quartile
Suppose the distribution of math scores in a class of 19 students in ascending order is:

59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90, 95, 98

First, mark down the median, Q2, which in this case is the 10th value: 75.

Q1 is the central point between the smallest score and the median. In this case, Q1 falls between the first and fifth score: 68. (Note that the median can also be included when calculating Q1 or Q3 for an odd set of values. If we were to include the median on either side of the middle point, then Q1 will be the middle value between the first and 10th score, which is the average of the fifth and sixth score—(fifth + sixth)/2 = (68 + 69)/2 = 68.5).

Q3 is the middle value between Q2 and the highest score: 84. (Or if you include the median, Q3 = (82 + 84)/2 = 83).

Now that we have our quartiles, let’s interpret their numbers. A score of 68 (Q1) represents the first quartile and is the 25th percentile. 68 is the median of the lower half of the score set in the available data—that is, the median of the scores from 59 to 75.

Q1 tells us that 25% of the scores are less than 68 and 75% of the class scores are greater. Q2 (the median) is the 50th percentile and shows that 50% of the scores are less than 75, and 50% of the scores are above 75. Finally, Q3, the 75th percentile, reveals that 25% of the scores are greater and 75% are less than 84.
<br>

### What Is the Inter quartile Range of a Data Set?
The inter quartile range is the middle 50% of measurements in a data set—in other words, the range of data between the upper quartile and the lower quartile. This is more statistically meaningful than using the full range of data, because it omits possible outliers.

The following code cell demonstrates how to use extract quartiles from the data set `numbers = [59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90, 95, 98 ]`

In [None]:
import numpy as np

numbers = np.array([59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90, 95, 98])

# first quartile(25th percentile)
q1 = np.percentile(numbers, 25)

# second quartile (50th percentile/median)
q2 = np.percentile(numbers, 50)

# third quartile (75th percentile)
q3 = np.percentile(numbers, 75)

print("Q1:", q1)
print("Q2:", q2)
print("Q3:", q3)

## 8. Percentiles
#### What is a Percentile:
Percentiles are used to understand and interpret data. It is defined as the value to which a given percentage falls under.

#### What is the formula for Percentile
Percentiles can be calculated using the formula n = (P/100) x N, where P = percentile, N = number of values in a data set (sorted from smallest to largest), and n = ordinal rank of a given value.

#### How is Percentile Calculated
Here are a few steps to use the percentile formula to find the percentile. If q is any number between zero and hundred, the qth percentile is a value that divides the data into two parts i.e. the lowest part contains the q percent of the data and the rest of the data is the upper part.

1.   Step 1: Arrange the data set in ascending order
2.  Step 2: Count the number of values in the data set and represent it as r
3.   Step 3: Calculate the value of q/100
4.   Step 4: Multiply q percent by r
5. Step 5: If the answer is not a whole number then rounding the number is required. If it is a whole number, continue to the next step.
6.   Step 6: Count the values in the data set, and find the mean and the next number. The answer is the qth percentile
7.   Step 7: Count the value in the data set, once you reach that number according to what we obtained in step 5 that is the qth percentile.

#### Calculating Percentiles using values in a Data set

The three definitions that define the k’th percentile are:
The smallest value that is greater than k percent of the values.
The smallest value that is greater than or equal to k percent of values.
An interpolated value between the two closest ranks.

To calculate percentiles using these three approaches, start by ranking your dataset from the lowest to the highest values.
Let’s use these three methods with the following dataset (n = 11) to find the 70th percentile.

**Definition 1: Greater Than**
Using the first definition, we need to find the value that is greater than 70% of the values, and there are 11 values. Take 70% of 11, which is 7.7. Then, round 7.7 up to 8. Using the first definition, the value for the 70th percentile must be greater than eight values. Consequently, we pick the 9th ranked value in the dataset, which is 40.

**Definition 2: Greater Than or Equal To**
Using the second definition, we need to find the value that is greater than or equal to 70% of the values. Thanks to the “equal to” portion of the definition, we can use the 8th ranked value, which is 35.
Using the first two definitions, we have found two values for the 70% percentile—35 and 40.

**Definition 3: Using an Interpolation Approach**
As you saw above, using either “greater” or “greater than or equal to” changes the results. Depending on the nature and size of your dataset, this difference can be substantial. Consequently, a third approach interpolates between two data values.
To calculate an interpolated percentile, do the following:
Calculate the rank to use for the percentile. Use: rank = p(n+1), where p = the percentile and n = the sample size. For our example, to find the rank for the 70th percentile, we take 0.7*(11 + 1) = 8.4.
If the rank in step 1 is an integer, find the data value that corresponds to that rank and use it for the percentile.
If the rank is not an integer, you need to interpolate between the two closest observations. For our example, 8.4 falls between 8 and 9, which corresponds to the data values of 35 and 40.
Take the difference between these two observations and multiply it by the fractional portion of the rank. For our example, this is: (40 – 35)0.4 = 2.
Take the lower-ranked value in step 3 and add the value from step 4 to obtain the interpolated value for the percentile. For our example, that value is 35 + 2 = 37.
Using three common calculations for percentiles, we find three different values for the 70th percentile: 35, 37, and 40.
<br>

#### The syntax for finding Numpy Percentile:
`numpy.percentile(a, q, axis=None, out=None, overwrite_input=False, interpolation='linear', keepdims=False)`

The general parameters are the following.
- `a`: Your input array.
- `q`: Percentile or sequence of percentiles to compute. It should be between 0 and 100 inclusive.
- `axis`: Axis or axes along the percentiles should be calculated.
- `out`: Alternative output array in which to place the result.
- `overwrite_input`: If True. It allows the input array to be modified by intermediate calculations.

#### Example of implementation of NumPy percentile
Note: You must import the numpy module for it to function correctly.

In [None]:
import numpy as np

# Example- Calculating the percentile for 1D Numpy array:
# We will be calculating a single dimension array and use np.percentile() method on it.

array_1d = np.array([1, 2, 3, 4, 9, 12])
print(np.percentile(array_1d, 50))

**Output**:
For this example the output gotten was 3.5 as  the 50% percentile.


#### Applications of Percentile
1. Test scores
2. Biometric measurements

## 9. Covariance and Pearson’s Rank Correlation
### Covariance
Covariance represents the ability of two variables to increase or decrease together with respect to their mean values.
Covariance can be positive, negative, or zero. Positive Covariance indicates that if one of the two variables has an increment in its values, the other variables’ values also increase.
Negative Covariance indicates that if one of the two variables has an increment in its values, the other variables’ values decrease.
Zero covariance indicates there is no linear relationship between the two variables.

The following code cell demonstrates how to calculate the covariance of two quantitative variables stored as columns in a pandas data frame:

In [None]:
import pandas as pd

data = pd.DataFrame({"hours_of_practice": [1, 5, 7, 2, 19, 67], "SAT Score": [970, 1230, 1000, 1200, 1490, 1590]})
print(data)
print(data.cov())

Each column yields a positive covariance with the other columns. This indicates that when `hours of practice` is high, `SAT score` is high, and vice versa.


### Correlation
Correlation is a normalized version of covariance. Covariance is a quantity that depends on the units of the variables used. Thus, if the units of either or both of the variables are changed, the digits of covariance change too. For solving this problem of units, correlation was introduced. Correlation can be positive, negative, or neutral.

**Correlation coefficients** reflect the strength of the linear relationship between the two variables and the direction of their linear trend. In comparison, covariance coefficients are only used to highlight the direction (positive, negative, or neutral) of a linear relationship between the two variables. That is mainly because the correlation coefficient values are normalised using the product of standard deviations.
<br>

#### Pearson’s Rank correlation
The Pearson correlation coefficient (r) is the most common way of measuring a linear correlation. It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables.
The Pearson correlation method is the most common method to use for numerical variables; it assigns a value between − 1 and 1, where 0 is no correlation, 1 is total positive correlation, and − 1 is total negative correlation. This is interpreted as follows: a correlation value of 0.7 between two variables would indicate that a significant and positive relationship exists between the two. A positive correlation signifies that if variable A goes up, then B will also go up, whereas if the value of the correlation is negative, then if A increases, B decreases.
A correlation can be calculated between two numerical values (e.g., age and salary) or between two category values (e.g., type of product and profession). As well as the correlation, the covariance of two variables is often calculated. In contrast with the correlation value, which must be between − 1 and 1, the covariance may assume any numerical value. The covariance indicates the grade of synchronization of the variance (or volatility) of the two variables.

The following code cell demonstrates how to calculate the correlation of two numerical variables in a pandas data frame:

In [None]:
import pandas as pd

data = pd.DataFrame({"hours_of_practice": [1, 5, 7, 2, 19, 67], "SAT Score": [970, 1230, 1000, 1200, 1490, 1590]})
print(data)  # display the data set
print(data.corr())  # calculate correlation

The results indicate that there is a correlation of approximately 80 % between hours studied and SAT Score. Therefore, the more hours studied, the higher the SAT score.

## 10. HANDLING MISSING VALUES
In working with a data set, one might encounter that some values are present whereas few are missing.

There are 2 primary ways of handling missing values:
- Deleting the Missing values
- Imputing the Missing Values

#### Deleting the Missing value
Generally, this approach is not recommended. It is one of the quick and dirty techniques one can use to deal with missing values.
It involves first identifying all observations in the data set containing missing values and removing them.
In pandas DataFrames, this is accomplished using the `dropna()` DataFrame method.

The following code cell demonstrates this approach for a data set containing student names and their grades with some students having missing grades:

In [None]:
import pandas as pd

# Before deletion
std_grades = pd.DataFrame([["Ekyi", 100], ["Fritz"], ["Rose", 95], ["Maria", 28], ["Sheena"]])
print("With Missing Values:\n--------------------\n",std_grades)

# Deleting missing values
std_grades.dropna(inplace=True)

# After deletion
print("\nWithout Missing Values:\n------------------------\n", std_grades)

As expected, the rows containing missing values were removed from the data set.

#### Imputing the Missing Value
This method involves replacing all missing values.
There are different ways of replacing the missing values. You can use the python libraries Pandas and Sci-kit learn as follows:

1. **Replacing With Arbitrary Value**: If you can make an educated guess about the missing value then you can replace it with some arbitrary value using the following code.
2. **Replacing With Mean**: This is the most common method of imputing missing values of numeric columns. If there are outliers then the mean will not be appropriate. In such cases, outliers need to be treated first.
3. **Replacing With Mode**: Mode is the most frequently occurring value. It is used in the case of categorical features.
4. **Replacing With Median**: Median is the middlemost value. It’s better to use the median value for imputation in the case of outliers.

For pandas DataFrames, the `fillna()` method is used to replace all missing values in a DataFrame.
The following code cell demonstrates all 4 aforementioned approaches for a pandas DataFrame:

In [None]:
import pandas as pd

# Before impute
std_grades = pd.DataFrame([["Ekyi", 100], ["Fritz"], ["Rose", 95], ["Maria", 28], ["Sheena"], ["Ymir", 95]])
print("With Missing Values:\n--------------------\n",std_grades)

# Method 1: Arbitrary values
arb_val = 8.5
print("\nMethod 1:\n-----------------\n", std_grades.fillna(arb_val))  # replace all missing values with 8.5

# Method 2: Mean
print("\nMethod 2:\n-----------------")
mean = std_grades[1].mean()
print("Mean:", mean)
print("", std_grades.fillna(mean))  # replace all missing values with mean

# Method 3: Mode
print("\nMethod 3:\n-----------------")
mode = std_grades[1].mode()[0]
print("Mode:", mode)
print("", std_grades.fillna(mode))  # replace all missing values with mode

# Method 4: Median
print("\nMethod 4:\n-----------------")
median = std_grades[1].median()
print("Median:", median)
print("", std_grades.fillna(median))  # replace all missing values with median

## 11. Detecting Outliers
Outliers or anomalies in a data set refer to data values that are unusually separated from other values within the data set.
For instance, in the data set `score = [1, 2, 3, 2, -1, 4, 5000]`, `5000` is an outlier as it is unusually larger than all other values within the data set.
It does not follow the general trend of the data set, and can even be said to "not belong" to the data set.
The scatter plot below visualizes the example data set above and helps to further clarify the nature of outliers:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

scores = pd.Series([1, 2, 3, 2, -1, 4, 5000])
x_values = pd.Series([1, 2, 3, 4, 5, 6, 7])  # x-axis values for scatter plot.
plt.scatter(scores.index, scores)
plt.show()

As can be seen in the figure above, the data point `5000` lies unusually far away from all other points in the data set. It is therefore an outlier or anomaly.

In working with data for data science, outliers may occur in one's data set and the two main methods for detecting outliers are as follows:
1. Using the Standard Deviation
2. Visualizing the data set with a box plot

### Using Standard Deviation
As mentioned in section 3, the standard deviation of a data set measures how spread apart the data set is, (i.e. how far each value in the data set is from the mean).

In statistics, a data value is considered to be an outlier if it is above or below **some number of standard deviations from the mean** of the data set.
The number of standard deviations is typically referred to as the threshold.

That is, if one calculates the standard deviation of a data set to be $s$ and sets the threshold for detecting outliers to be 3, any value that is smaller than the mean by more than $3s$ or larger than the mean by more than $3s$ will be considered an outlier.

For instance, consider the following data set comprising salaries of 10 employees.
`salary = [100, 150, 85, 200, 105, 65, 275, 250, 220, 1000]`

By intuition, `1000` must definitely be an outlier.
To prove this, we shall calculate the standard deviation and compare all values with it to detect those which are outliers.

We first start by calculating the mean and standard deviation before comparing each data point as shown:

In [None]:
import pandas as pd
from math import sqrt

salary = pd.Series([100, 150, 85, 200, 105, 65, 275, 250, 220, 1000])

# calculating mean
total = 0
N = len(salary)
for amount in salary:
    total += amount
mean = total / N
print("Mean:", mean)

# calculating standard deviation
std_total = 0
for amount in salary:
    std_total += (amount - mean) ** 2
std = sqrt(std_total / N)
print("Standard Deviation:", std)

# comparing for outliers
threshold = 2  # using a threshold of 2 standard deviations from mean
for amount in salary:
    is_outlier = False
    if amount < (mean - threshold * std):
        is_outlier = True
    elif amount > (mean + threshold * std):
        is_outlier = True
    print(amount, is_outlier)

As we predicted, the number 1000, has been revealed to be an outlier in our data set using a threshold of 2 standard deviations from the mean.

Note that, by default, the threshold for detecting outliers is usually set to 3 standard deviations.

### Visualizing the Data Set with a Box Plot
A box plot is a special type of plot comprising a rectangular box with whiskers that provides a visual summary of a data set using 5 distinct numbers, namely
 - The Minimum
 - The 1st Quartile
 - The 2nd Quartile (Median)
 - The 3rd Quartile
 - The Maximum

**Parts of a Box Plot**
<img alt="parts of a box plot" src="boxplot.png" title="Parts of a Box Plot"/>

The Minimum is simply the smallest value within the data set.
The Maximum is simply the largest value within the data set.

If the data set is to be divided into 4 equal parts or quarters,
The 1st Quartile is simply the value that would separate the 1st and 2nd quarter.
The 2nd Quartile is simply the value that would separate the 2nd and 3rd quarter.
The 3rd Quartile is simply the value that would separate the 3rd and 4th quarter.

The data set must be sorted first before any of these 5 numbers can be calculated to provide a summary of the data.

The difference between the 1st quartile and the 3rd quartile is a special number referred to as the InterQuartile Range(IQR).
This number is used to identify outliers in a data set.
Specifically, outliers in a data set are values that are either
 - below $Q1 - 1.5 \times IQR$
 - above $Q3 + 1.5 \times IQR$

Consider the following data set: `heights = [17.0, 18.5, 22.0, 12.3, 14.7, 19.1, 20.0, 13.5, 1, 43]`
Clearly the heights `43` and `1` are outliers.
The code cell below constructs a box plot to prove that they are indeed outliers:

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

heights = pd.Series([17.0, 18.5, 22.0, 12.3, 14.7, 19.1, 20.0, 13.5, 1, 43])
plt.boxplot(heights, vert=False)  # construct a horizontal boxplot using heights
plt.show()

As expected, the marker representing the data points for `1` and `43` have been separated from the box plot, indicating that they are outliers or anomalies.

Note that, the box plot's minimum and maximum don't actually represent the minimum and maximum values of the data set (which are the outliers).
They however represent the minimum and maximum of the data set after the removal of the outliers.

## 12. Handling Outliers
Upon detecting outliers in a data set with either of the methods identified above, there are 3 methods through which one can handle outliers:
1. Deleting outliers from the data set.
2. Imputing with 10th and 90th percentile.
3. Imputing with median.

### Deleting outliers
This method involves deleting all values that are identified as outliers from the data set.

**Deleting outliers detected via Boxplot**
If the outliers in the data set were identified using a box plot, the `plt.boxplot()` function returns a dictionary whose `fliers` key contains the outliers of the data set.

The following code cell retrieves outliers from a horizontal boxplot and removes them from the data set:

In [None]:
import matplotlib.pyplot as plt

heights = [17.0, 18.5, 22.0, 12.3, 14.7, 19.1, 20.0, 13.5, 1, 43, 78]
print("Data set with outliers:", heights)

# construct horizontal boxplot
bplot = plt.boxplot(heights, vert=False)

# retrieve outliers
outliers = bplot["fliers"][0].get_xdata()
print("Outliers:", outliers)

# remove outliers
for outlier in outliers:
    heights.remove(outlier)

print("Data set without outliers:", heights)

The following code cell accomplishes the same with a vertical boxplot instead:

In [None]:
import matplotlib.pyplot as plt

heights = [17.0, 18.5, 22.0, 12.3, 14.7, 19.1, 20.0, 13.5, 1, 43, 78]
print("Data set with outliers:", heights)

# construct horizontal boxplot
bplot = plt.boxplot(heights, vert=True)  # NB: vert=False -> vert=True

# retrieve outliers
outliers = bplot["fliers"][0].get_ydata()  # NB: .get_xdata() -> .get_ydata()
print("Outliers:", outliers)

# remove outliers
for outlier in outliers:
    heights.remove(outlier)

print("Data set without outliers:", heights)

A similar approach to this can be used to delete outliers detected using the standard deviation.
Values that compare as greater than or less than the mean by more than `threshold * standard deviation` will be deleted from the data set.

### Imputing with 10th and 90th percentiles
Another way of dealing with outliers is by imputing the value of the 10th and 90th percentile to all values less than the 10th percentile or greater than the 90th percentile respectively.

The following code snippet demonstrates this for a data set `temperatures = [37, 28, 33, 35, 18, -5, 89, 34, -12, 40]` whose outliers are `[-12, 40]`

In [None]:
import numpy as np
import matplotlib.pyplot as plt

temperatures = [37, 28, 33, 35, 18, 9, 89, 34, -12, 40]

# calculating percentiles
percentile10 = np.percentile(temperatures, 10)
percentile90 = np.percentile(temperatures, 90)

# box plot before imputation
plt.boxplot(temperatures, vert=False)
plt.title("Before Imputation")
plt.show()


# imputing outliers
for i in range(len(temperatures)):
    if temperatures[i] < percentile10:
        temperatures[i] = percentile10
    elif temperatures[i] > percentile90:
        temperatures[i] = percentile90
    i+= 1



# box plot after imputation
plt.boxplot(temperatures, vert=False)
plt.title("After Imputation")
plt.show()

As expected, imputing the 10th and 90th percentiles removed outliers from our data set.

### Imputing with Median
Another way of getting rid of outliers in a given data set is by replacing all outliers with the median.
In the following code cell, the outliers in the data set `scores = [1020, 1350, 1490, 1350, 1480, 1390, 1500, 1400, 890, 2400]` are replaced with the median:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

scores = [1020, 1350, 1490, 1350, 1480, 1390, 1500, 1400, 890, 2400]

# calculating median
median = np.median(scores)

# boxplot before imputation
bplot = plt.boxplot(scores, vert=False)
plt.title("Before Imputation")
plt.show()

# extracting outliers from boxplot
outliers = bplot["fliers"][0].get_xdata()
print("Outliers:", outliers)

# imputing median
for i in range(len(scores)):
    if scores[i] in outliers:
        scores[i] = median

# boxplot after imputation
plt.boxplot(scores, vert=False)
plt.title("After Imputation")
plt.show()

As expected, the boxplot confirms that outliers have been removed from our data set after imputing outlier values were imputed with median.

<h1 align="center"> END OF NOTEBOOK</h1>