Explanation and examples of calculating mean, median, mode, variance, standard deviation, IQR, and removing outliers using both pandas and numpy, along with their differences

In [1]:
# Example Dataset
import pandas as pd
import numpy as np

data = {'values': [12, 15, 12, 15, 18, 14, 14, 15, 16, 19, 22, 12, 100]}  # Last value is an outlier
df = pd.DataFrame(data)
arr = np.array(data['values'])

In [2]:
# Calculations Using Pandas

mean_p = df['values'].mean()
median_p = df['values'].median()
mode_p = df['values'].mode()   # Can return multiple modes as Series
variance_p = df['values'].var()    # Sample variance (ddof=1)
std_dev_p = df['values'].std()     # Sample std deviation (ddof=1)

# IQR Calculation:
Q1 = df['values'].quantile(0.25)
Q3 = df['values'].quantile(0.75)
IQR = Q3 - Q1

In [5]:
# Calculations Using Numpy

mean_n = np.mean(arr)
median_n = np.median(arr)

# Numpy doesn't have built-in mode, use scipy for mode:
from scipy import stats
mode_n = stats.mode(arr).mode

variance_n = np.var(arr, ddof=1)    # Sample variance: ddof=1 for consistency with pandas
std_dev_n = np.std(arr, ddof=1)     # Sample std deviation

# IQR Calculation:
Q1_n = np.percentile(arr, 25)
Q3_n = np.percentile(arr, 75)
IQR_n = Q3_n - Q1_n

In [9]:
# Removing Outliers Based on IQR (Pandas Example)

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

filtered_df = df[(df['values'] >= lower_bound) & (df['values'] <= upper_bound)]

In [10]:
# Removing Outliers Using Numpy

lower_bound_n = Q1_n - 1.5 * IQR_n
upper_bound_n = Q3_n + 1.5 * IQR_n

filtered_arr = arr[(arr >= lower_bound_n) & (arr <= upper_bound_n)]

| Operation        | Pandas                                    | Numpy                                   |
|------------------|-------------------------------------------|-----------------------------------------|
| Mean, Median     | Works seamlessly on Series/DataFrame      | Works on ndarray; no direct DataFrame support |
| Mode             | `.mode()` returns Series of mode(s)       | No built-in mode; use scipy.stats.mode   |
| Variance, Std Dev| Default: sample variance (ddof=1)         | Default: population variance (ddof=0), but ddof can be set |
| IQR              | `.quantile()` for percentiles             | `np.percentile()`                        |
| Outlier Removal  | Boolean indexing on DataFrame             | Boolean indexing on ndarray              |


To determine whether two variables are positively, negatively, or not correlated by looking at a scatterplot, follow these simple guidelines:
| Pattern                               | Correlation Type    | Interpretation                          |
|---------------------------------------|-------------------|----------------------------------------|
| Upward slope (bottom-left to top-right) | Positive correlation | Variables increase together            |
| Downward slope (top-left to bottom-right) | Negative correlation | One variable increases, the other decreases |
| No clear slope or pattern              | No correlation     | Variables do not have a linear relationship |
