**Tutorial 1. Data Aggregation: Summarising Data with Mean, Median, Mode, Standard Deviation,Variance, Quantiles, and Percentiles**

***1.1. Mean, Median, Mode, Standard Deviation, Max, Min in Pandas DataFrame***

In [None]:
import pandas as pd
from IPython.display import display

# Read the json file from the direcotory
diabities_df = pd.read_csv("/data/chapter1/diabetes.csv")

print(f'\n Mean \n \n {diabities_df.mean()}')

print(f'\n Median \n \n {diabities_df.median()}')

print(f'\n Mode \n \n {diabities_df.mode()}')

print(f'\n Varience \n \n {diabities_df.var()}')

print(f'\n Maximum \n \n {diabities_df.max()}')

print(f'\n Minimum \n \n {diabities_df.min()}')

***1.2. Mean, Median, Mode, Standard Deviation, Max, Min in Numpy Array***

In [None]:
import numpy as np
import statistics as st

# Create a numpy array
data = np.array([12, 15, 20, 25, 30, 30, 35, 40, 45, 50])

# Mean
mean = np.mean(data)

# Median
median = np.median(data)

# Mode
mode_result = st.mode(data)
mode_result

# Standard Deviation
std_dev = np.std(data)

# Maximum
maximum = np.max(data)

# Minimum
minimum = np.min(data)

print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Standard Deviation:", std_dev)
print("Maximum:", maximum)
print("Minimum:", minimum)

***1.3. Variance, Quantiles, and Percentiles are computed using `var()` and `quantiles` also the `describe()` shows it***

In [14]:
import pandas as pd
from IPython.display import display

# Read the json file from the direcotory
diabities_df = pd.read_csv("/workspaces/ImplementingStatisticsWithPython/data/chapter1/diabetes.csv")

# Variance
variance = diabities_df.var()

# Quantiles (25th, 50th, and 75th percentiles)
quantiles = diabities_df.quantile([0.25, 0.5, 0.75])

# Percentiles (90th and 95th percentiles)
percentiles = diabities_df.quantile([0.9, 0.95])

display("Variance:", variance)
display("Quantiles:", quantiles)
display("Percentiles:", percentiles)

'Variance:'

Pregnancies                    11.354056
Glucose                      1022.248314
BloodPressure                 374.647271
SkinThickness                 254.473245
Insulin                     13281.180078
BMI                            62.159984
DiabetesPedigreeFunction        0.109779
Age                           138.303046
Outcome                         0.227483
dtype: float64

'Quantiles:'

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0.25,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
0.5,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
0.75,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0


'Percentiles:'

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0.9,9.0,167.0,88.0,40.0,210.0,41.5,0.8786,51.0,1.0
0.95,10.0,181.0,90.0,44.0,293.0,44.395,1.13285,58.0,1.0


**Tutorial 2. Data Normalisation, Standardization, Transformation**

***2.1. Data Normalization on a Numpy array***

In [18]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Structured data (2D array)
structured_data = np.array([[1, 2], [3, 4], [5, 6]])
scaler = MinMaxScaler()
normalized_structured = scaler.fit_transform(structured_data)

print("Normalized Structured Data:")
print(normalized_structured)

Normalized Structured Data:
[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]


***2.2. Data Normalization on Pandas DataFrame***

In [33]:
import pandas as pd
from IPython.display import display

# Read the json file from the direcotory
diabities_df = pd.read_csv("/workspaces/ImplementingStatisticsWithPython/data/chapter1/diabetes.csv")

# Specify columns to normalize
columns_to_normalize = ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age','Outcome']

scaler = MinMaxScaler()
diabities_df[columns_to_normalize] = scaler.fit_transform(diabities_df[columns_to_normalize])

print("Normalized Structured Data:")
diabities_df

Normalized Structured Data:


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0.352941,0.743719,0.590164,0.353535,0.000000,0.500745,0.234415,0.483333,1.0
1,0.058824,0.427136,0.540984,0.292929,0.000000,0.396423,0.116567,0.166667,0.0
2,0.470588,0.919598,0.524590,0.000000,0.000000,0.347243,0.253629,0.183333,1.0
3,0.058824,0.447236,0.540984,0.232323,0.111111,0.418778,0.038002,0.000000,0.0
4,0.000000,0.688442,0.327869,0.353535,0.198582,0.642325,0.943638,0.200000,1.0
...,...,...,...,...,...,...,...,...,...
763,0.588235,0.507538,0.622951,0.484848,0.212766,0.490313,0.039710,0.700000,0.0
764,0.117647,0.613065,0.573770,0.272727,0.000000,0.548435,0.111870,0.100000,0.0
765,0.294118,0.608040,0.590164,0.232323,0.132388,0.390462,0.071307,0.150000,0.0
766,0.058824,0.633166,0.491803,0.000000,0.000000,0.448584,0.115713,0.433333,1.0


In unstructuctered data like text normalization may involve natural language processing like convert lowercase , removing punctuation, 
handling special character like whitespaces and many more.
<br>
As shows in ***Tutorial 2.3.***
In image or audio it may involve rescaling pixel values, extracting features.

***Tutorial 2.3. Convert lowercase , removing punctuation, handling special character like whitespaces in unstructured text data***

In [36]:
import re

def normalize_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Sample unstructured text data
unstructured_text = "This is an a text for book Implementing Stat with Python, with! various punctuation marks..."

normalized_text = normalize_text(unstructured_text)
print("Original Text:", unstructured_text)
print("Normalized Text:", normalized_text)

Original Text: This is an a text for book Implementing Stat with Python, with! various punctuation marks...
Normalized Text: this is an a text for book implementing stat with python with various punctuation marks


**Tutorial 3. Data Binning, Grouping and Encoding**

**Tutorial 4. Missing Data, Detecting and Treating Outliers**

**Tutorial 5. Histograms, Box plots, Scatter plots, Pie Charts, Bar Charts, X-Y Plots, Heatmaps**