IDSAI_2024_lecture3_DEMO2 -------------------------------- 01000110.01001010 ---- revised: Aug2024_F.Jalalypour

# Feature scaling

# Example 1
Different feature scaling methods

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Generating input data
np.random.seed(0)  # For reproducibility
feature_1 = np.random.randint(-200, 200, size=20)
feature_2 = np.random.randint(-100, 150, size=20)

# Absolute Maximum Scaling
feature_1_abs_max = feature_1 / np.max(np.abs(feature_1))
feature_2_abs_max = feature_2 / np.max(np.abs(feature_2))

# Min-Max Scaling (Normalization)
feature_1_min_max = (feature_1 - np.min(feature_1)) / (np.max(feature_1) - np.min(feature_1))
feature_2_min_max = (feature_2 - np.min(feature_2)) / (np.max(feature_2) - np.min(feature_2))

# Standardization (Z-score Normalization)
feature_1_mean, feature_2_mean = np.mean(feature_1), np.mean(feature_2)
feature_1_std, feature_2_std = np.std(feature_1), np.std(feature_2)

feature_1_zscore = (feature_1 - feature_1_mean) / feature_1_std
feature_2_zscore = (feature_2 - feature_2_mean) / feature_2_std

# Plotting
fig, axs = plt.subplots(2, 2, figsize=(14, 14))

# Original Data
axs[0, 0].scatter(feature_1, feature_2, color='blue')
axs[0, 0].set_title('Original Data')
axs[0, 0].set_xlabel('Feature 1')
axs[0, 0].set_ylabel('Feature 2')
axs[0, 0].grid(True)
axs[0, 0].axhline(0, color='black', lw=1.5)
axs[0, 0].axvline(0, color='black', lw=1.5)

# Absolute Maximum Scaling
axs[0, 1].scatter(feature_1_abs_max, feature_2_abs_max, color='green')
axs[0, 1].set_title('Absolute Maximum Scaling')
axs[0, 1].set_xlabel('Scaled Feature 1')
axs[0, 1].set_ylabel('Scaled Feature 2')
axs[0, 1].grid(True)
axs[0, 1].axhline(0, color='black', lw=1.5)
axs[0, 1].axvline(0, color='black', lw=1.5)

# Min-Max Scaling
axs[1, 0].scatter(feature_1_min_max, feature_2_min_max, color='orange')
axs[1, 0].set_title('Min-Max Scaling (Normalization)')
axs[1, 0].set_xlabel('Scaled Feature 1')
axs[1, 0].set_ylabel('Scaled Feature 2')
axs[1, 0].grid(True)
axs[1, 0].axhline(0, color='black', lw=1.5)
axs[1, 0].axvline(0, color='black', lw=1.5)

# Standardization (Z-score Normalization)
axs[1, 1].scatter(feature_1_zscore, feature_2_zscore, color='red')
axs[1, 1].set_title('Standardization (Z-score Normalization)')
axs[1, 1].set_xlabel('Scaled Feature 1')
axs[1, 1].set_ylabel('Scaled Feature 2')
axs[1, 1].grid(True)
axs[1, 1].axhline(0, color='black', lw=1.5)
axs[1, 1].axvline(0, color='black', lw=1.5)

plt.tight_layout()
plt.show()


In [None]:
import numpy as np

# Assuming feature_1_zscore is already calculated as shown previously
# Calculate the mean and standard deviation of feature_1_zscore
mean_feature_1_zscore = np.mean(feature_1_zscore)
std_feature_1_zscore = np.std(feature_1_zscore)

print(f"Mean of feature_1_zscore: {mean_feature_1_zscore}")
print(f"Standard Deviation of feature_1_zscore: {std_feature_1_zscore}")


# Example 2

## Z- score calculation

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('penguins_size.csv')
# Drop any rows with missing values
df_clear = df.dropna()
#Define the numeric columns we want to focus on
numeric_columns = ['culmen_length_mm','culmen_depth_mm','flipper_length_mm','body_mass_g']
#  Display the extracted columns
df_clear[numeric_columns]

We can compute the z-score manually by computing the mean and the standard deviation. `ddof=1` applies Bessel's corrrection, although it is often unnecessary if the number of observations is large.

# Calculate mean for each column

In [None]:
mean=df_clear[numeric_columns].mean()
mean

# Calculate sample variance for each column

In [None]:
variance_values = df_clear[numeric_columns].var()
variance_values

# Calculate standard deviation for each column

In [None]:
SD=df_clear[numeric_columns].std(ddof=1)
SD

# Calculate z-scores for each value in the numeric columns

### manually calculating z-scores 

In [None]:
z_scores_manual = (df_clear[numeric_columns] - df_clear[numeric_columns].mean()) / df_clear[numeric_columns].std(ddof=1)
print(z_scores_manual)

SciPy comes with `scipy.stats.zscore` that does the same. By default, it does not apply Bessel's correction, but we can enable that with `ddof=1`. We can apply the function to the entire column with `apply`. If the dataframe contained nonnumerical columns, we might want to choose those columns first.

In [None]:
from scipy.stats import zscore
z_scores = df_clear[numeric_columns].apply(zscore,ddof=1)
z_scores

### calculate z-scores by Scikit-learn 

Scikit-learn comes with `StandardScaler` that also does the same, but Bessel's correction is not used. This is inconsequential if *n* is large.

In [None]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
ss.fit_transform(df_clear[numeric_columns].to_numpy())

Standard scaler is particularly useful if we need to apply the same transformation to other datapoints (e.g., a separate test set). We want to use the same scale we determined with the training set. Calling `fit` or `fit_transform` will store the mean and standard deviation in the model. Future calls to `transform` (*without* `fit`) will then use the same values.

Thank you for your attention :) 