## Data Preprocessing and Quality Assurance

We will focus on:

1. Data Hygiene: Handling missing values and "cleaning" sensor noise.

2. Exploratory Data Analysis (EDA): Visualizing data to find correlations and outliers.

3. Feature Scaling: Normalizing data so large values (like Pressure) don't overpower small values (like Vacuum).


### 1. Setup

In [None]:
 !pip install pandas matplotlib seaborn numpy scikit-learn


### 2. Loading the Libraries and Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer

sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

print("‚úÖ Libraries imported successfully!")

raw_path  = ""

try:
    df = pd.read_csv(raw_path)
    print("‚úÖ Dataset loaded successfully!")
    print(f"Data Shape: {df.shape} (Rows, Columns)")
   
except Exception as e:
    print(f"‚ùå Error loading data: {e}")


### 3. Data Hygiene
Real-world sensor data often has "gaps" (missing values) or "noise" (random spikes).

#### 3.1 Handling Missing Data (Imputation)


In [None]:
# 1. Visualize the specific rows that are broken
print("Rows with missing sensor data (Sample):")
# This filters to show only rows where at least one value is NaN
rows_with_nan = df[df.isnull().any(axis=1)]
display(rows_with_nan.head())




In [None]:
# 2. Define the Imputer
# Strategy='mean' will calculate the average of every column
imputer = SimpleImputer(strategy='mean')

# 3. Apply the Imputer
# fit_transform() calculates the mean and fills the gaps in one step
df_clean_array = imputer.fit_transform(df)

# 4. Convert back to DataFrame
# (Scikit-Learn returns a plain array, so we add column names back)
df_clean = pd.DataFrame(df_clean_array, columns=df.columns)

print("\n" + "="*40)
print("üõ†Ô∏è REPAIR COMPLETE üõ†Ô∏è")
print("="*40)

# Check if any NaN remain
print("Missing values after imputation:")
print(df_clean.isnull().sum())

# Update our main variable 'df' to use the clean data
df = df_clean
print(df_clean.head())

### 4. Feature Scaling


Machine Learning models usually perform better when all inputs are on the same scale.

Ambient Pressure (AP): ~1000
Exhaust Vacuum (V): ~50

Without scaling, the model might think Pressure is "20 times more important" than Vacuum just because the number is bigger.

##### 4.1 Standard Scalar : StandardScaler standardizes features by transforming each value to have zero mean and unit variance

In [None]:

# Select features (Inputs) only, dropping the Target (PE)
X = df.drop(columns=['PE'])
y = df['PE']


# Initialize the Standard Scaler
scaler = StandardScaler()

# Apply Standard Scaling
X_scaled_array = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled_array, columns=X.columns)


## define the minmax scaler
minmax_scaler = MinMaxScaler()

# Apply MinMax Scaling
X_minmax_array = minmax_scaler.fit_transform(X)
X_minmax = pd.DataFrame(X_minmax_array, columns=X.columns)

# Visual Proof
print("Original Data (First 3 rows):")
display(X.head(3))

print("\nStandard Scaled Data (First 3 rows) - Centered around 0:")
display(X_scaled.head(3))

print("\nMinMax Scaled Data (First 3 rows) - Values between 0 and 1:")
display(X_minmax.head(3))

# Verify the math
print(f"\nMean of Standard Scaled Temperature: {X_scaled['AT'].mean():.2f} (Should be ~0)")
print(f"Std Dev of Standard Scaled Temperature: {X_scaled['AT'].std():.2f} (Should be ~1)")

print(f"\nMin of MinMax Scaled Temperature: {X_minmax['AT'].min():.2f} (Should be 0)")
print(f"Max of MinMax Scaled Temperature: {X_minmax['AT'].max():.2f} (Should be 1)")

### 5. Exploratory Data Analysis (EDA)


Now that the data is clean, we look for patterns.

Correlation Matrix: Does Temperature affect Power Output? Does Humidity?

Outlier Detection: Are there any physical impossibilities in the data?

In [None]:

# Calculate the correlation matrix
corr = df.corr()

# Plotting the Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f", linewidths=1)
plt.title("Correlation Heatmap: CCPP Features")
plt.show()

# Interpretation:
# 1. Look at 'PE' (Power Energy) row.
# 2. 'AT' (Temperature) is -0.93 (Strong Negative).
#    Result: As Temperature rises, Power Output drops significantly.
# 3. 'RH' (Humidity) is +0.38 (Weak Positive).
#    Result: Humidity has a smaller effect on power output.

In [None]:

# Let's analyze Ambient Pressure (AP)
plt.figure(figsize=(12, 5))

# 1. Histogram (Distribution)
plt.subplot(1, 2, 1)
sns.histplot(df['AP'], kde=True, color='green')
plt.title("Distribution of Ambient Pressure (AP)")

plt.figure(figsize=(10, 4))
plt.scatter(range(len(df)), df['AP'], alpha=0.6)
plt.title("Ambient Pressure (AP) ‚Äì Scatter Plot")
plt.xlabel("Observation Index")
plt.ylabel("Ambient Pressure (AP)")
plt.show()


# Interpretation:
# If you see dots outside the "whiskers" of the boxplot, those are statistical outliers.
# In sensor data, outliers could mean a malfunction or an extreme weather event.

### 6. Conclusion

In this session, we have:

Loaded a raw

Imputed missing values to save data.

Visualized the physics of the plant (Temperature vs Power).

Scaled the features to prepare for AI training.
