<a href="https://colab.research.google.com/github/cdxplora/APIAssignment-7/blob/main/lab01_data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 1 â€“ Data Exploration & Preprocessing

This notebook is a **starter template** notebook for the exercises.

> **Instructions:**
> - Read the tasks carefully.
> - Fill in the code where indicated by `# TODO` comments.
> - You may add extra cells/experiments, but do not remove the given headings.
> - When done, make sure all cells run **top to bottom** without errors.


## 1. Setup

Install and import the required libraries and data sets.

**Tasks/Steps**
- Ensure you can import `numpy`, `pandas`, `matplotlib`, and `scikit-learn`.
- Configure plots to appear inline.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# TODO: : import our data sets here
from sklearn.datasets import fetch_california_housing, load_diabetes

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

%matplotlib inline

print("Libraries and data sets imported.")

Libraries and data sets imported.


## 2. Load and Inspect the Dataset

Use a tabular dataset such as:
- Housing prices
- Iris / Penguins
- Any reasonably small CSV given by the instructor

**Tasks/Steps**
- Load the dataset into a pandas DataFrame.
- Inspect its shape, columns, and basic statistics.
- Identify potential issues (missing values, data types, etc.).


In [8]:
# TODO: load your dataset here
import urllib.error

try:
  data = fetch_california_housing(as_frame=True)
  df = data.frame.copy()
  target_name = data.target_names[0]
  print("Using California Housing dataset.")
except urllib.error.HTTPError as e:
    print("California Housing download failed. Falling back to Diabetes dataset.", e)
    data = load_diabetes(as_frame=True)
    df = data.frame.copy()
    target_name = "target"

#df = pd.DataFrame()  # replace with actual loading code
df.head()



California Housing download failed. Falling back to Diabetes dataset. HTTP Error 403: Forbidden


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


In [9]:
# TODO: inspect your data set here
print(df.info)
display(df.describe())

<bound method DataFrame.info of           age       sex       bmi        bp        s1        s2        s3  \
0    0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1   -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2    0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3   -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4    0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   
..        ...       ...       ...       ...       ...       ...       ...   
437  0.041708  0.050680  0.019662  0.059744 -0.005697 -0.002566 -0.028674   
438 -0.005515  0.050680 -0.015906 -0.067642  0.049341  0.079165 -0.028674   
439  0.041708  0.050680 -0.015906  0.017293 -0.037344 -0.013840 -0.024993   
440 -0.045472 -0.044642  0.039062  0.001215  0.016318  0.015283 -0.028674   
441 -0.045472 -0.044642 -0.073030 -0.081413  0.083740  0.027809  0.173816   

           s4        s5        s6  target  

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,-2.511817e-19,1.23079e-17,-2.245564e-16,-4.79757e-17,-1.3814990000000001e-17,3.9184340000000004e-17,-5.777179e-18,-9.04254e-18,9.293722000000001e-17,1.130318e-17,152.133484
std,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,77.093005
min,-0.1072256,-0.04464164,-0.0902753,-0.1123988,-0.1267807,-0.1156131,-0.1023071,-0.0763945,-0.1260971,-0.1377672,25.0
25%,-0.03729927,-0.04464164,-0.03422907,-0.03665608,-0.03424784,-0.0303584,-0.03511716,-0.03949338,-0.03324559,-0.03317903,87.0
50%,0.00538306,-0.04464164,-0.007283766,-0.005670422,-0.004320866,-0.003819065,-0.006584468,-0.002592262,-0.001947171,-0.001077698,140.5
75%,0.03807591,0.05068012,0.03124802,0.03564379,0.02835801,0.02984439,0.0293115,0.03430886,0.03243232,0.02791705,211.5
max,0.1107267,0.05068012,0.1705552,0.1320436,0.1539137,0.198788,0.1811791,0.1852344,0.1335973,0.1356118,346.0


## 3. Handling Missing Values & Encoding

**Tasks/Steps**
- Detect missing values.
- Decide on imputation strategies for numeric and categorical features.
- Encode categorical features using one-hot encoding.


In [10]:
# TODO: detect missing values
df.isna().sum()

Unnamed: 0,0
age,0
sex,0
bmi,0
bp,0
s1,0
s2,0
s3,0
s4,0
s5,0
s6,0


In [11]:

X = df.drop(columns=[target_name])
y = df[target_name]

# Identify numeric and categorical columns

numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(exclude=[np.number]).columns.tolist()

numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", StandardScaler()),
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)
print(X)
print(y)
print(preprocessor)

          age       sex       bmi        bp        s1        s2        s3  \
0    0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1   -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2    0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3   -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4    0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   
..        ...       ...       ...       ...       ...       ...       ...   
437  0.041708  0.050680  0.019662  0.059744 -0.005697 -0.002566 -0.028674   
438 -0.005515  0.050680 -0.015906 -0.067642  0.049341  0.079165 -0.028674   
439  0.041708  0.050680 -0.015906  0.017293 -0.037344 -0.013840 -0.024993   
440 -0.045472 -0.044642  0.039062  0.001215  0.016318  0.015283 -0.028674   
441 -0.045472 -0.044642 -0.073030 -0.081413  0.083740  0.027809  0.173816   

           s4        s5        s6  
0   -0.002592  0.019907 -0.017646  
1  

## 4. Train/Test Split & Preprocessing Pipeline

**Tasks**
- Split the data into train and test sets.
- Fit the preprocessing pipeline on the training data.
- Transform both train and test features.


In [12]:
# TODO: choose target column

target_column = target_name # e.g., 'price'

# Perform train/test split
if target_column in df.columns:
    X = df.drop(columns=[target_column])
    y = df[target_column]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Fit and transform
    X_train_processed = preprocessor.fit_transform(X_train)
    X_test_processed = preprocessor.transform(X_test)

    print("Preprocessing complete.")
else:
    print("Please set the target_column correctly.")

Preprocessing complete.


## 5. Reflection

Write a short reflection:
- What kind of data did we use? Describe the data?
- Where there any main data quality issues?
- Which preprocessing steps you think were most important?
- How might poor preprocessing affect model performance later?
- How did we split the data into train and test sets?

*(You can answer in text here or in a separate report.)*
