# Phase 2 â€” Data Summarization and Preprocessing

This notebook follows the Phase#2 instructions and is ready to run. Place your raw dataset file named **`Raw_dataset.csv`** in the same directory as this notebook or update the `DATA_PATH` variable below to point to the correct file in your repo.

Sections included:

1. Load dataset
2. Data overview and statistical summaries (five-number summary, etc.)
3. Missing values analysis
4. Variable distributions and plots (histograms, boxplots, bar plots)
5. Class label distribution plot
6. Outlier detection
7. Preprocessing (missing value treatment, encoding, scaling, feature selection)
8. Save preprocessed dataset

Notes:
- Do not modify the original file; the processed output will be saved as `Preprocessed_dataset.csv`.
- Run each cell in order. If your dataset file has a different name or is inside a folder, change `DATA_PATH` accordingly.


In [None]:
# 1) Imports and load dataset
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 200)

# Path to your raw dataset (change if needed)
DATA_PATH = 'Raw_dataset.csv'  # <-- change if your file name/path differs

# Try loading dataset
if not os.path.exists(DATA_PATH):
    print(f"Warning: {DATA_PATH} not found in notebook directory.\nPlease put your Raw_dataset.csv next to this notebook or update DATA_PATH.")
else:
    df = pd.read_csv(DATA_PATH)
    print('Loaded dataset with shape:', df.shape)
    display(df.head(5))


## 2) Data overview and statistical summaries

Compute number of instances, attributes, data types, and provide five-number summary for numeric attributes.

In [None]:
# Number of instances and attributes, datatypes
try:
    print('Number of records (instances):', df.shape[0])
    print('Number of attributes (columns):', df.shape[1])
    print('\nColumn datatypes:')
    display(df.dtypes)
except NameError:
    print('Dataset not loaded. Run the load cell and ensure DATA_PATH is correct.')


In [None]:
# Five-number summary (min, Q1, median, Q3, max) for numeric columns
try:
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    print('Numeric columns detected:', numeric_cols)
    display(df[numeric_cols].describe().T[['min','25%','50%','75%','max']].rename(columns={'25%':'Q1','50%':'median','75%':'Q3'}))
except NameError:
    pass


## 3) Missing values analysis

Show total and percent of missing values per column and a simple strategy recommendation.

In [None]:
# Missing values table
try:
    miss = df.isnull().sum().to_frame('missing_count')
    miss['missing_pct'] = miss['missing_count'] / df.shape[0] * 100
    display(miss.sort_values('missing_pct', ascending=False))
except NameError:
    pass


## 4) Variable distributions & plots

At least 3 different plotting types: histogram (numeric), boxplot (numeric/outliers), bar plot (categorical).

In [None]:
# Helper plotting function - run this cell to create multiple plots
try:
    import math
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    cat_cols = df.select_dtypes(exclude=[np.number]).columns.tolist()
    print('Numeric columns:', numeric_cols)
    print('Categorical columns:', cat_cols)

    # Histograms for up to 6 numeric columns
    n = min(6, len(numeric_cols))
    if n>0:
        cols = numeric_cols[:n]
        df[cols].hist(bins=15, figsize=(12, 3*n))
        plt.suptitle('Histograms (first numeric columns)')
        plt.show()

    # Boxplots for numeric columns (first 6)
    if n>0:
        fig, axes = plt.subplots(n, 1, figsize=(10, 4*n))
        if n==1:
            axes = [axes]
        for ax, col in zip(axes, cols):
            df.boxplot(column=col, ax=ax)
            ax.set_title(f'Boxplot - {col}')
        plt.tight_layout()
        plt.show()

    # Bar plots for top categorical columns (show value counts)
    m = min(4, len(cat_cols))
    if m>0:
        for col in cat_cols[:m]:
            vc = df[col].value_counts(dropna=False).nlargest(10)
            vc.plot(kind='bar', figsize=(8,4))
            plt.title(f'Value counts for {col} (top 10)')
            plt.ylabel('Count')
            plt.show()
    else:
        print('No categorical columns detected for bar plots.')
except NameError:
    print('Dataset not loaded. Run the load cell first.')


## 5) Class label distribution

If you have a class/target column, set TARGET_COL variable and visualize its distribution.

In [None]:
# Class label distribution - update TARGET_COL if necessary
TARGET_COL = None  # <-- e.g. 'target' or 'class'. Set to column name if available.

try:
    if TARGET_COL is None:
        # try to guess a likely target column (common names)
        for guess in ['target','class','label','y','Outcome','outcome','grade']:
            if guess in df.columns:
                TARGET_COL = guess
                print('Auto-detected target column:', TARGET_COL)
                break

    if TARGET_COL is not None and TARGET_COL in df.columns:
        vc = df[TARGET_COL].value_counts(dropna=False)
        display(vc.to_frame('count'))
        vc.plot(kind='bar', figsize=(6,4))
        plt.title(f'Class distribution: {TARGET_COL}')
        plt.ylabel('Count')
        plt.show()
    else:
        print('No target column specified or detected. Set TARGET_COL variable to your class column name.')
except NameError:
    pass


## 6) Outlier detection (IQR method)

Detect outliers using interquartile range for numeric features and report counts.

In [None]:
try:
    outlier_summary = []
    for col in numeric_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5*IQR
        upper = Q3 + 1.5*IQR
        out_count = ((df[col] < lower) | (df[col] > upper)).sum()
        outlier_summary.append((col, int(out_count)))
    out_df = pd.DataFrame(outlier_summary, columns=['column','outlier_count']).sort_values('outlier_count', ascending=False)
    display(out_df)
except NameError:
    pass


## 7) Preprocessing

Apply at least three preprocessing tasks (not just removing attributes or splitting dataset). Examples included below: missing value imputation, categorical encoding, scaling/normalization, and feature selection.


In [None]:
# Preprocessing pipeline (example). Modify as needed for your dataset.
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, f_classif

# Make a copy to avoid modifying original dataframe
try:
    df_proc = df.copy()
except NameError:
    df_proc = None
    print('Dataset not loaded.')

# 1) Missing value handling recommendations & example (median for numeric, most_frequent for categorical)
try:
    num_cols = df_proc.select_dtypes(include=[np.number]).columns.tolist()
    cat_cols = df_proc.select_dtypes(exclude=[np.number]).columns.tolist()
    print('Numeric columns:', num_cols)
    print('Categorical columns:', cat_cols)
except Exception as e:
    print('Error detecting columns:', e)

# Example transformers (do not run if dataset not loaded)
def build_preprocessor(use_scaler='standard', encode_type='onehot'):
    # numeric pipeline: impute + scaler
    numeric_pipeline = [('imputer', SimpleImputer(strategy='median'))]
    if use_scaler == 'standard':
        numeric_pipeline.append(('scaler', StandardScaler()))
    elif use_scaler == 'minmax':
        numeric_pipeline.append(('scaler', MinMaxScaler()))
    # categorical pipeline: impute + encode
    cat_pipeline = [('imputer', SimpleImputer(strategy='most_frequent'))]
    if encode_type == 'onehot':
        cat_pipeline.append(('encoder', OneHotEncoder(handle_unknown='ignore', sparse=False)))
    else:
        cat_pipeline.append(('encoder', OrdinalEncoder()))
    transformers = []
    if len(num_cols) > 0:
        transformers.append(('num', Pipeline(steps=numeric_pipeline), num_cols))
    if len(cat_cols) > 0:
        transformers.append(('cat', Pipeline(steps=cat_pipeline), cat_cols))
    col_transformer = ColumnTransformer(transformers=transformers, remainder='drop', sparse_threshold=0)
    return col_transformer

# Note: Pipeline imported below to avoid import errors if scikit-learn is not present
try:
    from sklearn.pipeline import Pipeline
except Exception:
    pass

print('\nPreprocessing instructions:')
print('- Build a ColumnTransformer with numeric imputation + scaling and categorical imputation + encoding.')
print('- Optionally apply feature selection with SelectKBest (if you have a labeled TARGET_COL).')


In [None]:
# Example: If you have a target, run feature selection and save preprocessed dataset
try:
    if 'df_proc' in globals() and df_proc is not None:
        # Simple example: drop columns with > 60% missing values, then impute remaining
        thresh = 0.6
        to_drop = miss[miss['missing_pct'] > thresh*100].index.tolist()
        print('Dropping columns with > 60% missing:', to_drop)
        df_proc = df_proc.drop(columns=to_drop, errors='ignore')

        # Impute numeric and categorical using the simple strategies from above
        from sklearn.impute import SimpleImputer
        for c in df_proc.select_dtypes(include=[np.number]).columns:
            if df_proc[c].isnull().any():
                df_proc[c] = SimpleImputer(strategy='median').fit_transform(df_proc[[c]])
        for c in df_proc.select_dtypes(exclude=[np.number]).columns:
            if df_proc[c].isnull().any():
                df_proc[c] = SimpleImputer(strategy='most_frequent').fit_transform(df_proc[[c]]).ravel()

        print('After imputation, missing values per column:')
        display(df_proc.isnull().sum().to_frame('missing_count'))

        # Save a snapshot of raw (already available) and preprocessed dataset
        out_raw = 'Snapshot_raw_first5rows.csv'
        df.head(5).to_csv(out_raw, index=False)
        out_pre = 'Preprocessed_dataset.csv'
        df_proc.to_csv(out_pre, index=False)
        print('\nSaved snapshot of raw data to', out_raw)
        print('Saved preprocessed dataset to', out_pre)
    else:
        print('No dataset to preprocess. Load dataset first.')
except NameError:
    print('Dataset not loaded or error in preprocessing.')


## Checklist / Deliverables

- [ ] Notebook `Phase2.ipynb` with analysis and preprocessing steps.
- [ ] Plots showing variable distributions (histograms, boxplots, bar plots).
- [ ] Missing values analysis and handling.
- [ ] Statistical summaries (five-number summary for numeric attributes).
- [ ] Class label distribution plot.
- [ ] Preprocessed dataset exported as `Preprocessed_dataset.csv`.

### How to run
1. Place `Raw_dataset.csv` in the same folder as this notebook or change `DATA_PATH`.
2. Run the notebook top-to-bottom.
3. Review outputs, modify preprocessing choices, and re-run cells as needed.
