# Imports and Justifications

- **pandas as pd**
  Used for data manipulation and analysis. It provides powerful structures like DataFrames to load, explore, and prepare tabular datasets.

- **from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score**
  - `StratifiedShuffleSplit`: ensures that training and test sets preserve the same proportion of classes, which is important in classification problems.
  - `cross_val_score`: evaluates models using cross-validation, giving a more reliable estimate of performance.

- **from sklearn.impute import IterativeImputer, SimpleImputer**
  - `IterativeImputer`: performs multivariate imputation by modeling each feature with missing values as a function of other features. Best for numerical data.
  - `SimpleImputer`: fills missing values with simple strategies (e.g., most frequent). Useful for categorical data.

- **from sklearn.preprocessing import RobustScaler, OneHotEncoder**
  - `RobustScaler`: scales numerical features while reducing the influence of outliers.
  - `OneHotEncoder`: converts categorical variables into binary indicator columns, allowing models to process nominal categories.

- **from sklearn.pipeline import Pipeline**
  Enables chaining multiple preprocessing and modeling steps into a single workflow, ensuring transformations are applied consistently.

- **from ucimlrepo import fetch_ucirepo**
  Provides direct access to datasets from the UCI Machine Learning Repository, a common source for benchmark datasets.

- **from sklearn.linear_model import LogisticRegression**
  A widely used linear model for classification tasks. It serves as a strong baseline and interpretable classifier.

- **from sklearn.feature_selection import SelectKBest, f_classif**
  - `SelectKBest`: selects the top k features based on a scoring function.
  - `f_classif`: ANOVA F-test statistic, useful for measuring the relationship between numerical features and categorical targets.

- **from sklearn.compose import ColumnTransformer**
  Allows applying different preprocessing pipelines to different subsets of columns (e.g., numerical vs categorical), making preprocessing flexible and organized.

In [32]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from ucimlrepo import fetch_ucirepo
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import GridSearchCV




# Data Separation

- **automobile = fetch_ucirepo(id=10)**
  Loads the Automobile dataset directly from the UCI Machine Learning Repository.

- **X = pd.DataFrame(automobile.data.features)**
  Creates a DataFrame containing all independent variables (features).

- **y = automobile.data.targets.iloc[:, 0]**
  Extracts the target variable (the column we want to predict).

- **numerical_features = X.select_dtypes(include='number').columns.tolist()**
  Automatically selects all columns with numeric data types (int, float).

- **categorical_features = X.select_dtypes(include='object').columns.tolist()**
  Automatically selects all columns with categorical/string data types.

This separation is essential because numerical and categorical features require different preprocessing steps before modeling.


In [21]:
automobile = fetch_ucirepo(id=10)

X = pd.DataFrame(automobile.data.features)
y = automobile.data.targets.iloc[:, 0]

numerical_features = X.select_dtypes(include='number').columns.tolist()
categorical_features = X.select_dtypes(include='object').columns.tolist()


# Preprocessing Pipelines

- **numerical_transformer**
  A pipeline for numerical features:
  - `IterativeImputer`: fills missing values by modeling each feature as a function of the others.
  - `RobustScaler`: scales values while reducing the influence of outliers.

- **categorical_transformer**
  A pipeline for categorical features:
  - `SimpleImputer(strategy='most_frequent')`: replaces missing values with the most frequent category.
  - `OneHotEncoder(handle_unknown='ignore')`: converts categories into binary columns and ignores unseen categories in test data.

- **preprocessing (ColumnTransformer)**
  Combines both pipelines:
  - Applies `numerical_transformer` to all numerical columns.
  - Applies `categorical_transformer` to all categorical columns.
  This ensures that each type of feature is preprocessed correctly before modeling.


In [27]:
numerical_transformer = Pipeline([
    ('imputer', IterativeImputer(max_iter=10, random_state=42)),
    ('scaler', RobustScaler())])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))])

preprocessing = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)])


# Modeling Pipeline with GridSearchCV

- **Pipeline**
  Combines all steps into a single workflow:
  - **preprocess**: applies a `ColumnTransformer` that handles numerical and categorical features.
    - Numerical: imputation with `IterativeImputer` + scaling with `RobustScaler`.
    - Categorical: imputation with `SimpleImputer` + encoding with `OneHotEncoder`.
  - **feature_selection (SelectKBest)**: selects the 15 best features using the ANOVA F-test (`f_classif`).
  - **classifier (LogisticRegression)**: trains a logistic regression model with L2 regularization, solver `lbfgs`, and a higher iteration limit (`max_iter=2000`) to ensure convergence.

- **StratifiedShuffleSplit**
  Defines the cross-validation strategy: splits the dataset into train/test sets while preserving class proportions, repeated 5 times for robust evaluation.

- **param_grid**
  Specifies the hyperparameters of logistic regression to be optimized:
  - `classifier__C`: strength of regularization.
  - `classifier__solver`: optimization algorithms (`lbfgs`, `liblinear`).
  - `classifier__max_iter`: maximum number of iterations.

- **GridSearchCV**
  Performs an exhaustive search over all hyperparameter combinations defined in `param_grid`, using stratified cross-validation.
  Evaluates each configuration with accuracy as the scoring metric.
  Returns the best parameters (`best_params_`) and the best average score (`best_score_`).

- **Results**
  Prints the optimal hyperparameters found and the corresponding average accuracy score, providing a systematic way to improve model performance.


In [31]:
pipeline = Pipeline(steps=[
    ('preprocess', preprocessing),
    ('feature_selection', SelectKBest(score_func=f_classif, k=15)),
    ('classifier', LogisticRegression(
        penalty='l2',
        C=1.0,
        solver='lbfgs',
        max_iter=2000))])

splitStratified = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)


param_grid ={
    'classifier__C': [0.01, 0.1, 1, 10],
    'classifier__solver': ['lbfgs', 'liblinear'],
    'classifier__max_iter': [500, 1000]
}

grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=splitStratified,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X, y)

print("The best params:", grid_search.best_params_)
print("The best average score:", grid_search.best_score_)


The best params: {'classifier__C': 10, 'classifier__max_iter': 500, 'classifier__solver': 'lbfgs'}
The best average score: 0.7609756097560976
