<a href="https://colab.research.google.com/github/faisu6339-glitch/Machine-learning/blob/main/F6_Pipelines(Without_Pipelines).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Machine Learning Pipelines

### Detailed Explanation of Machine Learning Pipelines

A machine learning pipeline is a series of steps that data goes through, from its raw form to a trained and deployed model. It's a structured way to manage the entire machine learning workflow, ensuring consistency, reusability, and maintainability. Each step in the pipeline typically performs a specific, well-defined task.

#### Key Stages/Components of a Typical ML Pipeline:

1.  **Data Ingestion/Collection**: This is the very first step where raw data is gathered from various sources (databases, APIs, files, sensors, etc.). It involves connecting to data sources and loading the data into a usable format, often a DataFrame.

2.  **Data Cleaning/Preprocessing**: Raw data is rarely perfect. This stage involves:
    *   **Handling Missing Values**: Imputing (filling in) missing data using strategies like mean, median, mode, or more advanced methods, or removing rows/columns with too many missing values.
    *   **Handling Outliers**: Detecting and managing extreme values that can skew model training.
    *   **Data Type Conversion**: Ensuring features are in the correct data types (e.g., converting strings to numerical, parsing dates).
    *   **Removing Duplicates**: Identifying and eliminating redundant entries.
    *   **Text Preprocessing (for NLP)**: Tokenization, stemming, lemmatization, stop-word removal, etc.

3.  **Feature Engineering**: This is often the most creative and impactful stage. It involves transforming raw data into features that better represent the underlying problem to the machine learning model. This can include:
    *   **Creating New Features**: Combining existing features (e.g., `age * income`), extracting information from timestamps (e.g., `day_of_week`, `month`), or using domain knowledge.
    *   **Encoding Categorical Variables**: Converting categorical text or numerical labels into a numerical format that models can understand (e.g., One-Hot Encoding, Label Encoding).
    *   **Scaling/Normalization**: Bringing features to a similar scale to prevent features with larger values from dominating the learning process (e.g., `StandardScaler`, `MinMaxScaler`).
    *   **Dimensionality Reduction**: Reducing the number of features while retaining important information (e.g., PCA, t-SNE).

4.  **Data Splitting**: Dividing the processed dataset into training, validation, and testing sets. This is crucial for evaluating the model's performance on unseen data and preventing overfitting.

5.  **Model Training**: The core machine learning algorithm is applied to the training data. The model learns patterns and relationships from the features to make predictions or classifications.

6.  **Model Evaluation**: Assessing the trained model's performance using appropriate metrics (e.g., accuracy, precision, recall, F1-score for classification; R-squared, RMSE for regression) on the validation set. This step helps in hyperparameter tuning and model selection.

7.  **Hyperparameter Tuning**: Optimizing the hyperparameters of the model to achieve the best possible performance. This often involves techniques like Grid Search, Random Search, or Bayesian Optimization.

8.  **Model Deployment (Productionization)**: Once the model is finalized and evaluated, it's integrated into an application or system where it can make predictions on new, real-time data.

9.  **Monitoring and Maintenance**: After deployment, the model's performance needs to be continuously monitored for degradation (e.g., concept drift, data drift) and retrained or updated as needed.

#### Why Use Pipelines?

*   **Consistency**: Ensures that all data transformations are applied uniformly across training, validation, and test datasets.
*   **Prevents Data Leakage**: Crucially, it prevents information from the test set (or validation set) from

leaking into the training process. For example, if you scale your entire dataset *before* splitting, the scaling parameters (mean and standard deviation) would be influenced by the test set, leading to an over-optimistic evaluation.
*   **Reproducibility**: Makes your entire ML workflow reproducible. Anyone can take your pipeline, run it, and get the same results (given the same data and random seeds).
*   **Modularity and Reusability**: Each step is a module, making it easy to swap out components (e.g., try a different imputer or scaler) or reuse parts of the pipeline in other projects.
*   **Simplified Hyperparameter Tuning**: Tools like `scikit-learn`'s `GridSearchCV` or `RandomizedSearchCV` can optimize hyperparameters for *all* steps in the pipeline simultaneously, not just the final model.
*   **Improved Code Organization**: Keeps your machine learning code clean, organized, and easier to understand.

#### Example with `scikit-learn`'s `Pipeline`

As shown in the previous example, `scikit-learn`'s `Pipeline` class provides a concise way to build these sequences. It treats all transformations and the final estimator as a single entity, allowing you to `fit` and `predict` on the entire sequence with one call.

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')), # Step 1: Handle missing values
    ('scaler', StandardScaler()),                 # Step 2: Scale features
    ('classifier', RandomForestClassifier())      # Step 3: Train a classifier
])

# When you call pipeline.fit(X_train, y_train):
# 1. imputer.fit(X_train) is called, then X_train is transformed.
# 2. scaler.fit() is called on the output of imputer.transform(), then that data is transformed.
# 3. classifier.fit() is called on the output of scaler.transform() and y_train.

# When you call pipeline.predict(X_test):
# 1. X_test is transformed by the *fitted* imputer.
# 2. The result is transformed by the *fitted* scaler.
# 3. The result is fed to the *fitted* classifier to make predictions.
```

This demonstrates how the `Pipeline` object intelligently handles the `fit` and `transform` calls for each step.

#Simple Example (Imputation + Scaling + Logistic Regression)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Sample dataset
data = {
    'Age': [25, 30, None, 40, 35],
    'Salary': [50000, 60000, 55000, None, 65000],
    'Purchased': [0, 1, 0, 1, 1]
}

df = pd.DataFrame(data)

X = df[['Age', 'Salary']]
y = df['Purchased']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

pipe.fit(X_train, y_train)

print("Accuracy:", pipe.score(X_test, y_test))


Accuracy: 1.0


#ðŸ”¹ Feature Engineering with Categorical + Numerical

In [None]:
data = {
    'Age': [25, 30, 35, 40, None],
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Female'],
    'City': ['Delhi', 'Mumbai', 'Delhi', 'Pune', 'Mumbai'],
    'Salary': [50000, 60000, 65000, 70000, 62000],
    'Purchased': [0, 1, 1, 0, 1]
}

df = pd.DataFrame(data)

X = df.drop('Purchased', axis=1)
y = df['Purchased']


ColumnTransformer (Heart of Feature Engineering Pipelines)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer


In [None]:
num_features = ['Age', 'Salary']
cat_features = ['Gender', 'City']

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', sparse_output=False))
])

preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])


In [None]:
from sklearn.linear_model import LogisticRegression

final_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('model', LogisticRegression())
])

final_pipeline.fit(X, y)

print("Training Accuracy:", final_pipeline.score(X, y))


Training Accuracy: 1.0


#Feature Engineering Pipeline

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression


In [None]:
data = {
    'Age': [22, 25, 30, 35, np.nan, 40, 28, 32],
    'Experience': [1, 2, 5, 7, 4, np.nan, 3, 6],
    'City': ['Delhi', 'Mumbai', 'Delhi', 'Pune', 'Mumbai', 'Delhi', 'Pune', 'Mumbai'],
    'Salary': [25000, 30000, 50000, 70000, 45000, 65000, 40000, 60000],
    'Purchased': [0, 0, 1, 1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)
df


Unnamed: 0,Age,Experience,City,Salary,Purchased
0,22.0,1.0,Delhi,25000,0
1,25.0,2.0,Mumbai,30000,0
2,30.0,5.0,Delhi,50000,1
3,35.0,7.0,Pune,70000,1
4,,4.0,Mumbai,45000,0
5,40.0,,Delhi,65000,1
6,28.0,3.0,Pune,40000,0
7,32.0,6.0,Mumbai,60000,1


In [None]:
X = df.drop('Purchased', axis=1)
y = df['Purchased']


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)


In [None]:
num_features = ['Age', 'Experience', 'Salary']
cat_features = ['City']


In [None]:
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])


#### Why `SimpleImputer` and `StandardScaler` are both needed:

While `SimpleImputer` handles missing values (NaNs), `StandardScaler` serves a completely different, but equally important, purpose:

1.  **`SimpleImputer(strategy='mean')`**: Its primary role is to address missing data. When your dataset has `NaN` values, many machine learning algorithms cannot process them directly and would either raise an error or drop those rows/columns. The imputer fills these gaps (in this case, with the mean of the respective column), making the dataset complete.

2.  **`StandardScaler()`**: This step comes *after* imputation and is responsible for **feature scaling**. Its purpose is to transform your numerical features so they have a mean of 0 and a standard deviation of 1. It does *not* deal with missing values; it assumes the data it receives is complete.

**Why is scaling necessary if NaNs are already handled?**

Many machine learning algorithms are sensitive to the scale of features. For example:
*   **Distance-based algorithms** (like K-Nearest Neighbors, Support Vector Machines, K-Means Clustering) rely on the distance between data points. If one feature has a range of 0-1000 and another has a range of 0-1, the feature with the larger range will dominate the distance calculation.
*   **Gradient Descent-based algorithms** (like Logistic Regression, Neural Networks) converge much faster and more stably when features are on a similar scale.
*   **Regularization techniques** (L1/L2 regularization) treat all features equally. If features are not scaled, features with larger magnitudes will be penalized more heavily.

Therefore, first, we `impute` to ensure there are no missing values, and *then* we `scale` the now complete numerical features to bring them to a comparable range. Both steps are crucial for preparing numerical data for many ML models and address different aspects of data quality and suitability.

The `num_pipeline` is a `scikit-learn` Pipeline specifically crafted for handling numerical features. It consists of two sequential steps:

1.  **`('imputer', SimpleImputer(strategy='mean'))`**:
    *   **Purpose**: This step addresses missing values in the numerical features.
    *   **`SimpleImputer`**: A `scikit-learn` transformer that handles missing data.
    *   **`strategy='mean'`**: Specifies that any missing numerical values (`np.nan`) in the columns processed by this imputer will be replaced with the *mean* of the non-missing values in their respective columns. This is a common strategy for numerical data.

2.  **`('scaler', StandardScaler())`**:
    *   **Purpose**: This step scales the numerical features.
    *   **`StandardScaler`**: A `scikit-learn` transformer that standardizes features by removing the mean and scaling to unit variance.
    *   **Why scaling?**: Many machine learning algorithms (e.g., Logistic Regression, SVMs, neural networks) perform better or converge faster when numerical features are on a similar scale. `StandardScaler` transforms the data such that it has a mean of 0 and a standard deviation of 1, making features comparable regardless of their original units or magnitudes.

In [None]:
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', sparse_output=False))
])


The `cat_pipeline` is a `scikit-learn` Pipeline tailored for processing categorical features. It comprises two sequential steps:

1.  **`('imputer', SimpleImputer(strategy='most_frequent'))`**:
    *   **Purpose**: This step addresses any missing values within the categorical features.
    *   **`SimpleImputer`**: A `scikit-learn` transformer used for handling missing data.
    *   **`strategy='most_frequent'`**: This strategy specifies that any `np.nan` values in the categorical columns will be replaced with the *most frequently occurring value* (the mode) in their respective columns. This is a common and effective approach for imputing missing categorical data.

2.  **`('encoder', OneHotEncoder(drop='first', sparse_output=False))`**:
    *   **Purpose**: This step converts categorical textual or numerical data into a numerical format that machine learning models can understand and process.
    *   **`OneHotEncoder`**: A `scikit-learn` transformer that converts categorical variables into a one-hot encoded numerical array.
    *   **`drop='first'`**: This parameter is used to avoid the [dummy variable trap](https://www.statisticssolutions.com/what-is-the-dummy-variable-trap/). When you one-hot encode a categorical feature with `N` categories, it creates `N` new binary features. However, `N-1` features are sufficient to represent all categories, as the `N`-th category can be inferred if all `N-1` features are zero. Dropping the first category prevents multicollinearity, which can be an issue for some models (e.g., linear regression).
    *   **`sparse_output=False`**: By default, `OneHotEncoder` outputs a sparse matrix, which is memory-efficient for high-dimensional data. Setting `sparse_output=False` ensures that the output is a dense NumPy array, which is often easier to work with in subsequent pipeline steps or for simpler datasets.

In [None]:
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])


The `preprocessor` object is an instance of `scikit-learn`'s `ColumnTransformer`. This is a powerful tool used to apply different transformations to different columns of your data, allowing for a flexible and robust preprocessing pipeline.

Here's what each part means:

*   **`ColumnTransformer([...])`**: This is the main component. It takes a list of tuples, where each tuple defines a specific transformation to be applied to a subset of columns.

*   **`('num', num_pipeline, num_features)`**:
    *   **`'num'`**: This is an arbitrary name for this transformer. It's good practice to use descriptive names.
    *   **`num_pipeline`**: This refers to the `Pipeline` object we defined earlier for numerical features. It contains a `SimpleImputer` (strategy='mean') for missing values and a `StandardScaler` for scaling.
    *   **`num_features`**: This is a list of column names (e.g., `['Age', 'Experience', 'Salary']`) to which the `num_pipeline` will be applied. The `ColumnTransformer` will select these columns, pass them through the `num_pipeline`, and then output the transformed numerical data.

*   **`('cat', cat_pipeline, cat_features)`**:
    *   **`'cat'`**: The name for the categorical features transformer.
    *   **`cat_pipeline`**: This refers to the `Pipeline` object defined for categorical features. It contains a `SimpleImputer` (strategy='most_frequent') for missing values and a `OneHotEncoder` for converting categorical data into a numerical format.
    *   **`cat_features`**: This is a list of column names (e.g., `['City']`) to which the `cat_pipeline` will be applied. The `ColumnTransformer` will select these columns, pass them through the `cat_pipeline`, and then output the one-hot encoded categorical data.

**How it works:**

When `preprocessor.fit_transform(X_train)` is called:
1.  The `num_pipeline` will be `fit` and `transform`ed on the `num_features` from `X_train`.
2.  The `cat_pipeline` will be `fit` and `transform`ed on the `cat_features` from `X_train`.
3.  The `ColumnTransformer` then concatenates the results of these two transformations into a single, unified feature array, which is then ready to be fed into a machine learning model. This ensures that numerical and categorical features are preprocessed independently but then combined correctly for model training.

In [None]:
final_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('model', LogisticRegression())
])


In [None]:
final_pipeline.fit(X_train, y_train)


In [None]:
train_acc = final_pipeline.score(X_train, y_train)
test_acc = final_pipeline.score(X_test, y_test)

print("Training Accuracy:", train_acc)
print("Testing Accuracy:", test_acc)


Training Accuracy: 1.0
Testing Accuracy: 1.0


In [None]:
import numpy as np
import pandas as pd

In [None]:
df1=pd.read_csv("Titanic.csv")

In [None]:
df1.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
df1.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'], inplace=True, errors='ignore')


In [None]:
df1.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [None]:
X = df1.drop('Survived', axis=1)
y = df1['Survived']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y
)


In [None]:
X_train.head(2)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
486,1,female,35.0,1,0,90.0,S
238,2,male,19.0,0,0,10.5,S


In [None]:
y1=pd.DataFrame(y_train)
y1.head(2)

Unnamed: 0,Survived
486,1
238,0


In [None]:
df1.isnull().sum()


Unnamed: 0,0
Survived,0
Pclass,0
Sex,0
Age,177
SibSp,0
Parch,0
Fare,0
Embarked,2


In [None]:
df1.isnull().sum()[df1.isnull().sum() > 0]


Unnamed: 0,0
Age,177
Embarked,2


In [None]:
missing_percent = (df1.isnull().sum() / len(df1)) * 100
missing_percent[missing_percent > 0].sort_values(ascending=False)


Unnamed: 0,0
Age,19.86532
Embarked,0.224467


Applying Imputation

In [None]:
df1[['Age', 'Embarked']].isnull().sum()


Unnamed: 0,0
Age,177
Embarked,2


In [None]:
from sklearn.impute import SimpleImputer

si_age = SimpleImputer(strategy='mean')
si_embarked = SimpleImputer(strategy='most_frequent')

# ---- Fit only on TRAIN ----
X_train_age = si_age.fit_transform(X_train[['Age']])
X_train_embarked = si_embarked.fit_transform(X_train[['Embarked']])

# ---- Transform TEST using same fitted imputers ----
X_test_age = si_age.transform(X_test[['Age']])
X_test_embarked = si_embarked.transform(X_test[['Embarked']])


This code block demonstrates the manual application of `SimpleImputer` to handle missing values in the 'Age' and 'Embarked' columns, with a crucial focus on preventing data leakage.

### Code Breakdown:

1.  **`from sklearn.impute import SimpleImputer`**:
    *   Imports the `SimpleImputer` class from `scikit-learn`, which is used to fill in missing values.

2.  **`si_age = SimpleImputer(strategy='mean')`**:
    *   Creates an instance of `SimpleImputer` specifically for the 'Age' column.
    *   **`strategy='mean'`**: Specifies that any missing values (`NaN`) in the 'Age' column will be replaced with the *mean* of the non-missing values in that column. 'Mean' is a suitable strategy for numerical data.

3.  **`si_embarked = SimpleImputer(strategy='most_frequent')`**:
    *   Creates another instance of `SimpleImputer` specifically for the 'Embarked' column.
    *   **`strategy='most_frequent'`**: Specifies that any missing values (`NaN`) in the 'Embarked' column will be replaced with the *most frequently occurring value* (the mode) in that column. This is a standard strategy for categorical data.

4.  **`X_train[['Age']] = si_age.fit_transform(X_train[['Age']])`**:
    *   **`fit_transform` on `X_train` for 'Age'**: The `si_age` imputer *learns* the mean from the 'Age' column of the `training data (`X_train`) and then *transforms* that column by filling its missing values with the learned mean. It's crucial that the imputer only learns from the training data.

5.  **`X_train[['Embarked']] = si_embarked.fit_transform(X_train[['Embarked']])`**:
    *   **`fit_transform` on `X_train` for 'Embarked'**: Similarly, the `si_embarked` imputer *learns* the most frequent value from the 'Embarked' column of `X_train` and then *transforms* that column by filling its missing values.

6.  **`X_test[['Age']] = si_age.transform(X_test[['Age']])`**:
    *   **`transform` on `X_test` for 'Age'**: Here, notice that only `transform` is called, not `fit_transform`. This is critical for preventing **data leakage**. The `X_test` 'Age' column's missing values are filled using the *mean that was already learned from `X_train`*. The test set's mean is not calculated or used.

7.  **`X_test[['Embarked']] = si_embarked.transform(X_test[['Embarked']])`**:
    *   **`transform` on `X_test` for 'Embarked'**: The same logic applies here. The missing values in `X_test`'s 'Embarked' column are filled using the *most frequent value learned from `X_train`*.

### Why Separate `fit` and `transform` (or `fit_transform` then `transform`):

This explicit separation ensures that the imputation parameters (like the mean or most frequent value) are derived *only* from the training data. If `fit_transform` were applied to the test data, information from the test set would inadvertently influence the training process, leading to an overly optimistic (and unrealistic) evaluation of the model's performance on unseen data. This is a fundamental principle in machine learning to maintain the integrity of your model evaluation.

In [None]:
X_train['Embarked'].head(2)

Unnamed: 0,Embarked
486,S
238,S


OneHot Encoding Sex and Embarked

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe_sex = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
ohe_embarked = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

X_train_sex = ohe_sex.fit_transform(X_train[['Sex']])
X_train_embarked = ohe_embarked.fit_transform(X_train[['Embarked']])

X_test_sex = ohe_sex.transform(X_test[['Sex']])
X_test_embarked = ohe_embarked.transform(X_test[['Embarked']])


In [None]:
X_train_sex

array([[1., 0.],
       [0., 1.],
       [0., 1.],
       ...,
       [1., 0.],
       [0., 1.],
       [0., 1.]])

In [None]:
X_train_embarked

array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       ...,
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.]])

This code block demonstrates the manual application of `OneHotEncoder` for the categorical features 'Sex' and 'Embarked', ensuring proper handling of training and testing data to prevent data leakage.

### Code Breakdown:

1.  **`from sklearn.preprocessing import OneHotEncoder`**:
    *   Imports the `OneHotEncoder` class from `scikit-learn`, a transformer used to convert categorical features into a one-hot numerical array.

2.  **`ohe_sex = OneHotEncoder(sparse_output=False, handle_unknown='ignore')`**:
    *   Creates an instance of `OneHotEncoder` specifically for the 'Sex' column.
    *   **`sparse_output=False`**: By default, `OneHotEncoder` produces a sparse matrix, which is efficient for high-dimensional data. Setting this to `False` makes the output a dense NumPy array, which can be easier to work with for smaller datasets or when combining with other dense arrays.
    *   **`handle_unknown='ignore'`**: This parameter specifies how to handle unknown categories encountered during the `transform` step (e.g., in the test set) that were not present in the `fit` data (from the training set). Setting it to `'ignore'` means that these unknown categories will result in all-zero columns in the one-hot encoded output, rather than raising an error. This is useful for robust deployment.

3.  **`ohe_embarked = OneHotEncoder(sparse_output=False, handle_unknown='ignore')`**:
    *   Creates another instance of `OneHotEncoder` for the 'Embarked' column, with the same configuration as for 'Sex'.

4.  **`X_train_sex = ohe_sex.fit_transform(X_train[['Sex']])`**:
    *   **`fit_transform` on `X_train` for 'Sex'**: The `ohe_sex` encoder *learns* the unique categories present in the 'Sex' column of the `training data (`X_train`) and then *transforms* that column into a one-hot encoded numerical array. The encoder only learns from the training data.

5.  **`X_train_embarked = ohe_embarked.fit_transform(X_train[['Embarked']])`**:
    *   **`fit_transform` on `X_train` for 'Embarked'**: Similarly, `ohe_embarked` learns the unique 'Embarked' categories from `X_train` and transforms the column.

6.  **`X_test_sex = ohe_sex.transform(X_test[['Sex']])`**:
    *   **`transform` on `X_test` for 'Sex'**: This is crucial for preventing **data leakage**. The `X_test` 'Sex' column is transformed using the *categories learned from `X_train`*. The test set's categories are not used to fit the encoder again.

7.  **`X_test_embarked = ohe_embarked.transform(X_test[['Embarked']])`**:
    *   **`transform` on `X_test` for 'Embarked'**: The same logic applies here. The `X_test` 'Embarked' column is transformed using the *categories learned from `X_train`*.

### Why Separate `fit` and `transform` (or `fit_transform` then `transform`):

This explicit separation ensures that the encoding scheme (i.e., which categories correspond to which columns in the one-hot encoding) is derived *only* from the training data. If `fit_transform` were applied to the test data, information from the test set could inadvertently influence the encoding, leading to an inaccurate and overly optimistic evaluation of the model's performance on unseen data. This is a fundamental practice in machine learning to maintain the integrity of your model evaluation.

In [None]:
X_train.head(2)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
486,1,female,35.0,1,0,90.0,S
238,2,male,19.0,0,0,10.5,S


In [None]:
X_train_rem=X_train.drop(columns=['Sex', 'Embarked'])
X_test_rem=X_test.drop(columns=['Sex', 'Embarked'])

In [None]:
X_train_transformed = np.concatenate(
    (X_train_rem, X_train_age, X_train_sex, X_train_embarked),
    axis=1
)
X_test_transformed = np.concatenate(
    (X_test_rem, X_test_age, X_test_sex, X_test_embarked),
    axis=1
)


In [None]:
X_train_transformed.shape

(668, 12)

In [None]:
X_test_transformed.shape

(223, 12)

In [None]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
clf.fit(X_train_transformed, y_train)


In [None]:
y_pred=clf.predict(X_test_transformed)
y_pred

array([0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0,
       1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 0])

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.7668161434977578

In [None]:
acc = accuracy_score(y_test, y_pred)
print("Test Accuracy:", acc)


Test Accuracy: 0.7668161434977578


In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)


[[112  25]
 [ 27  59]]


In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.81      0.82      0.81       137
           1       0.70      0.69      0.69        86

    accuracy                           0.77       223
   macro avg       0.75      0.75      0.75       223
weighted avg       0.77      0.77      0.77       223



In [None]:
train_pred = clf.predict(X_train_transformed)

print("Train Accuracy:", accuracy_score(y_train, train_pred))
print("Test Accuracy :", accuracy_score(y_test, y_pred))


Train Accuracy: 0.9820359281437125
Test Accuracy : 0.7668161434977578


In [None]:
import pickle


In [None]:
import os

os.makedirs('models', exist_ok=True)


In [None]:
import pickle
import os

os.makedirs('models', exist_ok=True)

pickle.dump(si_age, open('models/si_age.pkl', 'wb'))
pickle.dump(si_embarked, open('models/si_embarked.pkl', 'wb'))
pickle.dump(ohe_sex, open('models/ohe_sex.pkl', 'wb'))
pickle.dump(ohe_embarked, open('models/ohe_embarked.pkl', 'wb'))
pickle.dump(clf, open('models/titanic_model.pkl', 'wb'))

# ðŸ”¥ ADD THIS LINE (Pipeline)
pickle.dump(final_pipeline, open('models/titanic_pipeline.pkl', 'wb'))


In [None]:
test_input=np.array([2,'male',31.0,0,0,10.5,'S'],dtype=object).reshape(1,7)

In [None]:
test_input

array([[2, 'male', 31.0, 0, 0, 10.5, 'S']], dtype=object)

In [None]:
test_input_sex = ohe_sex.transform(test_input[:, [1]])

