<a href="https://colab.research.google.com/github/faisu6339-glitch/Machine-learning/blob/main/F7_Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.feature_selection import SelectKBest, chi2

In [None]:
df=pd.read_csv('Titanic.csv')

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
df.drop(columns=['PassengerId','Name','Ticket','Cabin'],inplace=True)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('Survived', axis=1), df['Survived'], test_size=0.2, random_state=42)

In [None]:
X_train

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
331,1,male,45.5,0,0,28.5000,S
733,2,male,23.0,0,0,13.0000,S
382,3,male,32.0,0,0,7.9250,S
704,3,male,26.0,1,0,7.8542,S
813,3,female,6.0,4,2,31.2750,S
...,...,...,...,...,...,...,...
106,3,female,21.0,0,0,7.6500,S
270,1,male,,0,0,31.0000,S
860,3,male,41.0,2,0,14.1083,S
435,1,female,14.0,1,2,120.0000,S


In [None]:
X_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
331,1,male,45.5,0,0,28.5,S
733,2,male,23.0,0,0,13.0,S
382,3,male,32.0,0,0,7.925,S
704,3,male,26.0,1,0,7.8542,S
813,3,female,6.0,4,2,31.275,S


In [None]:
y_train.head()

Unnamed: 0,Survived
331,0
733,0
382,0
704,0
813,0


imputation transformer

In [None]:
trf1=ColumnTransformer([
    ('impute_age',SimpleImputer(),[2]),
    ('impute_embarked',SimpleImputer(strategy='most_frequent'),[6])
],remainder='passthrough')

onehot Encoding

In [None]:
trf2=ColumnTransformer([
    ('ohe_sex_embarked',OneHotEncoder(handle_unknown='ignore'),[1,6])
],remainder='passthrough')

Scaling

In [None]:
trf3=ColumnTransformer([
    ('scale',MinMaxScaler(),slice(0,10))
])

Feature Selection

In [None]:
trf4=SelectKBest(score_func=chi2,k=5)

In [None]:
trf5=DecisionTreeClassifier()

pipeline

In [None]:
pipe=Pipeline([
    ('trf1',trf1),
    ('trf2',trf2),
    ('trf3',trf3),
    ('trf4',trf4),
    ('trf5',trf5)
])

# Task
Fit the previously defined machine learning pipeline (`pipe`) using the training data (`X_train`, `y_train`), then generate predictions on the test data (`X_test`), and finally, evaluate and summarize the model's performance on `y_test` using appropriate classification metrics.

## Fit the Pipeline

### Subtask:
Fit the defined machine learning pipeline (`pipe`) to the training data (`X_train`, `y_train`).


**Reasoning**:
Fit the machine learning pipeline to the training data as instructed.



**Reasoning**:
To allow for new predictions, we need to create a sample input. This code creates a pandas DataFrame representing a single passenger with example features. You can modify the values in this DataFrame to test different scenarios.

**Reasoning**:
The previous error occurred because the `trf2` ColumnTransformer was incorrectly set up. After `trf1` processes the data, the 'Sex' column is at index 3 and the 'Embarked' column is at index 1. The `trf2` needs to be updated to target these correct indices for one-hot encoding.



In [None]:
trf2=ColumnTransformer([
    ('ohe_sex_embarked',OneHotEncoder(handle_unknown='ignore'),[1,3])
],remainder='passthrough')

**Reasoning**:
The `trf1` and `trf2` ColumnTransformers were redefined in the previous steps. The `pipe` object still references the old definitions. Therefore, the `pipe` needs to be redefined to incorporate the corrected `trf1` and `trf2` before attempting to fit it again. This ensures the pipeline uses the correct column transformations.



In [None]:
pipe=Pipeline([
    ('trf1',trf1),
    ('trf2',trf2),
    ('trf3',trf3),
    ('trf4',trf4),
    ('trf5',trf5)
])

**Reasoning**:
Now that the `pipe` object has been redefined with the corrected transformers, the next step is to fit this updated pipeline to the training data (`X_train`, `y_train`) to prepare it for making predictions.



In [None]:
pipe.fit(X_train, y_train)

## Make Predictions

### Subtask:
Use the fitted pipeline to make predictions on the test data (`X_test`).


**Reasoning**:
The subtask requires making predictions on the test data. The `pipe` object, which has already been fitted, can be used with its `.predict()` method to generate these predictions on `X_test`.



In [None]:
y_pred = pipe.predict(X_test)
y_pred

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 1])

## Evaluate Model Performance

### Subtask:
Evaluate and summarize the model's performance on `y_test` using appropriate classification metrics.

**Reasoning**:
To evaluate the model's performance, it's essential to calculate relevant classification metrics. Accuracy, precision, recall, and F1-score are standard metrics for classification tasks, and the confusion matrix provides a detailed breakdown of correct and incorrect classifications. These metrics will give a comprehensive understanding of how well the model performed on the test data.



In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print("Confusion Matrix:\n", conf_matrix)

Accuracy: 0.8045
Precision: 0.7910
Recall: 0.7162
F1-Score: 0.7518
Confusion Matrix:
 [[91 14]
 [21 53]]


## Final Task

### Subtask:
Provide a summary of the model's performance metrics.


## Summary:

### Q&A
The model's performance metrics are as follows:
*   Accuracy: 0.8045
*   Precision: 0.7910
*   Recall: 0.7162
*   F1-Score: 0.7518
*   The confusion matrix is `[[91 14], [21 53]]`, meaning there were 91 true negatives, 14 false positives, 21 false negatives, and 53 true positives.

### Data Analysis Key Findings
*   The initial attempt to fit the machine learning pipeline failed with a `ValueError` due to string values ('male') in the 'Sex' column, indicating that categorical features were not being correctly processed by the `OneHotEncoder`.
*   The issue was resolved by correctly re-specifying the column indices for one-hot encoding in `trf2`, targeting the 'Embarked' column at index 1 and the 'Sex' column at index 3 after the initial transformations.
*   After correcting the pipeline configuration, the model successfully fit the training data and generated predictions on the test data.
*   The model achieved an accuracy of 0.8045, precision of 0.7910, recall of 0.7162, and an F1-Score of 0.7518 on the test set.
*   The confusion matrix indicates that the model correctly identified 91 negative cases and 53 positive cases, but it also produced 14 false positives and 21 false negatives.

### Insights or Next Steps
*   The model demonstrates reasonably good overall performance, but there is an opportunity to improve its ability to identify positive cases, as indicated by the 21 false negatives and a recall of 0.7162.
*   Future steps could involve exploring hyperparameter tuning of the current model, considering alternative classification algorithms, or conducting further feature engineering to potentially improve recall without significantly impacting precision.


## Make Prediction on New Input

### Subtask:
Allow user to input new data and predict survival.

In [None]:
trf1=ColumnTransformer([
    ('impute_age',SimpleImputer(),[2]),
    ('impute_embarked',SimpleImputer(strategy='most_frequent'),[6])
],remainder='passthrough')

**Reasoning**:
Now that we have the new passenger data, we can use our `pipe` object to make a prediction. The `pipe` will automatically handle all the preprocessing steps (imputation, one-hot encoding, scaling) and then use the trained `DecisionTreeClassifier` to predict survival.

A prediction of `0` means the model predicts the passenger did not survive, and `1` means the model predicts the passenger survived.

In [None]:
new_prediction = pipe.predict(new_passenger_data)

if new_prediction[0] == 1:
    print("The model predicts the passenger survived.")
else:
    print("The model predicts the passenger did not survive.")

The model predicts the passenger did not survive.


In [None]:
new_passenger_data = pd.DataFrame({
    'Pclass': [3], # Passenger Class (1, 2, or 3)
    'Sex': ['male'], # Sex (male or female)
    'Age': [25.0], # Age
    'SibSp': [0], # Number of Siblings/Spouses Aboard
    'Parch': [0], # Number of Parents/Children Aboard
    'Fare': [7.25], # Passenger Fare
    'Embarked': ['S'] # Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
})

display(new_passenger_data)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,25.0,0,0,7.25,S
