<a href="https://colab.research.google.com/github/Vishnu75678/R-D-INFRO-TECHNOLOGY/blob/main/Task_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Importing Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load the dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(url)

# Display basic information
print("Dataset Head:\n", data.head())
print("\nDataset Info:\n")
print(data.info())
print("\nMissing Values:\n", data.isnull().sum())

# Step 1: Handling Missing Values
# Fill missing 'Age' with median
data['Age'].fillna(data['Age'].median(), inplace=True)

# Fill missing 'Embarked' with the most frequent value
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)

# Drop 'Cabin' column due to a large number of missing values
data.drop(columns=['Cabin'], inplace=True)

# Step 2: Converting Categorical Variables (One-Hot Encoding)
categorical_features = ['Sex', 'Embarked', 'Pclass']
numerical_features = ['Age', 'Fare', 'SibSp', 'Parch']

# Create Column Transformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),          # Standardize numerical features
        ('cat', OneHotEncoder(drop='first'), categorical_features) # One-Hot Encode categorical features
    ]
)

# Apply the transformations to the data
processed_data = preprocessor.fit_transform(data[numerical_features + categorical_features])

# Convert processed data to DataFrame with meaningful column names
processed_data = pd.DataFrame(processed_data, columns=[
    'Age', 'Fare', 'SibSp', 'Parch', 'Sex_male', 'Embarked_Q', 'Embarked_S', 'Pclass_2', 'Pclass_3'
])

print("\nProcessed Data Head:\n", processed_data.head())

# Combine processed features with the target variable (if available)
if 'Survived' in data.columns:
    final_data = pd.concat([processed_data, data['Survived']], axis=1)
    print("\nFinal Processed Data Head:\n", final_data.head())
else:
    final_data = processed_data

# Save the cleaned and preprocessed data
final_data.to_csv('cleaned_customer_churn.csv', index=False)
print("\nData Cleaning and Preprocessing Completed!")

Dataset Head:
    PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   Na

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Age'].fillna(data['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
