
# data_preprocess 
 ## steps 
    Load the dataset

    Inspect the dataset (shape, types, nulls, preview)

    Handle missing values

    Remove duplicates

    Fix incorrect data types

    Encode categorical variables

    Feature engineering (create/update features)

    Handle outliers (optional)

    Normalize / Scale numerical features

    Split dataset into train/test (and optionally validation)

    Balance dataset (SMOTE or class weights) [optional for classification]

In [2]:
import pandas as pd
# reason why we dont import scikit-learn bc 
#1. its a package
#2. (-) hiphens are not allowed in python  in library names
# so instead we import the specific function we need , can aslo do import sklearn
from sklearn.model_selection import train_test_split

In [3]:
df = pd.read_csv('/home/aman/Desktop/datascience/dataset/Titanic-Dataset.csv')

# inspection 

In [4]:
# inspecting the data
# what data is ?
print(df.head())


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In [5]:
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None


In [22]:
# removing unwanted data like ,sibsp , ticket , fare ,name,pasanger id 

# df.drop(["PassengerId","Name","SibSp","Ticket","Fare"],axis=1)


#note -> if u run the code above more than once (in same working instance) , this throws error  KeyError: "['PassengerId', 'Name', 'SibSp', 'Ticket', 'Fare'] not found in axis" : bc its already removed that rows so running code again throws the error , so in order to remove that Keyerror case 
df.drop(["PassengerId","Name","SibSp","Ticket","Fare"],axis=1,errors="ignore")

Unnamed: 0,Survived,Pclass,Sex,Age,Parch,Cabin,Embarked
0,0,3,male,22.0,0,,S
1,1,1,female,38.0,0,C85,C
2,1,3,female,26.0,0,,S
3,1,1,female,35.0,0,C123,S
4,0,3,male,35.0,0,,S
...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,,S
887,1,1,female,19.0,0,B42,S
888,0,3,female,,2,,S
889,1,1,male,26.0,0,C148,C


In [6]:
print(df.shape)
print("----------------------------------------------")
print(df.isnull().sum()) # checking for null values in each column
print("----------------------------------------------")




(891, 12)
----------------------------------------------
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
----------------------------------------------


# setp-2 handling missing values
1. removing data itself (easy)
2. imputation (filling missing values) -> 
`constant`
`mean,meadian,mode`
`forwardfill,backwaedfill`
3. model based filling -> `KNN` `REGRESSION`


In [7]:
# REMOVING MISSING VALUES
df1 = df.dropna() # removing rows with missing values
print(df1.isnull().sum()) # checking for null values in each column after removing rows
print(df1.shape)
# so much data is lost , here in this case 

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64
(183, 12)


In [8]:
# imputation
# filling missing values with mean
df2 = df.fillna(df['Age'].mean()) # filling missing values with mean
print(df2.isnull().sum()) # checking for null values in each column after filling missing values
print(df2.shape) #DATA IS MAINTAINED
df3=df.fillna(df['Age'].median())
print('----------------------------------------------')
print(df3.isnull().sum()) # checking for null values in each column after filling missing values
print(df3.shape) #DATA IS MAINTAINED

# so for mode

# forward fill and backward fill
df4=df.fillna(method='ffill') # forward fill
print('----------------------------------------------')
print(df4.isnull().sum()) # checking for null values in each column after filling missing values
print(df4.shape) #DATA IS MAINTAINED




df5=df.fillna(method='bfill') # backward fill
print('----------------------------------------------')
print(df5.isnull().sum()) # checking for null values in each column after filling missing values
print(df5.shape) #DATA IS MAINTAINED


#note cabin have 1 missing value , still left bc its the first row 

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64
(891, 12)
----------------------------------------------
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64
(891, 12)
----------------------------------------------
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          1
Embarked       0
dtype: int64
(891, 12)
----------------------------------------------
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin   

  df4=df.fillna(method='ffill') # forward fill
  df5=df.fillna(method='bfill') # backward fill


## knn imputer 

he KNN Imputer is a method used to fill missing values in a dataset using the K-Nearest Neighbors approach. It's available in scikit-learn (sklearn.impute.KNNImputer). Instead of using simple strategies like mean or median imputation, KNN Imputer looks at the nearest data points (neighbors) and fills missing values based on their values.
🔧 How It Works

For each missing value in a feature:

    It finds the k-nearest samples (rows) that have a value for that feature.

    It uses the average (or weighted average) of those neighbors to impute the missing value.
        KNNImputer(...) creates an instance of the KNNImputer class from sklearn.impute.

    imputer is just a variable name that holds this instance.

    This instance has methods like .fit(), .transform(), and .fit_transform() that are used to impute missing values.

So technically:

    🔹 imputer is an object of the KNNImputer class that you use to perform K-Nearest Neighbors-based imputation on a dataset.




### .fit()



imputer.fit(data)

    Learns from the data.

    In KNNImputer, this means it calculates distances between rows, finds neighbors, and prepares itself to do imputation.

    But it does NOT actually change or return the data yet.

💬 Think of it as:

    “Hey imputer, look at this data and understand how it’s structured.”

###  .transform()

imputed_data = imputer.transform(data)

    Uses what was learned in .fit() to actually fill in the missing values.

    It gives you back a version of the dataset where NaNs are replaced.

💬 Think of it as:

    “Okay, now apply what you learned and fix the missing values.”
###  .fit_transform()

imputed_data = imputer.fit_transform(data)

    Does both steps in one line:

        First .fit(data) to learn

        Then .transform(data) to return the imputed data


In [10]:
#model based filling
#KNN imputation
from sklearn.impute import KNNImputer
#so from sklearn lib , importing knnimputer class , creating a instance from that class knn imputer with n_neighors , so imputer is object , when initiated will find the missing value row find 4 similar rows of that then average it and put value , in aloop it will do that for all 
imputer = KNNImputer(n_neighbors=4)


###  note
 knn imputer can only be used on numerical values so either convert the df into numeric_df or drop those columns 


✅ 2. What does "partial encoding" mean in this case?

“Partial encoding” means:

    We encode only the columns that help with similarity and are relevant for modeling or imputation.

This is good practice because:

    You keep useful categorical info (Sex, Embarked)

    You avoid noise from irrelevant fields (Name, Ticket)

🚫 Why encoding everything can hurt KNNImputer:

KNNImputer fills missing values by comparing rows using distances. If you include random or non-informative columns, KNN will calculate wrong neighbors — leading to bad imputation.

In [None]:

df_imputed = imputer.fit_transform(df)
print(df.isnull.sum())