<a href="https://colab.research.google.com/github/Vaishu242004-manoharan/Data-Science-Internship-Basics/blob/main/Data_Cleaning_%26_Missing_Value_Handling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd

# Load the Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic = pd.read_csv(url)

# Show the first few rows of the dataset
print("Initial Dataset:")
print(titanic.head())

# 1. Identify missing values
print("\nMissing Values:")
print(titanic.isnull().sum())  # Displays the number of missing values in each column

# 2. Handle missing values
# Option 1: Fill missing values (e.g., filling missing 'Age' with the mean of the column)
titanic['Age'].fillna(titanic['Age'].mean(), inplace=True)

# Option 2: Drop rows with missing 'Embarked' (assuming 'Embarked' is critical for analysis)
titanic.dropna(subset=['Embarked'], inplace=True)

# Alternatively, if you want to drop rows with any missing values, you can use:
# titanic.dropna(inplace=True)

# 3. Remove duplicate entries
titanic.drop_duplicates(inplace=True)

# 4. Standardize column names: Convert to lowercase and remove spaces
titanic.columns = titanic.columns.str.lower().str.replace(' ', '_')

# Show the cleaned dataset
print("\nCleaned Dataset:")
print(titanic.head())

# Display the number of missing values after cleaning
print("\nMissing Values After Cleaning:")
print(titanic.isnull().sum())


Initial Dataset:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['Age'].fillna(titanic['Age'].mean(), inplace=True)
