### Data Cleaning & Preprocessing

Program to demonstrate skills in cleaning and preprocessing data using libraries like pandas in Python

- Read a csv file using pandas DataFrame (df).
- Display the original DataFrame to see the initial state of the data.

Cleaning and preprocessing steps:
- Handle missing values in the 'Age' and 'Income' columns.
- Convert the 'Age' column to numeric values, coercing errors to NaN.
- Remove the dollar sign and convert the 'Income' column to float.
- Remove duplicate rows from the DataFrame.
- Display the cleaned and preprocessed DataFrame.

In [1]:
import pandas as pd

# Read data from a CSV file
df = pd.read_csv('data2clean.csv')

# Display the original DataFrame
print("Original DataFrame:")
df

Original DataFrame:


Unnamed: 0,Name,Age,Gender,Income,Education
0,John,25.0,Male,$50000,Bachelor
1,Jane,30.0,Female,$60000,Master
2,Bob,,Male,$45000,High School
3,Alice,22.0,Female,,PhD
4,Eve,28.0,Female,$70000,Bachelor


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Name       5 non-null      object 
 1   Age        4 non-null      float64
 2   Gender     5 non-null      object 
 3   Income     4 non-null      object 
 4   Education  5 non-null      object 
dtypes: float64(1), object(4)
memory usage: 328.0+ bytes


In [3]:
# Rename a column 
df.rename(columns={'Name': 'Firstname'}, inplace=True)

In [4]:
# Cleaning and preprocessing steps
# Handling missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Income'].fillna('$0', inplace=True)

# Converting data types
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')  # Convert to numeric, coerce errors to NaN

df['Income'] = df['Income'].replace('[\$,]', '', regex=True).astype(float)  # Remove $ and convert to float

df['Income'] = pd.to_numeric(df['Income'].astype('float64').round(2))

# Removing duplicates
df.drop_duplicates(inplace=True)

# Display the cleaned and preprocessed DataFrame
print("\nCleaned and Preprocessed DataFrame:")
df


Cleaned and Preprocessed DataFrame:


Unnamed: 0,Firstname,Age,Gender,Income,Education
0,John,25.0,Male,50000.0,Bachelor
1,Jane,30.0,Female,60000.0,Master
2,Bob,26.5,Male,45000.0,High School
3,Alice,22.0,Female,0.0,PhD
4,Eve,28.0,Female,70000.0,Bachelor


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Firstname  5 non-null      object 
 1   Age        5 non-null      float64
 2   Gender     5 non-null      object 
 3   Income     5 non-null      float64
 4   Education  5 non-null      object 
dtypes: float64(2), object(3)
memory usage: 240.0+ bytes
