#### **üß≠ Lesson 8: Data Type Conversion & Validation**

**üéØ Objective**
By the end of this lesson, you will be able to:

- Inspect & understand DataFrame dtypes
- Convert data types using .astype()
- Convert strings ‚Üí numbers using pd.to_numeric()
- Convert strings ‚Üí dates using pd.to_datetime()
- Handle errors in conversion (very important!)
- Detect ‚Äúhidden bad data‚Äù (common in CSV/excel)
- Validate schema (professional practice)

üß± Why Data Types Matter?

Data types impact:

‚û§ Performance

Faster calculations with correct numeric types.

‚û§ Memory optimization

category type reduces memory drastically.

‚û§ Correct analysis

String numbers ("12000") behave differently from integers (12000).

‚û§ Prevent errors

Aggregation, merging, sorting, and plotting all depend on correct dtypes.

**Import library**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
pburl = r"C:\Users\dhira\Desktop\python-mastery\pandas\dataset\raw\people_basic_data.xlsx"

df_excel = pd.read_excel(pburl)

print(df_excel.head(2))

     Name  Age       City  Salary_INR
0   Aarav   23     Mumbai      115059
1  Vivaan   50  Ahmedabad       93035


**Convert to CSV from Excel**

In [None]:
# File path of the Excel file to be converted
pburl = r"C:\Users\dhira\Desktop\python-mastery\pandas\dataset\raw\people_basic_data.xlsx"

# Read the Excel file into a Pandas DataFrame
df_excel = pd.read_excel(pburl)

# Print first 2 rows to verify the data
print("First 2 rows of the Excel file:\n", df_excel.head(2))

# ------------------------
# Save the DataFrame as a CSV file
# index=False ensures that the row indices are not written into the CSV
csv_path = r"C:\Users\dhira\Desktop\GENAI\pandas\dataset\raw\peoplebasic_data.csv"
df_excel.to_csv(csv_path, index=False)

# Print the path where the CSV file has been saved
print("CSV file saved at:", csv_path)

First 2 rows of the Excel file:
      Name  Age       City  Salary_INR
0   Aarav   23     Mumbai      115059
1  Vivaan   50  Ahmedabad       93035
CSV file saved at: C:\Users\dhira\Desktop\GENAI\pandas\dataset\raw\peoplebasic_data.csv


**üß© Check Current Data Types**

In [None]:
import pandas as pd

# File path of the CSV file
peoplebasic_url = r"C:\Users\dhira\Desktop\python-mastery\pandas\dataset\raw\peoplebasic_data.csv"

# Read the CSV file into a Pandas DataFrame
df_pb = pd.read_csv(peoplebasic_url)

# Display concise summary of the DataFrame
# .info() shows:
# - Number of entries (rows)
# - Column names
# - Non-null counts
# - Data types of each column
df_pb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        20 non-null     object
 1   Age         20 non-null     int64 
 2   City        20 non-null     object
 3   Salary_INR  20 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 772.0+ bytes


#### **üß© Converting Data Types Using .astype()**

In [5]:
# Convert the 'Age' column to integer type
# - Right-hand side: df_pb['Age'].astype(int) converts the data type of the column to int
# - Left-hand side: df_pb['Age'] = ... assigns the converted column back to the DataFrame
df_pb['Age'] = df_pb['Age'].astype(int)

# Display concise summary of the DataFrame to verify changes
# - .info() shows number of entries, column names, non-null counts, and data types
df_pb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        20 non-null     object
 1   Age         20 non-null     int64 
 2   City        20 non-null     object
 3   Salary_INR  20 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 772.0+ bytes


**Convert to float**

In [6]:
# ------------------------
# Print the structure of the original DataFrame
# df_pb.info() shows:
# - Number of rows
# - Column names
# - Non-null counts
# - Data types
print("\nOriginal DataFrame info:\n")
df_pb.info()

# ------------------------
# Convert 'Salary_INR' column to float type
# This ensures numeric operations (sum, mean, etc.) can be performed
df_pb['Salary_INR'] = df_pb['Salary_INR'].astype(float)

# ------------------------
# Print the DataFrame info again to verify the change
# 'Salary_INR' should now show dtype as float64
print("\nDataFrame info after converting 'Salary_INR' to float:\n")
df_pb.info()


Original DataFrame info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        20 non-null     object
 1   Age         20 non-null     int64 
 2   City        20 non-null     object
 3   Salary_INR  20 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 772.0+ bytes

DataFrame info after converting 'Salary_INR' to float:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Name        20 non-null     object 
 1   Age         20 non-null     int64  
 2   City        20 non-null     object 
 3   Salary_INR  20 non-null     float64
dtypes: float64(1), int64(1), object(2)
memory usage: 772.0+ bytes


**Convert to string**

In [7]:
# ------------------------
# Convert 'City' column to string type
# This ensures all entries are treated as text, even if there are numbers or NaNs
df_pb['City'] = df_pb['City'].astype(str)

# ------------------------
# Display concise summary of the DataFrame to verify the change
# - Check that 'City' column now shows dtype as 'object' (string)
df_pb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Name        20 non-null     object 
 1   Age         20 non-null     int64  
 2   City        20 non-null     object 
 3   Salary_INR  20 non-null     float64
dtypes: float64(1), int64(1), object(2)
memory usage: 772.0+ bytes


**Convert to category (memory efficient)**

In [8]:
# Convert 'City' column to categorical type
# - 'category' dtype is memory-efficient for text columns with repeated values
# - Useful for analysis, grouping, and plotting
df_pb['City'] = df_pb['City'].astype('category')
df_pb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Name        20 non-null     object  
 1   Age         20 non-null     int64   
 2   City        20 non-null     category
 3   Salary_INR  20 non-null     float64 
dtypes: category(1), float64(1), int64(1), object(1)
memory usage: 1004.0+ bytes


### **üß© Converting String Numbers ‚Üí Actual Numbers**

In [9]:
# ------------------------
# Convert 'Salary_INR' column to numeric type
# - pd.to_numeric() ensures all values are numeric (int or float)
# - errors='coerce' will convert any non-numeric values (like strings or special characters) to NaN
df_pb['Salary_INR'] = pd.to_numeric(df_pb['Salary_INR'], errors='coerce')

# ------------------------
# Display concise summary of the DataFrame to verify the change
# - Check that 'Salary_INR' dtype is now numeric (float64)
# - Also shows how many non-null values exist (useful to detect NaNs after coercion)
df_pb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Name        20 non-null     object  
 1   Age         20 non-null     int64   
 2   City        20 non-null     category
 3   Salary_INR  20 non-null     float64 
dtypes: category(1), float64(1), int64(1), object(1)
memory usage: 1004.0+ bytes


`errors` parameter:

| Value    | Meaning                        |
|----------|--------------------------------|
| 'raise'  | Throw error on bad values       |
| 'coerce' | Convert bad values ‚Üí NaN        |
| 'ignore' | Leave as is                     |

Always use "coerce" ‚Üí identify bad data safely.


#### **üß© Detecting Hidden Bad Data**

In [10]:
# ------------------------
# Identify rows in 'Salary_INR' column that cannot be converted to numeric
# - pd.to_numeric(..., errors='coerce') converts invalid entries to NaN
# - .isna() returns True for these invalid entries
# - Using this boolean mask to filter the original DataFrame
bad_rows = df_pb[pd.to_numeric(df_pb['Salary_INR'], errors='coerce').isna()]

# Display rows with invalid/non-numeric 'Salary_INR' values
print(bad_rows)

# Check unique values in Salary_INR
print(df_pb['Salary_INR'].unique())

# Or check if there are any NaNs after coercion
print(pd.to_numeric(df_pb['Salary_INR'], errors='coerce').isna().sum())


Empty DataFrame
Columns: [Name, Age, City, Salary_INR]
Index: []
[115059.  93035.  61033. 187550. 162866. 139457.  98410. 169869. 122722.
 105380. 106578.  68630.  45663. 124279. 180941.  40848. 175909. 170697.
  77049. 137382.]
0


**üß© Checking Memory Usage**

In [11]:
import time

# ------------------------
# Create a large DataFrame with repeated city names
n = 1_000_000
cities = ['Mumbai', 'Delhi', 'Bangalore', 'Chennai', 'Kolkata']
df = pd.DataFrame({
    'City_str': np.random.choice(cities, n),
    'City_cat': np.random.choice(cities, n)
})

# ------------------------
# Convert one column to string (object) and another to category
df['City_str'] = df['City_str'].astype(str)
df['City_cat'] = df['City_cat'].astype('category')

# ------------------------
# Check memory usage
print("Memory usage (object/string):")
print(df['City_str'].memory_usage(deep=True))

print("\nMemory usage (category):")
print(df['City_cat'].memory_usage(deep=True))

# ------------------------
# Compare performance for a groupby operation
start = time.time()
df.groupby('City_str').size()
end = time.time()
print("\nTime for groupby on object/string:", round(end - start, 4), "seconds")

start = time.time()
df.groupby('City_cat').size()
end = time.time()
print("Time for groupby on category:", round(end - start, 4), "seconds")


Memory usage (object/string):
55798604

Memory usage (category):
1000583

Time for groupby on object/string: 0.0461 seconds
Time for groupby on category: 0.008 seconds


  df.groupby('City_cat').size()
