## Data Validation & Error Correction with Custom Functions


Data validation and error correction are crucial steps in the data analysis process. Validating data ensures its accuracy and reliability, while error correction helps to rectify any inconsistencies or inaccuracies in the dataset. In pandas, custom functions can be created to perform specific validation and correction tasks tailored to the dataset's requirements.


In [1]:
from faker import Faker
import pandas as pd

# Initialize Faker
fake = Faker()

# Generate fake data
data = {
    'Name': [fake.name() for _ in range(10)],
    'Age': [fake.random_int(min=18, max=80) for _ in range(10)],
    'Email': [fake.email() for _ in range(10)],
    'Phone': [fake.phone_number() for _ in range(10)]
}

# Create DataFrame
df = pd.DataFrame(data)
print(df)


                 Name  Age                        Email                  Phone
0      Amanda Vasquez   41          fwilson@example.com     (531)368-6830x5212
1      Courtney Black   67          ljuarez@example.org        +1-564-264-6685
2      Hannah Mcgrath   60        patrick22@example.com       001-905-371-6278
3  Joseph Ramirez DDS   35        jessica23@example.org           303-888-2339
4         Laura Jones   20            pdiaz@example.net       001-623-823-1900
5      Larry Galloway   43  michaelmcdonald@example.com  001-833-795-8327x9961
6     Jimmy Gallagher   21      sydneysmith@example.org     236.670.3451x51351
7       Brenda Steele   33          yharris@example.net      439.594.8470x3280
8       Jeremy Golden   30       ginaconway@example.org   001-783-612-1250x849
9       Vanessa Hicks   23         dgardner@example.org     431.923.9833x36920


## Custom Functions for Data Validation & Error Correction:
Now, let's create custom functions to perform data validation and error correction on the generated dataset.

In [2]:
# Custom function for validating email addresses
def validate_email(email):
    if '@' not in email:
        return False
    return True

# Custom function for correcting phone numbers
def correct_phone(phone):
    # Remove non-numeric characters
    phone = ''.join(filter(str.isdigit, phone))
    # Add country code if missing
    if len(phone) == 10:
        phone = '+1' + phone
    return phone


## Applying Custom Functions to Dataset:
Next, we'll apply the custom functions to validate and correct errors in the dataset.

In [3]:
# Validate email addresses
df['Valid Email'] = df['Email'].apply(validate_email)

# Correct phone numbers
df['Corrected Phone'] = df['Phone'].apply(correct_phone)

print(df)


                 Name  Age                        Email  \
0      Amanda Vasquez   41          fwilson@example.com   
1      Courtney Black   67          ljuarez@example.org   
2      Hannah Mcgrath   60        patrick22@example.com   
3  Joseph Ramirez DDS   35        jessica23@example.org   
4         Laura Jones   20            pdiaz@example.net   
5      Larry Galloway   43  michaelmcdonald@example.com   
6     Jimmy Gallagher   21      sydneysmith@example.org   
7       Brenda Steele   33          yharris@example.net   
8       Jeremy Golden   30       ginaconway@example.org   
9       Vanessa Hicks   23         dgardner@example.org   

                   Phone  Valid Email    Corrected Phone  
0     (531)368-6830x5212         True     53136868305212  
1        +1-564-264-6685         True        15642646685  
2       001-905-371-6278         True      0019053716278  
3           303-888-2339         True       +13038882339  
4       001-623-823-1900         True      001623823190