**Task 1**: Checking Null Values for Completeness

**Description**: Verify if there are any null values in a dataset, which indicate incomplete data.

In [None]:
# Write your code from here

In [1]:
import pandas as pd

# Example dataset
data = {
    'Name': ['Alice', 'Bob', None, 'David'],
    'Age': [25, None, 35, 40],
    'Email': ['alice@example.com', 'bob@example.com', None, 'david@example.com']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Check for null values
null_values = df.isnull().sum()

# Display the null values count for each column
print("Null values in each column:")
print(null_values)

Null values in each column:
Name     1
Age      1
Email    1
dtype: int64


**Task 2**: Checking Data Type Validity

**Description**: Ensure that columns contain data of expected types, e.g., ages are integers.

In [None]:
# Write your code from here

In [2]:
# Check if the data types of each column are as expected
expected_dtypes = {
    'Name': 'object',
    'Age': 'float64',
    'Email': 'object'
}

# Compare the actual dtypes with the expected dtypes
dtype_validity = df.dtypes == pd.Series(expected_dtypes)

# Display the result of the data type validity check
print("Data type validity check:")
print(dtype_validity)

Data type validity check:
Name     True
Age      True
Email    True
dtype: bool


**Task 3**: Verify Uniqueness of Identifiers

**Description**: Check if a dataset has unique identifiers (e.g., emails).

In [None]:
# Write your code from here

In [3]:
# Check if the 'Email' column contains unique values
email_uniqueness = df['Email'].is_unique

# Display the result of the uniqueness check
print("Are email addresses unique?")
print(email_uniqueness)

Are email addresses unique?
True


Task 4: Validate Email Format Using Regex

Description: Validate if email addresses in a dataset have the correct format.

In [None]:
# Write your code from here

In [4]:
import re

# Define a function to validate email format using regex
def validate_email_format(email):
    if email is None:
        return False
    email_regex = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(email_regex, email))

# Apply the function to the 'Email' column
df['Email_Valid'] = df['Email'].apply(validate_email_format)

# Display the DataFrame with the new column indicating email validity
print("DataFrame with email validity check:")
print(df)

DataFrame with email validity check:
    Name   Age              Email  Email_Valid
0  Alice  25.0  alice@example.com         True
1    Bob   NaN    bob@example.com         True
2   None  35.0               None        False
3  David  40.0  david@example.com         True


Task 5: Check for Logical Age Validity

Description: Ensure ages are within a reasonable human range (e.g., 0-120).

In [None]:
# Write your code from here

In [5]:
# Check if the values in the 'Age' column are within a reasonable human range (0-120)
def is_age_valid(age):
    if age is None:
        return False
    return 0 <= age <= 120

# Apply the function to the 'Age' column
df['Age_Valid'] = df['Age'].apply(is_age_valid)

# Display the DataFrame with the new column indicating age validity
print("DataFrame with age validity check:")
print(df)

DataFrame with age validity check:
    Name   Age              Email  Email_Valid  Age_Valid
0  Alice  25.0  alice@example.com         True       True
1    Bob   NaN    bob@example.com         True      False
2   None  35.0               None        False       True
3  David  40.0  david@example.com         True       True


Task 6: Identify and Handle Missing Data

Description: Identify missing values in a dataset and impute them using a simple strategy (e.g., mean).

In [None]:
# Write your code from here

In [6]:
# Identify missing values and impute them using a simple strategy (e.g., mean for numerical columns)

# Impute missing values in the 'Age' column with the mean age
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)

# Impute missing values in the 'Name' and 'Email' columns with a placeholder
df['Name'].fillna('Unknown', inplace=True)
df['Email'].fillna('unknown@example.com', inplace=True)

# Display the DataFrame after handling missing data
print("DataFrame after imputing missing values:")
print(df)

DataFrame after imputing missing values:
      Name        Age                Email  Email_Valid  Age_Valid
0    Alice  25.000000    alice@example.com         True       True
1      Bob  33.333333      bob@example.com         True      False
2  Unknown  35.000000  unknown@example.com        False       True
3    David  40.000000    david@example.com         True       True


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(mean_age, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Name'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always be

Task 7: Detect Duplicates

Description: Detect duplicate rows in the dataset.

In [None]:
# Write your code from here

In [7]:
# Detect duplicate rows in the dataset
duplicate_rows = df[df.duplicated()]

# Display the duplicate rows, if any
if not duplicate_rows.empty:
    print("Duplicate rows in the dataset:")
    print(duplicate_rows)
else:
    print("No duplicate rows found in the dataset.")

No duplicate rows found in the dataset.


Task 8: Validate Correctness of Numerical Values

Description: Ensure numerical columns are within a specified range.

In [None]:
# Write your code from here

In [8]:
# Validate numerical columns to ensure they are within a specified range
# For this task, we will validate the 'Age' column to ensure all values are between 0 and 120

# Define a function to validate numerical values within a specified range
def validate_numerical_range(value, min_value, max_value):
    if value is None:
        return False
    return min_value <= value <= max_value

# Apply the function to the 'Age' column
df['Age_In_Range'] = df['Age'].apply(lambda x: validate_numerical_range(x, 0, 120))

# Display the DataFrame with the new column indicating if 'Age' is within the specified range
print("DataFrame with numerical range validity check:")
print(df)

DataFrame with numerical range validity check:
      Name        Age                Email  Email_Valid  Age_Valid  \
0    Alice  25.000000    alice@example.com         True       True   
1      Bob  33.333333      bob@example.com         True      False   
2  Unknown  35.000000  unknown@example.com        False       True   
3    David  40.000000    david@example.com         True       True   

   Age_In_Range  
0          True  
1          True  
2          True  
3          True  


Task 9: Custom Completeness Rule Violation Report

Description: Create a report showing which rows violate specific completeness rules, such as mandatory fields being empty.

In [None]:
# Write your code from here

In [9]:
# Custom Completeness Rule Violation Report

# Define a function to check for completeness rule violations
def check_completeness_violations(row):
    violations = []
    if row['Name'] == 'Unknown':  # Check if Name is missing
        violations.append('Name is missing')
    if row['Age'] == mean_age:  # Check if Age was imputed
        violations.append('Age was imputed')
    if row['Email'] == 'unknown@example.com':  # Check if Email was imputed
        violations.append('Email is missing or invalid')
    return ', '.join(violations) if violations else 'No violations'

# Apply the function to each row of the DataFrame
df['Completeness_Violations'] = df.apply(check_completeness_violations, axis=1)

# Display the DataFrame with the completeness violations report
print("DataFrame with completeness rule violations report:")
print(df[['Name', 'Age', 'Email', 'Completeness_Violations']])

DataFrame with completeness rule violations report:
      Name        Age                Email  \
0    Alice  25.000000    alice@example.com   
1      Bob  33.333333      bob@example.com   
2  Unknown  35.000000  unknown@example.com   
3    David  40.000000    david@example.com   

                        Completeness_Violations  
0                                 No violations  
1                               Age was imputed  
2  Name is missing, Email is missing or invalid  
3                                 No violations  


Task 10: Advanced Regex for Data Validity Check

Description: Check for validity with advanced regex patterns, such as validating complex fields with multi-level rules.

In [None]:
# Write your code from here

In [10]:
import pandas as pd
import numpy as np
import re

# Import necessary libraries

# Define the dataset
data = {
    'Name': ['Alice', 'Bob', None, 'David'],
    'Age': [25, None, 35, 40],
    'Email': ['alice@example.com', 'bob@example.com', None, 'david@example.com']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Task 1: Checking Null Values for Completeness
null_values = df.isnull().sum()
print("Null values in each column:")
print(null_values)

# Task 2: Checking Data Type Validity
expected_dtypes = {
    'Name': 'object',
    'Age': 'float64',
    'Email': 'object'
}
dtype_validity = df.dtypes == pd.Series(expected_dtypes)
print("Data type validity check:")
print(dtype_validity)

# Task 3: Verify Uniqueness of Identifiers
email_uniqueness = df['Email'].is_unique
print("Are email addresses unique?")
print(email_uniqueness)

# Task 4: Validate Email Format Using Regex
def validate_email_format(email):
    if email is None:
        return False
    email_regex = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(email_regex, email))

df['Email_Valid'] = df['Email'].apply(validate_email_format)
print("DataFrame with email validity check:")
print(df)

# Task 5: Check for Logical Age Validity
def is_age_valid(age):
    if age is None:
        return False
    return 0 <= age <= 120

df['Age_Valid'] = df['Age'].apply(is_age_valid)
print("DataFrame with age validity check:")
print(df)

# Task 6: Identify and Handle Missing Data
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)
df['Name'].fillna('Unknown', inplace=True)
df['Email'].fillna('unknown@example.com', inplace=True)
print("DataFrame after imputing missing values:")
print(df)

# Task 7: Detect Duplicates
duplicate_rows = df[df.duplicated()]
if not duplicate_rows.empty:
    print("Duplicate rows in the dataset:")
    print(duplicate_rows)
else:
    print("No duplicate rows found in the dataset.")

# Task 8: Validate Correctness of Numerical Values
def validate_numerical_range(value, min_value, max_value):
    if value is None:
        return False
    return min_value <= value <= max_value

df['Age_In_Range'] = df['Age'].apply(lambda x: validate_numerical_range(x, 0, 120))
print("DataFrame with numerical range validity check:")
print(df)

# Task 9: Custom Completeness Rule Violation Report
def check_completeness_violations(row):
    violations = []
    if row['Name'] == 'Unknown':  # Check if Name is missing
        violations.append('Name is missing')
    if row['Age'] == mean_age:  # Check if Age was imputed
        violations.append('Age was imputed')
    if row['Email'] == 'unknown@example.com':  # Check if Email was imputed
        violations.append('Email is missing or invalid')
    return ', '.join(violations) if violations else 'No violations'

df['Completeness_Violations'] = df.apply(check_completeness_violations, axis=1)
print("DataFrame with completeness rule violations report:")
print(df[['Name', 'Age', 'Email', 'Completeness_Violations']])

Null values in each column:
Name     1
Age      1
Email    1
dtype: int64
Data type validity check:
Name     True
Age      True
Email    True
dtype: bool
Are email addresses unique?
True
DataFrame with email validity check:
    Name   Age              Email  Email_Valid
0  Alice  25.0  alice@example.com         True
1    Bob   NaN    bob@example.com         True
2   None  35.0               None        False
3  David  40.0  david@example.com         True
DataFrame with age validity check:
    Name   Age              Email  Email_Valid  Age_Valid
0  Alice  25.0  alice@example.com         True       True
1    Bob   NaN    bob@example.com         True      False
2   None  35.0               None        False       True
3  David  40.0  david@example.com         True       True
DataFrame after imputing missing values:
      Name        Age                Email  Email_Valid  Age_Valid
0    Alice  25.000000    alice@example.com         True       True
1      Bob  33.333333      bob@example.co

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(mean_age, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Name'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always be