# Measuring Completeness

**Activity Overview**: Evaluate data completeness by checking missing data rates and handling partially available records.

## Title: Customer Profiles

**Task**: Calculate the missing data rate for customer profiles.

**Steps**:
1. List all required fields for a complete customer profile (e.g., name, address, email,
phone number).
2. Analyze the dataset to count how many profiles have missing fields.
3. Calculate the percentage of missing data fields across all profiles.

In [1]:
# Write your code from here
import pandas as pd

# Create a DataFrame
data = {
    'name': ['John Doe', 'Jane Smith', 'Bill Gates', 'Elon Musk'],
    'address': ['123 Main St', '456 Oak St', '789 Pine St', '101 Tech Ave'],
    'email': ['john@example.com', '', 'bill@microsoft.com', 'elon@spacex.com'],
    'phone number': ['555-1234', '555-5678', '', '555-9876']
}

df = pd.DataFrame(data)

# Save DataFrame to CSV
df.to_csv('customer_data.csv', index=False)

In [2]:
import pandas as pd

def load_data(file_path):
    """Loads data from a CSV file and handles exceptions."""
    try:
        # Load the CSV with additional missing values represented by 'na', 'n/a'
        df = pd.read_csv(file_path, na_values=['na', 'n/a', ''])
        return df
    except FileNotFoundError:
        print(f"Error: The file {file_path} does not exist.")
        return None
    except pd.errors.ParserError:
        print("Error: The file could not be parsed.")
        return None

def calculate_missing_data_percentage(df):
    """Calculates the missing data percentage for required fields."""
    required_fields = ['name', 'address', 'email', 'phone number']
    
    # Ensure required fields exist in the DataFrame
    for field in required_fields:
        if field not in df.columns:
            print(f"Error: Missing required field: {field}")
            return None
    
    # Check for missing data in each field
    missing_data = df[required_fields].isnull()
    
    # Calculate the number of missing values per field
    missing_per_field = missing_data.sum()
    
    # Calculate the total number of profiles
    total_profiles = len(df)
    
    # Calculate the percentage of missing data per field
    missing_percentage = (missing_per_field / total_profiles) * 100
    
    # Calculate the overall percentage of profiles with missing data in any required field
    missing_profiles = missing_data.any(axis=1).sum()
    missing_data_percentage = (missing_profiles / total_profiles) * 100
    
    return missing_percentage, missing_data_percentage

def main():
    # Load the data
    df = load_data('customer_data.csv')
    if df is not None:
        # Calculate missing data percentages
        missing_percentage, missing_data_percentage = calculate_missing_data_percentage(df)
        
        if missing_percentage is not None:
            print(f"Missing data percentage per field:\n{missing_percentage}")
            print(f"Overall percentage of profiles with missing data: {missing_data_percentage:.2f}%")

if __name__ == "__main__":
    main()

Missing data percentage per field:
name             0.0
address          0.0
email           25.0
phone number    25.0
dtype: float64
Overall percentage of profiles with missing data: 50.00%
