### Task 1: Measure Data Accuracy using a Trusted Source

**Description**: You have two datasets of product prices: `company_prices.csv` and
`trusted_prices.csv` . Check if the prices in `company_prices.csv` match the prices in
`trusted_prices.csv` . Assume both files have a "product_id" and "price" column.

In [1]:
# Write your code from here
import pandas as pd

# Data for company_prices.csv
company_prices_data = {
    'product_id': [1, 2, 3, 4],
    'price': [10.99, 20.99, -5.99, 15.50]
}

# Data for trusted_prices.csv
trusted_prices_data = {
    'product_id': [1, 2, 3, 4],
    'price': [10.99, 19.99, 10.99, 15.50]
}

# Convert data into DataFrames
company_prices_df = pd.DataFrame(company_prices_data)
trusted_prices_df = pd.DataFrame(trusted_prices_data)

# Save the DataFrames to CSV files
company_prices_df.to_csv('company_prices.csv', index=False)
trusted_prices_df.to_csv('trusted_prices.csv', index=False)

print("CSV files have been created.")

import pandas as pd

# Task 1: Measure Data Accuracy using a Trusted Source

# Load datasets
company_prices = pd.read_csv('company_prices.csv')
trusted_prices = pd.read_csv('trusted_prices.csv')

# Merge data on product_id
merged_data = pd.merge(company_prices, trusted_prices, on='product_id', suffixes=('_company', '_trusted'))

# Check for mismatches between company prices and trusted prices
price_mismatches = merged_data[merged_data['price_company'] != merged_data['price_trusted']]

# Print mismatched rows
if not price_mismatches.empty:
    print("Price mismatches found:")
    print(price_mismatches[['product_id', 'price_company', 'price_trusted']])
else:
    print("All prices match with the trusted source.")


# Task 2: Detect Incorrect Values (Negative Price Values)

# Detect negative price values in company_prices.csv
negative_prices = company_prices[company_prices['price'] < 0]

# Print rows with negative prices
if not negative_prices.empty:
    print("\nNegative price values found:")
    print(negative_prices[['product_id', 'price']])
else:
    print("\nNo negative price values found.")

CSV files have been created.
Price mismatches found:
   product_id  price_company  price_trusted
1           2          20.99          19.99
2           3          -5.99          10.99

Negative price values found:
   product_id  price
2           3  -5.99


### Task 2: Detect Incorrect Values

**Description**: In `company_prices.csv` , detect any negative price values which are incorrect values for prices.

In [None]:
# Write your code from here

### Task 3: Check Missing Data Rates

**Description**: Calculate the percentage of missing values in `customer_data.csv` .

In [2]:
# Write your code from here
import pandas as pd

# Create a sample dataset with missing data
data = {
    'customer_id': [1, 2, 3, 4, 5],
    'email': ['john.doe@example.com', 'jane.smith@example.com', None, 'mary.jones@example.com', None],
    'phone number': ['123-456-7890', '987-654-3210', '555-555-5555', None, None]
}

# Convert the data into a pandas DataFrame
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('customer_data.csv', index=False)

# Print the DataFrame to verify the content
print("Sample customer_data.csv:")
print(df)

import pandas as pd

# Load the customer_data.csv file
df = pd.read_csv('customer_data.csv')

# Calculate missing data rates
missing_data = df.isnull().mean() * 100  # Percentage of missing data for each column

# Print the result
print("Percentage of missing data in each column:")
print(missing_data)

# Calculate the overall missing data percentage
overall_missing_data = df.isnull().mean().mean() * 100
print(f"\nOverall percentage of missing data in the dataset: {overall_missing_data:.2f}%")

Sample customer_data.csv:
   customer_id                   email  phone number
0            1    john.doe@example.com  123-456-7890
1            2  jane.smith@example.com  987-654-3210
2            3                    None  555-555-5555
3            4  mary.jones@example.com          None
4            5                    None          None
Percentage of missing data in each column:
customer_id      0.0
email           40.0
phone number    40.0
dtype: float64

Overall percentage of missing data in the dataset: 26.67%


### Task 4: Handling Partially Available Records

**Description**: In `customer_data.csv` , identify records with missing "email" or "phone number" and decide whether to drop or fill them.

In [3]:
# Write your code from here
import pandas as pd

# Load the customer data from the CSV file
df = pd.read_csv('customer_data.csv')

# Display the DataFrame to check its contents
print("Original Data:")
print(df)

# Identify records with missing 'email' or 'phone number'
missing_contact_info = df[df['email'].isnull() | df['phone number'].isnull()]

# Print records with missing 'email' or 'phone number'
print("\nRecords with missing 'email' or 'phone number':")
print(missing_contact_info)

# Decision on how to handle missing data
# Option 1: Drop rows with missing 'email' or 'phone number'
df_cleaned_drop = df.dropna(subset=['email', 'phone number'])

# Option 2: Fill missing values
# For this example, we'll fill missing emails with a placeholder and missing phone numbers with a default value
df_cleaned_fill = df.copy()
df_cleaned_fill['email'].fillna('unknown@example.com', inplace=True)
df_cleaned_fill['phone number'].fillna('000-000-0000', inplace=True)

# Print the cleaned DataFrame (after dropping or filling missing values)
print("\nData after dropping rows with missing values:")
print(df_cleaned_drop)

print("\nData after filling missing values:")
print(df_cleaned_fill)

Original Data:
   customer_id                   email  phone number
0            1    john.doe@example.com  123-456-7890
1            2  jane.smith@example.com  987-654-3210
2            3                     NaN  555-555-5555
3            4  mary.jones@example.com           NaN
4            5                     NaN           NaN

Records with missing 'email' or 'phone number':
   customer_id                   email  phone number
2            3                     NaN  555-555-5555
3            4  mary.jones@example.com           NaN
4            5                     NaN           NaN

Data after dropping rows with missing values:
   customer_id                   email  phone number
0            1    john.doe@example.com  123-456-7890
1            2  jane.smith@example.com  987-654-3210

Data after filling missing values:
   customer_id                   email  phone number
0            1    john.doe@example.com  123-456-7890
1            2  jane.smith@example.com  987-654-3210
2    