## Ensuring Consistency in Multi-source Data Integration

**Description**: Validate the integration of two datasets `products_A.csv` and `products_B.csv` . Ensure consistency in product "category" information.

In [8]:
# Write your code from here
import pandas as pd
from io import StringIO

# Sample data for products_A.csv (replace with your actual file)
products_A_data = """product_id,name,category_A
1,Laptop,Electronics
2,T-Shirt,Apparel
3,Coffee Maker,Home Goods
4,Running Shoes,Apparel
5,Smartphone,Electronics
"""
products_A_df = pd.read_csv(StringIO(products_A_data))

# Sample data for products_B.csv (replace with your actual file)
products_B_data = """product_id,description,category_B
1,High-performance laptop,Electronics
2,Cotton T-shirt,Clothing
3,Brewing machine,Kitchen
4,Athletic footwear,Shoes
5,Latest mobile device,Mobile
"""
products_B_df = pd.read_csv(StringIO(products_B_data))

# Merge the two dataframes on 'product_id'
merged_df = pd.merge(products_A_df, products_B_df, on='product_id', how='inner')

# Identify inconsistencies in product categories
inconsistent_categories_df = merged_df[merged_df['category_A'] != merged_df['category_B']]

print("Products with Inconsistent Category Information:")
print(inconsistent_categories_df)

# Calculate the number and percentage of inconsistencies
num_inconsistencies = len(inconsistent_categories_df)
total_products = len(merged_df)
percentage_inconsistent = (num_inconsistencies / total_products) * 100 if total_products > 0 else 0

print(f"\nNumber of Products with Inconsistent Categories: {num_inconsistencies}")
print(f"Percentage of Products with Inconsistent Categories: {percentage_inconsistent:.2f}%")

# Optional: Investigate the different categories for inconsistent products
if not inconsistent_categories_df.empty:
    print("\nDetailed Category Comparison for Inconsistent Products:")
    for index, row in inconsistent_categories_df.iterrows():
        print(f"Product ID: {row['product_id']}")
        print(f"  Category in Dataset A: {row['category_A']}")
        print(f"  Category in Dataset B: {row['category_B']}")
        print("-" * 30)

Products with Inconsistent Category Information:
   product_id           name   category_A           description category_B
1           2        T-Shirt      Apparel        Cotton T-shirt   Clothing
2           3   Coffee Maker   Home Goods       Brewing machine    Kitchen
3           4  Running Shoes      Apparel     Athletic footwear      Shoes
4           5     Smartphone  Electronics  Latest mobile device     Mobile

Number of Products with Inconsistent Categories: 4
Percentage of Products with Inconsistent Categories: 80.00%

Detailed Category Comparison for Inconsistent Products:
Product ID: 2
  Category in Dataset A: Apparel
  Category in Dataset B: Clothing
------------------------------
Product ID: 3
  Category in Dataset A: Home Goods
  Category in Dataset B: Kitchen
------------------------------
Product ID: 4
  Category in Dataset A: Apparel
  Category in Dataset B: Shoes
------------------------------
Product ID: 5
  Category in Dataset A: Electronics
  Category in Dataset