#### Objective:
To consolidate multiple records of the same recycler industry into a structured and readable format by merging common industry details 
while retaining waste-specific authorization details separately.

#### Brief Description:
This script processes recycler directory data by standardizing industry names (removing variations like Pvt Ltd, Ltd, etc.) and trimming location fields. 
Records are grouped based on Industry Name, District, and Regional Office.

* When Industry Name, District, and Regional match, records are grouped.
* The first occurrence retains all industry-level information.
* For subsequent matching records, only authorization-specific columns (capacity, waste type, unit, type) are retained, while other fields are left blank.
* This ensures that industry identity appears once, while multiple authorizations remain clearly visible.
* The cleaned and merged output is exported to a new Excel file for final use in the recycler director

In [51]:
import pandas as pd
import re

def clean_industry_name(name):
    """Standardize Industry Name by removing Pvt Ltd variations and trimming spaces."""
    name = re.sub(r'\b(Pvt|P)\.?\s?(Ltd|Limited)\b', '', name, flags=re.IGNORECASE)
    name = re.sub(r'\bLtd\.?|Limited\b', '', name, flags=re.IGNORECASE)
    return name.strip()

# Load the Excel file
file_path = "C:/Users/Atique/Rutuja_Mam_Coding/Recycler_Directory/Apr03/V5_Final_Industry_Directory_R_SB_03_03_25.xlsx"  # Change this to your actual file path
df = pd.read_excel(file_path, sheet_name='NFerrous_withoutDrums')

# Columns used for merging
merge_columns = ['Industry Name', 'District', 'Regional']
exception_columns = [
    'Authorized Recycling $ / Utilization/ co-processing capacity (MTA)',
    'Type of Hazardous Waste authorised for Recycling / utilization/ co-processing',
    'UoM',
    'Type'
]

# Clean 'Industry Name' and trim spaces in 'District' and 'Regional'
df['Industry Name'] = df['Industry Name'].astype(str).apply(clean_industry_name)
df['District'] = df['District'].astype(str).str.strip()
df['Regional'] = df['Regional'].astype(str).str.strip()

# Group by matching criteria
grouped = df.groupby(merge_columns, as_index=False)

# Create a new dataframe for the processed output
output_rows = []

for _, group in grouped:
    # First row keeps all column values
    first_row = group.iloc[0].copy()
    
    # Remaining rows keep only exception column values
    for _, row in group.iloc[1:].iterrows():
        empty_row = {col: "" for col in df.columns}  # Make all columns empty
        for col in exception_columns:
            empty_row[col] = row[col]  # Keep exception columns
        output_rows.append(empty_row)
    
    output_rows.append(first_row)

# Convert to DataFrame
output_df = pd.DataFrame(output_rows)

# Save to a new Excel file
output_file = "C:/Users/Atique/Rutuja_Mam_Coding/Recycler_Directory/Apr03/output_commacombine_merged_Apr03.xlsx"
output_df.to_excel(output_file, index=False)

print(f"Processing complete! Merged data saved to {output_file}.")


Processing complete! Merged data saved to C:/Users/Atique/Rutuja_Mam_Coding/Recycler_Directory/Apr03/output_commacombine_merged_Apr03.xlsx.
