# **head_of_household and household_id**

### **Logic 1**

**Householding** logic comprises several operations

1. **Function Definition (`get_max_index`)**:
   - **Purpose**: Determines the index of the row within a group that has the maximum value in the "Total Lifetime Giving" column, or defaults to the first row if the column is absent or all values are zero.
   - **Usage**: Applied later within a grouped dataframe to identify the row with the highest lifetime giving.

2. **Data Preparation and Cleaning**:
   - **Filtering Rows (`df_cdi`)**: Creates `df_cdi` by excluding rows where donors are deceased (`df_cd["is_deceased"]`) and not individuals (`df_cd["is_individual"]`), and sets a new column `head_of_household` to `True`.
   - **Drop NaN Rows**: Removes rows where both "Address 1" and "Address 2" are NaN.
   - **Fill NaN Values**: Fills NaN values in selected columns ("Address 1", "Address 2", "City", "State", "Zip") with a placeholder value ('missing').
   - **Filter Numeric Addresses**: Filters rows where at least one of "Address 1" or "Address 2" contains a numeric or Roman numeral value.

3. **Grouping and Sorting**:
   - **Group by Address Details (`grouped`)**: Groups `df_cdi` by "Address 1", "Address 2", "City", "State", and "Zip".
   - **Identify Rows with Multiple Entries (`df_cdi_multiple`)**: Filters `df_cdi` to include only rows with duplicate combinations of "Address 1", "Address 2", "City", "State", and "Zip", sorting them for further operations.

4. **Assignment and Merging**:
   - **Identify Maximum Lifetime Giving (`idx_max_lifetime_giving`)**: Applies `get_max_index` function to `grouped` to find the row index with the maximum "Total Lifetime Giving" within each subgroup.
   - **Set Attributes (`head_of_household`, `household_id`)**:
     - Sets `head_of_household` to `False` for all rows in `df_cdi_multiple`.
     - Sets `head_of_household` to `True` for rows identified in `idx_max_lifetime_giving`.
     - Assigns a unique `household_id` to each subgroup in `df_cdi_multiple` based on its group index.
   - **Merge Attributes Back (`df_cdi` into `df_cd`)**: Merges `head_of_household` and `household_id` back into the original `df_cd` based on "Unique Donor ID".
   - 
This process effectively categorizes donors into households (**household_id**), designates a head of household (**head_of_household**), and ensures that these attributes are correctly assigned across the entire dataframe (df_cd).

In [None]:
# Function to determine index of row with max "Total Lifetime Giving" or default to first row
def get_max_index(group):
    if "Total Lifetime Giving" in group.columns:
        if group["Total Lifetime Giving"].max() > 0:
            return group["Total Lifetime Giving"].idxmax()
    # If "Total Lifetime Giving" column is absent or all values are zero, choose the first row
    return group.index[0]

In [None]:
df_cdi = df_cd[~(df_cd["is_deceased"]) & (df_cd["is_individual"])]
df_cdi["head_of_household"] = True
len(df_cd), len(df_cdi)

In [None]:
# Drop rows where both Addr1 and Addr2 are NaN
df_cdi = df_cdi.dropna(subset=["Address 1", "Address 2"], how='all')

# Fill NaNs with a placeholder value (e.g., 'missing')
df_cdi = df_cdi.fillna({'Address 1': 'missing', 'Address 2': 'missing', 'City': 'missing', 'State': 'missing', 'Zip': 'missing'})

# Filter rows to ensure there's something numeric in at least one of Addr1 or Addr2
df_cdi = df_cdi[df_cdi.apply(lambda row: roman_or_numeral(row["Address 1"]) or roman_or_numeral(row["Address 2"]), axis=1)]

df_cdi_multiple = df_cdi[df_cdi.groupby(["Address 1", "Address 2", "City", "State", "Zip"])\
                            .transform('size')>1]\
                            .sort_values(by=["Address 1", "Address 2", "City", "State", "Zip"])

# Create Groupby Object
grouped = df_cdi_multiple.groupby(["Address 1", "Address 2", "City", "State", "Zip"])

# Identify index of rows with max "Total Lifetime Giving" or default to first row
idx_max_lifetime_giving = grouped.apply(get_max_index).values

# Set "head_of_household" to False for all entries in subgroups with multiple entries
df_cdi.loc[df_cdi_multiple.index, "head_of_household"] = False
df_cdi_multiple["head_of_household"] = False # not necessary but

# Set "head_of_household" to True for the rows identified with max "Total Lifetime Giving"
df_cdi.loc[idx_max_lifetime_giving, "head_of_household"] = True
df_cdi_multiple.loc[idx_max_lifetime_giving, "head_of_household"] = True # not necessary but

# Assign unique household_id to each subgroup
df_cdi_multiple['household_id'] = df_cdi_multiple.groupby(["Address 1", "Address 2", "City", "State", "Zip"])\
                                               .ngroup() + 1

# Merge the household_id back into the original df_cdi
df_cdi = df_cdi.merge(df_cdi_multiple[["Unique Donor ID", "household_id"]], on="Unique Donor ID", how="left")

# Fill NaN values in household_id with 0 for those not in multiple entries groups
df_cdi["household_id"] = df_cdi["household_id"].fillna(0).astype(int)

# Merge the head_of_household and household_id into df_cd
df_cd = df_cd.merge(df_cdi[["Unique Donor ID", "head_of_household", "household_id"]], \
                  on="Unique Donor ID", how="left")

### **Logic 2**