# Play around with `avg_num_employees` agg

This notebook reviews two files: 
- **agg_df:** aggregated by year, utility, and plant type
- **full_df:** un-aggregated but with a column `avg_num_employees_agg` for aggergated year, utility, plant, and plant-type employee values. I included this one so you can play around and make sure the totals flags are working properly / change them if you don't like them. I'll show you how below. 

It's important to remeber that the `avg_num_employees_agg` values in the `agg_df` are calculated at the PLANT/PLANT-TYPE level not the UTILITY level. There is another round of aggregation that occurs before that. This is to make it easier to see what assumptions were made in the process of creating the final utility-aggregated employee number value.

In [1]:
import pandas as pd
import random

In [2]:
# Path to agg and full files; UPDATE as needed
agg_path = '/Users/aesharpe/Desktop/num_employees_agg.xlsx'
full_path = '/Users/aesharpe/Desktop/num_employees.xlsx'

# Load excel files into pandas
agg_df = pd.read_excel(agg_path).drop(columns=['Unnamed: 0'])
full_df = pd.read_excel(full_path).drop(columns=['Unnamed: 0'])

In [3]:
def get_random_group(df):
    """Show random year/utility groups that have multiple rows and at least one total.
    
    Use this function to see how the aggregation chose to allocate the avg_num_employees.
    You can compare the avg_num_employees column with the avg_num_employees_agg column.
    
    Args: 
        df (pandas.DataFrame): The num_employees.xlsx dataframe (i.e., the non 
            aggregated one).
    Returns:
        df (pandas.DataFrame): A random subset of the full dataframe that shows
            the records for a specific year and utility that has more than one
            record and at least one flagged total row in total_types. 
    """
    groups = df.groupby(['report_year', 'utility_id_ferc1']) # add plant_id_pudl if you want to narrow the groups
    while True:
        random_key = random.choice(list(groups.groups.keys()))
        random_group = groups.get_group(random_key)
        more_than_one_row = len(random_group) > 1
        has_total = random_group.total_type.notna().any()
        if more_than_one_row & has_total:
            break
    return random_group[[
        'record_id', 'report_year', 'utility_id_ferc1', 'utility_name_ferc1', 
        'plant_id_pudl', 'plant_name_ferc1', 'total_type', 'avg_num_employees', 'avg_num_employees_agg',
        'avg_num_employees_flag', 'capacity_mw', 'installation_year', 'plant_type']]

Every time you run this you'll get a different subset of the full df
You can use this to look at the way that employee numbers were allocated and decide whether you agree
Remember, all aggregation allocation decisions are being made at the year, utility, and plant level so
all of the values in avg_num_employees_agg represent the summary value for that plant, that's why they
are repeated for multipe records in a plant. The utility level aggregation is calculated (shown below)
by adding up the designated employee count for each year, utility, and plant id. In other words, you
can't just sum the column.

In [4]:
# Generate a random year/utility group
peek = get_random_group(full_df)

# Show what fuel types appear in that group
peek.plant_type.unique()

array(['combustion_turbine', 'steam', 'nuclear', 'unknown', 'storage',
       'run-of-river'], dtype=object)

In [5]:
# Look at one of those fuel types for a snapshot of what's going on
peek[peek['plant_type']=='steam'].sort_values('plant_id_pudl')

Unnamed: 0,record_id,report_year,utility_id_ferc1,utility_name_ferc1,plant_id_pudl,plant_name_ferc1,total_type,avg_num_employees,avg_num_employees_agg,avg_num_employees_flag,capacity_mw,installation_year,plant_type
3782,f1_steam_1996_12_57_0_5,1996,57,Georgia Power Company,73,bowen,,423.0,423,actual values provided,3499.0,1975.0,steam
3783,f1_steam_1996_12_57_1_5,1996,57,Georgia Power Company,246,hammond,,213.0,213,actual values provided,953.0,1970.0,steam
3784,f1_steam_1996_12_57_1_4,1996,57,Georgia Power Company,250,harllee branch,,347.0,347,actual values provided,1746.0,1969.0,steam
3788,f1_steam_1996_12_57_1_1,1996,57,Georgia Power Company,383,mcdonough,,177.0,177,actual values provided,598.0,1964.0,steam
3791,f1_steam_1996_12_57_2_3,1996,57,Georgia Power Company,398,mcmanus,,43.0,43,actual values provided,144.0,1959.0,steam
3793,f1_steam_1996_12_57_2_5,1996,57,Georgia Power Company,412,mitchell,,64.0,64,actual values provided,218.0,1964.0,steam
3794,f1_steam_1996_12_57_3_1,1996,57,Georgia Power Company,526,scherer,,399.0,399,actual values provided,818.0,1988.0,steam
3801,f1_steam_1996_12_57_2_1,1996,57,Georgia Power Company,656,yates,,317.0,317,actual values provided,1488.0,1974.0,steam
3803,f1_steam_1996_12_57_3_4,1996,57,Georgia Power Company,658,wansley,,249.0,249,actual values provided,1019.0,1978.0,steam
3817,f1_steam_1996_12_57_0_1,1996,57,Georgia Power Company,9611,arkwright,,80.0,80,actual values provided,181.0,1948.0,steam


## Recreate the agg_df from the full_df

If you see any values for `avg_num_employees_agg` that you do not think are representative of that year, utility, plant, and plant type, then you can change them. Just make sure you change the `avg_num_employees_agg` value in the `full_df` spreadsheet (num_employees_full.xlsx) for **ALL** records in the year, utility, plant, and plant type group. Then, you can run these next cells which will recreate the aggregated spreadsheet as well as show you the difference between the original aggregated spreadsheet and the version with changes. If you *don't* change the spreadsheet, this should output a blank df.

In [6]:
# Group by relevant columns. We include plant_id_pudl here because many of the totals are plant-level totals
groups = full_df.groupby(['report_year', 'utility_id_ferc1', 'plant_id_pudl', 'plant_type'])

# Test that the groups we've defined above all have the same values for the column avg_num_employees_agg
# This will spit out an error if that's not true
assert (groups.avg_num_employees_agg.nunique() > 1).any() == False, "groups don't have the same avg_num_employees_agg" 

# Group by plant and grab the first value in each avg_num_employees group because we know they are all the same
plant_groups_df = groups.agg('first').reset_index()

# Now we'll aggregate up to the utility plant-type level which is what we want for the final version.
util_groups_df = (
    plant_groups_df
    .groupby(['report_year', 'utility_id_ferc1', 'plant_type'])
    .agg('sum')
    .assign(avg_num_employees=lambda x: x.avg_num_employees_agg.astype('Int64'))
    .drop(columns=['plant_id_pudl'])
    .reset_index())[['report_year', 'utility_id_ferc1', 'plant_type', 'avg_num_employees']]

In [7]:
# Show the differences between the original agg_df and your newly aggregated full_df
# If you don't change anything, this should be empty
agg_no_flag = agg_df.drop(columns=['avg_num_employees_flag'])
pd.concat([agg_no_flag, util_groups_df]).drop_duplicates(keep=False)

Unnamed: 0,report_year,utility_id_ferc1,plant_type,avg_num_employees
