Assign FERC Plant IDs #144

zaneselvans · 2018-02-19T18:18:58Z

Organizing several sub-issues related to the FERC Plant ID assignments:

Tasks

Give feedback

Allow FERC plant records with no net_generation #215

data-cleaning ferc1
Incorporate FERC record IDs into PUDL #218

ferc1 glue
Integrate plant_id_ferc1 assignment into FERC Form 1 transform process #219

ferc1 glue
Bring plant_id_ferc1 into the FERC fuel table records #220

ferc1 glue
Deal with valid but uncategorized FERC plant records #223

data-cleaning ferc1
FERC plant records showing up in multiple time series #240

bug ferc1 glue
Understand why FERC transform loses a few steam plants records #242

data-cleaning ferc1
Fix FutureWarning about incomplete index in loc #222

bug ferc1
Develop metrics to evaluate FERC Form 1 Plant ID assignemnts #221

data-cleaning ferc1
Create minimal FERC Plant ID Assignment Test Case #303

bug ferc1 record-linkage
Understand why good FERC plant records are being orphaned #304

bug ferc1
Vectorize FERC plant matching functions #216

ferc1 inframundo performance record-linkage
Allow multiple uses of raw FERC records in plant time series #217

data-cleaning ferc1 time
Reduce memory usage of FERC Plant ID Assignment #475

data-cleaning ferc1 glue inframundo performance record-linkage
Options

Original Issue Description:

CPI has created a spreadsheet that does combines the following fields from the f1_steam table to create a candidate ID that persists between years for reported FERC plants:

plant_name
year_constructed
plant_kind (within a well defined taxonomy)
respondent_id

The problem of matching/clustering entity names which aren't exactly the same (but should be) is common. In order to be able to do it in a computationally efficient way, the problem is often vectorized. A popular approach appears to be using the TF-IDF (Term Frequency * Inverse Document Frequency) metric, with n-grams (either word or character based) as the "terms." A nice example of this process is outlined here.

In addition to the strings that are the reported plant names, we also have the other categorical features, which we can integrate separately as "one hot" features (that match, or not) in order to independently weight the different features, before applying a clustering algorithm. Here's how that works.

A common and computationally efficient metric of n-dimensional vector similarity is the cosine of the angle separating the vectors. Here's an explanation of using cosine similarity to compare vectorized documents.

Some other related resources:

Text based feature extraction w/ scikit-learn
Dictionary (one-hot) vectorization: w/ scikit-learn.
More complex & powerful natural language tokenization is possible using the NLTK package.

Once we have the FERC plant features identified, and have chosen a metric of similarity to compare them, there are a bunch of different clustering algorithms we can use to pull them together... or we can just look at the top match for each year -- which will usually be a perfect match (since many utilities do use well defined names for their plants.) Once we've identified the FERC records that are similar enough for us to consider them the "same" plant, we can assign unique numerical IDs to each of them on a record-by-record basis.

Also need to incorporate the other filters & tests that CPI put into the spreadsheet, which include removing records with missing data and records that require human categorization (but which get flagged at ingest).

zaneselvans · 2018-02-27T20:02:44Z

Okay, so this is a little more complicated than I thought (who is surprised by this?).

We have the plant name (a string) and the three categorical features (construction year, respondent ID, and plant kind), and for the purposes of clustering, we really need to keep all three of them associated with each other for any given FERC plant record. But we also need to be able to scale their resulting feature vectors independently, to allow the features to be weighted differently so we can tune the system to most effectively identify plants.

For each apparently valid record in the FERC Plants table, we need to split out the features, vectorize & scale them, and then re-combine (and re-scale) the vectors before calculating the matrix of pairwise cosine similarities between all records. That matrix of record-to-record similarity is the raw material on which we would want to run the clustering analysis.

The clustering has a lot of interesting constraints to consider, which are both complicating, and helpful. If what we're trying to do is identify individual plants, then we might leave out the respondent ID, and only play with the plant type, plant name, and construction year. But this would result in clusters where the same plants are being reported by different utilities -- in the same year when they have joint ownership, and in different years when the plants change hands.

Those true plant clusters will be interesting (and give a good indication as to whether this is working) but don't get us to having a clean time series for each plant, which will allow us to do regressions and estimate the variable & fixed operating expenses associated with each plant historically. Within each plant cluster, we then need to organize the records into a collection of time series -- choosing no more than one record from each report year for each responding utility.

There are several parameters that can be tuned to enhance the results, but doing so would require having a set of labeled test data -- plants that we have identified by hand as being associated with each other -- and I don't know how much we would need. Tuning parameters would include:

The length of n-grams used to vectorize the plant names
The relative weights of the four inputs.

Those parameters could be optimized using a grid search or other automated mechanism. The method of extracting features from the string input might be more fuzzy.

zaneselvans · 2018-02-28T00:00:19Z

A good post on several different ways to do categorical feature encoding.

zaneselvans · 2018-03-06T22:09:49Z

All the feature vectorizations seem to be working... but need to figure out the right weights & synthesize the time series.

Refining the above clustering to take explicit account of the fact that we're trying to create a time series for each plant... sounds a lot like the original CPI approach (duh):

Calculate the distance metric (cosine similarity, etc.) between all pairs of plant records, with some set of weights for name, respondent id, year constructed, and plant time, and some set of name vectorization parameters (n-gram lengths, etc.)
Now iterate over all of the years of data. For each record in each year, find the most similar record in each of the other years, above some threshold of similarity (so if there really isn't a good match, the best but still bad match gets discarded). This will generate a cluster of plants around each plant record for each year.
Any cluster of plant records that is identical for all the years in which it has a member is a matched (inter-year) plant!
Then we need to look at the leftovers to see what we are missing...

Possible refinement: We know that many plants report individual generation units. Often this information only shows up in the plant name, which ends in a number, e.g. "2", or "Unit 2" or "#2" or "(2)" -- this is a small amount of text, but very important information for the purpose of matching up time series, which ought to be weighted separately in an ideal world. We could split it off into a 5th categorical feature field (e.g. unit_id) and remove it from the plant names, if there are a small enough number of patterns to match.

zaneselvans · 2018-03-08T21:26:05Z

There are two main constraints that we have to work with in selecting records within a cluster:

Each cluster may contain at most one record from each year.
Each record may be part of at most one cluster.

An outline of a process that uses the above constraints:

Filter similarity matrix with a threshold for acceptably good match.
For each record, find the best match in each year, subject to some minimum threshold of similarity.
For each record, get the list of all other records whose best matches it appears in.
Take the intersection of those two sets of records -- this is a candidate cluster.
Use each record in the candidate cluster as a seed to generate a candidate cluster.
If all records in the candidate cluster generate the same cluster, assign them all an id, and remove them from the list of records that need to be matched.
Recalculate which records are most similar to each other, given the removal of the clustered records.
Optionally, relax the constraint that each record can appear in at most one cluster by not removing them or re-calculating the best matches, to evaluate how much duplication exists and the impact of using a greedy algorithm on the clustering.

Moved some notebook functions into the FERC transform module, and began the work of trying to build a scikit-learn FERCPlantClassifier that can find the best parameters for extracting clusters of FERC plants, based on the following features: - plant name (using TF-IDF and variable ngram lengths) - plant kind (gas, coal, etc. categorical) - plant capacity (numeric) - year of construction (categorical) - respondent ID (categorical) - supplement number (categorical) - row number (categorical) Partially addresses issue #144.

Added some helper functions to the FERCPlantClassifier class to help support the model training/fitting, adapted from the sklearn notebook. Also created a new notebook with a couple of examples of how to manually categorize FERC plants, so we can create a training set to work with. Partially address issue #144

Further additions to the (as of yet not working) sklearn classifier / transformer classes. Including long docstrings filled with plaintive questions. Progress toward addressing issue #144.

Got basic versions of fit() and predict() working -- storing the entire cosine similarity matrix inside the object, and requiring the original dataframe of information about the plants to be passed in as part of the __init__() function, which is obviously not idea -- but for this problem it seems like we have almost nothing *but* hyperparameters to vary. And the distance metric that results in the categories kind of depends on all of the records being available. Which is fine, but not the way these things normally work. Have still left out the single numerical variable, since it requires reshaping to work in the Pipeline. It's all so ugly right now. Blargh! Working toward issue #144

zaneselvans · 2018-09-30T14:08:33Z

The FERC time series generation is now structurally working -- pulling out collections of records, at most one per year, which are similar to each other, based on:

respondent_id (as a categorical feature)
plant_name (using TF-IDF on substrings)
tot_capacity (as a numeric feature)
plant_kind (as a categorical feature, after the field is coded using a thesaurus of strings)
yr_const (year of plant construction, as a categorical feature)
const_type (construction type, as a categorical feature)

How well the model chooses records to put in a time series depends on:

the relative weightings of the above features
the minimum required similarity for a feature to be potentially included in a series
the range of substring lengths used in the plant name vectorization

Created a basic scoring function in the FERCPlantClassifier that looks at how similar the "true" vs. recovered plant time series are. Removed some other unused code that had been superceded. Also fixed an assign on slice warning that was lingering. Working toward #144

* Finally moved the boilerplate FERC plant classification model into the transform module. * Re-organized the functions that were being used into methods within the FERCPlantClassifier class. * Tried and failed to vectorize the predict() method. It's still quite slow, but hopefully easier for humans to read. * The score() method still needs to be updated to work with the new predict() method. Working toward issues #144 and #216

Updated the plant_id_ferc1 assignment process to work on the plants_steam_ferc1 proto-table (the dataframe as it is at the end of the current transform step), since that's where the ID assignment will ultimately need to take place. This included swapping out the CPI plant type assignments, and reverting to using our own (which have more categories, and are thus likely more useful for matching plant records together). The CPI plant types have been preserved in the ever growing constants module. Changed the names of the columns that are used in the FERC Plant Classifier class. Update pre-processing of some of those columns in the transform step to ensure they work with sklearn module. Removed the ad-hoc FERC plant cleaning function which was hanging out with the FERC Plant Classifier in the transform module, since it now uses the canonical proto-table as input. Progress toward #219 and #144

At long last, the process for identifying related records in the FERC plant table has been integrated into the ETL process. There's still tweaking to be done with the values of the parameters which are used to determine the related groups, and the remaining ~10% of all records which aren't assigned to any FERC plant (but which are still valid records) need to be dealt with. Refinements are listed as other issues. This commit makes progress on #144 and closes #219

During the FERC Form 1 transform process, the plant_id which is generated and assigned in the steam_plants_ferc1 table is now inserted into the fuel_ferc1 table, by merging on the shared fields that are co-reported to FERC in the two tables: report_year, utility_id_ferc, and plant_name -- in this and only this instance, it appears that the plant name can serve as part of a foreign key -- the fuel and steam tables get the exact same values. Note that this merge *need not* result in the same plant_id_ferc1 being assosicated with records having the same plant name and utility ID. Those fields are just used to connect the FERC steam and fuel tables, bringing across whatever ID has been assigned on a record-by-record basis in the plants_steam_ferc1 ETL process. This commit closes #219 and makes progress toward #144.

This is mostly some naming cleanup, including: * Changed respondent_id to utility_id_ferc1 * Changed all of the utility_id_ferc values to utility_id_ferc1. * Changed respondent_name to utility_name_ferc1. * Expnaded util_id_pudl and util_plant_assn to utility_* since we're using the full word in the context of the other tables. Also brought plant_id_ferc1 into the FERC output tables, in order to be able to play with them more easily while integrating the new plant IDs into the FERC output routines. Closes #228, works toward #144.

cmgosnell · 2023-03-15T16:45:41Z

@zaneselvans can we close this? this is a "nothing is ever done" type of thing but we've been using these id's for a verrrry long time at this point and if we want to revamp them we can always come back to this issue to see what was not incorporated/what you were thinking of next steps.

zaneselvans · 2023-03-15T20:17:02Z

Yeah, I guess so. I pulled #475 and #216 into #2386 since the memory intensity and slowness are significant hangups in the DAG right now.

I hope that at some point we can revisit this system with our learnings from the other record linkage projects we've done, and maybe get @katie-lamb's eyes on it.

zaneselvans added the ferc1 Anything having to do with FERC Form 1 label Feb 19, 2018

zaneselvans self-assigned this Feb 19, 2018

zaneselvans added the glue PUDL specific structures & metadata. Stuff that connects datasets together. label Sep 28, 2018

zaneselvans added this to the v0.1.0 release milestone Sep 28, 2018

zaneselvans mentioned this issue Sep 28, 2018

Apply DataZipper to FERC / EIA Plant Mapping #212

Closed

zaneselvans changed the title ~~Create internally consistent FERC Plant IDs across years~~ Assign FERC Plant IDs based on consistent multi-year time series Sep 30, 2018

zaneselvans changed the title ~~Assign FERC Plant IDs based on consistent multi-year time series~~ Assign FERC Plant IDs based on multi-year time series Sep 30, 2018

zaneselvans changed the title ~~Assign FERC Plant IDs based on multi-year time series~~ Assign FERC Plant IDs Sep 30, 2018

zaneselvans mentioned this issue Oct 2, 2018

Deal with valid but uncategorized FERC plant records #223

Closed

zaneselvans mentioned this issue Oct 26, 2018

FERC plant records showing up in multiple time series #240

Closed

zaneselvans mentioned this issue Nov 10, 2018

Understand why FERC transform loses a few steam plants records #242

Closed

zaneselvans added this to To do in FERC_Form_1 Mar 29, 2019

zaneselvans moved this from To do to In progress in FERC_Form_1 Mar 29, 2019

This was referenced Mar 29, 2019

Integrate fuel proportions into FERC Plant ID assignment #266

Closed

Address fuel_ferc1 record_id duplicates #267

Closed

zaneselvans moved this from In progress to To do in FERC_Form_1 Mar 29, 2019

zaneselvans mentioned this issue Mar 29, 2019

Create per-plant fuel summary for FERC #251

Closed

zaneselvans moved this from To do to In progress in FERC_Form_1 Apr 30, 2019

zaneselvans moved this from In progress to To do in FERC_Form_1 Apr 30, 2019

This was referenced Jun 9, 2019

Create minimal FERC Plant ID Assignment Test Case #303

Closed

Understand why good FERC plant records are being orphaned #304

Open

zaneselvans modified the milestones: 0.1.0, 0.2.0 Jun 28, 2019

cmgosnell removed this from the 0.3.0 milestone Oct 4, 2019

zaneselvans added the epic Any issue whose primary purpose is to organize other issues into a group. label Oct 31, 2019

cmgosnell mentioned this issue May 24, 2021

generate plant id for FERC small plants table #1007

Open

3 tasks

zaneselvans added inframundo and removed inframundo labels Mar 11, 2023

cmgosnell closed this as completed Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assign FERC Plant IDs #144

Assign FERC Plant IDs #144

zaneselvans commented Feb 19, 2018 •

edited

Loading

Tasks

zaneselvans commented Feb 27, 2018

zaneselvans commented Feb 28, 2018

zaneselvans commented Mar 6, 2018

zaneselvans commented Mar 8, 2018

zaneselvans commented Sep 30, 2018 •

edited

Loading

cmgosnell commented Mar 15, 2023

zaneselvans commented Mar 15, 2023

Assign FERC Plant IDs #144

Assign FERC Plant IDs #144

Comments

zaneselvans commented Feb 19, 2018 • edited Loading

Tasks

Original Issue Description:

zaneselvans commented Feb 27, 2018

zaneselvans commented Feb 28, 2018

zaneselvans commented Mar 6, 2018

zaneselvans commented Mar 8, 2018

zaneselvans commented Sep 30, 2018 • edited Loading

cmgosnell commented Mar 15, 2023

zaneselvans commented Mar 15, 2023

zaneselvans commented Feb 19, 2018 •

edited

Loading

zaneselvans commented Sep 30, 2018 •

edited

Loading