Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assign FERC Plant IDs #144

Closed
11 of 14 tasks
zaneselvans opened this issue Feb 19, 2018 · 7 comments
Closed
11 of 14 tasks

Assign FERC Plant IDs #144

zaneselvans opened this issue Feb 19, 2018 · 7 comments
Assignees
Labels
epic Any issue whose primary purpose is to organize other issues into a group. ferc1 Anything having to do with FERC Form 1 glue PUDL specific structures & metadata. Stuff that connects datasets together.
Projects

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Feb 19, 2018

Organizing several sub-issues related to the FERC Plant ID assignments:

Tasks

  1. data-cleaning ferc1
  2. ferc1 glue
    zaneselvans
  3. ferc1 glue
    zaneselvans
  4. ferc1 glue
    zaneselvans
  5. data-cleaning ferc1
    zaneselvans
  6. bug ferc1 glue
    zaneselvans
  7. data-cleaning ferc1
    zaneselvans
  8. bug ferc1
    zaneselvans
  9. data-cleaning ferc1
    zaneselvans
  10. bug ferc1 record-linkage
    zaneselvans
  11. bug ferc1
    zaneselvans
  12. ferc1 inframundo performance record-linkage
    zaneselvans
  13. data-cleaning ferc1 time
    zaneselvans
  14. data-cleaning ferc1 glue inframundo performance record-linkage

Original Issue Description:

CPI has created a spreadsheet that does combines the following fields from the f1_steam table to create a candidate ID that persists between years for reported FERC plants:

  • plant_name
  • year_constructed
  • plant_kind (within a well defined taxonomy)
  • respondent_id

The problem of matching/clustering entity names which aren't exactly the same (but should be) is common. In order to be able to do it in a computationally efficient way, the problem is often vectorized. A popular approach appears to be using the TF-IDF (Term Frequency * Inverse Document Frequency) metric, with n-grams (either word or character based) as the "terms." A nice example of this process is outlined here.

In addition to the strings that are the reported plant names, we also have the other categorical features, which we can integrate separately as "one hot" features (that match, or not) in order to independently weight the different features, before applying a clustering algorithm. Here's how that works.

A common and computationally efficient metric of n-dimensional vector similarity is the cosine of the angle separating the vectors. Here's an explanation of using cosine similarity to compare vectorized documents.

Some other related resources:

Once we have the FERC plant features identified, and have chosen a metric of similarity to compare them, there are a bunch of different clustering algorithms we can use to pull them together... or we can just look at the top match for each year -- which will usually be a perfect match (since many utilities do use well defined names for their plants.) Once we've identified the FERC records that are similar enough for us to consider them the "same" plant, we can assign unique numerical IDs to each of them on a record-by-record basis.

Also need to incorporate the other filters & tests that CPI put into the spreadsheet, which include removing records with missing data and records that require human categorization (but which get flagged at ingest).

@zaneselvans zaneselvans added the ferc1 Anything having to do with FERC Form 1 label Feb 19, 2018
@zaneselvans zaneselvans self-assigned this Feb 19, 2018
@zaneselvans
Copy link
Member Author

Okay, so this is a little more complicated than I thought (who is surprised by this?).

We have the plant name (a string) and the three categorical features (construction year, respondent ID, and plant kind), and for the purposes of clustering, we really need to keep all three of them associated with each other for any given FERC plant record. But we also need to be able to scale their resulting feature vectors independently, to allow the features to be weighted differently so we can tune the system to most effectively identify plants.

For each apparently valid record in the FERC Plants table, we need to split out the features, vectorize & scale them, and then re-combine (and re-scale) the vectors before calculating the matrix of pairwise cosine similarities between all records. That matrix of record-to-record similarity is the raw material on which we would want to run the clustering analysis.

The clustering has a lot of interesting constraints to consider, which are both complicating, and helpful. If what we're trying to do is identify individual plants, then we might leave out the respondent ID, and only play with the plant type, plant name, and construction year. But this would result in clusters where the same plants are being reported by different utilities -- in the same year when they have joint ownership, and in different years when the plants change hands.

Those true plant clusters will be interesting (and give a good indication as to whether this is working) but don't get us to having a clean time series for each plant, which will allow us to do regressions and estimate the variable & fixed operating expenses associated with each plant historically. Within each plant cluster, we then need to organize the records into a collection of time series -- choosing no more than one record from each report year for each responding utility.

There are several parameters that can be tuned to enhance the results, but doing so would require having a set of labeled test data -- plants that we have identified by hand as being associated with each other -- and I don't know how much we would need. Tuning parameters would include:

  • The length of n-grams used to vectorize the plant names
  • The relative weights of the four inputs.

Those parameters could be optimized using a grid search or other automated mechanism. The method of extracting features from the string input might be more fuzzy.

@zaneselvans
Copy link
Member Author

A good post on several different ways to do categorical feature encoding.

@zaneselvans
Copy link
Member Author

All the feature vectorizations seem to be working... but need to figure out the right weights & synthesize the time series.

Refining the above clustering to take explicit account of the fact that we're trying to create a time series for each plant... sounds a lot like the original CPI approach (duh):

  1. Calculate the distance metric (cosine similarity, etc.) between all pairs of plant records, with some set of weights for name, respondent id, year constructed, and plant time, and some set of name vectorization parameters (n-gram lengths, etc.)
  2. Now iterate over all of the years of data. For each record in each year, find the most similar record in each of the other years, above some threshold of similarity (so if there really isn't a good match, the best but still bad match gets discarded). This will generate a cluster of plants around each plant record for each year.
  3. Any cluster of plant records that is identical for all the years in which it has a member is a matched (inter-year) plant!
  4. Then we need to look at the leftovers to see what we are missing...

Possible refinement: We know that many plants report individual generation units. Often this information only shows up in the plant name, which ends in a number, e.g. "2", or "Unit 2" or "#2" or "(2)" -- this is a small amount of text, but very important information for the purpose of matching up time series, which ought to be weighted separately in an ideal world. We could split it off into a 5th categorical feature field (e.g. unit_id) and remove it from the plant names, if there are a small enough number of patterns to match.

@zaneselvans
Copy link
Member Author

There are two main constraints that we have to work with in selecting records within a cluster:

  1. Each cluster may contain at most one record from each year.
  2. Each record may be part of at most one cluster.

An outline of a process that uses the above constraints:

  • Filter similarity matrix with a threshold for acceptably good match.
  • For each record, find the best match in each year, subject to some minimum threshold of similarity.
  • For each record, get the list of all other records whose best matches it appears in.
  • Take the intersection of those two sets of records -- this is a candidate cluster.
  • Use each record in the candidate cluster as a seed to generate a candidate cluster.
  • If all records in the candidate cluster generate the same cluster, assign them all an id, and remove them from the list of records that need to be matched.
  • Recalculate which records are most similar to each other, given the removal of the clustered records.
  • Optionally, relax the constraint that each record can appear in at most one cluster by not removing them or re-calculating the best matches, to evaluate how much duplication exists and the impact of using a greedy algorithm on the clustering.

zaneselvans added a commit that referenced this issue Jun 29, 2018
Moved some notebook functions into the FERC transform module, and began
the work of trying to build a scikit-learn FERCPlantClassifier that can
find the best parameters for extracting clusters of FERC plants, based
on the following features:
 - plant name (using TF-IDF and variable ngram lengths)
 - plant kind (gas, coal, etc. categorical)
 - plant capacity (numeric)
 - year of construction (categorical)
 - respondent ID (categorical)
 - supplement number (categorical)
 - row number (categorical)

Partially addresses issue #144.
zaneselvans added a commit that referenced this issue Jun 30, 2018
Added some helper functions to the FERCPlantClassifier class to help
support the model training/fitting, adapted from the sklearn notebook.

Also created a new notebook with a couple of examples of how to manually
categorize FERC plants, so we can create a training set to work with.

Partially address issue #144
zaneselvans added a commit that referenced this issue Jul 5, 2018
Further additions to the (as of yet not working) sklearn classifier /
transformer classes.  Including long docstrings filled with plaintive
questions. Progress toward addressing issue #144.
zaneselvans added a commit that referenced this issue Sep 7, 2018
Got basic versions of fit() and predict() working -- storing the entire
cosine similarity matrix inside the object, and requiring the original
dataframe of information about the plants to be passed in as part of the
__init__() function, which is obviously not idea -- but for this problem
it seems like we have almost nothing *but* hyperparameters to vary. And
the distance metric that results in the categories kind of depends on
all of the records being available. Which is fine, but not the way these
things normally work.

Have still left out the single numerical variable, since it requires
reshaping to work in the Pipeline. It's all so ugly right now. Blargh!

Working toward issue #144
@zaneselvans zaneselvans added the glue PUDL specific structures & metadata. Stuff that connects datasets together. label Sep 28, 2018
@zaneselvans zaneselvans added this to the v0.1.0 release milestone Sep 28, 2018
@zaneselvans
Copy link
Member Author

zaneselvans commented Sep 30, 2018

The FERC time series generation is now structurally working -- pulling out collections of records, at most one per year, which are similar to each other, based on:

  • respondent_id (as a categorical feature)
  • plant_name (using TF-IDF on substrings)
  • tot_capacity (as a numeric feature)
  • plant_kind (as a categorical feature, after the field is coded using a thesaurus of strings)
  • yr_const (year of plant construction, as a categorical feature)
  • const_type (construction type, as a categorical feature)

How well the model chooses records to put in a time series depends on:

  • the relative weightings of the above features
  • the minimum required similarity for a feature to be potentially included in a series
  • the range of substring lengths used in the plant name vectorization

@zaneselvans zaneselvans changed the title Create internally consistent FERC Plant IDs across years Assign FERC Plant IDs based on consistent multi-year time series Sep 30, 2018
@zaneselvans zaneselvans changed the title Assign FERC Plant IDs based on consistent multi-year time series Assign FERC Plant IDs based on multi-year time series Sep 30, 2018
@zaneselvans zaneselvans changed the title Assign FERC Plant IDs based on multi-year time series Assign FERC Plant IDs Sep 30, 2018
zaneselvans added a commit that referenced this issue Oct 2, 2018
Created a basic scoring function in the FERCPlantClassifier that looks
at how similar the "true" vs. recovered plant time series are.  Removed
some other unused code that had been superceded.  Also fixed an assign
on slice warning that was lingering.

Working toward #144
zaneselvans added a commit that referenced this issue Oct 2, 2018
 * Finally moved the boilerplate FERC plant classification model into
   the transform module.
 * Re-organized the functions that were being used into methods within
   the FERCPlantClassifier class.
 * Tried and failed to vectorize the predict() method. It's still quite
   slow, but hopefully easier for humans to read.
 * The score() method still needs to be updated to work with the new
   predict() method.

Working toward issues #144 and #216
zaneselvans added a commit that referenced this issue Oct 18, 2018
Updated the plant_id_ferc1 assignment process to work on the
plants_steam_ferc1 proto-table (the dataframe as it is at the end of the
current transform step), since that's where the ID assignment will
ultimately need to take place.

This included swapping out the CPI plant type assignments, and reverting
to using our own (which have more categories, and are thus likely more
useful for matching plant records together).  The CPI plant types have
been preserved in the ever growing constants module.

Changed the names of the columns that are used in the FERC Plant
Classifier class.  Update pre-processing of some of those columns in the
transform step to ensure they work with sklearn module.

Removed the ad-hoc FERC plant cleaning function which was hanging out
with the FERC Plant Classifier in the transform module, since it now
uses the canonical proto-table as input.

Progress toward #219 and #144
zaneselvans added a commit that referenced this issue Oct 19, 2018
At long last, the process for identifying related records in the FERC
plant table has been integrated into the ETL process.

There's still tweaking to be done with the values of the parameters
which are used to determine the related groups, and the remaining ~10%
of all records which aren't assigned to any FERC plant (but which are
still valid records) need to be dealt with. Refinements are listed as
other issues. This commit makes progress on #144 and closes #219
zaneselvans added a commit that referenced this issue Oct 19, 2018
During the FERC Form 1 transform process, the plant_id which is
generated and assigned in the steam_plants_ferc1 table is now inserted
into the fuel_ferc1 table, by merging on the shared fields that are
co-reported to FERC in the two tables: report_year, utility_id_ferc, and
plant_name -- in this and only this instance, it appears that the plant
name can serve as part of a foreign key -- the fuel and steam tables get
the exact same values.

Note that this merge *need not* result in the same plant_id_ferc1 being
assosicated with records having the same plant name and utility ID.
Those fields are just used to connect the FERC steam and fuel tables,
bringing across whatever ID has been assigned on a record-by-record
basis in the plants_steam_ferc1 ETL process.

This commit closes #219 and makes progress toward #144.
zaneselvans added a commit that referenced this issue Oct 19, 2018
This is mostly some naming cleanup, including:
 * Changed respondent_id to utility_id_ferc1
 * Changed all of the utility_id_ferc values to utility_id_ferc1.
 * Changed respondent_name to utility_name_ferc1.
 * Expnaded util_id_pudl and util_plant_assn to utility_* since we're
   using the full word in the context of the other tables.

Also brought plant_id_ferc1 into the FERC output tables, in order to be
able to play with them more easily while integrating the new plant IDs
into the FERC output routines.

Closes #228, works toward #144.
@zaneselvans zaneselvans added this to To do in FERC_Form_1 Mar 29, 2019
@zaneselvans zaneselvans moved this from To do to In progress in FERC_Form_1 Mar 29, 2019
@zaneselvans zaneselvans moved this from In progress to To do in FERC_Form_1 Mar 29, 2019
@zaneselvans zaneselvans moved this from To do to In progress in FERC_Form_1 Apr 30, 2019
@zaneselvans zaneselvans moved this from In progress to To do in FERC_Form_1 Apr 30, 2019
@zaneselvans zaneselvans modified the milestones: 0.1.0, 0.2.0 Jun 28, 2019
@cmgosnell cmgosnell removed this from the 0.3.0 milestone Oct 4, 2019
@zaneselvans zaneselvans added the epic Any issue whose primary purpose is to organize other issues into a group. label Oct 31, 2019
@cmgosnell
Copy link
Member

@zaneselvans can we close this? this is a "nothing is ever done" type of thing but we've been using these id's for a verrrry long time at this point and if we want to revamp them we can always come back to this issue to see what was not incorporated/what you were thinking of next steps.

@zaneselvans
Copy link
Member Author

Yeah, I guess so. I pulled #475 and #216 into #2386 since the memory intensity and slowness are significant hangups in the DAG right now.

I hope that at some point we can revisit this system with our learnings from the other record linkage projects we've done, and maybe get @katie-lamb's eyes on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Any issue whose primary purpose is to organize other issues into a group. ferc1 Anything having to do with FERC Form 1 glue PUDL specific structures & metadata. Stuff that connects datasets together.
Projects
Archived in project
FERC_Form_1
  
Backlog
Development

No branches or pull requests

2 participants