Skip to content

Contributing

catherinebirney edited this page Sep 25, 2023 · 8 revisions

Contributing to FLOWSA

FLOWSA contributions are managed through a standard GitHub process, by forking this repository, making changes, and creating a pull request in the main repository for an administrator review.

Please use separate commits for different functional changes to allow for git cherry-pick.

Creating a Flow-By-Activity Dataset

  1. Write "instructions" for how to find the original data source being imported (i.e. a webpage). Write the instructions in a yaml file in the flowbyactivitymethods folder. See the README for configuration information of the yaml.

  2. Write any functions needed to help pull, parse, and format the data in a single script with the same SourceName as that used in the flowbyactivitymethods yaml. Any functions written in this script should be called on in the method yaml. See details in the README.

  3. Generate the Flow-By-Activity dataset by running

import flowsa 
flowsa.getFlowByActivity('EIA_MECS_Energy', '2018')
  1. Update the source catalog yaml with the information specific to the dataset.

FlowByActivity Naming Convention

Source dataset names are consistent across (1) the FlowByActivity dataset 'SourceName' columns, (2) the parquet file names, (3) the Crosswalk file names, and (4) the Source Catalog information. Source names are comprised of two or three components. The first part of the name is the agency that published the data. The second component is the name or acronym of the published dataset. The third piece of the naming schema, if it exists, is the topic of data parsed from the original dataset. Of the four FlowByActivity datasets imported from the U.S. Department of Agriculture (USDA), three are data pulled from the same dataset, the Census of Agriculture (CoA). To make data easier to find, the CoA data is separated by topic (Cropland, Livestock, Product Market Value). As the FlowByActivity datasets are grouped by topic, some of the parquets contain multiple class types, meaning the Class type should be specified when calling on the data. The USDA_CoA_Cropland dataframe includes acreage information for crops (Class = Land) and the number of farms that grow a particular crop (Class = Other).

Creating a Flow-By-Activity Crosswalk

A crosswalk linking a Flow-By-Activity's unique FlowNames to NAICS is required for each Flow-By-Activity unless the imported data are already attributed to NAICS.

  1. Generate a csv mapping each activity name in the Flow-By-Activity to NAICS 2012 codes and save in the activitytosectormapping folder. These mapping files are only necessary for datasets that are not already NAICS based. It must be specified if the activities are NAICS-like in the Source Catalog by defining activity_schema.
    • Most mapping files are created with a .py file in the Scripts folder, some crosswalks are created manually.
    • The mapping files do not include ratios for how an activity is mapped to a NAICS code in the event there are more than one NAICS related to an activity. Instead, ratios are created through the Flow-By-Sector methodology.
  2. If you create any of your own NAICS codes outside of the official NAICS, the master NAICS crosswalk must be recreated by running the functions in writeNAICScrosswalk.py.

Creating a Flow-By-Sector Dataset

  1. Write "instructions" for how to attribute environmental data in the Flow-By-Activity datasets to economic sectors. Write the instructions in a yaml file in the flowbysectormethods folder. See the README for configuration information of the yaml.

    • Data in a Flow-By-Activity dataset can be attributed to NAICS based on the values in the 'FlowName' column. There are two options for identifying which 'FlowNames' to attribute in an activity set in the method yaml. The first method is to manually list out the FlowNames in the flowbysectormethods yaml. The second option is to create a csv with FlowNames and the activity set they belong to, saving the csv in the flowbysectoractivitysets folder. The scripts to generate the flowbysectoractivitysets are written in the Scripts folder.
  2. Add any functions required to help attribute a Flow-By-Activity dataset to NAICS in the same py file used to generate the Flow-By-Activity. These functions are optional, dependent on the data source.

  3. Generate the Flow-By-Sector dataset by running

import flowsa 
flowsa.getFlowBySector("Land_national_2012", download_FBAs_if_missing=True)

Flow-By-Activity and Flow-By-Sector Naming Convention

See explanation here.

NAICS Crosswalk

Included in the package is a NAICS Crosswalk, which maps NAICS across years for 2002, 2007, 2012, and 2017. At this time, all Flow-By-Sector dataframes are mapped to NAICS 2012 Codes. The basis of the crosswalk comes from USEEIO's mapping, which includes mapping NAICS to BEA codes.

The NAICS crosswalk contains some 7-digit NAICS 2012 Codes, which are not official US Census Codes. These 7-digit codes are used to help link datasets when available data is only a component of a 6-digt NAICS (such as the USDA Irrigation and Water Management Survey and the USDA Census of Agriculture). The 7-digit codes can be aggregated to official NAICS, if specified in the FBS methodology.

If you create any of your own NAICS codes, the NAICS crosswalk must be recreated by running the functions in writeNAICSCrosswalk.py.