## Compiled Questions

## Question 1

### Dataset Overview 

Our project uses the dataset “Emissions by Unit and Fuel Type (Subparts C, D, AA)”, obtained from the U.S. Environmental Protection Agency (EPA) through its Greenhouse Gas Reporting Program (GHGRP). The dataset provides facility-level and unit-level greenhouse gas (GHG) emissions data from the reporting years 2011 through 2023, focusing on large stationary industrial sources that emit 25,000 metric tons or more of CO₂-equivalent (CO₂e) annually.

All emissions are reported in metric tons of CO₂e, calculated using global warming potentials (GWPs) from the IPCC’s Fourth Assessment Report (AR4) to standardize the impact of CO₂, CH₄, and N₂O across facilities and fuels.

### Dataset Structure 

The file we created, emissions_by_unit_and_fuel_type_c_d_aa.xlsb, contains four sheets that organize and describe greenhouse gas emissions data collected by the EPA. The UNIT_DATA sheet lists individual emission units within each facility (e.g., boilers, turbines, process heaters) and includes variables such as Facility ID, Facility Name, State, NAICS Code, Year, Unit Type, Reporting Method, Maximum Heat Input, and emissions for CO₂, CH₄, N₂O, Biogenic CO₂, and Total CO₂e, allowing comparison of emissions intensity across facilities and equipment types. The FUEL_DATA sheet links facility and unit emissions to the type of fuel used, with columns for Facility ID, Unit ID, Industry Type, General Fuel Type, Specific Fuel Type, Blend Fuel Name, Other Fuel Name, and CH₄ and N₂O emissions (mt CO₂e); most “blend fuel” and “other fuel” fields are empty, indicating that single-fuel systems dominate. The Industry Type sheet defines each reporting subpart (C, D, AA) and connects it to its industrial category—such as Stationary Fuel Combustion, Electricity Generation, or Pulp and Paper Manufacturing—helping distinguish between different emission sources. Finally, the FAQs about this Data sheet provides definitions for key variables (Facility ID, FRS ID, NAICS Code), explains how biogenic CO₂ is reported, lists the Global Warming Potentials (GWPs) used in calculations, and includes official EPA links for verifying and exploring emissions data.

### Provenance: Who Collected the Data and Why 

The dataset was collected and published by the U.S. Environmental Protection Agency (EPA) under the Greenhouse Gas Reporting Program (GHGRP), which was established to track and reduce industrial greenhouse gas emissions.
Facilities that emit ≥25,000 metric tons CO₂e per year are legally required to report their emissions annually under 40 CFR Part 98.

The EPA collects this data to:

1. Quantify and monitor large-scale greenhouse gas emissions across U.S. industries.

2. Ensure transparency and compliance with federal climate policy.

3. Provide publicly accessible data for research, modeling, and policymaking.

Each submission is verified through EPA’s quality-assurance protocols, ensuring that the dataset is accurate, standardized, and consistent across reporting years.

### Missing Data and Limitations 

The dataset is largely complete and reliable, though a few areas show minor gaps that could influence the analysis. The Biogenic CO₂ and blend fuel columns are missing for most facilities, since many industrial sites do not use biomass or mixed-fuel systems. The Maximum Heat Input Capacity column is occasionally unreported, as it is optional for some facilities. In a few cases, facility identifiers have been withheld by the EPA as Confidential Business Information to protect proprietary data. It is also worth noting that the dataset only includes facilities emitting 25,000 metric tons or more of CO₂-equivalent per year, so smaller emitters are not represented and the focus is on large industrial sources. Despite these limitations, the essential variables such as Facility ID, Reporting Year, Industry Type, and Total Emissions for CO₂, CH₄, and N₂O are over 99 percent complete. This makes the dataset highly dependable and provides a solid foundation for modeling greenhouse gas emissions and analyzing uncertainty.

## Question 2

[missing]

## Question 3*

How are you fitting your model to the phenomenon to get realistic properties of the data?
- We had to aggregate all the cities together to get one row per State and Year
- We picked Non-Biogenic CO2 emissions to focus on because it is the type of emission that is a result of non-renewable carbon sources such as Coal, Oil, Natural gas, and Petroleum products. These are human-caused CO2 emissions rather than natural processes. These carbon dioxide sources are directly addressable by humans and are therefore intriguing from a media and communications perspective. 
- We wanted to get a sense of emissions per area - some states may have larger emissions by virtue of being a larger state, and it's not good practice to assign that state a descriptor of "large emittor" without taking this into account. This gave us a realistic picture of the data.
 
What challenges did you have to overcome? 
- The emissions were large and constituted a large range - so taking a log helped standardize the data and make it easier to visualize. The overall quality of the data was good in part due to pre-cleaning the data ahead of visualizations. 
- The area was also large with a large range; taking a log solved the same problems.

## Question 4

## Question 5

Regarding the bootstrapping process by state, the sequences do not have the property of the training data when we categorized by state in our bootstrap. Each state only has about 10 entries, so the sample size per state was very small. The bootstrapping pulling the same values over again/ This led to low variance per each state indicated by the high peaks in the KDE and the smoother ECDF. Our estimates are not credible and reliable because sample size per state to small. Looking at the overall KDEs and ECDF for the underlying data, there was a high density of emitting states towards the right side of the graph (higher emissions). However, since there was a secondary peak towards the left of the underlying KDE plot, the averaged bootstrapping points had a higher density at a lower emission value. Moreover, the KDE visualization of the computed statistics showed a normal distribution, indicating that our bootstrapping function had enough samples to approach a normal distribution. The ECDF of the bootstrapped value again reflects a normal distribution. Overall, the process of bootstrapping the overall data provided more credible and reliable results than bootstrapping by states.


## Question 6