# Greenhouse Gas Concentration Analysis  
**Data Source**:  
NOAA Global Monitoring Laboratory ([gml.noaa.gov ...](https://gml.noaa.gov/data/data.php?site=MLO&category=Greenhouse%2BGases))  

**Dataset README**:  
NOAA MLO Methane Flask dataset README ([gml.noaa.gov/aftp ...](https://gml.noaa.gov/aftp/data/trace_gases/ch4/flask/surface/README_ch4_surface-flask_ccgg.html))

## Dataset   
- Gas: CH4 (Methane)
- Measurement method: Surface Flask
- URL:  https://gml.noaa.gov/aftp/data/trace_gases/ch4/flask/surface/txt/ch4_mlo_surface-flask_1_ccgg_event.txt 
- Unit: ppb (nmol/mol)

## Key Notes  
- **`date` column**: datetime information in datetime64[ns] with dropped timezone info.
- **`value` column**: Gas concentration in dry air: ppb (nmol/mol)  
- **`value_unc` column**: The estimated uncertainty in reported value (in nanomol/mol (ppb)). Missing data coded as `-999.999`.  The corresponding `value column` value for missing data is usually, but not always, `-999.99`.
- **`qcflag` column**: 3-character flags to indicate retained or rejected flask results as follows:

If the first character is not a period, the sample result should be
rejected for scientific use due to sample collection and/or measurement
issues. A second column character other than a period indicates a sample
that has no identifiable measurement or sampling errors but does not meet selection for representativeness
such as midday sampling or background air sampling or is otherwise atypical for a given sampling location and season. A third column flag
other than a period indicates noteworthy circumstances that are not known
to affect the data quality, but may have potential to.

| Indication | Flag | Description |
|------------|------|-------------|
| Retained |        ... |     good pair, no other issues |
| Retained |        ..* |     good pair, no definitive issues |
| Rejected |        M.. |     sample measurement issue |
| Rejected |        C.. |     sample collection issue |
| Rejected |        B.. |     both measurement and collection issues |
| Selection |       .S. |     selection issue. High/low mole fraction thought to not represent background conditions for example. |
| Informational |   ..M |     informational measurement tag or potential measurement issue |
| Informational |   ..C |     informational collection tag or potential collection issue |

- **`method` column**: single-character code used to identify the sample collection method as follows:

| Code | Description |
|------|-------------|
| P |Sample collected using a portable, battery powered pumping unit.  Two flasks are connected in series, flushed with air, and then pressurized to 1.2 - 1.5 times ambient pressure. |
| D | Similar to P but the air passes through a condenser cooled to about 5 deg C to partially dry the sample. |
| G | Similar to D but with a gold-plated condenser. |
| T | Evacuated flask filled by opening an O-ring sealed stopcock. |
| S | Flasks filled at NOAA GML observatories by sampling air from the in situ CO2 measurement air intake system. |
| N | Before 1981, flasks filled using a hand-held aspirator bulb. After 1981, flasks filled using a pump different from those used in method P, D, or G. |
| F | Five liter evacuated flasks filled by opening a ground glass, greased stopcock |

# Libraries

In [2]:
import pandas as pd
from pathlib import Path
import os

# Make Directories

In [3]:
repo_root = Path(__file__).resolve().parents[2] if "__file__" in globals() else Path.cwd().parent
output_dir = repo_root / "data" / "processed"
output_dir.mkdir(parents=True, exist_ok=True)

# Load Datasets and Save CSV

In [14]:
df = pd.read_csv(r'https://gml.noaa.gov/aftp/data/trace_gases/ch4/flask/surface/txt/ch4_mlo_surface-flask_1_ccgg_event.txt', 
                    sep=r'\s+', 
                    comment='#', 
                    header=0)

# select features/columns
cols_to_keep = ['datetime', 'value', 'value_unc', 'qcflag', 'method']
df = df[[c for c in cols_to_keep if c in df.columns]]

# convert datetime column to date, without time info, and set as the index 
df['date'] = pd.to_datetime(df['datetime'], errors='coerce').dt.tz_localize(None).dt.normalize()
df.drop(columns=['datetime'], inplace=True)
df = df.set_index('date')

# save a copy to csv
df.to_csv(output_dir / 'CH4_raw_dropped_cols.csv', index=True)

df

Unnamed: 0_level_0,value,value_unc,qcflag,method
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1983-05-06,1659.334,3.300,...,P
1983-05-06,1654.272,3.300,...,P
1983-05-13,1645.170,3.300,...,P
1983-05-20,1631.007,3.300,...,P
1983-05-20,1627.970,3.300,...,P
...,...,...,...,...
2024-12-24,1954.760,0.575,...,S
2024-12-31,1975.490,0.575,.S.,P
2024-12-31,1974.960,0.575,.S.,P
2024-12-31,1979.990,0.575,.S.,S
