# Exploratory Data Analysis: Greenhouse Gases  
**Data Source**: Processed NOAA data (cleaned in `1_data_loading.ipynb`)  

## Dataset Overview  
| Gas | Unit | # Observations | Time Range |  
|------|------|---------------|------------|  
| CO2 (Carbon Dioxide) | ppm | `df[df['gas']=='CO2'].shape[0]` | `df[df['gas']=='CO2']['date'].min()` to `max()` |  
| CH4 (Methane) | ppb | `df[df['gas']=='CH4'].shape[0]` | `df[df['gas']=='CH4']['date'].min()` to `max()` |  
| N2O (Nitrous Oxide) | ppb | `df[df['gas']=='N2O'].shape[0]` | `df[df['gas']=='N2O']['date'].min()` to `max()` |  
| CO (Carbon Monoxide) | ppb | `df[df['gas']=='CO'].shape[0]` | `df[df['gas']=='CO']['date'].min()` to `max()` |  
| H2 (Hydrogen) | ppb | `df[df['gas']=='H2'].shape[0]` | `df[df['gas']=='H2']['date'].min()` to `max()` |  
| SF6 (Sulfur Hexaflouride) | ppt | `df[df['gas']=='SF6'].shape[0]` | `df[df['gas']=='SF6']['date'].min()` to `max()` |  

## Key Notes  
- **Focus Gas**: CH4 (Methane) - primary analysis target  
- **Data Quality**:  
  - Missing values: all rows kept (missing values --> NaN)
  - Negative values: converted to NaN  
- Raw data sources and cleaning steps documented in [`1_data_loading.ipynb`](../notebooks/1_data_loading.ipynb).  

# Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load Data

In [10]:
# define dataset URLs
datasets = {
    'CH4': 'https://gml.noaa.gov/aftp/data/trace_gases/ch4/flask/surface/txt/ch4_mlo_surface-flask_1_ccgg_event.txt',
    'N2O': 'https://gml.noaa.gov/aftp/data/trace_gases/n2o/flask/surface/txt/n2o_mlo_surface-flask_1_ccgg_event.txt',
    'SF6': 'https://gml.noaa.gov/aftp/data/trace_gases/sf6/flask/surface/txt/sf6_mlo_surface-flask_1_ccgg_event.txt',
    'CO2': 'https://gml.noaa.gov/aftp/data/trace_gases/co2/flask/surface/txt/co2_mlo_surface-flask_1_ccgg_event.txt',
    'CO': 'https://gml.noaa.gov/aftp/data/trace_gases/co/flask/surface/txt/co_mlo_surface-flask_1_ccgg_event.txt',
    'H2': 'https://gml.noaa.gov/aftp/data/trace_gases/h2/flask/surface/txt/h2_mlo_surface-flask_1_ccgg_event.txt'
}
# Note: all datasets have the same features, excpet CO, which lacks "value_unc".

# function to load and clean the datasets
# Load raw data, keeping only datetime and value columns  
def load_gas_data(url, gas_type):
    df = pd.read_csv(url, sep=r'\s+', comment='#', header=0)

    # select features/columns
    cols_to_keep = ['datetime', 'value']
    df = df[cols_to_keep]

    # convert datetime column to date, without time info 
    df['date'] = pd.to_datetime(df['datetime']).dt.tz_localize(None).dt.normalize()
    df.drop(columns=['datetime'], inplace=True)
    
    # add a feature/column for gas type
    df['gas'] = gas_type

    return df

# load all datasets
df_list = [load_gas_data(url, gas) for gas, url in datasets.items()]

# concatenate all sets into a single dataframe
df_all = pd.concat(df_list)

# align data based on date (from datetime)
df_combined = df_all.pivot_table(index='date', columns='gas', values='value')

# reset the index
df_combined.reset_index(inplace=True)

# save a copy to csv
df_combined.to_csv('all_ghg_aligned.csv', index=False)

df_combined.head()

gas,date,CH4,CO,CO2,H2,N2O,SF6
0,1969-08-20,,,-5.27,,,
1,1969-08-27,,,-2.1625,,,
2,1969-09-02,,,-9.115,,,
3,1969-09-12,,,320.945,,,
4,1969-09-24,,,320.89,,,


# EDA 

## Overview

In [8]:
df_combined.describe()

gas,date,CH4,CO,CO2,H2,N2O,SF6
count,2562,2131.0,1825.0,2561.0,790.0,1498.0,1496.0
mean,2000-04-04 02:32:19.110070400,1786.120962,91.608401,366.723396,543.061065,323.71185,6.416216
min,1969-08-20 00:00:00,-99.433333,-999.99,-551.175667,226.026,58.226,-246.425
25%,1987-10-24 18:00:00,1753.305,77.1175,349.615,535.39875,318.040625,5.29
50%,2000-05-01 12:00:00,1791.18,89.965,370.0,544.2225,323.6725,7.17625
75%,2012-08-27 18:00:00,1842.66125,105.54,394.6925,553.233125,330.458125,9.4925
max,2025-04-03 00:00:00,1989.3775,248.6425,508.15175,596.36,339.2125,12.345
std,,133.264558,46.026747,62.016673,23.384055,15.493204,14.933516


## Datatypes

In [9]:
df_combined.dtypes

gas
date    datetime64[ns]
CH4            float64
CO             float64
CO2            float64
H2             float64
N2O            float64
SF6            float64
dtype: object

## Null Values

In [None]:
df_combined.isnull().sum()

In [None]:
# the fill values for 'values' is -999.999.  This is essentially the same as a null value.  
# So, I will check on the number of fill values.  

fillvalue_counts = (df_combined == -999.999).sum()
fillvalue_counts

Null (NaN) values are data points that were not collected or recorded. Many null values occur at the beginning of the timeseries for each gas, except CO2, since measurement of the other gases began after the first CO2 measurement. 

## measurement start dates:
- CO2: 1969-8-20
- CH4: 1983-5-6
- CO: 1989-7-7
- N2O: 1995-12-15
- SF6: 1995-12-15

I will not impute any null value that exist at dates earlier than the first measurement date for each gas.

## Negative Values

In [None]:
# There are some negative values for gas concentration.  This doesn't make physical sense. 
# One possible explanation is that the GC sensor was zeroed incorrectly. Either way, I will 
# likely set them to NaN.  First, inspect:

neg_value_count = (df_combined.iloc[:,1:] < 0).sum()
neg_value_count

In [None]:
# replace negative values with NaN

df_combined.iloc[:,1:] = df_combined.iloc[:,1:].mask(df_combined.iloc[:,1:] < 0, np.nan)
                          
new_neg_count = (df_combined.iloc[:,1:] < 0).sum()
new_neg_count

All NaN values that originate after the data collection start date will be imputed during preprocessing.  

In [None]:
# Store new dateframe as CSV

df_combined.to_csv('all_ghg_aligned_nan.csv', index=False)

## Data Distribution

In [None]:
# inspect the distribution and outliers of each dataset

plt.figure(figsize=(12,8))
sns.boxplot(data=df_combined.iloc[:,1:])
plt.title('Boxplot of Gas Concentration')
plt.ylabel('Gas Concentration')
plt.xlabel('Gas Type')
plt.show()

## Data Frequency (per year)

In [None]:
# The seasonality appears as a single cycle per year for each gas. 
# Confirm the number of datapoints per year for each gas.

df_counts = df_combined.copy()

df_counts['year'] = df_counts['date'].dt.year # extract the year
yearly_counts = df_counts.groupby('year').count()
yearly_counts

In [None]:
# As expected, there is variation in the number of data points per year.  
# I will determine and use the mode of each gas for signal decomposition, preprocessing, and modeling.

seasonal_mode = yearly_counts.replace(0, np.nan).mode().iloc[0] 
seasonal_mode