**Breakdown**

**Data Preparation**
1.   Load all data from directory
2.   Organize accordingly using pandas - make sure to download a CSV of the final feature columns, and a separate CSV for `soil_temperature_level_1`, `soil_temperature_level_2`, `soil_temperature_level_3`, and `soil_temperature_level_4`.
3. Use soil temperature entries to calculate Ti and Fi
* Ti : sum of positive values per year
* Fi : sum of negative values per year

You also need to do:
- feature engineering
- feature analysis
- actual machine learning (pytorch or scikit-learn packages)
- testing the results

# **Import Packages**

In [2]:
import pandas as pd
import numpy as np

# **1. Load Band Data as DataFrames**

In [18]:
band1 = pd.read_csv('../raw/band1_data.csv')
band2 = pd.read_csv('../raw/band2_data.csv')
band3 = pd.read_csv('../raw/band3_data.csv')
band4 = pd.read_csv('../raw/band4_data.csv')
band5 = pd.read_csv('../raw/band5_data.csv')
band6 = pd.read_csv('../raw/band6_data.csv')

# **2. Add Latitude/Band Information to DataFrames**

In [52]:
bands = {
    'band_1' : [band1, np.mean([61.2755545, 59.2632802])],
    'band_2' : [band2, np.mean([63.2878288, 61.2755545])],
    'band_3' : [band3, np.mean([65.3001031, 63.2878288])],
    'band_4' : [band4, np.mean([65.3001031, 67.3123774])],
    'band_5' : [band5, np.mean([67.3123774, 69.3246517])],
    'band_6' : [band6, np.mean([69.3246517, 71.336926])]
}

In [53]:
for key, value in bands.items():
  df = value[0]
  df['band'] = key
  df['average_latitude'] = value[1]
  df.drop('Unnamed: 0', axis=1, inplace=True)
  print(f'{key} columns: ', df.columns)

band_1 columns:  Index(['date', 'lake_bottom_temperature', 'lake_ice_depth',
       'lake_ice_temperature', 'lake_mix_layer_depth',
       'lake_mix_layer_temperature', 'lake_shape_factor',
       'lake_total_layer_temperature', 'leaf_area_index_high_vegetation',
       'leaf_area_index_low_vegetation', 'snow_albedo', 'snow_cover',
       'snow_density', 'snow_depth', 'snow_depth_water_equivalent',
       'snowfall_sum', 'snowmelt_sum', 'temperature_of_snow_layer',
       'soil_temperature_level_1', 'soil_temperature_level_2',
       'soil_temperature_level_3', 'soil_temperature_level_4',
       'volumetric_soil_water_layer_1', 'volumetric_soil_water_layer_2',
       'volumetric_soil_water_layer_3', 'volumetric_soil_water_layer_4',
       'forecast_albedo', 'surface_latent_heat_flux_sum',
       'surface_net_solar_radiation_sum', 'surface_net_thermal_radiation_sum',
       'surface_sensible_heat_flux_sum',
       'surface_solar_radiation_downwards_sum',
       'surface_thermal_radiatio

# **3. Combine Bands to Create One DataFrame**

In [54]:
all_bands = [band1, band2, band3, band4, band5, band6]

In [30]:
permafrost_data = pd.concat(all_bands, axis=0, join='outer')

In [60]:
permafrost_data.columns

Index(['date', 'lake_bottom_temperature', 'lake_ice_depth',
       'lake_ice_temperature', 'lake_mix_layer_depth',
       'lake_mix_layer_temperature', 'lake_shape_factor',
       'lake_total_layer_temperature', 'leaf_area_index_high_vegetation',
       'leaf_area_index_low_vegetation', 'snow_albedo', 'snow_cover',
       'snow_density', 'snow_depth', 'snow_depth_water_equivalent',
       'snowfall_sum', 'snowmelt_sum', 'temperature_of_snow_layer',
       'soil_temperature_level_1', 'soil_temperature_level_2',
       'soil_temperature_level_3', 'soil_temperature_level_4',
       'volumetric_soil_water_layer_1', 'volumetric_soil_water_layer_2',
       'volumetric_soil_water_layer_3', 'volumetric_soil_water_layer_4',
       'forecast_albedo', 'surface_latent_heat_flux_sum',
       'surface_net_solar_radiation_sum', 'surface_net_thermal_radiation_sum',
       'surface_sensible_heat_flux_sum',
       'surface_solar_radiation_downwards_sum',
       'surface_thermal_radiation_downwards_sum',

In [62]:
permafrost_data.shape

(50262, 40)

We have 40 features and 50k+ rows.

In [94]:
permafrost_data.isna().sum()

Unnamed: 0,0
date,0
lake_bottom_temperature,0
lake_ice_depth,0
lake_ice_temperature,0
lake_mix_layer_depth,0
lake_mix_layer_temperature,0
lake_shape_factor,0
lake_total_layer_temperature,0
leaf_area_index_high_vegetation,0
leaf_area_index_low_vegetation,0


There are also no missing values.

In [93]:
permafrost_data.head()

Unnamed: 0,date,lake_bottom_temperature,lake_ice_depth,lake_ice_temperature,lake_mix_layer_depth,lake_mix_layer_temperature,lake_shape_factor,lake_total_layer_temperature,leaf_area_index_high_vegetation,leaf_area_index_low_vegetation,...,surface_sensible_heat_flux_sum,surface_solar_radiation_downwards_sum,surface_thermal_radiation_downwards_sum,dewpoint_temperature_2m,skin_temperature,temperature_2m,total_evaporation_sum,total_precipitation_sum,band,average_latitude
0,2001-01-01,2.45437,0.701248,-8.115607,3.97168,-0.000342,0.63736,1.291071,1.563471,1.533203,...,78524.0,442472.0,22520729.0,-10.185593,-9.533737,-9.097836,-273.149988,2.225,band_1,66.30624
1,2001-01-02,2.45437,0.707476,-10.248999,3.97168,-0.000342,0.63736,1.290979,1.560664,1.533203,...,379508.0,474116.0,20943421.0,-12.552611,-11.231655,-11.260786,-273.149962,1.696,band_1,66.30624
2,2001-01-03,2.454309,0.716535,-15.694647,3.97168,-0.000342,0.63736,1.291132,1.557973,1.533203,...,1318736.0,575140.0,17960761.0,-18.537108,-16.346471,-16.356472,-273.150102,3.979,band_1,66.30624
3,2001-01-04,2.454492,0.729136,-20.371283,3.97168,-0.000342,0.63736,1.291315,1.555171,1.533086,...,1736264.0,619788.0,15550905.0,-23.518153,-21.461701,-21.07465,-273.149993,3.579,band_1,66.30624
4,2001-01-05,2.454553,0.743706,-22.970068,3.97168,-0.000342,0.63736,1.291437,1.552363,1.533081,...,1000884.0,620828.0,15669307.0,-26.52617,-24.061777,-23.886537,-273.150028,0.367,band_1,66.30624


# **4. Create TBFI Columns**

For `skin_temperature`, `soil_temperature_level_1`,  `soil_temperature_level_2`, `soil_temperature_level_3`, and `soil_temperature_level_4` of each band of each year, compute:

$$
TBFI_{d_i, d_j} = ln|\frac{Ti_{d_i, d_j}}{Fi_{d_i, d_j}}|
$$

Where:

$$
Ti_{d_i, d_j} = \sum_{k=0}^{n} T_k \cdot \mathbf{1}_{\{T_k > 0\}}
$$

and

$$
Fi_{d_i, d_j} = \sum_{k=0}^{n} T_k \cdot \mathbf{1}_{\{T_k < 0\}}
$$

where n = number of days in a given year, $T_k$ is the mean temperature for the kth day in a year, from a soil depth of $d_i$ to $d_j$ in centimeters.

In [95]:
permafrost_data['date'] = pd.to_datetime(permafrost_data['date']).dt.date

In [97]:
import numpy as np
import pandas as pd

# List of temperature columns
temp_cols = [
    'skin_temperature',
    'soil_temperature_level_1',
    'soil_temperature_level_2',
    'soil_temperature_level_3',
    'soil_temperature_level_4'
]

# Initialize result DataFrame with original structure
result = permafrost_data.copy()
permafrost_data['date'] = pd.to_datetime(permafrost_data['date'])

# Group by band and year
grouped = permafrost_data.groupby(['band', permafrost_data['date'].dt.year])

# Loop through each temperature column
for col in temp_cols:
    tbfi_col = f'TBFI_{col}'

    # Apply TBFI formula for each group
    result[tbfi_col] = grouped[col].transform(
        lambda x: np.log(np.abs(x[x > 0].sum() / (-x[x < 0].sum())))
        if (x[x > 0].sum() != 0 and x[x < 0].sum() != 0)
        else np.nan
    )

In [98]:
result.isna().sum()

Unnamed: 0,0
date,0
lake_bottom_temperature,0
lake_ice_depth,0
lake_ice_temperature,0
lake_mix_layer_depth,0
lake_mix_layer_temperature,0
lake_shape_factor,0
lake_total_layer_temperature,0
leaf_area_index_high_vegetation,0
leaf_area_index_low_vegetation,0


In [100]:
result.drop('TBFI_soil_temperature_level_4', axis=1, inplace=True) # too many missing vals

In [102]:
result.to_csv('clean/alaska_permafrost_data.csv')