# Part 2: Time-Series Feature Engineering and Imputation

This notebook continues from Part 1, using the `trade_data.csv` file. The primary goal is to enrich the dataset with features that capture the temporal dynamics of each trade relationship.

The objectives are:
1.  **Create Unique Trade Pair IDs**: For easier grouping and time-series operations on the 17,170 trade records.
2.  **Manual Time-Series Features**: Generate lagged trade amounts (up to 3 years) and rolling statistics (3-year mean, std dev) for each trade pair.
3.  **`tsfresh` Feature Extraction**: Utilize the `tsfresh` library to automatically extract a comprehensive set of **777 time-series characteristics** for each of the **3,681 unique trade pairs**.
4.  **Merge Features**: Combine manual and `tsfresh` features, expanding the DataFrame to **791 columns**.
5.  **Missing Value Imputation (MICE)**: Impute missing values across **783 numerical columns** using Multivariate Imputation by Chained Equations, ensuring a complete dataset for modeling.
6.  **Save Processed Data**: Output the final feature-engineered and imputed DataFrame to `trade_data_features.csv`.

In [1]:
import pandas as pd
import numpy as np
from tsfresh import extract_features
from tsfresh.feature_extraction import EfficientFCParameters
from tsfresh.utilities.dataframe_functions import impute as tsfresh_impute # tsfresh has its own simple imputer
from sklearn.experimental import enable_iterative_imputer # Enable experimental IterativeImputer
from sklearn.impute import IterativeImputer
import os

# Configuration
INPUT_FILE = 'trade_data.csv'  # Output from Part 1
FINAL_OUTPUT_FILE = 'trade_data_features.csv' # Final output for Part 2

# Lags and rolling window configuration
N_LAGS = 3
ROLLING_WINDOW_SIZE = 3

  import pkg_resources


## 1. Load Data and Create Trade Pair IDs

The processed data from Part 1, with a shape of (17170, 8), is loaded. To facilitate time-series operations, a unique `trade_pair_id` is created for each record by concatenating the importer and exporter ISO codes (e.g., 'Afghanistan_Andorra'). The DataFrame is then sorted by this ID and `year` to establish a clean temporal order, which is essential for accurate lag and rolling window calculations.

In [2]:
df = pd.read_csv(INPUT_FILE)
print(f"Loaded data from {INPUT_FILE}. Shape: {df.shape}")

# Create a unique ID for each importer-exporter pair
df['trade_pair_id'] = df['importer'] + '_' + df['exporter']

# Sort data by trade_pair_id and year for consistent time-series operations
df.sort_values(['trade_pair_id', 'year'], inplace=True)
df.reset_index(drop=True, inplace=True)

print("DataFrame head after creating 'trade_pair_id' and sorting:")
print(df.head())

Loaded data from trade_data.csv. Shape: (17170, 8)
DataFrame head after creating 'trade_pair_id' and sorting:
      importer        exporter  amount  year  importer_latitude  \
0  Afghanistan  American Samoa     0.0  2019               33.0   
1  Afghanistan         Andorra     0.0  2019               33.0   
2  Afghanistan       Argentina     0.0  2019               33.0   
3  Afghanistan       Australia     0.0  2019               33.0   
4  Afghanistan         Austria     0.0  2019               33.0   

   importer_longitude  exporter_latitude  exporter_longitude  \
0                65.0           -14.3333           -170.0000   
1                65.0            42.5000              1.6000   
2                65.0           -34.0000            -64.0000   
3                65.0           -27.0000            133.0000   
4                65.0            47.3333             13.3333   

                trade_pair_id  
0  Afghanistan_American Samoa  
1         Afghanistan_Andorra  
2     

## 2. Manual Time-Series Feature Engineering

To capture basic temporal dependencies, two types of manual time-series features are generated for the `amount` of each trade pair:
-   **Lagged Features**: The trade amount from the previous 1, 2, and 3 years are created as `amount_lag_1`, `amount_lag_2`, and `amount_lag_3`.
-   **Rolling Statistics**: A rolling mean and standard deviation of the trade amount are calculated over a 3-year window (`ROLLING_WINDOW_SIZE`).
These operations introduce `NaN` values for the initial years of each trade pair's history, which will be addressed during the imputation step.

In [3]:
print("Generating manual time-series features...")

# Lagged features
for i in range(1, N_LAGS + 1):
    df[f'amount_lag_{i}'] = df.groupby('trade_pair_id')['amount'].shift(i)

# Rolling statistics
rolling_amount = df.groupby('trade_pair_id')['amount'].rolling(window=ROLLING_WINDOW_SIZE, min_periods=1)
df['amount_rolling_mean'] = rolling_amount.mean().reset_index(level=0, drop=True)
df['amount_rolling_std'] = rolling_amount.std().reset_index(level=0, drop=True)

print("Manual time-series features generated.")
print("DataFrame head after adding manual features (some NaNs are expected):")
print(df[['trade_pair_id', 'year', 'amount', 'amount_lag_1', 'amount_rolling_mean']].head(10)) # Show more rows to see effect

Generating manual time-series features...
Manual time-series features generated.
DataFrame head after adding manual features (some NaNs are expected):
                trade_pair_id  year  amount  amount_lag_1  amount_rolling_mean
0  Afghanistan_American Samoa  2019     0.0           NaN                  0.0
1         Afghanistan_Andorra  2019     0.0           NaN                  0.0
2       Afghanistan_Argentina  2019     0.0           NaN                  0.0
3       Afghanistan_Australia  2019     0.0           NaN                  0.0
4         Afghanistan_Austria  2019     0.0           NaN                  0.0
5      Afghanistan_Azerbaijan  2019     0.0           NaN                  0.0
6         Afghanistan_Bahrain  2019     0.0           NaN                  0.0
7      Afghanistan_Bangladesh  2019     0.0           NaN                  0.0
8         Afghanistan_Belarus  2019     0.0           NaN                  0.0
9         Afghanistan_Belgium  2019     0.0           NaN  

## 3. `tsfresh` Feature Extraction

The `tsfresh` library is employed to automatically generate a rich set of time-series features. Using the `EfficientFCParameters` setting for computational feasibility, the process extracts **777 statistical features** for each of the **3,681 unique trade pairs** based on their historical `amount` data. These features, which characterize properties like trend, seasonality, and volatility for the *entire history* of each pair, are then merged back into the main DataFrame. This operation significantly expands the feature space, increasing the total number of columns to **791**.

In [4]:
print("\nStarting tsfresh feature extraction...")
# Prepare data for tsfresh: needs 'id', 'time', 'value'
# We'll extract features for the 'amount' column
df_tsfresh_input = df[['trade_pair_id', 'year', 'amount']].copy()

# Using EfficientFCParameters for a manageable number of features
# tsfresh's extract_features can take some time depending on data size and complexity
try:
    extracted_features = extract_features(df_tsfresh_input, 
                                          column_id='trade_pair_id', 
                                          column_sort='year', 
                                          column_value='amount',
                                          default_fc_parameters=EfficientFCParameters(),
                                          # tsfresh can impute NaNs generated during feature calculation
                                          impute_function=tsfresh_impute) 
    print(f"tsfresh extracted {extracted_features.shape[1]} features for {extracted_features.shape[0]} trade pairs.")

    # Merge tsfresh features back into the main DataFrame
    # The extracted_features are indexed by 'trade_pair_id' (after extract_features converts it to index)
    df = df.merge(extracted_features, left_on='trade_pair_id', right_index=True, how='left')
    print("tsfresh features merged into the main DataFrame.")
    print(f"DataFrame shape after merging tsfresh features: {df.shape}")

except Exception as e:
    print(f"Error during tsfresh extraction or merge: {e}")
    print("Skipping tsfresh feature addition.")

print(df.head())


Starting tsfresh feature extraction...


Feature Extraction: 100%|██████████| 30/30 [01:26<00:00,  2.89s/it]


tsfresh extracted 777 features for 3681 trade pairs.
tsfresh features merged into the main DataFrame.
DataFrame shape after merging tsfresh features: (17170, 791)
      importer        exporter  amount  year  importer_latitude  \
0  Afghanistan  American Samoa     0.0  2019               33.0   
1  Afghanistan         Andorra     0.0  2019               33.0   
2  Afghanistan       Argentina     0.0  2019               33.0   
3  Afghanistan       Australia     0.0  2019               33.0   
4  Afghanistan         Austria     0.0  2019               33.0   

   importer_longitude  exporter_latitude  exporter_longitude  \
0                65.0           -14.3333           -170.0000   
1                65.0            42.5000              1.6000   
2                65.0           -34.0000            -64.0000   
3                65.0           -27.0000            133.0000   
4                65.0            47.3333             13.3333   

                trade_pair_id  amount_lag_1  ... 

## 4. Missing Value Imputation (MICE)

The preceding feature engineering steps (lagging, rolling statistics, `tsfresh`) introduced `NaN` values, particularly at the beginning of each time series. To create a complete dataset ready for modeling, these missing values are imputed using Multivariate Imputation by Chained Equations (MICE). This method models each feature with missing values as a function of other features and iterates to fill them in. A total of **783 numerical columns** were identified for imputation. The process completed successfully, resolving all `NaN` values in the selected columns.

In [5]:
print("\nStarting MICE imputation...")

# Identify numerical feature columns for imputation
# Exclude IDs, categorical data, and the target variable if pre-imputation analysis is needed
# For now, we impute all numerical features including lags and rolling stats.
cols_to_impute = df.select_dtypes(include=np.number).columns.tolist()
# Remove year from imputation if it's considered fixed, or if it's complete
if 'year' in cols_to_impute and df['year'].isnull().sum() == 0:
    cols_to_impute.remove('year')

# Ensure original coordinates are not in cols_to_impute if they are complete and not to be changed
coord_cols = ['importer_latitude', 'importer_longitude', 'exporter_latitude', 'exporter_longitude']
for c_col in coord_cols:
    if c_col in cols_to_impute and df[c_col].isnull().sum() == 0:
        cols_to_impute.remove(c_col)

numerical_features_df = df[cols_to_impute]
print(f"Columns selected for MICE imputation: {cols_to_impute}")

if not numerical_features_df.empty and numerical_features_df.isnull().any().any():
    imputer = IterativeImputer(max_iter=10, random_state=0, verbose=1) # verbose=1 to see progress
    df_imputed_values = imputer.fit_transform(numerical_features_df)
    
    # Convert imputed numpy array back to DataFrame
    df_imputed_features = pd.DataFrame(df_imputed_values, columns=cols_to_impute, index=df.index)
    
    # Update the original DataFrame with imputed values
    for col in cols_to_impute:
        df[col] = df_imputed_features[col]
    
    print("MICE imputation completed.")
    print(f"NaNs remaining in imputed columns sum: {df[cols_to_impute].isnull().sum().sum()}")
else:
    print("No columns for MICE imputation or no NaNs found in selected numerical columns.")

print("DataFrame head after MICE imputation:")
print(df.head())


Starting MICE imputation...
Columns selected for MICE imputation: ['amount', 'amount_lag_1', 'amount_lag_2', 'amount_lag_3', 'amount_rolling_mean', 'amount_rolling_std', 'amount__variance_larger_than_standard_deviation', 'amount__has_duplicate_max', 'amount__has_duplicate_min', 'amount__has_duplicate', 'amount__sum_values', 'amount__abs_energy', 'amount__mean_abs_change', 'amount__mean_change', 'amount__mean_second_derivative_central', 'amount__median', 'amount__mean', 'amount__length', 'amount__standard_deviation', 'amount__variation_coefficient', 'amount__variance', 'amount__skewness', 'amount__kurtosis', 'amount__root_mean_square', 'amount__absolute_sum_of_changes', 'amount__longest_strike_below_mean', 'amount__longest_strike_above_mean', 'amount__count_above_mean', 'amount__count_below_mean', 'amount__last_location_of_maximum', 'amount__first_location_of_maximum', 'amount__last_location_of_minimum', 'amount__first_location_of_minimum', 'amount__percentage_of_reoccurring_values_to_

## 5. Save Processed Data

The final DataFrame, now including manual time-series features, `tsfresh` features, and with all missing values imputed, is saved to a new CSV file. The resulting dataset, `trade_data_features.csv`, has a final shape of **(17170, 791)** and will serve as the input for the next stage of the analysis.

In [6]:
df.to_csv(FINAL_OUTPUT_FILE, index=False)
print(f"\nFinal feature-engineered data saved to {FINAL_OUTPUT_FILE}")
print(f"Final DataFrame shape: {df.shape}")
print("Final DataFrame Info:")
df.info()


Final feature-engineered data saved to trade_data_features.csv
Final DataFrame shape: (17170, 791)
Final DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17170 entries, 0 to 17169
Columns: 791 entries, importer to amount__mean_n_absolute_max__number_of_maxima_7
dtypes: float64(787), int64(1), object(3)
memory usage: 103.6+ MB


## End of Part 2

The DataFrame `df` has been successfully enriched and saved to `trade_data_features.csv`. This file contains the multi-year trade data with a rich set of features, now fully prepared for the dynamic network analysis and predictive modeling tasks in the subsequent parts of the project.