# Notebook: Violence Dynamics Prediction (Mode-Based Sliding Window Approach)
## 1. Introduction
This notebook implements a predictive model for the dynamics of Selective Violence (VS), Indiscriminate Violence (VI), and Collective Violence (VC) in Colombia. Leveraging the previously calculated historical Escalation, Intensity, and "Previous State" metrics, the objective is to predict the overall "State", "Intensity", and "Escalation" for the next month, based on the modal behavior observed within a sliding time window.

The model is designed to be simple and interpretable, focusing on identifying dominant patterns in the recent past to infer the near-future behavior of violence at different geographical levels.

## 2. Prediction Methodology
For each month to be predicted, the model follows these steps:

Data Loading: The notebook will load the pre-processed collective violence (VC) data, which already contains the Escalation, Intensity, and "Previous State" metrics for each month.

Observation Window: A sliding window of the last 18 months of historical data will be used for observation.

"State" Prediction (A, B, C, etc.):

Within the 18-month observation window, the mode (most frequent value) of the observed "Previous States" (e.g., 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I') will be calculated.

The predicted "State" for the next month will be this modal value.

"Intensity" Prediction (1, 0, -1):

Once the modal "State" is identified, the data within the 18-month window will be filtered to include only those months whose "Previous State" matches the calculated modal state.

Within this subset of months (which share the modal state), the "share" (proportion) of each Intensity value (1, 0, -1) will be calculated.

The predicted "Intensity" value for the next month will be the value with the highest "share" (the mode) within this filtered subset.

"Escalation" Prediction (1, 0, -1):

Analogous to the Intensity calculation, the same subset of months (those with the modal "Previous State") will be used.

The "share" of each Escalation value (1, 0, -1) will be calculated within this subset.

The predicted "Escalation" value for the next month will be the value with the highest "share" (the mode) within this filtered subset.

## 3. Geographical Scope
This prediction process will be applied, and results will be generated for the following geographical levels:

Country: Colombia

Departments: Each analyzed department.

Regions: Each defined region in the study.

## 4. Data Input and Output
Input: .tsv files containing collective violence data with Escalation, Intensity, and Previous State columns, generated by the previous notebook. These files are expected to be structured by geographical level (country, department, region).

Output: The notebook will generate new .tsv or .csv files containing the predictions for "State", "Intensity", and "Escalation" for each future month, along with the corresponding dates.

### 1. Initial Setup, Library Imports, and Path Configuration
This block performs the initial setup for the notebook environment. It includes importing all necessary Python libraries required for data handling, file system operations, and numerical computations. It also defines the relative paths for the input directories where the pre-processed data (with Escalation, Intensity, and Previous State) is located, and the base output directory where the prediction results will be saved. These paths are configured assuming a standard project structure relative to the notebook's location.

In [1]:
# 1. Initial Setup, Library Imports, and Path Configuration

import pandas as pd
import os
import numpy as np

# Define the base input directory where processed data (with metrics) is stored.
# Assumes the notebook is in a 'notebooks' subfolder and data is one level up in 'Results/intensity & escalation/victims'.
# Example structure:
# Project_Root/
# ├── Results/
# │   └── intensity & escalation/
# │       └── victims/    # Input files are here (country, department, region subfolders)
# └── notebooks/        # This notebook is here
base_input_data_dir = os.path.join(os.getcwd(), '..', 'Results', 'intensity & escalation', 'victims')

# Define the base output directory for prediction results.
# Prediction results will be saved in subfolders within this directory.
prediction_results_base_dir = os.path.join(os.getcwd(), '..', 'Results', 'predictions', 'victims')

# Define the window size for the modal prediction.
# This specifies how many previous months to consider for finding the mode.
PREDICTION_WINDOW_SIZE = 18 # Months

# Define the mapping for [Intensity, Escalation] pairs to states (for reference, though not directly used for prediction input).
state_mapping = {
    (1, 1): 'A',
    (1, 0): 'B',
    (1, -1): 'C',
    (0, 1): 'D',
    (0, 0): 'E',
    (0, -1): 'F',
    (-1, 1): 'G',
    (-1, 0): 'H',
    (-1, -1): 'I'
}

print("Initial setup, library imports, and path configuration complete.")
print(f"Input data will be read from: {base_input_data_dir}")
print(f"Prediction results will be saved in: {prediction_results_base_dir}")
print(f"Prediction window size set to: {PREDICTION_WINDOW_SIZE} months.")


Initial setup, library imports, and path configuration complete.
Input data will be read from: /Users/diegohernandez/Documents/GitHub/VS_VI_Source_Code/Scripts/../Results/intensity & escalation/victims
Prediction results will be saved in: /Users/diegohernandez/Documents/GitHub/VS_VI_Source_Code/Scripts/../Results/predictions/victims
Prediction window size set to: 18 months.


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### 2. Define Prediction Logic Functions
This block defines the core functions for the prediction model. These functions implement the logic of finding the mode of "Previous State" within a sliding window and then, based on that modal state, finding the modal "Intensity" and "Escalation" values.

get_mode_or_default(series): A helper function to find the mode of a pandas Series. If there are multiple modes, it picks the first one. If the series is empty, it returns a default value (e.g., None or a string like 'NoMode').

predict_next_state_and_dynamics(df_time_series, window_size): This is the main prediction function. It iterates through the time series, applies the sliding window logic for each prediction point, and generates predictions for 'Previous State', 'Intensity', and 'Escalation' for the next month.

In [2]:
# 2. Define Prediction Logic Functions

print("\n--- Defining Prediction Logic Functions ---")

def get_mode_or_default(series, default_value=None):
    """
    Calculates the mode of a pandas Series. If there are multiple modes,
    it returns the first one. If the series is empty, it returns a default value.

    Args:
        series (pd.Series): The input series.
        default_value: The value to return if the series is empty or has no mode.

    Returns:
        The mode of the series, or the default_value.
    """
    if series.empty:
        return default_value
    # value_counts() returns a Series with counts of unique values, sorted in descending order.
    # The index of the first element is the mode.
    modes = series.mode()
    if not modes.empty:
        return modes.iloc[0] # Return the first mode if multiple exist
    return default_value # Should not happen if series is not empty, but for safety

def predict_next_state_and_dynamics(df_time_series, window_size):
    """
    Predicts the 'Previous State', 'Intensity', and 'Escalation' for the next month
    based on the modal behavior within a sliding window.

    Args:
        df_time_series (pd.DataFrame): A DataFrame containing 'Año', 'Mes',
                                       'Previous State', 'Intensity', 'Escalation'
                                       for a specific violence type and geographical unit.
                                       Assumes it's sorted chronologically.
        window_size (int): The number of months in the sliding window to look back.

    Returns:
        pd.DataFrame: A DataFrame with 'Año', 'Mes', 'violence type',
                      'Predicted_State', 'Predicted_Intensity', 'Predicted_Escalation'.
    """
    predictions = []
    # Ensure the DataFrame is sorted by Year and Mes for correct window slicing
    df_time_series = df_time_series.sort_values(by=['Año', 'Mes']).reset_index(drop=True)

    # Iterate through the time series to make predictions for future months
    # We need at least 'window_size' months of historical data to make the first prediction.
    # The prediction for month 't+1' uses data from 't-window_size' to 't'.
    # In our case, the 'Previous State' at month 't' is based on (Intensity_t-1, Escalation_t-1).
    # So, to predict for month 't+1', we need the 'Previous State' of month 't'.
    # The window will be from (current_index - window_size) to (current_index - 1)
    # The prediction is for the month *after* the current_index.

    # Start making predictions from the month immediately following the window_size.
    # For example, if window_size is 18, the first prediction will be for the 19th month,
    # using data from months 1 to 18.
    for i in range(window_size, len(df_time_series)):
        # Define the historical window (last 'window_size' months *before* the current month 'i')
        # This window contains data from index (i - window_size) up to (i - 1)
        # The 'Previous State' column in df_time_series[i-window_size:i] refers to
        # the state of (Intensity, Escalation) for the month *prior* to that row's month.
        # So, if we want to predict for month 'i' (which is the next month after the window),
        # we look at the 'Previous State' column within the window.
        current_window = df_time_series.iloc[i - window_size : i]

        if current_window.empty or 'Previous State' not in current_window.columns:
            # This case should ideally not be hit if the input df_time_series is valid
            print(f"Warning: Empty or invalid window at index {i}. Skipping prediction.")
            continue

        # 1. Predict the 'Previous State' for the next month (i.e., the state of the *current* month 'i' based on its (I,E) pair)
        # Find the mode of 'Previous State' within the current historical window
        predicted_state_mode = get_mode_or_default(current_window['Previous State'], default_value='Unknown')

        # 2. Predict 'Intensity' and 'Escalation' for the next month
        # Filter the window to include only months whose 'Previous State' matches the modal state
        filtered_by_mode_state = current_window[current_window['Previous State'] == predicted_state_mode]

        predicted_intensity = get_mode_or_default(filtered_by_mode_state['Intensity'], default_value=0) # Default to 0 if no mode
        predicted_escalation = get_mode_or_default(filtered_by_mode_state['Escalation'], default_value=0) # Default to 0 if no mode

        # Get the month and year for which we are making the prediction (the month *after* the window)
        # This is the month 'i' in the original df_time_series
        predicted_year = df_time_series.loc[i, 'Año']
        predicted_month = df_time_series.loc[i, 'Mes']
        violence_type = df_time_series.loc[i, 'violence type'] # Keep the violence type consistent

        predictions.append({
            'Año': predicted_year,
            'Mes': predicted_month,
            'violence type': violence_type,
            'Predicted_State': predicted_state_mode,
            'Predicted_Intensity': predicted_intensity,
            'Predicted_Escalation': predicted_escalation
        })

    return pd.DataFrame(predictions)

print("Prediction logic functions defined.")



--- Defining Prediction Logic Functions ---
Prediction logic functions defined.


### 3. Execute Prediction and Save Results
This block orchestrates the entire prediction process. It iterates through the pre-processed data files for each geographical level (Country, Department, Region). For each file, it loads the data, extracts the relevant time series for each violence type (VS, VI, VC), applies the predict_next_state_and_dynamics function (defined in Cell 2) to generate predictions, and finally saves these predictions into new TSV files in the designated output directory structure.

This step assumes that the input data files (containing 'Escalation', 'Intensity', and 'Previous State' columns) have been successfully generated and saved by the previous data processing notebook.

In [3]:
# 3. Execute Prediction and Save Results

print("\n--- Executing Prediction and Saving Results ---")

# Ensure prediction window size is defined
if 'PREDICTION_WINDOW_SIZE' not in globals():
    print("Error: PREDICTION_WINDOW_SIZE not defined. Please run Cell 1.")
elif 'predict_next_state_and_dynamics' not in globals():
    print("Error: Prediction logic functions not found. Please run Cell 2.")
else:
    # Define the levels and their corresponding input/output directories
    levels_info = [
        {'name': 'Country', 'input_dir': os.path.join(base_input_data_dir, 'country'), 'output_subdir': 'country'},
        {'name': 'Department', 'input_dir': os.path.join(base_input_data_dir, 'department'), 'output_subdir': 'department'},
        {'name': 'Region', 'input_dir': os.path.join(base_input_data_dir, 'region'), 'output_subdir': 'region'}
    ]

    for level in levels_info:
        level_name = level['name']
        input_dir = level['input_dir']
        output_subdir = level['output_subdir']
        output_dir = os.path.join(prediction_results_base_dir, output_subdir)

        print(f"\n--- Processing {level_name} Level for Predictions ---")

        # Create the output directory for this level if it doesn't exist
        os.makedirs(output_dir, exist_ok=True)
        print(f"Ensured output directory exists: {output_dir}")

        # Check if the input directory exists
        if not os.path.exists(input_dir):
            print(f"Error: Input data directory not found for {level_name}: {input_dir}. Skipping this level.")
            continue # Skip to the next level

        # List all TSV files in the input directory for this level
        # Assuming filenames end with '_victims_metrics.tsv' from the previous notebook's output
        processed_files = [f for f in os.listdir(input_dir) if f.endswith('_victims_metrics.tsv')]

        if not processed_files:
            print(f"Warning: No processed metrics TSV files found in {input_dir}. Skipping {level_name} prediction.")
        else:
            print(f"Found {len(processed_files)} processed {level_name} files. Generating predictions for each...")
            for filename in processed_files:
                file_path = os.path.join(input_dir, filename)
                # Extract unit name from filename (e.g., 'colombia', 'antioquia', 'pacifica')
                unit_name = filename.replace("_victims_metrics.tsv", "")

                print(f"\nGenerating predictions for {level_name}: {unit_name} from file: {os.path.basename(file_path)}")

                try:
                    # Read the processed TSV file
                    df_unit_metrics = pd.read_csv(file_path, sep='\t')

                    # Ensure required columns are present for prediction
                    required_cols = ['Año', 'Mes', 'violence type', 'Intensity', 'Escalation', 'Previous State']
                    if not all(col in df_unit_metrics.columns for col in required_cols):
                        print(f"Error: Missing required columns in {os.path.basename(file_path)}: {required_cols}. Skipping prediction for this unit.")
                        continue # Skip to the next file

                    # Ensure data types are correct
                    df_unit_metrics['Año'] = pd.to_numeric(df_unit_metrics['Año'], errors='coerce').astype(int)
                    df_unit_metrics['Mes'] = pd.to_numeric(df_unit_metrics['Mes'], errors='coerce').astype(int)
                    df_unit_metrics['Intensity'] = pd.to_numeric(df_unit_metrics['Intensity'], errors='coerce').fillna(0).astype(int)
                    df_unit_metrics['Escalation'] = pd.to_numeric(df_unit_metrics['Escalation'], errors='coerce').fillna(0).astype(int)
                    df_unit_metrics = df_unit_metrics.dropna(subset=required_cols).copy()

                    # Get unique violence types (VS, VI, VC) from the loaded data
                    unique_violence_types_in_file = df_unit_metrics['violence type'].unique()

                    all_predictions_for_unit = []
                    for v_type in unique_violence_types_in_file:
                        df_type_series = df_unit_metrics[df_unit_metrics['violence type'] == v_type].copy()

                        if len(df_type_series) < PREDICTION_WINDOW_SIZE:
                            print(f"Warning: Not enough historical data for {v_type} in {unit_name} ({len(df_type_series)} months). Skipping prediction.")
                            continue # Skip if not enough data for the window

                        # Generate predictions for this violence type
                        predictions_df_type = predict_next_state_and_dynamics(df_type_series, PREDICTION_WINDOW_SIZE)
                        if not predictions_df_type.empty:
                            all_predictions_for_unit.append(predictions_df_type)

                    # Concatenate all predictions for the current unit (across VS, VI, VC)
                    if all_predictions_for_unit:
                        final_predictions_df = pd.concat(all_predictions_for_unit, ignore_index=True)
                        final_predictions_df = final_predictions_df.sort_values(by=['Año', 'Mes', 'violence type']).reset_index(drop=True)

                        # --- Save the predictions to TSV ---
                        # Generate the output filename: unit name (lowercase, no spaces) + "_cases_predictions.tsv"
                        output_filename = unit_name.lower().replace(" ", "") + "_victims_predictions.tsv"
                        output_path = os.path.join(output_dir, output_filename)

                        try:
                            final_predictions_df.to_csv(output_path, sep='\t', index=False)
                            print(f"Saved predictions for {level_name}: {unit_name} to {output_filename}")
                        except Exception as e:
                            print(f"Error saving predictions for {level_name}: {unit_name} to {output_filename}: {e}")
                    else:
                        print(f"No predictions generated for {level_name}: {unit_name}. Skipping save.")

                except Exception as e:
                    print(f"An unexpected error occurred during processing file {filename}: {e}")

            print(f"\n{level_name}-level prediction calculation and saving complete.")

    print("\n--- Overall Prediction Process Finished ---")



--- Executing Prediction and Saving Results ---

--- Processing Country Level for Predictions ---
Ensured output directory exists: /Users/diegohernandez/Documents/GitHub/VS_VI_Source_Code/Scripts/../Results/predictions/victims/country
Found 1 processed Country files. Generating predictions for each...

Generating predictions for Country: 1958_2022_victims_country.tsv from file: 1958_2022_victims_country.tsv_victims_metrics.tsv
Saved predictions for Country: 1958_2022_victims_country.tsv to 1958_2022_victims_country.tsv_victims_predictions.tsv

Country-level prediction calculation and saving complete.

--- Processing Department Level for Predictions ---
Ensured output directory exists: /Users/diegohernandez/Documents/GitHub/VS_VI_Source_Code/Scripts/../Results/predictions/victims/department
Found 35 processed Department files. Generating predictions for each...

Generating predictions for Department: nariño from file: nariño_victims_metrics.tsv
Saved predictions for Department: nariño 