# Code Description: Data Cleaning and Preprocessing Script | Cases 

## Purpose
This script is designed to perform initial data cleaning and preprocessing steps on the raw violence data for the project on Selective and Indiscriminate Violence (VS/VI) in Colombia. Its main goal is to prepare the data for subsequent analysis, metric calculation (Escalation, Intensity), and potential modeling.

## Workflow Stage
This script is in the Data Cleaning / Preprocessing stage. It takes the raw data, likely loaded from the combined DataFrame generated in the previous step, and transforms it into a clean, structured format suitable for further use in the analytical pipeline.

## About
This script will handle common data issues such as missing values, incorrect data types, and inconsistencies. It will standardize column names if necessary and potentially aggregate data by relevant temporal (e.g., month) and geographical (Country, Department, or Region) units, depending on the specific analysis level being targeted. The output will be a cleaned dataset ready for calculating metrics and generating features.


In [31]:
import pandas as pd
import numpy as np
import os
from itertools import product 

### 1. Initial Setup, Library Imports, and Path Configuration
This block performs the initial setup, including importing necessary libraries (pandas, os), defining the path to the raw data folder (one level up in 'Data/raw'), and listing the specific filenames expected for VI and VS violence types. It also defines the list of columns to be extracted from each file.

In [24]:
# Define the base columns to select from each Excel file (excluding actor columns for now).
columns_to_select_base = [
    "Año",
    "Mes",
    "Día",
    "ID Caso",
    "Municipio",
    "Departamento",
    "Región"
]

# Define the list of potential actor columns, in order of preference.
actor_column_candidates = ["Presunto Responsable", "Grupo Armado 1"]

# Define the lists of filenames corresponding to each violence type (VI and VS).
# These filenames are used to classify the data.
vi_files = [
    "Casos_Acciones_Belicas_202503.xlsx",
    "Casos_Ataques_a_Poblaciones_202503.xlsx",
    "Casos_Atentados_Terroristas_202503.xlsx",
    "Casos_MInas_202503.xlsx",
    "Casos_Reclutamiento_ninas_ninos_U_202503.xlsx"
]

vs_files = [
    "Caso_ Danos_a_Bienes_Civiles_202503.xlsx", # Note: Check for potential extra space in filename "Caso_ Danos..."
    "Casos_Asesinatos_Selectivo_202503.xlsx",
    "Casos_Desaparicion_Forzada _202503.xlsx", # Note: Check for potential extra space in filename "Desaparicion_Forzada _"
    "Casos_Masacre_202503.xlsx",
    "Casos_Secuestro_202503.xlsx",
    "Casos_Violencia_Sexual_202503.xlsx"
]

# --- Define lists for State and Non-State Actors based on 'Presunto Responsable' / 'Grupo Armado 1' values ---
# These lists classify the responsible party into two broad categories.
# Values not in these lists will be categorized as 'UNKNOWN_ACTOR_TYPE'.

STATE_ACTORS_RESPONSIBLE = [
    'AGENTE DEL ESTADO'
]

NON_STATE_ACTORS_RESPONSIBLE = [
    'GUERRILLA',
    'GRUPO PARAMILITAR',
    'GRUPO ARMADO NO IDENTIFICADO',
    'BANDOLERISMO',
    'GRUPO POSDESMOVILIZACIÓN',
    'CRIMEN ORGANIZADO',
    'AGENTE EXTRANJERO',
    'GRUPO PARAMILITAR - GUERRILLA',
    'GRUPO POSDESMOVILIZACIÓN - GUERRILLA',
    'AGENTE DEL ESTADO - GRUPO PARAMILITAR',
    'AGENTE DEL ESTADO - GRUPO POSDESMOVILIZACIÓN',
    'AGENTE DEL ESTADO - GUERRILLA',
    'DESCONOCIDO',
    'OTRO ¿CUÁL?'
]

# Helper function to classify the actor type
def classify_actor(responsible_party):
    """
    Classifies a responsible party string into 'STATE_ACTOR', 'NON_STATE_ACTOR',
    or 'UNKNOWN_ACTOR_TYPE'. Handles NaN/None values and standardizes input.
    """
    if pd.isna(responsible_party):
        return 'UNKNOWN_ACTOR_TYPE'
    responsible_party_upper = str(responsible_party).strip().upper()

    if responsible_party_upper in [s.upper() for s in STATE_ACTORS_RESPONSIBLE]:
        return 'STATE_ACTOR'
    elif responsible_party_upper in [ns.upper() for ns in NON_STATE_ACTORS_RESPONSIBLE]:
        return 'NON_STATE_ACTOR'
    else:
        return 'UNKNOWN_ACTOR_TYPE' # Default for unlisted or unknown parties

# Initialize an empty list to store the processed dataframes from each file.
all_dataframes = []

# Iterate through all files in the specified data folder.
for filename in os.listdir(data_folder_path):
    # Construct the full file path.
    file_path = os.path.join(data_folder_path, filename)

    # Check if the current item is a file and if it's an Excel file.
    if os.path.isfile(file_path) and filename.endswith('.xlsx'):
        print(f"Processing file: {filename}")

        try:
            # Read the Excel file into a pandas DataFrame.
            df = pd.read_excel(file_path)

            # Determine which actor column is available in the current file
            current_actor_column = None
            for col_candidate in actor_column_candidates:
                if col_candidate in df.columns:
                    current_actor_column = col_candidate
                    break # Found the column, use it and exit loop

            if current_actor_column is None:
                print(f"Warning: Neither '{actor_column_candidates[0]}' nor '{actor_column_candidates[1]}' found in file '{filename}'. Skipping this file.")
                continue # Skip this file if no relevant actor column is found

            # Construct the list of columns to select for THIS specific DataFrame
            # This includes the base columns and the identified actor column
            cols_to_load_for_this_df = columns_to_select_base + [current_actor_column]

            # Select only the required columns.
            df_selected = df[cols_to_load_for_this_df].copy()
            df_selected.rename(columns={"Grupo Armado 1":"Presunto Responsable"},inplace=True)

            # Rename the found actor column to a standardized name for classification
            df_selected['ResponsiblePartyRaw'] = df_selected[current_actor_column]

            # Apply the actor classification to create the 'ActorType' column
            df_selected['ActorType'] = df_selected['ResponsiblePartyRaw'].apply(classify_actor)

            # Drop the temporary raw column used for classification
            df_selected.drop(columns=['ResponsiblePartyRaw'], inplace=True)

            # Determine the violence type based on the filename and add the 'violence type' column.
            if filename in vi_files:
                df_selected['violence type'] = 'VI'
            elif filename in vs_files:
                df_selected['violence type'] = 'VS'
            else:
                print(f"Warning: File '{filename}' not classified as VI or VS. Skipping.")
                continue # Skip this file

            # Append the processed DataFrame to the list.
            all_dataframes.append(df_selected)

        except KeyError as ke:
            print(f"Error processing file {filename}: Missing required base column - {ke}. Please check if all base columns exist.")
        except Exception as e:
            print(f"Error processing file {filename}: An unexpected error occurred - {e}")


Processing file: Casos_Desaparicion_Forzada _202503.xlsx


  warn("Workbook contains no default style, apply openpyxl's default")


Processing file: Casos_Reclutamiento_ninas_ninos_U_202503.xlsx


  warn("Workbook contains no default style, apply openpyxl's default")


Processing file: Casos_Acciones_Belicas_202503.xlsx


  warn("Workbook contains no default style, apply openpyxl's default")


Error processing file Casos_Acciones_Belicas_202503.xlsx: Missing required base column - 'Grupo Armado 1'. Please check if all base columns exist.
Processing file: Casos_MInas_202503.xlsx


  warn("Workbook contains no default style, apply openpyxl's default")


Processing file: Casos_Ataques_a_Poblaciones_202503.xlsx


  warn("Workbook contains no default style, apply openpyxl's default")


Error processing file Casos_Ataques_a_Poblaciones_202503.xlsx: Missing required base column - 'Grupo Armado 1'. Please check if all base columns exist.
Processing file: Casos_Violencia_Sexual_202503.xlsx


  warn("Workbook contains no default style, apply openpyxl's default")


Processing file: Casos_Asesinatos_Selectivo_202503.xlsx


  warn("Workbook contains no default style, apply openpyxl's default")


Processing file: Casos_Atentados_Terroristas_202503.xlsx
Processing file: Casos_Danos_a_Bienes_Civiles_202503.xlsx


  warn("Workbook contains no default style, apply openpyxl's default")
  warn("Workbook contains no default style, apply openpyxl's default")


Processing file: Casos_Secuestro_202503.xlsx


  warn("Workbook contains no default style, apply openpyxl's default")


Processing file: Casos_Masacre_202503.xlsx


  warn("Workbook contains no default style, apply openpyxl's default")


### 2. Concatenate DataFrames and Display Summary
This block consolidates all individual DataFrames processed from the Excel files into a single combined_df. It includes a check to ensure data was processed before concatenation. Finally, it displays the head, info, and violence type counts of the combined DataFrame for initial verification.

In [25]:
# Concatenate all dataframes in the list into a single DataFrame.
if all_dataframes:
    combined_df = pd.concat(all_dataframes, ignore_index=True)

    # In case some files were skipped and 'ActorType' was not added to all, ensure it exists
    if 'ActorType' not in combined_df.columns:
        combined_df['ActorType'] = 'UNKNOWN_ACTOR_TYPE' # Fallback if no actor types were classified

    # Display basic info
    print("\nCombined DataFrame Head:")
    print(combined_df.head())
    print("\nCombined DataFrame Info:")
    combined_df.info()
    print("\nViolence Type Counts:")
    print(combined_df['violence type'].value_counts())
    print("\nActor Type Counts:")
    print(combined_df['ActorType'].value_counts())

    # Save the combined DataFrame as a raw processed file with ActorType.
    # This acts as a single source of truth for the combined raw data with violence/actor types.
    output_processed_base_dir = os.path.join(os.getcwd(), '..', 'Data', 'processed', 'cases')
    os.makedirs(output_processed_base_dir, exist_ok=True)
    combined_df.to_csv(os.path.join(output_processed_base_dir, 'raw_combined_cases_with_actors.tsv'), sep='\t', index=False)
    print(f"\nSaved raw combined cases with actor types to: {os.path.join(output_processed_base_dir, 'raw_combined_cases_with_actors.tsv')}")

else:
    print("\nNo Excel files were processed or found. 'combined_df' is not created.")


Combined DataFrame Head:
    Año  Mes  Día  ID Caso  Municipio  Departamento  \
0  1991    5   25   100265  JERUSALEN  CUNDINAMARCA   
1  2004   12    2   100282    LA MESA  CUNDINAMARCA   
2  1993    3    9   101616     YACOPI  CUNDINAMARCA   
3  1997    6    8   102204     BOJAYA         CHOCO   
4  2000    6   12   102489      LLORO         CHOCO   

                         Región Presunto Responsable        ActorType  \
0  SUROCCIDENTE DE CUNDINAMARCA            GUERRILLA  NON_STATE_ACTOR   
1  SUROCCIDENTE DE CUNDINAMARCA    GRUPO PARAMILITAR  NON_STATE_ACTOR   
2               MAGDALENA MEDIO    GRUPO PARAMILITAR  NON_STATE_ACTOR   
3                        ATRATO    GRUPO PARAMILITAR  NON_STATE_ACTOR   
4                        ATRATO    GRUPO PARAMILITAR  NON_STATE_ACTOR   

  violence type  
0            VS  
1            VS  
2            VS  
3            VS  
4            VS  

Combined DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306738 entries, 0 to

In [26]:
combined_df['ActorType']=combined_df['ActorType']
combined_df.to_csv('../Data/processed/cases/real_total_cases.tsv',sep='\t')
combined_df.head()

Unnamed: 0,Año,Mes,Día,ID Caso,Municipio,Departamento,Región,Presunto Responsable,ActorType,violence type
0,1991,5,25,100265,JERUSALEN,CUNDINAMARCA,SUROCCIDENTE DE CUNDINAMARCA,GUERRILLA,NON_STATE_ACTOR,VS
1,2004,12,2,100282,LA MESA,CUNDINAMARCA,SUROCCIDENTE DE CUNDINAMARCA,GRUPO PARAMILITAR,NON_STATE_ACTOR,VS
2,1993,3,9,101616,YACOPI,CUNDINAMARCA,MAGDALENA MEDIO,GRUPO PARAMILITAR,NON_STATE_ACTOR,VS
3,1997,6,8,102204,BOJAYA,CHOCO,ATRATO,GRUPO PARAMILITAR,NON_STATE_ACTOR,VS
4,2000,6,12,102489,LLORO,CHOCO,ATRATO,GRUPO PARAMILITAR,NON_STATE_ACTOR,VS


### 3. Group Cases Data by Year, Month, Violence Type, and Actor Type (1958-2022) for Country Level
This block processes the combined_df to filter and group cases data for the entire country (Colombia). It specifically groups by Año (Year), Mes (Month), violence type (VI/VS), and the created ActorType (STATE_ACTOR, NON_STATE_ACTOR).

The purpose is to count the total number of cases for each combination of these dimensions, creating complete monthly time series. It ensures all possible Year-Month-Violence Type-Actor Type combinations within the specified range (1958-2022) are present by imputing missing entries with a count of 0. The resulting time series DataFrames for each ActorType are then saved as separate TSV files in their respective output directories (country/state_actors and country/non_state_actors).

In [27]:
# 3. Group Cases Data by Year, Month, Violence Type, and Actor Type (1958-2022) for Country Level
combined_df = pd.read_csv('../Data/processed/cases/real_total_cases.tsv',sep='\t')
print("\n--- Grouping Cases data by Year, Month, Violence Type, and Actor Type for Country (1958-2022) ---")

# Ensure combined_df exists from the previous step (Cell 1)
if 'combined_df' not in locals() or combined_df.empty:
    print("Error: 'combined_df' not found or is empty. Please run the initial data loading and actor classification code block (Cell 1).")
else:
    # Define the output directory for country-level case data, now segmented by ActorType
    output_dir_country_base = os.path.join(os.getcwd(), '..', 'Data', 'processed', 'cases', 'country')

    # Define the year range for filtering and imputation
    min_year = 1958
    max_year = 2022

    # Ensure 'Año', 'Mes', 'violence type', and 'ActorType' columns are valid
    required_cols_for_grouping = ['Año', 'Mes', 'violence type', 'ActorType']
    df_filtered_cases = combined_df.copy()

    try:
        # Convert to numeric and drop NaNs for 'Año' and 'Mes'
        df_filtered_cases['Año'] = pd.to_numeric(df_filtered_cases['Año'], errors='coerce')
        df_filtered_cases['Mes'] = pd.to_numeric(df_filtered_cases['Mes'], errors='coerce')
        df_filtered_cases = df_filtered_cases.dropna(subset=['Año', 'Mes']).copy()

        # Filter by the specified year range
        df_filtered_cases = df_filtered_cases[
            (df_filtered_cases['Año'] >= min_year) & (df_filtered_cases['Año'] <= max_year)
        ].copy()

        # Drop rows where 'violence type' or 'ActorType' are missing/null
        df_filtered_cases = df_filtered_cases.dropna(subset=['violence type', 'ActorType']).copy()

        # Get unique violence types and actor types present in the filtered data
        unique_violence_types = df_filtered_cases['violence type'].unique()
        unique_actor_types = df_filtered_cases['ActorType'].unique()

        if len(unique_actor_types) == 0:
            print("Warning: No valid ActorTypes found after filtering. Skipping country-level grouping by actor.")
        else:
            print(f"Filtered cases data for years {min_year}-{max_year}. Shape: {df_filtered_cases.shape}")
            print(f"Unique Violence Types found: {unique_violence_types}")
            print(f"Unique Actor Types found: {unique_actor_types}")

            # Create a complete list of all expected Year-Month combinations for imputation
            full_date_range = pd.date_range(start=f'{min_year}-01-01', end=f'{max_year}-12-01', freq='MS')

            # Loop through each unique ActorType to process and save data separately
            for actor_type in unique_actor_types:
                actor_type_clean_name = actor_type.replace(" ", "_").lower() # e.g., 'state_actor', 'non_state_actor'
                output_dir_actor_type = os.path.join(output_dir_country_base, actor_type_clean_name)
                os.makedirs(output_dir_actor_type, exist_ok=True)
                print(f"\nEnsured output directory exists for {actor_type}: {output_dir_actor_type}")

                # Filter data for the current ActorType
                df_actor_type = df_filtered_cases[df_filtered_cases['ActorType'] == actor_type].copy()

                if df_actor_type.empty:
                    print(f"No data for ActorType: {actor_type}. Skipping processing for this actor type.")
                    continue

                # Group by 'Año', 'Mes', and 'violence type' and count the occurrences
                grouping_cols = ['Año', 'Mes', 'violence type']
                cases_by_month_year_type_actor = df_actor_type.groupby(grouping_cols).size()

                # Create a complete multi-index for this ActorType, including all Year-Month-Violence Type combinations
                all_combinations_actor = list(product(full_date_range.year, full_date_range.month, unique_violence_types))
                full_index_actor = pd.MultiIndex.from_tuples(
                    all_combinations_actor,
                    names=grouping_cols
                )

                # Reindex the monthly case counts using the complete index
                cases_by_month_year_complete_actor = cases_by_month_year_type_actor.reindex(full_index_actor)

                # Fill NaN values with 0 and convert to integer
                cases_by_month_year_complete_actor = cases_by_month_year_complete_actor.fillna(0).astype(int)

                # Convert Series back to a DataFrame with 'CaseCount'
                grouped_cases_country_monthly_actor = cases_by_month_year_complete_actor.reset_index(name='CaseCount')

                # Sort the DataFrame chronologically and by violence type
                grouped_cases_country_monthly_actor = grouped_cases_country_monthly_actor.sort_values(by=['Año', 'Mes', 'violence type']).reset_index(drop=True)

                # --- Deduplication (Crucial Step) ---
                initial_rows = len(grouped_cases_country_monthly_actor)
                grouped_cases_country_monthly_actor.drop_duplicates(subset=['Año', 'Mes', 'violence type'], inplace=True)
                if len(grouped_cases_country_monthly_actor) < initial_rows:
                    print(f"Warning: Removed {initial_rows - len(grouped_cases_country_monthly_actor)} duplicate rows for (Año, Mes, violence type) within {actor_type} data before saving.")
                else:
                    print(f"No duplicates found for (Año, Mes, violence type) within {actor_type} data before saving.")

                # --- Save the results to TSV ---
                # Filename: 'colombia_cases_actor_type.tsv'
                output_filename = f"colombia_cases_{actor_type_clean_name}.tsv"
                output_path = os.path.join(output_dir_actor_type, output_filename)

                try:
                    grouped_cases_country_monthly_actor.to_csv(output_path, sep='\t', index=False)
                    print(f"Saved data for Country ({actor_type}) to {output_filename}")
                except Exception as e:
                    print(f"Error saving data for Country ({actor_type}) to {output_filename}: {e}")

            print("\nCountry-level cases data processing and saving by ActorType complete.")

    except KeyError as e:
        print(f"Error: Required column not found - {e}. Please check column names in the combined_df or input Excel files.")
    except Exception as e:
        print(f"An unexpected error occurred during grouping: {e}")




--- Grouping Cases data by Year, Month, Violence Type, and Actor Type for Country (1958-2022) ---
Filtered cases data for years 1958-2022. Shape: (285325, 11)
Unique Violence Types found: ['VS' 'VI']
Unique Actor Types found: ['NON_STATE_ACTOR' 'STATE_ACTOR']

Ensured output directory exists for NON_STATE_ACTOR: /Users/diegohernandez/Documents/GitHub/VS_VI_Source_Code/Scripts/../Data/processed/cases/country/non_state_actor
Saved data for Country (NON_STATE_ACTOR) to colombia_cases_non_state_actor.tsv

Ensured output directory exists for STATE_ACTOR: /Users/diegohernandez/Documents/GitHub/VS_VI_Source_Code/Scripts/../Data/processed/cases/country/state_actor
Saved data for Country (STATE_ACTOR) to colombia_cases_state_actor.tsv

Country-level cases data processing and saving by ActorType complete.


In [28]:
#grouped_cases_country_monthly_actor.to_csv('../Data/processed/cases/country/1958_2022_cases_country.tsv',sep='\t')
grouped_cases_country_monthly_actor.head()

Unnamed: 0,Año,Mes,violence type,CaseCount
0,1958,1,VI,0
780,1958,1,VS,3
1560,1958,2,VI,0
2340,1958,2,VS,3
3120,1958,3,VI,0


In [20]:
print(output_path)

/Users/diegohernandez/Documents/GitHub/VS_VI_Source_Code/Scripts/../Data/processed/cases/country/state_actor/colombia_cases_state_actor.tsv


### 4. Generate Animated Line Plot of VI vs VS Cases (1958-2022)
This block creates an animated line plot visualizing the yearly trend of Selective Violence (VS) and Indiscriminate Violence (VI) cases in Colombia from 1958 to 2022. The animation shows how the cumulative case counts for each violence type evolve over time, providing a dynamic view of their historical trajectories. The output is saved as an MP4 video file.

In [20]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import os
import seaborn as sns # Import seaborn for aesthetics
from itertools import product # Import product for generating combinations (if needed for re-running previous cells)
import numpy as np # Import numpy for numerical operations, specifically arange

grouped_cases_country_monthly_type = pd.read_csv('../Data/processed/cases/country/1958_2022_cases_country.tsv',sep='\t')

# Ensure the grouped_cases_country_monthly_type DataFrame exists
if 'grouped_cases_country_monthly_type' not in locals():
    print("Error: 'grouped_cases_country_monthly_type' not found. Please run the previous code block to create it.")
else:
    # --- 4. Generate Animated Line Plot of VI vs VS Cases (1958-2022) ---

    print("\n--- Generating Animated Line Plot ---")

    # Prepare data: Group by Year and Violence Type and sum the monthly counts
    # This gives the total cases per year for each violence type
    yearly_cases_by_type = grouped_cases_country_monthly_type.groupby(['Año', 'violence type'])['CaseCount'].sum().reset_index()

    # Pivot the data for easier plotting
    # Years will be the index, violence types will be columns, and values will be CaseCount
    yearly_cases_pivot = yearly_cases_by_type.pivot(index='Año', columns='violence type', values='CaseCount').fillna(0)

    # Ensure both 'VI' and 'VS' columns exist, even if one had 0 cases for all years
    for v_type in ['VI', 'VS']:
        if v_type not in yearly_cases_pivot.columns:
            yearly_cases_pivot[v_type] = 0

    # Sort the pivot table by year
    yearly_cases_pivot = yearly_cases_pivot.sort_index()

    # Calculate cumulative sum for the animation
    # This shows the total cases up to a given year
    yearly_cases_cumulative = yearly_cases_pivot.cumsum()

    # Set up the figure and axes for the plot
    plt.style.use('seaborn-v0_8-darkgrid') # Use a nice seaborn style
    fig, ax = plt.subplots(figsize=(10, 4))

    # Set initial plot limits (adjust as needed)
    ax.set_xlim(yearly_cases_cumulative.index.min(), yearly_cases_cumulative.index.max())
    ax.set_ylim(0, yearly_cases_cumulative.values.max() * 1.1) # Add 10% padding to y-axis

    # Set titles and labels
    ax.set_title('Cumulative Cases of Selective (VS) and Indiscriminate (VI) Violence in Colombia (1958-2022)', fontsize=14)
    ax.set_xlabel('Year', fontsize=12)
    ax.set_ylabel('Cumulative Number of Cases', fontsize=12)
    ax.grid(True, linestyle='--', alpha=0.6)

    # --- X-axis Tick Adjustment ---
    # Determine the range of years
    min_year = yearly_cases_cumulative.index.min()
    max_year = yearly_cases_cumulative.index.max()

    # Set ticks at intervals (e.g., every 10 years)
    # Use numpy.arange for consistent spacing
    tick_years = np.arange(min_year, max_year + 1, 10) # Adjust the step (10) as needed

    # Ensure the last year is included if it's not exactly on an interval
    if max_year not in tick_years:
         tick_years = np.append(tick_years, max_year)

    ax.set_xticks(tick_years)
    ax.tick_params(axis='x', rotation=45) # Rotate labels slightly if needed for clarity

    # Initialize the lines for the plot
    line_vi, = ax.plot([], [], label='VI Cases', color='red', linewidth=2)
    line_vs, = ax.plot([], [], label='VS Cases', color='blue', linewidth=2)
    # --- ADJUSTMENT HERE: Change loc to 'lower right' ---
    ax.legend(loc='lower right')

    # Add a text annotation for the current year (will be updated in animation)
    year_text = ax.text(0.02, 0.95, '', transform=ax.transAxes, fontsize=15, color='gray')

    # Define the animation update function
    def update(frame):
        """
        Updates the plot data for each frame of the animation.
        frame: The current frame number (index of the year).
        """
        current_year_index = frame
        current_year = yearly_cases_cumulative.index[current_year_index]

        # Update data for VI line up to the current year
        line_vi.set_data(yearly_cases_cumulative.index[:current_year_index+1],
                         yearly_cases_cumulative['VI'].iloc[:current_year_index+1])

        # Update data for VS line up to the current year
        line_vs.set_data(yearly_cases_cumulative.index[:current_year_index+1],
                         yearly_cases_cumulative['VS'].iloc[:current_year_index+1])

        # Update the year text annotation
        year_text.set_text(f'Year: {current_year}')

        # Need to return all artists that were modified
        return line_vi, line_vs, year_text, ax.legend_

    # Create the animation
    # frames: number of frames (equal to the number of years)
    # interval: delay between frames in milliseconds
    # blit: True means only re-draw the parts that have changed (can be faster)
    ani = animation.FuncAnimation(fig, update, frames=len(yearly_cases_cumulative.index),
                                  interval=200, blit=True) # Adjust interval for speed

    # Define the output path for the video
    output_dir = os.path.join(os.getcwd(), '..', 'Images')
    output_filename = 'VI_VS_Colombia_Cases.mp4'
    output_path = os.path.join(output_dir, output_filename)

    # Create the output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)

    # Save the animation
    # Requires ffmpeg. If you don't have it, you might need to install it
    # (e.g., using conda install -c conda-forge ffmpeg or through a system package manager)
    try:
        print(f"\nSaving animation to {output_path}...")
        # Fix for Matplotlib Deprecation Warning: close figure before switching backend
        plt.close(fig) # Close the figure before switching
        plt.switch_backend('agg') # Switch backend for saving

        # Re-create the writer after closing and switching
        writer = animation.FFMpegWriter(fps=10) # frames per second
        # Re-create the animation object, or ensure the writer can handle the original fig
        # It's often better to just pass the fig directly to save if possible,
        # or ensure the backend is set correctly BEFORE figure creation if saving without displaying.
        # However, since the figure is needed for FuncAnimation, closing *before* saving
        # is the fix for the specific warning. Let's try saving the *original* animation object.
        ani.save(output_path, writer=writer)

        print("Animation saved successfully!")
    except Exception as e:
        print(f"\nError saving animation: {e}")
        print("Please ensure you have ffmpeg installed and accessible in your environment.")
        print("You might need to install it using: conda install -c conda-forge ffmpeg")
        print("Or using your system's package manager (e.g., sudo apt-get install ffmpeg on Ubuntu or brew install ffmpeg on macOS).")




--- Generating Animated Line Plot ---

Saving animation to /Users/diegohernandez/Documents/GitHub/VS_VI_Source_Code/Scripts/../Images/VI_VS_Colombia_Cases.mp4...
Animation saved successfully!


### 5. Group Cases Data by Year, Month, Violence Type, Department, and Actor Type (1958-2022)
This block processes the combined_df to filter and group cases data for each individual department in Colombia. For each department, it calculates the monthly case count for each violence type (VI/VS) and ActorType (STATE_ACTOR, NON_STATE_ACTOR).

The purpose is to create complete monthly time series for each combination of department, violence type, and actor type within the specified year range (1958-2022), ensuring all possible Year-Month-Violence Type-Actor Type combinations are present by imputing missing entries with a count of 0. The resulting time series DataFrames for each ActorType within each department are then saved as separate TSV files in their respective output directories (departments/state_actors and departments/non_state_actors).

In [29]:
combined_df = pd.read_csv('../Data/processed/cases/real_total_cases.tsv',sep='\t')

# 5. Group Cases Data by Year, Month, Violence Type, Department, and Actor Type (1958-2022)

print("\n--- Grouping Cases data by Department, Month, Violence Type, and Actor Type (1958-2022) ---")

# Ensure combined_df exists from the initial loading step (Cell 1)
if 'combined_df' not in locals() or combined_df.empty:
    print("Error: 'combined_df' not found or is empty. Please run the initial data loading and actor classification code block (Cell 1).")
else:
    # Define the base output directory for department-level case data, segmented by ActorType
    output_dir_departments_base = os.path.join(os.getcwd(), '..', 'Data', 'processed', 'cases', 'departments')

    # Define the year range for filtering and imputation
    min_year = 1958
    max_year = 2022

    # Ensure 'Año', 'Mes', 'violence type', 'Department', and 'ActorType' columns are valid
    required_cols_for_grouping = ['Año', 'Mes', 'violence type', 'Departamento', 'ActorType']
    df_filtered_cases_dept = combined_df.copy()

    try:
        # Convert to numeric and drop NaNs for 'Año' and 'Mes'
        df_filtered_cases_dept['Año'] = pd.to_numeric(df_filtered_cases_dept['Año'], errors='coerce')
        df_filtered_cases_dept['Mes'] = pd.to_numeric(df_filtered_cases_dept['Mes'], errors='coerce')
        df_filtered_cases_dept = df_filtered_cases_dept.dropna(subset=['Año', 'Mes']).copy()

        # Filter by the specified year range
        df_filtered_cases_dept = df_filtered_cases_dept[
            (df_filtered_cases_dept['Año'] >= min_year) & (df_filtered_cases_dept['Año'] <= max_year)
        ].copy()

        # Drop rows where 'violence type', 'Departamento' or 'ActorType' are missing/null
        df_filtered_cases_dept = df_filtered_cases_dept.dropna(subset=['violence type', 'Departamento', 'ActorType']).copy()

        # Get unique violence types and actor types present in the filtered data
        unique_violence_types = df_filtered_cases_dept['violence type'].unique()
        unique_actor_types = df_filtered_cases_dept['ActorType'].unique()
        unique_departments = df_filtered_cases_dept['Departamento'].unique()

        if len(unique_actor_types) == 0:
            print("Warning: No valid ActorTypes found after filtering. Skipping department-level grouping by actor.")
        elif len(unique_departments) == 0:
            print("Warning: No valid Departments found after filtering. Skipping department-level grouping.")
        else:
            print(f"Filtered cases data for years {min_year}-{max_year}. Shape: {df_filtered_cases_dept.shape}")
            print(f"Unique Violence Types found: {unique_violence_types}")
            print(f"Unique Actor Types found: {unique_actor_types}")
            print(f"Found {len(unique_departments)} unique departments. Processing each department...")

            # Create a complete list of all expected Year-Month combinations for imputation
            full_date_range = pd.date_range(start=f'{min_year}-01-01', end=f'{max_year}-12-01', freq='MS')

            # Loop through each unique ActorType to create separate output subdirectories
            for actor_type in unique_actor_types:
                actor_type_clean_name = actor_type.replace(" ", "_").lower() # e.g., 'state_actor', 'non_state_actor'
                output_dir_actor_type = os.path.join(output_dir_departments_base, actor_type_clean_name)
                os.makedirs(output_dir_actor_type, exist_ok=True)
                print(f"\nEnsured output directory exists for {actor_type} in departments: {output_dir_actor_type}")

                # Loop through each unique department
                for department in unique_departments:
                    department_str = str(department) # Ensure department name is string
                    print(f"\nProcessing department: {department_str} for ActorType: {actor_type}")

                    # Filter data for the current department and ActorType
                    df_dept_actor_type = df_filtered_cases_dept[
                        (df_filtered_cases_dept['Departamento'] == department) &
                        (df_filtered_cases_dept['ActorType'] == actor_type)
                    ].copy()

                    if df_dept_actor_type.empty:
                        print(f"No data for Department: {department_str} and ActorType: {actor_type}. Skipping processing.")
                        continue

                    # Group by 'Año', 'Mes', and 'violence type' and count the occurrences
                    grouping_cols = ['Año', 'Mes', 'violence type']
                    cases_by_month_year_type_dept_actor = df_dept_actor_type.groupby(grouping_cols).size()

                    # Create a complete multi-index for this Department-ActorType combination
                    # It includes all Year-Month-Violence Type combinations
                    all_combinations_dept_actor = list(product(full_date_range.year, full_date_range.month, unique_violence_types))
                    full_index_dept_actor = pd.MultiIndex.from_tuples(
                        all_combinations_dept_actor,
                        names=grouping_cols
                    )

                    # Reindex the monthly case counts using the complete index
                    cases_by_month_year_complete_dept_actor = cases_by_month_year_type_dept_actor.reindex(full_index_dept_actor)

                    # Fill NaN values with 0 and convert to integer
                    cases_by_month_year_complete_dept_actor = cases_by_month_year_complete_dept_actor.fillna(0).astype(int)

                    # Convert Series back to a DataFrame with 'CaseCount'
                    grouped_cases_dept_monthly_actor = cases_by_month_year_complete_dept_actor.reset_index(name='CaseCount')

                    # Sort the DataFrame chronologically and by violence type
                    grouped_cases_dept_monthly_actor = grouped_cases_dept_monthly_actor.sort_values(by=['Año', 'Mes', 'violence type']).reset_index(drop=True)

                    # --- Deduplication (Crucial Step) ---
                    initial_rows = len(grouped_cases_dept_monthly_actor)
                    grouped_cases_dept_monthly_actor.drop_duplicates(subset=['Año', 'Mes', 'violence type'], inplace=True)
                    if len(grouped_cases_dept_monthly_actor) < initial_rows:
                        print(f"Warning: Removed {initial_rows - len(grouped_cases_dept_monthly_actor)} duplicate rows for (Año, Mes, violence type) within {department_str} - {actor_type} data before saving.")
                    else:
                        print(f"No duplicates found for (Año, Mes, violence type) within {department_str} - {actor_type} data before saving.")

                    # --- Save the results to TSV ---
                    # Filename: 'departmentname_cases_actor_type.tsv'
                    filename_dept = department_str.replace(" ", "").lower() + f"_cases_{actor_type_clean_name}.tsv"
                    output_path_dept = os.path.join(output_dir_actor_type, filename_dept)

                    try:
                        grouped_cases_dept_monthly_actor.to_csv(output_path_dept, sep='\t', index=False)
                        print(f"Saved data for Department ({department_str}, {actor_type}) to {filename_dept}")
                    except Exception as e:
                        print(f"Error saving data for Department ({department_str}, {actor_type}) to {filename_dept}: {e}")

            print("\nDepartment-level cases data processing and saving by ActorType complete.")

    except KeyError as e:
        print(f"Error: Required column not found - {e}. Please check column names in the combined_df or input Excel files.")
    except Exception as e:
        print(f"An unexpected error occurred during processing: {e}")



--- Grouping Cases data by Department, Month, Violence Type, and Actor Type (1958-2022) ---
Filtered cases data for years 1958-2022. Shape: (285325, 11)
Unique Violence Types found: ['VS' 'VI']
Unique Actor Types found: ['NON_STATE_ACTOR' 'STATE_ACTOR']
Found 35 unique departments. Processing each department...

Ensured output directory exists for NON_STATE_ACTOR in departments: /Users/diegohernandez/Documents/GitHub/VS_VI_Source_Code/Scripts/../Data/processed/cases/departments/non_state_actor

Processing department: CUNDINAMARCA for ActorType: NON_STATE_ACTOR
Saved data for Department (CUNDINAMARCA, NON_STATE_ACTOR) to cundinamarca_cases_non_state_actor.tsv

Processing department: CHOCO for ActorType: NON_STATE_ACTOR
Saved data for Department (CHOCO, NON_STATE_ACTOR) to choco_cases_non_state_actor.tsv

Processing department: HUILA for ActorType: NON_STATE_ACTOR
Saved data for Department (HUILA, NON_STATE_ACTOR) to huila_cases_non_state_actor.tsv

Processing department: LA GUAJIRA for

### 6. Group Cases Data by Year, Month, Violence Type, Region, and Actor Type (1958-2022)
This block processes the combined_df to filter and group cases data for each individual region in Colombia. For each region, it calculates the monthly case count for each violence type (VI/VS) and ActorType (STATE_ACTOR, NON_STATE_ACTOR).

The purpose is to create complete monthly time series for each combination of region, violence type, and actor type within the specified year range (1958-2022), ensuring all possible Year-Month-Violence Type-Actor Type combinations are present by imputing missing entries with a count of 0. The resulting time series DataFrames for each ActorType within each region are then saved as separate TSV files in their respective output directories (regions/state_actors and regions/non_state_actors).

In [30]:
combined_df = pd.read_csv('../Data/processed/cases/real_total_cases.tsv',sep='\t')

# 6. Group Cases Data by Year, Month, Violence Type, Region, and Actor Type (1958-2022)

print("\n--- Grouping Cases data by Region, Month, Violence Type, and Actor Type (1958-2022) ---")

# Ensure combined_df exists from the initial loading step (Cell 1)
if 'combined_df' not in locals() or combined_df.empty:
    print("Error: 'combined_df' not found or is empty. Please run the initial data loading and actor classification code block (Cell 1).")
else:
    # Define the base output directory for region-level case data, segmented by ActorType
    output_dir_regions_base = os.path.join(os.getcwd(), '..', 'Data', 'processed', 'cases', 'regions')

    # Define the year range for filtering and imputation
    min_year = 1958
    max_year = 2022

    # Ensure 'Año', 'Mes', 'violence type', 'Región', and 'ActorType' columns are valid
    required_cols_for_grouping = ['Año', 'Mes', 'violence type', 'Región', 'ActorType']
    df_filtered_cases_region = combined_df.copy()

    try:
        # Convert to numeric and drop NaNs for 'Año' and 'Mes'
        df_filtered_cases_region['Año'] = pd.to_numeric(df_filtered_cases_region['Año'], errors='coerce')
        df_filtered_cases_region['Mes'] = pd.to_numeric(df_filtered_cases_region['Mes'], errors='coerce')
        df_filtered_cases_region = df_filtered_cases_region.dropna(subset=['Año', 'Mes']).copy()

        # Filter by the specified year range
        df_filtered_cases_region = df_filtered_cases_region[
            (df_filtered_cases_region['Año'] >= min_year) & (df_filtered_cases_region['Año'] <= max_year)
        ].copy()

        # Drop rows where 'violence type', 'Región' or 'ActorType' are missing/null
        df_filtered_cases_region = df_filtered_cases_region.dropna(subset=['violence type', 'Región', 'ActorType']).copy()

        # Get unique violence types and actor types present in the filtered data
        unique_violence_types = df_filtered_cases_region['violence type'].unique()
        unique_actor_types = df_filtered_cases_region['ActorType'].unique()
        unique_regions = df_filtered_cases_region['Región'].unique()

        if len(unique_actor_types) == 0:
            print("Warning: No valid ActorTypes found after filtering. Skipping region-level grouping by actor.")
        elif len(unique_regions) == 0:
            print("Warning: No valid Regions found after filtering. Skipping region-level grouping.")
        else:
            print(f"Filtered cases data for years {min_year}-{max_year}. Shape: {df_filtered_cases_region.shape}")
            print(f"Unique Violence Types found: {unique_violence_types}")
            print(f"Unique Actor Types found: {unique_actor_types}")
            print(f"Found {len(unique_regions)} unique regions. Processing each region...")

            # Create a complete list of all expected Year-Month combinations for imputation
            full_date_range = pd.date_range(start=f'{min_year}-01-01', end=f'{max_year}-12-01', freq='MS')

            # Loop through each unique ActorType to create separate output subdirectories
            for actor_type in unique_actor_types:
                actor_type_clean_name = actor_type.replace(" ", "_").lower() # e.g., 'state_actor', 'non_state_actor'
                output_dir_actor_type = os.path.join(output_dir_regions_base, actor_type_clean_name)
                os.makedirs(output_dir_actor_type, exist_ok=True)
                print(f"\nEnsured output directory exists for {actor_type} in regions: {output_dir_actor_type}")

                # Loop through each unique region
                for region in unique_regions:
                    region_str = str(region) # Ensure region name is string
                    print(f"\nProcessing region: {region_str} for ActorType: {actor_type}")

                    # Filter data for the current region and ActorType
                    df_region_actor_type = df_filtered_cases_region[
                        (df_filtered_cases_region['Región'] == region) &
                        (df_filtered_cases_region['ActorType'] == actor_type)
                    ].copy()

                    if df_region_actor_type.empty:
                        print(f"No data for Region: {region_str} and ActorType: {actor_type}. Skipping processing.")
                        continue

                    # Group by 'Año', 'Mes', and 'violence type' and count the occurrences
                    grouping_cols = ['Año', 'Mes', 'violence type']
                    cases_by_month_year_type_region_actor = df_region_actor_type.groupby(grouping_cols).size()

                    # Create a complete multi-index for this Region-ActorType combination
                    # It includes all Year-Month-Violence Type combinations
                    all_combinations_region_actor = list(product(full_date_range.year, full_date_range.month, unique_violence_types))
                    full_index_region_actor = pd.MultiIndex.from_tuples(
                        all_combinations_region_actor,
                        names=grouping_cols
                    )

                    # Reindex the monthly case counts using the complete index
                    cases_by_month_year_complete_region_actor = cases_by_month_year_type_region_actor.reindex(full_index_region_actor)

                    # Fill NaN values with 0 and convert to integer
                    cases_by_month_year_complete_region_actor = cases_by_month_year_complete_region_actor.fillna(0).astype(int)

                    # Convert Series back to a DataFrame with 'CaseCount'
                    grouped_cases_region_monthly_actor = cases_by_month_year_complete_region_actor.reset_index(name='CaseCount')

                    # Sort the DataFrame chronologically and by violence type
                    grouped_cases_region_monthly_actor = grouped_cases_region_monthly_actor.sort_values(by=['Año', 'Mes', 'violence type']).reset_index(drop=True)

                    # --- Deduplication (Crucial Step) ---
                    initial_rows = len(grouped_cases_region_monthly_actor)
                    grouped_cases_region_monthly_actor.drop_duplicates(subset=['Año', 'Mes', 'violence type'], inplace=True)
                    if len(grouped_cases_region_monthly_actor) < initial_rows:
                        print(f"Warning: Removed {initial_rows - len(grouped_cases_region_monthly_actor)} duplicate rows for (Año, Mes, violence type) within {region_str} - {actor_type} data before saving.")
                    else:
                        print(f"No duplicates found for (Año, Mes, violence type) within {region_str} - {actor_type} data before saving.")

                    # --- Save the results to TSV ---
                    # Filename: 'regionname_cases_actor_type.tsv'
                    filename_region = region_str.replace(" ", "").lower() + f"_cases_{actor_type_clean_name}.tsv"
                    output_path_region = os.path.join(output_dir_actor_type, filename_region)

                    try:
                        grouped_cases_region_monthly_actor.to_csv(output_path_region, sep='\t', index=False)
                        print(f"Saved data for Region ({region_str}, {actor_type}) to {filename_region}")
                    except Exception as e:
                        print(f"Error saving data for Region ({region_str}, {actor_type}) to {filename_region}: {e}")

            print("\nRegion-level cases data processing and saving by ActorType complete.")

    except KeyError as e:
        print(f"Error: Required column not found - {e}. Please check column names in the combined_df or input Excel files.")
    except Exception as e:
        print(f"An unexpected error occurred during processing: {e}")


--- Grouping Cases data by Region, Month, Violence Type, and Actor Type (1958-2022) ---
Filtered cases data for years 1958-2022. Shape: (284215, 11)
Unique Violence Types found: ['VS' 'VI']
Unique Actor Types found: ['NON_STATE_ACTOR' 'STATE_ACTOR']
Found 78 unique regions. Processing each region...

Ensured output directory exists for NON_STATE_ACTOR in regions: /Users/diegohernandez/Documents/GitHub/VS_VI_Source_Code/Scripts/../Data/processed/cases/regions/non_state_actor

Processing region: SUROCCIDENTE DE CUNDINAMARCA for ActorType: NON_STATE_ACTOR
Saved data for Region (SUROCCIDENTE DE CUNDINAMARCA, NON_STATE_ACTOR) to suroccidentedecundinamarca_cases_non_state_actor.tsv

Processing region: MAGDALENA MEDIO for ActorType: NON_STATE_ACTOR
Saved data for Region (MAGDALENA MEDIO, NON_STATE_ACTOR) to magdalenamedio_cases_non_state_actor.tsv

Processing region: ATRATO for ActorType: NON_STATE_ACTOR
Saved data for Region (ATRATO, NON_STATE_ACTOR) to atrato_cases_non_state_actor.tsv

Pro