# Analysis of Selective and Indiscriminate Violence (VS/VI) in Colombia

## Project: VS_VI_Source_Code
This Jupyter Notebook is part of the VS_VI_Source_Code project, which aims to analyze the dynamics of Selective Violence (VS) and Indiscriminate Violence (VI) in Colombia.

## Purpose
The primary purpose of this notebook is to process raw violence data, calculate key temporal metrics (Escalation and Intensity) for different violence types (VS, VI, and VC = VS + VI), and save the results for further analysis and visualization.

## Workflow
This notebook covers the initial data ingestion, cleaning, and metric calculation stages of the project's analytical workflow. It processes data at the Country, Department, and Region levels.

## About
This notebook implements the logic to read raw case data from Excel files, consolidate it, define and apply functions to calculate Escalation and Intensity based on monthly case counts, and store the processed data and metrics in a structured format.

### 1. Initial Setup, Library Imports, and Path Configuration
This block performs the initial setup, including importing necessary libraries (pandas, os, numpy, itertools) for data handling, file system interaction, numerical operations, and generating combinations. It also defines project-specific paths for raw data input and processed results output, relative to the notebook's location.

In [1]:
# 1. Initial Setup, Library Imports, and Path Configuration

import pandas as pd
import os
import numpy as np
from itertools import product

# Define the path to the folder containing the raw Excel files.
# Assumes the notebook is in a subfolder and the data is one level up in 'Data/raw'.
# Adjust '../' if your folder structure is different.
data_folder_path = os.path.join(os.getcwd(), '..', 'Data', 'raw')

# Define the base output directory for results.
# Assumes the results should be saved one level up in 'Results/intensity & escalation/cases'.
base_results_dir = os.path.join(os.getcwd(), '..', 'Results', 'intensity & escalation', 'cases')

# Define the list of columns to select from each Excel file.
columns_to_select = [
    "Año",
    "Mes",
    "Día",
    "ID Caso",
    "Municipio",
    "Departamento",
    "Región"
]

# Define the lists of filenames corresponding to each violence type (VI and VS).
# These filenames are used to classify the data during ingestion.
vi_files = [
    "Casos_Acciones_Belicas_202503.xlsx",
    "Casos_Ataques_a_Poblaciones_202503.xlsx",
    "Casos_Atentados_Terroristas_202503.xlsx",
    "Casos_MInas_202503.xlsx",
    "Casos_Reclutamiento_ninas_ninos_U_202503.xlsx"
]

vs_files = [
    "Caso_ Danos_a_Bienes_Civiles_202503.xlsx", # Note: Check for potential extra space in filename "Caso_ Danos..."
    "Casos_Asesinatos_Selectivo_202503.xlsx",
    "Casos_Desaparicion_Forzada _202503.xlsx", # Note: Check for potential extra space in filename "Desaparicion_Forzada _"
    "Casos_Masacre_202503.xlsx",
    "Casos_Secuestro_202503.xlsx",
    "Casos_Violencia_Sexual_202503.xlsx"
]

print("Initial setup, library imports, and path configuration complete.")

Initial setup, library imports, and path configuration complete.


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### 2. Read and Combine Raw Data from Excel Files
This block reads all the specified raw Excel files from the configured data folder. It selects only the necessary columns and classifies each record as either 'VI' (Indiscriminate Violence) or 'VS' (Selective Violence) based on the source filename. All processed data is then combined into a single pandas DataFrame named combined_df.

In [None]:
# 2. Read and Combine Raw Data from Excel Files

print("\n--- Reading and Combining Raw Data ---")

# Initialize an empty list to store the processed dataframes from each file.
all_dataframes = []

# Iterate through all items in the specified data folder.
for filename in os.listdir(data_folder_path):
    # Construct the full file path.
    file_path = os.path.join(data_folder_path, filename)

    # Check if the current item is a file and if it's an Excel file.
    if os.path.isfile(file_path) and filename.endswith('.xlsx'):
        print(f"Processing file: {filename}") # Print the filename being processed

        try:
            # Read the Excel file into a pandas DataFrame.
            df = pd.read_excel(file_path)

            # Select only the required columns.
            # Use .copy() to avoid SettingWithCopyWarning later.
            df_selected = df[columns_to_select].copy()

            # Determine the violence type based on the filename and add the 'violence type' column.
            if filename in vi_files:
                df_selected['violence type'] = 'VI'
            elif filename in vs_files:
                df_selected['violence type'] = 'VS'
            else:
                # If the file is not in either list, skip it.
                print(f"Warning: File '{filename}' not classified as VI or VS. Skipping.")
                continue # Skip this file

            # Append the processed DataFrame to the list.
            all_dataframes.append(df_selected)

        except Exception as e:
            # Print an error message if reading or processing a file fails.
            print(f"Error processing file {filename}: {e}")

# Concatenate all dataframes in the list into a single DataFrame.
# ignore_index=True resets the index of the resulting dataframe.
if all_dataframes:
    combined_df = pd.concat(all_dataframes, ignore_index=True)

    # Display the first few rows of the combined DataFrame.
    print("\nCombined DataFrame Head:")
    print(combined_df.head())

    # Display information about the combined DataFrame (column types, non-null counts).
    print("\nCombined DataFrame Info:")
    combined_df.info()

    # Optional: Display value counts for the 'violence type' column to verify classification.
    print("\nViolence Type Counts:")
    print(combined_df['violence type'].value_counts())

    print("\nRaw data reading and combining complete.")

else:
    print("\nNo Excel files were processed or found. 'combined_df' was not created.")

