# Predicting Soldier Defection Risk using a Rule-Based System

This notebook demonstrates how to generate a synthetic dataset of soldiers and predict their likelihood of defection based on key individual and external factors. We assign weights to each feature based on its positive or negative relationship with the defection risk and use these weights to calculate a defection risk score.

## 1. Import Required Libraries

We will start by importing the necessary libraries: `numpy` for numerical operations and `pandas` for data manipulation.


In [13]:
import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(0)


## 2. Initialize the Dataset

We will create an empty DataFrame and set the number of soldiers (`n`) to 1000. This will serve as the basis for our synthetic dataset.


In [14]:
# Number of soldiers
n = 10000

# Initialize empty DataFrame
df = pd.DataFrame()

## 3. Generate Security Clearance Level

We generate a `security_clearance_level` column with values 'low', 'medium', and 'high'. These represent the security clearance status of each soldier, with the majority having 'low' clearance.


In [15]:
# 5. Security Clearance Level ('low', 'medium', 'high')
df['security_clearance_level'] = np.random.choice(['low', 'medium', 'high'], size=n, p=[0.6, 0.3, 0.1])

# Map to numerical values
security_clearance_mapping = {'low': 1, 'medium': 2, 'high': 3}
df['security_clearance_level_num'] = df['security_clearance_level'].map(security_clearance_mapping)

## 4. Generate Morale Score

We create a `morale_score` column with random values between 0 and 100, representing the morale level of each soldier. Since morale surveys are mandatory, there are no missing values.


In [16]:
# 2. Morale Score (0 to 100)
df['morale_score'] = np.random.uniform(0, 100, size=n)

## 5. Generate Family Military History

We generate a binary column `family_military_history_bin` using a binomial distribution with a 10% chance of being `1`, indicating that only a few soldiers have a family member who served in the military.


In [17]:
# 4. Family Military History (0 or 1)
df['family_military_history_bin'] = np.random.binomial(1, 0.1, size=n)

## 6. Generate Punishment Policy

We create a column `punishment_policy` representing the perceived level of strictness in anti-defection policies. The majority of soldiers perceive the policy as 'strict'.


In [18]:
# 6. Punishment Policy ('strict', 'lenient')
df['punishment_policy'] = np.random.choice(['strict', 'lenient'], size=n, p=[0.8, 0.2])

# Encode punishment_policy
df['punishment_policy_num'] = df['punishment_policy'].map({'strict': 0, 'lenient': 1})

## 7. Generate Regime Type

We generate a `regime_type` column set to 'personalist', indicating a regime where power is concentrated in a single leader. This type of regime typically faces higher defection rates.


In [19]:
# 11. Regime Type ('personalist', 'party-based')
df['regime_type'] = 'personalist'  # Can change in future data
# Encode regime_type
df['regime_type_num'] = df['regime_type'].map({'personalist': 1, 'party-based': 0})

## 8. Generate Trust in Leadership Score

We create a `trust_in_leadership_score` column with random values between 0 and 100, representing the level of trust soldiers have in their leadership.


In [20]:
# 12. Trust in Leadership Score (0 to 100)
df['trust_in_leadership_score'] = np.random.uniform(0, 100, size=n)

## 9. Generate Military Structure

The `military_structure` column is set to 'patrimonial', indicating a military deeply tied to the ruling leader through informal connections, making them more likely to remain loyal.


In [21]:
# 13. Military Structure ('patrimonial', 'institutionalized')
df['military_structure'] = 'patrimonial'  # Can change in future data
# Encode military_structure
df['military_structure_num'] = df['military_structure'].map({'patrimonial': 0, 'institutionalized': 1})

## 10. Set Defector Capture Rate

The `defector_capture_rate` is set to a constant value of 0.8, representing the perceived rate at which defectors are captured. This rate can change in future data generations.


In [22]:
# Defector Capture Rate (0 to 1)
df['defector_capture_rate'] = 0.8  # Can change in future data

## 11. Generate Promotion Fairness Score

We create a `promotion_fairness_score` column with most values between 0 and 40, reflecting a low chance of fair promotions in a patrimonial military structure.


In [23]:
# 9. Promotion Fairness Score (0 to 100)
df['promotion_fairness_score'] = np.where(
    np.random.rand(n) < 0.8, 
    np.random.uniform(0, 40, size=n),  # 80% between 0 and 40
    np.random.uniform(40, 100, size=n)  # 20% between 40 and 100
)

## 12. Generate Communication Quality Score

The `communication_quality_score` column represents the quality of communication within the military, with random values between 0 and 100.


In [24]:
# 10. Communication Quality Score (0 to 100)
df['communication_quality_score'] = np.random.uniform(0, 100, size=n)

## 13. Generate Opportunity Cost

We generate the `opportunity_cost` column with random values between 0 and 5000, representing the perceived financial or personal cost of remaining in service. This value is adjusted based on security clearance level.


In [25]:
# Opportunity Cost
df['opportunity_cost'] = np.random.uniform(0, 5000, size=n)
# Adjust based on security clearance
df['opportunity_cost'] = df.apply(
    lambda row: row['opportunity_cost'] if row['security_clearance_level'] == 'low' else 
                row['opportunity_cost'] * 0.5 if row['security_clearance_level'] == 'medium' else 
                row['opportunity_cost'] * 0.2,
    axis=1
)
# Normalize opportunity cost
df['opportunity_cost_norm'] = (df['opportunity_cost'] - df['opportunity_cost'].min()) / (df['opportunity_cost'].max() - df['opportunity_cost'].min())

## 14. Generate Has Enemy Connections

The `has_enemy_connections` column is generated with a 10% probability of being 'Yes'. Most values are set to NaN, reflecting missing data.


In [26]:
# Variable with missing values: Has Enemy Connections
df['has_enemy_connections'] = np.where(
    np.random.binomial(1, 0.1, size=n) == 1, 
    'Yes', 
    'No'
)
# Randomly set 95% to NaN
missing_mask = np.random.rand(n) < 0.95
df.loc[missing_mask, 'has_enemy_connections'] = np.nan
# Encode has_enemy_connections
df['has_enemy_connections_bin'] = df['has_enemy_connections'].map({'Yes': 1, 'No': 0})
df['has_enemy_connections_bin'] = df['has_enemy_connections_bin'].fillna(0)

## 15. Assign Feature Weights and Normalize Negative Relationship Features

We assign weights to each feature based on its positive or negative relationship with the likelihood of defection. These weights reflect the importance of each feature. We also normalize the features with a negative relationship with defection to ensure their values fall between 0 and 1.



In [27]:
# Normalize features where necessary
df['morale_score_norm'] = df['morale_score'] / 100
df['trust_in_leadership_score_norm'] = df['trust_in_leadership_score'] / 100
df['promotion_fairness_score_norm'] = df['promotion_fairness_score'] / 100
df['communication_quality_score_norm'] = df['communication_quality_score'] / 100
df['defector_capture_rate_norm'] = df['defector_capture_rate']  # Already between 0 and 1

# Time of Data Measurement
df['time_of_measurement'] = np.random.choice(['start of the war', 'middle of the war', 'end of the war'], size=n, p=[0.3, 0.4, 0.3])

# Assign Weights
weights = {
    'opportunity_cost_norm': 0.4,
    'family_military_history_bin': -0.2,
    'security_clearance_level_num': 0.3,
    'has_enemy_connections_bin': 0.5,
    'morale_score_norm': -0.3,
    'trust_in_leadership_score_norm': -0.3,
    'promotion_fairness_score_norm': -0.2,
    'communication_quality_score_norm': -0.1,
    'defector_capture_rate_norm': -0.4,
    'punishment_policy_num': 0.2,
    'regime_type_num': 0.3,
    'military_structure_num': 0.2,
}

## 17. Calculate Defection Risk Score

Using the assigned weights, we calculate a defection risk score for each soldier by summing the weighted values of each feature.


In [28]:
# Calculate Defection Risk Score
df['defection_risk_score'] = (
    weights['opportunity_cost_norm'] * df['opportunity_cost_norm'] +
    weights['family_military_history_bin'] * df['family_military_history_bin'] +
    weights['security_clearance_level_num'] * (df['security_clearance_level_num'] / df['security_clearance_level_num'].max()) +
    weights['has_enemy_connections_bin'] * df['has_enemy_connections_bin'] +
    weights['morale_score_norm'] * df['morale_score_norm'] +
    weights['trust_in_leadership_score_norm'] * df['trust_in_leadership_score_norm'] +
    weights['promotion_fairness_score_norm'] * df['promotion_fairness_score_norm'] +
    weights['communication_quality_score_norm'] * df['communication_quality_score_norm'] +
    weights['defector_capture_rate_norm'] * df['defector_capture_rate_norm'] +
    weights['punishment_policy_num'] * df['punishment_policy_num'] +
    weights['regime_type_num'] * df['regime_type_num'] +
    weights['military_structure_num'] * df['military_structure_num']
)

## 18. Determine Defection Threshold

We determine a threshold based on the median defection risk score. Soldiers with a score above this threshold are classified as likely to defect.


In [29]:
# Determine Threshold for Defection
threshold = df['defection_risk_score'].median()

# Create 'will_defect' Column
df['will_defect'] = np.where(df['defection_risk_score'] > threshold, 'yes', 'no')

## 20. Save the Dataset

Finally, we save the generated dataset to a CSV file for further analysis or modeling.


In [30]:
# Save dataset to CSV file
df.to_csv('soldier_defection_dataset_final.csv', index=False)

df.head()

Unnamed: 0,security_clearance_level,security_clearance_level_num,morale_score,family_military_history_bin,punishment_policy,punishment_policy_num,regime_type,regime_type_num,trust_in_leadership_score,military_structure,...,has_enemy_connections,has_enemy_connections_bin,morale_score_norm,trust_in_leadership_score_norm,promotion_fairness_score_norm,communication_quality_score_norm,defector_capture_rate_norm,time_of_measurement,defection_risk_score,will_defect
0,low,1,74.826798,0,strict,0,personalist,1,36.925632,patrimonial,...,,0.0,0.748268,0.369256,0.371182,0.011097,0.8,middle of the war,-0.254191,no
1,medium,2,18.020271,0,strict,0,personalist,1,21.1326,patrimonial,...,,0.0,0.180203,0.211326,0.361175,0.00177,0.8,start of the war,0.00178,yes
2,medium,2,38.902314,1,strict,0,personalist,1,47.690477,patrimonial,...,,0.0,0.389023,0.476905,0.83325,0.155055,0.8,end of the war,-0.374606,no
3,low,1,3.760018,0,lenient,1,personalist,1,8.223436,patrimonial,...,,0.0,0.0376,0.082234,0.204322,0.316761,0.8,middle of the war,0.459662,yes
4,low,1,1.178774,1,strict,0,personalist,1,23.765937,patrimonial,...,,0.0,0.011788,0.237659,0.23328,0.651845,0.8,middle of the war,0.023138,yes


## Removal of normalized columns to get the finalized dataset

So removing all the columns containing the normalized values, so that we get the final dataset on which we can train and test a model devloped for predicting the defection of the soldier.

In [31]:
# Define the columns you want to keep
columns_to_keep = [
    'security_clearance_level', 'morale_score', 'family_military_history_bin',
    'punishment_policy', 'regime_type', 'trust_in_leadership_score',
    'military_structure', 'defector_capture_rate',
    'promotion_fairness_score', 'communication_quality_score',
    'opportunity_cost', 'has_enemy_connections', 'will_defect'
]

# Select only the columns you want to keep
df_cleaned = df[columns_to_keep]

# Save the cleaned dataset to a new CSV file
df_cleaned.to_csv('final_submission.csv', index=False)

# Display the first few rows of the cleaned DataFrame
df_cleaned.head()

Unnamed: 0,security_clearance_level,morale_score,family_military_history_bin,punishment_policy,regime_type,trust_in_leadership_score,military_structure,defector_capture_rate,promotion_fairness_score,communication_quality_score,opportunity_cost,has_enemy_connections,will_defect
0,low,74.826798,0,strict,personalist,36.925632,patrimonial,0.8,37.118191,1.109675,955.619535,,no
1,medium,18.020271,0,strict,personalist,21.1326,patrimonial,0.8,36.117479,0.176979,146.369162,,yes
2,medium,38.902314,1,strict,personalist,47.690477,patrimonial,0.8,83.325013,15.505512,1092.026293,,no
3,low,3.760018,0,lenient,personalist,8.223436,patrimonial,0.8,20.432228,31.676115,3601.483238,,yes
4,low,1.178774,1,strict,personalist,23.765937,patrimonial,0.8,23.328005,65.184489,4122.058623,,yes
