# Preprocessing Step: Synthetic Data Generation for the Global Terrorism Project

## Overview
This preprocessing step involves generating synthetic data for the Global Terrorism project. The synthetic data mimics the structure and characteristics of the Global Terrorism Database (GTD), providing a foundation for testing and validating preprocessing, analysis, and machine learning models in the project. 

## Purpose
Creating synthetic data allows us to:
- Validate the project's data processing pipelines without handling real-world sensitive information.
- Test and validate the improvement made by SingleStoreDB in using Shared Keys and 

## Data Generation Process
The Python script provided uses Python libraries (Pandas, NumPy, and random) to create a DataFrame containing the following main fields:

- **Event Details**: `eventid`, `iyear`, `imonth`, and `iday` represent unique identifiers and the dates for each synthetic incident.
- **Categorical Fields**: Categorical variables such as `country_txt`, `region_txt`, `attacktype1_txt`, and `targtype1_txt` are generated with predefined lists to mirror real-world values in the GTD.
- **Numerical Fields**: Numeric fields like `nkill`, `nwound`, and `propvalue` simulate counts and values associated with each incident, using Poisson and uniform distributions to generate plausible values.

## Output
The generated synthetic data is saved as a CSV file (`synthetic_global_terrorism_data.csv`), which can then be loaded and used in subsequent steps of the project for testing and model development.

In [6]:
import pandas as pd
import numpy as np
import random

# Load the existing dataset
file_path = input('Please enter your csv file you wish to expand: ')
original_df = pd.read_csv(file_path)

# Number of new synthetic rows to add
n_new_rows = 1000000

# Fill missing values in numerical columns with 0, which are required for Poisson generation
# and ensure all values are non-negative
original_df['nkill'] = original_df['nkill'].fillna(0).apply(lambda x: max(x, 0))
original_df['nwound'] = original_df['nwound'].fillna(0).apply(lambda x: max(x, 0))
original_df['propvalue'] = original_df['propvalue'].fillna(0).apply(lambda x: max(x, 0))
original_df['nkillus'] = original_df['nkillus'].fillna(0).apply(lambda x: max(x, 0))
original_df['nkillter'] = original_df['nkillter'].fillna(0).apply(lambda x: max(x, 0))
original_df['nwoundus'] = original_df['nwoundus'].fillna(0).apply(lambda x: max(x, 0))
original_df['nwoundte'] = original_df['nwoundte'].fillna(0).apply(lambda x: max(x, 0))
original_df['nhostkid'] = original_df['nhostkid'].fillna(0).apply(lambda x: max(x, 0))
original_df['ransomamt'] = original_df['ransomamt'].fillna(0).apply(lambda x: max(x, 0))
original_df['nreleased'] = original_df['nreleased'].fillna(0).apply(lambda x: max(x, 0))  # Added to ensure no NaN or negative values

# Print means of columns used for Poisson distribution to verify they are non-negative
print("nkill mean:", original_df['nkill'].mean())
print("nwound mean:", original_df['nwound'].mean())
print("nkillus mean:", original_df['nkillus'].mean())
print("nkillter mean:", original_df['nkillter'].mean())
print("nwoundus mean:", original_df['nwoundus'].mean())
print("nwoundte mean:", original_df['nwoundte'].mean())
print("nhostkid mean:", original_df['nhostkid'].mean())
print("ransomamt mean:", original_df['ransomamt'].mean())
print("nreleased mean:", original_df['nreleased'].mean())

# Create possible values for synthetic data generation based on your original dataset
countries = original_df['country_txt'].unique()
regions = original_df['region_txt'].unique()
attack_types = original_df['attacktype1_txt'].unique()
target_types = original_df['targtype1_txt'].unique()
weapon_types = original_df['weaptype1_txt'].unique()
success_values = [0, 1]
yes_no = [0, 1]

# Generate synthetic data 
synthetic_data = {
    'eventid': [f"EVT{str(i).zfill(5)}" for i in range(len(original_df), len(original_df) + n_new_rows)],
    'iyear': np.random.choice(original_df['iyear'], n_new_rows),
    'imonth': np.random.choice(original_df['imonth'], n_new_rows),
    'iday': np.random.choice(original_df['iday'], n_new_rows),
    'approxdate': [None] * n_new_rows,
    'extended': random.choices(yes_no, k=n_new_rows),
    'resolution': [None] * n_new_rows,
    'country': np.random.choice(original_df['country'], n_new_rows),
    'country_txt': random.choices(countries, k=n_new_rows),
    'region': np.random.choice(original_df['region'], n_new_rows),
    'region_txt': random.choices(regions, k=n_new_rows),
    'provstate': np.random.choice(original_df['provstate'], n_new_rows),
    'city': np.random.choice(original_df['city'], n_new_rows),
    'latitude': np.random.normal(original_df['latitude'].mean(), original_df['latitude'].std(), n_new_rows),
    'longitude': np.random.normal(original_df['longitude'].mean(), original_df['longitude'].std(), n_new_rows),
    'specificity': np.random.choice(original_df['specificity'], n_new_rows),
    'vicinity': random.choices(yes_no, k=n_new_rows),
    'location': np.random.choice(original_df['location'], n_new_rows),
    'summary': ['Synthetic summary text.'] * n_new_rows,
    'crit1': random.choices(yes_no, k=n_new_rows),
    'crit2': random.choices(yes_no, k=n_new_rows),
    'crit3': random.choices(yes_no, k=n_new_rows),
    'doubtterr': random.choices(yes_no, k=n_new_rows),
    'alternative': random.choices([None, 1, 2], k=n_new_rows),
    'alternative_txt': [None] * n_new_rows,
    'multiple': random.choices(yes_no, k=n_new_rows),
    'success': random.choices(success_values, k=n_new_rows),
    'suicide': random.choices(yes_no, k=n_new_rows),
    'attacktype1': np.random.choice(original_df['attacktype1'], n_new_rows),
    'attacktype1_txt': random.choices(attack_types, k=n_new_rows),
    'targtype1': np.random.choice(original_df['targtype1'], n_new_rows),
    'targtype1_txt': random.choices(target_types, k=n_new_rows),
    'weaptype1': np.random.choice(original_df['weaptype1'], n_new_rows),
    'weaptype1_txt': random.choices(weapon_types, k=n_new_rows),
    'nkill': np.random.poisson(lam=original_df['nkill'].mean(), size=n_new_rows),
    'nkillus': np.random.poisson(lam=original_df['nkillus'].mean(), size=n_new_rows),
    'nkillter': np.random.poisson(lam=original_df['nkillter'].mean(), size=n_new_rows),
    'nwound': np.random.poisson(lam=original_df['nwound'].mean(), size=n_new_rows),
    'nwoundus': np.random.poisson(lam=original_df['nwoundus'].mean(), size=n_new_rows),
    'nwoundte': np.random.poisson(lam=original_df['nwoundte'].mean(), size=n_new_rows),
    'property': random.choices(yes_no, k=n_new_rows),
    'propextent': np.random.choice(original_df['propextent'], n_new_rows),
    'propextent_txt': np.random.choice(original_df['propextent_txt'], n_new_rows),
    'propvalue': np.random.normal(original_df['propvalue'].mean(), original_df['propvalue'].std(), n_new_rows).astype(int),
    'ishostkid': random.choices(yes_no, k=n_new_rows),
    'nhostkid': np.random.poisson(lam=original_df['nhostkid'].mean(), size=n_new_rows),
    'ransom': random.choices(yes_no, k=n_new_rows),
    'ransomamt': np.random.normal(original_df['ransomamt'].mean(), original_df['ransomamt'].std(), n_new_rows).astype(int),
    'hostkidoutcome': np.random.choice(original_df['hostkidoutcome'], n_new_rows),
    'nreleased': np.random.poisson(lam=original_df['nreleased'].mean(), size=n_new_rows),
}

# Create synthetic DataFrame
# Append synthetic data to the original data
# Save the expanded dataset

synthetic_df = pd.DataFrame(synthetic_data)


expanded_df = pd.concat([original_df, synthetic_df], ignore_index=True)

expanded_df.to_csv('/Users/gabrielfuentes/Project2024_Repo/SingleStoreDB Sample Project/data/processed/expanded_global_terrorism_data.csv', index=False)
print("Synthetic data appended and saved as 'expanded_global_terrorism_data.csv'")

  original_df = pd.read_csv(file_path)


nkill mean: 2.266900036326409
nwound mean: 2.8833316821329107
nkillus mean: 0.02967207159605033
nkillter mean: 0.320833746133439
nwoundus mean: 0.025076230419514987
nwoundte mean: 0.06638376099424281
nhostkid mean: 1.0126702112435741
ransomamt mean: 23573.307391268445
nreleased mean: 0.2793225675065773
Synthetic data appended and saved as 'expanded_global_terrorism_data.csv'


## Modifying `eventid` Column to Sequential Values

### Background
While attempting to load the dataset into SingleStoreDB, errors were encountered regarding the `eventid` column. The error message indicated that some values in the `eventid` column were "out of range." 

### Solution
To address this, we are modifying the `eventid` column to contain sequential integer values, starting from 1 up to the total number of rows in the dataset. This approach simplifies the data by:
1. Ensuring the `eventid` values fit within standard integer ranges, eliminating "out of range" errors during loading.
2. Maintaining a unique identifier for each event that doesn't impact other analysis columns.

### Script Overview
The following script will:
1. Load the CSV file into a DataFrame.
2. Replace the `eventid` column with sequential integers, starting from 1.
3. Save the updated dataset to a new CSV file, which can be loaded into SingleStoreDB without causing range-related errors in the `eventid` column.



In [2]:
import pandas as pd

# Load the CSV file
file_path = '/Users/gabrielfuentes/Project2024_Repo/SingleStoreDB Sample Project/data/processed/expanded_global_terrorism_data.csv' 
df = pd.read_csv(file_path, encoding='utf-8', low_memory=False)

# Replace the 'eventid' column with sequential numbers
df['eventid'] = range(1, len(df) + 1)

# Save the updated DataFrame to a new CSV file
output_path = '/Users/gabrielfuentes/Project2024_Repo/SingleStoreDB Sample Project/data/processed/expanded_global_terrorism_data_sequential_eventid.csv'
df.to_csv(output_path, index=False, encoding='utf-8')

print(f"File saved as {output_path}")


File saved as /Users/gabrielfuentes/Project2024_Repo/SingleStoreDB Sample Project/data/processed/expanded_global_terrorism_data_sequential_eventid.csv
