# FBiH Cleaning (and some feature engineering)

Now that all the data is combined I will be deciding how I want the final dataset to look like for the purposes of my paper.

In [1]:
import pandas as pd
import numpy as np

from tools.config import SAVE_DIR

Load up the combined data

In [2]:
combined = pd.read_excel(SAVE_DIR + "combined.xlsx")
# combined

## 1. Timeline from 2012 - 2022

I will be included years 2010 and 2011 for now because I want to impute data for 2012 using voting data for 2010.

In [3]:
combined = combined[combined["Year"].between(2010, 2022)].copy()

## 2. Create Independent Variables

### 2.1 Ethnic Fragmentation Constant

First to normalize the percentage_voted columns to be values between 0 and 1 inclusive.

In [4]:
combined["percentage_bosnian_votes"] = combined["percentage_bosnian_votes"] / 100
combined["percentage_croat_votes"] = combined["percentage_croat_votes"] / 100

The 2013 census data should be a percentatge instead of tracking vote count.

In [5]:
combined["Total Ethnicity"] = combined["Bosniak"] + combined["Croat"] + combined["Serb"]

In [6]:
combined["Percentage Bosniak"] = combined["Bosniak"] / combined["Total Ethnicity"]
combined["Percentage Croat"] = combined["Croat"] / combined["Total Ethnicity"]
combined["Percentage Serb"] = combined["Serb"] / combined["Total Ethnicity"]

In [7]:
combined = combined.drop(columns=["Total Ethnicity", "Bosniak", "Croat", "Serb"])

In [8]:
# combined

Define HHI to transform ethnic and voting percentage data to new variables.

In [9]:
def calculate_ethnic_hhi(row):
    percentages = [
        row['Percentage Bosniak'],
        row['Percentage Croat'],
        row['Percentage Serb']
    ]
    
    if any(pd.isna(p) for p in percentages):
        return np.nan
    
    hhi_score = 0
    for p in percentages:
        hhi_score += p**2
        
    return hhi_score

Calculate for 2013

In [10]:
combined['ethnic_concentration_hhi'] = combined.apply(calculate_ethnic_hhi, axis=1)

Impute variable as a constant for respective municipalities for all years.

In [11]:
combined['ethnic_concentration_hhi'] = combined.groupby('Municipality')['ethnic_concentration_hhi'].transform(lambda x: x.ffill().bfill())
combined = combined.drop(columns=['Percentage Bosniak', 'Percentage Croat', 'Percentage Serb'])

### 2.2 Voting Fragmentation variable

First to linearly interpolate it for years that did not have elections.

In [12]:
combined = combined.sort_values(by=['Municipality', 'Year'])
cols_to_interpolate = ['percentage_bosnian_votes', 'percentage_croat_votes']
combined[cols_to_interpolate] = combined.groupby('Municipality')[cols_to_interpolate].transform(lambda x: x.interpolate(method='linear'))

Now to create an HHI variable for political fragmentation.

In [13]:
def calculate_political_hhi(row):
    percentages = [
        row['percentage_bosnian_votes'],
        row['percentage_croat_votes'],
    ]
    
    if any(pd.isna(p) for p in percentages):
        return np.nan
    
    hhi_score = 0
    for p in percentages:
        hhi_score += p**2
        
    return hhi_score

In [14]:
combined['political_fragmentation_hhi'] = combined.apply(calculate_political_hhi, axis=1)
combined = combined.drop(columns=['percentage_bosnian_votes', 'percentage_croat_votes'])

Make the timeline actually be 2012 to 2022

In [None]:
combined = combined[combined["Year"].between(2012, 2022)].copy()
# combined

## 3. Percentage of Agricultural Entities

Used as a proxy for development and education based off the asusmption that areas with higher amounts of businesses in agriculture don't need workers with high amounts of human capital.

This is used instead of direct human capital variables due to data limitations of finding such a variable at the municipal level.

In [None]:
combined["Percentage of Agricultural Businesses"] = combined["A-Agriculture, forestry and fishing"] / combined["Total"]

Now to drop the other sector columns

In [35]:
start_col = '00-Unclassified according to activities CEA 1)'
end_col = 'Total'

start_index = combined.columns.get_loc(start_col)
end_index = combined.columns.get_loc(end_col)

cols_to_drop = combined.columns[start_index : end_index + 1]

combined = combined.drop(columns=cols_to_drop)

In [38]:
combined = combined[
    [
        "Municipality", 
        "Year", 
        "Gross Average Wage", 
        "ethnic_concentration_hhi", 
        "political_fragmentation_hhi", 
        "Percentage of Agricultural Businesses", 
        "Employees"
    ]
]
combined

Unnamed: 0,Municipality,Year,Gross Average Wage,ethnic_concentration_hhi,political_fragmentation_hhi,Percentage of Agricultural Businesses,Employees
564,banovici,2012,1220.0,0.953261,0.706868,0.019355,5056.0
643,banovici,2013,1248.0,0.953261,0.827413,0.022340,5214.0
722,banovici,2014,1261.0,0.953261,0.975511,0.023256,5167.0
801,banovici,2015,1270.0,0.953261,0.917824,0.025316,5230.0
880,banovici,2016,1276.0,0.953261,0.863865,0.025000,5169.0
...,...,...,...,...,...,...,...
1050,zivinice,2018,1187.0,0.905964,0.581126,0.038872,10137.0
1129,zivinice,2019,1185.0,0.905964,0.585950,0.036850,10911.0
1208,zivinice,2020,1208.0,0.905964,0.590914,0.035323,10916.0
1287,zivinice,2021,1241.0,0.905964,0.596017,0.030917,10929.0


## 4. Clean up remaining data and adjust datatypes

Going to do a sanity check that there are only 79 municipalities in the dataset as there should be.

In [56]:
len(combined["Municipality"].unique())

79

Same sanity check for years and making sure that years are actually an int value for stata.

In [59]:
len(combined["Year"].unique()), combined["Year"].dtype

(11, dtype('int64'))

The only nan that existed for `Gross Average Wage` col was for dobretici in 2006. The following cell will confirm if there are any decimals values (other than .0) in that column before casting to ints.

In [47]:
decimal_count = (combined['Gross Average Wage'] % 1 != 0).sum()

print(f"Number of values with decimals: {decimal_count}")

Number of values with decimals: 0


In [48]:
combined['Gross Average Wage'] = combined['Gross Average Wage'].astype('int64')

The only nan that existed for `Employees` col was for dobretici in 2006. The following cell will confirm if there are any decimals values (other than .0) in that column before casting to ints.

In [50]:
decimal_count = (combined['Employees'] % 1 != 0).sum()

print(f"Number of values with decimals: {decimal_count}")

Number of values with decimals: 0


In [51]:
combined['Employees'] = combined['Employees'].astype('int64')

In [52]:
combined

Unnamed: 0,Municipality,Year,Gross Average Wage,ethnic_concentration_hhi,political_fragmentation_hhi,Percentage of Agricultural Businesses,Employees
564,banovici,2012,1220,0.953261,0.706868,0.019355,5056
643,banovici,2013,1248,0.953261,0.827413,0.022340,5214
722,banovici,2014,1261,0.953261,0.975511,0.023256,5167
801,banovici,2015,1270,0.953261,0.917824,0.025316,5230
880,banovici,2016,1276,0.953261,0.863865,0.025000,5169
...,...,...,...,...,...,...,...
1050,zivinice,2018,1187,0.905964,0.581126,0.038872,10137
1129,zivinice,2019,1185,0.905964,0.585950,0.036850,10911
1208,zivinice,2020,1208,0.905964,0.590914,0.035323,10916
1287,zivinice,2021,1241,0.905964,0.596017,0.030917,10929


Now to save this

In [61]:
combined.to_excel(SAVE_DIR + "combined_clean.xlsx", index=False)

In [63]:
df = pd.read_excel(SAVE_DIR + "combined_clean.xlsx")
df

Unnamed: 0,Municipality,Year,Gross Average Wage,ethnic_concentration_hhi,political_fragmentation_hhi,Percentage of Agricultural Businesses,Employees
0,banovici,2012,1220,0.953261,0.706868,0.019355,5056
1,banovici,2013,1248,0.953261,0.827413,0.022340,5214
2,banovici,2014,1261,0.953261,0.975511,0.023256,5167
3,banovici,2015,1270,0.953261,0.917824,0.025316,5230
4,banovici,2016,1276,0.953261,0.863865,0.025000,5169
...,...,...,...,...,...,...,...
864,zivinice,2018,1187,0.905964,0.581126,0.038872,10137
865,zivinice,2019,1185,0.905964,0.585950,0.036850,10911
866,zivinice,2020,1208,0.905964,0.590914,0.035323,10916
867,zivinice,2021,1241,0.905964,0.596017,0.030917,10929
