## Population Data Processing

The population data for this study were collected from the Copernicus Emergency Management Service provided by the [GHSL - Global Human Settlement Layer](https://human-settlement.emergency.copernicus.eu/download.php?ds=pop)
}. These data were gathered by downloading the tiles of TIFF files covering the areas near our study regions. These data have a resolution of 100m and consist of population estimates from the years 2000 to 2025, with intervals of five years (i.e., 2000, 2005, 2010, etc.)

After downloading the tiles that cover our study areas, we used the country shapefiles to extract data aligning with the national boundaries of both Rwanda and Tanzania. To estimate population data for years not directly available (e.g., 2001, 2002, 2003, and 2004), we performed linear interpolation. This process involved reading existing TIFF files for known years, interpolating the data for the missing years, and generating new TIFF files. This method ensured accurate and reliable population estimates based on existing data, providing a continuous and consistent dataset for our analysis.

After obtaining the TIFF files for each country, we used district shapefiles to extract the population data for each district and subsequently computed various metrics, including population density, Gini coefficient, and spatial autocorrelation at 2 km and 5 km scales.

## Computed Metrics

1. Gini Entropy

Gini entropy is a measure derived from the concept of Gini impurity used in decision trees, and it's related to the Gini coefficient used in economics to measure income inequality. In the context of population data, Gini entropy can be used to assess the inequality in the distribution of population across different areas. A higher Gini entropy indicates more uneven distribution, while a lower Gini entropy indicates a more even distribution.
2. Autocorrelation (autocorr2 and autocorr5)

Autocorrelation measures the correlation of a signal with a delayed copy of itself as a function of delay. In the context of spatial data like population density, autocorrelation at lag 2 (autocorr2) and lag 5 (autocorr5) can be used to understand the spatial correlation of population data over different distances. Positive autocorrelation means that high population densities are clustered together, while negative autocorrelation means that high and low population densities are interspersed.
Why Use These Metrics?

    Understanding Distribution: Gini entropy helps in understanding how evenly or unevenly the population is distributed across regions or districts. This can be crucial for resource allocation and policy-making.
    Spatial Patterns: Autocorrelation helps in identifying spatial patterns and dependencies in the population data. It indicates whether population densities are clustered or dispersed over space.
    Normalized Representation: While Gini entropy and autocorrelation do not normalize the data in the traditional sense, they help in normalizing the understanding of population distribution by providing standardized metrics that can be compared across different regions or time periods.

Example Usage in Population Data Analysis

In a typical population data analysis, these metrics can provide deeper insights:

    Gini Entropy: To measure and compare the inequality of population distribution in different districts or regions.
    Autocorr2 and Autocorr5: To analyze the spatial patterns of population density and identify whether population centers are clustering together or spreading out over space.

### Import Libraries

In [28]:
import pandas as pd
from pandas import read_csv
import numpy as np
import rasterio
import geopandas as gpd
from rasterio.merge import merge
import os
import re
import fiona
from scipy.interpolate import interp1d
import rasterio.features
from rasterio.mask import mask
from rasterio.warp import calculate_default_transform, reproject, Resampling
from fiona.crs import from_epsg
from shapely.geometry import mapping, shape
from sklearn.preprocessing import minmax_scale
from scipy.stats import entropy

## Helper Functions

In [4]:
def merge_tiff_files(tiff_files, output_path):
    """
    Merge multiple TIFF files into a single TIFF file.

    Parameters:
    tiff_files (list of str): List of file paths to the input TIFF files.
    output_path (str): File path to save the merged output TIFF file.
    """
    try:
        # Open the TIFF files
        src_files_to_mosaic = []
        for fp in tiff_files:
            src = rasterio.open(fp)
            src_files_to_mosaic.append(src)

        # Merge the TIFF files
        mosaic, out_trans = merge(src_files_to_mosaic)

        # Define the metadata for the output file
        out_meta = src.meta.copy()
        out_meta.update({
            "driver": "GTiff",
            "height": mosaic.shape[1],
            "width": mosaic.shape[2],
            "transform": out_trans
        })

        # Write the merged TIFF file to disk
        with rasterio.open(output_path, "w", **out_meta) as dest:
            dest.write(mosaic)

        # Close all opened files
        for src in src_files_to_mosaic:
            src.close()

        print(f"Data has been successfully merged and is available at: {output_path}")

    except Exception as e:
        print(f"An error occurred: {e}")

In [5]:
def extract_data(tiff_path, shapefile_path, output_path):
    """
    Extract data from a TIFF file using a shapefile.

    Parameters:
    tiff_path (str): File path to the input TIFF file.
    shapefile_path (str): File path to the input shapefile.
    output_path (str): File path to save the extracted data TIFF file.
    """
    try:
        # Open the TIFF file to get the CRS
        with rasterio.open(tiff_path) as src:
            tiff_crs = src.crs

        # Read and reproject the shapefile to match the TIFF file's CRS
        shapefile = gpd.read_file(shapefile_path)
        shapefile = shapefile.to_crs(tiff_crs)

        # Get the geometry from the shapefile
        geometries = [mapping(geom) for geom in shapefile.geometry]

        # Open the TIFF file again to mask it with the reprojected shapefile geometries
        with rasterio.open(tiff_path) as src:
            out_image, out_transform = mask(src, geometries, crop=True)
            out_meta = src.meta.copy()
            out_meta.update({
                "driver": "GTiff",
                "height": out_image.shape[1],
                "width": out_image.shape[2],
                "transform": out_transform,
                "crs": tiff_crs
            })

        # Write the clipped TIFF file to disk
        with rasterio.open(output_path, "w", **out_meta) as dest:
            dest.write(out_image)

        # Print success message with output path
        print(f"Data successfully extracted and saved to {output_path}")

    except Exception as e:
        print(f"An error occurred: {e}")

In [6]:
def batch_data_extraction(tiff_files, shapefile_path, output_directory, prefix):
    """
    Process a list of TIFF files by extracting data using a shapefile and saving with a specified prefix.

    Parameters:
    tiff_files (list of str): List of file paths to the input TIFF files.
    shapefile_path (str): File path to the input shapefile.
    output_directory (str): Directory path to save the extracted data TIFF files.
    prefix (str): Prefix to use for the output TIFF files.
    """
    for tiff_path in tiff_files:
        
       # Generate the output file name and path
        file_name = os.path.basename(tiff_path).replace('merged', 'data')
        output_file_name = f"{prefix}_{file_name}"
        output_path = os.path.join(output_directory, output_file_name)

        # Call the extract_data function
        extract_data(tiff_path, shapefile_path, output_path)

In [44]:
#Function to Normalize the Population
''''
def gini_entropy(data):
    """Compute the Gini entropy of a dataset."""
    if len(data) == 0:
        return np.nan
    scaled_data = minmax_scale(data, feature_range=(0, 1))
    return entropy(scaled_data, base=2)
'''
def gini_coefficient(data):
    """Compute the Gini coefficient of a dataset."""
    if len(data) == 0:
        return np.nan
    data = np.sort(data)
    n = len(data)
    index = np.arange(1, n + 1)
    gini_index = (2 * np.sum(index * data)) / (n * np.sum(data)) - (n + 1) / n
    return gini_index

def autocorr(x, lag):
    """Compute the autocorrelation of a dataset at a specified lag."""
    if len(x) < lag + 1:
        return np.nan
    return np.corrcoef(np.array([x[:-lag], x[lag:]]))[0, 1]

def autocorr2(data):
    """Compute the autocorrelation at lag 2."""
    return autocorr(data, 2)

def autocorr5(data):
    """Compute the autocorrelation at lag 5."""
    return autocorr(data, 5)

In [60]:
def extract_district_population(tiff_path, shapefile_path, output_directory):
    """Extract and process population data from TIFF file."""
    try:
        # Open the TIFF file to get the CRS and data
        with rasterio.open(tiff_path) as src:
            tiff_crs = src.crs

            # Read and reproject the shapefile to match the TIFF file's CRS
            shapefile = gpd.read_file(shapefile_path)
            shapefile = shapefile.to_crs(tiff_crs)

            results = []

            # Process each district
            for _, row in shapefile.iterrows():
                geometry = [mapping(row['geometry'])]
                region = row.get('region', row.get('province'))
                district = row['district']

                # Mask the TIFF file with the geometry
                out_image, out_transform = mask(src, geometry, crop=True)
                out_image = out_image[0]  # Extract the first layer if multi-dimensional
                
                # Sum only the valid pixel values
                population = np.sum(out_image[out_image != src.nodata])
                
                # Handle potential scale factor (if the population density needs to be adjusted)
                # Here we assume pixel values are the population counts directly
                # If pixel values represent density or other metrics, apply appropriate scaling
                
                population_density = population / row['geometry'].area

                # Flatten the masked image for metric calculations
                flat_image = out_image.flatten()
                flat_image = flat_image[np.isfinite(flat_image)]  # Exclude NaN and inf values
                flat_image = flat_image[flat_image != src.nodata]  # Exclude no-data values

                if flat_image.size == 0:
                    continue  # Skip if no valid data

                # Compute metrics (assuming functions gini_entropy, autocorr2, autocorr5 are defined)
                gini = gini_coefficient(flat_image)
                ac2 = autocorr2(flat_image)
                ac5 = autocorr5(flat_image)

                results.append({
                    'year': re.search(r'(\d{4})', tiff_path).group(1),
                    'region': region,
                    'district': district,
                    'population': population,
                    'population_density': population_density,
                    'gini_coefficient': gini,
                    'autocorr2': ac2,
                    'autocorr5': ac5,
                    'geometry': row['geometry']
                })

        # Create DataFrame
        df = pd.DataFrame(results)

        # Save to CSV without geometry column
        df_without_geometry = df.drop(columns=['geometry'])
        year = re.search(r'(\d{4})', tiff_path).group(1)
        csv_path = os.path.join(output_directory, f"{year}_population_data.csv")
        df_without_geometry.to_csv(csv_path, index=False)

        # Save to GeoJSON with geometry column
        gdf = gpd.GeoDataFrame(df, geometry='geometry')
        geojson_path = os.path.join(output_directory, f"{year}_population_data.geojson")
        gdf.to_file(geojson_path, driver='GeoJSON')

        print(f"The CSV file of population data is successfully extracted and saved to {csv_path}")
        print(f"The geojson file of population data is successfully extracted and saved to {geojson_path}")
        return df_without_geometry
    except Exception as e:
        print(f"An error occurred: {e}")

In [61]:
def batch_extract_district_population(tiff_paths, shapefile_path, output_directory, prefix):
    """Batch extract district population data from a list of TIFF files."""
    all_data = []

    for tiff_path in tiff_paths:
        extract_district_population(tiff_path, shapefile_path, output_directory)
        year = re.search(r'(\d{4})', tiff_path).group(1)
        csv_path = os.path.join(output_directory, f"{year}_population_data.csv")
        if os.path.exists(csv_path):
            df = pd.read_csv(csv_path)
            all_data.append(df)
        else:
            print(f"Warning: CSV file {csv_path} not found, skipping.")

    # Merge all data into a single DataFrame
    if all_data:
        merged_df = pd.concat(all_data, ignore_index=True)
        merged_csv_path = os.path.join(output_directory, f"{prefix}_population_data.csv")
        merged_df.to_csv(merged_csv_path, index=False)
        print(f"All data successfully merged and saved to {merged_csv_path}")
        
        # Save the merged data to an Excel file
        merged_excel_path = os.path.join(output_directory, f"{prefix}_population_data.xlsx")
        merged_df.to_excel(merged_excel_path, index=False)
        print(f"All data successfully merged and saved to {merged_excel_path}")
    else:
        print("No data was processed. Please check the input TIFF files and shapefile.")


In [62]:
def linear_interpolate_rasters(prefix,raster_paths, output_dir, target_years):
    """
    Linearly interpolate rasters between given years using interp1d in a vectorized manner.

    Parameters:
    raster_paths (dict): A dictionary with years as keys and file paths of existing TIFF files as values.
    output_dir (str): Directory to save the interpolated TIFF files.
    target_years (list of int): List of specific target years for interpolation.
    prefix (str): Prefix for the output file names.

    Returns:
    dict: Dictionary with interpolated years as keys and interpolated rasters as values.
    """
    known_years = sorted(raster_paths.keys())
    os.makedirs(output_dir, exist_ok=True)

    # Read the existing TIFF files
    data_dict = {}
    profile = None
    for year in known_years:
        with rasterio.open(raster_paths[year]) as src:
            data_dict[year] = src.read(1)
            if profile is None:
                profile = src.profile

    # Get the shape of the data
    sample_year = known_years[0]
    sample_data = data_dict[sample_year]
    height, width = sample_data.shape

    # Stack data into a single array for fitting
    data = np.stack([data_dict[year] for year in known_years], axis=0)

    # Define the years corresponding to your data
    years = np.array(known_years)

    # Reshape the data for vectorized interpolation
    data_reshaped = data.reshape((len(known_years), -1))

    # Initialize interpolated data array
    interpolated_rasters = {}

    # Create an interp1d object for vectorized interpolation
    interpolator = interp1d(years, data_reshaped, kind='linear', axis=0, fill_value='extrapolate')

    for target_year in target_years:
        if target_year in known_years:
            interpolated_rasters[target_year] = data_dict[target_year]
            continue

        # Perform vectorized interpolation
        target_values = interpolator(target_year)

        # Reshape the interpolated values back to the original shape
        target_values = target_values.reshape((height, width))

        interpolated_rasters[target_year] = target_values

        # Write the interpolated raster to a new file
        output_file = os.path.join(output_dir, f'{prefix}_{target_year}_population_data.tif')
        with rasterio.open(output_file, 'w', **profile) as dst:
            dst.write(target_values.astype(profile['dtype']), 1)  # Ensure dtype matches profile
        print(f"Generated {output_file} with interpolated population values.")

    print("Interpolation and TIFF file generation completed successfully.")
    #return interpolated_rasters

### Merge The Tiff File of Population Data

In [63]:
#merge the population data of 2000
t1 = 'population_data/2000/1.tif'
t2 = 'population_data/2000/2.tif'
t3 = 'population_data/2000/3.tif'
t4 = 'population_data/2000/4.tif'
t5 = 'population_data/2000/5.tif'
t6 = 'population_data/2000/6.tif'
tiff_files = [t1, t2, t3, t4, t5, t6]
output_path = 'population_data/2000_population_merged.tif'
#merge_tiff_files(tiff_files, output_path)

In [64]:
#merge the population data of 2005
t1 = 'population_data/2005/1.tif'
t2 = 'population_data/2005/2.tif'
t3 = 'population_data/2005/3.tif'
t4 = 'population_data/2005/4.tif'
t5 = 'population_data/2005/5.tif'
t6 = 'population_data/2005/6.tif'
tiff_files = [t1, t2, t3, t4, t5, t6]
output_path = 'population_data/2005_population_merged.tif'
#merge_tiff_files(tiff_files, output_path)

In [65]:
#merge the population data of 2010
t1 = 'population_data/2010/1.tif'
t2 = 'population_data/2010/2.tif'
t3 = 'population_data/2010/3.tif'
t4 = 'population_data/2010/4.tif'
t5 = 'population_data/2010/5.tif'
t6 = 'population_data/2010/6.tif'
tiff_files = [t1, t2, t3, t4, t5, t6]
output_path = 'population_data/2010_population_merged.tif'
#merge_tiff_files(tiff_files, output_path)

In [66]:
#merge the population data of 2015
t1 = 'population_data/2015/1.tif'
t2 = 'population_data/2015/2.tif'
t3 = 'population_data/2015/3.tif'
t4 = 'population_data/2015/4.tif'
t5 = 'population_data/2015/5.tif'
t6 = 'population_data/2015/6.tif'
tiff_files = [t1, t2, t3, t4, t5, t6]
output_path = 'population_data/2015_population_merged.tif'
#merge_tiff_files(tiff_files, output_path)

In [67]:
#merge the population data of 2020
t1 = 'population_data/2020/1.tif'
t2 = 'population_data/2020/2.tif'
t3 = 'population_data/2020/3.tif'
t4 = 'population_data/2020/4.tif'
t5 = 'population_data/2020/5.tif'
t6 = 'population_data/2020/6.tif'
tiff_files = [t1, t2, t3, t4, t5, t6]
output_path = 'population_data/2020_population_merged.tif'
#merge_tiff_files(tiff_files, output_path)

In [68]:
#merge the population data of 2025
t1 = 'population_data/2025/1.tif'
t2 = 'population_data/2025/2.tif'
t3 = 'population_data/2025/3.tif'
t4 = 'population_data/2025/4.tif'
t5 = 'population_data/2025/5.tif'
t6 = 'population_data/2025/6.tif'
tiff_files = [t1, t2, t3, t4, t5, t6]
output_path = 'population_data/2025_population_merged.tif'
#merge_tiff_files(tiff_files, output_path)

## Extract The Population Data  For Tanzania

#### Get the Tiff file of Population Data

In [69]:
tz_dir = 'tanzania_data/'
prefix = 'tz'
shapefile_path = tz_dir + 'shapefiles/tz_country.shp'
output_directory = tz_dir + 'population_data/'
t1 = 'population_data/2000_population_merged.tif'
t2 = 'population_data/2005_population_merged.tif'
t3 = 'population_data/2010_population_merged.tif'
t4 = 'population_data/2015_population_merged.tif'
t5 = 'population_data/2020_population_merged.tif'
t6 = 'population_data/2025_population_merged.tif'
tiff_files = [t1, t2, t3, t4, t5, t6]
#batch_data_extraction(tiff_files, shapefile_path, output_directory, prefix)

#### Interpolate the Missing Year of Population Data

In [70]:
prefix = 'tz'
output_dir = tz_dir + 'population_data/interpolated/'
t1 = tz_dir + 'population_data/tz_2000_population_data.tif'
t2 = tz_dir + 'population_data/tz_2005_population_data.tif'
t3 = tz_dir + 'population_data/tz_2010_population_data.tif'
t4 = tz_dir + 'population_data/tz_2015_population_data.tif'
t5 = tz_dir + 'population_data/tz_2020_population_data.tif'
t6 = tz_dir + 'population_data/tz_2025_population_data.tif'

In [71]:
# Example usage:
tif_paths = {
    2010: t3,
    2015: t4
}

#linear_interpolate_rasters(prefix,tif_paths, output_dir, target_years= [2011, 2012, 2013, 2014])

In [72]:
# Example usage:
tif_paths = {
    2015: t4,
    2020: t5
}

#linear_interpolate_rasters(prefix,tif_paths, output_dir, target_years= [2016, 2017, 2018, 2019])

In [73]:
# Example usage:
tif_paths = {
    2020: t5,
    2025: t6
}

#linear_interpolate_rasters(prefix,tif_paths, output_dir, target_years= [2021, 2022, 2023, 2024])

#### Extract The Population Data of of Each Districts

In [84]:
prefix = 'tz'
shapefile_path = tz_dir + 'shapefiles/tz_districts.shp'
output_directory = tz_dir + 'population_data/processed/'
t1 = tz_dir + 'population_data/tz_2010_population_data.tif'
t2 = tz_dir + 'population_data/interpolated/tz_2011_population_data.tif'
t3 = tz_dir + 'population_data/interpolated/tz_2012_population_data.tif'
t4 = tz_dir + 'population_data/interpolated/tz_2013_population_data.tif'
t5 = tz_dir + 'population_data/interpolated/tz_2014_population_data.tif'
t6 = tz_dir + 'population_data/tz_2015_population_data.tif'
t7 = tz_dir + 'population_data/interpolated/tz_2016_population_data.tif'
t8 = tz_dir + 'population_data/interpolated/tz_2017_population_data.tif'
t9 = tz_dir + 'population_data/interpolated/tz_2018_population_data.tif'
t10 = tz_dir + 'population_data/interpolated/tz_2019_population_data.tif'
t11 = tz_dir + 'population_data/tz_2020_population_data.tif'
t12 = tz_dir + 'population_data/interpolated/tz_2021_population_data.tif'
t13 = tz_dir + 'population_data/interpolated/tz_2022_population_data.tif'
t14 = tz_dir + 'population_data/interpolated/tz_2023_population_data.tif'

tiff_paths = [t1, t2, t3, t4, t5, t6, t7,
              t8, t9, t10, t11, t12, t13, t14
             ]

#extract_district_population(tiff_path, shapefile_path, output_directory)
#batch_extract_district_population(tiff_paths, shapefile_path, output_directory, prefix)

## Extract The Population Data  For Rwanda

#### Get the Tiff file of Population Data

In [76]:
rw_dir = 'rwanda_data/'
prefix = 'rw'
shapefile_path = rw_dir + 'shapefiles/rw_country.shp'
output_directory = rw_dir + 'population_data/'
t1 = 'population_data/2000_population_merged.tif'
t2 = 'population_data/2005_population_merged.tif'
t3 = 'population_data/2010_population_merged.tif'
t4 = 'population_data/2015_population_merged.tif'
t5 = 'population_data/2020_population_merged.tif'
t6 = 'population_data/2025_population_merged.tif'
tiff_files = [t1, t2, t3, t4, t5, t6]
#batch_data_extraction(tiff_files, shapefile_path, output_directory, prefix)

#### Interpolate the Missing Year of Population Data

In [77]:
prefix = 'rw'
shapefile_path = rw_dir + 'shapefiles/rw_district.shp'
output_dir = rw_dir + 'population_data/interpolated/'
t1 = rw_dir + 'population_data/rw_2000_population_data.tif'
t2 = rw_dir + 'population_data/rw_2005_population_data.tif'
t3 = rw_dir + 'population_data/rw_2010_population_data.tif'
t4 = rw_dir + 'population_data/rw_2015_population_data.tif'
t5 = rw_dir + 'population_data/rw_2020_population_data.tif'
t6 = rw_dir + 'population_data/rw_2025_population_data.tif'
tiff_paths = [t1, t2, t3, t4, t5, t6]

In [78]:
# Example usage:
tif_paths = {
    2005: t2,
    2010: t3
}

#linear_interpolate_rasters(prefix,tif_paths, output_dir, target_years= [2006, 2007, 2008, 2009])

In [79]:
# Example usage:
tif_paths = {
    2010: t3,
    2015: t4
}

#linear_interpolate_rasters(prefix,tif_paths, output_dir, target_years= [2011, 2012, 2013, 2014])

In [80]:
# Example usage:
tif_paths = {
    2015: t4,
    2020: t5
}

#linear_interpolate_rasters(prefix,tif_paths, output_dir, target_years= [2016, 2017, 2018, 2019])

In [81]:
# Example usage:
tif_paths = {
    2020: t5,
    2025: t6
}

#linear_interpolate_rasters(prefix,tif_paths, output_dir, target_years= [2021, 2022, 2023, 2024])

#### Extract The Population Data of of Each Districts

In [83]:
prefix = 'rw'
shapefile_path = rw_dir + 'shapefiles/rw_district.shp'
output_directory = rw_dir + 'population_data/processed/'
t1 = rw_dir + 'population_data/rw_2005_population_data.tif'
t2 = rw_dir + 'population_data/interpolated/rw_2006_population_data.tif'
t3 = rw_dir + 'population_data/interpolated/rw_2007_population_data.tif'
t4 = rw_dir + 'population_data/interpolated/rw_2008_population_data.tif'
t5 = rw_dir + 'population_data/interpolated/rw_2009_population_data.tif'
t6 = rw_dir + 'population_data/rw_2010_population_data.tif'
t7 = rw_dir + 'population_data/interpolated/rw_2011_population_data.tif'
t8 = rw_dir + 'population_data/interpolated/rw_2012_population_data.tif'
t9 = rw_dir + 'population_data/interpolated/rw_2013_population_data.tif'
t10 = rw_dir + 'population_data/interpolated/rw_2014_population_data.tif'
t11 = rw_dir + 'population_data/rw_2015_population_data.tif'
t12 = rw_dir + 'population_data/interpolated/rw_2016_population_data.tif'
t13 = rw_dir + 'population_data/interpolated/rw_2017_population_data.tif'
t14 = rw_dir + 'population_data/interpolated/rw_2018_population_data.tif'
t15 = rw_dir + 'population_data/interpolated/rw_2019_population_data.tif'
t16 = rw_dir + 'population_data/rw_2020_population_data.tif'
t17 = rw_dir + 'population_data/interpolated/rw_2021_population_data.tif'
t18 = rw_dir + 'population_data/interpolated/rw_2022_population_data.tif'
tiff_paths = [t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13, t14, t15, t16, t17, t18]

#batch_extract_district_population(tiff_paths, shapefile_path, output_directory, prefix)