# Timeseries Data Collection and Preprocessing

This notebook focuses on collecting and preprocessing time-series data relevant to food security indicators. It encompasses a diverse range of data sources, including temperature, rainfall, and food item prices. The collected data undergoes preprocessing to ensure its quality and suitability for predictive modeling of food security indicators. By using this comprehensive dataset, we aim to gain insights into the dynamic interplay between environmental factors, economic variables, and food security outcomes.

## Data Sources

In this section, we provide links to the diverse array of data sources utilized in our analysis. These sources encompass a wide range of domains crucial for understanding food security dynamics, including meteorological data repositories, governmental databases, and market price indices. By using these reputable and diverse sources, we ensure the richness and reliability of the data underpinning our analysis. These data are publicly available on the following data repository:

* [NASA's POWER (Prediction Of Worldwide Energy Resources)](https://power.larc.nasa.gov/data-access-viewer/): 
This resource is part of NASA Earth Science's Applied Sciences Program, designed to support various energy-related applications by providing access to NASA's solar radiation and meteorological research data.The research areas are include in aids of **Renewable Energy**, and **Agricultural Needs**. It is also provides the [API](https://power.larc.nasa.gov/docs/services/api/) services for customs scripts and applications.


### Import Libraries

In [1]:
#pip install nasa-power

In [12]:
import pandas as pd
from pandas import read_csv
import geopandas as gpd
from geopandas import read_file
import matplotlib.pyplot as plt
import requests
import calendar
from shapely.ops import cascaded_union
import numpy as np
from shapely.geometry import Point, Polygon
import time

## Helper Function

In [4]:
def compute_centroids(shapefile_path, region_col, district_col):
    
    '''
    Compute centroids of districts in a shapefile.

    Args:
    - shapefile_path (str): Path to the shapefile.
    - region_col (str): Name of the column containing region or province information.
    - district_col (str): Name of the column containing district information.

    Returns:
    - centroid_df (GeoDataFrame): DataFrame containing region, district,
      centroid lat, and centroid lon.
    
    '''
    
    # Read the shapefile
    gdf = read_file(shapefile_path)
    
    # Ensure the columns exist in the DataFrame
    if region_col not in gdf.columns or district_col not in gdf.columns:
        raise ValueError("Columns not found in the shapefile.")
    
    # Compute centroids
    centroids = gdf.centroid
    
    # Extract centroid coordinates
    centroid_coords = centroids.apply(lambda x: (x.y, x.x))
    
    # Create a DataFrame with region, district, and centroid coordinates
    centroid_df = gpd.GeoDataFrame({
        'region': gdf[region_col],
        'district': gdf[district_col],
        'centroid_lat': centroid_coords.apply(lambda x: x[0]),
        'centroid_lon': centroid_coords.apply(lambda x: x[1])
    })
    
    return centroid_df


In [5]:
def compute_district_bounds(shapefile_path):
    
    '''
    Compute the bounds (minimum and maximum longitude and latitude) of districts in a shapefile.

    Args:
    - shapefile_path (str): Path to the shapefile.

    Returns:
    - district_bounds (DataFrame): DataFrame containing region, district,
      minimum longitude, maximum longitude, minimum latitude, and maximum latitude.
    
    '''
    
    # Read the shapefile
    gdf = read_file(shapefile_path)

    # Group by district and compute the maximum and minimum longitude and latitude
    district_bounds = gdf.groupby(['region', 'district'])['geometry'].apply(lambda x: x.bounds).reset_index()

    # Extract maximum and minimum longitude and latitude values
    district_bounds['min_lon'] = district_bounds['minx']
    district_bounds['max_lon'] = district_bounds['maxx']
    district_bounds['min_lat'] = district_bounds['miny']
    district_bounds['max_lat'] = district_bounds['maxy']

    # Select relevant columns
    district_bounds = district_bounds[['region', 'district', 'min_lon', 'max_lon', 'min_lat', 'max_lat']]

    return district_bounds


In [6]:
def generate_random_points_in_polygon(polygon, num_points, min_distance=2):
    '''
    Generate random points inside a polygon with a minimum distance between points.

    Parameters:
        polygon (Polygon): Polygon object representing the district boundary.
        num_points (int): Number of random points to generate.
        min_distance (float): Minimum distance between points in kilometers.

    Returns:
        list: List of Point objects representing the random points inside the polygon.
    '''
    points = []
    min_x, min_y, max_x, max_y = polygon.bounds
    boundary_polygon = polygon.buffer(0)
    while len(points) < num_points:
        point = Point(np.random.uniform(min_x, max_x), np.random.uniform(min_y, max_y))
        if point.within(boundary_polygon):
            if all(point.distance(existing_point) > min_distance for existing_point in points):
                points.append(point)
    return points

In [7]:
def compute_district_points(shapefile_path, num_points_per_district, min_distance=2):
    '''
    Compute random points inside each district polygon in a shapefile with a minimum distance between points.

    Args:
        shapefile_path (str): Path to the shapefile.
        num_points_per_district (int): Number of random points to generate per district.
        min_distance (float): Minimum distance between points in kilometers.

    Returns:
        DataFrame: DataFrame containing region, district,
                   longitude, and latitude of each random point.
    '''
    # Read the shapefile
    gdf = gpd.read_file(shapefile_path)

    # Initialize an empty list to store district points
    district_points = []

    # Iterate over each district
    for index, row in gdf.iterrows():
        district = row['district']
        region = row['region']
        polygon = row['geometry']

        # Generate random points inside the polygon with minimum distance between points
        points = generate_random_points_in_polygon(polygon, num_points_per_district, min_distance)

        # Extract longitude and latitude of each point
        for point in points:
            district_points.append({
                'region': region,
                'district': district,
                'longitude': point.x,
                'latitude': point.y
            })

    # Convert list of dictionaries to DataFrame
    district_points_df = pd.DataFrame(district_points)

    return district_points_df


In [9]:
def visualize_centroids(shapefile_path, region_col, district_col):
    
    '''
    Visualize centroids of districts on a map.

    Args:
    - shapefile_path (str): Path to the shapefile.
    - region_col (str): Name of the column containing region or province information.
    - district_col (str): Name of the column containing district information.
    
    '''
    
    # Compute centroids using compute_centroids function
    centroid_df = compute_centroids(shapefile_path, region_col, district_col)

    # Read the shapefile
    gdf = gpd.read_file(shapefile_path)

    # Plot districts
    ax = gdf.plot(figsize=(10, 10), edgecolor='black', alpha=0.5)

    # Plot centroids
    ax.scatter(centroid_df['centroid_lon'], centroid_df['centroid_lat'], color='red', s=50, label='Centroids')

    # Add labels
    for x, y, label in zip(centroid_df['centroid_lon'], centroid_df['centroid_lat'], centroid_df[district_col]):
        ax.text(x, y, label, fontsize=10, ha='center', va='center')

    # Set plot title
    plt.title('Centroids of Districts')

    # Set x and y axis labels
    plt.xlabel('Longitude')
    plt.ylabel('Latitude')

    # Show legend
    plt.legend()

    # Show plot
    plt.show()


In [None]:
def aggregate_data(df, groupby_columns, mean_columns):
    
    '''
    Aggregate data in a DataFrame based on specified groupby columns and calculate the mean of specified columns.

    Parameters:
        df (DataFrame): Input DataFrame containing the data.
        groupby_columns (list): List of columns to group by.
        mean_columns (list): List of columns for which the mean needs to be calculated.

    Returns:
        DataFrame: DataFrame containing the aggregated data with means of specified columns.
    '''
    # Prepare dictionary with mean aggregation for specified columns
    aggregation_dict = {column: 'mean' for column in mean_columns}

    # Group by specified columns and calculate the mean
    aggregated_df = df.groupby(groupby_columns).agg(aggregation_dict).reset_index()

    return aggregated_df

In [13]:
data = compute_district_points('tanzania_data/tz_shapefiles/tz_districts.shp', 20, min_distance=2)

KeyboardInterrupt: 

NameError: name 'data' is not defined

In [None]:
data.to_csv('tanzania_data/tz_districts_points.csv', index=False)

In [13]:
#visualize_centroids('tanzania_data/tz_shapefiles/tz_districts.shp','region','district')

## Collecting Temperature Data
Understanding temperature patterns is crucial for agriculture as it impacts various aspects of crop growth, pest management, and water availability. By analyzing temperature trends over time, we can assess the vulnerability of agricultural systems to climate fluctuations and long-term changes which is essential factor for addessing food insecurity and improve agricultural resilience in the context of evolving climate conditions.Therefore in this section, we will be using [NASA's Power API](https://power.larc.nasa.gov/docs/services/api/temporal/monthly/) services to gather comprehensive temperature datasets spanning diverse geographic regions and temporal scales of monthly base. It provides parameters by year; the annual and each month's average, maximum, and/or minimum values. These data can be collected by single point and by regional.

### Single Point 

In [37]:
def fetch_temperature_data_point(lat, lon):
    
    '''
    Fetches temperature data from the NASA POWER API for a given latitude and longitude.

    Parameters:
        lat (float): Latitude of the location.
        lon (float): Longitude of the location.

    Returns:
        dict: A dictionary containing the fetched data if successful, None otherwise.
    '''
    
    parameters = {
        'latitude': lat,
        'longitude': lon,
        'start': '1990',         # Start year (YYYY)
        'end': '2022',           # End year (YYYY)
        'community': 'AG',       # Agroclimatology Archive
        'parameters': ','.join([
            'T2M',                # MERRA-2 Temperature at 2 Meters (C)
            'T2M_MAX',            # MERRA-2 Temperature at 2 Meters Maximum (C)
            'T2M_MIN',            # MERRA-2 Temperature at 2 Meters Minimum (C)
        ]),
        'format': 'json',        # Data format (json or csv)
        'temporalAverage': 'monthly'  # Temporal resolution (monthly)
    }

    # Make a GET request to the NASA POWER API
    response = requests.get('https://power.larc.nasa.gov/api/temporal/monthly/point', params=parameters)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Extract and return the data from the response
        return response.json()
    else:
        print("Failed to retrieve data. Status code:", response.status_code)
        return None

In [38]:
def compute_point_temperature(centroids_df):
    '''
    Calculates temperature data for each district centroid and returns a DataFrame.

    Parameters:
        centroids_df (DataFrame): DataFrame containing centroids (latitude and longitude) of each district.

    Returns:
        DataFrame: A DataFrame containing the calculated temperature data for each district centroid.
    '''
    # Initialize an empty list to store individual DataFrames
    all_data_frames = []

    # Loop over each row in the DataFrame
    for index, row in centroids_df.iterrows():
        latitude = row['centroid_lat']
        longitude = row['centroid_lon']
        district = row['district']
        region = row['region']

        # Fetch NASA data
        nasa_data = fetch_temperature_data_point(latitude, longitude)

        if nasa_data:
            # Convert the response to a pandas DataFrame
            nasa_df = pd.DataFrame(nasa_data['properties']['parameter'])

            # Extract year and month from the index
            nasa_df.index = nasa_df.index.astype(str)  # Convert index to string
            nasa_df['year'] = nasa_df.index.str[:4]    # Extract first 4 characters as year
            nasa_df['month'] = nasa_df.index.str[4:]   # Extract remaining characters as month
            nasa_df['district'] = district
            nasa_df['region'] = region

            # Reorder columns
            nasa_df = nasa_df[['region', 'district', 'year', 'month', 'T2M', 'T2M_MAX', 'T2M_MIN']]

            # Append DataFrame to the list
            all_data_frames.append(nasa_df)

    # Concatenate all DataFrames into a single DataFrame
    all_data = pd.concat(all_data_frames, ignore_index=True)

    return all_data

### Regional

In [15]:
def fetch_temperature_data_region(lat_min, lat_max, lon_min, lon_max):
    
    '''
    Fetches temperature data for a given region from the NASA POWER API.

    Parameters:
        lat_min (float): The minimum latitude of the region.
        lat_max (float): The maximum latitude of the region.
        lon_min (float): The minimum longitude of the region.
        lon_max (float): The maximum longitude of the region.

    Returns:
        dict: A dictionary containing the fetched temperature data, or None if the request failed.
    '''
    parameters = {
        'latitude-min': lat_min,
        'latitude-max': lat_max,
        'longitude-min': lon_min,
        'longitude-max': lon_max,
        'start': 1990,  # Start year (YYYY)
        'end': 2020,    # End year (YYYY)
        'community': 'ag',    # Agroclimatology Archive
        'parameters': ','.join([
            'T2M',                 # MERRA-2 Temperature at 2 Meters (C)
            'T2M_MAX',             # MERRA-2 Temperature at 2 Meters Maximum (C)
            'T2M_MIN',             # MERRA-2 Temperature at 2 Meters Minimum (C)
        ]),
        'format': 'json',       # Data format (json or csv)
        'temporalAverage': 'monthly'  # Temporal resolution (monthly)
    }

    # Make a GET request to the NASA POWER API
    response = requests.get('https://power.larc.nasa.gov/api/temporal/monthly/regional', params=parameters)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Extract and return the data from the response
        return response.json()
    else:
        print("Failed to retrieve data. Status code:", response.status_code)
        return None


In [47]:
def compute_regional_temperature(district_bounds_df):
    '''
    Calculates temperature data for each district region and returns a DataFrame.

    Parameters:
        district_bounds_df (DataFrame): DataFrame containing district bounds.

    Returns:
        DataFrame: A DataFrame containing the calculated temperature data for each district region.
    '''
    # Initialize an empty list to store individual DataFrames
    all_data_frames = []

    # Loop over each row in the DataFrame
    for index, row in district_bounds_df.iterrows():
        min_lat = row['min_lat']
        max_lat = row['max_lat']
        min_lon = row['min_lon']
        max_lon = row['max_lon']
        district = row['district']
        region = row['region']

        # Fetch NASA data
        nasa_data = fetch_temperature_data_region(min_lat, max_lat, min_lon, max_lon)

        if nasa_data:
            # Convert the response to a pandas DataFrame
            nasa_df = pd.DataFrame(nasa_data['properties']['parameter'])

            # Extract year and month from the index
            nasa_df.index = nasa_df.index.astype(str)  # Convert index to string
            nasa_df['year'] = nasa_df.index.str[:4]    # Extract first 4 characters as year
            nasa_df['month'] = nasa_df.index.str[4:]   # Extract remaining characters as month
            nasa_df['district'] = district
            nasa_df['region'] = region

            # Reorder columns
            nasa_df = nasa_df[['region', 'district', 'year', 'month', 'T2M', 'T2M_MAX', 'T2M_MIN']]

            # Append DataFrame to the list
            all_data_frames.append(nasa_df)

    # Concatenate all DataFrames into a single DataFrame
    all_data = pd.concat(all_data_frames, ignore_index=True)

    return all_data

### TANZANIA

In [39]:
#computing The Centroids of Distrcits
district_centroids = compute_centroids('tanzania_data/tz_shapefiles/tz_districts.shp','region','district')
district_centroids.head()


  centroids = gdf.centroid


Unnamed: 0,region,district,centroid_lat,centroid_lon
0,Arusha,Arusha,-3.347952,36.68817
1,Arusha,Arusha Urban,-3.437964,36.675364
2,Arusha,Karatu,-3.554529,35.43512
3,Arusha,Lake Eyasi,-3.585394,35.109951
4,Arusha,Lake Manyara,-3.515286,35.836992


In [48]:
#csv_data = fetch_temperature_data_point(latitude, longitude)
#csv_data=fetch_temperature_data_point(-3.347952, 36.688170)
districts_bounds = compute_district_bounds('tanzania_data/tz_shapefiles/tz_districts.shp')
districts_bounds.head()

Unnamed: 0,region,district,min_lon,max_lon,min_lat,max_lat
0,Arusha,Arusha,36.46981,36.889061,-3.65555,-3.066394
1,Arusha,Arusha Urban,36.583969,36.767124,-3.558246,-3.342853
2,Arusha,Karatu,34.750038,35.976353,-3.946814,-3.032225
3,Arusha,Lake Eyasi,34.78899,35.355385,-3.825978,-3.352262
4,Arusha,Lake Manyara,35.766632,35.891113,-3.625842,-3.421769


In [29]:
if csv_data:
        # Convert the response to a pandas DataFrame
        nasa_df = pd.DataFrame(csv_data['properties']['parameter'])

In [35]:
nasa_df.columns

Index(['T2M', 'T2M_MAX', 'T2M_MIN'], dtype='object')

In [32]:
nasa_df.index

Index(['199001', '199002', '199003', '199004', '199005', '199006', '199007',
       '199008', '199009', '199010',
       ...
       '202204', '202205', '202206', '202207', '202208', '202209', '202210',
       '202211', '202212', '202213'],
      dtype='object', length=429)

In [44]:
#Fetch the data
#tz_temperature = compute_point_temperature(district_centroids)

In [45]:
tz_temperature.head(14)

Unnamed: 0,region,district,year,month,T2M,T2M_MAX,T2M_MIN
0,Arusha,Arusha,1990,1,22.65,32.57,14.08
1,Arusha,Arusha,1990,2,23.51,35.32,16.51
2,Arusha,Arusha,1990,3,21.68,32.15,15.95
3,Arusha,Arusha,1990,4,20.7,27.09,15.88
4,Arusha,Arusha,1990,5,19.55,27.15,13.61
5,Arusha,Arusha,1990,6,18.64,28.42,11.02
6,Arusha,Arusha,1990,7,17.95,27.48,9.75
7,Arusha,Arusha,1990,8,19.59,29.85,12.51
8,Arusha,Arusha,1990,9,21.35,32.05,13.76
9,Arusha,Arusha,1990,10,22.27,32.19,13.47


In [46]:
# Save the combined monthly temperature data to a CSV file
tz_temperature.to_csv('tanzania_data/tz_monthly_temperature_data.csv', index=False)