# EXPLORATORY DATA ANALYSIS
1) Describe the Data
2) Trends
3) Summary tables
4) drop unneeded columns
5) drop duplicates
6) drop outliers
 
## Update from Bhuwan & possible tasks for next week and beyond: (3/27/2024)
### * Task 1 - see if you can create a summary statistics using this data and find any relationship between the storm events and crop damages 
  - I have added storm event line shape file. It has details about the storm intensity, etc. 
  - The file is called “stormevents_line_details.shp” and its in [this folder](https://drive.google.com/drive/u/1/folders/1EWWnjXzVrVY3GUTp3hwoWEh5UsmP2E_E). 
  - Maybe we can create a smaller size buffer. I used 5 mile buffer of the line and I think it was too big. 
  - You will have to aggregate the storm events data at county-level to compare it with crop loss because crop loss is at the county level only. 

## Update from Bhuwan & possible tasks for next week and beyond: (3/30/2024)
### * Task 2 - Mean monthly wind speed (netcdf data)
  - Considering the time constraints, let’s use the mean monthly wind speed data from [this folder](https://drive.google.com/drive/folders/1zxvOB6XwXkiOWh1QhQyzTEd5VrL87Puw?usp=drive_link).
  - The data source is [climatologylab.org](https://www.climatologylab.org/gridmet.html) 
  - Pick any one year and do the analysis, (1) spatial and temporal trends of wind speed in the study area, (2) its relationship with crop loss (correlation coefficient).  






# Getting started
Intstall the latest [MiniConda](https://docs.anaconda.com/free/miniconda/) for your Operating System 
- When asked, I recommend you install for 'Just me', not 'All users'
- This is important if you are not admin on your system

Create your environment in a Python shell (Terminal in PyCharm)
- Replace 'myenv' with whatever name you want to give your new environment
- I'm calling mine 'WindBreaks'
Install these packages with the '--yes' option to keep it moving without asking questions 
- ... or you can leave that out if you want to know what is being upgraded/downgraded/installed for compatibility and dependency

Here is an example:

    # Create new env
    conda create --name myenv
    # Activate new env 
    conda activate myenv
    # Install the packages
    python -m pip install jupyter
    conda install conda-content-trust --yes
    conda install -c anaconda ipykernel --yes
    conda install -c conda-forge IPython --yes
    conda install -c conda-forge netCDF4 --yes
    conda install -c conda-forge rasterio --yes
    conda install -c conda-forge numpy --yes
    conda install -c conda-forge xarray --yes
    conda install -c pyviz hvplot --yes
    conda install -c conda-forge holoviews --yes
    conda install -c anaconda pandas --yes
    conda install -c conda-forge geopandas --yes
    conda install -c conda-forge rasterstats --yes
    conda install -c anaconda seaborn --yes
    conda install -c conda-forge matplotlib --yes
    conda install -c anaconda regex --yes
    conda install -c conda-forge cartopy --yes
    conda install -c conda-forge ipywidgets --yes

    # Create the Jupyter Kernal for your notebook
    python -m ipykernel install --user --name WindBreaks --display-name "WindBreaks"

    
## Possible additional resources:
### Tutorials
   #### Exploratory Analysis
   - [Tutorial 1](https://www.geeksforgeeks.org/quick-guide-to-exploratory-data-analysis-using-jupyter-notebook/)
   - [YOUR Data Teacher (YouTube video)](https://www.youtube.com/watch?v=iZ2MwVWKwr4)

# Project Notes:
## Standardization:
### Colors
  - April red hex E41A1C
  - May blue hex 377EB8
  - June green hex 4DAF4A

# Data Sources:
- [NOAA Local Climate Data (LCD)](https://www.ncei.noaa.gov/maps/lcd/)
- [NOAA Storm Events (NCEI)](https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/)
- [Copernicus Climate Data Store (CDS)](https://cds.climate.copernicus.eu/cdsapp#!/home)
- [Climatology Lab](https://www.climatologylab.org/gridmet.html) 


# Prepare the EDA Environment

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import os
import glob
import regex as re
import netCDF4 as nc
import rasterio
import numpy as np
import xarray as xr
import hvplot.xarray
import holoviews as hv 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import geopandas as gpd
import cartopy.crs as ccrs
import cartopy.feature as cf
import ipywidgets as widgets
import matplotlib.ticker as mticker

In [None]:
from bokeh.io import output_notebook, show
from bokeh.resources import INLINE
from rasterio.transform import from_origin
from rasterstats import zonal_stats
from shapely.geometry import LineString
from matplotlib.path import Path
from matplotlib.colors import Normalize
from netCDF4 import Dataset
from pyproj import CRS
from IPython.display import display
import chardet
from cartopy.mpl.gridliner import LONGITUDE_FORMATTER, LATITUDE_FORMATTER

In [None]:
# Set up Bokeh to display plots inline in the notebook.
output_notebook(INLINE)
hv.extension('bokeh')
%matplotlib inline

## Set data source variables

In [None]:
# Path to the directory
directory = 'Data'

# Check if the directory exists
if os.path.isdir(directory):
    src_dir = directory
else:
    src_dir = None

print(src_dir)

## Set the extents of the Area of Interest (AOI)

In [None]:
# Project extents
extent_coords = {'min_lat': 36.998665, 'max_lat': 37.734463,
                 'min_lon': -95.964735, 'max_lon': -94.616789}

In [None]:
# Us Jupyter magik to list all variables and function loaded in the interactive workspace
%whos

# Define Functions for EDA

## - Extent Filter Function that works with the variable 'extents_coords'

In [None]:
def filter_dataframe_on_extent(df, extent_coords, lat1, lon1, lat2, lon2):
    min_lat, max_lat = extent_coords['min_lat'], extent_coords['max_lat']
    min_lon, max_lon = extent_coords['min_lon'], extent_coords['max_lon']
    return df[
        ((df[lat1] >= min_lat) & (df[lat1] <= max_lat) &
         (df[lon1] >= min_lon) & (df[lon1] <= max_lon)) |
        ((df[lat2] >= min_lat) & (df[lat2] <= max_lat) &
         (df[lon2] >= min_lon) & (df[lon2] <= max_lon))
        ]



## - Convert Geodataframe to a Shapefile

In [None]:
def convert_geodataframe_to_shapefile(gdf, output_file, extent_coords):
    """
    This function reads a GeoDataFrame, and filters it based on latitude and longitude. 
    It then drops NA rows from 'BEGIN_LAT', 'BEGIN_LON', 'END_LAT', 'END_LON' and saves the resultant GeoDataFrame into shapefile format.   
    Args:
    df: GeoDataFrame: The input GeoDataFrame.
    output_file: str: Path of the output shapefile.
    extent_coords: dict: 
        A dictionary containing the coordinates for area bounds to the filter. 
        Keys are 'min_lat', 'max_lat', 'min_lon', 'max_lon', belongs to either lat-lon pair. 
    Returns:
    None
    """

    # Filter the records that either start or end within the given extents
    gdf_filtered = filter_dataframe_on_extent(gdf, extent_coords, 'BEGIN_LAT', 'BEGIN_LON', 'END_LAT', 'END_LON')

    # Save the GeoDataFrame as a shapefile
    gdf_filtered.to_file(output_file)


## - Create Buffer Shapefile from Geodataframe

In [None]:
def create_buffer(gdf, distance, output_path):
    """
    Creates a buffer around the geometries in a given GeoDataFrame and saves the output to a shapefile.

    Args:
    gdf: A GeoDataFrame containing the geometries to buffer
    distance: The distance to buffer around each geometry. This can be a numeric value or a string. 
        If a numeric value, the units should be in miles for CRS in State Plane or kilometers for all others. 
        If a string, it should end with 'm' for meters or 'ft' for feet.
    output_path: The path to the output shapefile

    Returns:
    A new GeoDataFrame with buffered geometries in the original CRS. Also writes the new GeoDataFrame to a shapefile.
    """
    original_crs = gdf.crs
    crs_unit = original_crs.axis_info[0].unit_name.lower()

    buffer_distance = distance
    if isinstance(distance, str):
        if distance.endswith('m'):
            buffer_distance = float(distance.rstrip('m'))
            if crs_unit == 'us survey foot' or crs_unit == 'foot':
                buffer_distance *= 3.281
        elif distance.endswith('ft'):
            buffer_distance = float(distance.rstrip('ft'))
            if crs_unit == 'meter':
                buffer_distance /= 3.281
    else:
        if crs_unit == 'degree':
            gdf = gdf.to_crs(epsg=3395)
            crs_unit = gdf.crs.axis_info[0].unit_name.lower()
            buffer_distance = distance * 1000
        elif crs_unit == 'us survey foot':
            buffer_distance = distance * 5280.01016
        elif crs_unit == 'foot':
            buffer_distance = distance * 5280
        elif crs_unit == 'meter':
            buffer_distance = distance * 1000

    gdf['geometry'] = gdf.geometry.buffer(buffer_distance)

    if original_crs != gdf.crs:
        gdf = gdf.to_crs(original_crs)

    gdf.to_file(output_path)

    return gdf

## - Data munging function customized for Storm Event csv's from NCEI

In [None]:
# TODO: BROKEN, but I need to sleep. takle tomorrow

# def modify_csv(filename: str):
#     """
#     This function modifies the content of a CSV file from the Storm Events dataset.
# 
#     Parameters:
#     filename (str): The full path of the original CSV file to be modified.
# 
#     Returns:
#     None
#     """
# 
#     # Extract base name from filename without year and extension
#     base_name = os.path.splitext(os.path.basename(filename))[0].rsplit('_', 1)[0]
# 
#     # Skip processing if the file has already been processed
#     if base_name in existing_files:
#         print(f"{base_name} already exists. Skipping...")
#         return
# 
#     # If the file is a 'locations' file, check if the corresponding 'details' file has been processed
#     if "locations" in filename and base_name.replace('locations', 'details') not in existing_files:
#         print(f"The corresponding details file for {base_name} has not been processed yet. Skipping...")
#         return
# 
#     # Read the file using detected encoding
#     rawdata = open(filename, 'rb').read()
#     result = chardet.detect(rawdata)
#     encoding = result['encoding']
#     df = pd.read_csv(filename, encoding=encoding)
# 
#     # Modify the latitudes and longitudes in the dataframe
#     if 'LAT2' in df.columns:
#         df['LAT2'] = df['LAT2'].astype(str).str.lstrip('-').apply(
#             lambda x: x[:2] + '.' + x[2:].replace('.', '')).astype(float)
# 
#     if 'LON2' in df.columns:
#         df['LON2'] = df['LON2'].astype(str).str.lstrip('-').apply(
#             lambda x: '-' + x[:2] + '.' + x[2:].replace('.', '')).astype(float)
# 
#     # Prepare a new filename
#     new_filename = os.path.join(os.path.dirname(filename), f'{base_name}.csv')
# 
#     # Save the modified dataframe into the new file
#     df.to_csv(new_filename, index=False)
# 
#     print(f"File {base_name} has been processed and saved as {new_filename}.")

## - Join Storm Event Tables

In [None]:
# TODO: BROKEN, but I need to sleep. takle tomorrow

# def join_and_save(src_dir, year, event_id_exceptions=[], extent_coords=None):
#     # Define all matching CSV files
#     all_files = glob.glob(os.path.join(src_dir, 'Storm_event/StormEvents_*-ftp*d*_c*.csv'))
#     # Get a list of existing files without the directory and extension
#     existing_files = [os.path.splitext(os.path.basename(f))[0] for f in all_files]
# 
#     # Check if a file for this year's details already exists
#     if any(f'StormEvents_details_{year}' in filename for filename in existing_files):
#         print(f"File for details of year {year} already exists.")
#         details = pd.read_csv(os.path.join(src_dir, f'Storm_event\\StormEvents_details_{year}.csv'))
#     else:
#         print(f"Unable to find details file for year {year}.")
#         return
# 
#     # Only process 'locations' file if 'details' file already exists and 'locations' does not
#     if any(f'StormEvents_locations_{year}' in filename for filename in existing_files):
#         print(f'File for locations of year {year} already exists. Skipping...')
#         return
#     else:
#         print(f'Processing locations file for year {year}.')
#         locations = pd.read_csv(os.path.join(src_dir, f'Storm_event\\StormEvents_locations_{year}.csv')).drop(
#             columns='EPISODE_ID')
# 
#     # Separate out specific records
#     details_to_add = details[details['EVENT_ID'].isin(event_id_exceptions)]
#     locations_to_add = locations[locations['EVENT_ID'].isin(event_id_exceptions)]
# 
#     # Remove these specific records from details and locations before filtering
#     details = details[~details['EVENT_ID'].isin(event_id_exceptions)]
#     locations = locations[~locations['EVENT_ID'].isin(event_id_exceptions)]
# 
#     if extent_coords:
#         details = filter_dataframe_on_extent(details, extent_coords, 'BEGIN_LAT', 'BEGIN_LON', 'END_LAT', 'END_LON')
#         locations = filter_dataframe_on_extent(locations, extent_coords, 'LATITUDE', 'LONGITUDE', 'LAT2', 'LON2')
# 
#         # Append the separated records back to details and locations after filtering
#         details = pd.concat([details, details_to_add])
#         locations = pd.concat([locations, locations_to_add])
# 
#     details.columns = [col + '_det' if col not in ['EVENT_ID', 'EPISODE_ID'] else col for col in details.columns]
#     locations.columns = [
#         col + '_loc' if col not in ['EVENT_ID', 'EPISODE_ID', 'LATITUDE', 'LONGITUDE', 'LAT2', 'LON2'] else col for col
#         in locations.columns]
# 
#     merged_df = pd.merge(details, locations, on='EVENT_ID', how='outer')
# 
#     column_order = ['EVENT_ID', 'EPISODE_ID'] + [col for col in merged_df.columns if
#                                                  col not in ['EVENT_ID', 'EPISODE_ID']]
#     merged_df = merged_df[column_order]
# 
#     merged_df.to_csv(os.path.join(src_dir, f'Storm_event\\StormEvents_{year}.csv'), index=False)

## Load basemaps and boundary files for AOI

In [None]:
# Load the county boundary shapefile
sixco_fn = os.path.join(src_dir, 'GIS_files/KS_six_co_bo.shp')
sixco_data = gpd.read_file(sixco_fn)

# Get the CRS
crs = CRS(sixco_data.crs)

# Name of the CRS
print("Name:", crs.name)
# EPSG of the CRS
if crs.to_epsg():
    print('EPSG:', crs.to_epsg())
else:
    print('No EPSG found for this CRS')
# Unit of the CRS
print("Unit:", crs.axis_info[0].unit_name)

# Preview Data
sixco_data.head()

In [None]:
sixco_data = sixco_data.to_crs("EPSG:6469")
sixco_data.crs

# Examine Storm Event Data

## - Examine Storm Event Shapefiles

In [None]:
# Load the Storm event data
input_shp_file = os.path.join(src_dir, 'Storm_event/stormevents_line_details.shp')

# Load original data
original_data = gpd.read_file(input_shp_file)

# Preview Data
original_data.head()

In [None]:
# get column names with their index
for i, col_name in enumerate(original_data.columns):
    print(f"Index: {i}, Column Name: {col_name}")

In [None]:
# Modify data as needed
# # Drop duplicates and redunant data
# # Shorten column names to to Max ten character column name for shapefiles
# # Standardize related data
modified_data = original_data.drop(columns='EVENT_ID_1').rename(columns={'BEGIN_YEAR': 'BEGIN_YRMO', 'END_YEARMO': 'END_YRMO'}) 

# Check your modifications
for i, col_name in enumerate(modified_data.columns):
    print(f"Index: {i}, Column Name: {col_name}")

In [None]:
# Get the CRS
crs = CRS(modified_data.crs)

# Name of the CRS
print("Name:", crs.name)
# EPSG of the CRS
if crs.to_epsg():
    print('EPSG:', crs.to_epsg())
else:
    print('No EPSG found for this CRS')
# Unit of the CRS
print("Unit:", crs.axis_info[0].unit_name)


In [None]:
# Reproject / transform to state plane
modified_data = modified_data.to_crs("EPSG:6469")
# Get the CRS
crs = CRS(modified_data.crs)

# Name of the CRS
print("Name:", crs.name)
# EPSG of the CRS
if crs.to_epsg():
    print('EPSG:', crs.to_epsg())
else:
    print('No EPSG found for this CRS')
# Unit of the CRS
print("Unit:", crs.axis_info[0].unit_name)

In [None]:
# Save modified data as a new Shapefile
modified_shp_file = os.path.join(src_dir, 'Storm_event/modified_storm_events.shp')
modified_data.to_file(modified_shp_file)

In [None]:
# Filter the GeoDataFrame based on extent and save as a new shapefile
output_shp_file = os.path.join(src_dir, 'Storm_event/filtered_storm_events.shp')
convert_geodataframe_to_shapefile(modified_data, output_shp_file, extent_coords)

# Load the filtered data
se_gdf_filtered = gpd.read_file(output_shp_file)
se_gdf_filtered.crs

In [None]:
# Specify the path to the input geodataframe
input_gdf = se_gdf_filtered
# Specify the buffer distance as a float or int
buf_dist = 3
# or you can specify a str if you wamt to buffer by meters or feet instesd of kilometers and and miles, respectively 
# buf_dist = '3000ft'
buf_dist_str = str(buf_dist).replace('.', '_')
# Specify the path to the output shapefile
output_path = os.path.join(src_dir, f'Storm_event/storm_line_{buf_dist_str}_buf.shp')

# Create a buffered GeoDataFrame
# Note: For Geographic Coordinate Systems, provide buffer distance in kilometers. 
# For State Plane Coordinate Systems, provide buffer distance in miles.
buffered_storm_events = create_buffer(input_gdf, buf_dist, output_path)

In [None]:
# Group by 'county' and 'END_YEARMO' and aggregate
se_gdf = se_gdf_filtered #.groupby(['county', 'END_YEARMO']).sum().reset_index()

min_value = se_gdf['BEGIN_YRMO'].min()
max_value = se_gdf['END_YRMO'].max()

print('Minimum value in BEGIN_YRMO column:', min_value)
print('Maximum value in END_YRMO column:', max_value)


## - Create Wind Event Relate Tables (under construction)

In [None]:
# TODO: BROKEN, but I need to sleep. takle tomorrow

# # HERE ARE THREE OPTIONS FOR CALLING THE STROM EVENT DATA MUNGE!! PLEASE COMMENT OUT ALL BUT ONE METHOD!!
# 
# # Keep this part active....
# # Define all matching CSV files
# all_files = glob.glob(os.path.join(src_dir, 'Storm_event/StormEvents_*-ftp*d*_c*.csv'))
# # Get a list of existing files without the directory and extension
# existing_files = [os.path.splitext(os.path.basename(f))[0] for f in all_files]
# 
# Choose 1), 2), or 3). Comment the two unused options out.
# # 1) Define year specific files
# years = [2014, 2016, 2018]  # or any other specific years
# for year in years:
#     # Check if a file for this year's details already exists
#     if any(f'StormEvents_details_{year}' in filename for filename in existing_files):
#         print(f"File for details of year {year} already exists. Skipping...")
#     else:
#         # If the file doesn't exist, find the original file and modify it
#         detail_file = glob.glob(os.path.join(src_dir, f'Storm_event/StormEvents_details-ftp_v1.0_d{year}_c*.csv'))
#         for f in detail_file:
#             modify_csv(f)
#     # Only process 'locations' file if 'details' file already exists and 'locations' does not
#     if any(f'StormEvents_details_{year}' in filename for filename in existing_files) and not any(
#             f'StormEvents_locations_{year}' in filename for filename in existing_files):
#         # If the 'locations' file doesn't exist, find the original file and modify it
#         location_file = glob.glob(os.path.join(src_dir, f'Storm_event/StormEvents_locations-ftp_v1.0_d{year}_c*.csv'))
#         for f in location_file:
#             modify_csv(f)
# 
# # 2) Define range of year files
# for year in range(2011, 2021):  # or any other range of years
#     # Check if a file for this year's details already exists
#     if any(f'StormEvents_details_{year}' in filename for filename in existing_files):
#         print(f"File for details of year {year} already exists. Skipping...")
#     else:
#         # If the file doesn't exist, find the original file and modify it
#         detail_file = glob.glob(os.path.join(src_dir, f'Storm_event/StormEvents_details-ftp_v1.0_d{year}_c*.csv'))
#         for f in detail_file:
#             modify_csv(f)
#     # Only process 'locations' file if 'details' file already exists and 'locations' does not
#     if any(f'StormEvents_details_{year}' in filename for filename in existing_files) and not any(
#             f'StormEvents_locations_{year}' in filename for filename in existing_files):
#         # If the 'locations' file doesn't exist, find the original file and modify it
#         location_file = glob.glob(os.path.join(src_dir, f'Storm_event/StormEvents_locations-ftp_v1.0_d{year}_c*.csv'))
#         for f in location_file:
#             modify_csv(f)
# 
# # 3) Define all matching CSV files
# detail_files = glob.glob(os.path.join(src_dir, 'Storm_event/StormEvents_details-ftp*_d*_c*.csv'))
# location_files = glob.glob(os.path.join(src_dir, 'Storm_event/StormEvents_locations-ftp*_d*_c*.csv'))
# 
# # Filter filenames to get base name without year and extension
# existing_files = [os.path.splitext(os.path.basename(f))[0].rsplit('_', 1)[0] for f in all_files]
# 
# # Check for each 'details' file
# for f in detail_files:
#     base_name = os.path.splitext(os.path.basename(f))[0].rsplit('_', 1)[0]
#     if base_name not in existing_files:
#         modify_csv(f)
#         print(f"{base_name} processed.")
#     else:
#     print(f"{base_name} already exists. Skipping...")
# 
# # Check for each 'locations' file where 'details' file exists and 'locations' does not
# for f in location_files:
#     base_name = os.path.splitext(os.path.basename(f))[0].rsplit('_', 1)[0]
#     if base_name.replace('locations', 'details') in existing_files and base_name not in existing_files:
#         modify_csv(f)
#         print(f"{base_name} processed.")
#     else:
#         print(f"{base_name} already exists or corresponding details file does not exist. Skipping...")


In [None]:
# TODO: BROKEN, but I need to sleep. takle tomorrow

# # List of Event IDs to be kept even after extent filtering (optional)
# event_id_exceptions = []
# '''[498859, 508406, 508407, 508408, 508409, 508413, 508414, 508440, 508468, 508476, 508480, 508481,
#    508482, 516243, 516244, 516245, 516246, 526836, 526837, 533868, 542919, 543368, 543376, 543645,
#    543646, 543667, 627803, 627804, 627806, 630383, 630387, 632580, 633147, 636638, 655474, 659269,
#    659272, 659273, 659274, 659292, 660117, 660118, 661872, 661873, 662309, 662310, 663504, 663539,
#    663986, 663989, 663991, 754111, 755412, 756557, 756559, 756560, 756562, 756563, 756564, 756569,
#    756576, 756578, 756608, 756609, 756611, 756861, 765900, 765905, 772228, 772229, 774573, 774574,
#    779898, 779899, 780508, 780781, 787624, 787627, 787630, 792763]
#    '''
# # Define all matching CSV files
# all_files = glob.glob(os.path.join(src_dir, 'Storm_event/StormEvents_*d*_c*.csv'))
# details_files = [f for f in all_files if "details" in f]
# locations_files = [f for f in all_files if "locations" in f]
# 
# detail_names = [os.path.splitext(os.path.basename(f))[0] for f in details_files]
# location_names = [os.path.splitext(os.path.basename(f))[0] for f in locations_files]
# 
# # # 1) Define year specific files
# # years = [2014, 2016, 2018]  # or any other specific years
# # for year in years:
# #     if (f'StormEvents_details_{year}' in detail_names) and (f'StormEvents_locations_{year}' not in location_names):
# #         join_and_save(src_dir, year)
# # 
# # # 2) Define range of year files
# # for year in range(2011, 2021):  # or any other range of years
# #     if (f'StormEvents_details_{year}' in detail_names) and (f'StormEvents_locations_{year}' not in location_names):
# #         join_and_save(src_dir, year)
# 
# # 3) Define all matching CSV files
# for detail in detail_names:
#     year = detail.split("_")[-1]
#     if f'StormEvents_locations_{year}' not in location_names:
#         join_and_save(src_dir, year)


In [None]:
%whos

In [None]:
se_gdf_filtered.crs

In [None]:
# Load the storm line data from a shapefile
buf_dist_str = '3'
stormbuf_gdf = gpd.read_file(os.path.join(src_dir, f'Storm_event/storm_line_{buf_dist_str}_buf.shp'))
# Perform the spatial join (intersection)
intersect_gdf = gpd.sjoin(sixco_data, stormbuf_gdf, how="inner", op='intersects')
intersect_gdf.head()

In [None]:
print(sixco_data.crs)
print(stormbuf_gdf.crs)
print(intersect_gdf.columns.tolist())
print(stormbuf_gdf.columns.tolist())

In [None]:
events_count_by_year_month_county = intersect_gdf.groupby(['END_YRMO', 'COUNTYFP'])['EVENT_ID'].nunique()

print(events_count_by_year_month_county)

# Examine Crop Loss Data (under construction)

In [None]:
# Load the pipe delimited file into a DataFrame without headers
col_tbl = pd.read_csv(os.path.join(src_dir, 'crop_loss_COL/colsom_2014.txt'), sep='|', header=None)
col_tbl.head()

In [None]:
# Load the Excel file
code_header_tbl = pd.read_csv((os.path.join(src_dir, 'crop_loss_COL/dictionary_colsommonth_allyears.csv')), header=None)
code_header_tbl.head()

In [None]:
# Get the headers from the 2nd column (index 1) of the Excel DataFrame
tbl_headers = code_header_tbl.iloc[:, 1]

# Set the headers in the DataFrame
col_tbl.columns = tbl_headers
col_tbl.head()

In [None]:
# Get unique values
unique_states_col_tbl = col_tbl['State Code'].unique()
unique_states_sixco_data = sixco_data['STATEFP'].unique()
# Display unique values
print(unique_states_col_tbl)
print(unique_states_sixco_data)

In [None]:
# Convert 'County Code' values to string and pad with leading zeros
col_tbl['County Code'] = col_tbl['County Code'].astype(str).str.zfill(3)
col_tbl = col_tbl[col_tbl['State Code'].isin(sixco_data['STATEFP'].astype(int))]
# Further filter rows where 'County Code' matches 'COUNTYFP'
col_tbl = col_tbl[col_tbl['County Code'].isin(sixco_data['COUNTYFP'].astype(str))]
# # Convert 'County Code'  values back to int
# col_tbl['County Code'] = col_tbl['County Code'].astype(int)
col_tbl.head()

In [None]:

# Get unique values
unique_counties = col_tbl['County Code'].unique()
unique_states = col_tbl['State Code'].unique()
# Display unique
print(unique_counties, unique_states)


In [None]:
# Sum 'Indemnity Amount' by 'County Code' and 'Commodity Year Identifier'
indemnity_sum = col_tbl.groupby(['Commodity Year Identifier', 'County Code'])['Indemnity Amount'].sum().reset_index()
print(indemnity_sum)

In [None]:
# TODO: Fix the NaN's
# Transform `events_count_by_year_month_county` into a DataFrame
events_df = events_count_by_year_month_county.reset_index()
events_df.columns = ['END_YRMO', 'COUNTYFP', 'EVENT_COUNT']

# Extract year from 'END_YRMO'
events_df['Year'] = events_df['END_YRMO'].astype(str).str[:4].astype(
    int)  # convert to int, or .astype(str) if 'Commodity Year Identifier' is string

# Merge with `sixco_data` to get 'NAME'
merged_df = pd.merge(events_df, sixco_data[['COUNTYFP', 'NAME']].drop_duplicates(), on='COUNTYFP', how='left')

# Convert 'Commodity Year Identifier' in `indemnity_sum` to int (or leave it as is if it's string)
indemnity_sum['Commodity Year Identifier'] = indemnity_sum['Commodity Year Identifier'].astype(
    int)  # remove this line if 'Commodity Year Identifier' is string
# Drop missing values from the 'indemnity_sum' dataset
indemnity_sum = indemnity_sum.dropna(subset=['County Code', 'Commodity Year Identifier'])

# Merge with `indemnity_sum` to get 'Indemnity Amount'
final_df = pd.merge(merged_df, indemnity_sum, left_on=['COUNTYFP', 'Year'],
                    right_on=['County Code', 'Commodity Year Identifier'], how='left')
print(final_df)