# Combining Files - Creating and Cleaning Consolidated Data Files

This notebook adds two necessary columns to our data files and combines the individual `.txt` files into two larger `.txt` files. Execution of this notebook will create `Combined_Reduced_Trackfile.txt` and `Combined_WWLLN_Locations.txt`. This notebook should be executed after the `data_file_cleaning.ipynb` notebook.

We then perform some post-processing on the data by adding column headers, filtering to tropical cyclones that are category 1 or higher, and calculating the direct distance of each lightning strike to the TC storm center. This will create additional `Filtered_Reduced_Trackfile.csv` and `Filtered_WWLLN_Locations.txt` files for use in analysis. 

The last part of this notebook joins the trackfile and WWLLN locations together, where we bin the lightning data by 30 minute increments and join to the closest storm center timestamp to get the wind speed and pressure data. This portion will create the `some file` for use in analysis.

### Combining Files
We use the [Google Drive API](https://developers.google.com/drive/api/guides/about-sdk) to download the files previously uploaded in `data_upload.ipynb` to consolidate the individual files. The first half of the code works if the Google Drive API is already set up (refer to instructions in `data_upload.ipynb`). The code after we create the list of files is not dependent on the Google Drive API.

Let's start by installing necessary packages and then importing them.

In [None]:
# Install necessary packages
!pip install google-api-python-client google-auth google-auth-oauthlib google-auth-httplib2

In [1]:
# Import packages
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from threading import Thread
from queue import Queue
import os
import polars as pl
from googleapiclient.http import MediaIoBaseDownload
from io import BytesIO
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
import pickle

Similar to the function in `data_file_cleaning.ipynb`, we use the following function to authenticate the Google Drive API. This will open a browser to perform the authentication process. 

Check if a `token.pickle` file already exists before running the following code. If the file exists, it is recommended to delete it before running the code below.

In [17]:
# Scopes for accessing Google Drive
SCOPES = ['https://www.googleapis.com/auth/drive']

# Authenticate and create the service object
def authenticate_drive_api():
    creds = None
    # Token file for saving the authentication
    if os.path.exists('token.pickle'):
        with open('token.pickle', 'rb') as token:
            creds = pickle.load(token)
    # If there are no credentials, perform authentication
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                'client_secrets.json', SCOPES)  # Ensure 'credentials.json' is downloaded from Google API Console
            creds = flow.run_local_server(port=0)
        # Save the credentials for future use
        with open('token.pickle', 'wb') as token:
            pickle.dump(creds, token)
    return build('drive', 'v3', credentials=creds)

# Initialize the service object
service = authenticate_drive_api()


Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=389849867563-4uggnm57nqe52156v32gj1lkosoqpoem.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A37635%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&state=GyfvJGpr2mtpuRzI6AP8xi3h1k3QIH&access_type=offline


This next function grabs the list of all files in a specified folder that are not trashed and stores them into a list. Each file has an ID and name attribute that we will use later.

In [27]:
def find_files(folder_id):
    # Query to find files in the specified folder
    query = f"'{folder_id}' in parents and trashed=false"
    files = []

    # List files in the folder and append to list
    page_token = None
    while True:
        response = service.files().list(
            q=query,
            spaces='drive',
            fields='nextPageToken, files(id, name)',
            pageToken=page_token
        ).execute()

        files += response.get('files', [])

        page_token = response.get('nextPageToken', None)
        if page_token is None:
            break
    return files

Call the function to find files in the specified folder. The folder ID can be found as the string after the "folders/" part of the URL for the Google Drive folder. This will give us a list of files to iterate through for the rest of the notebook.

In [28]:
# Get the list of files in the folder
folder_id = '14idmMBbM5xXZg4b61iINHbBTl2z4yLeD'
files = find_files(folder_id)

Next, we split out the tropical cyclone ID and name from each of the files to add as a separate column. We then save the files in the `processed_files` directory for use later.

In [None]:
# Process each file to add cyclone ID and name as columns
# Directory to save the processed files locally
output_dir = "processed_files"
os.makedirs(output_dir, exist_ok=True)

# Process each file
for file in files:
    file_id = file['id']
    file_name = file['name']

    # Extract the cyclone ID and name from the filename
    cyclone_id = '_'.join(file_name.split('_')[:3])
    cyclone_name = file_name.split('_')[3]

    # Download the file content
    request = service.files().get_media(fileId=file_id)
    file_stream = BytesIO()
    downloader = MediaIoBaseDownload(file_stream, request)
    done = False
    while not done:
        status, done = downloader.next_chunk()
    file_stream.seek(0)
    content = file_stream.read().decode('utf-8')

    # Add the cyclone id and name as a new column using Polars
    df = pl.read_csv(BytesIO(content.encode('utf-8')),separator='\t', has_header=False)
    df = df.with_columns([
    pl.lit(cyclone_id).alias("cyclone_id"),
    pl.lit(cyclone_name).alias("cyclone_name")
    ])

    # Save the modified DataFrame locally
    output_file_path = os.path.join(output_dir, file_name)
    df.write_csv(output_file_path, separator='\t',include_header=False)

    print(f"Processed and saved: {output_file_path}")

Next, we combine each of the trackfiles in the `processed_files` folder into one file, and each of the WWLLN location files into one file. This will give us two output files in the `combined_files` folder - `Combined_Reduced_Trackfile.txt` and `Combined_WWLLN_Locations.txt`. We will use these files as the basis for our subsequent analyses.

In [37]:
import glob

# Directories for processed files and output
input_dir = "processed_files"
output_dir = "combined_files"
os.makedirs(output_dir, exist_ok=True)

# File patterns to combine
patterns = {
    "Reduced_Trackfile": os.path.join(input_dir, "*Reduced_Trackfile*.txt"),
    "WWLLN_Locations": os.path.join(input_dir, "*WWLLN_Locations*.txt")
}

# Combine files based on patterns
for pattern_name, pattern_path in patterns.items():
    combined_content = []
    output_file_path = os.path.join(output_dir, f"Combined_{pattern_name}.txt")

    # Find all matching files
    matching_files = glob.glob(pattern_path)
    print(f"Combining {len(matching_files)} files for pattern '{pattern_name}'...")

    with open(output_file_path, "w") as output_file:
        for file_path in matching_files:
            with open(file_path, "r") as input_file:
                for line in input_file:
                    output_file.write(line)

    print(f"Combined file saved: {output_file_path}")

Combining 992 files for pattern 'Reduced_Trackfile'...
Combined file saved: combined_files/Combined_Reduced_Trackfile.txt
Combining 994 files for pattern 'WWLLN_Locations'...
Combined file saved: combined_files/Combined_WWLLN_Locations.txt


### Cleaning and Processing
In this section we add a column header to the files and filter down to TCs that are category 1 and above. Category 1 is defined using the [Saffir-Simpson Hurricane Wind Scale](https://www.nhc.noaa.gov/aboutsshws.php), where the maximum sustained wind speed is between 64-82 kt. We calculate each TC's category using the Saffir-Simpson Scale and save it in a new column. We then calculate the direct distance of each lightning strike to the storm center and denote it as inner core or rainband. This section outputs `Filtered_Reduced_Trackfile.csv` and `Filtered_WWLLN_Locations.txt` files.

Start by importing the necessary libraries and files created earlier.

In [34]:
# import necessary libraries
import pandas as pd
import numpy as np
import polars as pl

In [14]:
# import txt files created earlier
# define the path to file below
trackfile_path = "Combined_Reduced_Trackfile.txt"
wwlln_path = "Combined_WWLLN_Locations.txt"

track_file = pd.read_csv(trackfile_path, sep="\t")
track_file = track_file.drop(track_file.columns[8], axis=1) # drop the column of zeros

# !!!! commented out bc elaine's computer will die so this is for u janice <3
# chunksize = 100000  # Process 100,000 rows at a time
# chunks = []

# for chunk in pd.read_csv(
#     wwlln_path,
#     delim_whitespace=True,
#     chunksize=chunksize
# ):
#     chunks.append(chunk)

# locations_WWLLN = pd.concat(chunks, ignore_index=True)

Let's add headers to the two dataframes for better readability.

In [15]:
track_file.columns = ['year', 'month', 'day','hour','lat','lon','pressure', 'knots', 'storm_code', 'storm_name']
#locations_WWLLN.columns = ['year', 'month', 'day', 'hour', 'min', 'sec','lat','lon','distance_from_storm_center_km_east', 'distance_from_storm_center_km_north', 'storm_code','storm_name']

Next, we process the trackfile data by creating the list of storm codes that meet the category 1 or higher requirement. We will use this list to filter the wind speed/pressure data as well as the lightning data.

In [22]:
# calculate the max wind speed for each storm code
max_wind_speed = track_file.groupby('storm_code').agg(
    max_wind_speed=('knots', 'max')
).reset_index()
max_wind_speed.head()

Unnamed: 0,storm_code,max_wind_speed
0,ATL_10_1,85
1,ATL_10_10,30
2,ATL_10_11,135
3,ATL_10_12,120
4,ATL_10_13,105


In [44]:
# filter by max >= 64 knots
storm_filter = max_wind_speed[max_wind_speed["max_wind_speed"] >= 64].copy()

# calculate the TC category using the max wind speed
storm_filter["category"] = storm_filter["max_wind_speed"].apply(
    lambda x: 1 if 64 <= x <= 82 else (2 if 82 < x <= 95 else (3 if 95 < x <= 112 else (4 if 112 < x <= 136 else (5 if x > 136 else 0))))
)
storm_filter = storm_filter[["storm_code", "category"]]

# strip the basin from the storm code
storm_filter["basin"] = storm_filter["storm_code"].str.extract(r"^(.*?)_")

storm_filter.head()

Unnamed: 0,storm_code,category,basin
0,ATL_10_1,2,ATL
2,ATL_10_11,4,ATL
3,ATL_10_12,4,ATL
4,ATL_10_13,3,ATL
5,ATL_10_14,1,ATL


In [43]:
print(f"Overall number of TCs: {len(max_wind_speed)}, category 1 or higher number of TCs: {len(storm_filter)}")

Overall number of TCs: 982, category 1 or higher number of TCs: 473


In [45]:
# filter the trackfile data by the storm filter
track_file_filtered = track_file[track_file["storm_code"].isin(storm_filter["storm_code"])]
# join the category column by storm code
track_file_filtered = pd.merge(track_file_filtered, storm_filter, how='inner', on='storm_code')

track_file_filtered.head()

Unnamed: 0,year,month,day,hour,lat,lon,pressure,knots,storm_code,storm_name,category,basin
0,2020,10,20,0,12.1,-80.0,0,15,ATL_20_28,Zeta,2,ATL
1,2020,10,20,6,12.5,-80.1,0,15,ATL_20_28,Zeta,2,ATL
2,2020,10,20,12,12.8,-80.2,0,15,ATL_20_28,Zeta,2,ATL
3,2020,10,20,18,13.2,-80.3,0,15,ATL_20_28,Zeta,2,ATL
4,2020,10,21,0,13.8,-80.4,0,15,ATL_20_28,Zeta,2,ATL


Let's save this as a csv file for use in analysis.

In [46]:
track_file_filtered.to_csv('Filtered_Reduced_Trackfile.csv', index=False)

Next, let's focus on the WWLLN dataset. Start by filtering the WWLLN dataset by the storm filter created above.

In [None]:
# filter WWLLN dataset by the storm filter
locations_WWLLN_filtered = locations_WWLLN[locations_WWLLN["storm_code"].isin(storm_filter["storm_code"])]
locations_WWLLN_filtered.head()

Now let's calculate the direct distance of each lightning instance from the storm center using a simple triangle calculation. We have the north and east distances from center, so we use the Pythagorean theorem to simply calculate the missing hypotenuse. We also create an indicator for inner core lightning and another for rainband lightning. Inner core is defined as within 100km of storm center, while rainband is defined as between 200-400km of storm center.

In [None]:
locations_WWLLN_filtered['hypotenuse_disance_from_storm_center'] = np.sqrt(locations_WWLLN_filtered['distance_from_storm_center_km_east'] ** 2 +locations_WWLLN['distance_from_storm_center_km_north'] ** 2)
locations_WWLLN_filtered["inner_core_ind"] = locations_WWLLN_filtered["hypotenuse_disance_from_storm_center"].apply(
    lambda x: 1 if x <= 100 else 0
)
locations_WWLLN_filtered["rainband_ind"] = locations_WWLLN_filtered["hypotenuse_disance_from_storm_center"].apply(
    lambda x: 1 if (x >= 200 and x <= 400) else 0 # pls check if this works i didnt run it and then delete this comment
)

Let's save this as a txt file for future use.

In [None]:
locations_WWLLN_filtered_pl = pl.from_pandas(locations_WWLLN_filtered)
locations_WWLLN_filtered_pl.write_csv('Filtered_WWLLN_Locations.txt')

### Joining the Data
Do we want to put the join here??

In [None]:
# ???