# Data Upload
This notebook uploads data (in the form of txt files) from a USB drive to Google Drive. We use the Google Drive API to facilitate interactions with Google Drive. We use multithreading to speed up the uploading process.

### Setting Up Google Drive API
This code requires setting up a Google Cloud project and creating OAuth 2.0 Client ID.
 
1. Start by creating a Google Cloud project. We use the free tier. [Documentation](https://console.cloud.google.com/projectcreate?sjid=1200308559223672915-NC&inv=1&invt=AbjYiw)
2. Enable the Google Drive API. Navigate to the APIs & Services page, click the "Enable APIs and Services" button, and search for "Google Drive API". Click "Enable" if not already enabled.
3. Configure the OAuth consent screen as needed* by navigating to APIs & Services > OAuth consent screen. For the sake of this project, we leave the app in testing mode and manually add test users for access. Users can be added in the "Test users" section. [Documentation](https://support.google.com/cloud/answer/10311615?hl=en&ref_topic=3473162&sjid=1200308559223672915-NC)
4. Create credentials. Navigate to APIs & Services > Credentials. Click "Create Credentials" and select "OAuth 2.0 Client ID". Select "Desktop app" as the application type and download the json file. Rename the file to `client_secret.json` and move this to the same directory as this script. [Documentation](https://support.google.com/cloud/answer/6158849?hl=en)

### Code
Let's get started!

Start by installing pydrive and importing necessary packages.

In [None]:
# Install necessary packages
!pip install pydrive

In [None]:
# Import packages
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
import os
from threading import Thread
from queue import Queue

Next, authenticate with Google Drive. This will open your browser and prompt you to log in and connect to the app. Make sure the login used is included in the list of test users.

In [None]:
# Authenticate Google Drive
gauth = GoogleAuth()
gauth.LocalWebserverAuth()  # Authenticate via browser
drive = GoogleDrive(gauth)

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?client_id=389849867563-4uggnm57nqe52156v32gj1lkosoqpoem.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8090%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&access_type=offline&response_type=code

Authentication successful.


We set our parameters next, namely the destination folder ID, year, basins, and file extension/keywords to filter files for. Our data on the USB includes files we do not need for our analysis dataset, so we filter these out before uploading to Google Drive. We are only interested in the trackfiles and WWLLN location files for each storm.

The destination folder ID can be found as the string after the "folders/" part of the URL for the Google Drive folder.

In [23]:
# Folder ID of the destination folder in Google Drive
destination_folder_id = '12y8waGZLY8Ikeko0bixoMlFk20wWMjcp'  # Replace with your folder ID
# Define parameters
year = 10 # don't loop here because we created a different folder per year, each folder has its own ID
# could possibly combine all into one folder? rename the 10 folder?
#basin = "ATL" # manually excluded for now bc ATL was the test folder for 10
basins = ["CPAC", "EPAC", "IO", "SHEM", "WPAC"] # "ATL",
file_extension = ".txt"
filter = ['Trackfile', 'Locations']

Next, we create our function that we'll use to upload the files. We add some file name cleaning here to facilitate later processes.

In [20]:
# Function to upload files
def upload_file():
    while not file_queue.empty():
        file_path = file_queue.get()
        try:
            file_name = os.path.basename(file_path)
            print(f"Uploading: {file_name}")
            gfile = drive.CreateFile({'title': file_name.lstrip('._'), 'parents': [{'id': destination_folder_id}]})
            gfile.SetContentFile(file_path)
            gfile.Upload()
            print(f"Uploaded: {file_name}")
        except Exception as e:
            print(f"Failed to upload {file_path}: {e}")
        file_queue.task_done()

Let's upload our files!

This code will look for each basin in the list of basins in our year folder. If the basin folder exists, we add the files that match our filtering criteria to the queue for uploading. We then upload these files concurrently over multiple threads that call the upload file function.

In [None]:
# Define the file upload queue
file_queue = Queue()

# Populate the queue with files to upload
for basin in basins:
    base_directory = f"/mnt/d/WWLLN_TC_Data_2010_2020/{year}/{basin}/"
    if not os.path.exists(base_directory):
        print(f"The path {base_directory} does not exist.")
        pass
    else:
        print(f"Attempting upload for 20{year} {basin} basin")

        for root, _, files in os.walk(base_directory):
            for file in files:
                if file.endswith(file_extension) and any(word in file for word in filter):  # Filter files by extension
                    file_path = os.path.join(root, file)
                    file_queue.put(file_path)  # Add the file path to the queue

        # Check if files were added to the queue
        if file_queue.empty():
            print("No files were added to the queue.")
        else:
            print(f"Total files to upload for 20{year} {basin} basin: {file_queue.qsize()}")

            # Create threads for uploading
            threads = []
            num_threads = 4

            for _ in range(num_threads):
                thread = Thread(target=upload_file)
                threads.append(thread)
                thread.start()

            # Wait for all threads to complete
            for thread in threads:
                thread.join()

            print(f"Uploaded files for 20{year}, {basin}")


Attempting upload for 2010, CPAC
Total files to upload for 2010 CPAC basin: 6
Uploading: CPAC_10_1_Omeka_Reduced_Trackfile 2.txt
Uploading: CPAC_10_1_Omeka_WWLLN_Locations 2.txt
Uploading: ._CPAC_10_1_Omeka_WWLLN_Locations 2.txt
Uploading: CPAC_10_1_Omeka_WWLLN_Locations.txt
Uploaded: ._CPAC_10_1_Omeka_WWLLN_Locations 2.txt
Uploading: ._CPAC_10_1_Omeka_WWLLN_Locations.txt
Uploaded: CPAC_10_1_Omeka_Reduced_Trackfile 2.txt
Uploading: CPAC_10_1_Omeka_Reduced_Trackfile.txt
Uploaded: CPAC_10_1_Omeka_WWLLN_Locations.txt
Uploaded: CPAC_10_1_Omeka_WWLLN_Locations 2.txt
Uploaded: ._CPAC_10_1_Omeka_WWLLN_Locations.txt
Uploaded: CPAC_10_1_Omeka_Reduced_Trackfile.txt
Uploaded 0 for 2010, CPAC
Attempting upload for 2010, EPAC
Total files to upload for 2010 EPAC basin: 45
Uploading: EPAC_10_9_Frank_Reduced_Trackfile 2.txt
Uploading: EPAC_10_9_Frank_Reduced_Trackfile.txt
Uploading: EPAC_10_9_Frank_WWLLN_Locations.txt
Uploading: ._EPAC_10_9_Frank_WWLLN_Locations.txt
Uploaded: EPAC_10_9_Frank_Reduced_T