## Data Extraction Pipeline

#### This notebook serves to automate the process of collecting the data from the YouTube API and then performing the translation process on them. <span style="color:red">It is only recommended for use by those who are uncomfortable manually implementing the scripting steps from the command line, as those will provide significantly more flexibility. </span><br>
#### Users MUST fill in 5 pieces of information for this to work:
* The location of the cloned repository
* The location of the downloaded dataset
* The path to the first api key (single line text document)
* The path to the second api key (single ine text document)
* Whether to do a fresh lookup / translation. Set to True if fresh data is desired, leave as False to reuse existing data

#### By default all output will be placed in a folder called "intermediate"  at the SAME level as the repository which you can then point the BertTopics portion of the pipeline to. This is to aid in the ease of cleanup after you have finished doing topic modeling, you can simply delete the intermediate folder. You can change this behavior by adjusting the relevant portions of the magic commands below, however it is recommended that any user comfortable with that approach simply uses the command line approach instead.

# Getting Started

In [1]:
# REQUIREMENT 1: path to where this repo exists
repo_location = "/nfs/turbo/seas-nhcarter/human_wildlife_interactions/repo/"
# REQUIREMENT 2: path to where the downloaded dataset is stored
storage_path = "/nfs/turbo/seas-nhcarter/human_wildlife_interactions"

# REQUIREMENT 3 / 4: api key file_path (expects a 1 line .txt file)
# (it is recommended that you use 2 different keys because each can only handle 10k lookups per day and each video requires 2 lookups,
# however if you choose to use 1 please enter it twice)
api_key1_path = "/nfs/turbo/seas-nhcarter/human_wildlife_interactions/repo/human-wildlife-interactions/pipeline/youtubeapi1.txt"
api_key2_path = "/nfs/turbo/seas-nhcarter/human_wildlife_interactions/repo/human-wildlife-interactions/pipeline/youtubeapi2.txt"

# REQUIREMENT 5: do you want a new API lookup? (set to false to use existing data)
fresh_data = False


API Query Settings (Optional, currently configured to our Wildlife lookups):

In [2]:
# desired topic (must be from the available yt8m topic list: default is 'Wildlife')
yt8m_topic = "Wildlife"
# genre topic belongs to (as defined here: https://research.google.com/youtube8m/csv/2/vocabulary.csv)
yt8m_genre = "Pets & Animals"

## Imports / Dependencies

In [3]:
# general utility imports
import os
import re
import json
import pickle
from tqdm import tqdm
from pathlib import Path
# local utility imports
import pipeline_utility
# data manipulation imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [4]:
# show dependencies
print(pd.__version__)
print(np.__version__)
! python --version

1.4.2
1.22.4
Python 3.9.7


In [5]:
# path to the yt8m video training data
video_training_location = "/nfs/turbo/seas-nhcarter/human_wildlife_interactions/video/train"
# path to the yt8m frame training data
frame_training_location = "/nfs/turbo/seas-nhcarter/human_wildlife_interactions/frame/train"

repo_path = Path(repo_location)
api1_path = Path(api_key1_path)
api2_path = Path(api_key2_path)
storage_path = Path(storage_path)
video_path = Path(storage_path / "video/train")
frame_path = Path(storage_path / "frame/train")
shutil.copyfile()

NameError: name 'shutil' is not defined

## Get the Relevant Video IDs

In [None]:
# get the list of all relevant video ids
yt8m_ids = pipeline_utility.get_video_ids(yt8m_genre, yt8m_topic)

## Get the Video Description and Comment information from the YouTube API

In [None]:
# set fresh_data to True to do a new API pull (recommend not running more than once per day due to API data limits)
if fresh_data:    
    # check if intermediate directory exists, otherwise make it
    working_path = os.path.join(repo_location, "intermediate/api_data")
    if not os.path.isdir(working_path):
        os.makedirs(working_path)

    # grab the video titles / descriptions and comments and then write them to files for follow on steps
    # this step generally takes about 30 - 40 minutes depending on your connection / setup
    api1 = pipeline_utility.api_key(api_key1_path)
    api2 = pipeline_utility.api_key(api_key2_path)
    pipeline_utility.lookup_videos(yt8m_ids, working_path, api1, api2)
else:
    print("Using existing data.")

## Send to Translation and Topic Modeling Pipelines

In [None]:
# combine the files produced by the API crawl
if fresh_data:
    script_path = repo_path / "human-wildlife-interactions/src/data/combine"
    input_dir = repo_path / "intermediate/api_data"
    output_dir = repo_path / "intermediate/combined"
    if not output_dir.exists():
        output_dir.mkdir(parents=True, exist_ok=True)
    # because this is being run in Great Lakes we can't set as relative paths unfortunately
    %run /nfs/turbo/seas-nhcarter/human_wildlife_interactions/repo/human-wildlife-interactions/src/data/combine /nfs/turbo/seas-nhcarter/human_wildlife_interactions/repo/intermediate/api_data /nfs/turbo/seas-nhcarter/human_wildlife_interactions/repo/intermediate/combined
else:
    print("Using existing combined data.")

In [None]:
# This step takes an eternity, only do if it really is necessary
if fresh_data:
    # translate the title for topic modeling
    translated_dir = repo_path / "intermediate/translated"
    if not translated_dir.exists():
        translated_dir.mkdir(parents=True, exist_ok=True)
    # replace this with relative paths once it is in the correct directory
    %run /nfs/turbo/seas-nhcarter/human_wildlife_interactions/repo/human-wildlife-interactions/src/features/translation /nfs/turbo/seas-nhcarter/human_wildlife_interactions/repo/intermediate/combined/videoDets.pkl snippet.title title /nfs/turbo/seas-nhcarter/human_wildlife_interactions/repo/intermediate/combined/translatedTitle.pkl
    
    # translate the description for topic modeling
    %run /nfs/turbo/seas-nhcarter/human_wildlife_interactions/repo/human-wildlife-interactions/src/features/translation /nfs/turbo/seas-nhcarter/human_wildlife_interactions/repo/intermediate/combined/translatedTitle.pkl snippet.description descrip /nfs/turbo/seas-nhcarter/human_wildlife_interactions/repo/intermediate/combined/desc_title_translated.pkl
    !rm /nfs/turbo/seas-nhcarter/human_wildlife_interactions/repo/intermediate/combined/translatedTitle.pkl
else:
    print("Using existing translated data.")