# Motivation:
An online source where skaters communications with each other is openly displayed and recorded is youtube comments. Under skate videos many people interested in the contents will have discussions. While these comments are not reflective of speech they are indicative of the speech patterns and slang used by skateboarders interested in these videos. 

Because of the massive availiblity of public data in this domain I figured it was an excellent place to start.

# Scraping data:
Thrasher will be the starting point for scraping this data. In order to do this I am using [simple-youtube-comment-crawler](https://github.com/hangyeoldora/Simple-Youtube-Comment-Downloader.git) to scrape comments from the internet.

## Setup

We need some variables to indicate *if* we should scrape and where the data will be located

We also need to install the needed packages to work with this:

In [1]:
install_packages = True
delete_existing_data = True

### Install packages

In [2]:
if install_packages:
    ! pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting google-api-python-client (from -r requirements.txt (line 3))
  Downloading google_api_python_client-2.125.0-py2.py3-none-any.whl.metadata (6.6 kB)
Collecting google-auth-oauthlib (from -r requirements.txt (line 4))
  Downloading google_auth_oauthlib-1.2.0-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting google-auth-httplib2 (from -r requirements.txt (line 5))
  Downloading google_auth_httplib2-0.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting httplib2<1.dev0,>=0.19.0 (from google-api-python-client->-r requirements.txt (line 3))
  Downloading httplib2-0.22.0-py3-none-any.whl.metadata (2.6 kB)
Collecting google-auth!=2.24.0,!=2.25.0,<3.0.0.dev0,>=1.32.0 (from google-api-python-client->-r requirements.txt (line 3))
  Downloading google_auth-2.29.0-py2.py3-none-any.whl.metadata (4.7 kB)
Collecting google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5 (from google-api-python-client->-r r

## Import Packages

In [3]:
import time
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

## Web Crawling

### Defining Youtube API Key

In [4]:
api_key = "AIzaSyCYjn8QMzOCqwpK3zk3C0EaaK6onJX79XY"

### Defining Scraping Functions

In [5]:
def get_comments_yt_video(video_id, max_results=100):
    youtube = build('youtube', 'v3', developerKey=api_key)

    try:
        # Retrieve comments for the given video ID
        comments = []
        next_page_token = None

        while True:
            response = youtube.commentThreads().list(
                part='snippet',
                videoId=video_id,
                textFormat='plainText',
                maxResults=max_results,
                pageToken=next_page_token
            ).execute()

            # Collect comments
            for item in response['items']:
                comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
                comments.append(comment)

            # Check if there are more comments to fetch
            next_page_token = response.get('nextPageToken')
            if not next_page_token:
                break

        return comments

    except HttpError as e:
        print(f'An HTTP error {e.resp.status} occurred: {e.content}')
        return None

def get_popular_yt_videos(channel_id, max_results=10):
    # Initialize the YouTube Data API client
    youtube = build("youtube", "v3", developerKey=api_key)

    # Retrieve most popular videos for the given channel
    request = youtube.search().list(
        part="snippet",
        channelId=channel_id,
        type="video",
        order="viewCount",
        maxResults=max_results
    )

    response = request.execute()

    # Extract video details from the response
    videos = []
    for item in response["items"]:
        video_id = item["id"]["videoId"]
        title = item["snippet"]["title"]
        videos.append({"title": title, "video_id": video_id})

    return videos

### Scraping data

First we need to find the most popular videos by Thrasher for our use

In [7]:
channel_id = "UCt16NSYjauKclK67LCXvQyA"

popular_videos = get_popular_yt_videos(channel_id, 10)

Then we need to scrape all the comments for all the most popular videos

In [8]:
comments = []


for video in popular_videos:
    comments.append(get_comments_yt_video(video['video_id'], 100))