# Motivation:
An online source where skaters communications with each other is openly displayed and recorded is youtube comments. Under skate videos many people interested in the contents will have discussions. While these comments are not reflective of speech they are indicative of the speech patterns and slang used by skateboarders interested in these videos. 

Because of the massive availiblity of public data in this domain I figured it was an excellent place to start.

# Setup

In [1]:
install_packages = True
delete_existing_data = False

## Setup flags

## Installing packages

In [2]:
if install_packages:
    ! pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


## Importing packages

In [3]:
import time
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from IPython.display import Markdown

from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
import re
import html
import json
import shutil
import os

## Cleaning Dataset

In [4]:
if delete_existing_data:
    try:
        shutil.rmtree('./data')
        shutil.rmtree('./meta')
    except:
        display(Markdown('### No such folder to delete!'))

# Scraping data:
Thrasher will be the starting point for scraping this data. In order to do this I am using [simple-youtube-comment-crawler](https://github.com/hangyeoldora/Simple-Youtube-Comment-Downloader.git) to scrape comments from the internet.

## Web Crawling

### Defining Youtube API Key

In [5]:
api_key = "AIzaSyDrm4yGLK6uCcQI5ZlWVG3acRd4XvH0o3M"

### Defining Scraping Functions

In [12]:
def get_comments_yt_video(video_id, max_results=50):
    youtube = build('youtube', 'v3', developerKey=api_key)

    try:
        # Retrieve comments for the given video ID
        comments = []
        next_page_token = None

        while True:
            response = youtube.commentThreads().list(
                part='snippet',
                videoId=video_id,
                textFormat='plainText',
                maxResults=max_results,
                pageToken=next_page_token
            ).execute()

            # Collect comments
            for item in response['items']:
                comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
                comments.append(comment)

            # Check if there are more comments to fetch
            next_page_token = response.get('nextPageToken')
            if not next_page_token:
                break

        return comments

    except HttpError as e:
        print(f'An HTTP error {e.resp.status} occurred: {e.content}')
        return None

def get_popular_yt_videos(channel_id, max_results=50):
    # Initialize the YouTube Data API client
    youtube = build("youtube", "v3", developerKey=api_key)
    
    pageToken = None


    videos = []

    try:
        for page in range(0,max_results, 50):
            # Retrieve most popular videos for the given channel
            request = youtube.search().list(
                part="snippet",
                channelId=channel_id,
                type="video",
                order="viewCount",
                maxResults=max_results,
                pageToken=pageToken
            )


            response = request.execute()

            pageToken = response['nextPageToken']

            for item in response["items"]:
                video_id = item["id"]["videoId"]
                title = item["snippet"]["title"]
                videos.append({"title": title, "video_id": video_id})
    except:
        return videos

    return videos

def get_video_metadata(video_id):
    youtube = build('youtube', 'v3', developerKey=api_key)

    video_response = youtube.videos().list(
        part='snippet,statistics',
        id=video_id
    ).execute()

    video_details = video_response['items'][0]
    snippet = video_details['snippet']
    statistics = video_details['statistics']

    metaData = dict()

    metaData['Title'] = snippet['title']
    metaData['Description'] = snippet['description']
    metaData['Published'] = snippet['publishedAt']
    metaData['Views'] = statistics['viewCount']
    metaData['Likes'] = statistics['likeCount']
    metaData['Comments'] = statistics['commentCount']
    return metaData

### Scraping data

We need a helper function to clean youtube titles into file-friendly strings

In [7]:
def clean_string(input_string):
    cleaned_string = html.unescape(input_string)
    cleaned_string = cleaned_string.lower()
    cleaned_string = cleaned_string.replace(' ', '-')
    cleaned_string = re.sub(r'[^a-zA-Z0-9\-]', '', cleaned_string)
    return cleaned_string

First we need to find the most popular videos by Thrasher for our use

In [8]:
thrasher_id = "UCt16NSYjauKclK67LCXvQyA"
newschoolers_id = "UC27MqEZBoxV0Cs7ZbMtBk1g"

popular_videos = dict()

if os.path.exists('./videos'):
    display(Markdown('**Video data allready pulled. Fetching from files**'))
    for filename in os.listdir('./videos'):
        with open(f'./videos/{filename}') as f:
            popular_videos[os.path.splitext(filename)[0]] = json.loads(f.read())
else:
    os.mkdir('./videos')
    videos_raw = get_popular_yt_videos(thrasher_id, 500)
    for video in videos_raw:
        filename = clean_string(video['title'])
        popular_videos[filename] = video
        with open(f'./videos/{filename}.json', 'w') as f:
            f.write(json.dumps(video))
    

**Video data allready pulled. Fetching from files**

In [9]:
display(Markdown('**Video titles**'))
display(Markdown(f'**Total number:** {len(popular_videos)}'))
for video in list(popular_videos.keys())[:10]:
    display(Markdown(f'- {popular_videos[video]['title']}'))

**Video titles**

**Total number:** 499

- SKATELINE - Jamie Foy El Toro Front Krook War? Tom Schaar, Antonio Durao &amp; More

- King of the Road 2012: Webisode 3

- &quot;Revolutions on Granite&quot; Ukraine Skate Documentary

- The Basement Tapes Marc Johnson

- SKATELINE - Hit Makers Jereme Rogers &amp; Steven Fernandez, Erick Winkowski, Blake Johnson Pro

- Evan Smith&#39;s &quot;Sky Bird People Up-Jump&quot; Grimple Part

- My War: Ryan Decenzo

- King of the Road 2010: Episode 1

- Emerica&#39;s &quot;THIS&quot; Video

- SKATELINE - Shane ONeill, Nick Tucker, Andrew Brophy, Marc Johnson, GX 1000

In [10]:
comments = dict()
if os.path.exists('./data'):
    display(Markdown('### Data allready exists, skipping scraping'))
    for file in os.listdir('./data'):
        with open(f'./data/{file}', 'r') as f:
            comments[os.path.splitext(file)[0]] = f.readlines()
else:
    os.mkdir('./data')

    for videoKey in popular_videos:
        video = popular_videos[videoKey]
        with open(f'./data/{clean_string(video['title'])}.txt', 'w') as f:
            video_comments = get_comments_yt_video(video['video_id'])
            f.writelines('%s\n' % comment for comment in video_comments)
            comments.append(video_comments)
        display(Markdown(f'**{video['title']}** scraping complete'))

### Data allready exists, skipping scraping

### Get video metadata

In [11]:
metaData = dict()
if os.path.exists('./meta'): 
    display(Markdown('**Metadata allready exists: pulling in from files**'))
    for metadata_file in os.listdir('./meta'):
        with open(f'./meta/{metadata_file}', 'r') as f:
            metaData[os.path.splitext(metadata_file)[0]] = json.loads(f.read())
else: 
    os.mkdir('./meta')
    for video in popular_videos:
        videoFileName = clean_string(video['title'])
        metaDataItem = get_video_metadata(video['video_id'])
        with open(f'./meta/{videoFileName}.txt', 'w') as f:
            f.write(json.dumps(metaDataItem))
        metaData[videoFileName] = metaDataItem

**Metadata allready exists: pulling in from files**

# Analysis

## Qualitative content analysis

## Patterns and trends

## Sentiment analysis

## Network analysis

## Compare and contrast

# Sources

- https://arxiv.org/pdf/1601.01126v1.pdf