# 1: Extract Video Comments

<div style="text-align: right"> <b>MODULE 1</b></div>
<div style="text-align: right"> <b>Authors:</b></div>
<div style="text-align: right"> Vassil Dimitrov</div>
<div style="text-align: right"> Sergiy Chepiga</div>
<div style="text-align: right"> Parsa Kamali</div>
<div style="text-align: right"> <b>Date:</b> 2023-09-01 </div>

---

This first module selects a video by identifying a video ID and extracts all the comments and comment replies with associated user names and user channel IDs. The user names and user channel IDs will subsequently be used for network analysis

## Load libraries

In [1]:
from googleapiclient.discovery import build
import pandas as pd

## Prep

YouTube Data API v3 was activated and a project was created in order to obtain an API key. This was done for **Public** information and there are certain restrictions that apply. *This is the reason why the channel IDs were also obtained for network analysis*.

### Define parameters for API

In [2]:
API_KEY = '!!!!yourAPIkey!!!!'

# Create a YouTube API client
youtube = build('youtube', 'v3', developerKey=API_KEY)

# Video ID for the video you're interested in
video_id = 'Et7l3Fjsjao'

## Extract comments and user data

For each comment, user name and user channelID is also extracted. The same is applied to comment replies. Iteration over all pages was performed in order to extract most comments.
>Note that a maximum of 100 comments and their replies are extracted here by setting the parameter `maxResults = 100`.

In [26]:
# Extract comments:
next_page_token = None

# Prep data for dataframe:
comments = []

while True:
    # Fetch comments for video
    comments_response = youtube.commentThreads().list(
        part='snippet,replies', 
        videoId=video_id,
        maxResults=100,
        pageToken=next_page_token if next_page_token else ''
    ).execute()

    # Iterate over items to extract info:
    for comment_item in comments_response['items']:
        comment = comment_item['snippet']['topLevelComment']['snippet']
        comment_id = comment_item['id']
        user = comment['authorDisplayName']
        user_channel_id = comment['authorChannelId']['value'] if 'authorChannelId' in comment else 'not available'
        comment_text = comment['textOriginal']
        comments.append ({
            'comment_id' : comment_id,
            'user' : user,
            'comment_text' : comment_text,
            'user_channel_id' : user_channel_id

        })
        
        # Fetch replies if they exist
        replies = comment_item.get('replies', {}).get('comments', [])
        for reply in replies:
            reply_id = reply['id']
            reply_user = reply.get('authorDisplayName', 'Unknown')
            reply_user_channel_id = reply['snippet']['authorChannelId']['value'] if 'authorChannelId' in reply['snippet'] else 'Not available'
            reply_text = reply['textOriginal'] if 'textOriginal' in reply else 'Not text available' # Use 'textOriginal' for full reply text
            comments.append({
                'comment_id': reply_id,
                'user': reply_user,
                'comment_text': reply_text,
                'user_channel_id' : reply_user_channel_id
            })
            
    # Check if there are more pages of comments
    next_page_token = comments_response.get('nextPageToken')
    if not next_page_token:
        break

# Create dataframe:
df = pd.DataFrame(comments)

## Save in a csv file

In [35]:
df.to_csv('Et7l3Fjsjao_100_20230901.csv')

---
<div style="text-align: right"> <b>MODULE 1</b></div>
<div style="text-align: right"> <b>Date END:</b> 2023-09-01 </div>