### Reddit and YouTube Data Collection Script

In this script, data collection is performed. The data from Reddit and YouTube based on a set of keywords related to gaming culture (e.g., AI in games, mental health in gaming, etc.) is collected that includes both posts/videos and their respective comments. The collected data is then saved to JSON files.

In [1]:
# Import necessary libraries
import json
import praw
from datetime import datetime
from redditClient import redditClient
from youtubeClient import youtubeClient
from googleapiclient.errors import HttpError

In [2]:
# Initialise clients
redditClient = redditClient()
youtubeClient = youtubeClient()

In [3]:
# Define the keywords and time period
keywords = ['AI in games', 'mental health in gaming', 'cyberbullying in gaming', 'stress relief in gaming', 'toxicity in gaming', 'upcoming games']
year = 2024

In [4]:
# Initialise lists to store Reddit and YouTube data
redditData = []
youtubeData = []

### Reddit Data Collection

The data from Reddit is collected using `search` method of the Reddit API using the specified keywords in the `keywords` list from all subreddits. We are only extracting post title, author, date, score, and up to 50 comments (along with comment authors) and only of the year 2024.

In [5]:
# Initilise ID as 0
ID = 0  

# Iterate through the keywords and collect relevant Reddit posts and comments
for keyword in keywords:
    
    # Search posts in 'all' subreddit based on the keywords 
    for submission in redditClient.subreddit('all').search(keyword, time_filter='year', limit=1000):
        
        # Ensure the data is from 2024
        if datetime.utcfromtimestamp(submission.created_utc).year == year:
            
            # Get author of the submission (Assign Anonymous if not present)
            sub_author = submission.author.name if submission.author else 'Anonymous'
            
            # Extracting comments with their authors
            comments = []
            for comment in submission.comments[:50]:
                if isinstance(comment, praw.models.Comment):
                    
                    # Get author of the comment (Assign Anonymous if not present)
                    comment_author = comment.author.name if comment.author else 'Anonymous'
                    comments.append({
                        'comment_body': comment.body,
                        'comment_author': comment_author
                    })
            
            # Add submission data into a dictionary
            subData = {
                'title': submission.title,
                'author': sub_author,
                'ID': ID,
                'date': datetime.fromtimestamp(submission.created_utc).strftime('%Y-%m-%d %H:%M:%S'),
                'keyword': keyword,
                'score': submission.score,
                'comments': comments
            }
            ID += 1
            
            # Store the data
            redditData.append(subData)

Now print the total number of posts and comments that we collected from Reddit.

In [6]:
# Print the number of posts
print(f"Total number of posts: {len(redditData)}")

# Calculate the total number of comments
print(f"Total number of comments: {sum(len(submission['comments']) for submission in redditData)}")

Total number of posts: 1269
Total number of comments: 44401


### YouTube Data Collection

Similar to Reddit data collection, the data from YouTube is collected using `search` method of the YouTube API using the specified keywords in the `keywords` list. Here, we are only extracting video title, author, date, and up to 50 comments (along with comment authors) and only of the year 2024.

In [7]:
# Initilise ID as 0
ID = 0  

# Iterate through the keywords and collect YouTube videos and their comments
for keyword in keywords:
    
    # Search YouTube videos based on the keyword
    videos = youtubeClient.search().list(q=keyword, part='snippet', type='video', maxResults=1000).execute()

    for video in videos['items']:
        video_id = video['id']['videoId']
        title = video['snippet']['title']
        video_author = video['snippet']['channelTitle']  # Get the uploader (channel name)
        published_at = video['snippet']['publishedAt'][:10]  # Get date in YYYY-MM-DD format
        y = int(published_at.split('-')[0])
        
        # Filter by the current year (2024)
        if y == year:
            try:
                comments_response = youtubeClient.commentThreads().list(part='snippet', videoId=video_id, maxResults=1000).execute()
                
                # Extract comments along with their authors
                comments = []
                for comment in comments_response['items']:
                    comment_text = comment['snippet']['topLevelComment']['snippet']['textOriginal']
                    comment_author = comment['snippet']['topLevelComment']['snippet']['authorDisplayName']  # Get comment author name
                    comments.append({
                        'comment_body': comment_text,
                        'comment_author': comment_author
                    })
                
                # Add the video data into a dictionary 
                video_data = {
                    'title': title,
                    'author': video_author,  # The uploader of the video (channel name)
                    'ID': ID,
                    'date': published_at,
                    'keyword': keyword,
                    'comments': comments
                }
            except HttpError as e:
                # Handle the case where comments are disabled
                print(f"Comments are disabled for video {video_id}. Skipping...")
                continue
            ID += 1
            
            # Store the data
            youtubeData.append(video_data)


Comments are disabled for video 0dEm2lF2dH4. Skipping...


Now display the number of video data collected and number of comments.

In [8]:
# Print the number of posts
print(f"Total number of posts: {len(youtubeData)}")

# Calculate the total number of comments
print(f"Total number of comments: {sum(len(videos['comments']) for videos in youtubeData)}")

Total number of posts: 109
Total number of comments: 6900


### Save The Data

Next we save the collected data as `json` file which will be used for preprocessing.

In [9]:
# Save Reddit data to JSON
with open('redditGamingData.json', 'w') as jsonFile:
    json.dump(redditData, jsonFile, indent=4)

# Save YouTube data to JSON
with open('youtubeGamingData.json', 'w') as jsonFile:
    json.dump(youtubeData, jsonFile, indent=4)

print(f"Reddit and YouTube data saved!!!")

Reddit and YouTube data saved!!!
