### Text Preprocessing

This script contains preprocessing of text in the Reddit and YouTube data that we collected and stored in `json` file. The main goals is to clean the data, remove unwanted elements such as URLs and mentions. Below is the step-by-step breakdown of the process.


### Import Libraries

Import necessary libraries for text processing and regular expression operations.

In [1]:
# import necessary libraries
import json
import re

from collections import Counter
from datetime import datetime

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

### Load the Data

Load that data that we collected from Reddit and YouTube from the `json` files.

In [2]:
# Load the data from the JSON file
with open('redditGamingData.json', 'r') as jsonFile:
    redditData = json.load(jsonFile)
    
with open('youtubeGamingData.json', 'r') as jsonFile:
    youtubeData = json.load(jsonFile)

### Data Overview

Let's check the basic informations about the data such as number of posts and comments, number of words, number of urls and number of mentions.

In [3]:
# Print the number of posts and comments in Reddit data
print("Reddit:")
print(f"Total number of posts: {len(redditData)}")
print(f"Total number of comments: {sum(len(submission['comments']) for submission in redditData)}")

# Print the number of posts and comments in YouTube data
print("\nYouTube:")
print(f"Total number of posts: {len(youtubeData)}")
print(f"Total number of comments: {sum(len(videos['comments']) for videos in youtubeData)}")

Reddit:
Total number of posts: 1269
Total number of comments: 44401

YouTube:
Total number of posts: 109
Total number of comments: 6900


Define a function `count_words` to count the number of words in the title and comments seperately.

In [4]:
# Function to count the number of words
def count_words(data):
    
    # initialise the counts with 0
    subWordCount = 0
    comWordCount = 0

    # Iterate through all submission to count the words in each submission
    for submission in data:
        
        # Count words in the title
        subWordCount += len(submission['title'].split())
    
        # count words in the comments
        for comment in submission['comments']:
            comWordCount += len(comment['comment_body'].split())

    print(f"Total number of words in posts: {subWordCount}")
    print(f"Total number of words in comments: {comWordCount}")

In [5]:
# Print the number of words in posts and comments in Reddit data before preprocessing
print("Reddit:")
count_words(redditData)

# Print the number of words in posts and comments in YouTube data before preprocessing
print("\nYouTube:")
count_words(youtubeData)

Reddit:
Total number of words in posts: 19306
Total number of words in comments: 1963998

YouTube:
Total number of words in posts: 1035
Total number of words in comments: 124575


Define a function `count_url` to count the number of urls present in the data.

In [6]:
# Funtion to count the number of urls in the data
def count_urls(data):
    # Regular expression to match URLs
    url_pattern = r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+|www\.(?:[-\w.]|(?:%[\da-fA-F]{2}))+'
    
    total_urls = 0

    for submission in data:
        # Count URLs in the title
        total_urls += len(re.findall(url_pattern, submission['title'], flags=re.MULTILINE))
    
        # Count URLs in the comments
        for comment in submission['comments']:
            total_urls += len(re.findall(url_pattern, comment['comment_body'], flags=re.MULTILINE))

    print(f"Total number of URLs in the data: {total_urls}")

In [7]:
# Print the number of urls in posts and comments in Reddit data before preprocessing
print("Reddit:")
count_urls(redditData)

# Print the number of urls in posts and comments in YouTube data before preprocessing
print("\nYouTube:")
count_urls(youtubeData)

Reddit:
Total number of URLs in the data: 2259

YouTube:
Total number of URLs in the data: 21


The above results indicate that the data contains several URLs which need to be removed.

Define a function `check_mentions` to check the mentions(for example: @username) in the data.

In [8]:
def check_mentions(data):
    pattern = r'@\w+'  # Pattern to match mentions like @username
    mentions = []  # List to store all found mentions
    
    for submission in data:
        # Check for mentions in the title
        title_mentions = re.findall(pattern, submission['title'])
        if title_mentions:
            mentions.extend(title_mentions)
            print(f"Mentions found in post ID {submission['ID']} (Title): {title_mentions}")
        
        # Check for mentions in the comments
        for comment in submission['comments']:
            comment_mentions = re.findall(pattern, comment['comment_body'])
            if comment_mentions:
                mentions.extend(comment_mentions)
                print(f"Mentions found in post ID {submission['ID']} (Comment Author: {comment['comment_author']}): {comment_mentions}")
    
    print(f"\nTotal number of mentions found: {len(mentions)}")

In [9]:
# Check for mentions in Reddit data
print("Checking for mentions in Reddit data:")
check_mentions(redditData)

# Check for mentions in YouTube data
print("\nChecking for mentions in YouTube data:")
check_mentions(youtubeData)

Checking for mentions in Reddit data:
Mentions found in post ID 48 (Comment Author: PandaCheese2016): ['@Civilization']
Mentions found in post ID 100 (Comment Author: AutoModerator): ['@openai']
Mentions found in post ID 116 (Comment Author: wanderingnexus): ['@ss']
Mentions found in post ID 161 (Comment Author: AutoModerator): ['@openai']
Mentions found in post ID 185 (Comment Author: ChocolateAxis): ['@zigzagame']
Mentions found in post ID 219 (Comment Author: AutoModerator): ['@openai']
Mentions found in post ID 348 (Comment Author: PCMRBot): ['@home']
Mentions found in post ID 377 (Comment Author: PrepperLady999): ['@ExpertlyAmateur']
Mentions found in post ID 382 (Comment Author: Affectionate_West725): ['@Tammy']
Mentions found in post ID 653 (Comment Author: TELETUBB13S): ['@plhought']
Mentions found in post ID 697 (Comment Author: Intrepid-Extent6611): ['@r']
Mentions found in post ID 775 (Comment Author: AutoModerator): ['@hiraedu']
Mentions found in post ID 799 (Comment Author

From the above overview we have found that our data contains URLs and mention which are irrelavent for our analysis and are to be removed.


### Extracting Hashtags

The data contains hashtags(#) which are used for several reasons in Reddit and YouTube such as increasing the reach of audience, categorizing content, etc. Let's extract these hashtags and check the top 10 popular hastags. 

For that a fucntion `extract_hashtags` is created that check for these hashtags and gives us the top 'n' popular hashtags.

In [10]:
def extract_hashtags(data):
    hashtags = []
    hashtag_pattern = r'#\b[a-zA-Z]+\b'
    
    for submission in data:
        hashtags += re.findall(hashtag_pattern, submission['title'])
        for comment in submission['comments']:
            hashtags += re.findall(hashtag_pattern, comment['comment_body'])
    
    return Counter(hashtags).most_common(10)

In [11]:
print("Top 10 hashtags in Reddit data:")
print(f"{'Hashtag':<30}{'Count'}")
print("-" * 50)
for hashtag, count in extract_hashtags(redditData):
    print(f"{hashtag:<30}{count}")

print("\nTop 10 hashtags in YouTube data:")
print(f"{'Hashtag':<30}{'Count'}")
print("-" * 50)
for hashtag, count in extract_hashtags(youtubeData):
    print(f"{hashtag:<30}{count}")

Top 10 hashtags in Reddit data:
Hashtag                       Count
--------------------------------------------------
#Do                           140
#Don                          10
#Subreddit                    10
#If                           3
#MOAR                         3
#advanced                     2
#mods                         2
#button                       2
#cyberbullying                2
#survey                       2

Top 10 hashtags in YouTube data:
Hashtag                       Count
--------------------------------------------------
#gaming                       9
#shorts                       6
#vr                           3
#clips                        2
#minecraft                    2
#overwatch                    2
#games                        2
#gameplay                     2
#funny                        2
#gameshorts                   2


### Data Preprocessing

This step includes:
* Convert text to lowercase.
* Remove URLs.
* Remove mentions .
* Remove non-alphabetic characters.
* Remove stopwords.
* Lemmatize words to their base form.

We create a function `preprocess_text` to carry out this process.

In [12]:
# Function to preprocess text
def preprocess_text(text):

    # Convert text to lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+|www\.(?:[-\w.]|(?:%[\da-fA-F]{2}))+', '', text, flags=re.MULTILINE)
    
    # Remove mentions (e.g., @username)
    text = re.sub(r'@\w+', '', text)
    
    # Keep only alphabetic characters and spaces
    text = re.sub(r'[^a-z\s]', '', text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word not in stop_words]
    
    # Initialize the lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # Lemmatize words
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    
    return ' '.join(lemmatized_words)

Define a function `clean_data` that will go through each titles and comments of a post and preprocess them using `preprocess_text` function.

In [13]:
def clean_data(data):
    # Create an empty list to store clean data
    cleanData = []

    # iterate through submission to preprocess the data
    for submission in data:
        # Process title
        title = preprocess_text(submission['title'])
        
        # Process comments
        comments = []
        for comment in submission['comments']:
            comment['comment_body'] = preprocess_text(comment['comment_body'])
        
            # Add only if comment is present
            if comment['comment_body'].strip():
                comments.append(comment)
     
        # Add only if title is present
        if title.strip():
            if 'score' in submission:
                submission = {
                    'title': title,
                    'date': submission['date'],
                    'ID' : submission['ID'],
                    'keyword': submission['keyword'],
                    'score': submission['score'],
                    'comments': comments
                }
            else:
                submission = {
                    'title': title,
                    'date': submission['date'],
                    'ID' : submission['ID'],
                    'keyword': submission['keyword'],
                    'comments': comments
                }
        cleanData.append(submission)
    return cleanData

In [14]:
# Preprocess Reddit data
cleanRedditData = clean_data(redditData)

# Preprocess YouTube data
cleanYTData = clean_data(youtubeData)

Now let's check the basic infromation of the data after preprocessing.

In [15]:
# Print the number of posts and comments in Reddit data after preprocessing
print("Reddit:")
print(f"Total number of posts: {len(cleanRedditData)}")
print(f"Total number of comments: {sum(len(submission['comments']) for submission in cleanRedditData)}")

# Print the number of posts and comments in YouTube data after preprocessing
print("\nYouTube:")
print(f"Total number of posts: {len(cleanYTData)}")
print(f"Total number of comments: {sum(len(videos['comments']) for videos in cleanYTData)}")

Reddit:
Total number of posts: 1269
Total number of comments: 44272

YouTube:
Total number of posts: 109
Total number of comments: 6768


It can be observed that the number of comments in the data have been reduced.

In [16]:
# Print the number of words in posts and comments in Reddit data after preprocessing
print("Reddit:")
count_words(cleanRedditData)

# Print the number of words in posts and comments in YouTube data after preprocessing
print("\nYouTube:")
count_words(cleanYTData)

Reddit:
Total number of words in posts: 10852
Total number of words in comments: 999306

YouTube:
Total number of words in posts: 691
Total number of words in comments: 68480


As the preprocessing step will remove the irrelavent items from a text data, the number of words in both titles and comments are decreased.

In [17]:
# Print the number of urls in posts and comments in Reddit data before preprocessing
print("Reddit:")
count_urls(cleanRedditData)

# Print the number of urls in posts and comments in YouTube data before preprocessing
print("\nYouTube:")
count_urls(cleanYTData)

Reddit:
Total number of URLs in the data: 0

YouTube:
Total number of URLs in the data: 0


In [18]:
# Check for mentions in Reddit data
print("Checking for mentions in Reddit data:")
check_mentions(cleanRedditData)

# Check for mentions in YouTube data
print("\nChecking for mentions in YouTube data:")
check_mentions(cleanYTData)

Checking for mentions in Reddit data:

Total number of mentions found: 0

Checking for mentions in YouTube data:

Total number of mentions found: 0


We can also observe that the mentions and URLs are completely removed from the data.

### Save The Data
Next we save the collected data as json file which will be used for preprocessing.

In [19]:
# Save Reddit data to JSON
with open('preprocessedRedditData.json', 'w') as jsonFile:
    json.dump(redditData, jsonFile, indent=4)

# Save YouTube data to JSON
with open('preprocessedYoutubeData.json', 'w') as jsonFile:
    json.dump(youtubeData, jsonFile, indent=4)

print(f"Reddit and YouTube data saved!!!")

Reddit and YouTube data saved!!!
