<a href="https://colab.research.google.com/github/ezzy4me/youtube_comment_tm/blob/main/step_2%263_youtube_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2&3. Preprocessing through YouTube Comments on Living Alone

The data, including YouTube comments stored in CSV format, is loaded from a specified drive path. Comments and titles are further cleaned, masked, and processed. Quantiles are computed for length-based filtering, and the dataset is sampled and concatenated. The final DataFrame is used to combine titles and comments.


## load CSV file


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# environment setting
import csv
import pandas as pd
import numpy as np
import re

In [8]:
# load data
from glob import glob

# Set up a csv path with a comment csv file
to_sc = glob('/content/drive/MyDrive/Colab Notebooks/eng_youtube/*.csv')
to_sc

['/content/drive/MyDrive/Colab Notebooks/eng_youtube/developed_infos.csv',
 '/content/drive/MyDrive/Colab Notebooks/eng_youtube/nomad_infos.csv',
 '/content/drive/MyDrive/Colab Notebooks/eng_youtube/harun_infos.csv',
 '/content/drive/MyDrive/Colab Notebooks/eng_youtube/stephie_infos.csv',
 '/content/drive/MyDrive/Colab Notebooks/eng_youtube/rusty_infos.csv']

In [9]:
total_len = 0
for i in range(len(to_sc)):
    filename = to_sc[i]

    # Try to read the file using utf8 encoding. If that fails, use 'utf-8-sig' encoding instead.
    try:
        globals()[f'df_{i}'] = pd.read_csv(to_sc[i], encoding='utf8')  # Load the CSV file into a DataFrame with UTF-8 encoding.
    except:
        globals()[f'df_{i}'] = pd.read_csv(to_sc[i], encoding='utf-8-sig')  # If there's an encoding error, reload the CSV file using 'utf-8-sig' encoding. ex)koean

    print(f'length of {filename}: ', len(globals()[f'df_{i}']))  # Print the length of the DataFrame (i.e., the number of rows).
    total_len += len(globals()[f'df_{i}'])  # Add the length of the DataFrame to the total length.

print('total length of comments: ', total_len)

length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/developed_infos.csv:  745
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/nomad_infos.csv:  259
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/harun_infos.csv:  645
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/stephie_infos.csv:  380
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/rusty_infos.csv:  9030
total length of comments:  11059


# 2. cleaning&preprocessing

### Describtion of Preprocessing
1. **Basic Cleaning for Comments and Titles**

2. **Text Length Filtering**: Filter with results for statistics in sentences and titles with short tokens

3. **LogSamping**
    - Sampling according to '% of log comments' to ease imbalancing
    - In principle, raw data is imported from channels that are higher than raw data (not in this example)
    - Approximately 700,000 -> 20,000


4. **Text Concatanation** : [title;comment]




### summary of length changes
### after 1. Basic Cleaning for comments

- developed_infos.csv:  80033
- nomad_infos.csv:  13430
- harun_infos.csv:  98262
- stephie_infos.csv:  14971
- rusty_infos.csv:  287215

### after 2. Text Length Filtering
- developed_infos.csv:  30140
- nomad_infos.csv:  3452
- harun_infos.csv:  36386
- stephie_infos.csv:  6096
- rusty_infos.csv:  33486

### after 3. Long sampling
- developed_infos.csv:  4289
- nomad_infos.csv:  3388
- harun_infos.csv:  4367
- stephie_infos.csv:  3624
- rusty_infos.csv:  4332

### 1. Basic Cleaning for Comments and Titles

- a) **Basic Cleaning**: Removal of emojis, numbers, and other non-essential characters.

- b) **Duplicate Removal**: Removal of duplicate comments reduces the dataset's size. 700,953 -> 493,920

- c) **URL Removal**: Comments containing "http" or "www" are removed. -> 493,911

In [10]:
import pandas as pd
import re

def remove_special_characters(text):
    # Remove special characters except for certain punctuation, remove digits, and convert to lowercase.
    cleaned_text = re.sub(r'[^\w\s\'!,.?]', '', text)
    cleaned_text = re.sub(r'\d', '', cleaned_text)
    cleaned_text = cleaned_text.lower()
    return cleaned_text

def replace_single_u(comment, replacement='you'):
    # Replace standalone 'u' with 'you' (or another specified replacement).
    u_pattern = re.compile(r'\bu\b')  # Find instances where 'u' appears independently.
    return u_pattern.sub(replacement, comment)

def remove_pattern(comment, pattern):
    # Remove all instances of a specified pattern from the comment.
    pattern = re.compile(pattern)
    return pattern.sub('', comment)


In [11]:
total_len = 0

for i in range(len(to_sc)):

    # Apply text cleaning functions on the 'comments' column.
    globals()[f'df_{i}']['comments'] = globals()[f'df_{i}']['comments'].apply(replace_single_u)  # Replace 'u' with 'you'.
    globals()[f'df_{i}']['comments'] = globals()[f'df_{i}']['comments'].apply(lambda x: remove_pattern(x, r'<[^>]*?>'))  # Remove HTML tags.
    globals()[f'df_{i}']['comments'] = globals()[f'df_{i}']['comments'].apply(remove_special_characters)  # Remove special characters.

    # Apply text cleaning on the 'title' column.
    globals()[f'df_{i}']['title'] = globals()[f'df_{i}']['title'].apply(remove_special_characters)  # Remove special characters.

    # Drop rows with NaN values in the 'comments' column.
    globals()[f'df_{i}'] = globals()[f'df_{i}'].dropna(subset=['comments'])

    # Print the length of comments for each YouTube channel.
    print(f'length of {to_sc[i]}: ', len(globals()[f'df_{i}']))
    total_len += len(globals()[f'df_{i}'])  # Update the total length of comments.

print('total length of comments: ', total_len)  # Print the total length of comments.


length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/developed_infos.csv:  745
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/nomad_infos.csv:  259
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/harun_infos.csv:  645
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/stephie_infos.csv:  380
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/rusty_infos.csv:  9030
total length of comments:  11059


In [12]:
df_1.head()

Unnamed: 0,title,url,comments
0,remember these things....,https://www.youtube.com/watch?v=hJLUl8QEHJQ,'subscribe amp join the channel click the join...
1,stop investing....,https://www.youtube.com/watch?v=ftC_KPLzdSA,'subscribe amp join the channel click the join...
2,sold... k van stealth overland motorhome,https://www.youtube.com/watch?v=vISmsBVYWFM,'subscribe amp join the channel click the join...
3,no hombase... no bueno! living in a car or li...,https://www.youtube.com/watch?v=EM_rzCtuUwA,'subscribe amp join the channel click the join...
4,a regular car better than a rv???? living in ...,https://www.youtube.com/watch?v=JJqiKkkiHMw,'subscribe amp join the channel click the join...


In [13]:
df_1['comments'][20]

"'subscribe amp join the channel click the join button for unlimited live chat access, a membership badge and custom emojis.if you cant see or use the join button on your phone try using your laptop and click this link httpswww.youtube.comchannelucwd_qzlspwwzshupuzqjoinall my youtube videos are free..xaif you would like to support the channel in an additional way you can contribute via paypal by clicking this linkxahttpswww.paypal.meinspirationalnomadthank you for the love amp support ', '', 'thank you brother.. keep pushing forward ', 'need deep pockets for mercedes repairs', 'true, definitely a factor that should be considered'"

In [15]:
import shlex

In [16]:
total_len = 0

for i in range(len(to_sc)):  # Iterate over each item in to_sc.
    print(i)
    # Split each comment into separate words using shlex.split, which preserves quoted substrings.
    globals()[f'df_{i}']['comments'] = globals()[f'df_{i}']['comments'].apply(lambda x: shlex.split(x))
    # Explode each list of words into separate rows.
    globals()[f'df_{i}'] = globals()[f'df_{i}'].explode('comments')
    print(f'length of {to_sc[i]}: ', len(globals()[f'df_{i}']))

    # Drop duplicate comments.
    globals()[f'df_{i}'] = globals()[f'df_{i}'].drop_duplicates(subset='comments')
    # Replace empty strings in the 'comments' column with NaN.
    globals()[f'df_{i}']['comments'].replace('', np.nan, inplace=True)
    # Drop rows with NaN in the 'comments' column.
    globals()[f'df_{i}'].dropna(subset=['comments'], inplace=True)
    # Reset the DataFrame's index.
    globals()[f'df_{i}'] = globals()[f'df_{i}'].reset_index(drop=True)

    # Print the number of comments after removing duplicates for each YouTube channel.
    print(f'after drop_duplicates_length of {to_sc[i]}: ', len(globals()[f'df_{i}']))
    total_len += len(globals()[f'df_{i}'])  # Update the total length of comments.

print('total length of comments: ', total_len)  # Print the total length of comments.


0
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/developed_infos.csv:  87203
after drop_duplicates_length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/developed_infos.csv:  80033
1
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/nomad_infos.csv:  15946
after drop_duplicates_length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/nomad_infos.csv:  13430
2
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/harun_infos.csv:  107807
after drop_duplicates_length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/harun_infos.csv:  98262
3


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  globals()[f'df_{i}']['comments'].replace('', np.nan, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  globals()[f'df_{i}'].dropna(subset=['comments'], inplace=True)


length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/stephie_infos.csv:  16293
after drop_duplicates_length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/stephie_infos.csv:  14971
4
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/rusty_infos.csv:  473704
after drop_duplicates_length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/rusty_infos.csv:  287215
total length of comments:  493911


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  globals()[f'df_{i}']['comments'].replace('', np.nan, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  globals()[f'df_{i}'].dropna(subset=['comments'], inplace=True)


In [17]:
df_1['comments'][0]

'subscribe amp join the channel click the join button for unlimited live chat access, a membership badge and custom emojis.if you cant see or use the join button on your phone try using your laptop and click this link httpswww.youtube.comchannelucwd_qzlspwwzshupuzqjoinall my youtube videos are free..xaif you would like to support the channel in an additional way you can contribute via paypal by clicking this linkxahttpswww.paypal.meinspirationalnomadthank you for the love amp support ,'

In [18]:
def remove_http_www(text):
    # Using regex, this function removes any URL that starts with 'http' or 'www' from the input text.
    pattern = r'\b(?:http\S*|www\S*)\b'
    return re.sub(pattern, '', text)

def mask_youtube_names(comment, names, mask='youtuber'):
    # This function replaces each instance of a YouTuber's name in the comment with the word 'youtuber'.
    for name in names:
        name_pattern = re.compile(re.escape(name))
        comment = name_pattern.sub(mask, comment)
    return comment

def process_string(s):
    # This function performs a series of replacements in the string:
    # replaces ',.' and '.,' with '.', '....' (or more dots) with '.', two or more spaces with one, and two or more '!' with one.
    s = re.sub('.,', '.', s)
    s = re.sub(',.', '.', s)
    s = re.sub('\.{2,}', '.', s)
    s = re.sub(' {2,}', ' ', s)
    s = re.sub('\!{2,}', '!', s)
    return s

In [19]:
import numpy as np

total_len = 0
log_sample = []
total_log = 0
all_dfs = []

for i in range(len(to_sc)):

    # The following blocks of code process both 'comments' and 'title' in similar ways:
    # remove URLs, mask YouTuber names, process strings, and compute the length of the processed string.
    # 'comments'
    names = ['sam', 'steph', 'david', 'rusty', 'aysha'] # predefine the youtuber names
    globals()[f'df_{i}']['comments'] = globals()[f'df_{i}']['comments'].apply(remove_http_www)
    globals()[f'df_{i}']['comments'] = globals()[f'df_{i}']['comments'].apply(lambda x : mask_youtube_names(comment=x, names = names))
    globals()[f'df_{i}']['comments'] = globals()[f'df_{i}']['comments'].apply(process_string)
    globals()[f'df_{i}']['comments_length'] = globals()[f'df_{i}']['comments'].str.len()

    # 'title'
    globals()[f'df_{i}']['title'] = globals()[f'df_{i}']['title'].apply(remove_http_www)
    globals()[f'df_{i}']['title'] = globals()[f'df_{i}']['title'].apply(lambda x : mask_youtube_names(comment=x, names = names))
    globals()[f'df_{i}']['title'] = globals()[f'df_{i}']['title'].apply(process_string)
    globals()[f'df_{i}']['title_length'] = globals()[f'df_{i}']['title'].str.len()

    # Print the length of the current dataframe and add it to the total length.
    print(f'length of {to_sc[i]}: ', len(globals()[f'df_{i}']))
    total_len += len(globals()[f'df_{i}'])

    # Append the current dataframe to the list of all dataframes.
    all_dfs.append(globals()[f'df_{i}'])

# Create a new dataframe by concatenating all dataframes in the list and resetting the index.
new_df = pd.concat(all_dfs).reset_index(drop=True)

# Filter the data based on the computed quantiles for 'comments_length' and 'title_length'.
lower_quantile = new_df['comments_length'].quantile(0.50)
high_quantile = new_df['comments_length'].quantile(0.99)
lower_quantile_title = new_df['title_length'].quantile(0.50)
high_quantile_title = new_df['title_length'].quantile(0.99)

length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/developed_infos.csv:  80033
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/nomad_infos.csv:  13430
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/harun_infos.csv:  98262
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/stephie_infos.csv:  14971
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/rusty_infos.csv:  287215


### 2. **Text Length Filtering**
- Filter with results for statistics in sentences and titles with short tokens

In [20]:
for i in range(len(to_sc)):

    # Filter 'comments_length' and 'title_length' in the dataframe between the lower and high quantile, both inclusive.
    globals()[f'df_{i}'] = globals()[f'df_{i}'][globals()[f'df_{i}']['comments_length'] > lower_quantile]
    globals()[f'df_{i}'] = globals()[f'df_{i}'][globals()[f'df_{i}']['comments_length'] < high_quantile]
    globals()[f'df_{i}'] = globals()[f'df_{i}'][globals()[f'df_{i}']['title_length'] > lower_quantile_title]
    globals()[f'df_{i}'] = globals()[f'df_{i}'][globals()[f'df_{i}']['title_length'] < high_quantile_title]

    # Print the length of the current dataframe.
    print(f'length of {to_sc[i]}: ', len(globals()[f'df_{i}']))

    # Append the log of the length of the current dataframe to the log_sample list.
    log_sample.append(np.log(len(globals()[f'df_{i}'])))

    # Print the log length of the current dataframe.
    print(f'log length of {to_sc[i]}: ', np.log(len(globals()[f'df_{i}'])))

    # Add the log length of the current dataframe to the total log.
    total_log += np.log(len(globals()[f'df_{i}']))


length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/developed_infos.csv:  30140
log length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/developed_infos.csv:  10.313608472180485
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/nomad_infos.csv:  3452
log length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/nomad_infos.csv:  8.146709052203319
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/harun_infos.csv:  36386
log length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/harun_infos.csv:  10.50193936425675
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/stephie_infos.csv:  6096
log length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/stephie_infos.csv:  8.715388097366482
length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/rusty_infos.csv:  33486
log length of /content/drive/MyDrive/Colab Notebooks/eng_youtube/rusty_infos.csv:  10.418882720016489


In [21]:
# '20000' : This scales these proportions up by a factor of '20000',
# which is presumably the total number of samples that will be drawn in the next step of the process.
log_sample_cnt = np.round((log_sample / total_log) * 20000).astype(int)

In [22]:
log_sample_cnt

array([4289, 3388, 4367, 3624, 4332])

### 3. LogSampling

In [23]:
# it samples a subset of rows without replacement, resulting in a list of filtered and potentially down-sampled dataframes, which is stored in filtered_dfs.
filtered_dfs = []

for i, size in enumerate(log_sample_cnt):
    if len(globals()[f'df_{i}']) < log_sample_cnt[i]:
        sampled_df = globals()[f'df_{i}']
    else:
        sampled_df = globals()[f'df_{i}'].sample(n=size, replace=False, random_state=42)  # replace=False -> avoid duplicate

    filtered_dfs.append(sampled_df)

### 4. **Text Concatanation**
- Proceed with [Video Title; Comments] process for data that went through the previous process**


In [25]:
# Create filtered data frames
filtered_df = pd.concat(filtered_dfs).reset_index(drop=True)

# Concatanate table and comments columns
filtered_df['title_comments'] = filtered_df['title'] + " " + filtered_df['comments']

In [26]:
filtered_df

Unnamed: 0,title,url,comments,comments_length,title_length,title_comments
0,be more judgmental! society is keeping you weak.,https://www.youtube.com/watch?v=EJjWC0f2fKk,im a lonely person the only thing im doing is ...,110,48,be more judgmental! society is keeping you wea...
1,dating coaches are lying to you! when will you...,https://www.youtube.com/watch?v=JBcGXR-Usec,hey youtuber would love if you can make a vide...,119,55,dating coaches are lying to you! when will you...
2,have no friends! the world has lied to you.,https://www.youtube.com/watch?v=6NaTTs3KTI0,god lord when you are in the flow centered in ...,187,43,have no friends! the world has lied to you. go...
3,why i'm staying single for life! it's just not...,https://www.youtube.com/watch?v=liNBlxz61Is,i ride alone. so many ppl are afraid of being ...,264,56,why i'm staying single for life! it's just not...
4,biblical masculinity vs. red pill! the tides a...,https://www.youtube.com/watch?v=OpscWwKkygg,i might yes! the entire episode is always avai...,86,58,biblical masculinity vs. red pill! the tides a...
...,...,...,...,...,...,...
19995,future plans?.maybe a small travel trailer???,https://www.youtube.com/watch?v=cA5SX82Oz2E,im with yo.cooking outside isnt what its made ...,102,45,future plans?.maybe a small travel trailer??? ...
19996,rpod or winnebago drop or jayco hummingbird???,https://www.youtube.com/watch?v=kUu0bfIBkBQ,now youre talking.this sounds like a plan! im ...,294,46,rpod or winnebago drop or jayco hummingbird???...
19997,an easy bed solution for rv or home.and more!,https://www.youtube.com/watch?v=uHnnin8-Tv0,wo. no junk or debris cluttering up the are. n...,77,45,an easy bed solution for rv or home.and more! ...
19998,cooking with youtuber.mostly using instant pot,https://www.youtube.com/watch?v=LtX-jRW584Y,good morning rangers and the stinking goat !.g...,124,46,cooking with youtuber.mostly using instant pot...


In [27]:
len(filtered_df)

20000

In [33]:
# save as csv
save_path = '/content/drive/MyDrive/Colab Notebooks/eng_youtube/'
filtered_df.to_csv(path_or_buf=save_path + 'model_input.csv')

In [34]:
check_file = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/eng_youtube/model_input.csv')