# DSCI 511: Data Acquisition and Pre-Processing <br> Term Project Phase 2: Pre-processing Valorant Commments for Sentiment Analysis

## Group members 
- Group member 
    - Name: Amira Bendjama
    - Email: ab4745@drexel.edu
- Group member 
    - Name: Nicole Padilla 
    - Email: np858@drexel.edu

# Data pre-processing for sentiment analysis

The goal of collecting the Valorant comment data from the YouTube API is to make it available for sentiment analysis. This sentiment analysis may be interesting to marketers, both of games and products targeted at gamers, streamers looking to improve their popularity or gaming companies seeking feedback.

Before processing the data for analysis we offer the option to parse the data by multiple variables:  Channel ID or Video ID. For this final dataset we are using Channel ID.

The following sources were referenced for determining the criteria and methods for data pre-processing for sentiment analysis:
[Article 1](https://towardsdatascience.com/cleaning-preprocessing-text-data-for-sentiment-analysis-382a41f150d6), [Article 2](https://towardsdatascience.com/how-to-build-your-own-dataset-of-youtube-comments-39a1e57aade), [Article 3](https://neptune.ai/blog/tokenization-in-nlp).

Based on these sources, and our own determinations, this pre-processing takes the following steps:
1. Remove Emojis
2. Strip URLs
3. Clean up HTML text
4. Convert all text to lower
5. Handle contractions (replace contractions with full words, i.e. you're >> you are)
6. Strip remaining extra characters
7. Remove stop words
8. Lemmatization (defined per Article 1 as "Lemmatization removes the grammar tense and transforms each word into its original form") 
9. Tokenization


### Grouping by YouTube IDs
First, parse file containing comment, channel & video data (generated from YouTube API) for unique IDs. Unique IDs will be used to group comment data. For example, comment data grouped by channel ID would allow for sentiment analysis of an entire channel. 

Next, load a dictionary of with ID as key and list of comments as values.

In [3]:
import csv
import pandas as pd
#load unique ids from csv file based on either channel or video id
def get_id_list(file_path, id_type = 'channel'):
    
    data = pd.read_csv(file_path, sep = ",", header = 0)
    
    #set id_name based on id_type
    id_name = ''
    if id_type == 'channel':
        id_name = 'Channel ID'
    if id_type == 'video':
        id_name = 'Video ID'
        
    #collect unique video IDs from file
    id_list = []
    
    for unique_id in data[id_name]:
        if unique_id not in id_list:
            id_list.append(unique_id)
            
    return id_list

In [4]:
#Return channel IDs list
channel_ids = get_id_list('data/comments_videos_channel_info.csv', 'channel')
channel_ids

['UCoz3Kpu5lv-ALhR4h9bDvcw',
 'UCRAEUAmW9kletIzOxhpLRFw',
 'UC5v2QgY2D5tlu8uws23MG4Q',
 'UCckPYr9b_iVucz8ID1Q67sw',
 'UCIfAlCwj-ZPZq5fqjpYDX3w',
 'UCWphjEePrzIrRA5mwcOt_4Q',
 'UCxjdy5n9BxX_6RTL8Dt_7pg',
 'UCujyjxsq5FZNVnQro51zKSQ',
 'UCTbtlMEiBfs0zZLQyJzR0Uw',
 'UCgtbMb3djcXKj6CHerHwZ-A',
 'UCQ4dS_JStXcK3A30isduBbg',
 'UCqoJxH5s6xAiJ6QyevmuG7Q',
 'UC_wSuaxwUYsJOBZDWwHIQZg',
 'UCdSad9tpJS8V8rL-4iuRuYw',
 'UCRN1XC7PnnTL5R_GbYOMCZg',
 'UCFJ1pr8iwWPeQjmeHnPhqvA',
 'UCeIcwvxA_e5Dvrqg3rsuN1w',
 'UCQ8VQZoYPeXF_q0E19UDGYQ',
 'UCdH7fwkQ5RGVAMIAN2ufm4Q',
 'UCCBJqqk5h2hh8_WDGzrkRCQ',
 'UC-_1FH52GIOFGu4a8PzwRzQ',
 'UCtTWOND3uyl4tVc_FarDmpw']

In [5]:
#load comments from file, grouped by unique id

def get_comments_by_id(file_path, id_list, id_type = 'channel'):
    
    comments_by_id = {}
    data = pd.read_csv(file_path, sep = ",", header = 0) 
    
    #set id_name based on id_type
    id_name = ''
    if id_type == 'channel':
        id_name = 'Channel ID'
    if id_type == 'video':
        id_name = 'Video ID'
        
    for unique_id in id_list:
        comments = []
        for index, row in data.iterrows():
            if row[id_name] == unique_id:
                comments.append(row['Comment'])
        comments_by_id[unique_id] = comments
    
    return comments_by_id
        

In [6]:
#return comments by channel
comments_by_channel = get_comments_by_id('data/comments_videos_channel_info.csv',channel_ids, 'channel')
comments_by_channel

{'UCoz3Kpu5lv-ALhR4h9bDvcw': ["says if you share this video with your friends, they won't let you solo-queue again",
  'No doubt if you even take a break you will always be a pro player not only in valo in every game bro we all know you,you are the best.And bro what is the crosshair settings',
  "I wanna see Shroud's Curved wall plays",
  'I play valorant because of you. I hope you can give us some tutorials.',
  'From my experience when it comes to chat and comments, shrouds chat and comments are chill, tarik trolling and tenz fanboy or toxic.',
  'SHROUD! i WANNA SEE YOU BACK IN PUBG MY BROTHER',
  "3:07 So Harbor's Bubble doesn't block projectiles. Good to know.",
  '8:46 I swear he makes an Austin powers expression',
  'just ordered your mouse shroud cant wait',
  'does he stop the walls or am i tripping? like can you choose when the wall stops on harbours C?',
  'Clean gameplay as always, but I gotta mention the editor as well. Holy crap it’s sooo good and so nice to watch! I love

### Cleaning the Comments

Data cleaning is important for accurate sentiment analysis. If text contains large amounts of superflous words the sentiment analysis will not be accurate. 

Based on the unique aspects of this dataset (for example, it includes HTML markup) and researched best practices for data cleaning we are taking following steps to clean the text:

1. Remove Emojis
2. Strip URLs
3. Clean up HTML text
4. Convert all text to lower
5. Handle contractions (replace contractions with full words, i.e. you're >> you are)
6. Strip remaining extra characters
7. Remove stop words
8. Lemmatization (defined per Article 1 as "Lemmatization removes the grammar tense and transforms each word into its original form")
9. Tokenization

Refernce for why data cleaning is important: https://www.repustate.com/blog/data-cleaning-in-sentiment-analysis/



#### Remove emojis with demoji

Demoji allows for the removal of emojis from text

Demoji documentation can be found here: https://pypi.org/project/demoji/

In [None]:
#Install demoji for identifying emojis
#!pip install demoji

In [7]:
#Remove Emojis

import demoji

#Download codes (only once): demoji.download_codes()
##https://pypi.org/project/demoji/

#loop through video ids and clean URLs from each file
def remove_emojis(text):
    return demoji.replace(text, "")


#### Remove URLS

Comment data contains a mix of actual and broken links, all contained within HTML markup. Use the HTML markup to delineate links

In [8]:
#Remove URLs
# Use <a href to capture both actual URLs and broken links

import re

def remove_urls(text):
    return re.sub('<a href\S+', ' ', text)

#### Remove HTML chars

This must be done before removing special characters in order to be accurate. Removes the break and quote HTML mark up

In [9]:
#remove HTML chars ('quot', 'br') MUST DO BEFORE REMOVE SPECIAL CHARACTERS

def remove_html(text):
    remove_br = re.sub('<br>', ' ', text)
    remove_quot = re.sub('&quot', ' ', remove_br)
    return remove_quot

#### Make all text lowercase

When all text is the same case words will be recognized as the same (i.e. Word != word, but word == word)

In [10]:
#make all text lowercase

def text_lower(text):
    new_text = [x.lower() for x in text.split()]
    return ' '.join(new_text)


### Handle contractions

Using contractions library found here: https://www.geeksforgeeks.org/nlp-expand-contractions-in-text-processing/

Replace contractions with the actual words, i.e. you're >> you are

In [132]:
#install contractions
#!pip install contractions

In [11]:
#handle contractions (i.e. #39)
#YouTube comment data shows contractions as &#39;
# import contractions
import contractions

def fix_contractions(text):
 
    # create an empty list for expanded words
    expanded_words = [] 
    
    for word in text.split():
        #sub contraction characters with apostrophe
        clean_word = re.sub('&#39;', "'", word)
        
        # using contractions.fix to expand the shortened words
        expanded_words.append(contractions.fix(clean_word))  

    expanded_text = ' '.join(expanded_words)
    return expanded_text


#### Remove special characters

Do this last. After all individual cases of text clean up have been performed, remove remaining special characters

In [12]:
#remove special characters
import re

def remove_special_char(text):
    new_text = re.sub('[^A-Za-z0-9 ]+', '', text)
    return new_text


In [13]:
#Clean text of emojis and URLs

def clean_text(comment_list):
    
    clean_comments = []
    for comment in comment_list:
        text = text_lower(remove_special_char(fix_contractions(remove_html(remove_emojis(remove_urls(comment))))))
        clean_comments.append(text)
                            
    return clean_comments

In [14]:
#Clean text and return in dictionary format by ID

def clean_text_dictionary(comments_dict, id_list):
    clean_comments = {}
    for unique_id in id_list:
        clean_comments[unique_id] = clean_text(comments_dict[unique_id])
    return clean_comments

In [15]:
#Clean comments dictionary of URLs and emojis

comment_dict = clean_text_dictionary(comments_by_channel, channel_ids)


### Remove stop words

Use Natural Language Toolkit (nlkt) to define list of stopwords to remove

https://www.nltk.org/

In [None]:
#Download resources from Natural Language Toolkit (nltk)

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

In [17]:
from nltk.corpus import stopwords

def remove_stopwords(text):
    stop = stopwords.words('english')
    new_text = [x for x in text.split() if x not in stop]
    return ' '.join(new_text)

#print(remove_stopwords('so right before the tournament i said to myself if the team rae is on goes undefeated i will download valorant and start playing and what do you know'))

#### Lemmatization
Reference: https://pub.towardsai.net/how-and-why-to-implement-stemming-and-lemmatization-from-nltk-5f0cc69d2af

Lemmatization resolves words to their dictionary form (for example, stripes becomes stripe)


In [18]:
from nltk.stem import WordNetLemmatizer

def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    lemmatized_text = [lemmatizer.lemmatize(word) for word in text.split()]
    return ' '.join(lemmatized_text)

In [19]:
#remove stopword and lemmatize clean_text

def ready_to_tokenize(comment_list):
    ready_to_tokenize = []
    for comment in comment_list:   
        ready_to_tokenize.append(lemmatize(remove_stopwords(comment)))
    return ready_to_tokenize

In [20]:
#Prepare comment dictionary for tokenization by performing

def ready_to_tokenize_dict(comment_dict, id_list):
    dict_to_tokenize = {}
    for unique_id in id_list:
        dict_to_tokenize[unique_id] = ready_to_tokenize(comment_dict[unique_id])
    return dict_to_tokenize

In [21]:
#Return tokenization prepared dictionary from comment dictionary
dict_to_tokenize = ready_to_tokenize_dict(comment_dict, channel_ids)

### Tokenization
Because we have removed emojis and special characters prior to tokenization, simple whitespace tokenization is an effective method of tokenization for this dataset.

Tokenization method was determined based on this study:
http://sentiment.christopherpotts.net/tokenizing.html

In [22]:
#tokenize text
def tokenization(text):
    return text.split(' ')

In [23]:
def tokenize_comments(comment_list):
    tokenized_comments = []
    for line in comment_list:
        tokenized_comments.append(tokenization(line))
    return tokenized_comments

In [24]:
#Tokenize comment dictionary

def tokenize_dict(comment_dict, id_list):
    tokenized_dict = {}
    for unique_id in id_list:
        tokenized_dict[unique_id] = tokenize_comments(comment_dict[unique_id])
    return tokenized_dict

In [25]:
#Return tokenized dictionary

tokenized_dict = tokenize_dict(dict_to_tokenize, channel_ids)

#### Save results in separate text files, by ID

Files can then be fed into sentiment analysis API, results will give sentiment analysis of the specified group of text.

In [27]:
#Get Channel Names to create clean file names

file_path = 'data/channels.csv'

data = pd.read_csv(file_path, sep = ",", header = 0, dtype = 'str')

channel_names = []
for unique_id in channel_ids:
    for index, row in data.iterrows():
        if row['channel_id'] == unique_id:
            if (row['channel_title'], unique_id) not in channel_names:
                channel_names.append((row['channel_title'], unique_id))
print(channel_names)

[('Shroud', 'UCoz3Kpu5lv-ALhR4h9bDvcw'), ('Sykkuno', 'UCRAEUAmW9kletIzOxhpLRFw'), ('iiTzTimmy', 'UC5v2QgY2D5tlu8uws23MG4Q'), ('TenZ', 'UCckPYr9b_iVucz8ID1Q67sw'), ('Flights', 'UCIfAlCwj-ZPZq5fqjpYDX3w'), ('Grim', 'UCWphjEePrzIrRA5mwcOt_4Q'), ('Kyedae', 'UCxjdy5n9BxX_6RTL8Dt_7pg'), ('fuslie', 'UCujyjxsq5FZNVnQro51zKSQ'), ('tarik', 'UCTbtlMEiBfs0zZLQyJzR0Uw'), ('MrLowlander', 'UCgtbMb3djcXKj6CHerHwZ-A'), ('noted', 'UCQ4dS_JStXcK3A30isduBbg'), ('Flexinja', 'UCqoJxH5s6xAiJ6QyevmuG7Q'), ('QuarterJade', 'UC_wSuaxwUYsJOBZDWwHIQZg'), ('xirena', 'UCdSad9tpJS8V8rL-4iuRuYw'), ('Hiko', 'UCRN1XC7PnnTL5R_GbYOMCZg'), ('Red', 'UCFJ1pr8iwWPeQjmeHnPhqvA'), ('Keeoh', 'UCeIcwvxA_e5Dvrqg3rsuN1w'), ('Ziptie', 'UCQ8VQZoYPeXF_q0E19UDGYQ'), ('xChocoBars', 'UCdH7fwkQ5RGVAMIAN2ufm4Q'), ('vkimm', 'UCCBJqqk5h2hh8_WDGzrkRCQ'), ('Peak', 'UC-_1FH52GIOFGu4a8PzwRzQ'), ('Sydeon', 'UCtTWOND3uyl4tVc_FarDmpw')]


In [28]:
#save results to .txt

def save_files_by_channel(final_comment_dict, id_list):
    
    for channel_name in channel_names:
        file_name = 'data/'+str(channel_name[0])+'.txt'
        with open(file_name, 'w') as f:
            for comment in tokenized_dict[unique_id]:
                f.write(' '.join(comment))
                f.write(' ')

In [30]:
save_files_by_channel(tokenized_dict, channel_ids)