# DSCI 511: Data Acquisition and Pre-Processing <br> Term Project Phase 2: Pre-processing Valorant Commments for Sentiment Analysis

## Group members 
- Group member 
    - Name: Amira Bendjama
    - Email: ab4745@drexel.edu
- Group member 
    - Name: Nicole Padilla 
    - Email: np858@drexel.edu

# Data pre-processing for sentiment analysis

The goal of collecting the Valorant comment data from the YouTube API is to make it available for sentiment analysis. This sentiment analysis may be interesting to marketers, both of games and products targeted at gamers, streamers looking to improve their popularity or gaming companies seeking feedback.

Before processing the data for analysis we offer the option to parse the data by multiple variables: both YouTube variables (channel ID, video ID) and game-specific (agent, map, weapons). This enables the end user of the dataset to perform sentiment analysis on different groupings of text.

The following sources were referenced for determining the criteria and methods for data pre-processing for sentiment analysis:
Article 1: https://towardsdatascience.com/cleaning-preprocessing-text-data-for-sentiment-analysis-382a41f150d6
Article 2: https://towardsdatascience.com/how-to-build-your-own-dataset-of-youtube-comments-39a1e57aade
Article 3: https://neptune.ai/blog/tokenization-in-nlp

Based on these sources, and our own determinations, this pre-processing takes the following steps:
1. Remove Emojis
2. Strip URLs
3. Clean up HTML text
4. Convert all text to lower
5. Handle contractions (replace contractions with full words, i.e. you're >> you are)
6. Strip remaining extra characters
7. Lemmatization (defined per Article 1 as "Lemmatization removes the grammar tense and transforms each word into its original form")
8. Tokenization



In [131]:
##TO DO##

#Reference for what to remove: https://towardsdatascience.com/cleaning-preprocessing-text-data-for-sentiment-analysis-382a41f150d6

#load comments file
##From comments file get list of unique IDs to group text by for analysis (Channel or Video)

#Process selected Text
##Requirements for NLP Processing
###remove emojis
###strip URLs
###clean up HTML text
###handle contractions
###strip extra characters
###Lemmatization
##Tokenization

#Return processed data

In [25]:
import csv
import pandas as pd
#load unique ids from csv file based on either channel or video id
def get_id_list(file_path, id_type = 'channel'):
    
    data = pd.read_csv(file_path, sep = ",", header = 0)
    
    #set id_name based on id_type
    id_name = ''
    if id_type == 'channel':
        id_name = 'Channel ID'
    if id_type == 'video':
        id_name = 'Video ID'
        
    #collect unique video IDs from file
    id_list = []
    
    for unique_id in data[id_name]:
        if unique_id not in id_list:
            id_list.append(unique_id)
            
    return id_list

In [27]:
channel_ids = get_id_list('data/comments.csv', 'video')

In [None]:
#separate data by gaming variables (agents, weapons, maps)
##TO DO

In [43]:
#load comments from file, grouped by unique id

def get_comments_by_id(file_path, id_list, id_type = 'channel'):
    
    comments_by_id = {}
    data = pd.read_csv(file_path, sep = ",", header = 0) 
    
    #set id_name based on id_type
    id_name = ''
    if id_type == 'channel':
        id_name = 'Channel ID'
    if id_type == 'video':
        id_name = 'Video ID'
        
    for unique_id in id_list:
        comments = []
        for index, row in data.iterrows():
            if row[id_name] == unique_id:
                comments.append(row['Comment'])
        comments_by_id[unique_id] = comments
    
    return comments_by_id
        

In [60]:
comments_by_video = get_comments_by_id('data/comments.csv', channel_ids, 'video')

#print(comments_by_video['Kl2XzD5DMoY'])

sykkuno_comments = comments_by_video['Kl2XzD5DMoY']

In [117]:
#Remove URLs
# https://medium.com/r/?url=https%3A%2F%2Fstackoverflow.com%2Fa%2F40823105
# r'S' matches anything until next whitespace

import re

def remove_urls(text):
    return re.sub('<a href\S+', ' ', text)


In [168]:
#Install demoji for identifying emojis
#!pip install demoji

Collecting demoji
  Downloading demoji-1.1.0-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 2.9 MB/s eta 0:00:011
[?25hInstalling collected packages: demoji
Successfully installed demoji-1.1.0


In [53]:
#Remove Emojis

import demoji

#Download codes (only once): demoji.download_codes()
##https://pypi.org/project/demoji/

#loop through video ids and clean URLs from each file
def remove_emojis(text):
    return demoji.replace(text, "")


In [74]:
#make all text lowercase

def text_lower(text):
    new_text = [x.lower() for x in text.split()]
    return ' '.join(new_text)


In [89]:
#install contractions
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[K     |████████████████████████████████| 287 kB 6.7 MB/s eta 0:00:01
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp39-cp39-macosx_10_9_x86_64.whl (32 kB)
Installing collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.73 pyahocorasick-1.4.4 textsearch-0.0.24


In [127]:
#remove HTML chars ('quot', 'br') MUST DO BEFORE REMOVE SPECIAL CHARACTERS

def remove_html(text):
    remove_br = re.sub('<br>', ' ', text)
    remove_quot = re.sub('&quot', ' ', remove_br)
    return remove_quot

In [96]:
#handle contractions (i.e. #39)
#Reference https://www.geeksforgeeks.org/nlp-expand-contractions-in-text-processing/
#YouTube comment data shows contractions as &#39;
# import contractions
import contractions

def fix_contractions(text):
 
    # create an empty list for expanded words
    expanded_words = [] 
    
    for word in text.split():
        #sub contraction characters with apostrophe
        clean_word = re.sub('&#39;', "'", word)
        
        # using contractions.fix to expand the shortened words
        expanded_words.append(contractions.fix(clean_word))  

    expanded_text = ' '.join(expanded_words)
    return expanded_text


In [86]:
#remove special characters
import re

def remove_special_char(text):
    new_text = re.sub('[^A-Za-z0-9 ]+', '', text)
    return new_text


In [128]:
#Clean text of emojis and URLs

def clean_text(comment_list):
    
    clean_comments = []
    for comment in comment_list:
        text = text_lower(remove_special_char(fix_contractions(remove_html(remove_emojis(remove_urls(comment))))))
        clean_comments.append(text)
                            
    return clean_comments
        

In [129]:
#Clean selected comments of URLs and emojis

clean_sykkuno_comments = clean_text(sykkuno_comments)

print(clean_sykkuno_comments)


['their synergy was so good and all of them played so well', 'when you are the lowest team but you are a main character in an anime but they lost the finals and hey main characters do not always win maybe someday you guys will be legends', 'i can relate to this vid my team we are winning me i knowi am scared too', 'team scarra had like the highest average in player skill if you ask me they were pretty stacked the fact that team lily that was kind of looking like the weakest on paper going into it ended up going to the finals and getting that close shows how good their teamwork was they all understood their role in the team as players and in terms of who they were playing team lily was by far the best team to watch this tournament because they kept defying expectations would be cool to see the tournament happen again with the same teams even though it would never happen', 'sykkuno be like we are the lowest ranked team in valorant buti knifed toast guysi knifed toast hahahahaha', 'but yo

### Tokenization
Because we have removed emojis and special characters prior to tokenization, simple whitespace tokenization is an effective method of tokenization for this dataset.

Tokenization method was determined based on this study:
http://sentiment.christopherpotts.net/tokenizing.html

In [101]:
#tokenize text
def tokenization(text):
    return text.split(' ')

In [102]:
def tokenize_comments(comment_list):
    tokenized_comments = []
    for line in comment_list:
        tokenized_comments.append(tokenization(line))
    return tokenized_comments

In [130]:
tokenized_comments = tokenize_comments(clean_sykkuno_comments)
print(tokenized_comments)

[['their', 'synergy', 'was', 'so', 'good', 'and', 'all', 'of', 'them', 'played', 'so', 'well'], ['when', 'you', 'are', 'the', 'lowest', 'team', 'but', 'you', 'are', 'a', 'main', 'character', 'in', 'an', 'anime', 'but', 'they', 'lost', 'the', 'finals', 'and', 'hey', 'main', 'characters', 'do', 'not', 'always', 'win', 'maybe', 'someday', 'you', 'guys', 'will', 'be', 'legends'], ['i', 'can', 'relate', 'to', 'this', 'vid', 'my', 'team', 'we', 'are', 'winning', 'me', 'i', 'knowi', 'am', 'scared', 'too'], ['team', 'scarra', 'had', 'like', 'the', 'highest', 'average', 'in', 'player', 'skill', 'if', 'you', 'ask', 'me', 'they', 'were', 'pretty', 'stacked', 'the', 'fact', 'that', 'team', 'lily', 'that', 'was', 'kind', 'of', 'looking', 'like', 'the', 'weakest', 'on', 'paper', 'going', 'into', 'it', 'ended', 'up', 'going', 'to', 'the', 'finals', 'and', 'getting', 'that', 'close', 'shows', 'how', 'good', 'their', 'teamwork', 'was', 'they', 'all', 'understood', 'their', 'role', 'in', 'the', 'team', 