<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# DSI 37 Capstone Project

<a id='part_ii'></a>
[Part I](Part_1-Data_Collection.ipynb#part_i) <br>
[Part III](Part_3-EDA_and_Modeling.ipynb#part_iii)

# Part II: Prepare Training Data
This section encompasses Part 2 of 3 in the project code, dedicated to the preparation of training data.

<a id='part_ii'></a>

## Contents

[1. Glossary](#glossary)<br>
[2. Import Libraries](#import)<br>
[3. Load Datasets](#load_datasets)<br>
[4. Preprocessing](#preprocessing)<br>
[5. Run Topic Based Sentiment Analysis on Posts](#sentiment_posts)<br>
[6. Run Topic Based Sentiment Analysis on Comments](#sentiment_comments)<br>

<a id='glossary'></a>

## 1. Glossary

### Data Dictionary
The data dictionary for the 3 datasets utilized in this section is provided below for reference.

`video_info_audio_caption_cleaned_df`

|Feature|Type|Description|
|:---|:---:|:---|
|<b>id</b>|*object*|Id of the TikTok post|
|<b>url</b>|*object*|URL address of the TikTok post|
|<b>account_name</b>|*object*|Account name of the TikTok post uploader|
|<b>following_count</b>| *int64*|Following count of the TikTok post uploader|
|<b>follower_count</b>|*object*|Follower count of the TikTok post uploader|
|<b>total_like_count</b>|*object*|Total like count of the TikTok post uploader|
|<b>date</b>|*object*|Date on which the TikTok post was uploaded|
|<b>href</b>|*object*|The href needed to access the link to the TikTok post uploader's account page|
|<b>handle</b>|*object*|TikTok post uploader's account handle|
|<b>description</b>|*object*|Description of the TikTok post|
|<b>hashtag</b>|*object*|Hashtags of the TikTok post|
|<b>like_count</b>|*object*|Like count of the TikTok post|
|<b>bookmark_count</b>|*object*|Bookmark count of the TikTok post|
|<b>share_count</b>|*object*|Share count of the TikTok post|
|<b>comment_count</b>|*object*|Comment count of the TikTok post|
|<b>final_text</b>|*object*|Text from speech-to-text, unless empty, and if empty, it will be text from caption-to-text |

<br>

`comments_df`

|Feature|Type|Description|
|:---|:---:|:---|
|<b>id</b>|*object*|Id of the post with indication of being a comment instead of post|
|<b>url</b>|*object*|URL address of the TikTok post|
|<b>handle</b>|*object*|TikTok post uploader's account handle|
|<b>comment_count</b>| *object*|Comment count of the TikTok post|
|<b>comment</b>|*object*|Comment text|

<br>

`sg_entities_patterns_df`

|Feature|Type|Description|
|:---|:---:|:---|
|<b>label</b>|*object*|Label of the tourist attraction (entity)|
|<b>pattern</b>|*object*|Pattern to recognize as tourist attraction (entity)|
|<b>sublocation</b>|*object*|Actual name of the tourist attraction (entity)|
|<b>interest_1</b>| *object*|Main category in which the tourist attraction falls under|
|<b>interest_2</b>|*object*|Secondary category in which the tourist attraction falls under|
|<b>indoor_outdoor</b>|*object*|Whether the tourist attraction is indoors or outdoors|

<br>

<a id='import'></a>

## 2. Import Libraries

In [1]:
# pip install spacy
# pip install ipywidgets

In [36]:
import pandas as pd
import re
import ast
import string

import spacy
from textblob import TextBlob

from sklearn.model_selection import train_test_split

<a id='load_datasets'></a>

## 3. Load Datasets

In [3]:
video_info_audio_caption_cleaned_df = pd.read_csv('../datasets/video_info_audio_caption_cleaned.csv')
sg_entities_patterns_df = pd.read_csv('../datasets/sg_entities_patterns.csv')
comments_df = pd.read_csv('../datasets/comments.csv')

<a id='preprocessing'></a>

## 4. Preprocessing

In [4]:
pd.set_option('max_colwidth', None)

#### Preprocess `sg_entities_patterns_df`. This dataframe contains a list of Singapore's tourist attractions which we prepared. We shall lemmatize the list so that it can be matched to the lemmatized text later on

In [5]:
# load spaCy English model
nlp = spacy.load('en_core_web_sm')

# define function for lemmatization using spaCy
def lemmatize_text(text):
    doc = nlp(text)
    lemmatized_text = ' '.join(token.lemma_ for token in doc)
    return lemmatized_text

In [6]:
# convert all text to lowercase
sg_entities_patterns_df['pattern'] = sg_entities_patterns_df['pattern'].apply(lambda x: x.lower())
# lemmatize text using function defined above
sg_entities_patterns_df['pattern'] = sg_entities_patterns_df['pattern'].apply(lambda x: lemmatize_text(x))

In [7]:
sg_entities_patterns_df.to_csv('../datasets/sg_entities_patterns.csv',index=False)

#### Preprocess `video_info_audio_caption_cleaned_df`

In [8]:
video_info_audio_caption_cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 401 entries, 0 to 400
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   id                401 non-null    object
 1   url               401 non-null    object
 2   account_name      401 non-null    object
 3   following_count   401 non-null    int64 
 4   follower_count    401 non-null    object
 5   total_like_count  401 non-null    object
 6   date              401 non-null    object
 7   href              401 non-null    object
 8   handle            401 non-null    object
 9   description       387 non-null    object
 10  hashtag           401 non-null    object
 11  like_count        401 non-null    object
 12  bookmark_count    401 non-null    object
 13  share_count       401 non-null    object
 14  comment_count     401 non-null    object
 15  final_text        358 non-null    object
dtypes: int64(1), object(15)
memory usage: 50.2+ KB


In [9]:
video_info_audio_caption_cleaned_df.head()

Unnamed: 0,id,url,account_name,following_count,follower_count,total_like_count,date,href,handle,description,hashtag,like_count,bookmark_count,share_count,comment_count,final_text
0,0_post,https://www.tiktok.com/@montanadarby/video/7232388092764671258,Montana | Travels,149,95.5K,4.8M,2023-5-12,/@montanadarby,montanadarby,The perfect 48 hour itinerary for Singapore!,"['singapore', 'singaporetravel', 'travel']",36.7K,25K,5008,208,"How to spend the perfect 48 hours in Singapore? First, head to your waterfall at Singapore airport, it's absolutely incredible and not to be missed. Then, head to the center and drop by the Future World Exhibition at the Art Science Museum. Next, you have to head to Gardens by the Bay, it's completely free to walk around here and just look how amazing it is. Don't forget to buy tickets for the Skyway for the most insane views of the Floral Fantasy, for a magical indoor garden escape. And of course, you have to catch one of the free light shows. Start early to head to one of the most iconic photo spots in Singapore, the Fort Canning tunnel. Then it's time for a quick stroll along Arab Street for the most incredible foods of Sultan mosque before stopping by Crack."
1,1_post,https://www.tiktok.com/@trippingmillennial/video/7194849267461426478,Travel with Rachael ✈️,474,57.1K,1.8M,2023-1-31,/@trippingmillennial,trippingmillennial,Hawker centers are my love language,"['southeastasia', 'singapore', 'travelsingapore']",4913,3686,806,31,"Skip to Singapore, kind of feels like a trip into the future, and if you visit, here are five can't-miss things that you should do. First, Singapore is pricey, but thankfully, the Hawker centres are not. These food halls are a great way to sample some of Singapore's Best Foods; it's really cheap; some of them are Michelin-starred, and it's a good way to get to the local culture. To using the incredible Gardens by the bay. While most of this area is actually free to explore, I recommend springing for tickets to the Skywalk along with the cloud forest and the beautiful Flower Dome. Don't miss the night show here either (more on that later). Period which is a beautiful, quirky, colorful, fun place to explore, even when it's raining; it's a good place to shop and get a beer and people watched. The trip to the iconic Marina Bay Sands Hotel; definitely don't skip on the historic Raffles hotel, which is a beautiful example of British colonial architecture, and don't miss their Long Bar, which is famous for the Singapore Sling cocktail. Thanks and follow for More Travel content."
2,2_post,https://www.tiktok.com/@montanadarby/video/7217039897687887109,Montana | Travels,149,95.5K,4.8M,2023-4-1,/@montanadarby,montanadarby,The ultimate guide on the BEST things to do Singapore! Can you believe most of these things are completely FREE!!,"['singapore', 'singaporethingstodo', 'marinabaysands', 'jewelchangi', 'gardensbythebay', 'backpacking']",73.7K,40.3K,10.5K,609,Sky Garden at Capita Spring free. Jewel fountain at Changi Airport free. OCBC Skyway £7.30. fort canning walk / tunnel free. arab street free. Gardens by the bay light show free. level 33 rooftop bar free entry. floral fantasy exhibition £9.10. spectra light show free. future world at art science museum £9.20.
3,3_post,https://www.tiktok.com/@aktravelss/video/7208911518451223813,AKTRAVELS ✈️ 🌍,129,130.2K,8.4M,2023-3-10,/@aktravelss,aktravelss,Singapore is just different 🇸🇬,"['singapore', 'singaporetiktok', 'singaporelife', 'shopping', 'traveltiktok', 'aktravels']",2.1M,292.8K,25.1K,11.7K,"shopping in singapore, LOUIS VUITTON"
4,4_post,https://www.tiktok.com/@belemartorres2/video/7218172665222237441,Belemar Torres,440,1284,37.2K,2023-4-4,/@belemartorres2,belemartorres2,Here’s my top 20 most visited place in Singapore!,"['singapore2023', 'travelsingapore', 'solotraveling', 'fyp']",10.1K,7756,1935,71,"top 20 places to visit, merlion park, marina bay sands fountain show, skypark / celavi, helix bridge, gardens by the bay, haji lane, bali lane, sultan mosque, chinatown, little india, arab street, orchard, raffles hotel, sentosa sea aquarium, universal studios, cable car, fort canning tree tunnel, jewel changi airport"


#### Combine caption / audio to text with description as many videos do not repeat the information provided in the description, whereas the caption and audio to text are typically vert similar.

In [10]:
video_info_audio_caption_cleaned_df['final_text_description'] = video_info_audio_caption_cleaned_df['final_text'] + ' ' + video_info_audio_caption_cleaned_df['description']

In [11]:
video_info_audio_caption_cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 401 entries, 0 to 400
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      401 non-null    object
 1   url                     401 non-null    object
 2   account_name            401 non-null    object
 3   following_count         401 non-null    int64 
 4   follower_count          401 non-null    object
 5   total_like_count        401 non-null    object
 6   date                    401 non-null    object
 7   href                    401 non-null    object
 8   handle                  401 non-null    object
 9   description             387 non-null    object
 10  hashtag                 401 non-null    object
 11  like_count              401 non-null    object
 12  bookmark_count          401 non-null    object
 13  share_count             401 non-null    object
 14  comment_count           401 non-null    object
 15  final_

#### Convert `following_count`, `follower_count`, `total_like_count`, `like_count`, `bookmark_count`, `share_count`, `comment_count` to numbers as those with more than thousands or millions will have a 'K' or 'M' appended at the back of a float.

In [12]:
video_info_audio_caption_cleaned_df['following_count'] = video_info_audio_caption_cleaned_df['following_count'].apply(lambda x: str(x))

In [13]:
# define a function that converts string that ends with 'M' or 'K' to an integer
def convert_value(value):
    if value.endswith('M'):
        return int(float(value[:-1]) * 1000000)
    elif value.endswith('K'):
        return int(float(value[:-1]) * 1000)
    else:
        return int(value)

In [14]:
# apply function to all columns with counts
lst_counts = ['following_count','follower_count','total_like_count','like_count','bookmark_count','share_count',
              'comment_count']
for i in lst_counts:
    video_info_audio_caption_cleaned_df[i] = video_info_audio_caption_cleaned_df[i].apply(lambda x: convert_value(x))

#### Create a new column that has the number of hashtags used in the post

In [15]:
video_info_audio_caption_cleaned_df['hashtag'].loc[0]

"['singapore', 'singaporetravel', 'travel']"

In [16]:
# convert string of list into a list
video_info_audio_caption_cleaned_df['hashtag'] = video_info_audio_caption_cleaned_df['hashtag'].apply(lambda x: ast.literal_eval(x))

In [17]:
# check if it is now a list
video_info_audio_caption_cleaned_df['hashtag'].loc[0]

['singapore', 'singaporetravel', 'travel']

In [18]:
# count the number of hashtags
video_info_audio_caption_cleaned_df['num_hashtags'] = video_info_audio_caption_cleaned_df['hashtag'].apply(lambda x: len(x))

In [19]:
video_info_audio_caption_cleaned_df['num_hashtags']

0      3
1      3
2      6
3      6
4      4
      ..
396    0
397    4
398    3
399    3
400    6
Name: num_hashtags, Length: 401, dtype: int64

#### Save as csv file.

In [104]:
video_info_audio_caption_cleaned_df.to_csv('../datasets/video_info_audio_caption_cleaned_final.csv',index=False)

<a id='sentiment_posts'></a>

## 5. Run Topic Based Sentiment Analysis on Posts

#### Instantiate `train_df`

In [20]:
# instantiate dataframe
train_df = pd.DataFrame(columns=['post_comment','id','sentence','entity','pos_sentiment_words','neg_sentiment_words','sentiment'])
train_df

Unnamed: 0,post_comment,id,sentence,entity,pos_sentiment_words,neg_sentiment_words,sentiment


#### Using spaCy’s rule-based matcher and `sg_entities_patterns_df`, we will be able to extract Singapore's tourist attractions as entities. To do so, we will convert `sg_entities_patterns_df` to a list of dictionary and add them as patterns to the rule-based matcher

In [21]:
# load spaCy English model
nlp = spacy.load('en_core_web_sm')

# create entity ruler
ruler = nlp.add_pipe('entity_ruler',"ruleActions", config={"overwrite_ents": True})

# list of entities and patterns
# note that the text are lemmatized before pulling for entities. Thus, patterns should be in root form
lst_of_patterns = sg_entities_patterns_df.to_dict('records') # convert df to list of dictionary
patterns = lst_of_patterns

ruler.add_patterns(patterns)

#### Here, we will split the posts into sentences. For each sentence, we will extract entities, positive and/or negative sentiment words and sentiment score 

In [22]:
# find no. of rows in 'video_info_audio_caption_cleaned_df'
n_rows = len(video_info_audio_caption_cleaned_df)

for i in range(0,n_rows):
    # define text
    text = video_info_audio_caption_cleaned_df['final_text_description'].loc[i]
    
    # if there are no words in 'final_text_description', it becomes a math nan which is a float
    # thus, to check if there are words, check if type is string or float
    if type(text) == str:
        # split text into smaller sentences
        sentences = re.split(r'[^\w\s\'-,]', text) # NOT a word character, whitespace, single quote, hyphen, or comma

        for sentence in sentences:
            # instantiate a dictionary
            dictionary = {}

            # since this is a post, save this in 'post_comment'
            dictionary['post_comment'] = 'post' 
            
            # save the 'post_id'
            dictionary['id'] = video_info_audio_caption_cleaned_df['id'].loc[i]

            # convert characters to lowercase
            sentence = sentence.lower()
            # remove punctuations
            sentence = sentence.translate(str.maketrans('','',string.punctuation))

            # check if there are words in the sentence
            if sentence != '' and sentence != ' ':
                # save the sentence in dictionary
                dictionary['sentence'] = sentence

                # instantiate 'pos_sentiment_words' and 'neg_sentiment_words'
                dictionary['pos_sentiment_words'] = []
                dictionary['neg_sentiment_words'] = []
                for token in sentence.split():
                    textblob_token = TextBlob(token)
                    if textblob_token.sentiment.polarity > 0:
                        dictionary['pos_sentiment_words'].append(token)
                    elif textblob_token.sentiment.polarity < 0:
                        dictionary['neg_sentiment_words'].append(token)

                # lemmatize text using function defined above
                lemmatized_text = lemmatize_text(sentence)
                lemmatized_doc = nlp(lemmatized_text)

                # extract entities 
                entities = [entity for entity in lemmatized_doc.ents]

                for entity in entities:

                    # if word happens to be in the name of entity, it should not be a sentiment word, remove from pos words
                    for word in dictionary['pos_sentiment_words']:
                        if word in str(entity):
                            dictionary['pos_sentiment_words'].remove(word)
                            sentence = sentence.replace(word,'')
                    # if word happens to be in the name of entity, it should not be a sentiment word, remove from neg words
                    for word in dictionary['neg_sentiment_words']:
                        if word in str(entity):
                            dictionary['neg_sentiment_words'].remove(word)
                            sentence = sentence.replace(word,'')

                textblob_sentence = TextBlob(sentence)
                dictionary['sentiment'] = textblob_sentence.sentiment.polarity

                dictionary['entity'] = []
                for entity in entities:
                    # save the entity in the dictionary
                    dictionary['entity'].append(str(entity))

                # save the entire dictionary to train_df
                train_df.loc[len(train_df)] = dictionary

#### Check the dataframe

In [23]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1411 entries, 0 to 1410
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   post_comment         1411 non-null   object 
 1   id                   1411 non-null   object 
 2   sentence             1411 non-null   object 
 3   entity               1411 non-null   object 
 4   pos_sentiment_words  1411 non-null   object 
 5   neg_sentiment_words  1411 non-null   object 
 6   sentiment            1411 non-null   float64
dtypes: float64(1), object(6)
memory usage: 88.2+ KB


In [24]:
train_df.tail()

Unnamed: 0,post_comment,id,sentence,entity,pos_sentiment_words,neg_sentiment_words,sentiment
1406,post,435_post,i feel like its the las vegas of asia but actually not quite there yet,"[the las vegas, asia]",[],[],0.0
1407,post,435_post,dont get me wrong their gardens are so pretty and i loved walking through everything but there is just no natural substance to the country,[],"[pretty, loved, natural]",[wrong],0.1
1408,post,435_post,so these are the reasons why i dont like singapore,[singapore],[],[],0.0
1409,post,435_post,if you are actually from singapore and disagree on what i just said i am open to hearing your feedback so comment down below,[singapore],[],[down],-0.051852
1410,post,435_post,im sorry singapore,[singapore],[],[sorry],-0.5


#### Save the dataframe to a csv file

In [25]:
train_df.to_csv('../datasets/train_df.csv',index=False)

<a id='sentiment_comments'></a>

## 6. Run Topic Based Sentiment Analysis on Comments

In [26]:
comments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20015 entries, 0 to 20014
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             20015 non-null  object
 1   url            20015 non-null  object
 2   handle         20015 non-null  object
 3   comment_count  20015 non-null  object
 4   comment        20015 non-null  object
dtypes: object(5)
memory usage: 782.0+ KB


#### Here, we will split the comments into sentences. For each sentence, we will extract entities, positive and/or negative sentiment words and sentiment score 

In [27]:
# find no. of rows in 'video_info_audio_caption_cleaned_df'
n_rows = len(comments_df)

for i in range(0,n_rows):
    # define text
    text = comments_df['comment'].loc[i]
    
    # if there are no words in 'final_text_description', it becomes a math nan which is a float
    # thus, to check if there are words, check if type is string or float
    if type(text) == str:
        # split text into smaller sentences
        sentences = re.split(r'[^\w\s\'-,]', text) # NOT a word character, whitespace, single quote, hyphen, or comma

        for sentence in sentences:
            # instantiate a dictionary
            dictionary = {}

            # since this is a post, save this in 'post_comment'
            dictionary['post_comment'] = 'post' 
            
            # save the 'post_id'
            dictionary['id'] = comments_df['id'].loc[i]

            # convert characters to lowercase
            sentence = sentence.lower()
            # remove punctuations
            sentence = sentence.translate(str.maketrans('','',string.punctuation))

            # check if there are words in the sentence
            if sentence != '' and sentence != ' ':
                # save the sentence in dictionary
                dictionary['sentence'] = sentence

                # instantiate 'pos_sentiment_words' and 'neg_sentiment_words'
                dictionary['pos_sentiment_words'] = []
                dictionary['neg_sentiment_words'] = []
                for token in sentence.split():
                    textblob_token = TextBlob(token)
                    if textblob_token.sentiment.polarity > 0:
                        dictionary['pos_sentiment_words'].append(token)
                    elif textblob_token.sentiment.polarity < 0:
                        dictionary['neg_sentiment_words'].append(token)

                # lemmatize text using function defined above
                lemmatized_text = lemmatize_text(sentence)
                lemmatized_doc = nlp(lemmatized_text)

                # extract entities 
                entities = [entity for entity in lemmatized_doc.ents]

                for entity in entities:

                    # if word happens to be in the name of entity, it should not be a sentiment word, remove from pos words
                    for word in dictionary['pos_sentiment_words']:
                        if word in str(entity):
                            dictionary['pos_sentiment_words'].remove(word)
                            sentence = sentence.replace(word,'')
                    # if word happens to be in the name of entity, it should not be a sentiment word, remove from neg words
                    for word in dictionary['neg_sentiment_words']:
                        if word in str(entity):
                            dictionary['neg_sentiment_words'].remove(word)
                            sentence = sentence.replace(word,'')

                textblob_sentence = TextBlob(sentence)
                dictionary['sentiment'] = textblob_sentence.sentiment.polarity

                dictionary['entity'] = []
                for entity in entities:
                    # save the entity in the dictionary
                    dictionary['entity'].append(str(entity))

                # save the entire dictionary to train_df
                train_df.loc[len(train_df)] = dictionary

In [74]:
train_df.tail()

Unnamed: 0,post_comment,id,sentence,entity,pos_sentiment_words,neg_sentiment_words,sentiment
31335,post,435_cmt,locals,[],[],[],0.0
31336,post,435_cmt,you complained singaporeans are rude bec of your experience with 1 taxi driver,[1],[],"[complained, rude]",-0.3
31337,post,435_cmt,i felt sorry for your friend if he is a singaporean,[singaporean],[],[sorry],-0.5
31338,post,435_cmt,next time do a bit of googling before going to a new place,[],[new],[],0.068182
31339,post,435_cmt,these opinions sound ignorant,[],[sound],[],0.4


#### Save the dataframe to a csv file

In [31]:
train_df.to_csv('../datasets/train_df.csv',index=False)

## <b> End of Part II</b> <br>
[Part I](Part_1-Data_Collection.ipynb#part_i) <br>
[Part III](Part_3-EDA_and_Modeling.ipynb#part_iii)