# Data Import & Cleaning

### Contents:

- [Data Import](#Data-Import)
- [Data Cleaning](#Data-Cleaning)
- [Feature Engineering](#Feature-Engineering)

### Import Libraries

In [1]:
#import standard libraries
import pandas as pd
import numpy as np

#import emoji
import emoji

from natsort import natsorted, index_natsorted, order_by_index

#import warnings to ignore flags when the project is complete
#import warnings
#warnings.filterwarnings('ignore')

#import pre-processing libraries for data cleaning
import string
import re
import nltk
from nltk.tokenize import RegexpTokenizer, sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

## Data Import

**Read scrapped data for the following videos**

In [2]:
va = pd.read_csv('../../data/scrapped_data/va_ebeveadmin_252840063350669.csv')

In [3]:
va

Unnamed: 0,video_for,totalEmojiReaction,views
0,ebeveadmin/videos/252840063350669,30,1.6K


In [4]:
#retrieve the number of views for the video
va['views'].iloc[0]

'1.6K'

In [5]:
#drop the K and replace it with 2 '0's behind
va['views'] = va['views'].str.replace("K", "00", regex=True)
#drop the dot
va['views'] = va['views'].str.replace(".", "", regex=True)
#change the string to be an integer
va['views'] = int(va['views'].iloc[0])

In [6]:
va

Unnamed: 0,video_for,totalEmojiReaction,views
0,ebeveadmin/videos/252840063350669,30,1600


In [7]:
va.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   video_for           1 non-null      object
 1   totalEmojiReaction  1 non-null      int64 
 2   views               1 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 152.0+ bytes


In [8]:
df = pd.read_csv('../../data/scrapped_data/ebeveadmin_252840063350669.csv', encoding='utf-8')

In [9]:
df

Unnamed: 0,postComment,postCommentAuthor,postCommentTime
0,LNS,Razip Yusuf,0:36
1,Baobeiii. Sooo chiooo,Vincent Wan,0:40
2,Jambu siol lols,Ellie Lee,0:44
3,Lns done..,Aysha Khamarudin Al Takhi,0:44
4,Got short plate ?,Aysha Khamarudin Al Takhi,0:50
5,"Let's go dating <span class=""pq6dq46d tbxw36s4 knj5qynh kvgmc6g5 ditlmg2l oygrvhab nvdbi5me sf5mxxl7 gl3lb2sf hhz5lgdu""><img alt=""😂"" height=""16"" referrerpolicy=""origin-when-cross-origin"" src=""https://static.xx.fbcdn.net/images/emoji.php/v9/t6f/2/16/1f602.png"" width=""16""/></span> sorry Chong <span class=""pq6dq46d tbxw36s4 knj5qynh kvgmc6g5 ditlmg2l oygrvhab nvdbi5me sf5mxxl7 gl3lb2sf hhz5lgdu""><img alt=""😅"" height=""16"" referrerpolicy=""origin-when-cross-origin"" src=""https://static.xx.fbcdn.net/images/emoji.php/v9/tf2/2/16/1f605.png"" width=""16""/></span>",Jeffrey Ng,0:56
6,No wonder u n chonghao all shocked for so long,き リーサン,1:00:07
7,How to cook,Irene Lee,1:00:55
8,Razor +3,Vincent Wan,1:01:27
9,Razor+1,Irene Lee,1:02:14


In [10]:
#https://stackoverflow.com/questions/32072076/find-the-unique-values-in-a-column-and-then-sort-them
#check if the seller uses multiple accounts to reply
postCommentAuthor_unique = df['postCommentAuthor'].unique()
print(sorted(postCommentAuthor_unique))

['Ad JC', "Anep Q'Ratu", 'Aysha Khamarudin Al Takhi', 'Catherine Koo', 'Dedy Hui', 'E-Beve', 'Ellie Lee', 'Ernest Tan', 'Geraldine Chan', 'Goh Tat Hin', 'Irene Lee', 'Jeffrey Ng', 'Kwok Ying Ying', 'Lily Koh', 'Maurice Wilson', 'Min Xuan', 'Razip Yusuf', 'Richard Ling', 'Shedah Rahman', 'Shuganya Devi', 'Vincent Wan', 'き リーサン']


In [11]:
#find comments posted by the seller only
df.loc[df['postCommentAuthor'] == 'E-Beve']

Unnamed: 0,postComment,postCommentAuthor,postCommentTime
14,【Product】Half Shell Mussel - S$9.50 | Keyword: HSM,E-Beve,1:05:08
21,【Product】Snow Crab 500g - S$19.00 | Keyword: SNOW,E-Beve,1:09:18
25,【Product】Squid Ring - S$8.00 | Keyword: SQUIDR,E-Beve,1:12:20
27,【Product】Breaded scallops - S$2.50 | Keyword: BREADS,E-Beve,1:15:27
42,vegetables 2nd,E-Beve,6:08
53,【Product】Freezepack Nuggets - S$8.00 | Keyword: NUG,E-Beve,8:39
68,【Product】Chicken Steak - S$8.50 | Keyword: CKCS,E-Beve,12:38
72,【Product】3 joint wings - S$10.50 | Keyword: 3JW,E-Beve,15:57
75,【Product】Popcorn chicken - S$10.50 | Keyword: POPC,E-Beve,18:21
86,【Product】Chicken Drumlets - S$5.50 | Keyword: CKCD,E-Beve,22:10


In [12]:
#check the comments to have a gauge
postComment_unique = df['postComment'].unique()
print(sorted(postComment_unique))

['20', '50 to 60pcs', '<span class="pq6dq46d tbxw36s4 knj5qynh kvgmc6g5 ditlmg2l oygrvhab nvdbi5me sf5mxxl7 gl3lb2sf hhz5lgdu"><img alt="👇" height="16" referrerpolicy="origin-when-cross-origin" src="https://static.xx.fbcdn.net/images/emoji.php/v9/tee/2/16/1f447.png" width="16"/></span><span class="pq6dq46d tbxw36s4 knj5qynh kvgmc6g5 ditlmg2l oygrvhab nvdbi5me sf5mxxl7 gl3lb2sf hhz5lgdu"><img alt="👇" height="16" referrerpolicy="origin-when-cross-origin" src="https://static.xx.fbcdn.net/images/emoji.php/v9/tee/2/16/1f447.png" width="16"/></span><span class="pq6dq46d tbxw36s4 knj5qynh kvgmc6g5 ditlmg2l oygrvhab nvdbi5me sf5mxxl7 gl3lb2sf hhz5lgdu"><img alt="👇" height="16" referrerpolicy="origin-when-cross-origin" src="https://static.xx.fbcdn.net/images/emoji.php/v9/tee/2/16/1f447.png" width="16"/></span> Recommended Product <span class="pq6dq46d tbxw36s4 knj5qynh kvgmc6g5 ditlmg2l oygrvhab nvdbi5me sf5mxxl7 gl3lb2sf hhz5lgdu"><img alt="👇" height="16" referrerpolicy="origin-when-cross-orig

## Data Cleaning

### Convert the emojis to text for easy cleaning

We noticed that the emojis have html parsers attached to it. Hence, we will convert the images of the emojis to text first, to remove the html parsers to the emojis, while retaining the emoji's text. We will convert it back to emoji afterwards.

In [13]:
df['postComment'] = df['postComment'].apply(emoji.demojize)

In [14]:
postComment_unique = df['postComment'].unique()
print(sorted(postComment_unique))

['20', '50 to 60pcs', '<span class="pq6dq46d tbxw36s4 knj5qynh kvgmc6g5 ditlmg2l oygrvhab nvdbi5me sf5mxxl7 gl3lb2sf hhz5lgdu"><img alt=":backhand_index_pointing_down:" height="16" referrerpolicy="origin-when-cross-origin" src="https://static.xx.fbcdn.net/images/emoji.php/v9/tee/2/16/1f447.png" width="16"/></span><span class="pq6dq46d tbxw36s4 knj5qynh kvgmc6g5 ditlmg2l oygrvhab nvdbi5me sf5mxxl7 gl3lb2sf hhz5lgdu"><img alt=":backhand_index_pointing_down:" height="16" referrerpolicy="origin-when-cross-origin" src="https://static.xx.fbcdn.net/images/emoji.php/v9/tee/2/16/1f447.png" width="16"/></span><span class="pq6dq46d tbxw36s4 knj5qynh kvgmc6g5 ditlmg2l oygrvhab nvdbi5me sf5mxxl7 gl3lb2sf hhz5lgdu"><img alt=":backhand_index_pointing_down:" height="16" referrerpolicy="origin-when-cross-origin" src="https://static.xx.fbcdn.net/images/emoji.php/v9/tee/2/16/1f447.png" width="16"/></span> Recommended Product <span class="pq6dq46d tbxw36s4 knj5qynh kvgmc6g5 ditlmg2l oygrvhab nvdbi5me sf5m

### Clean the comments using Regex

From the above unique values of the comments, there are several cleaning issues that needs to be addressed.

    1. Removal of links or URLs. 
        - URLs are present when the Facebook users manually type a link in the comment. 
        - Additionally, when users are tagged, their Facebook profile URL is printed as a result.
        - Similarly, there is a unique URL linked to each emoji as well. 
        
    2. Removal of HTML special entities
        - Examples of HTML special entities are '&amp' and '&gt'. 
        
    3. Removal of other HTML special terms
        - Examples of other HTML special entities are whitespace HTML special entities like '#x200B' and '#xa0'.
        
    4. Removal of HTML Parsers to the Emojis
        - Previously, the emojis have been demojized to text already. However, the HTML parsers to the mojis remain. Hence, they are required to be removed as well.
        
    5. Removal of HTML Parcers to Whitespaces
        - When a next line is entered in the same comment, there will be a HTML parser to the this whitespace. Hence, this are to be removed as well.
        
    6. Removal of HTML Parsers to tagged names
        - When users are tagged in the comments, in addition to their Facebook profile URL, there will be HTML parsers to the tagged names as well. Hence, this are to be removed as well.
        
    7. Removsl of other HTML Parsers   

In [15]:
def clean(row):
    
    # Remove links or URLs
    row['postComment'] = re.sub(
        pattern=r'https?:\/\/.*\/\w*', 
        repl='', 
        string=row['postComment'],
        flags=re.M)
    
    # Remove HTML special entities (e.g.. &amp, &gt;)
    row['postComment'] = re.sub(
        pattern=r'\&\w*;',
        repl='',
        string=row['postComment'],
        flags=re.M)    
    
    # Remove emoji html parsers
    #https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
    row['postComment'] = re.sub(
        pattern=r'<span\sclass=\"([a-z0-9]{8}\s)+[a-z0-9]{8}\">',
        repl='', 
        string=row['postComment'],
        flags=re.M)
 
    # Remove emoji html parsers
    row['postComment'] = re.sub(
        pattern = r'<img\salt=\"(\:\w*)(\-?)(\w*\:)\"\s[a-z]{6}=\"\d\d\"\s[a-z]{14}=.{26}\s[a-z]{3}=\">',
        repl=r'\1\2\3',
        string=row['postComment'],
        flags=re.M)
    
    # Remove whitespaces
    row['postComment'] = re.sub(
        pattern=r'<\/div><div\s.*\s.*>',
        repl=' ',
        string=row['postComment'],
        flags=re.M)

    # Remove whitespaces
    row['postComment'] = re.sub(
        pattern=r'<div\sdir=\"auto\"\s.*\s.*>',
        repl=' ',
        string=row['postComment'],
        flags=re.M)

    # Remove tagged names
    row['postComment'] = re.sub(
        pattern=r'<a\sclass=\"([a-z0-9]{8}\s)+[a-z0-9]{8}\"\shref=\">',
        repl='',
        string=row['postComment'],
        flags=re.M)
    

    # Remove consecutive non-ASCII characters
    # This will remove the chinese comments
    #https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
    row['postComment'] = re.sub(
        pattern=r'[^\x00-\x7F]+', 
        repl=' ', 
        string=row['postComment'],
        flags=re.M)
    
    return row

In [16]:
df2 = df.apply(clean, axis=1)

In [17]:
postComment_unique2 = df2['postComment'].unique()
print(sorted(postComment_unique2))

[' Product 3 joint wings - S$10.50 | Keyword: 3JW', ' Product Beef Chop 500g - S$17.00 | Keyword: BEEFST', ' Product Beef Cube - S$17.00 | Keyword: BEEFCUB', ' Product Breaded scallops - S$2.50 | Keyword: BREADS', ' Product Chicken Drumlets - S$5.50 | Keyword: CKCD', ' Product Chicken Steak - S$8.50 | Keyword: CKCS', ' Product Chicken hotdog X 2 - S$2.80 | Keyword: HOTD', ' Product Chicken katsudon - S$11.50 | Keyword: CKCK', ' Product Freezepack Nuggets - S$8.00 | Keyword: NUG', ' Product French fries - S$4.50 | Keyword: FF', ' Product Half Shell Mussel - S$9.50 | Keyword: HSM', ' Product Mixed veggies - S$2.50 | Keyword: MV', ' Product Onion Rings - S$6.50 | Keyword: ONION', ' Product Popcorn chicken - S$10.50 | Keyword: POPC', ' Product Razor Clams 1kg - S$6.00 | Keyword: RAZOR', ' Product Seaweed chicken - S$12.00 | Keyword: SEAC', ' Product Snow Crab 500g - S$19.00 | Keyword: SNOW', ' Product Squid Ring - S$8.00 | Keyword: SQUIDR', ' Product Whole chicken - S$4.80 | Keyword: WCKC'

**Convert encoded emoji text back to emojis**

In [18]:
df2['postComment'] = df2['postComment'].apply(emoji.emojize)

In [19]:
postComment_unique2 = df2['postComment'].unique()
print(sorted(postComment_unique2))

[' Product 3 joint wings - S$10.50 | Keyword: 3JW', ' Product Beef Chop 500g - S$17.00 | Keyword: BEEFST', ' Product Beef Cube - S$17.00 | Keyword: BEEFCUB', ' Product Breaded scallops - S$2.50 | Keyword: BREADS', ' Product Chicken Drumlets - S$5.50 | Keyword: CKCD', ' Product Chicken Steak - S$8.50 | Keyword: CKCS', ' Product Chicken hotdog X 2 - S$2.80 | Keyword: HOTD', ' Product Chicken katsudon - S$11.50 | Keyword: CKCK', ' Product Freezepack Nuggets - S$8.00 | Keyword: NUG', ' Product French fries - S$4.50 | Keyword: FF', ' Product Half Shell Mussel - S$9.50 | Keyword: HSM', ' Product Mixed veggies - S$2.50 | Keyword: MV', ' Product Onion Rings - S$6.50 | Keyword: ONION', ' Product Popcorn chicken - S$10.50 | Keyword: POPC', ' Product Razor Clams 1kg - S$6.00 | Keyword: RAZOR', ' Product Seaweed chicken - S$12.00 | Keyword: SEAC', ' Product Snow Crab 500g - S$19.00 | Keyword: SNOW', ' Product Squid Ring - S$8.00 | Keyword: SQUIDR', ' Product Whole chicken - S$4.80 | Keyword: WCKC'

### Reindexing the dataframe 
**New Column to reindex the dataframe in accordance to time**

From the data, we can tell that there is an inconsistent timestamp being used in the column 'postCommentTime'. For example, there are times like '0:57' and '1:00:14' which indicates 0 hour 0 mins 57 secs, and 1 hour 0 mins 14 secs. 

Hence, a new column 'postCommentTime_final' is created to ensure that a timestamp of HH:MM:SS is being used consistently throughout. However, we note that the number of days is being included for TimeDeltaIndex.

As a result, we will read the timestamp, by excluding the number of days.  

In [20]:
#TimedeltaIndex
#https://stackoverflow.com/questions/54877467/pandas-convert-hhmm-and-hhmmss-to-standard-hhmmss-in-python
# for example, time of '0:57' will then be 00:00:57; 0 hours 0 mins 57 secs
df2['postCommentTime_final'] = pd.to_timedelta(np.where(df2['postCommentTime'].str.count(':') == 1, '00:' + df2['postCommentTime'], df2['postCommentTime']))

In [21]:
df2.head()

Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final
0,LNS,Razip Yusuf,0:36,0 days 00:00:36
1,Baobeiii. Sooo chiooo,Vincent Wan,0:40,0 days 00:00:40
2,Jambu siol lols,Ellie Lee,0:44,0 days 00:00:44
3,Lns done..,Aysha Khamarudin Al Takhi,0:44,0 days 00:00:44
4,Got short plate ?,Aysha Khamarudin Al Takhi,0:50,0 days 00:00:50


In [22]:
df2['postCommentTime_final'] = df2['postCommentTime_final'].astype(str).map(lambda x: x[7:])

In [23]:
df2

Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final
0,LNS,Razip Yusuf,0:36,00:00:36
1,Baobeiii. Sooo chiooo,Vincent Wan,0:40,00:00:40
2,Jambu siol lols,Ellie Lee,0:44,00:00:44
3,Lns done..,Aysha Khamarudin Al Takhi,0:44,00:00:44
4,Got short plate ?,Aysha Khamarudin Al Takhi,0:50,00:00:50
5,Let's go dating 😂,Jeffrey Ng,0:56,00:00:56
6,No wonder u n chonghao all shocked for so long,き リーサン,1:00:07,01:00:07
7,How to cook,Irene Lee,1:00:55,01:00:55
8,Razor +3,Vincent Wan,1:01:27,01:01:27
9,Razor+1,Irene Lee,1:02:14,01:02:14


In [24]:
#reindex according to postCommentTime_final
#previous natsort in data collection didnt take into account the different timestamp format
df3 = df2.reindex(index=order_by_index(df2.index, index_natsorted(df2.postCommentTime_final)))

In [25]:
df3

Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final
0,LNS,Razip Yusuf,0:36,00:00:36
1,Baobeiii. Sooo chiooo,Vincent Wan,0:40,00:00:40
2,Jambu siol lols,Ellie Lee,0:44,00:00:44
3,Lns done..,Aysha Khamarudin Al Takhi,0:44,00:00:44
4,Got short plate ?,Aysha Khamarudin Al Takhi,0:50,00:00:50
5,Let's go dating 😂,Jeffrey Ng,0:56,00:00:56
32,Lns done baobei,Geraldine Chan,3:02,00:03:02
33,LS,Aysha Khamarudin Al Takhi,3:07,00:03:07
34,Steal ur gf,Ernest Tan,3:51,00:03:51
35,Lns done,Richard Ling,3:51,00:03:51


In [26]:
#reset the index for the dataframe
#https://stackoverflow.com/questions/20490274/how-to-reset-index-in-a-pandas-dataframe

df3 = df3.reset_index(drop=True)

In [27]:
df3

Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final
0,LNS,Razip Yusuf,0:36,00:00:36
1,Baobeiii. Sooo chiooo,Vincent Wan,0:40,00:00:40
2,Jambu siol lols,Ellie Lee,0:44,00:00:44
3,Lns done..,Aysha Khamarudin Al Takhi,0:44,00:00:44
4,Got short plate ?,Aysha Khamarudin Al Takhi,0:50,00:00:50
5,Let's go dating 😂,Jeffrey Ng,0:56,00:00:56
6,Lns done baobei,Geraldine Chan,3:02,00:03:02
7,LS,Aysha Khamarudin Al Takhi,3:07,00:03:07
8,Steal ur gf,Ernest Tan,3:51,00:03:51
9,Lns done,Richard Ling,3:51,00:03:51


**Obtain the length of the video**

Assuming that the last comment of the video is the total length of the video, we will input the length of the video from the comments dataframe into the video attributes dataframe.

In [28]:
#retrieve last comment to obtain the length of the video
df3['postCommentTime_final'].iloc[-1]

'01:19:24'

In [29]:
#https://stackoverflow.com/questions/6402812/how-to-convert-an-hmmss-time-string-to-seconds-in-python
def get_sec(time_str):
    """Get Seconds from time."""
    h, m, s = time_str.split(':')
    return int(h) * 3600 + int(m) * 60 + int(s)

In [30]:
#retrieve last comment to obtain the length of the video in seconds
va['videoLength']= get_sec(df3['postCommentTime_final'].iloc[-1])

In [31]:
va

Unnamed: 0,video_for,totalEmojiReaction,views,videoLength
0,ebeveadmin/videos/252840063350669,30,1600,4764


## Feature Engineering

### Comments made by the Seller

**New Column to identify the number of comments made by the seller in the video**

From the total sum of comments made by the seller, we will input it into the video attributes dataframe.

In [32]:
(df3['postCommentAuthor']=='E-Beve').sum()

29

In [33]:
va['numSellerComments'] = (df3['postCommentAuthor']=='E-Beve').sum()

In [34]:
va

Unnamed: 0,video_for,totalEmojiReaction,views,videoLength,numSellerComments
0,ebeveadmin/videos/252840063350669,30,1600,4764,29


**New Column to identify if the comment is made by the Seller or not**

In [35]:
#create a new column to show if the comment is made by the seller or not
df3['isSeller'] = df3['postCommentAuthor'].map(lambda x:1 if x =='E-Beve' else 0)

In [36]:
df3.head()

Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final,isSeller
0,LNS,Razip Yusuf,0:36,00:00:36,0
1,Baobeiii. Sooo chiooo,Vincent Wan,0:40,00:00:40,0
2,Jambu siol lols,Ellie Lee,0:44,00:00:44,0
3,Lns done..,Aysha Khamarudin Al Takhi,0:44,00:00:44,0
4,Got short plate ?,Aysha Khamarudin Al Takhi,0:50,00:00:50,0


In [37]:
df3['isSeller'].value_counts()

0    162
1     29
Name: isSeller, dtype: int64

In [38]:
#show all the seller's comments
df3.loc[df3['isSeller'] == 1]

Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final,isSeller
16,vegetables 2nd,E-Beve,6:08,00:06:08,1
27,Product Freezepack Nuggets - S$8.00 | Keyword: NUG,E-Beve,8:39,00:08:39,1
42,Product Chicken Steak - S$8.50 | Keyword: CKCS,E-Beve,12:38,00:12:38,1
46,Product 3 joint wings - S$10.50 | Keyword: 3JW,E-Beve,15:57,00:15:57,1
49,Product Popcorn chicken - S$10.50 | Keyword: POPC,E-Beve,18:21,00:18:21,1
60,Product Chicken Drumlets - S$5.50 | Keyword: CKCD,E-Beve,22:10,00:22:10,1
62,Product Chicken Drumlets - S$5.50 | Keyword: CKCD,E-Beve,22:18,00:22:18,1
68,Product Whole chicken - S$4.80 | Keyword: WCKC,E-Beve,24:38,00:24:38,1
75,Product Seaweed chicken - S$12.00 | Keyword: SEAC,E-Beve,27:52,00:27:52,1
95,Product Chicken hotdog X 2 - S$2.80 | Keyword: HOTD,E-Beve,34:37,00:34:37,1


### Length of comments

**New Column to identify the length of each comment**

From the comments, we take the length of each comment as the total number of words each comment has.

In [39]:
#length of each comment
#https://stackoverflow.com/questions/37483470/how-to-calculate-number-of-words-in-a-string-in-dataframe
df3['postCommentLength'] = df3['postComment'].str.split().str.len()

In [40]:
df3.head(10)

Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final,isSeller,postCommentLength
0,LNS,Razip Yusuf,0:36,00:00:36,0,1
1,Baobeiii. Sooo chiooo,Vincent Wan,0:40,00:00:40,0,3
2,Jambu siol lols,Ellie Lee,0:44,00:00:44,0,3
3,Lns done..,Aysha Khamarudin Al Takhi,0:44,00:00:44,0,2
4,Got short plate ?,Aysha Khamarudin Al Takhi,0:50,00:00:50,0,4
5,Let's go dating 😂,Jeffrey Ng,0:56,00:00:56,0,4
6,Lns done baobei,Geraldine Chan,3:02,00:03:02,0,3
7,LS,Aysha Khamarudin Al Takhi,3:07,00:03:07,0,1
8,Steal ur gf,Ernest Tan,3:51,00:03:51,0,3
9,Lns done,Richard Ling,3:51,00:03:51,0,2


**New Column to identify the total number of comments in the video**

From the total number of comments in the video, we will input it into the video attributes dataframe.

In [41]:
#total number of comments
df3['postCommentLength'].sum()

825

In [42]:
va['numComments'] = df3['postCommentLength'].sum()

In [43]:
va

Unnamed: 0,video_for,totalEmojiReaction,views,videoLength,numSellerComments,numComments
0,ebeveadmin/videos/252840063350669,30,1600,4764,29,825


### LNS

LNS is an acronym that stands for 'like and share'. It is a form of customer engagement as it indicates by the customers to the sellers that they have liked and shared the video on their Facebook wall. 

**New Column to identify if Customers are engaging in liking and sharing the video**

In [44]:
#if the customer has commented 'lns' or 'ls' which stands for 'like & shared' & 'like shared' respectively
def lns(comment):
    if re.search(r'(l)(n?)(s)', comment, re.IGNORECASE):
        return int(1)
    else:
        return int(0)

In [45]:
df3['lns'] = df3['postComment'].map(lambda x:lns(x))

In [46]:
df3.head()

Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final,isSeller,postCommentLength,lns
0,LNS,Razip Yusuf,0:36,00:00:36,0,1,1
1,Baobeiii. Sooo chiooo,Vincent Wan,0:40,00:00:40,0,3,0
2,Jambu siol lols,Ellie Lee,0:44,00:00:44,0,3,1
3,Lns done..,Aysha Khamarudin Al Takhi,0:44,00:00:44,0,2,1
4,Got short plate ?,Aysha Khamarudin Al Takhi,0:50,00:00:50,0,4,0


**New Column to identify if the number of Customers who explicitly inform the sellers that they are engaging in liking and sharing the video**

In [47]:
#range of customer's engagement for LNS
df3['lns'].value_counts()

0    164
1     27
Name: lns, dtype: int64

In [48]:
(df3['lns']==1).sum()

27

In [49]:
va['lnsQuantity'] = (df3['lns']==1).sum()

In [50]:
va

Unnamed: 0,video_for,totalEmojiReaction,views,videoLength,numSellerComments,numComments,lnsQuantity
0,ebeveadmin/videos/252840063350669,30,1600,4764,29,825,27


## Sales Quantity

**New Columns to identify the quantity of sales made**

From the comments, using regex, we first see an overview of the comments that are related to the sale of the products. 

In [51]:
#products offered by the seller
df3[df3['postComment'].str.contains('(Keyword: )(\w*)', regex=True)]

  df3[df3['postComment'].str.contains('(Keyword: )(\w*)', regex=True)]


Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final,isSeller,postCommentLength,lns
27,Product Freezepack Nuggets - S$8.00 | Keyword: NUG,E-Beve,8:39,00:08:39,1,8,0
42,Product Chicken Steak - S$8.50 | Keyword: CKCS,E-Beve,12:38,00:12:38,1,8,0
46,Product 3 joint wings - S$10.50 | Keyword: 3JW,E-Beve,15:57,00:15:57,1,9,0
49,Product Popcorn chicken - S$10.50 | Keyword: POPC,E-Beve,18:21,00:18:21,1,8,0
60,Product Chicken Drumlets - S$5.50 | Keyword: CKCD,E-Beve,22:10,00:22:10,1,8,0
62,Product Chicken Drumlets - S$5.50 | Keyword: CKCD,E-Beve,22:18,00:22:18,1,8,0
68,Product Whole chicken - S$4.80 | Keyword: WCKC,E-Beve,24:38,00:24:38,1,8,0
75,Product Seaweed chicken - S$12.00 | Keyword: SEAC,E-Beve,27:52,00:27:52,1,8,0
95,Product Chicken hotdog X 2 - S$2.80 | Keyword: HOTD,E-Beve,34:37,00:34:37,1,10,0
105,Product Beef Cube - S$17.00 | Keyword: BEEFCUB,E-Beve,38:12,00:38:12,1,8,0


In [52]:
#overview of the sales
df3[df3['postComment'].str.contains('(\w*)(\s)*(\+)(\s)*(\d*)', regex=True)]

  df3[df3['postComment'].str.contains('(\w*)(\s)*(\+)(\s)*(\d*)', regex=True)]


Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final,isSeller,postCommentLength,lns
18,"👇 [Order rules] Leave a message ""keyword"" or ""keyword + quantity""",Aysha Khamarudin Al Takhi,6:59,00:06:59,0,11,0
57,Ckcs+1,Irene Lee,21:13,00:21:13,0,1,0
59,Popcorn+1,Irene Lee,21:26,00:21:26,0,1,0
61,Popcorn+1,Irene Lee,22:12,00:22:12,0,1,0
63,Popc +1,Irene Lee,22:33,00:22:33,0,2,0
66,Bug+1,Irene Lee,24:08,00:24:08,0,1,0
67,Nug +1,Irene Lee,24:17,00:24:17,0,2,0
79,Seac +1,Irene Lee,29:16,00:29:16,0,2,0
106,HOTD+1,Irene Lee,38:25,00:38:25,0,1,0
143,Onion +1,Irene Lee,49:16,00:49:16,0,2,0


In [53]:
def sale(comment):
    if re.findall(r'[Xx\+]\s?\d', comment):
        results = re.findall(r'[Xx\+]\s?\d', comment)
        total = 0
        for r in results:
            total += int(r[-1])
        return total
    else:
        return int(0)

In [54]:
df3['salesQuantity'] = df3['postComment'].apply(lambda x:sale(x))

In [55]:
df3

Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity
0,LNS,Razip Yusuf,0:36,00:00:36,0,1,1,0
1,Baobeiii. Sooo chiooo,Vincent Wan,0:40,00:00:40,0,3,0,0
2,Jambu siol lols,Ellie Lee,0:44,00:00:44,0,3,1,0
3,Lns done..,Aysha Khamarudin Al Takhi,0:44,00:00:44,0,2,1,0
4,Got short plate ?,Aysha Khamarudin Al Takhi,0:50,00:00:50,0,4,0,0
5,Let's go dating 😂,Jeffrey Ng,0:56,00:00:56,0,4,0,0
6,Lns done baobei,Geraldine Chan,3:02,00:03:02,0,3,1,0
7,LS,Aysha Khamarudin Al Takhi,3:07,00:03:07,0,1,1,0
8,Steal ur gf,Ernest Tan,3:51,00:03:51,0,3,0,0
9,Lns done,Richard Ling,3:51,00:03:51,0,2,1,0


The cells at row 28, 86, 101, 137 and 144 are ordered without the '+' to the product code. For products code mentioned without the quantity, we assume that the product is ordered for a quantity of 1. Hence, the sales quantity are manually filled in.

In [56]:
df3.loc[28, 'salesQuantity'] = int(3)
df3.loc[86, 'salesQuantity'] = int(1)
df3.loc[101, 'salesQuantity'] = int(1)
df3.loc[137, 'salesQuantity'] = int(1)
df3.loc[144, 'salesQuantity'] = int(1)

In [57]:
df3

Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity
0,LNS,Razip Yusuf,0:36,00:00:36,0,1,1,0
1,Baobeiii. Sooo chiooo,Vincent Wan,0:40,00:00:40,0,3,0,0
2,Jambu siol lols,Ellie Lee,0:44,00:00:44,0,3,1,0
3,Lns done..,Aysha Khamarudin Al Takhi,0:44,00:00:44,0,2,1,0
4,Got short plate ?,Aysha Khamarudin Al Takhi,0:50,00:00:50,0,4,0,0
5,Let's go dating 😂,Jeffrey Ng,0:56,00:00:56,0,4,0,0
6,Lns done baobei,Geraldine Chan,3:02,00:03:02,0,3,1,0
7,LS,Aysha Khamarudin Al Takhi,3:07,00:03:07,0,1,1,0
8,Steal ur gf,Ernest Tan,3:51,00:03:51,0,3,0,0
9,Lns done,Richard Ling,3:51,00:03:51,0,2,1,0


In [58]:
#range of sales quantity
df3['salesQuantity'].value_counts()

0    162
1     17
2      8
3      3
4      1
Name: salesQuantity, dtype: int64

In [59]:
#total number of orders made
df3['salesQuantity'].sum()

46

In [60]:
va['salesQuantity'] = df3['salesQuantity'].sum()

In [61]:
va

Unnamed: 0,video_for,totalEmojiReaction,views,videoLength,numSellerComments,numComments,lnsQuantity,salesQuantity
0,ebeveadmin/videos/252840063350669,30,1600,4764,29,825,27,46


## Products 

The seller will comment and post the unique product codes for each product. Thereafter, customers who are keen on purchasing the products will explicitly comment out the specific & unique product codes, in addition to the quanityt of each product that they wish to purchase.

**New Columns to identify the products purchased by the Customers**

Regex is used to identify the products being offered and the products being purchased as well.

In [62]:
#function to identify the code of the product bought
def sale2(comment):
    if re.search(r'(\w*)(\s?)([Xx\+])(\s?)(\d)', comment):
        return str(re.search(r'(\w*)(\s?)([Xx\+])(\s?)(\d)', comment).group(0)[:-2])
    else:
        return int(0)


In [63]:
#identifies all comments that have the codes of the products purchased by the Customers
#this column will be dropped afterwards.
df3['productBought'] = df3['postComment'].apply(lambda x:sale2(x))

In [64]:
df3

Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productBought
0,LNS,Razip Yusuf,0:36,00:00:36,0,1,1,0,0
1,Baobeiii. Sooo chiooo,Vincent Wan,0:40,00:00:40,0,3,0,0,0
2,Jambu siol lols,Ellie Lee,0:44,00:00:44,0,3,1,0,0
3,Lns done..,Aysha Khamarudin Al Takhi,0:44,00:00:44,0,2,1,0,0
4,Got short plate ?,Aysha Khamarudin Al Takhi,0:50,00:00:50,0,4,0,0,0
5,Let's go dating 😂,Jeffrey Ng,0:56,00:00:56,0,4,0,0,0
6,Lns done baobei,Geraldine Chan,3:02,00:03:02,0,3,1,0,0
7,LS,Aysha Khamarudin Al Takhi,3:07,00:03:07,0,1,1,0,0
8,Steal ur gf,Ernest Tan,3:51,00:03:51,0,3,0,0,0
9,Lns done,Richard Ling,3:51,00:03:51,0,2,1,0,0


In [65]:
df3['productBought'].unique()

array([0, 'Nug X', 'PopC ', 'Ckcs', 'Popcorn', 'Popc ', 'Bug', 'Nug ',
       'Wckc ', 'Seac X', 'Seac ', 'hotdog X', 'hotd X', 'HOTD', 'HOTD X',
       'Nug x', 'Onion ', 'FF', 'MV', 'Razor ', 'Razor', 'Breads'],
      dtype=object)

The cells at rows 28, 86, 101, 137 & 144 are ordered without the '+' to the product code. 

Hence, they are manually filled in.

In [66]:
#https://stackoverflow.com/questions/13842088/set-value-for-particular-cell-in-pandas-dataframe-using-index
df3.loc[28, 'productBought'] = 'NUG'
df3.loc[86, 'productBought'] = 'SEAC'
df3.loc[101, 'productBought'] = 'HOTD'
df3.loc[137, 'productBought'] = 'CKCS'
df3.loc[144, 'productBought'] = 'ONION'

Notwithstanding the above, we noticed that some orders were made with an 'X' or 'x' character instead of the '+' symbol. Hence, we will remove the additional X' or 'x' characters.

In [67]:
#keeping the first word only
#https://stackoverflow.com/questions/37504672/pandas-dataframe-return-first-word-in-string-for-column
df3['productBought'] = df3['productBought'].str.split().str.get(0)

In [68]:
df3['productBought'] = df3['productBought'].fillna(0)

In [69]:
df3['productBought'].unique()

array([0, 'NUG', 'Nug', 'PopC', 'Ckcs', 'Popcorn', 'Popc', 'Bug', 'Wckc',
       'Seac', 'SEAC', 'hotdog', 'hotd', 'HOTD', 'CKCS', 'Onion', 'ONION',
       'FF', 'MV', 'Razor', 'Breads'], dtype=object)

We noticed that some customers have inconsistently used the full name of the product instead of the product code to order the products. Hence, we will replace them with the product codes.

In [70]:
df3['productBought'] = df3['productBought'].replace(to_replace= 'hotdog', value= 'HOTD', regex= True)
df3['productBought'] = df3['productBought'].replace(to_replace= 'Popcorn', value= 'POPC', regex= True)

In [71]:
df3['productBought'].unique()

array([0, 'NUG', 'Nug', 'PopC', 'Ckcs', 'POPC', 'Popc', 'Bug', 'Wckc',
       'Seac', 'SEAC', 'HOTD', 'hotd', 'CKCS', 'Onion', 'ONION', 'FF',
       'MV', 'Razor', 'Breads'], dtype=object)

Change the produce codes to be uppercase for consistency

In [72]:
#change the produce codes to be uppercase for consistency, and since python is case sensitive.
#https://stackoverflow.com/questions/39512002/convert-whole-dataframe-from-lower-case-to-upper-case-with-pandas
df3['productBought'] = df3['productBought'].astype(str).str.upper()

In [73]:
df3['productBought'].unique()

array(['0', 'NUG', 'POPC', 'CKCS', 'BUG', 'WCKC', 'SEAC', 'HOTD', 'ONION',
       'FF', 'MV', 'RAZOR', 'BREADS'], dtype=object)

### Price of Products

**New Column to identify the price of the products**

This new column is created to identify the regex of the unique product codes and their corresponding prices, as adviced by the seller.

In [74]:
#products offered by the seller
df3[df3['postComment'].str.contains('(Keyword: )(\w*)', regex=True)]

  df3[df3['postComment'].str.contains('(Keyword: )(\w*)', regex=True)]


Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productBought
27,Product Freezepack Nuggets - S$8.00 | Keyword: NUG,E-Beve,8:39,00:08:39,1,8,0,0,0
42,Product Chicken Steak - S$8.50 | Keyword: CKCS,E-Beve,12:38,00:12:38,1,8,0,0,0
46,Product 3 joint wings - S$10.50 | Keyword: 3JW,E-Beve,15:57,00:15:57,1,9,0,0,0
49,Product Popcorn chicken - S$10.50 | Keyword: POPC,E-Beve,18:21,00:18:21,1,8,0,0,0
60,Product Chicken Drumlets - S$5.50 | Keyword: CKCD,E-Beve,22:10,00:22:10,1,8,0,0,0
62,Product Chicken Drumlets - S$5.50 | Keyword: CKCD,E-Beve,22:18,00:22:18,1,8,0,0,0
68,Product Whole chicken - S$4.80 | Keyword: WCKC,E-Beve,24:38,00:24:38,1,8,0,0,0
75,Product Seaweed chicken - S$12.00 | Keyword: SEAC,E-Beve,27:52,00:27:52,1,8,0,0,0
95,Product Chicken hotdog X 2 - S$2.80 | Keyword: HOTD,E-Beve,34:37,00:34:37,1,10,0,2,HOTD
105,Product Beef Cube - S$17.00 | Keyword: BEEFCUB,E-Beve,38:12,00:38:12,1,8,0,0,0


In [75]:
def price(comment):
    if re.search(r'(S)(\$)(\d*)(.*)(\s|\s)(.*)(\:)(\s*)(.*)', comment):
        return str(re.search(r'(\$)(\d*)(.*)(\s|\s)(.*)(\:)(\s*)(.*)', comment).group(0))
    else:
        return int(0)

In [76]:
df3['productPrice'] = df3['postComment'].apply(lambda x:price(x))

In [77]:
df3

Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productBought,productPrice
0,LNS,Razip Yusuf,0:36,00:00:36,0,1,1,0,0,0
1,Baobeiii. Sooo chiooo,Vincent Wan,0:40,00:00:40,0,3,0,0,0,0
2,Jambu siol lols,Ellie Lee,0:44,00:00:44,0,3,1,0,0,0
3,Lns done..,Aysha Khamarudin Al Takhi,0:44,00:00:44,0,2,1,0,0,0
4,Got short plate ?,Aysha Khamarudin Al Takhi,0:50,00:00:50,0,4,0,0,0,0
5,Let's go dating 😂,Jeffrey Ng,0:56,00:00:56,0,4,0,0,0,0
6,Lns done baobei,Geraldine Chan,3:02,00:03:02,0,3,1,0,0,0
7,LS,Aysha Khamarudin Al Takhi,3:07,00:03:07,0,1,1,0,0,0
8,Steal ur gf,Ernest Tan,3:51,00:03:51,0,3,0,0,0,0
9,Lns done,Richard Ling,3:51,00:03:51,0,2,1,0,0,0


We noticed that each comment has the regex "\s\|\s\w*\:", where "\w*" is equivalent to the word 'Keyword' in the middle of the extracted string of comments for the column 'productPrice'. Hence, we will remove the mentioned regex.

In [78]:
#https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-column-of-a-pandas-dataframe
df3['productPrice'] = df3['productPrice'].str.replace("\s\|\s\w*\:", "", regex=True)

In [79]:
df3

Unnamed: 0,postComment,postCommentAuthor,postCommentTime,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productBought,productPrice
0,LNS,Razip Yusuf,0:36,00:00:36,0,1,1,0,0,
1,Baobeiii. Sooo chiooo,Vincent Wan,0:40,00:00:40,0,3,0,0,0,
2,Jambu siol lols,Ellie Lee,0:44,00:00:44,0,3,1,0,0,
3,Lns done..,Aysha Khamarudin Al Takhi,0:44,00:00:44,0,2,1,0,0,
4,Got short plate ?,Aysha Khamarudin Al Takhi,0:50,00:00:50,0,4,0,0,0,
5,Let's go dating 😂,Jeffrey Ng,0:56,00:00:56,0,4,0,0,0,
6,Lns done baobei,Geraldine Chan,3:02,00:03:02,0,3,1,0,0,
7,LS,Aysha Khamarudin Al Takhi,3:07,00:03:07,0,1,1,0,0,
8,Steal ur gf,Ernest Tan,3:51,00:03:51,0,3,0,0,0,
9,Lns done,Richard Ling,3:51,00:03:51,0,2,1,0,0,


In [80]:
df3['productPrice'].unique()

array([nan, '$8.00 NUG', '$8.50 CKCS', '$10.50 3JW', '$10.50 POPC',
       '$5.50 CKCD', '$4.80 WCKC', '$12.00 SEAC', '$2.80 HOTD',
       '$17.00 BEEFCUB', '$17.00 BEEFST', '$11.50 CKCK', '$6.50 ONION',
       '$4.50 FF', '$2.50 MV', '$6.00 RAZOR', '$9.50 HSM', '$19.00 SNOW',
       '$8.00 SQUIDR', '$2.50 BREADS'], dtype=object)

Replace NaN values with an integer '0'

In [81]:
df3['productPrice'] = df3['productPrice'].fillna(0)

In [82]:
df3['productPrice'].unique()

array([0, '$8.00 NUG', '$8.50 CKCS', '$10.50 3JW', '$10.50 POPC',
       '$5.50 CKCD', '$4.80 WCKC', '$12.00 SEAC', '$2.80 HOTD',
       '$17.00 BEEFCUB', '$17.00 BEEFST', '$11.50 CKCK', '$6.50 ONION',
       '$4.50 FF', '$2.50 MV', '$6.00 RAZOR', '$9.50 HSM', '$19.00 SNOW',
       '$8.00 SQUIDR', '$2.50 BREADS'], dtype=object)

Excluding the comments where there was no product codes as adviced by the seller, we find the number of unique products offered by the seller.

In [83]:
#number of unique products offered by the seller
int(df3['productPrice'].nunique()) - int(1)

19

In [84]:
#total number of products offered
va['numProducts'] = int(df3['productPrice'].nunique()) - int(1)

In [85]:
va

Unnamed: 0,video_for,totalEmojiReaction,views,videoLength,numSellerComments,numComments,lnsQuantity,salesQuantity,numProducts
0,ebeveadmin/videos/252840063350669,30,1600,4764,29,825,27,46,19


**Drop irrelevant columns**

The following column was dropped for the following reasons:

1. 'postCommentTime'
- Since a new column 'postCommentTime_final' was created to ensure that a consistent timestamp of HH:MM:SS is used consistently throughout the dataframe, and the dataframe has been thereafter reindex and sorted in accordance to time in ascending order, we dropped the original inconsistent time column 'postCommentTime' as it had varying timestamp formats of HH:MM:SS, MM:SS and M:SS.

In [86]:
#drop unwanted columns
df3.drop(['postCommentTime'], axis=1, inplace=True)

In [87]:
df3.head()

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productBought,productPrice
0,LNS,Razip Yusuf,00:00:36,0,1,1,0,0,0
1,Baobeiii. Sooo chiooo,Vincent Wan,00:00:40,0,3,0,0,0,0
2,Jambu siol lols,Ellie Lee,00:00:44,0,3,1,0,0,0
3,Lns done..,Aysha Khamarudin Al Takhi,00:00:44,0,2,1,0,0,0
4,Got short plate ?,Aysha Khamarudin Al Takhi,00:00:50,0,4,0,0,0,0


### Revenue from the sale of the products

**Dummify the products bought to find the revenue**

In [88]:
#getdummies the products bought
df3 = pd.get_dummies(df3, columns = ['productBought'], drop_first = True)

In [89]:
df3.head()

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productPrice,productBought_BREADS,productBought_BUG,productBought_CKCS,productBought_FF,productBought_HOTD,productBought_MV,productBought_NUG,productBought_ONION,productBought_POPC,productBought_RAZOR,productBought_SEAC,productBought_WCKC
0,LNS,Razip Yusuf,00:00:36,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Baobeiii. Sooo chiooo,Vincent Wan,00:00:40,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Jambu siol lols,Ellie Lee,00:00:44,0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Lns done..,Aysha Khamarudin Al Takhi,00:00:44,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Got short plate ?,Aysha Khamarudin Al Takhi,00:00:50,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [90]:
# iterating the columns
for col in df3.columns:
    print(col)

postComment
postCommentAuthor
postCommentTime_final
isSeller
postCommentLength
lns
salesQuantity
productPrice
productBought_BREADS
productBought_BUG
productBought_CKCS
productBought_FF
productBought_HOTD
productBought_MV
productBought_NUG
productBought_ONION
productBought_POPC
productBought_RAZOR
productBought_SEAC
productBought_WCKC


After the column 'productBought' has been dummified, the cells will return a '1' if that particular item is bought, and a '0' if it is not. Hence, we will replace all the '1' with the price of the product.

Then, we will create a new revenue column which is a multiplication of the column 'salesQuantity' against the price of the product (i.e. the dummified 'productBought' columnns).

Product BREADS

In [91]:
df3[df3['postComment'].str.contains('BREADS', na = False, regex = False)]

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productPrice,productBought_BREADS,productBought_BUG,productBought_CKCS,productBought_FF,productBought_HOTD,productBought_MV,productBought_NUG,productBought_ONION,productBought_POPC,productBought_RAZOR,productBought_SEAC,productBought_WCKC
186,Product Breaded scallops - S$2.50 | Keyword: BREADS,E-Beve,01:15:27,1,8,0,0,$2.50 BREADS,0,0,0,0,0,0,0,0,0,0,0,0


In [92]:
df3['productBought_BREADS'] = df3['productBought_BREADS'].map(lambda x:float(2.50) if x == int(1) else 0)

In [93]:
df3['revenue_BREADS'] = np.multiply(df3['productBought_BREADS'], df3['salesQuantity'])

In [94]:
revenue_BREADS = "The total revenue from the sale of the product {} is ${}". format ("BREADS", format(df3['revenue_BREADS'].sum(), '.2f'))
print(revenue_BREADS)


The total revenue from the sale of the product BREADS is $5.00


Product BUG

In [95]:
df3[df3['postComment'].str.contains('BUG', na = False, regex = False)]

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productPrice,productBought_BREADS,productBought_BUG,...,productBought_FF,productBought_HOTD,productBought_MV,productBought_NUG,productBought_ONION,productBought_POPC,productBought_RAZOR,productBought_SEAC,productBought_WCKC,revenue_BREADS


We noticed that the seller does not have a product code 'BUG'. Perhaps a wrong product code has been typed out and posted.

Product CKCS

In [96]:
df3[df3['postComment'].str.contains('CKCS', na = False, regex = False)]

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productPrice,productBought_BREADS,productBought_BUG,...,productBought_FF,productBought_HOTD,productBought_MV,productBought_NUG,productBought_ONION,productBought_POPC,productBought_RAZOR,productBought_SEAC,productBought_WCKC,revenue_BREADS
42,Product Chicken Steak - S$8.50 | Keyword: CKCS,E-Beve,00:12:38,1,8,0,0,$8.50 CKCS,0.0,0,...,0,0,0,0,0,0,0,0,0,0.0
129,Product Chicken Steak - S$8.50 | Keyword: CKCS,E-Beve,00:45:47,1,8,0,0,$8.50 CKCS,0.0,0,...,0,0,0,0,0,0,0,0,0,0.0


In [97]:
df3['productBought_CKCS'] = df3['productBought_CKCS'].map(lambda x:float(8.50) if x == int(1) else 0)

In [98]:
df3['revenue_CKCS'] = np.multiply(df3['productBought_CKCS'], df3['salesQuantity'])

In [99]:
revenue_CKCS = "The total revenue from the sale of the product {} is ${}". format ("CKCS", format(df3['revenue_CKCS'].sum(), '.2f'))
print(revenue_CKCS)


The total revenue from the sale of the product CKCS is $17.00


Product FF

In [100]:
df3[df3['postComment'].str.contains('FF', na = False, regex = False)]

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productPrice,productBought_BREADS,productBought_BUG,...,productBought_HOTD,productBought_MV,productBought_NUG,productBought_ONION,productBought_POPC,productBought_RAZOR,productBought_SEAC,productBought_WCKC,revenue_BREADS,revenue_CKCS
152,Product French fries - S$4.50 | Keyword: FF,E-Beve,00:52:44,1,8,0,0,$4.50 FF,0.0,0,...,0,0,0,0,0,0,0,0,0.0,0.0
153,FF+1,Irene Lee,00:53:22,0,1,0,1,0,0.0,0,...,0,0,0,0,0,0,0,0,0.0,0.0


In [101]:
df3['productBought_FF'] = df3['productBought_FF'].map(lambda x:float(4.50) if x == int(1) else 0)

In [102]:
df3['revenue_FF'] = np.multiply(df3['productBought_FF'], df3['salesQuantity'])

In [103]:
revenue_FF = "The total revenue from the sale of the product {} is ${}". format ("FF", format(df3['productBought_FF'].sum(), '.2f'))
print(revenue_FF)


The total revenue from the sale of the product FF is $4.50


Product HOTD

In [104]:
df3[df3['postComment'].str.contains('HOTD', na = False, regex = False)]

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productPrice,productBought_BREADS,productBought_BUG,...,productBought_MV,productBought_NUG,productBought_ONION,productBought_POPC,productBought_RAZOR,productBought_SEAC,productBought_WCKC,revenue_BREADS,revenue_CKCS,revenue_FF
95,Product Chicken hotdog X 2 - S$2.80 | Keyword: HOTD,E-Beve,00:34:37,1,10,0,2,$2.80 HOTD,0.0,0,...,0,0,0,0,0,0,0,0.0,0.0,0.0
101,HOTD,Shuganya Devi,00:37:00,0,1,0,1,0,0.0,0,...,0,0,0,0,0,0,0,0.0,0.0,0.0
106,HOTD+1,Irene Lee,00:38:25,0,1,0,1,0,0.0,0,...,0,0,0,0,0,0,0,0.0,0.0,0.0
110,HOTD X 2,Aysha Khamarudin Al Takhi,00:39:22,0,3,0,2,0,0.0,0,...,0,0,0,0,0,0,0,0.0,0.0,0.0


In [105]:
df3['productBought_HOTD'] = df3['productBought_HOTD'].map(lambda x:float(2.80) if x == int(1) else 0)

In [106]:
df3['revenue_HOTD'] = np.multiply(df3['productBought_HOTD'], df3['salesQuantity'])

In [107]:
revenue_HOTD = "The total revenue from the sale of the product {} is ${}". format ("HOTD", format(df3['revenue_HOTD'].sum(), '.2f'))
print(revenue_HOTD)


The total revenue from the sale of the product HOTD is $28.00


Product MV

In [108]:
df3[df3['postComment'].str.contains('MV', na = False, regex = False)]

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productPrice,productBought_BREADS,productBought_BUG,...,productBought_NUG,productBought_ONION,productBought_POPC,productBought_RAZOR,productBought_SEAC,productBought_WCKC,revenue_BREADS,revenue_CKCS,revenue_FF,revenue_HOTD
155,Product Mixed veggies - S$2.50 | Keyword: MV,E-Beve,00:55:37,1,8,0,0,$2.50 MV,0.0,0,...,0,0,0,0,0,0,0.0,0.0,0.0,0.0
156,Product Mixed veggies - S$2.50 | Keyword: MV,E-Beve,00:55:46,1,8,0,0,$2.50 MV,0.0,0,...,0,0,0,0,0,0,0.0,0.0,0.0,0.0
157,MV+1,Irene Lee,00:56:15,0,1,0,1,0,0.0,0,...,0,0,0,0,0,0,0.0,0.0,0.0,0.0


In [109]:
df3['productBought_MV'] = df3['productBought_MV'].map(lambda x:float(2.50) if x == int(1) else 0)

In [110]:
df3['revenue_MV'] = np.multiply(df3['productBought_MV'], df3['salesQuantity'])

In [111]:
revenue_MV = "The total revenue from the sale of the product {} is ${}". format ("MV", format(df3['revenue_MV'].sum(), '.2f'))
print(revenue_MV)


The total revenue from the sale of the product MV is $2.50


Product NUG

In [112]:
df3[df3['postComment'].str.contains('NUG', na = False, regex = False)]

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productPrice,productBought_BREADS,productBought_BUG,...,productBought_ONION,productBought_POPC,productBought_RAZOR,productBought_SEAC,productBought_WCKC,revenue_BREADS,revenue_CKCS,revenue_FF,revenue_HOTD,revenue_MV
27,Product Freezepack Nuggets - S$8.00 | Keyword: NUG,E-Beve,00:08:39,1,8,0,0,$8.00 NUG,0.0,0,...,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0
126,Product Freezepack Nuggets - S$8.00 | Keyword: NUG,E-Beve,00:45:38,1,8,0,0,$8.00 NUG,0.0,0,...,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0


In [113]:
df3['productBought_NUG'] = df3['productBought_NUG'].map(lambda x:float(8.00) if x == int(1) else 0)

In [114]:
df3['revenue_NUG'] = np.multiply(df3['productBought_NUG'], df3['salesQuantity'])

In [115]:
revenue_NUG = "The total revenue from the sale of the product {} is ${}". format ("NUG", format(df3['revenue_NUG'].sum(), '.2f'))
print(revenue_NUG)


The total revenue from the sale of the product NUG is $72.00


Product ONION

In [116]:
df3[df3['postComment'].str.contains('ONION', na = False, regex = False)]

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productPrice,productBought_BREADS,productBought_BUG,...,productBought_POPC,productBought_RAZOR,productBought_SEAC,productBought_WCKC,revenue_BREADS,revenue_CKCS,revenue_FF,revenue_HOTD,revenue_MV,revenue_NUG
142,Product Onion Rings - S$6.50 | Keyword: ONION,E-Beve,00:48:39,1,8,0,0,$6.50 ONION,0.0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [117]:
df3['productBought_ONION'] = df3['productBought_ONION'].map(lambda x:float(6.50) if x == int(1) else 0)

In [118]:
df3['revenue_ONION'] = np.multiply(df3['productBought_ONION'], df3['salesQuantity'])

In [119]:
revenue_ONION = "The total revenue from the sale of the product {} is ${}". format ("ONION", format(df3['revenue_ONION'].sum(), '.2f'))
print(revenue_ONION)


The total revenue from the sale of the product ONION is $13.00


Product POPC

In [120]:
df3[df3['postComment'].str.contains('POPC', na = False, regex = False)]

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productPrice,productBought_BREADS,productBought_BUG,...,productBought_RAZOR,productBought_SEAC,productBought_WCKC,revenue_BREADS,revenue_CKCS,revenue_FF,revenue_HOTD,revenue_MV,revenue_NUG,revenue_ONION
49,Product Popcorn chicken - S$10.50 | Keyword: POPC,E-Beve,00:18:21,1,8,0,0,$10.50 POPC,0.0,0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
131,Product Popcorn chicken - S$10.50 | Keyword: POPC,E-Beve,00:45:51,1,8,0,0,$10.50 POPC,0.0,0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [121]:
df3['productBought_POPC'] = df3['productBought_POPC'].map(lambda x:float(10.50) if x == int(1) else 0)

In [122]:
df3['revenue_POPC'] = np.multiply(df3['productBought_POPC'], df3['salesQuantity'])

In [123]:
revenue_POPC = "The total revenue from the sale of the product {} is ${}". format ("POPC", format(df3['revenue_POPC'].sum(), '.2f'))
print(revenue_POPC)


The total revenue from the sale of the product POPC is $42.00


Product RAZOR

In [124]:
df3[df3['postComment'].str.contains('RAZOR', na = False, regex = False)]

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productPrice,productBought_BREADS,productBought_BUG,...,productBought_SEAC,productBought_WCKC,revenue_BREADS,revenue_CKCS,revenue_FF,revenue_HOTD,revenue_MV,revenue_NUG,revenue_ONION,revenue_POPC
162,Product Razor Clams 1kg - S$6.00 | Keyword: RAZOR,E-Beve,00:58:23,1,9,0,0,$6.00 RAZOR,0.0,0,...,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [125]:
df3['productBought_RAZOR'] = df3['productBought_RAZOR'].map(lambda x:float(6.00) if x == int(1) else 0)

In [126]:
df3['revenue_RAZOR'] = np.multiply(df3['productBought_RAZOR'], df3['salesQuantity'])

In [127]:
revenue_RAZOR = "The total revenue from the sale of the product {} is ${}". format ("RAZOR", format(df3['revenue_RAZOR'].sum(), '.2f'))
print(revenue_RAZOR)


The total revenue from the sale of the product RAZOR is $36.00


Product SEAC

In [128]:
df3[df3['postComment'].str.contains('SEAC', na = False, regex = False)]

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productPrice,productBought_BREADS,productBought_BUG,...,productBought_WCKC,revenue_BREADS,revenue_CKCS,revenue_FF,revenue_HOTD,revenue_MV,revenue_NUG,revenue_ONION,revenue_POPC,revenue_RAZOR
75,Product Seaweed chicken - S$12.00 | Keyword: SEAC,E-Beve,00:27:52,1,8,0,0,$12.00 SEAC,0.0,0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
130,Product Seaweed chicken - S$12.00 | Keyword: SEAC,E-Beve,00:45:51,1,8,0,0,$12.00 SEAC,0.0,0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [129]:
df3['productBought_SEAC'] = df3['productBought_SEAC'].map(lambda x:float(12.00) if x == int(1) else 0)

In [130]:
df3['revenue_SEAC'] = np.multiply(df3['productBought_SEAC'], df3['salesQuantity'])

In [131]:
revenue_SEAC = "The total revenue from the sale of the product {} is ${}". format ("SEAC", format(df3['revenue_SEAC'].sum(), '.2f'))
print(revenue_SEAC)


The total revenue from the sale of the product SEAC is $72.00


Product WCKC

In [132]:
df3[df3['postComment'].str.contains('WCKC', na = False, regex = False)]

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productPrice,productBought_BREADS,productBought_BUG,...,revenue_BREADS,revenue_CKCS,revenue_FF,revenue_HOTD,revenue_MV,revenue_NUG,revenue_ONION,revenue_POPC,revenue_RAZOR,revenue_SEAC
68,Product Whole chicken - S$4.80 | Keyword: WCKC,E-Beve,00:24:38,1,8,0,0,$4.80 WCKC,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
134,Product Whole chicken - S$4.80 | Keyword: WCKC,E-Beve,00:45:55,1,8,0,0,$4.80 WCKC,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [133]:
df3['productBought_WCKC'] = df3['productBought_WCKC'].map(lambda x:float(4.80) if x == int(1) else 0)

In [134]:
df3['revenue_WCKC'] = np.multiply(df3['productBought_WCKC'], df3['salesQuantity'])

In [135]:
revenue_WCKC = "The total revenue from the sale of the product {} is ${}". format ("WCKC", format(df3['revenue_WCKC'].sum(), '.2f'))
print(revenue_WCKC)


The total revenue from the sale of the product WCKC is $9.60


In [136]:
# iterating the columns
for col in df3.columns:
    print(col)

postComment
postCommentAuthor
postCommentTime_final
isSeller
postCommentLength
lns
salesQuantity
productPrice
productBought_BREADS
productBought_BUG
productBought_CKCS
productBought_FF
productBought_HOTD
productBought_MV
productBought_NUG
productBought_ONION
productBought_POPC
productBought_RAZOR
productBought_SEAC
productBought_WCKC
revenue_BREADS
revenue_CKCS
revenue_FF
revenue_HOTD
revenue_MV
revenue_NUG
revenue_ONION
revenue_POPC
revenue_RAZOR
revenue_SEAC
revenue_WCKC


**Sum of total revenue from the video**

In [137]:
#total revenue from the video
total_revenue = df3.loc[:, 'revenue_BREADS': 'revenue_WCKC'].values.sum()

#round the total revenue to 2 decimals places
total_revenue_rounded = format(total_revenue, '.2f')

print(f"The total revenue for the sale of products for the video is ${total_revenue_rounded}")


The total revenue for the sale of products for the video is $301.60


In [138]:
va['totalRevenue'] = total_revenue_rounded

In [139]:
va

Unnamed: 0,video_for,totalEmojiReaction,views,videoLength,numSellerComments,numComments,lnsQuantity,salesQuantity,numProducts,totalRevenue
0,ebeveadmin/videos/252840063350669,30,1600,4764,29,825,27,46,19,301.6


**New Column for the total revenue at that comment**

In [140]:
#https://stackoverflow.com/questions/42063716/pandas-sum-up-multiple-columns-into-one-column-without-last-column
df3['revenue'] = df3.loc[:, 'revenue_BREADS': 'revenue_WCKC'].sum(axis=1)
df3

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,productPrice,productBought_BREADS,productBought_BUG,...,revenue_FF,revenue_HOTD,revenue_MV,revenue_NUG,revenue_ONION,revenue_POPC,revenue_RAZOR,revenue_SEAC,revenue_WCKC,revenue
0,LNS,Razip Yusuf,00:00:36,0,1,1,0,0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Baobeiii. Sooo chiooo,Vincent Wan,00:00:40,0,3,0,0,0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Jambu siol lols,Ellie Lee,00:00:44,0,3,1,0,0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Lns done..,Aysha Khamarudin Al Takhi,00:00:44,0,2,1,0,0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Got short plate ?,Aysha Khamarudin Al Takhi,00:00:50,0,4,0,0,0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Let's go dating 😂,Jeffrey Ng,00:00:56,0,4,0,0,0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Lns done baobei,Geraldine Chan,00:03:02,0,3,1,0,0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,LS,Aysha Khamarudin Al Takhi,00:03:07,0,1,1,0,0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Steal ur gf,Ernest Tan,00:03:51,0,3,0,0,0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Lns done,Richard Ling,00:03:51,0,2,1,0,0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [141]:
#https://www.geeksforgeeks.org/how-to-move-a-column-to-first-position-in-pandas-dataframe/
#shift the revenue column to be the 8th column, i.e at position 7
eighth_column = df3.pop('revenue')

# insert column using insert(position,column_name,ninth_column) function
df3.insert(7, 'revenue', eighth_column)
df3

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,revenue,productPrice,productBought_BREADS,...,revenue_CKCS,revenue_FF,revenue_HOTD,revenue_MV,revenue_NUG,revenue_ONION,revenue_POPC,revenue_RAZOR,revenue_SEAC,revenue_WCKC
0,LNS,Razip Yusuf,00:00:36,0,1,1,0,0.0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Baobeiii. Sooo chiooo,Vincent Wan,00:00:40,0,3,0,0,0.0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Jambu siol lols,Ellie Lee,00:00:44,0,3,1,0,0.0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Lns done..,Aysha Khamarudin Al Takhi,00:00:44,0,2,1,0,0.0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Got short plate ?,Aysha Khamarudin Al Takhi,00:00:50,0,4,0,0,0.0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Let's go dating 😂,Jeffrey Ng,00:00:56,0,4,0,0,0.0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Lns done baobei,Geraldine Chan,00:03:02,0,3,1,0,0.0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,LS,Aysha Khamarudin Al Takhi,00:03:07,0,1,1,0,0.0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Steal ur gf,Ernest Tan,00:03:51,0,3,0,0,0.0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Lns done,Richard Ling,00:03:51,0,2,1,0,0.0,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


After the revenue of the sales have been calculated, all the dummified product & revenue columns and the column 'productPrice' will be dropped as they are no longer required to be checked against to identify the price of the product. Additionally, we would not require the dummified revenue columns as well.

In [142]:
df3 = df3.loc[: ,'postComment':'revenue']
df3

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,revenue
0,LNS,Razip Yusuf,00:00:36,0,1,1,0,0.0
1,Baobeiii. Sooo chiooo,Vincent Wan,00:00:40,0,3,0,0,0.0
2,Jambu siol lols,Ellie Lee,00:00:44,0,3,1,0,0.0
3,Lns done..,Aysha Khamarudin Al Takhi,00:00:44,0,2,1,0,0.0
4,Got short plate ?,Aysha Khamarudin Al Takhi,00:00:50,0,4,0,0,0.0
5,Let's go dating 😂,Jeffrey Ng,00:00:56,0,4,0,0,0.0
6,Lns done baobei,Geraldine Chan,00:03:02,0,3,1,0,0.0
7,LS,Aysha Khamarudin Al Takhi,00:03:07,0,1,1,0,0.0
8,Steal ur gf,Ernest Tan,00:03:51,0,3,0,0,0.0
9,Lns done,Richard Ling,00:03:51,0,2,1,0,0.0


**New Column to identify the frequency of the seller's comments in the video**

In [143]:
#frequency of seller's comments
va['frequencySeller']= np.divide((va['videoLength'].iloc[0]),va['numSellerComments'])
#seller's comment appears on average of every 164 seconds

In [144]:
va

Unnamed: 0,video_for,totalEmojiReaction,views,videoLength,numSellerComments,numComments,lnsQuantity,salesQuantity,numProducts,totalRevenue,frequencySeller
0,ebeveadmin/videos/252840063350669,30,1600,4764,29,825,27,46,19,301.6,164.275862


**New Column to identify the seller**

In [145]:
df3['seller'] = 'ebeveadmin'

In [146]:
df3.head()

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,revenue,seller
0,LNS,Razip Yusuf,00:00:36,0,1,1,0,0.0,ebeveadmin
1,Baobeiii. Sooo chiooo,Vincent Wan,00:00:40,0,3,0,0,0.0,ebeveadmin
2,Jambu siol lols,Ellie Lee,00:00:44,0,3,1,0,0.0,ebeveadmin
3,Lns done..,Aysha Khamarudin Al Takhi,00:00:44,0,2,1,0,0.0,ebeveadmin
4,Got short plate ?,Aysha Khamarudin Al Takhi,00:00:50,0,4,0,0,0.0,ebeveadmin


**New Column for the Average Compound Score from Sentiment Analysis on raw & uncleaned comments for video**

In [147]:
# Instantiate Sentiment Intensity Analyzer
sent = SentimentIntensityAnalyzer()

In [148]:
df3['sentiment_score'] = df3['postComment'].apply(sent.polarity_scores)
df3['compound'] = [sent.polarity_scores(x)['compound'] for x in df3['postComment']]
df3.head()

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,revenue,seller,sentiment_score,compound
0,LNS,Razip Yusuf,00:00:36,0,1,1,0,0.0,ebeveadmin,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.0
1,Baobeiii. Sooo chiooo,Vincent Wan,00:00:40,0,3,0,0,0.0,ebeveadmin,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.0
2,Jambu siol lols,Ellie Lee,00:00:44,0,3,1,0,0.0,ebeveadmin,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.0
3,Lns done..,Aysha Khamarudin Al Takhi,00:00:44,0,2,1,0,0.0,ebeveadmin,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.0
4,Got short plate ?,Aysha Khamarudin Al Takhi,00:00:50,0,4,0,0,0.0,ebeveadmin,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}",0.0


In [149]:
#average compound scores for the video
#df.shape[0] calculates the total number of rows in the dataframe
va['averageCompound']= (df3['compound'].sum())/(df3['compound'].sum())/df.shape[0]
va

Unnamed: 0,video_for,totalEmojiReaction,views,videoLength,numSellerComments,numComments,lnsQuantity,salesQuantity,numProducts,totalRevenue,frequencySeller,averageCompound
0,ebeveadmin/videos/252840063350669,30,1600,4764,29,825,27,46,19,301.6,164.275862,0.005236


In [150]:
#drop the columns with regarding to the sentiment analysis on the raw & uncleaned comments
#as we will perform sentiment analysis at the comment level on processed comments
df3 = df3.loc[: ,'postComment':'seller']
df3.head()

Unnamed: 0,postComment,postCommentAuthor,postCommentTime_final,isSeller,postCommentLength,lns,salesQuantity,revenue,seller
0,LNS,Razip Yusuf,00:00:36,0,1,1,0,0.0,ebeveadmin
1,Baobeiii. Sooo chiooo,Vincent Wan,00:00:40,0,3,0,0,0.0,ebeveadmin
2,Jambu siol lols,Ellie Lee,00:00:44,0,3,1,0,0.0,ebeveadmin
3,Lns done..,Aysha Khamarudin Al Takhi,00:00:44,0,2,1,0,0.0,ebeveadmin
4,Got short plate ?,Aysha Khamarudin Al Takhi,00:00:50,0,4,0,0,0.0,ebeveadmin


### Saving the cleaned dataframes

In [151]:
# export to csv - change the name of the data file for each video
va.to_csv('../../data/cleaned_data/cleaned_va_ebeveadmin_252840063350669.csv', index=False)

In [152]:
#check for nulls
#displaying only the columns with nulls and their sum
df3[df3.columns[df3.isnull().any()]].isnull().sum()

Series([], dtype: float64)

In [153]:
# export to csv - change the name of the data file for each video
df3.to_csv('../../data/cleaned_data/cleaned_ebeveadmin_252840063350669.csv', index=False)