# **Preprocessing**

**Install packages and import file**

In [1]:
import pandas as pd
import plotly.express as px
import plotly.io as pio
from datetime import datetime

In [6]:
url='https://drive.google.com/file/d/13vvFBR9l3sbbbxTyR4P_wzwPHC0TVEPH/view?usp=drive_link'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_csv(url)

Unnamed: 0.1,Unnamed: 0,Following,Followers,Likes,Replies,Retweet Count,Quote Count
count,295.0,295.0,295.0,295.0,295.0,295.0,295.0
mean,70.345763,787.39322,164450.1,15.40339,1.59661,3.755932,0.535593
std,63.864513,1592.316947,831675.9,79.131919,6.557178,23.057782,2.053184
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,12.0,55.0,32.5,0.0,0.0,0.0,0.0
50%,52.0,207.0,296.0,0.0,0.0,0.0,0.0
75%,125.5,833.0,1625.0,2.0,1.0,0.0,0.0
max,199.0,14389.0,8974332.0,1008.0,75.0,352.0,22.0


In [8]:
df.shape

(295, 25)

In [9]:
df.columns

Index(['Unnamed: 0', 'Timestamp scraped', 'Collector', 'Topic', 'Keyword',
       'Account Handle', 'Account Name', 'Account Bio', 'Account Bio URLs',
       'Verified', 'Joined', 'Following', 'Followers', 'Location',
       'Tweet Content', 'Tweet Rendered Content', 'Tweet Type', 'Date Posted',
       'URL', 'Content Type', 'Likes', 'Replies', 'Retweet Count',
       'Quote Count', 'Reasoning'],
      dtype='object')

**Handling missing values**

To handle missing values, we first examined our dataset‚Äôs columns to determine which ones have null values.


In [10]:
df.isnull().sum()

Unnamed: 0                  0
Timestamp scraped           0
Collector                   0
Topic                       0
Keyword                     0
Account Handle              0
Account Name                0
Account Bio                84
Account Bio URLs          281
Verified                    0
Joined                      0
Following                   0
Followers                   0
Location                   96
Tweet Content               0
Tweet Rendered Content      0
Tweet Type                267
Date Posted                 0
URL                         0
Content Type                0
Likes                       0
Replies                     0
Retweet Count               0
Quote Count                 0
Reasoning                   0
dtype: int64

In this case, non-optional columns e.g. Account Bio, Account Bio URLs, Location, and Tweet Type have high number of rows with null values but we decided not to impute them as these features are not significant to answering our problem statement.

**Outliers**

Reference: https://careerfoundry.com/en/blog/data-analytics/how-to-find-outliers/#:~:text=Using%20pandas%20describe()%20to,not%20the%20dataset%20has%20outliers.

To handle outliers, we tried to determine which columns have outliers, but we did not impute them because we wanted to investigate further the existence of these outliers.

We used .describe() to get an idea of the mean value for each numerical column and how their maximum and minimum values deviate from the mean.

In [11]:
! pip install -U kaleido

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting kaleido
  Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m79.9/79.9 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: kaleido
Successfully installed kaleido-0.2.1


In [13]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Following,Followers,Likes,Replies,Retweet Count,Quote Count
count,295.0,295.0,295.0,295.0,295.0,295.0,295.0
mean,70.345763,787.39322,164450.1,15.40339,1.59661,3.755932,0.535593
std,63.864513,1592.316947,831675.9,79.131919,6.557178,23.057782,2.053184
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,12.0,55.0,32.5,0.0,0.0,0.0,0.0
50%,52.0,207.0,296.0,0.0,0.0,0.0,0.0
75%,125.5,833.0,1625.0,2.0,1.0,0.0,0.0
max,199.0,14389.0,8974332.0,1008.0,75.0,352.0,22.0


**Find outliers using histogram and box plot**

Most numerical features e.g. Following, Followers, Likes, Replies, Retweet Count and Quote Count do not have a normal distribution. They are mostly skewed to the left and all features have outliers as you can see in the histogram images below:

In [20]:
# Plot the histogram
fig = px.histogram(df, x='Following', title="Distribution of Following Accounts")
fig.update_layout(xaxis_title_text='Following Accounts', yaxis_title_text='Count', title_x=0.5)

# Save the figure as a PNG file
pio.write_image(fig, 'Distribution of Following Accounts.png')

# Display the graph
fig.show()

In [18]:
# Plot the histogram
fig = px.histogram(df, x='Followers', title="Distribution of Followers")
fig.update_layout(xaxis_title_text='Followers', yaxis_title_text='Count', title_x=0.5)

# Save the figure as a PNG file
pio.write_image(fig, 'Distribution of Followers.png')

# Display the graph
fig.show()

In [21]:
# Plot the histogram
fig = px.histogram(df, x='Likes', title="Distribution of Like Count")
fig.update_layout(xaxis_title_text='Likes', yaxis_title_text='Count', title_x=0.5)

# Save the figure as a PNG file
pio.write_image(fig, 'Distribution of Like Count.png')

# Display the graph
fig.show()

In [22]:
# Plot the histogram
fig = px.histogram(df, x='Replies', title="Distribution of Reply Count")
fig.update_layout(xaxis_title_text='Replies', yaxis_title_text='Count', title_x=0.5)

# Save the figure as a PNG file
pio.write_image(fig, 'Distribution of Reply Count.png')

# Display the graph
fig.show()

In [23]:
# Plot the histogram
fig = px.histogram(df, x='Retweet Count', title="Distribution of Retweet Count")
fig.update_layout(xaxis_title_text='Retweets', yaxis_title_text='Count', title_x=0.5)

# Save the figure as a PNG file
pio.write_image(fig, 'Distribution of Retweet Count.png')

# Display the graph
fig.show()

In [25]:
# Plot the histogram
fig = px.histogram(df, x='Quote Count', title="Distribution of Quote Retweets")
fig.update_layout(xaxis_title_text='Quote Retweets', yaxis_title_text='Count', title_x=0.5)

# Save the figure as a PNG file
pio.write_image(fig, 'Distribution of Quote Retweet Count.png')

# Display the graph
fig.show()

In [27]:
# Plot the box plot
fig = px.box(df, y='Following', title="Box Plot of Following Accounts")
fig.update_layout(yaxis_title_text='Count of Following Accounts', title_x=0.5)

# Save the figure as a PNG file
pio.write_image(fig, 'Variation of Following Accounts.png')

# Display the graph
fig.show()


In [28]:
# Plot the box plot
fig = px.box(df, y='Followers', title="Box Plot of Followers")
fig.update_layout(yaxis_title_text='Count of Followers', title_x=0.5)

# Save the figure as a PNG file
pio.write_image(fig, 'Variation of Followers.png')

# Display the graph
fig.show()


In [31]:
# Plot the box plot
fig = px.box(df, y='Likes', title="Box Plot of Like Count")
fig.update_layout(yaxis_title_text='Count of Likes', title_x=0.5)

# Save the figure as a PNG file
pio.write_image(fig, 'Variation of Like Count.png')

# Display the graph
fig.show()


In [32]:
# Plot the box plot
fig = px.box(df, y='Replies', title="Box Plot of Reply Count")
fig.update_layout(yaxis_title_text='Count of Replies', title_x=0.5)

# Save the figure as a PNG file
pio.write_image(fig, 'Variation of Reply Count.png')

# Display the graph
fig.show()


In [33]:
# Plot the box plot
fig = px.box(df, y='Retweet Count', title="Box Plot of Retweet Count")
fig.update_layout(yaxis_title_text='Count of Retweets', title_x=0.5)

# Save the figure as a PNG file
pio.write_image(fig, 'Variation of Retweet Count.png')

# Display the graph
fig.show()


In [34]:
# Plot the box plot
fig = px.box(df, y='Quote Count', title="Box Plot of Quote Retweet Count")
fig.update_layout(yaxis_title_text='Count of Quote Retweets', title_x=0.5)

# Save the figure as a PNG file
pio.write_image(fig, 'Variation of Quote Retweet Count.png')

# Display the graph
fig.show()


**Investigating outliers**

In [35]:
df[df['Following'] == 14389]

Unnamed: 0.1,Unnamed: 0,Timestamp scraped,Collector,Topic,Keyword,Account Handle,Account Name,Account Bio,Account Bio URLs,Verified,...,Tweet Rendered Content,Tweet Type,Date Posted,URL,Content Type,Likes,Replies,Retweet Count,Quote Count,Reasoning
154,154,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,WoodstockPro,Woodstock Production,Promoting your video game is effortless with o...,,False,...,RT lugto13: RISA HONTIVEROS MAS NAAWA PA SA MA...,,2017-06-07 17:01:59+00:00,https://twitter.com/WoodstockPro/status/872498...,To be checked,0,0,0,0,Scraped
155,155,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,WoodstockPro,Woodstock Production,Promoting your video game is effortless with o...,,False,...,RT redcupidz_pro: RISA HONTIVEROS MAS NAAWA PA...,,2017-06-07 17:01:59+00:00,https://twitter.com/WoodstockPro/status/872498...,To be checked,0,0,0,0,Scraped


In [36]:
df[df['Likes'] == 1008]

Unnamed: 0.1,Unnamed: 0,Timestamp scraped,Collector,Topic,Keyword,Account Handle,Account Name,Account Bio,Account Bio URLs,Verified,...,Tweet Rendered Content,Tweet Type,Date Posted,URL,Content Type,Likes,Replies,Retweet Count,Quote Count,Reasoning
265,20,14/06/2023 04:31:46,Group 49,Risa Hontiveros supports rebels,risa hontiveros enabler,barnabychuck,Barnaby Lo Âê≥ÂÆóÈ¥ª,Al Jazeera correspondent covering the Philippi...,,False,...,Sen. Risa Hontiveros on learning materials tha...,[Photo(previewUrl='https://pbs.twimg.com/media...,2022-10-24 06:07:02+00:00,https://twitter.com/barnabychuck/status/158442...,To be checked,1008,20,352,15,Scraped


**Ensuring formatting consistency**

Format dates

In [38]:
joinedRaw = df['Joined'].tolist()
joinedNew = []

for j in joinedRaw:
  dt = datetime.strptime(j, '%Y-%m-%d %H:%M:%S%z')
  formatted_str = dt.strftime('%m/%y')
  joinedNew.append(formatted_str)

postedRaw = df['Date Posted'].tolist()
postedNew = []
for p in postedRaw:
  dt1 = datetime.strptime(p, '%Y-%m-%d %H:%M:%S%z')
  formatted_str1 = dt1.strftime('%d/%m/%y %H:%M')
  postedNew.append(formatted_str1)


In [39]:
df['Joined'] = df['Joined'].replace(dict(zip(joinedRaw, joinedNew)))

df['Date Posted'] = df['Date Posted'].replace(dict(zip(postedRaw, postedNew)))

In [78]:
df

Unnamed: 0.1,Unnamed: 0,Timestamp scraped,Collector,Topic,Keyword,Account Handle,Account Name,Account Bio,Account Bio URLs,Verified,...,Tweet Rendered Content,Tweet Type,Date Posted,URL,Content Type,Likes,Replies,Retweet Count,Quote Count,Reasoning
0,0,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,,marilyn redwood,,,False,...,sSO MGA DRUG ADDICT ANG MAUTE GROUP.PATI KASAM...,,27/05/17 06:20,https://twitter.com/lynredw/status/86835114345...,To be checked,0,0,0,0,Scraped
1,1,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,@AlvinLabios1,Jeprox Ako Anong Pake Mo TV,"News, Stories and Tutorials",,False,...,I added a video to a @YouTube playlist youtu.b...,,27/06/17 18:58,https://twitter.com/AlvinLabios1/status/879776...,To be checked,0,0,0,0,Scraped
2,2,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,@AlvinLabios1,Jeprox Ako Anong Pake Mo TV,"News, Stories and Tutorials",,False,...,I added a video to a @YouTube playlist youtu.b...,,27/06/17 18:58,https://twitter.com/AlvinLabios1/status/879776...,To be checked,0,0,0,0,Scraped
3,3,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,@AlvinLabios1,Jeprox Ako Anong Pake Mo TV,"News, Stories and Tutorials",,False,...,I added a video to a @YouTube playlist youtu.b...,,22/06/17 23:44,https://twitter.com/AlvinLabios1/status/878036...,To be checked,0,0,0,0,Scraped
4,4,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,@AlvinLabios1,Jeprox Ako Anong Pake Mo TV,"News, Stories and Tutorials",,False,...,I added a video to a @YouTube playlist youtu.b...,,22/06/17 23:44,https://twitter.com/AlvinLabios1/status/878036...,To be checked,0,0,0,0,Scraped
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
290,19,14/06/2023 04:53:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros NDF CPP,@shiningtwicexo,‚úåüëäüö´,"Happiness, personality and productivity really...",,False,...,These are the following current senators na po...,,01/09/22 01:45,https://twitter.com/shiningtwicexo/status/1565...,To be checked,2,4,1,0,Scraped
291,20,14/06/2023 04:53:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros NDF CPP,@jpii041122,JPII041122,,,False,...,@AdmarVilando @KabataanPL @risahontiveros GAGO...,,29/10/22 08:03,https://twitter.com/jpii041122/status/15862674...,To be checked,0,0,0,0,Scraped
292,21,14/06/2023 04:53:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros NDF CPP,@jpii041122,JPII041122,,,False,...,@_miggypot29 PAANO MAPRO PROTEKTAHAN ANG PAMIL...,,25/10/22 06:43,https://twitter.com/jpii041122/status/15847977...,To be checked,4,0,0,0,Scraped
293,22,14/06/2023 04:53:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros NDF CPP,@NationPinoy,PINOY NATIONüáµüá≠,Walang kinikilingan\nNOTO CPP-NDF-NPA \nNEVER ...,,False,...,Tamimi si Risa Hontiveros akala nya kasing Bob...,[Photo(previewUrl='https://pbs.twimg.com/media...,24/11/22 03:13,https://twitter.com/NationPinoy/status/1595616...,To be checked,0,1,0,0,Scraped


Standardization

In [80]:
from sklearn.preprocessing import StandardScaler

# select columns to standardize
cols_to_standardize = ['Following', 'Followers', 'Likes', 'Replies', 'Retweet Count', 'Quote Count']

# standardize selected columns
scaler = StandardScaler()
df[cols_to_standardize] = scaler.fit_transform(df[cols_to_standardize])

# print updated dataframe
df

Unnamed: 0.1,Unnamed: 0,Timestamp scraped,Collector,Topic,Keyword,Account Handle,Account Name,Account Bio,Account Bio URLs,Verified,...,Tweet Rendered Content,Tweet Type,Date Posted,URL,Content Type,Likes,Replies,Retweet Count,Quote Count,Reasoning
0,0,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,,marilyn redwood,,,False,...,sSO MGA DRUG ADDICT ANG MAUTE GROUP.PATI KASAM...,,27/05/17 06:20,https://twitter.com/lynredw/status/86835114345...,To be checked,-0.194985,-0.243904,-0.163169,-0.261303,Scraped
1,1,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,@AlvinLabios1,Jeprox Ako Anong Pake Mo TV,"News, Stories and Tutorials",,False,...,I added a video to a @YouTube playlist youtu.b...,,27/06/17 18:58,https://twitter.com/AlvinLabios1/status/879776...,To be checked,-0.194985,-0.243904,-0.163169,-0.261303,Scraped
2,2,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,@AlvinLabios1,Jeprox Ako Anong Pake Mo TV,"News, Stories and Tutorials",,False,...,I added a video to a @YouTube playlist youtu.b...,,27/06/17 18:58,https://twitter.com/AlvinLabios1/status/879776...,To be checked,-0.194985,-0.243904,-0.163169,-0.261303,Scraped
3,3,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,@AlvinLabios1,Jeprox Ako Anong Pake Mo TV,"News, Stories and Tutorials",,False,...,I added a video to a @YouTube playlist youtu.b...,,22/06/17 23:44,https://twitter.com/AlvinLabios1/status/878036...,To be checked,-0.194985,-0.243904,-0.163169,-0.261303,Scraped
4,4,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,@AlvinLabios1,Jeprox Ako Anong Pake Mo TV,"News, Stories and Tutorials",,False,...,I added a video to a @YouTube playlist youtu.b...,,22/06/17 23:44,https://twitter.com/AlvinLabios1/status/878036...,To be checked,-0.194985,-0.243904,-0.163169,-0.261303,Scraped
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
290,19,14/06/2023 04:53:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros NDF CPP,@shiningtwicexo,‚úåüëäüö´,"Happiness, personality and productivity really...",,False,...,These are the following current senators na po...,,01/09/22 01:45,https://twitter.com/shiningtwicexo/status/1565...,To be checked,-0.169668,0.367151,-0.119726,-0.261303,Scraped
291,20,14/06/2023 04:53:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros NDF CPP,@jpii041122,JPII041122,,,False,...,@AdmarVilando @KabataanPL @risahontiveros GAGO...,,29/10/22 08:03,https://twitter.com/jpii041122/status/15862674...,To be checked,-0.194985,-0.243904,-0.163169,-0.261303,Scraped
292,21,14/06/2023 04:53:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros NDF CPP,@jpii041122,JPII041122,,,False,...,@_miggypot29 PAANO MAPRO PROTEKTAHAN ANG PAMIL...,,25/10/22 06:43,https://twitter.com/jpii041122/status/15847977...,To be checked,-0.144351,-0.243904,-0.163169,-0.261303,Scraped
293,22,14/06/2023 04:53:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros NDF CPP,@NationPinoy,PINOY NATIONüáµüá≠,Walang kinikilingan\nNOTO CPP-NDF-NPA \nNEVER ...,,False,...,Tamimi si Risa Hontiveros akala nya kasing Bob...,[Photo(previewUrl='https://pbs.twimg.com/media...,24/11/22 03:13,https://twitter.com/NationPinoy/status/1595616...,To be checked,-0.194985,-0.091140,-0.163169,-0.261303,Scraped


**NLP**

In [82]:
import nltk

# download nltk tokenizer
nltk.download('punkt')

nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Punctuation removal

In [84]:
#library that contains punctuation
import string
string.punctuation
#defining the function to remove punctuation
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree
#storing the puntuation free text
df['Tweet Content']= df['Tweet Content'].apply(lambda x:remove_punctuation(x))
df['Tweet Content'].tolist()

['sSO MGA DRUG ADDICT ANG MAUTE GROUPPATI KASAMA NILA MGA SABOG KAYA GANITO WALA NA ALAM KUNDI GUMAWA NG KRIMENANO NA RISA HONTIVEROS httpstcoTQTmqRjEGA',
 'I added a video to a YouTube playlist httpstcod5oexXUDIA DUTERTE NAGSALITA NA LABAN KAY RISA HONTIVEROS KINAMPIHAN ANG MAUTE',
 'I added a video to a YouTube playlist httpstcod5oexXD2R2 DUTERTE NAGSALITA NA LABAN KAY RISA HONTIVEROS KINAMPIHAN ANG MAUTE',
 'I added a video to a YouTube playlist httpstcoxnt6ran9KY DUTERTE NAGSALITA NA LABAN KAY RISA HONTIVEROS KINAMPIHAN ANG MAUTE',
 'I added a video to a YouTube playlist httpstcoxnt6raEL9y DUTERTE NAGSALITA NA LABAN KAY RISA HONTIVEROS KINAMPIHAN ANG MAUTE',
 'Can we make Risa Hontiveros as soldiers Human shield to fight Maute terroristsüòíüòèüòíüòètutal nmn wla nmn cia utak at walng silbi na Senadora',
 'etulfo2011 Ang basa nga ng mga Ordinaryong Filipino si Risa Hontiveros ay Protector ng Maute or isa sya sa TerroristEd Langaw layas sa TV5',
 'DUTERTE NAGSALITA NA LABAN KAY R

In [86]:
df['Tweet Content']= df['Tweet Content'].apply(lambda x: x.lower())
df['Tweet Content']= df['Tweet Content'].str.replace(r'https.*', '')
df['Tweet Content'].tolist()


The default value of regex will change from True to False in a future version.



['sso mga drug addict ang maute grouppati kasama nila mga sabog kaya ganito wala na alam kundi gumawa ng krimenano na risa hontiveros ',
 'i added a video to a youtube playlist ',
 'i added a video to a youtube playlist ',
 'i added a video to a youtube playlist ',
 'i added a video to a youtube playlist ',
 'can we make risa hontiveros as soldiers human shield to fight maute terroristsüòíüòèüòíüòètutal nmn wla nmn cia utak at walng silbi na senadora',
 'etulfo2011 ang basa nga ng mga ordinaryong filipino si risa hontiveros ay protector ng maute or isa sya sa terroristed langaw layas sa tv5',
 'duterte nagsalita na laban kay risa hontiveros kinampihan ang maute group more info ',
 'galit si sara duterte sa kabugokan ni risa hontiveros spokesperson ng maute group please watch and share this ',
 'duterte nagsalita na laban kay risa hontiveros sa pagkampi nito sa maute group please watch and share this video ',
 'para sa mga nagsabi na sinusuportahan ni risa hontiveros ang ginagawa ng

In [87]:
df['Tweet Content']= df['Tweet Content'].str.replace('\n', ' ')
df['Tweet Content'].tolist()

['sso mga drug addict ang maute grouppati kasama nila mga sabog kaya ganito wala na alam kundi gumawa ng krimenano na risa hontiveros ',
 'i added a video to a youtube playlist ',
 'i added a video to a youtube playlist ',
 'i added a video to a youtube playlist ',
 'i added a video to a youtube playlist ',
 'can we make risa hontiveros as soldiers human shield to fight maute terroristsüòíüòèüòíüòètutal nmn wla nmn cia utak at walng silbi na senadora',
 'etulfo2011 ang basa nga ng mga ordinaryong filipino si risa hontiveros ay protector ng maute or isa sya sa terroristed langaw layas sa tv5',
 'duterte nagsalita na laban kay risa hontiveros kinampihan ang maute group more info ',
 'galit si sara duterte sa kabugokan ni risa hontiveros spokesperson ng maute group please watch and share this ',
 'duterte nagsalita na laban kay risa hontiveros sa pagkampi nito sa maute group please watch and share this video ',
 'para sa mga nagsabi na sinusuportahan ni risa hontiveros ang ginagawa ng

Tokenization

In [89]:
#defining function for tokenization
import re
def tokenization(text):
    tokens = re.split('W+',text)
    return tokens
#applying function to the column
df['Tweet Content'].apply(lambda x: tokenization(x))

0      [sso mga drug addict ang maute grouppati kasam...
1               [i added a video to a youtube playlist ]
2               [i added a video to a youtube playlist ]
3               [i added a video to a youtube playlist ]
4               [i added a video to a youtube playlist ]
                             ...                        
290    [these are the following current senators na p...
291    [admarvilando kabataanpl risahontiveros gagong...
292    [miggypot29 paano mapro protektahan ang pamily...
293    [tamimi si risa hontiveros akala nya kasing bo...
294    [this is so true  youll really wonder bakit ga...
Name: Tweet Content, Length: 295, dtype: object

In [90]:
# create example dataframe
# tokenize text column
df['Tokenized Tweet'] = df['Tweet Content'].apply(lambda x: nltk.word_tokenize(x))

# display result
df.filter(items=['Tweet Content', 'Tokenized Tweet'])


Unnamed: 0,Tweet Content,Tokenized Tweet
0,sso mga drug addict ang maute grouppati kasama...,"[sso, mga, drug, addict, ang, maute, grouppati..."
1,i added a video to a youtube playlist,"[i, added, a, video, to, a, youtube, playlist]"
2,i added a video to a youtube playlist,"[i, added, a, video, to, a, youtube, playlist]"
3,i added a video to a youtube playlist,"[i, added, a, video, to, a, youtube, playlist]"
4,i added a video to a youtube playlist,"[i, added, a, video, to, a, youtube, playlist]"
...,...,...
290,these are the following current senators na po...,"[these, are, the, following, current, senators..."
291,admarvilando kabataanpl risahontiveros gagong ...,"[admarvilando, kabataanpl, risahontiveros, gag..."
292,miggypot29 paano mapro protektahan ang pamilya...,"[miggypot29, paano, mapro, protektahan, ang, p..."
293,tamimi si risa hontiveros akala nya kasing bob...,"[tamimi, si, risa, hontiveros, akala, nya, kas..."


Remove stopwords

In [91]:
! pip install stopwordsiso

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stopwordsiso
  Downloading stopwordsiso-0.6.1-py3-none-any.whl (73 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m73.5/73.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: stopwordsiso
Successfully installed stopwordsiso-0.6.1


In [92]:
from stopwordsiso import stopwords as fil
from nltk.corpus import stopwords as eng

stop_words = set(eng.words("english"))

def remove_stopwords(text):
    # Tokenize the text
    tokens = nltk.word_tokenize(text.lower())

    # Remove punctuation
    tokens = [token for token in tokens if token not in string.punctuation]

    # Remove Filipino stop words
    tokens = [token for token in tokens if token not in fil('tl')]

    # Remove English stopwords
    tokens = [token for token in tokens if token not in stop_words]

    # Join the tokens back into a string
    return ' '.join(tokens)

# Apply the function to the 'text' column
df['Clean Tweets'] = df['Tweet Content'].apply(remove_stopwords)

df['Clean Tweets'].tolist()

['sso drug addict maute grouppati kasama sabog ganito wala alam kundi krimenano risa hontiveros',
 'added video youtube playlist',
 'added video youtube playlist',
 'added video youtube playlist',
 'added video youtube playlist',
 'make risa hontiveros soldiers human shield fight maute terroristsüòíüòèüòíüòètutal nmn wla nmn cia utak walng silbi senadora',
 'etulfo2011 basa nga ordinaryong filipino si risa hontiveros protector maute sya terroristed langaw layas tv5',
 'duterte nagsalita kay risa hontiveros kinampihan maute group info',
 'galit si sara duterte kabugokan risa hontiveros spokesperson maute group please watch share',
 'duterte nagsalita kay risa hontiveros pagkampi maute group please watch share video',
 'nagsabi sinusuportahan risa hontiveros maute marawi magbasa nang malinawan',
 'netijen pilipin lg rame soal risa hontiveros senator filipin yang dianggap tidak mendukung langkah duterte memerangi al maute di marawi',
 'liked youtube video',
 'duterte nagsalita kay ris

In [93]:
#importing the Stemming function from nltk library
from nltk.stem.porter import PorterStemmer
#defining the object for stemming
porter_stemmer = PorterStemmer()
#defining a function for stemming
def stemming(text):
  stem_text = [porter_stemmer.stem(word) for word in text]
  return stem_text
df['Stemmed Tweets']=df['Clean Tweets'].apply(lambda x: stemming(x))

import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [94]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer

# initialize lemmatizer and stemmer
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

# define a function to apply lemmatization and stemming to each row
def lemmatize_and_stem(text):
    tokens = word_tokenize(text) # tokenize the text
    lemmatized = [lemmatizer.lemmatize(token) for token in tokens] # lemmatize the tokens
    #stemmed = [stemmer.stem(token) for token in lemmatized] # stem the lemmatized tokens
    return " ".join(lemmatized) # join the stemmed tokens back into a string

# apply the lemmatization and stemming function to the "text" column
df["Lemmatized Tweets"] = df["Clean Tweets"].apply(lemmatize_and_stem)

df

Unnamed: 0.1,Unnamed: 0,Timestamp scraped,Collector,Topic,Keyword,Account Handle,Account Name,Account Bio,Account Bio URLs,Verified,...,Content Type,Likes,Replies,Retweet Count,Quote Count,Reasoning,Tokenized Tweet,Clean Tweets,Stemmed Tweets,Lemmatized Tweets
0,0,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,,marilyn redwood,,,False,...,To be checked,-0.194985,-0.243904,-0.163169,-0.261303,Scraped,"[sso, mga, drug, addict, ang, maute, grouppati...",sso drug addict maute grouppati kasama sabog g...,"[s, s, o, , d, r, u, g, , a, d, d, i, c, t, ...",sso drug addict maute grouppati kasama sabog g...
1,1,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,@AlvinLabios1,Jeprox Ako Anong Pake Mo TV,"News, Stories and Tutorials",,False,...,To be checked,-0.194985,-0.243904,-0.163169,-0.261303,Scraped,"[i, added, a, video, to, a, youtube, playlist]",added video youtube playlist,"[a, d, d, e, d, , v, i, d, e, o, , y, o, u, ...",added video youtube playlist
2,2,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,@AlvinLabios1,Jeprox Ako Anong Pake Mo TV,"News, Stories and Tutorials",,False,...,To be checked,-0.194985,-0.243904,-0.163169,-0.261303,Scraped,"[i, added, a, video, to, a, youtube, playlist]",added video youtube playlist,"[a, d, d, e, d, , v, i, d, e, o, , y, o, u, ...",added video youtube playlist
3,3,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,@AlvinLabios1,Jeprox Ako Anong Pake Mo TV,"News, Stories and Tutorials",,False,...,To be checked,-0.194985,-0.243904,-0.163169,-0.261303,Scraped,"[i, added, a, video, to, a, youtube, playlist]",added video youtube playlist,"[a, d, d, e, d, , v, i, d, e, o, , y, o, u, ...",added video youtube playlist
4,4,14/06/2023 03:33:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros maute,@AlvinLabios1,Jeprox Ako Anong Pake Mo TV,"News, Stories and Tutorials",,False,...,To be checked,-0.194985,-0.243904,-0.163169,-0.261303,Scraped,"[i, added, a, video, to, a, youtube, playlist]",added video youtube playlist,"[a, d, d, e, d, , v, i, d, e, o, , y, o, u, ...",added video youtube playlist
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
290,19,14/06/2023 04:53:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros NDF CPP,@shiningtwicexo,‚úåüëäüö´,"Happiness, personality and productivity really...",,False,...,To be checked,-0.169668,0.367151,-0.119726,-0.261303,Scraped,"[these, are, the, following, current, senators...",following current senators posibleng panigurad...,"[f, o, l, l, o, w, i, n, g, , c, u, r, r, e, ...",following current senator posibleng panigurado...
291,20,14/06/2023 04:53:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros NDF CPP,@jpii041122,JPII041122,,,False,...,To be checked,-0.194985,-0.243904,-0.163169,-0.261303,Scraped,"[admarvilando, kabataanpl, risahontiveros, gag...",admarvilando kabataanpl risahontiveros gagong ...,"[a, d, m, a, r, v, i, l, a, n, d, o, , k, a, ...",admarvilando kabataanpl risahontiveros gagong ...
292,21,14/06/2023 04:53:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros NDF CPP,@jpii041122,JPII041122,,,False,...,To be checked,-0.144351,-0.243904,-0.163169,-0.261303,Scraped,"[miggypot29, paano, mapro, protektahan, ang, p...",miggypot29 mapro protektahan pamilyang lapid e...,"[m, i, g, g, y, p, o, t, 2, 9, , m, a, p, r, ...",miggypot29 mapro protektahan pamilyang lapid e...
293,22,14/06/2023 04:53:42,Group 49,Risa Hontiveros supports rebels,risa hontiveros NDF CPP,@NationPinoy,PINOY NATIONüáµüá≠,Walang kinikilingan\nNOTO CPP-NDF-NPA \nNEVER ...,,False,...,To be checked,-0.194985,-0.091140,-0.163169,-0.261303,Scraped,"[tamimi, si, risa, hontiveros, akala, nya, kas...",tamimi si risa hontiveros akala nya kasing bob...,"[t, a, m, i, m, i, , s, i, , r, i, s, a, , ...",tamimi si risa hontiveros akala nya kasing bob...


In [95]:
df.to_csv('NLP_related.csv')