# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Capstone Project - Malay Language Social Media Sentiment Analysis

## Part 1 - Data Acquisition and Cleaning

## 1. Introduction and Problem Statement

Sentiment analysis is a subfield of Natural Language Processing (NLP) that focuses on the detection of one's sentiment in text data. It is one of the most widely-used applications of NLP, and is important in a variety of areas including social media content, customer reviews, speech conversations, and news.

Sentiment analysis allows organisations and individuals to understand views on their actions, and themselves. For organisations and individuals who want to track public opinion on them (ie. reputation management), sentiment analysis is vital to help them filter through enormous amounts of unstructured information.

While research and development into sentiment analysis has been done quite comprehensively on major languages such as the English and Chinese languages, it has been really scarce for less popular languages such as Malay. As of today, there is only one well known Natural-Language-Toolkit library for Bahasa Melayu, which is [Malaya](https://malaya.readthedocs.io/en/stable/). In the library, there are a number of modules that can be used, including sentiment analysis.

However, the sentiment analysis model used in Malaya appears to be lexicon-based. While lexicon-based models are common (eg. VADER) and interpretable, it is limited when it comes to handling contextual understanding, as well as vocabulary gaps. For the latter, it is important that the list of words in the lexicon is varied and robust in order to detect the right sentiments in varied types of sentences.

However, diving into the [sentiment-specific lexicon](https://github.com/huseinzol05/malaysian-dataset/blob/master/lexicon/sentiment.json), it appears to be quite limited to formal Malay words, and is unlikely to be able to detect sentiments in texts which include slang or short-forms, which is very common in Malay social media.

Based on this, the problem statement for my capstone is:

**<center>Can we create an best-in-class classification model to identify sentiments in Malay social media comments?</center>**

The ideal outcome / goal is to create an sentiment analysis model for Bahasa Melayu that is reasonably accurate for social media comments. In this project, I will compare the accuracy of the Malaya model in detecting sentiments, and compare it to the accuracy of ChatGPT 3.5, which is an existing Large Language Model that is trained on Bahasa Melayu (albeit less so than on English). Whichever model performs better, will be used as a baseline to train our model on.

The primary audience will be individuals and organisations that wish to understand views of Malay-speakers on social media. These individuals and organisations will be primarily based in the Malay archipelago, which includes countries such as Malaysia, Indonesia, Singapore and Brunei, and has a combined population of close to 400 million people.

## 2. Data Acquisition - YouTube Comments

In this project, we will be working with one dataset - YouTube comments. The YouTube comments will be pulled from various videos which have comments primarily in Bahasa Melayu, and our goal will be to get sufficient comments with the various sentiments: positive, negative, and neutral.

In this notebook, we will import the data before commencing data cleaning.

<mark>**Note:**</mark> Before importing libraries, I'd recommend creating a new environment due to certain version conflicts with pip installing and importing the Malaya model.

To do so, run the following commands in the terminal in the same folder as the requirements.txt (assuming we are using mamba as an environment manager).

1.
```bash
mamba create --name malay_sentiment_project
```

2.
```bash
mamba activate malay_sentiment_project
```

3.
```bash
pip install -r requirements.txt
```

In [1]:
#importing libraries
import googleapiclient.discovery
import pandas as pd
from bs4 import BeautifulSoup
import openai

### 2.1 Comments Pull - Round 1

We will be doing 2 rounds of pulling comments in order not to reach our daily API limit. Thus, the pulling is done over two days.

In [2]:
# YouTube API Credentials
api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = "insert developer key here"

In [3]:
# Using YouTube API
def get_video_comments(video_id, max_results=100, max_comments=10000):
    youtube = googleapiclient.discovery.build(api_service_name, api_version, developerKey=DEVELOPER_KEY)
    
    comments = []
    next_page_token = None
    total_comments_fetched = 0
    
    while total_comments_fetched < max_comments:
        remaining_comments_to_fetch = max_comments - total_comments_fetched
        comments_to_fetch = min(max_results, remaining_comments_to_fetch)
        
        request = youtube.commentThreads().list(
            part="snippet",
            videoId=video_id,
            maxResults=comments_to_fetch,
            pageToken=next_page_token if next_page_token else None
        )
        
        response = request.execute()
        
        # Get the video information (title and upload date) for the first batch of comments
        if total_comments_fetched == 0:
            video_response = youtube.videos().list(
                part="snippet",
                id=video_id
            ).execute()
            
            video_title = video_response['items'][0]['snippet']['title']
            video_published_at = video_response['items'][0]['snippet']['publishedAt']
        
        for item in response['items']:
            comment = item['snippet']['topLevelComment']['snippet']
            comments.append([
                comment['authorDisplayName'],
                comment['publishedAt'],
                comment['likeCount'],
                comment['textDisplay'],
                video_title,  # Adding video title for each comment
                video_published_at  # Adding video upload date for each comment
            ])
        
        total_comments_fetched += comments_to_fetch
        next_page_token = response.get('nextPageToken')
        if not next_page_token:
            break
    
    return comments

# List of video IDs for which you want to fetch comments
video_ids = ["aUZoLk01OPo", "mZN0xnK4hBU", "TJ2INthzEis", "2ZpcdcjJ2hs", "JgyQ0v-7DS4", "GlO2sSuezXw"]

# Initialize an empty list to store all comments from multiple videos
all_comments = []

for video_id in video_ids:
    comments_for_video = get_video_comments(video_id)
    all_comments.extend(comments_for_video)

# Create a DataFrame from all the comments
df = pd.DataFrame(all_comments, columns=['author', 'published_at', 'like_count', 'text', 'video_title', 'video_published_at'])

# Display the first 10 rows of the DataFrame
df.head(10)

Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at
0,Hafiz Khan,2023-09-06T07:52:06Z,0,Org kelantan ke apa ji,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
1,Mia Adriana,2023-08-18T05:07:29Z,0,Kalau lama pecah cermin tu n pichang la org dl...,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
2,abdul mutalib,2023-08-10T11:29:37Z,0,Patut kak tu diberi lesen mcine gun tembak mat...,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
3,abdul mutalib,2023-08-10T11:28:47Z,0,Mau langar smpai patah kaki ja... Tada kj samu...,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
4,Syam Ryin,2023-08-08T12:50:10Z,0,Mangsa Dalam kereta tu macam mna?diam je x beg...,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
5,NT H,2023-08-06T12:00:40Z,0,langgar je kasi mati terus. bugima btl,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
6,Farizan Rili,2023-07-22T07:13:55Z,0,Langar mati pun tak apa,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
7,Mohamad Anwar,2023-06-16T11:10:39Z,0,Terbaik la tindakan sist.....,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
8,Ezaaq Eida,2023-06-16T09:44:11Z,0,Kalu aku memang non stop doh tu langgar … biar...,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
9,dwi laksono,2023-06-10T14:35:51Z,0,Knp dia stop ehh..klu aku dh tekan hon biar se...,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z


In [4]:
df.shape

(15211, 6)

There are over 15,000 comments.

In [5]:
# Saving to csv
df.to_csv('../data/01_raw_malay_comments_1.csv', index=False)

### 2.2 Comments Pull - Round 2

In [30]:
# Using YouTube API
def get_video_comments(video_id, max_results=100, max_comments=10000):
    youtube = googleapiclient.discovery.build(api_service_name, api_version, developerKey=DEVELOPER_KEY)
    
    comments = []
    next_page_token = None
    total_comments_fetched = 0
    
    while total_comments_fetched < max_comments:
        remaining_comments_to_fetch = max_comments - total_comments_fetched
        comments_to_fetch = min(max_results, remaining_comments_to_fetch)
        
        request = youtube.commentThreads().list(
            part="snippet",
            videoId=video_id,
            maxResults=comments_to_fetch,
            pageToken=next_page_token if next_page_token else None
        )
        
        response = request.execute()
        
        # Get the video information (title and upload date) for the first batch of comments
        if total_comments_fetched == 0:
            video_response = youtube.videos().list(
                part="snippet",
                id=video_id
            ).execute()
            
            video_title = video_response['items'][0]['snippet']['title']
            video_published_at = video_response['items'][0]['snippet']['publishedAt']
        
        for item in response['items']:
            comment = item['snippet']['topLevelComment']['snippet']
            comments.append([
                comment['authorDisplayName'],
                comment['publishedAt'],
                comment['likeCount'],
                comment['textDisplay'],
                video_title,  # Adding video title for each comment
                video_published_at  # Adding video upload date for each comment
            ])
        
        total_comments_fetched += comments_to_fetch
        next_page_token = response.get('nextPageToken')
        if not next_page_token:
            break
    
    return comments

# List of video IDs for which you want to fetch comments
video_ids = ["zMKC2nk0KoU", "W8cnupHVrXE", "Xtx5L4NEemg", "8VMFEuEQvd4"]

# Initialize an empty list to store all comments from multiple videos
all_comments = []

for video_id in video_ids:
    comments_for_video = get_video_comments(video_id)
    all_comments.extend(comments_for_video)

# Create a DataFrame from all the comments
df = pd.DataFrame(all_comments, columns=['author', 'published_at', 'like_count', 'text', 'video_title', 'video_published_at'])

# Display the first 10 rows of the DataFrame
df.head(10)

Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at
0,David Arnajirson,2023-08-18T07:56:39Z,0,"Setu saya nak bagi tahu, Hal kecil pun jadi b...",“Dia nak selamatkan muka dia” - Diana Danielle...,2023-03-28T05:21:54Z
1,Zain Isa,2023-04-21T07:20:34Z,0,@SyamsulYusuf. 66 oi&amp;gas!,“Dia nak selamatkan muka dia” - Diana Danielle...,2023-03-28T05:21:54Z
2,Oudsatinmood A,2023-04-16T16:36:24Z,0,🤗,“Dia nak selamatkan muka dia” - Diana Danielle...,2023-03-28T05:21:54Z
3,lanun si kucing,2023-04-09T01:44:40Z,0,Pasal cincin ke biar betul mat. Ni mcm kes be...,“Dia nak selamatkan muka dia” - Diana Danielle...,2023-03-28T05:21:54Z
4,Yg Raja Putra,2023-04-07T05:34:00Z,0,"Biasalah artist mcm Diana, bila suami dak susa...",“Dia nak selamatkan muka dia” - Diana Danielle...,2023-03-28T05:21:54Z
5,Ahmad Yusof,2023-04-05T12:25:59Z,0,Kalau seorang isteri ditanya tentang org ketig...,“Dia nak selamatkan muka dia” - Diana Danielle...,2023-03-28T05:21:54Z
6,Pencinta Ustaz Wadi Annuar,2023-04-05T03:04:46Z,0,Apa yang terbaik pasti akan terjadi seadanya 1...,“Dia nak selamatkan muka dia” - Diana Danielle...,2023-03-28T05:21:54Z
7,oldmansingin love,2023-04-03T05:50:29Z,0,Selamatlh mnjadi bini setan,“Dia nak selamatkan muka dia” - Diana Danielle...,2023-03-28T05:21:54Z
8,FA betta chanel,2023-04-02T18:41:16Z,0,Wartawan pon batu api 🔥... Dalam rumah tangga...,“Dia nak selamatkan muka dia” - Diana Danielle...,2023-03-28T05:21:54Z
9,Laili Mohd,2023-04-02T15:30:33Z,0,No3. To late the hero.,“Dia nak selamatkan muka dia” - Diana Danielle...,2023-03-28T05:21:54Z


In [33]:
df.shape

(10377, 6)

There are over 10,000 comments.

In [32]:
# Saving to csv
df.to_csv('../data/01_raw_malay_comments_2.csv', index=False)

## 3. Data Cleaning

We will start off with data cleaning - focusing on null values, duplicated values, data types and HTML-encoded entities. But first, we will concatenate both datasets.

### 3.1 Concatenating DataFrames

In [3]:
df1 = pd.read_csv('../data/01_raw_malay_comments_1.csv')

In [4]:
df2 = pd.read_csv('../data/01_raw_malay_comments_2.csv')

In [5]:
df1.head()

Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at
0,Hafiz Khan,2023-09-06T07:52:06Z,0,Org kelantan ke apa ji,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
1,Mia Adriana,2023-08-18T05:07:29Z,0,Kalau lama pecah cermin tu n pichang la org dl...,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
2,abdul mutalib,2023-08-10T11:29:37Z,0,Patut kak tu diberi lesen mcine gun tembak mat...,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
3,abdul mutalib,2023-08-10T11:28:47Z,0,Mau langar smpai patah kaki ja... Tada kj samu...,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
4,Syam Ryin,2023-08-08T12:50:10Z,0,Mangsa Dalam kereta tu macam mna?diam je x beg...,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z


In [6]:
df2.head()

Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at
0,David Arnajirson,2023-08-18T07:56:39Z,0,"Setu saya nak bagi tahu, Hal kecil pun jadi b...",“Dia nak selamatkan muka dia” - Diana Danielle...,2023-03-28T05:21:54Z
1,Zain Isa,2023-04-21T07:20:34Z,0,@SyamsulYusuf. 66 oi&amp;gas!,“Dia nak selamatkan muka dia” - Diana Danielle...,2023-03-28T05:21:54Z
2,Oudsatinmood A,2023-04-16T16:36:24Z,0,🤗,“Dia nak selamatkan muka dia” - Diana Danielle...,2023-03-28T05:21:54Z
3,lanun si kucing,2023-04-09T01:44:40Z,0,Pasal cincin ke biar betul mat. Ni mcm kes be...,“Dia nak selamatkan muka dia” - Diana Danielle...,2023-03-28T05:21:54Z
4,Yg Raja Putra,2023-04-07T05:34:00Z,0,"Biasalah artist mcm Diana, bila suami dak susa...",“Dia nak selamatkan muka dia” - Diana Danielle...,2023-03-28T05:21:54Z


In [7]:
df = pd.concat([df1, df2], axis=0)

In [12]:
df.tail()

Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at
10380,Muhammad Shafizan,2023-04-06T04:01:36Z,2,nasib baik button dislike tkboleh tngk berapa ...,Dato Sri Aliff Syukri - Chu Ku Chuk Raya [Offi...,2023-04-06T04:00:10Z
10381,Mr_Adam 92,2023-04-06T04:01:19Z,0,kepala cengkirit..tu je bleh cakap kbai,Dato Sri Aliff Syukri - Chu Ku Chuk Raya [Offi...,2023-04-06T04:00:10Z
10382,Sanjut x Gaming,2023-04-06T04:00:57Z,0,ayuh kasi dislike,Dato Sri Aliff Syukri - Chu Ku Chuk Raya [Offi...,2023-04-06T04:00:10Z
10383,nurhazirah lim,2023-04-06T04:00:46Z,1,😂😂😂😂,Dato Sri Aliff Syukri - Chu Ku Chuk Raya [Offi...,2023-04-06T04:00:10Z
10384,siti nurzafirah,2023-04-06T04:00:37Z,1,No syatuuuuu wak.😂😂😂,Dato Sri Aliff Syukri - Chu Ku Chuk Raya [Offi...,2023-04-06T04:00:10Z


In [13]:
# Saving to csv
df.to_csv('../data/01_raw_malay_comments.csv', index = False)

### 3.2 Handling Null Values

In [14]:
df = pd.read_csv('../data/01_raw_malay_comments.csv')

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25613 entries, 0 to 25612
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   author              25612 non-null  object
 1   published_at        25592 non-null  object
 2   like_count          25592 non-null  object
 3   text                25581 non-null  object
 4   video_title         25584 non-null  object
 5   video_published_at  25584 non-null  object
dtypes: object(6)
memory usage: 1.2+ MB


In [19]:
# Looking for columns with null values
df.isnull().sum()

author                 1
published_at          21
like_count            21
text                  32
video_title           29
video_published_at    29
dtype: int64

There are a number of null values. Because they make up such a small amount of the total number of rows, we will drop them.

In [20]:
# Dropping nulls
df.dropna(inplace = True)

In [21]:
# Double checking
df.isnull().sum()

author                0
published_at          0
like_count            0
text                  0
video_title           0
video_published_at    0
dtype: int64

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 25576 entries, 0 to 25612
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   author              25576 non-null  object
 1   published_at        25576 non-null  object
 2   like_count          25576 non-null  object
 3   text                25576 non-null  object
 4   video_title         25576 non-null  object
 5   video_published_at  25576 non-null  object
dtypes: object(6)
memory usage: 1.4+ MB


### 3.3 Handling Duplicated Values

In [23]:
# Looking for the number of rows
df.shape

(25576, 6)

In [24]:
# Looking for number of duplicated comments
df['text'].duplicated().sum()

815

In [25]:
df[df['text'].duplicated()]

Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at
516,Sherli Malinda,2020-11-19T14:04:55Z,0,SEORANG GADIS DI PERKOSA RAMAI RAMAI DAN DI RE...,Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
542,yt 12,2020-11-16T01:00:08Z,0,"<a href=""https://youtu.be/dClnptakA4g"">https:/...",Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
543,Aci Uwa,2020-11-16T00:08:05Z,0,"<a href=""https://youtu.be/dClnptakA4g"">https:/...",Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
544,Aci Uwa,2020-11-16T00:07:59Z,0,"<a href=""https://youtu.be/dClnptakA4g"">https:/...",Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
683,Ali Redha,2020-10-23T12:28:31Z,0,"Jalankan undang-undang Islam, baru tak ada ora...",Respek kakak ni bertarung nyawa dan harta,2019-09-14T04:04:25Z
...,...,...,...,...,...,...
25596,Rizu Iskandar,2023-04-06T04:04:14Z,2,🎉🎉,Dato Sri Aliff Syukri - Chu Ku Chuk Raya [Offi...,2023-04-06T04:00:10Z
25603,Chaimah gaming,2023-04-06T04:02:44Z,3,Done dislike,Dato Sri Aliff Syukri - Chu Ku Chuk Raya [Offi...,2023-04-06T04:00:10Z
25606,Julisham ajirul,2023-04-06T04:01:53Z,2,Done dislike,Dato Sri Aliff Syukri - Chu Ku Chuk Raya [Offi...,2023-04-06T04:00:10Z
25607,eyman. ucop,2023-04-06T04:01:39Z,3,❤❤❤,Dato Sri Aliff Syukri - Chu Ku Chuk Raya [Offi...,2023-04-06T04:00:10Z


There are a number of duplicated rows. Again, because they make up such a small amount of the total number of rows, we will drop them.

In [26]:
df.drop_duplicates(subset='text', keep='first', inplace=True)

In [28]:
df.reset_index(drop = True, inplace = True)

### 3.4 Handling Data Types

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24761 entries, 0 to 24760
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   author              24761 non-null  object
 1   published_at        24761 non-null  object
 2   like_count          24761 non-null  object
 3   text                24761 non-null  object
 4   video_title         24761 non-null  object
 5   video_published_at  24761 non-null  object
dtypes: object(6)
memory usage: 1.1+ MB


We have two columns that are meant to be datetimes:
1. 'published_at'
2. 'video_published_at'

Let us change the type.

In [30]:
# Converting to datetime
df['published_at'] = pd.to_datetime(df['published_at'])
df['video_published_at'] = pd.to_datetime(df['video_published_at'])

In [31]:
# Checking the dtype
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24761 entries, 0 to 24760
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   author              24761 non-null  object             
 1   published_at        24761 non-null  datetime64[ns, UTC]
 2   like_count          24761 non-null  object             
 3   text                24761 non-null  object             
 4   video_title         24761 non-null  object             
 5   video_published_at  24761 non-null  datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](2), object(4)
memory usage: 1.1+ MB


### 3.5 Handling HTML-encoded entities

Some of the comments can be quite confusing due to HTML-encoded entities. Thus, we will use BeautifulSoup to handle that.

In [34]:
# Showing example of HTML-encoded text
df_html = pd.concat([df.iloc[32:33], df.iloc[5076:5077]])
df_html

Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at
32,Z,2023-02-01 09:57:11+00:00,0,"Kamus Dewan Edisi 17, <br>~ Respek - Bahasa pa...",Respek kakak ni bertarung nyawa dan harta,2019-09-14 04:04:25+00:00
5076,Marvel Hulk87,2019-05-25 15:58:24+00:00,3,Terbaek KC 💪💪💪<br>Tk pernah hampa kan peminat ...,Cerita dalam Kereta S2 Ep05 (Wanita Mati Dibun...,2019-05-25 14:28:49+00:00


In [35]:
# Using BeautifulSoup to handle HTML-encoded entities
df['text'] = df['text'].apply(lambda x: BeautifulSoup(x, "html.parser").get_text())

  df['text'] = df['text'].apply(lambda x: BeautifulSoup(x, "html.parser").get_text())


In [36]:
# Showing the changes on HTML-encoded text
df_html = pd.concat([df.iloc[32:33], df.iloc[5076:5077]])
df_html

Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at
32,Z,2023-02-01 09:57:11+00:00,0,"Kamus Dewan Edisi 17, ~ Respek - Bahasa pasar ...",Respek kakak ni bertarung nyawa dan harta,2019-09-14 04:04:25+00:00
5076,Marvel Hulk87,2019-05-25 15:58:24+00:00,3,Terbaek KC 💪💪💪Tk pernah hampa kan peminat Respect,Cerita dalam Kereta S2 Ep05 (Wanita Mati Dibun...,2019-05-25 14:28:49+00:00


## 4. Feature Engineering

To get more details of the comments, let us create a couple of columns: comment length and comment word count.

In [37]:
# Creating a column called comment_length
df['comment_length'] = df['text'].apply(lambda x: len(x))

In [38]:
# Creating a column called comment_word_count
df['comment_word_count'] = df['text'].apply(lambda x: len(x.split()))

In [40]:
df.head()

Unnamed: 0,author,published_at,like_count,text,video_title,video_published_at,comment_length,comment_word_count
0,Hafiz Khan,2023-09-06 07:52:06+00:00,0,Org kelantan ke apa ji,Respek kakak ni bertarung nyawa dan harta,2019-09-14 04:04:25+00:00,22,5
1,Mia Adriana,2023-08-18 05:07:29+00:00,0,Kalau lama pecah cermin tu n pichang la org dl...,Respek kakak ni bertarung nyawa dan harta,2019-09-14 04:04:25+00:00,50,11
2,abdul mutalib,2023-08-10 11:29:37+00:00,0,Patut kak tu diberi lesen mcine gun tembak mat...,Respek kakak ni bertarung nyawa dan harta,2019-09-14 04:04:25+00:00,143,25
3,abdul mutalib,2023-08-10 11:28:47+00:00,0,Mau langar smpai patah kaki ja... Tada kj samu...,Respek kakak ni bertarung nyawa dan harta,2019-09-14 04:04:25+00:00,54,10
4,Syam Ryin,2023-08-08 12:50:10+00:00,0,Mangsa Dalam kereta tu macam mna?diam je x beg...,Respek kakak ni bertarung nyawa dan harta,2019-09-14 04:04:25+00:00,52,9


In [41]:
# Saving to csv file
df.to_csv('../data/01_cleaned_malay_comments.csv', index=False)