![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)


# Cleaning the raw data


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### import required library


In [110]:
import pandas as pd
import numpy as np
from pprint import pprint
import json
import re
import datetime

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)


### Create a dataframe from the csv file


In [111]:
raw_df = pd.read_csv("Data/weibovietnam.csv")

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)


### Let's have a look at the raw data before clean it


In [112]:
# the first 5 rows of the data
raw_df.head()

Unnamed: 0,post_id,text,post_text,shared_text,original_text,time,timestamp,image,image_lowquality,images,...,w3_fb_url,reactions,reaction_count,with,page_id,sharers,image_id,image_ids,was_live,fetched_time
0,731043432389703,NHẬT KÝ ĐI LẤY CHỒNG XA: KHÔNG DẬY SỚM BỊ CHỒN...,NHẬT KÝ ĐI LẤY CHỒNG XA: KHÔNG DẬY SỚM BỊ CHỒN...,,,2023-11-29 23:58:05,1701277085,https://scontent.fhan14-2.fna.fbcdn.net/v/t39....,https://scontent.fhan14-2.fna.fbcdn.net/v/t39....,['https://scontent.fhan14-2.fna.fbcdn.net/v/t3...,...,https://www.facebook.com/weibovietnam/posts/73...,"{'thích': 3003, 'yêu thích': 6, 'haha': 21, 'w...",4155,,138841156165916,,731040000000000.0,['731040002390046'],False,2023-11-30 10:48:43.213665
1,731015475725832,Bạn cùng phòng chỉ cho tôi cách dùng khăn giấy...,Bạn cùng phòng chỉ cho tôi cách dùng khăn giấy...,,,2023-11-29 22:54:15,1701273255,,https://scontent.fhan14-2.fna.fbcdn.net/v/t39....,[],...,https://www.facebook.com/weibovietnam/posts/73...,"{'thích': 699, 'yêu thích': 3, 'haha': 564, 'w...",1275,,138841156165916,,731014600000000.0,['731014649059248'],False,2023-11-30 10:49:46.221907
2,731004215726958,NHỮNG TRÁI TIM BĂNG GIÁ GIỮA TRỜI ĐÔNG THỦ ĐÔ ...,NHỮNG TRÁI TIM BĂNG GIÁ GIỮA TRỜI ĐÔNG THỦ ĐÔ ...,,,2023-11-29 22:30:47,1701271847,https://m.facebook.com/photo/view_full_size/?f...,https://scontent.fhan14-4.fna.fbcdn.net/v/t39....,['https://m.facebook.com/photo/view_full_size/...,...,https://www.facebook.com/weibovietnam/posts/73...,"{'thích': 1976, 'yêu thích': 970, 'haha': 949,...",3896,,138841156165916,,731004000000000.0,['731004012393645'],False,2023-11-30 10:50:13.421866
3,730966445730735,Đôi mắt chăm chú và biểu cảm tràn đầy yêu thíc...,Đôi mắt chăm chú và biểu cảm tràn đầy yêu thíc...,,,2023-11-29 21:11:50,1701267110,https://m.facebook.com/photo/view_full_size/?f...,https://scontent.fhan14-2.fna.fbcdn.net/v/t39....,['https://m.facebook.com/photo/view_full_size/...,...,https://www.facebook.com/weibovietnam/posts/73...,"{'thích': 2846, 'yêu thích': 737, 'haha': 21, ...",3621,,138841156165916,,730966400000000.0,['730966409064072'],False,2023-11-30 10:50:30.764316
4,730945852399461,Sau này đám cưới của tui và anh iu sẽ kiểu 🤣🤣🤣...,Sau này đám cưới của tui và anh iu sẽ kiểu 🤣🤣🤣...,,,2023-11-29 20:31:34,1701264694,https://scontent.fhan14-3.fna.fbcdn.net/v/t39....,https://scontent.fhan14-3.fna.fbcdn.net/v/t39....,['https://scontent.fhan14-3.fna.fbcdn.net/v/t3...,...,https://www.facebook.com/weibovietnam/posts/73...,"{'thích': 256, 'yêu thích': 5, 'haha': 164, 't...",427,,138841156165916,,730945500000000.0,['730945492399497'],False,2023-11-30 10:50:41.289521


In [113]:
# information of the data
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 51 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   post_id                        1000 non-null   int64  
 1   text                           953 non-null    object 
 2   post_text                      953 non-null    object 
 3   shared_text                    1 non-null      object 
 4   original_text                  0 non-null      float64
 5   time                           1000 non-null   object 
 6   timestamp                      1000 non-null   int64  
 7   image                          897 non-null    object 
 8   image_lowquality               1000 non-null   object 
 9   images                         954 non-null    object 
 10  images_description             954 non-null    object 
 11  images_lowquality              1000 non-null   object 
 12  images_lowquality_description  1000 non-null   ob

#### -> This data is not very clean, it contains many missing values and few-valued cols. Some cols may not be used later


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)


#### Delete unnecessary columns, add and alter dataframe


#### 1) Delete useless columns:


In [114]:
delete_cols = [
    "post_text",
    "image",
    "images_lowquality",
    "post_url",
    "timestamp",
    "username",
    "likes",
    "shares",
    "images_description",
    "images_lowquality_description",
    "shared_text",
    "original_text",
    "video_size_MB",
    "image_lowquality",
    "images",
    "video",
    "video_duration_seconds",
    "video_height",
    "video_id",
    "video_quality",
    "video_thumbnail",
    "video_watches",
    "video_width",
    "link",
    "links",
    "user_url",
    "is_live",
    "factcheck",
    "shared_post_id",
    "shared_time",
    "shared_user_id",
    "shared_username",
    "shared_post_url",
    "available",
    "reactors",
    "w3_fb_url",
    "with",
    "page_id",
    "sharers",
    "image_id",
    "user_id",
    "image_ids",
    "was_live",
    "fetched_time",
]
# create clean dataframe by deleting unnecessary columns
clean_df = raw_df.drop(delete_cols, axis=1)

In [115]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   post_id         1000 non-null   int64 
 1   text            953 non-null    object
 2   time            1000 non-null   object
 3   comments        1000 non-null   int64 
 4   comments_full   1000 non-null   object
 5   reactions       239 non-null    object
 6   reaction_count  1000 non-null   int64 
dtypes: int64(3), object(4)
memory usage: 54.8+ KB


#### We will clear posts which have no comments!

In [116]:
# clear rows that 0 comments 
clean_df = clean_df[clean_df["comments"] != 0]
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 237 entries, 0 to 988
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   post_id         237 non-null    int64 
 1   text            237 non-null    object
 2   time            237 non-null    object
 3   comments        237 non-null    int64 
 4   comments_full   237 non-null    object
 5   reactions       237 non-null    object
 6   reaction_count  237 non-null    int64 
dtypes: int64(3), object(4)
memory usage: 14.8+ KB


#### 2)Comment_full to new DataFrame: Comment_frame
##### Convert comment_full cols into a Dataframe with 4 fields: 
- comment_id: 
- commenter_id:
- comment_text: 
- post_id: 

In [117]:
# function to extract comments from string 
def get_comments(text):
    # regular expression to extract comments
    pattern = r"\{[^{}]*\}"
    matches = re.findall(pattern, text)
    result = [eval(match) for match in matches]
    return result

In [118]:
def create_comments_frame(df):
    df_result = {
        "comment_id": [],
        "commenter_id": [],
        "comment_text": [],
        "post_id": [],
    }

    new_df = df[~df["comments_full"].isna()]
    for index, rows in new_df.iterrows():
        x = rows["comments_full"]
        for comment in get_comments(x):
            try:
                df_result["comment_id"].append(comment["comment_id"])
                df_result["comment_text"].append(comment["comment_text"])
                df_result["commenter_id"].append(comment["commenter_id"])
                df_result["post_id"].append(rows.post_id)
            except:
                continue
    return pd.DataFrame(df_result)

In [119]:
Comment_frame = create_comments_frame(clean_df)
Comment_frame

Unnamed: 0,comment_id,commenter_id,comment_text,post_id
0,659041829740311,100004523313287,Chả bù mình đi lấy chồng ngủ đến 11h trưa. Từ ...,731043432389703
1,1058980365296051,100004428425175,Đa số mà bố mẹ ngăn cản thì là bố mẹ đúng đấy,731043432389703
2,1049705363008340,100005874492328,Đúng là lấy chồng như canh bạc,731043432389703
3,847616997155316,100000337745111,Hải Phong tốt nhất là lấy chồng cũng mê ngủ gi...,731043432389703
4,3561634410773586,1251407235,Cãi ba mẹ lấy cho đã rồi lúc về lại nhà ba mẹ ...,731043432389703
...,...,...,...,...
3622,3405980696397936,100035876087877,Mặc quần đùi đi đá bóng có coi là lộ chân ko nhỉ?,6563791543670813
3623,1345872649536668,100017682734580,cái post này mà cmt ảnh được là phần cmt toàn ...,6563791543670813
3624,1363567874435450,100012723275222,"Này là chân, còn của tui là cột đình👀",6563791543670813
3625,752935386491017,100045700756125,Trang Le chân kia nhỏ hơn bắp tay mình luôn í=))),6563791543670813


#### 3)Create Dataframe for reactions from column "reactions"

In [120]:
dfs = []
for index, row in clean_df.iterrows():
    if row['reactions'] is not np.nan:
        # Convert the 'reactions' column into a DataFrame
        df = pd.DataFrame(eval(row['reactions']), index=[0])
        df['post_id'] = row['post_id']
        dfs.append(df)

# Concat
Reactions_Frame = pd.concat(dfs, ignore_index=True)

# Replace NaN values with 0
Reactions_Frame = Reactions_Frame.fillna(0)

# Convert data type to int
Reactions_Frame = Reactions_Frame.astype(int)

# Addthe reactions of each post
Reactions_Frame['total'] = Reactions_Frame.drop('post_id', axis=1).sum(axis=1)


In [121]:
Reactions_Frame

Unnamed: 0,thích,yêu thích,haha,wow,thương thương,buồn,phẫn nộ,post_id,total
0,3003,6,21,13,4,1089,19,731043432389703,4155
1,699,3,564,6,1,1,1,731015475725832,1275
2,2846,737,21,2,14,0,1,730966445730735,3621
3,256,5,164,0,1,0,1,730945852399461,427
4,1014,7,1271,0,1,3,1,730934009067312,2297
...,...,...,...,...,...,...,...,...,...
232,2428,3,1,0,0,1,0,6564292776954023,2433
233,3682,955,1755,2,87,2,0,6564246580291976,6483
234,2721,2,11,1,0,1,0,6564158210300813,2736
235,2824,22,1209,7,6,511,0,6563918830324751,4579


#### Delete 3 cols: "comments_full", "reactions", "reaction_count"

In [122]:
fin_delete_cols = ["comments_full", "reactions", "reaction_count"]
#delete it 
clean_df = clean_df.drop(fin_delete_cols, axis=1)
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 237 entries, 0 to 988
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   post_id   237 non-null    int64 
 1   text      237 non-null    object
 2   time      237 non-null    object
 3   comments  237 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 9.3+ KB


In [123]:
# last 5 rows of the data
clean_df.head()

Unnamed: 0,post_id,text,time,comments
0,731043432389703,NHẬT KÝ ĐI LẤY CHỒNG XA: KHÔNG DẬY SỚM BỊ CHỒN...,2023-11-29 23:58:05,603
1,731015475725832,Bạn cùng phòng chỉ cho tôi cách dùng khăn giấy...,2023-11-29 22:54:15,37
3,730966445730735,Đôi mắt chăm chú và biểu cảm tràn đầy yêu thíc...,2023-11-29 21:11:50,58
4,730945852399461,Sau này đám cưới của tui và anh iu sẽ kiểu 🤣🤣🤣...,2023-11-29 20:31:34,32
5,730934009067312,HÌNH TƯỢNG KHI ĐI LÀM VÀ SAU KHI TAN LÀM CỦA G...,2023-11-29 20:07:05,71


#### remove blank line (\n) , emojis in text

In [124]:
def clean_text(text):
    text = text.replace("\n", " ")
    text = text.replace("\t", " ")

    # Remove emojis
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    text = re.sub(emoj, '', text)

    return text

#### convert time to datetime 

In [125]:
# convert time to datetime
clean_df["time"] = pd.to_datetime(clean_df["time"])


In [126]:
# clean the text
clean_df["text"] = clean_df["text"].apply(clean_text)
Comment_frame["comment_text"] = Comment_frame["comment_text"].apply(clean_text)

In [127]:
clean_df.head()

Unnamed: 0,post_id,text,time,comments
0,731043432389703,NHẬT KÝ ĐI LẤY CHỒNG XA: KHÔNG DẬY SỚM BỊ CHỒN...,2023-11-29 23:58:05,603
1,731015475725832,Bạn cùng phòng chỉ cho tôi cách dùng khăn giấy...,2023-11-29 22:54:15,37
3,730966445730735,Đôi mắt chăm chú và biểu cảm tràn đầy yêu thíc...,2023-11-29 21:11:50,58
4,730945852399461,Sau này đám cưới của tui và anh iu sẽ kiểu C...,2023-11-29 20:31:34,32
5,730934009067312,HÌNH TƯỢNG KHI ĐI LÀM VÀ SAU KHI TAN LÀM CỦA G...,2023-11-29 20:07:05,71


In [128]:
Comment_frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3627 entries, 0 to 3626
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   comment_id    3627 non-null   object
 1   commenter_id  3627 non-null   object
 2   comment_text  3627 non-null   object
 3   post_id       3627 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 113.5+ KB


In [129]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 237 entries, 0 to 988
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   post_id   237 non-null    int64         
 1   text      237 non-null    object        
 2   time      237 non-null    datetime64[ns]
 3   comments  237 non-null    int64         
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 9.3+ KB


In [130]:
Reactions_Frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 237 entries, 0 to 236
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   thích          237 non-null    int64
 1   yêu thích      237 non-null    int64
 2   haha           237 non-null    int64
 3   wow            237 non-null    int64
 4   thương thương  237 non-null    int64
 5   buồn           237 non-null    int64
 6   phẫn nộ        237 non-null    int64
 7   post_id        237 non-null    int64
 8   total          237 non-null    int64
dtypes: int64(9)
memory usage: 16.8 KB


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)


### Save the dataframe

In [131]:
# save to csv 
clean_df.to_csv("Data/clean_df.csv", index=False)
Comment_frame.to_csv("Data/Comment_frame.csv", index=False)
Reactions_Frame.to_csv("Data/Reactions_Frame.csv", index=False)