# Organize Article Dataset
The raw dataset is a little messy, in this notebook, we aim to process dataset to became more structured, here is main steps:
1. Remove unused columns
2. covert data types( exp: post_time to datetime)
3. for each year/kind we make a new csv file
4. each year stock data separate to a csv file

In [1]:
import pandas as pd
from pathlib import Path
from typing import Literal
import os

folder_name = "organize"
if folder_name in os.getcwd():
    os.chdir(os.path.abspath(os.pardir))
%pwd

'C:\\Users\\break\\Projects\\College\\大四下\\DatAnalysis\\2023-2-data-analyze-midterm'

# Configuration

In [2]:
DATA_DIR = "./bda2023_mid_dataset"
ORGANIZED_DATASET_DIR = "./organized_data"
ORGANIZED_DATASET_NAME = "documents.csv"
RAW_DATASET_NAMES = {
    'bbs_2019_2021': "bda2023_mid_bbs_2019-2021.csv",
    'bbs_2022_2023' : "bda2023_mid_bbs_2022-2023.csv",
    'forum_2019' : "bda2023_mid_forum_2019.csv",
    'forum_2020' : "bda2023_mid_forum_2020.csv",
    'forum_2021' : "bda2023_mid_forum_2021.csv",
    'forum_2022_2023' : "bda2023_mid_forum_2022-2023.csv",
    'news_2019' : "bda2023_mid_news_2019.csv",
    'news_2020' : "bda2023_mid_news_2020.csv",
    'news_2021' : "bda2023_mid_news_2021.csv",
    'news_2022' : "bda2023_mid_news_2022.csv",
    'news_2022_2023' : "bda2023_mid_news_2022-2023.csv",
    'news_2023' : "bda2023_mid_news_2023.csv",
}

# Utility functions

In [3]:
def have_same_columns(*dfs):
    """check df have same columns or not"""
    for i in range(len(dfs)):
        for j in range(i + 1, len(dfs)):
            if not dfs[i].columns.equals(dfs[j].columns):
                print(f"columns of {i} and {j} are not equal")
                return False
    return True


def get_article_dfs_by_type(article_type: Literal['bbs', 'forum', 'news']) -> pd.DataFrame:
    """
    get article dfs by type, type can be bbs, forum, news
    return a dataframe that contains all data of that type.
    exp: get_article_dfs_by_type('bbs') will return a dataframe that contains all bbs data from 2019 to 2023
    """
    raw_dataset_paths = {k: Path(DATA_DIR, v) for k, v in RAW_DATASET_NAMES.items()}
    dfs = [pd.read_csv(file_path) for k, file_path in raw_dataset_paths.items() if k.startswith(article_type)]
    assert have_same_columns(*dfs)
    df = pd.concat(dfs)
    return df

# Research
before we start preprocessing the dataset, we first need to have a brief idea of what the dataset looks like

In [4]:
# load all article df
bbs_df = get_article_dfs_by_type('bbs')
forum_df = get_article_dfs_by_type('forum')
news_df = get_article_dfs_by_type('news')
display(bbs_df.head())
display(forum_df.head())
display(news_df.head())

Unnamed: 0,id,p_type,s_name,s_area_name,post_time,title,author,content,page_url
0,1546274852018_PTT02R,bbs,Ptt,Stock,2019-01-01 00:31:32,[公告] n199808m HitMaker 警告一次,eyespot,1. 主旨：n199808m 違反板規4-2-1 警告一次 HitMake...,http://www.ptt.cc/bbs/Stock/M.1546273895.A.81F...
1,1546278287622_PTT02R,bbs,Ptt,Stock,2019-01-01 01:28:28,Re: [新聞] 貿戰讓台商錢匯不出？ 海基會：漣漪效應,CGDGAD,小弟有個想法不知可不可行 如果有人民幣想洗出來 出國一趟，比方去歐洲 用海外刷卡買黃金，存在...,http://www.ptt.cc/bbs/Stock/M.1546277311.A.1D3...
2,1546278288500_PTT02R,bbs,Ptt,Stock,2019-01-01 01:32:39,Re: [新聞] 貿易戰搶出口 透支效應2019衝擊中國經濟!,americ,分身帳號好像要連坐水桶 《ＩＤ暱稱》tangolosss (配息配股變成大富翁)《經濟狀況...,http://www.ptt.cc/bbs/Stock/M.1546277562.A.F7E...
3,1546298530556_PTT02R,bbs,Ptt,Stock,2019-01-01 07:07:37,Re: [新聞] 陸媒：俄羅斯想聯手中國去美元化,taco13,所以說不要小看俄羅斯的險惡奸詐 俄國一直鼓勵中國發展人民幣石油 去美元化的種種行為 俄羅...,http://www.ptt.cc/bbs/Stock/M.1546297660.A.928...
4,1546299585726_PTT02R,bbs,Ptt,Stock,2019-01-01 07:35:29,[標的] (伺機作多)日元正二,hrma,1. 標的：元大日元指數正二 2. 分類：(伺機作多)多 3. 分析/正文： (...,http://www.ptt.cc/bbs/Stock/M.1546299333.A.8D3...


Unnamed: 0,id,p_type,s_name,s_area_name,post_time,title,author,content,page_url,content_type,comment_count
0,1546273483220_F01,forum,mobile01,閒聊_投資與理財,2019-01-01 00:10:00,今日華固大跌9%，有人知道為什麼嗎？,d885668,到現在還不知為何暴跌,https://www.mobile01.com/topicdetail.php?p=2&f...,reply,15
1,1546273487328_F01,forum,mobile01,閒聊_投資與理財,2019-01-01 00:16:00,個人研究觀察記錄篇(宏碁轉型之路篇),Roger0607,2019 新年快樂…祝A大各位雞友2019財源廣進通四海,https://www.mobile01.com/topicdetail.php?p=109...,reply,13176
2,1546274269262_F01,forum,mobile01,閒聊_投資與理財,2019-01-01 00:30:00,一則美債新聞分享,杜鵑泣血,不用先去煩惱美債，中國違約的債卷還更多，我們就看下去吧,https://www.mobile01.com/topicdetail.php?f=291...,reply,5
3,1546275066274_F01,forum,mobile01,閒聊_投資與理財,2019-01-01 00:35:00,關於美股走空、黃豆、白銀、原油、上證指數、美元走勢,sphenoidarthur,白銀如預期漲幅已到達年線<BR>短期漲幅滿足點到達<BR><BR><BR>黃豆MA60. M...,https://www.mobile01.com/topicdetail.php?p=4&f...,reply,62
4,1546276014397_F01,forum,mobile01,閒聊_投資與理財,2019-01-01 00:50:00,看到台湾人都这么关心大陆的贸易战，作为大陆人我也简单谈一下，目前看形势一片大好,四少爺,才過了幾個月，「中國或成最大贏家」！ 哈哈哈。,https://www.mobile01.com/topicdetail.php?p=49&...,reply,647


Unnamed: 0,id,p_type,s_name,s_area_name,post_time,title,author,content,page_url
0,1546294835402_N01,news,yahoo股市,最新財經新聞,2019-01-01 03:45:00,【歐股盤後】氣氛樂觀 盤勢走穩,中央社 中央社,（中央社台北2019年1月1日電）即將舉行的美中貿易談判為投資人帶來希望，歐洲股市在封關前最...,https://tw.stock.yahoo.com/news/歐股盤後-氣氛樂觀-盤勢走穩...
1,1546293936100_N01,news,yahoo股市,重大要聞,2019-01-01 05:18:00,台股元旦休市期間 美股累計漲跌幅--12月31日,鉅亨網 鉅亨網編譯郭照青,-------------------12 月 27 日 -------12 月 31 日<...,https://tw.stock.yahoo.com/news/台股元旦休市期間-美股累計漲...
2,1546302955899_N01,news,yahoo股市,重大要聞,2019-01-01 05:50:00,2019最受期待10款新車亮相 國產ALTIS和FOCUS成焦點 雙B大型豪華SUV對戰,中時電子報 報導陳大任,中國時報 延續2018競爭激烈的車市氛圍來到2019，今年將有多款新車等著跳上車市擂台一較高...,https://tw.stock.yahoo.com/news/2019最受期待10款新車亮...
3,1546296648699_N01,news,yahoo股市,最新財經新聞,2019-01-01 06:03:00,【美股盤後】封關收紅,中央社 中央社,（中央社台北2019年1月1日電）美股31日在2018年最後一個交易日收漲，不過這是10年前...,https://tw.stock.yahoo.com/news/美股盤後-封關收紅-2203...
4,1546296650082_N01,news,yahoo股市,最新財經新聞,2019-01-01 06:30:00,【能源盤後】年終收紅,中央社 中央社,PR2F3301.DBP.US.GB.OIL.ECO.（中央社台北2019年1月1日電）國際...,https://tw.stock.yahoo.com/news/能源盤後-年終收紅-2230...


In [5]:
article_df = pd.concat([bbs_df, forum_df, news_df])
article_df
# check which columns have null values
print('columns null value count:')
article_df.isnull().sum()

columns null value count:


id                     0
p_type                 0
s_name                 0
s_area_name            0
post_time              0
title                  0
author             53652
content             9896
page_url               0
content_type     1314396
comment_count    1314396
dtype: int64

After some understanding of the dataset, we decided to do the following
1. as three types of articles df is pretty much the same, so I decided to stack them together
2. remove duplicates (by id)
3. remove useless columns, we don't need id and page url
4. sort by post_time (first converted to datetime object)

NOTE: we don't handle Null values here, because the way we handle null values depends on the algorithm we want to use.

# Preprocessing

In [66]:
article_df = pd.concat([bbs_df, forum_df, news_df])

# remove duplicated rows by id
original_rows = article_df.shape[0]
article_df = article_df.drop_duplicates(subset=['id'])
dropped_rows = original_rows - article_df.shape[0]
print(f"{dropped_rows} rows were dropped due to duplicates.")

# drop id and page url
print("drop id and page url")
article_df = article_df.drop(columns=['id', 'page_url'])

# convert post_time to datetime object and sort by post_time
print("convert post_time to datetime object and sort by post_time")
article_df['post_time'] = pd.to_datetime(article_df['post_time'])
article_df = article_df.sort_values(by='post_time')

# if author or title or content is NaN, convert to ""
article_df['author'].fillna('""', inplace=True)
article_df['title'].fillna('""', inplace=True)
article_df['content'].fillna('""', inplace=True)

article_df.set_index('post_time', inplace=True)
# check for duplicates in the index
if article_df.index.duplicated().any():
    # drop duplicates from the index
    article_df = article_df[~article_df.index.duplicated()]

article_df.head()

293974 rows were dropped due to duplicates.
drop id and page url
convert post_time to datetime object and sort by post_time


Unnamed: 0_level_0,p_type,s_name,s_area_name,title,author,content,content_type,comment_count
post_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2019-01-01 00:10:00,forum,mobile01,閒聊_投資與理財,今日華固大跌9%，有人知道為什麼嗎？,d885668,到現在還不知為何暴跌,reply,15.0
2019-01-01 00:16:00,forum,mobile01,閒聊_投資與理財,個人研究觀察記錄篇(宏碁轉型之路篇),Roger0607,2019 新年快樂…祝A大各位雞友2019財源廣進通四海,reply,13176.0
2019-01-01 00:30:00,forum,mobile01,閒聊_投資與理財,一則美債新聞分享,杜鵑泣血,不用先去煩惱美債，中國違約的債卷還更多，我們就看下去吧,reply,5.0
2019-01-01 00:31:32,bbs,Ptt,Stock,[公告] n199808m HitMaker 警告一次,eyespot,1. 主旨：n199808m 違反板規4-2-1 警告一次 HitMake...,,
2019-01-01 00:35:00,forum,mobile01,閒聊_投資與理財,關於美股走空、黃豆、白銀、原油、上證指數、美元走勢,sphenoidarthur,白銀如預期漲幅已到達年線<BR>短期漲幅滿足點到達<BR><BR><BR>黃豆MA60. M...,reply,62.0


the preprocessed data seem good, store it to csv

In [67]:
article_csv_path = Path(ORGANIZED_DATASET_DIR, ORGANIZED_DATASET_NAME)
# create preprocessed_dataset dir if not exists
if not Path(ORGANIZED_DATASET_DIR).exists():
    Path(ORGANIZED_DATASET_DIR).mkdir()
# save to csv file
article_df.to_csv(article_csv_path)
print(f"article df is saved to {article_csv_path}")
# check everything is ok
# display(pd.read_csv(article_csv_path, index_col=0).head())
# display(article_df.columns)

article df is saved to organized_data\documents.csv


Every thing seems good!

In [68]:
df = pd.read_csv(article_csv_path, index_col=0)

In [70]:
article_df.isnull().sum()
df.isnull().sum()
display(df)
display(article_df)

p_type                0
s_name                0
s_area_name           0
title                 0
author                0
content               0
content_type     707405
comment_count    707405
dtype: int64