<a href="https://colab.research.google.com/github/angel870326/Monthly-Revenue-Forecasting/blob/main/015_industry_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> 2023.05.06 Ssu-Yun Wang<br/>
[Github @angel870326](https://github.com/angel870326)

# **Variable for Monthly Revenue Forecasting - Industry**

### Contents
1. Read Data
    * 1.1 月營收
    * 1.2 產業變數
2. Method 1: 原始產業類別
3. Method 2: 原始產業類別 + Frequency Encoder
4. Method 3: 四大類別
5. Method 4: 四大類別 + Frequency Encoder

In [None]:
# sConnect to the Google Drive
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [None]:
import pandas as pd
import numpy as np
import os

## **1. Read Data**


In [None]:
project_path = '/content/gdrive/Shareddrives/Me/論文'

### **1.1 月營收**

【**月營收盈餘 (2013-2022)**】

資料期間：2013年1月至2022年12月（共120個月）

資料範圍：上市、櫃公司（排除金融業、生技醫療、建材營造、DR和KY公司）

資料來源：TEJ Company DB、公開資訊觀測站

In [None]:
org_data = pd.read_excel(os.path.join(project_path, '資料集/007_v1/201301-202212上市櫃公司月營收_非金融業.xlsx'), index_col=0)
org_data.columns = pd.to_datetime(org_data.columns, format="%Y-%m-%d").to_period('M')
org_data

Unnamed: 0_level_0,2013-01,2013-02,2013-03,2013-04,2013-05,2013-06,2013-07,2013-08,2013-09,2013-10,...,2022-03,2022-04,2022-05,2022-06,2022-07,2022-08,2022-09,2022-10,2022-11,2022-12
公司,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1101 台泥,9134465,5540346,9457971,9919269,9543782,9517630,9875888,9835143,10060975,10654077,...,9971650,8319342,7733787,9145989,10102468,10689860,10404901,11368096,9674576,12584154
1102 亞泥,6018213,2552357,5428755,5930748,6239676,5952754,5942364,5786107,5879394,6478670,...,8160414,8710220,8000427,7776413,7864622,7069221,6994078,7601097,8306062,8340507
1103 嘉泥,288455,166638,286007,365292,382601,302995,294781,336088,314563,429783,...,220463,168089,163521,183177,178825,182371,205264,209429,221763,228644
1104 環泥,486481,299860,461732,394631,406677,415968,453397,393203,448691,521445,...,591593,638493,537082,573028,580420,605512,597159,634981,631827,725055
1108 幸福,481802,276936,444917,362054,381384,368109,439572,379115,387362,450770,...,345612,335518,332258,334113,326691,390053,346635,401202,383773,418326
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9951 皇田,201785,167967,240746,243935,238296,193880,198427,256724,228796,250756,...,374229,302262,323433,371791,337581,468608,464373,432835,500111,506796
9955 佳龍,394489,383183,428478,564053,336622,295391,434605,306534,266617,363766,...,96200,101850,95096,80726,85625,81881,79179,80630,91270,84115
9958 世紀鋼,198944,166364,351222,280864,289332,426371,213281,302589,401695,255738,...,626104,401960,673479,665459,651699,757968,903198,911834,944060,1082675
9960 邁達康,52534,41935,61642,70998,81508,64525,62085,60960,60309,61582,...,60275,86754,69752,103280,64983,105969,113755,78996,96570,58764


In [None]:
print("Data shape:", org_data.shape)
print("Data size:", org_data.size)

Data shape: (1240, 120)
Data size: 148800


### **1.2 產業變數**


In [None]:
industry = pd.read_excel(os.path.join(project_path, '資料集/007_v2/201301-202212上市櫃公司產業配對_非金融業.xlsx')).drop('TSE舊產業_名稱', axis=1)
industry

Unnamed: 0,公司簡稱,TSE 產業別,TSE新產業_名稱
0,1101 台泥,1,水泥工業
1,1102 亞泥,1,水泥工業
2,1103 嘉泥,1,水泥工業
3,1104 環泥,1,水泥工業
4,1108 幸福,1,水泥工業
...,...,...,...
1235,9949 琉園,32,文化創意業
1236,9950 萬國通,3,塑膠工業
1237,9951 皇田,5,電機機械
1238,9960 邁達康,18,貿易百貨


In [None]:
# 看一下 TSE 有哪些產業類別
industryCounts = industry[['TSE新產業_名稱']].value_counts().reset_index(name='counts')
industryCounts

Unnamed: 0,TSE新產業_名稱,counts
0,電子零組件,182
1,半導體,129
2,光電業,99
3,電腦及週邊,95
4,其他,91
5,通信網路業,79
6,電機機械,71
7,其他電子業,68
8,紡織纖維,51
9,鋼鐵工業,43


## **2. Method 1: 原始產業類別**

In [None]:
# 產業別代號
industry_category = industry['TSE 產業別'].astype("category").cat.categories
industry_category

Int64Index([ 1,  2,  3,  4,  5,  6,  8,  9, 10, 11, 12, 15, 16, 18, 20, 21, 23,
            24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34],
           dtype='int64')

In [None]:
# 新編號
industry_category_new = np.array(range(1, len(industry_category)+1))
industry_category_new

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28])

In [None]:
# 建立產業變數
industry_data = industry[['TSE 產業別']].copy().replace(industry_category, industry_category_new)
industry_data.index = org_data.index
industry_data.columns = ['industry']
industry_data

Unnamed: 0_level_0,industry
公司,Unnamed: 1_level_1
1101 台泥,1
1102 亞泥,1
1103 嘉泥,1
1104 環泥,1
1108 幸福,1
...,...
9951 皇田,26
9955 佳龍,3
9958 世紀鋼,5
9960 邁達康,14


## **3. Method 2: 原始產業類別 + Frequency Encoder**

In [None]:
!pip install category_encoders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting category_encoders
  Downloading category_encoders-2.6.0-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.2/81.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.0


In [None]:
from category_encoders import CountEncoder

encoder = CountEncoder(cols=['industry'])
industry_data = encoder.fit_transform(industry_data)

industry_data

Unnamed: 0_level_0,industry
公司,Unnamed: 1_level_1
1101 台泥,7
1102 亞泥,7
1103 嘉泥,7
1104 環泥,7
1108 幸福,7
...,...
9951 皇田,18
9955 佳龍,22
9958 世紀鋼,71
9960 邁達康,24


## **4. Method 3: 四大類別**

以下將產業分為四大類：
1. 電子零組件 (204)
2. 半導體 (175)
3. 其他電子業 (光電業、電腦及週邊、通信網路業、資訊服務業、電子通路業、電子商務、其他電子業)
4. 非電子業 (電機機械、紡織纖維、鋼鐵工業、觀光事業、化學工業、貿易百貨、汽車工業、航運業、食品工業、塑膠工業、文化創意業、電器電纜、油電燃氣業、橡膠工業、存託憑證、造紙工業、水泥工業、玻璃陶瓷、農業科技、其他)

註：已排除金融業、建材營造、生技醫療

In [None]:
# Company list for industries
def companyList(industryList: list):
    filter = industry['TSE新產業_名稱'].isin(industryList)
    cList = industry.loc[filter,:]['公司簡稱'].tolist()
    return cList

In [None]:
# 將所有公司分為四組產業
components = companyList(['電子零組件'])
semiconductor = companyList(['半導體'])
electronics = companyList(['光電業', '電腦及週邊', '通信網路業', '資訊服務業', '電子通路業', '電子商務', '其他電子業'])
others = industry.loc[~industry['公司簡稱'].isin(components+semiconductor+electronics),:]['公司簡稱'].tolist()

if len(components+semiconductor+electronics+others) == industry.shape[0]:
    print(f"電子零組件：{len(components)} \n半導體：{len(semiconductor)} \n其他電子業：{len(electronics)} \n非電子業：{len(others)}")

電子零組件：182 
半導體：129 
其他電子業：411 
非電子業：518


In [None]:
# 將月營收資料分為四組產業
industry_data = industry[['TSE新產業_名稱']].copy()
industry_data.columns = ['industry']
industry_data = industry_data.replace(['光電業', '電腦及週邊', '通信網路業', '資訊服務業', '電子通路業', '電子商務', '其他電子業'], '其他電子業')
industry_data = industry_data.where(industry_data.isin(['電子零組件', '半導體', '其他電子業']), other = '非電子業')
# 確認分類無誤
industryCounts = industry_data.value_counts().reset_index(name='counts')
industryCounts

Unnamed: 0,industry,counts
0,非電子業,518
1,其他電子業,411
2,電子零組件,182
3,半導體,129


In [None]:
# 取得產業類別名稱
industry_category = industry_data['industry'].astype("category").cat.categories
industry_category

Index(['其他電子業', '半導體', '電子零組件', '非電子業'], dtype='object')

In [None]:
# 新編號
industry_category_new = np.array(range(0, len(industry_category)))
industry_category_new

array([0, 1, 2, 3])

In [None]:
# 建立產業變數
industry_data = industry_data.replace(industry_category, industry_category_new)
industry_data.index = org_data.index
industry_data

Unnamed: 0_level_0,industry
公司,Unnamed: 1_level_1
1101 台泥,3
1102 亞泥,3
1103 嘉泥,3
1104 環泥,3
1108 幸福,3
...,...
9951 皇田,3
9955 佳龍,3
9958 世紀鋼,3
9960 邁達康,3


## **5. Method 4: 四大類別 + Frequency Encoder**

In [None]:
encoder = CountEncoder(cols=['industry'])
industry_data = encoder.fit_transform(industry_data)

industry_data

Unnamed: 0_level_0,industry
公司,Unnamed: 1_level_1
1101 台泥,518
1102 亞泥,518
1103 嘉泥,518
1104 環泥,518
1108 幸福,518
...,...
9951 皇田,518
9955 佳龍,518
9958 世紀鋼,518
9960 邁達康,518
