# Technical Paper:
# Text Mining and Document Classification Workflows for Chinese Administrative Documents

## File 5.2.a - cleaning and splitting the data

Proceeding in 4 steps: 
1. load packages
2. load data
3. semgementation and cleaning
4. split into training and test data

### 1. load packages

In [None]:
# packages
import os # connection to OS
import numpy as np
import pandas as pd # data manipulation
import jieba # segmenting Chinese text
import re # regular expressions
from sklearn.model_selection import train_test_split # splitting data

Python Version: 3.10.13

In [65]:
# versions
np.__version__, pd.__version__, jieba.__version__, re.__version__

('1.26.4', '2.2.3', '0.42.1', '2.2.1')

### 2. load the data

Proceed in 2 steps:
1. load the labelled data and inspect it
2. load the unlabelled data

#### 2.1 load labelled data

In [None]:
# set working directory
os.chdir("working_directory_path")  

In [3]:
# import data (Windows)
data = pd.read_csv('./QRY_export_train_dta.csv', encoding='utf-8')

In [4]:
# check the shape
data.shape

(20458, 6)

In [5]:
# check data types
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20458 entries, 0 to 20457
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   doc_index          20458 non-null  int64 
 1   sen_index          20458 non-null  int64 
 2   ran200doc          20458 non-null  int64 
 3   sentences          20288 non-null  object
 4   coverage_broad_AM  20458 non-null  object
 5   coverage_AM        20458 non-null  object
dtypes: int64(3), object(3)
memory usage: 959.1+ KB


Columns 6 to 8 are formatted as "object", that might point to missing observations.

In [6]:
# check for missing observations
data.isnull().sum()

doc_index              0
sen_index              0
ran200doc              0
sentences            170
coverage_broad_AM      0
coverage_AM            0
dtype: int64

There are 170 empty fields in the sentence column.

In [7]:
data.head()

Unnamed: 0,doc_index,sen_index,ran200doc,sentences,coverage_broad_AM,coverage_AM
0,9389,272309,2,（三）社会捐赠及其它资金,FALSCH,FALSCH
1,9389,272325,2,第三十条　本办法自2008年12月1日起施行,FALSCH,FALSCH
2,9571,278003,1,恶性肿瘤内分泌治疗定点医疗机构为具有治疗该病资质的医疗保险定点医疗机构，参保人员进行恶性肿瘤...,FALSCH,FALSCH
3,9687,282269,2,定点医疗机构在自愿基础上，从中选择适宜病种实施，逐步扩大实施病种范围,FALSCH,FALSCH
4,9687,282270,2,（二）专家论证，科学合理,FALSCH,FALSCH


In [8]:
# overview of narrow y
data['coverage_AM'].value_counts() 

coverage_AM
FALSCH    20040
WAHR        418
Name: count, dtype: int64

In [9]:
# overview of broad y
data['coverage_broad_AM'].value_counts() 

coverage_broad_AM
FALSCH    19308
WAHR       1150
Name: count, dtype: int64

In [11]:
# check the coverage variable data
type(data['coverage_AM'][1])

str

The coverage variable is formatted as a string.

Drop the missing observations and transform y into integer format:

In [14]:
data_labelled = data.dropna(axis = 'index', how = 'any').copy() # remove missing observations
data_labelled['y_narrow'] = data_labelled['coverage_AM'].replace({'WAHR': 1, 'FALSCH': 0}) # transform narrow coverage 
data_labelled['y_broad'] = data_labelled['coverage_broad_AM'].replace({'WAHR': 1, 'FALSCH': 0}) # transform narrow coverage 
data_labelled = data_labelled.drop(['coverage_AM', 'coverage_broad_AM'], axis = 1) # drop original y and old IDs
data_labelled

  data_labelled['y_narrow'] = data_labelled['coverage_AM'].replace({'WAHR': 1, 'FALSCH': 0}) # transform narrow coverage
  data_labelled['y_broad'] = data_labelled['coverage_broad_AM'].replace({'WAHR': 1, 'FALSCH': 0}) # transform narrow coverage


Unnamed: 0,doc_index,sen_index,ran200doc,sentences,y_narrow,y_broad
0,9389,272309,2,（三）社会捐赠及其它资金,0,0
1,9389,272325,2,第三十条　本办法自2008年12月1日起施行,0,0
2,9571,278003,1,恶性肿瘤内分泌治疗定点医疗机构为具有治疗该病资质的医疗保险定点医疗机构，参保人员进行恶性肿瘤...,0,0
3,9687,282269,2,定点医疗机构在自愿基础上，从中选择适宜病种实施，逐步扩大实施病种范围,0,0
4,9687,282270,2,（二）专家论证，科学合理,0,0
...,...,...,...,...,...,...
20453,31060,999534,40,二、调整城乡居民医保待遇保障标准 （一）稳步提升基本医保待遇水平,0,0
20454,31060,999541,40,2020年起在一个医疗保险年度内，参保人员住院报销和门诊报销医疗费累计最高支付限额为20万元,0,0
20455,31060,999542,40,起付线以上、最高支付限额以内的合规医疗费用，由基本医疗保险统筹基金和参保个人按比例支付,0,0
20456,31060,999562,40,同时，要着眼促进乡村振兴战略实施，建立防范和化解因病致贫、因病返贫的长效机制,0,0


#### 2.2 Load unlabelled data

Load the unlabelled data and check for missing observations.

In [18]:
data2 = pd.read_csv('./health_sentences_20231123.csv', encoding='utf-8')
data2

Unnamed: 0,doc_index,ran200doc,sentences,ran20sen,sen_index
0,1,137,《中外合资、合作医疗机构管理暂行办法》的补充规定 中华...,16,1
1,1,137,卫生部部长：陈竺商务部部长：陈德铭二○○七年十二月三十日 《中外合资、合作医疗机构管理暂行...,18,2
2,1,137,二、本规定中香港、澳门服务提供者应分别符合《内地与香港关于建立更紧密经贸关系的安排》及《...,18,3
3,1,137,三、香港、澳门服务提供者在内地设立合资、合作医疗机构的其他规定，仍参照《中外合资、合作医...,7,4
4,1,137,四、本规定自2008年1月1日起施行,16,5
...,...,...,...,...,...
1003131,31138,35,第四章　附则 第二十四条　本技术方案从二00八年六月一日起统一执行，同时原制定的补偿方案作废,8,1003132
1003132,31138,35,第二十五条　本方案由龙胜各族自治县新型农村合作医疗管理办公室负责解释,9,1003133
1003133,31138,35,,9,1003134
1003134,31139,91,龙胜各族自治县人民政府关于成立自治县“健康扶贫·医疗救助”公益基金管理工作领导小组的通知龙胜...,14,1003135


There are 1,003,136 observations ...

In [16]:
data2.isnull().sum() # check missing observations

doc_index       0
ran200doc       0
sentences    8632
ran20sen        0
sen_index       0
dtype: int64

... 8,632 of which include no data in the sentences column.

In [20]:
# drop missing observations
data_unlabelled = data2.dropna(axis = 'index', how = 'any').copy()
data_unlabelled

Unnamed: 0,doc_index,ran200doc,sentences,ran20sen,sen_index
0,1,137,《中外合资、合作医疗机构管理暂行办法》的补充规定 中华...,16,1
1,1,137,卫生部部长：陈竺商务部部长：陈德铭二○○七年十二月三十日 《中外合资、合作医疗机构管理暂行...,18,2
2,1,137,二、本规定中香港、澳门服务提供者应分别符合《内地与香港关于建立更紧密经贸关系的安排》及《...,18,3
3,1,137,三、香港、澳门服务提供者在内地设立合资、合作医疗机构的其他规定，仍参照《中外合资、合作医...,7,4
4,1,137,四、本规定自2008年1月1日起施行,16,5
...,...,...,...,...,...
1003130,31138,35,否则不予报销,7,1003131
1003131,31138,35,第四章　附则 第二十四条　本技术方案从二00八年六月一日起统一执行，同时原制定的补偿方案作废,8,1003132
1003132,31138,35,第二十五条　本方案由龙胜各族自治县新型农村合作医疗管理办公室负责解释,9,1003133
1003134,31139,91,龙胜各族自治县人民政府关于成立自治县“健康扶贫·医疗救助”公益基金管理工作领导小组的通知龙胜...,14,1003135


After dropping these empty sentences, 994,504 observations remain.

### 3. Segmentation and cleaning

The data requires further preparation:
1. The text needs to be segmented, because Chinese does not separate words with spaces.
2. Some cleaning is in order to remove ideographic spaces, punctuation and stop words.


In [21]:
# display example sentence with ideographic space
data_unlabelled['sentences'][555]

'\u3000\u3000第十一条\u3000医疗保障部门要建立健全医疗救助基金预算制度、财务会计制度和内部审计制度'

#### 3.1 Segmentation

Use the jieba library to tokenize the sentences. The output will be stored as a list of words.

In [22]:
# tokenize the sentences
data_unlabelled['tokenized_sen'] = [jieba.lcut(Content) for Content in data_unlabelled['sentences']]

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Paul\AppData\Local\Temp\jieba.cache
Loading model cost 0.630 seconds.
Prefix dict has been built successfully.


In [28]:
# inspect the data
data_unlabelled['tokenized_sen'][555]

['\u3000',
 '\u3000',
 '第十一条',
 '\u3000',
 '医疗',
 '保障部门',
 '要',
 '建立健全',
 '医疗',
 '救助',
 '基金',
 '预算',
 '制度',
 '、',
 '财务会计',
 '制度',
 '和',
 '内部',
 '审计',
 '制度']

#### 3.2 Remove punctuation and stop words

For the approaches based on bag-of-words, stop words and punctuation should be removed.
Here, a general list of stop words is removed, as well as latin characters (typically for pharmaceutical or chemical substances) and numbers. These are not relevant predcitors for coverage rules.

Percentages are retained, as they occasionally depict coverage targets.

In [45]:
# Create a list of all tokens
token_list = list(set([token for sentence in data_unlabelled['tokenized_sen'] for token in sentence]))
token_list[1000:1150]


['1807',
 '潮王',
 '13877871800',
 '组和未入',
 '2.927',
 '良好',
 '20163770331',
 '蒋江涛',
 '地黄',
 '赋安堂',
 'd25292',
 '250402062',
 '85504501',
 '33.80',
 '为民服务',
 '殷金莎',
 '2200220022004031090300403',
 '石贯子',
 'Dixon',
 '丁二钠',
 '元宗',
 '王民登',
 '无渣',
 'AG145000AG145000',
 '免疫检测',
 '豆衣',
 '8400800',
 '乡林峒',
 'S01EC',
 '是现',
 '1212T450703206',
 '49000907',
 '镇小和山',
 '波形蛋白',
 '3.377%',
 '低病',
 '26982428.22158',
 'psnfixmedin',
 '150001200030001300078005200',
 '剂量测定',
 '梅格施',
 '诺氟沙星',
 '14996118',
 '陕价费',
 '康妮',
 '胃丸',
 '为卫计',
 '肛',
 '生产资料',
 '节点均',
 '33X0236',
 '4792',
 '受益者',
 '11122dhlw0dhlwcyf0146903',
 '晚重',
 '2016377202060231',
 '卢军伟',
 '受伤者',
 '124ldky0ldkyxyf0296301',
 '减下',
 '500500002055',
 '1215218',
 '乔庄',
 '函杨鹏',
 '50013120040000002',
 '黎渊弘',
 '条本',
 '460401005103',
 '258.00519',
 '3260177',
 '061912402004000091',
 '五月份',
 'lsyd272Abc123456253',
 '内障',
 '孔繁森',
 '450230200012',
 '84GTX79',
 '陈昌旭',
 '手持',
 '68322',
 '31140005602',
 '59S019D0836286',
 '5962897639612041',
 '6213488',
 '84120

Check the length: there are 336,989 different tokens in the list.

In [46]:
len(token_list)

336989

Load a prepared list of stop words with commonly used words, as well as punctuations and Latin characters.

In [None]:
with open("./stopwords/zh.txt", 'r', encoding='utf-8') as file:
        stop_words = {line.strip() for line in file if line.strip()}

In [48]:
stop_words

{'"',
 '%%%%',
 '%%%%%',
 '%-',
 '(',
 ')',
 '-',
 '----%',
 '.',
 '[',
 ']',
 '×',
 '“',
 '”',
 '≥',
 '①',
 '②',
 '⑵',
 '⒈',
 '、',
 '。',
 '〈',
 '〉',
 '《',
 '》',
 '〔',
 '〕',
 '一',
 '一个',
 '一些',
 '一何',
 '一切',
 '一则',
 '一方面',
 '一旦',
 '一来',
 '一样',
 '一种',
 '一般',
 '一转眼',
 '七',
 '万一',
 '三',
 '上',
 '上下',
 '下',
 '不',
 '不仅',
 '不但',
 '不光',
 '不单',
 '不只',
 '不外乎',
 '不如',
 '不妨',
 '不尽',
 '不尽然',
 '不得',
 '不怕',
 '不惟',
 '不成',
 '不拘',
 '不料',
 '不是',
 '不比',
 '不然',
 '不特',
 '不独',
 '不管',
 '不至于',
 '不若',
 '不论',
 '不过',
 '不问',
 '与',
 '与其',
 '与其说',
 '与否',
 '与此同时',
 '且',
 '且不说',
 '且说',
 '两者',
 '个',
 '个别',
 '中',
 '临',
 '为',
 '为了',
 '为什么',
 '为何',
 '为止',
 '为此',
 '为着',
 '乃',
 '乃至',
 '乃至于',
 '么',
 '之',
 '之一',
 '之所以',
 '之类',
 '乌乎',
 '乎',
 '乘',
 '九',
 '也',
 '也好',
 '也罢',
 '了',
 '二',
 '二来',
 '于',
 '于是',
 '于是乎',
 '云云',
 '云尔',
 '五',
 '些',
 '亦',
 '人',
 '人们',
 '人家',
 '什',
 '什么',
 '什么样',
 '今',
 '介于',
 '仍',
 '仍旧',
 '从',
 '从此',
 '从而',
 '他',
 '他人',
 '他们',
 '他们们',
 '以',
 '以上',
 '以为',
 '以便',
 '以免',
 '以及',
 '以故',
 '以期',
 '以来',
 '以至',
 '以

Filter out stop words, numbers, space and "None":

In [None]:
# Function to check if a string is a number
def is_number(value):
    try:
        float(value)  # Try converting to a float
        return True
    except ValueError:
        return False

data_unlabelled['tokenized_sen_filtered'] = data_unlabelled['tokenized_sen'].apply(
    lambda char_list: [word for word in char_list 
                       if word # no None observations
                       and word.strip() # no empty space
                       and not re.match(r'^[a-zA-Z0-9 !#*\'&_\-,\.\.+]+$', word) # remove numbers, punctuation and latin characters
                       and not re.match(r'^[\d\s\.\.]+$', word)  #match only numbers and spaces
                       and not is_number(word) # remove floats coded as strings
                       and word not in stop_words] # no stop words
)

Take a look at the list of tokens: their number has been reduced considerably. Common stop words have been removed, as well as much (though not all) of the noise.

In [50]:
# Create a list of all tokens
token_list_2 = list(set([token for sentence in data_unlabelled['tokenized_sen_filtered'] for token in sentence]))
len(token_list_2)

150436

In [51]:
# Sort the flattened list alphabetically
sorted_list = sorted(token_list_2, reverse=True)
sorted_list[1000:1150]

['黎炳兰',
 '黎渊弘',
 '黎海澜',
 '黎民',
 '黎正刚',
 '黎梅',
 '黎树华',
 '黎朝',
 '黎晔',
 '黎晓燕',
 '黎晓兴',
 '黎春林',
 '黎明村',
 '黎明',
 '黎旭',
 '黎族',
 '黎承杨',
 '黎惠声',
 '黎思艳',
 '黎志凌',
 '黎庆娟',
 '黎平县',
 '黎平',
 '黎巴嫩',
 '黎巧',
 '黎川县',
 '黎川',
 '黎展',
 '黎小明',
 '黎宏颖',
 '黎宏祥',
 '黎子生',
 '黎培戈',
 '黎城县',
 '黎坤',
 '黎国',
 '黎四龙',
 '黎善',
 '黎卫昌',
 '黎华店',
 '黎华',
 '黎勒',
 '黎力',
 '黎公姑',
 '黎克江',
 '黎仲光区',
 '黎仕书',
 '黎亿峰',
 '黎习',
 '黎乐宁',
 '黎丽',
 '黎一华',
 '黎',
 '黍米',
 '黉',
 '黄龙镇',
 '黄龙县',
 '黄龙',
 '黄黛片',
 '黄黛晴',
 '黄黑牙',
 '黄黎娟',
 '黄麻',
 '黄麓',
 '黄鹤楼',
 '黄鹏',
 '黄鹂',
 '黄鸿海',
 '黄鸿',
 '黄鸣鹤',
 '黄鳝',
 '黄高壮',
 '黄高',
 '黄骅市',
 '黄颖倩市',
 '黄颖',
 '黄顺林',
 '黄韬',
 '黄韦吉',
 '黄静文',
 '黄静',
 '黄靖',
 '黄霜霞',
 '黄雪颖',
 '黄雪雁',
 '黄雪智',
 '黄雪媚',
 '黄雪',
 '黄雨顶',
 '黄雄兰',
 '黄雄',
 '黄隆',
 '黄陶',
 '黄陵县',
 '黄陵',
 '黄陂区',
 '黄陂',
 '黄阁',
 '黄闻欢',
 '黄门',
 '黄镇',
 '黄锦辉',
 '黄锦玲',
 '黄锦泰',
 '黄锦师',
 '黄锦峰',
 '黄锦',
 '黄锋杰',
 '黄锋',
 '黄铭杰',
 '黄铜',
 '黄铁矿',
 '黄铁明',
 '黄钠',
 '黄鑫',
 '黄金村',
 '黄金时代',
 '黄金搭档',
 '黄金山',
 '黄金周',
 '黄金',
 '黄里明',
 '黄酮醇',
 '黄酮',
 '黄酒',
 '黄邵先',
 '黄道',
 '黄通秘',
 '黄连素',
 '黄连',
 '黄远林',
 '黄

Let's check some entries which have been eliminated in the process: Some 978 rows that merely included dates, marks and punctuations and so forth. They are removed from the data.

In [53]:
# Check for empty lists in the Series
empty_lists_mask = data_unlabelled['tokenized_sen_filtered'].apply(lambda x: len(x) == 0)
data_unlabelled[empty_lists_mask]


Unnamed: 0,doc_index,ran200doc,sentences,ran20sen,sen_index,tokenized_sen,tokenized_sen_filtered
2431,59,21,,4,2432,[ ],[]
5174,135,119,,9,5175,[ ],[]
18978,1110,16,)”,7,18979,"[), ”]",[]
18985,1110,16,”,19,18986,[”],[]
18989,1110,16,”,11,18990,[”],[]
...,...,...,...,...,...,...,...
990649,30819,90,2012年12月19日,3,990650,"[　, , 2012, 年, 12, 月, 19, 日]",[]
990732,30821,182,2007年12月3日,15,990733,"[　, , 2007, 年, 12, 月, 3, 日]",[]
990740,30822,39,2004年11月2日,13,990741,"[ , , , , , , , , , , , , , , , ...",[]
995708,30967,58,其他同上,17,995709,"[其他, 同, 上]",[]


In [55]:
# remove the empty lists
data_unlabelled = data_unlabelled[~empty_lists_mask]
data_unlabelled

Unnamed: 0,doc_index,ran200doc,sentences,ran20sen,sen_index,tokenized_sen,tokenized_sen_filtered
0,1,137,《中外合资、合作医疗机构管理暂行办法》的补充规定 中华...,16,1,"[ , , , , , , , , , , , , 《, 中外合资, ...","[中外合资, 合作医疗, 机构, 管理, 暂行办法, 补充规定, 中华人民共和国, 卫生部,..."
1,1,137,卫生部部长：陈竺商务部部长：陈德铭二○○七年十二月三十日 《中外合资、合作医疗机构管理暂行...,18,2,"[卫生部, 部长, ：, 陈竺, 商务部, 部长, ：, 陈德铭, 二, ○, ○, 七年,...","[卫生部, 部长, 陈竺, 商务部, 部长, 陈德铭, ○, ○, 七年, 十二月, 三十日..."
2,1,137,二、本规定中香港、澳门服务提供者应分别符合《内地与香港关于建立更紧密经贸关系的安排》及《...,18,3,"[　, , 二, 、, 本, 规定, 中, 香港, 、, 澳门, 服务提供者, 应, 分别...","[规定, 香港, 澳门, 服务提供者, 应, 符合, 内地, 香港, 建立, 紧密, 经贸关..."
3,1,137,三、香港、澳门服务提供者在内地设立合资、合作医疗机构的其他规定，仍参照《中外合资、合作医...,7,4,"[　, , 三, 、, 香港, 、, 澳门, 服务提供者, 在, 内地, 设立, 合资, ...","[香港, 澳门, 服务提供者, 内地, 设立, 合资, 合作医疗, 机构, 规定, 参照, ..."
4,1,137,四、本规定自2008年1月1日起施行,16,5,"[　, , 四, 、, 本, 规定, 自, 2008, 年, 1, 月, 1, 日起, 施行]","[规定, 日起, 施行]"
...,...,...,...,...,...,...,...
1003130,31138,35,否则不予报销,7,1003131,"[否则, 不予, 报销]","[不予, 报销]"
1003131,31138,35,第四章　附则 第二十四条　本技术方案从二00八年六月一日起统一执行，同时原制定的补偿方案作废,8,1003132,"[　, 第四章, , 附则, , , 第二十四条, , 本, 技术, 方案, 从, ...","[第四章, 附则, 第二十四条, 技术, 方案, 八年, 六月, 一日, 统一, 执行, 原..."
1003132,31138,35,第二十五条　本方案由龙胜各族自治县新型农村合作医疗管理办公室负责解释,9,1003133,"[　, , 第二十五条, , 本, 方案, 由, 龙胜各族自治县, 新型农村, 合作医疗...","[第二十五条, 方案, 龙胜各族自治县, 新型农村, 合作医疗, 管理, 办公室, 负责, 解释]"
1003134,31139,91,龙胜各族自治县人民政府关于成立自治县“健康扶贫·医疗救助”公益基金管理工作领导小组的通知龙胜...,14,1003135,"[龙胜各族自治县, 人民政府, 关于, 成立, 自治县, “, 健康, 扶贫, ·, 医疗,...","[龙胜各族自治县, 人民政府, 成立, 自治县, 健康, 扶贫, ·, 医疗, 救助, 公益..."


#### 3.3 save unlabelled data

In [None]:
# save
data_unlabelled.to_csv('./data_unlabelled_tok_fil.csv', index=False)

In [15]:
# load again
data_unlabelled = pd.read_csv('./data_unlabelled_tok_fil.csv', encoding='utf-8')

In [57]:
data_unlabelled

Unnamed: 0,doc_index,ran200doc,sentences,ran20sen,sen_index,tokenized_sen,tokenized_sen_filtered
0,1,137,《中外合资、合作医疗机构管理暂行办法》的补充规定 中华...,16,1,"[ , , , , , , , , , , , , 《, 中外合资, ...","[中外合资, 合作医疗, 机构, 管理, 暂行办法, 补充规定, 中华人民共和国, 卫生部,..."
1,1,137,卫生部部长：陈竺商务部部长：陈德铭二○○七年十二月三十日 《中外合资、合作医疗机构管理暂行...,18,2,"[卫生部, 部长, ：, 陈竺, 商务部, 部长, ：, 陈德铭, 二, ○, ○, 七年,...","[卫生部, 部长, 陈竺, 商务部, 部长, 陈德铭, ○, ○, 七年, 十二月, 三十日..."
2,1,137,二、本规定中香港、澳门服务提供者应分别符合《内地与香港关于建立更紧密经贸关系的安排》及《...,18,3,"[　, , 二, 、, 本, 规定, 中, 香港, 、, 澳门, 服务提供者, 应, 分别...","[规定, 香港, 澳门, 服务提供者, 应, 符合, 内地, 香港, 建立, 紧密, 经贸关..."
3,1,137,三、香港、澳门服务提供者在内地设立合资、合作医疗机构的其他规定，仍参照《中外合资、合作医...,7,4,"[　, , 三, 、, 香港, 、, 澳门, 服务提供者, 在, 内地, 设立, 合资, ...","[香港, 澳门, 服务提供者, 内地, 设立, 合资, 合作医疗, 机构, 规定, 参照, ..."
4,1,137,四、本规定自2008年1月1日起施行,16,5,"[　, , 四, 、, 本, 规定, 自, 2008, 年, 1, 月, 1, 日起, 施行]","[规定, 日起, 施行]"
...,...,...,...,...,...,...,...
1003130,31138,35,否则不予报销,7,1003131,"[否则, 不予, 报销]","[不予, 报销]"
1003131,31138,35,第四章　附则 第二十四条　本技术方案从二00八年六月一日起统一执行，同时原制定的补偿方案作废,8,1003132,"[　, 第四章, , 附则, , , 第二十四条, , 本, 技术, 方案, 从, ...","[第四章, 附则, 第二十四条, 技术, 方案, 八年, 六月, 一日, 统一, 执行, 原..."
1003132,31138,35,第二十五条　本方案由龙胜各族自治县新型农村合作医疗管理办公室负责解释,9,1003133,"[　, , 第二十五条, , 本, 方案, 由, 龙胜各族自治县, 新型农村, 合作医疗...","[第二十五条, 方案, 龙胜各族自治县, 新型农村, 合作医疗, 管理, 办公室, 负责, 解释]"
1003134,31139,91,龙胜各族自治县人民政府关于成立自治县“健康扶贫·医疗救助”公益基金管理工作领导小组的通知龙胜...,14,1003135,"[龙胜各族自治县, 人民政府, 关于, 成立, 自治县, “, 健康, 扶贫, ·, 医疗,...","[龙胜各族自治县, 人民政府, 成立, 自治县, 健康, 扶贫, ·, 医疗, 救助, 公益..."


### 4. Prepare training and test data

#### 4.1 add tokenized sentences to training data

For the training data, we need tokenized and filtered sentences. Since we already created those for the entire dataset above, we can now merely merge that data with the training data.

In [41]:
data_labelled

Unnamed: 0,doc_index,sen_index,ran200doc,sentences,y_narrow,y_broad
0,9389,272309,2,（三）社会捐赠及其它资金,0,0
1,9389,272325,2,第三十条　本办法自2008年12月1日起施行,0,0
2,9571,278003,1,恶性肿瘤内分泌治疗定点医疗机构为具有治疗该病资质的医疗保险定点医疗机构，参保人员进行恶性肿瘤...,0,0
3,9687,282269,2,定点医疗机构在自愿基础上，从中选择适宜病种实施，逐步扩大实施病种范围,0,0
4,9687,282270,2,（二）专家论证，科学合理,0,0
...,...,...,...,...,...,...
20453,31060,999534,40,二、调整城乡居民医保待遇保障标准 （一）稳步提升基本医保待遇水平,0,0
20454,31060,999541,40,2020年起在一个医疗保险年度内，参保人员住院报销和门诊报销医疗费累计最高支付限额为20万元,0,0
20455,31060,999542,40,起付线以上、最高支付限额以内的合规医疗费用，由基本医疗保险统筹基金和参保个人按比例支付,0,0
20456,31060,999562,40,同时，要着眼促进乡村振兴战略实施，建立防范和化解因病致贫、因病返贫的长效机制,0,0


In the process, we lose a few observations which contained stop words only.

In [59]:
## check case numbers
data_labelled[~data_labelled['sen_index'].isin(data_unlabelled["sen_index"])].shape

(24, 6)

In [60]:
data_merge = data_unlabelled[['sen_index', 'tokenized_sen_filtered']] # extract tokenized sentences
data_labelled = data_merge.merge(data_labelled, how= "inner", on = ["sen_index"]) # merge with training data
data_labelled

Unnamed: 0,sen_index,tokenized_sen_filtered,doc_index,ran200doc,sentences,y_narrow,y_broad
0,876,"[牵头, 单位, 市, 财政局, 责任, 单位, 市人, 社局, 市, 医保, 局, 市卫,...",21,31,（牵头单位：市财政局；责任单位：市人社局、市医保局、市卫计委、市审计局、各区、县政府） 五...,0,0
1,907,"[基金, 独立核算, 专户, 管理, 单位, 个人, 挤占, 挪用]",21,31,基金独立核算、专户管理，任何单位和个人不得挤占挪用,0,0
2,921,"[完善, 支付, 方式]",21,31,（四）完善支付方式,0,0
3,926,"[卫生, 计生, 行政部门, 结合, 医药卫生, 体制改革, 加强, 医疗, 服务, 监管,...",21,31,卫生计生行政部门要结合医药卫生体制改革，加强医疗服务监管，规范医疗服务行为，提升医疗服务水平,0,0
4,949,"[市人, 社局, 牵头, 市, 编办, 市卫, 计委, 市, 财政局, 市, 医保, 局, ...",21,31,（市人社局牵头，市编办、市卫计委、市财政局、市医保局、各区、县政府配合） （三）基金移交,0,0
...,...,...,...,...,...,...,...
20259,1003042,"[家庭, 帐户, 占, 基金]",31138,35,家庭帐户占基金的12.5％,0,0
20260,1003059,"[报账, 需提交, 材料, 户口簿, 合作医疗, 证, 当年, 参加, 新型农村, 合作医疗...",31138,35,（二）报账需提交材料：户口簿、合作医疗证、当年参加新型农村合作医疗的缴费发票、疾病诊断书...,0,0
20261,1003060,"[第十二条, 参加, 新型农村, 合作医疗, 农村, 住院, 分娩, 产妇, 每例, 补助,...",31138,35,第十二条　参加新型农村合作医疗的农村住院分娩的产妇每例补助100元（剖宫产、产妇若合并大出血...,0,0
20262,1003078,"[县级, 及县, 外, 医疗机构, 床位费, 元, 计入, 乡, 镇, 医院, 实际, 床位...",31138,35,（三）县级及县外医疗机构床位费按12元计入，乡、镇医院按实际床位收费标准计入报账基数,0,0


#### 4.2 split the data along y_broad

There are only few positive observation relative to the sample size.
It is important to keep their distribution balanced between the test and training data.
Here, we assign one third of the observations to the test data.

In [61]:
pd.crosstab(data_labelled['y_narrow'], data_labelled['y_broad'])

y_broad,0,1
y_narrow,Unnamed: 1_level_1,Unnamed: 2_level_1
0,19114,732
1,0,418


In [66]:
# Perform stratified split for bag of words
X_toksen_train, X_toksen_test, y_broad_train, y_broad_test = train_test_split(
    data_labelled['tokenized_sen_filtered'], data_labelled['y_broad'], 
    test_size=0.33, stratify=data_labelled['y_broad'], random_state=42)

For Transformers, we can directly use the entire sentences, without tokenization and cleaning.

In [67]:
# Perform stratified split for full sentences
X_sen_train, X_sen_test, y_broad_train, y_broad_test = train_test_split(
    data_labelled['sentences'], data_labelled['y_broad'],
      test_size=0.33, stratify=data_labelled['y_broad'], random_state=42)

Check the dimensions

In [None]:
# Save the Training data as a CSV files
X_toksen_train.to_csv('./y_broad/X_toksen_train.csv', index=False)
X_toksen_test.to_csv('../y_broad/X_toksen_test.csv', index=False)
X_sen_train.to_csv('./y_broad/X_sen_train.csv', index=False)
X_sen_test.to_csv('./y_broad/X_sen_test.csv', index=False)
y_broad_train.to_csv('../y_broad/y_broad_train.csv', index=False)
y_broad_test.to_csv('./y_broad/y_broad_test.csv', index=False)

In [25]:
X_sen_train.head()

16653    加强部门间工作协同，全面对接社会救助经办服务，各地（市）在制定依申请医疗救助规程中，要按照部...
12681                   按规定定期向社会公布基金收支情况和参合人员待遇享受情况，接受社会监督
15709    事实无人抚养儿童监护人或受监护人委托的近亲属填写《事实无人抚养儿童基本生活补贴申请表》（见附...
3021                       　　一、慢性病种补偿名录　　（一）呼吸系统：慢性支气管炎肺气肿
19133           　　（二）市外省内定点医疗机构住院医疗待遇的起付标准和支付比例均不分定点医疗机构级别
Name: sentences, dtype: object

In [26]:
X_toksen_train.head()

16653    ['加强', '部门', '间', '工作', '协同', '全面', '对接', '社会'...
12681    ['按规定', '定期', '社会', '公布', '基金', '收支', '情况', '参...
15709    ['事实', '无人', '抚养', '儿童', '监护人', '受', '监护人', '委...
3021     ['慢性病', '种', '补偿', '名录', '呼吸系统', '慢性', '支气管炎',...
19133    ['市', '外', '省内', '定点', '医疗机构', '住院', '医疗', '待遇...
Name: tokenized_sen_filtered, dtype: object

#### 4.3 Unlabelled data for Transformers

In [None]:
# import data (Windows)
data_unlabelled = pd.read_csv('./health_sentences_20231123.csv', encoding='utf-8')
data_unlabelled



Unnamed: 0,doc_index,ran200doc,sentences,ran20sen,sen_index
0,1,137,《中外合资、合作医疗机构管理暂行办法》的补充规定 中华...,16,1
1,1,137,卫生部部长：陈竺商务部部长：陈德铭二○○七年十二月三十日 《中外合资、合作医疗机构管理暂行...,18,2
2,1,137,二、本规定中香港、澳门服务提供者应分别符合《内地与香港关于建立更紧密经贸关系的安排》及《...,18,3
3,1,137,三、香港、澳门服务提供者在内地设立合资、合作医疗机构的其他规定，仍参照《中外合资、合作医...,7,4
4,1,137,四、本规定自2008年1月1日起施行,16,5
...,...,...,...,...,...
1003131,31138,35,第四章　附则 第二十四条　本技术方案从二00八年六月一日起统一执行，同时原制定的补偿方案作废,8,1003132
1003132,31138,35,第二十五条　本方案由龙胜各族自治县新型农村合作医疗管理办公室负责解释,9,1003133
1003133,31138,35,,9,1003134
1003134,31139,91,龙胜各族自治县人民政府关于成立自治县“健康扶贫·医疗救助”公益基金管理工作领导小组的通知龙胜...,14,1003135


In [4]:
data_unlabelled = data_unlabelled.drop(['ran200doc', 'ran20sen'], axis = 1)
data_unlabelled.head()

Unnamed: 0,doc_index,sentences,sen_index
0,1,《中外合资、合作医疗机构管理暂行办法》的补充规定 中华...,1
1,1,卫生部部长：陈竺商务部部长：陈德铭二○○七年十二月三十日 《中外合资、合作医疗机构管理暂行...,2
2,1,二、本规定中香港、澳门服务提供者应分别符合《内地与香港关于建立更紧密经贸关系的安排》及《...,3
3,1,三、香港、澳门服务提供者在内地设立合资、合作医疗机构的其他规定，仍参照《中外合资、合作医...,4
4,1,四、本规定自2008年1月1日起施行,5


In [None]:
# save to upload from Mac OS
data_unlabelled.to_csv('./X_sen_unlabelled.csv', index=False)