# CKIP Tagger {#ckiptagger}

The current state-of-art Chinese segmenter for Taiwan Mandarin available is probably the [CKIP tagger](https://github.com/ckiplab/ckiptagger), created by the [Chinese Knowledge and Information Processing (CKIP)](https://ckip.iis.sinica.edu.tw/) group at the Academia Sinica.

The `ckiptagger` is released as a python module. In this chpater, I will demonstrate how to use the module for Chinese word segmentation but in an R environment, i.e., how to integrate Python modules in R coherently to perform complex tasks.

## Installation

Because `ckiptagger` is built in python, we need to have python installed in our working environment. Please install the following applications on your own before you start:

- [Anaconda + Python 3.6+](https://www.anaconda.com/distribution/)
- `ckiptagger` module in Python (Please install the module using the `Anaconda Navigator` or `pip install` in the terminal)

(**Please consult the github of the [`ckiptagger`](https://github.com/ckiplab/ckiptagger) for more details on installation.**)

```{note}
For some reasons, the module `ckiptagger` may not be found in the base channel. In `Anaconda Navigator`, if you cannot find this module, please add specifically the following channel to the environment so that your Anaconda can find `ckiptagger` module:

`https://conda.anaconda.org/roccqqck`
```

## Download the Model Files

All NLP applications have their models behind their fancy performances. To use the tagger provided in `ckiptagger`, we need to download their pre-trained model files. 

Please go to the [github of CKIP tagger](https://github.com/ckiplab/ckiptagger) to download the model files, which is provided as a zipped file. (The file is very big. It takes a while.)

After you download the zipped file, unzip it under your working directory to the `data/` directory.

## Word Segmentation

Before we proceed, please check if you have everything ready (The following includes the versions of the modules used for this session):

- Anaconda + Python 3.6+ (`Python 3.6.10`)
- Python modules: `ckiptagger` (`ckiptagger 0.1.1` + `tensorflow 1.13.1`)
- CKIP model files under your working directory `./data`

If yes, then we are ready to go.


## Creating Conda Environment for `ckiptagger`

I would suggest to install all necessary Python modules in a conda environment for easier use. 

In the following demonstration, I assume that you have created a conda environment `ckiptagger`, where all the necessary modules (i.e., `ckiptagger`, `tensorflow`) have been pip-installed.

```
# isntsall in terminal
## create new env
conda create --name ckiptagger python=3.6
conda activate ckiptagger
pip install -U ckiptagger
## AND INSTALL EVERYTHING NEEDED FOR YOUR PROJECR

## deactivate env when you are done
conda deactivate
```

## Segmenting Texts

The initialized word segmenter object, `ws()`, can tokenize any input **character vectors** into a list of **word vectors** of the same size.

In [4]:
from ckiptagger import data_utils, construct_dictionary, WS, POS, NER

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [38]:
# Set Parameter Path
MODEL_PATH = '../../../NTNU/CorpusLinguistics/CorpusLinguistics_bookdown/data/'
#'/Users/Alvin/Dropbox/NTNU/CorpusLinguistics/CorpusLinguistics_bookdown/data/'
## Loading model
#ws = WS('/Users/Alvin/Dropbox/NTNU/CorpusLinguistics/CorpusLinguistics_bookdown/data/')
ws = WS(MODEL_PATH)
#ws = WS('../../../NTNU/CorpusLinguistics/CorpusLinguistics_bookdown/data/')
pos = POS(MODEL_PATH)
ner = NER(MODEL_PATH)

In [45]:
## Raw text corpus 
sentence_list = ['傅達仁今將執行安樂死，卻突然爆出自己20年前遭緯來體育台封殺，他不懂自己哪裡得罪到電視台。',
              '美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會，預料她將會很順利通過參議院支持，成為該國有史以來第一位的華裔女性內閣成員。',
              '土地公有政策?？還是土地婆有政策。',
              '… 你確定嗎… 不要再騙了……他來亂的啦',
              '最多容納59,000個人,或5.9萬人,再多就不行了.這是環評的結論.',
              '科長說:1,坪數對人數為1:3。2,可以再增加。']
    ## other parameters
    # sentence_segmentation = True, # To consider delimiters
    # segment_delimiter_set = {",", "。", ":", "?", "!", ";"}), # This is the defualt set of delimiters
    # recommend_dictionary = dictionary1, # words in this dictionary are encouraged
    # coerce_dictionary = dictionary2, # words in this dictionary are forced

word_list = ws(sentence_list)
pos_list = pos(word_list)
entity_list = ner(word_list, pos_list)
    

TypeError: 'list' object is not callable

In [47]:
def print_word_pos_sentence(word_sentence, pos_sentence):
    assert len(word_sentence) == len(pos_sentence)
    for word, pos in zip(word_sentence, pos_sentence):
        print(f"{word}({pos})", end="\u3000")
    print()
    return
    
for i, sentence in enumerate(sentences):
    print()
    print(f"'{sentence}'")
    print_word_pos_sentence(words[i],  pos[i])
    for entity in sorted(entities[i]):
        print(entity)



'傅達仁今將執行安樂死，卻突然爆出自己20年前遭緯來體育台封殺，他不懂自己哪裡得罪到電視台。'
傅達仁(Nb)　今(Nd)　將(D)　執行(VC)　安樂死(Na)　，(COMMACATEGORY)　卻(D)　突然(D)　爆出(VJ)　自己(Nh)　20(Neu)　年(Nf)　前(Ng)　遭(P)　緯來(Nb)　體育台(Na)　封殺(VC)　，(COMMACATEGORY)　他(Nh)　不(D)　懂(VK)　自己(Nh)　哪裡(Ncd)　得罪到(VJ)　電視台(Nc)　。(PERIODCATEGORY)　
(0, 3, 'PERSON', '傅達仁')
(18, 22, 'DATE', '20年前')
(23, 28, 'ORG', '緯來體育台')

'美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會，預料她將會很順利通過參議院支持，成為該國有史以來第一位的華裔女性內閣成員。'
美國(Nc)　參議院(Nc)　針對(P)　今天(Nd)　總統(Na)　布什(Nb)　所(D)　提名(VC)　的(DE)　勞工部長(Na)　趙小蘭(Nb)　展開(VC)　認可(VC)　聽證會(Na)　，(COMMACATEGORY)　預料(VE)　她(Nh)　將(D)　會(D)　很(Dfa)　順利(VH)　通過(VC)　參議院(Nc)　支持(VC)　，(COMMACATEGORY)　成為(VG)　該(Nes)　國(Nc)　有史以來(D)　第一(Neu)　位(Nf)　的(DE)　華裔(Na)　女性(Na)　內閣(Na)　成員(Na)　。(PERIODCATEGORY)　
(0, 2, 'GPE', '美國')
(2, 5, 'ORG', '參議院')
(7, 9, 'DATE', '今天')
(11, 13, 'PERSON', '布什')
(17, 21, 'ORG', '勞工部長')
(21, 24, 'PERSON', '趙小蘭')
(42, 45, 'ORG', '參議院')
(56, 58, 'ORDINAL', '第一')
(60, 62, 'NORP', '華裔')

'土地公有政策?？還是土地婆有政策。'
土地公有(VH)　政策(Na)　?(QUESTIONCATEGORY)　？(QUESTIONCATEGORY)　還是(Caa)　土地(Na)　婆(Na)　有(V

The word segmenter `ws()` returns a `list` object, each element of which is a word-based vector of the original sentence.

## Define Own Dictionary

The performance of Chinese word segmenter depends highly on the dictionary. Texts in different disciplines may have very domain-specific vocabulary. To prioritize a set of words in a dictionary, we can further ensure the accuracy of the word segmentation.

To create a dictionary for `ckiptagger`, we need a **named list**, i.e., to create a `list` with element **names** = "the new words" and element **values** = "the weights".

Then we use the python function `ckip$construct_dictionary()` to create the `dictionary` Python object, which is the input argument for word segmenter `ws(..., recommend_dictionary = ...)`.

```{r}
# Define new words in own dictionary
new_words <- c("土地公有",
               "土地公",
               "土地婆",
               "來亂的",
               "啦",
               "緯來體育台")

# Transform the `vector` into `list` for Python
new_words_py <- c(2, 1, 1, 1, 1, 1) %>% as.list

  # cf. `list(rep, 1 , length(new_words))`
names(new_words_py) <- new_words

  # To create a dictionary for `construct_dictionary()`
  # We need a list, with names as the words and list elements as the weights in the dictionary

# Create Python `dictionary` object, required by `ckiptagger.wc()`
dictionary<-ckip$construct_dictionary(new_words_py)

# Segment texts using dictionary
words_1 <- ws(texts, recommend_dictionary = dictionary)
words_1
```

```{exercise}
We usually have a list of new words saved in a text file. Can you write a R function, which loads the words in the `demo_data/dict-sample.txt` into a named `list`, i.e., `new_words`, which can easily serve as the input for `ckip$construct_dictionary()` to create the python `dictionary` object? (Note: All weights are default to 1)
```

```{r echo = F, eval = T, purl=F}
loadDictionary <- function(input = ""){
  words <- readLines(input)
  weights <- as.list(rep(1, length(words)))
  names(weights)<-words
  return(weights)  
}# endfunc
```


```{r}
new_words<-loadDictionary(input = "demo_data/dict-sample.txt") 
dictionary<-ckip$construct_dictionary(new_words)
# Segment texts using dictionary
words_2 <- ws(texts, recommend_dictionary = dictionary)
words_2
```

## Beyond Word Boundaries

In addition to primitive word segmentation, the `ckiptagger` provides also the parts-of-speech tags for words and named entity recognitions for the texts. The `ckiptagger` follows the pipeline below for text processing.

```{r eval= T, echo = F, purl=F}
library(DiagrammeR)
grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']


      # edge definitions with the node IDs
      tab1 -> tab2 -> tab3 -> tab4;
      }

      [1]: 'Raw Texts'
      [2]: 'Words'
      [3]: 'Parts-of-Speech'
      [4]: 'Named Entity Recognition (NER)'
      ")
```

- Load the models

To perform these additional tasks, we need to load the necessary models (pre-trained and provided by the CKIP group) first as well. 

They should all have been included in the model directory you unzipped earlier (cf. `./data`).

```{r eval = T}
# loading other necessary models
system.time((pos <- ckip$POS("./data"))) # 詞性 6s
system.time((ner <- ckip$NER("./data"))) # 實體辨識 8s
```

- POS tagging and NER

```{r eval = T}
# Parts-of-speech Tagging
pos_words <- pos(words_1)
pos_words

# Named Entity Recognition
ner <- ner(words_1, pos_words)
ner
```

## Tidy Up the Results

We can tidy up results provided by `ckiptagger` and create a word-based tidy structure of our data:

```{r eval = T}
word_df <- data.frame(text_id = mapply(rep, c(1:length(texts)), sapply(words_1, length)) %>% unlist,
                        words = do.call(c, words_1),
                        pos = do.call(c, pos_words))
word_df
```

***

```{exercise}
With a word-based tidy structure of the corpus, it is easy to convert it into a text-based one with both the information of word boundaries and parts-of-speech tag. 

Please convert the above `word_df` into a text-based data frame, as shown below.
```

```{r echo=F, purl=F}
#require(dplyr)
word_df %>%
  group_by(text_id) %>%
  summarize(text = str_c(words, pos, sep="/") %>% str_c(collapse="\u3000")) %>%
  ungroup
```

```{exercise}
How to tidy up the results of `ner` so that we can include the recognized named entities in the same word-based data frame `word_df`?
```

- You may need to convert the output of `ner` from ckiptagger into a data frame like this:

```{r eval=T, echo=F, purl=F}
library(stringr)
library(readr)

extract_ner_df <- function(x){
  x %>% str_extract_all("\\([^\\)]+?\\)") %>% 
    unlist %>%
    str_replace_all("\\((\\d+), (\\d+), '(\\w+)', '([^']+)'\\)","\\1\t\\2\t\\3\t\\4") 
} # endfunc

# extract ner
tibble(ner_raw = ner) %>%
  mutate(ner_extract = sapply(ner, extract_ner_df)) %>%
  mutate(text_id = row_number()) %>%
  dplyr::select(-ner_raw) %>%
  tidyr::unnest(ner_extract) %>%
  tidyr::separate(col="ner_extract", 
                  into = c("start","end", "ner_type", "string"),
                  sep ="\t") %>%
  mutate(text_id_start = paste(text_id, start, sep="-"),
         text_id_end = paste(text_id, end, sep="-")) ->ner_df

ner_df %>% arrange(text_id, start) %>%
  select(text_id, start, end, ner_type, string)
#%>%
  #select(text_id_start, text_id_end, ner_type, string)
```

- And figure out a way to add the annotations of named entities in the word-based data frame, `word_df`, by including another column, as shown below:

```{r eval=T, echo=F, purl=F}
library(dplyr)
library(tidytext)
library(tidyr)

# word-based to char-based
word_df %>%
  mutate(word_id = row_number()) %>%
  unnest_tokens(characters,
               words,
               token = function(x) str_split(x, "")) %>%
  group_by(text_id) %>%
  mutate(char_id = row_number()) %>%
  ungroup %>%
  mutate(text_char_id = paste(text_id, char_id, sep="-"))-> char_df

# create `tag`
tag <- rep("O", nrow(char_df))
names(tag) <- char_df$text_char_id

# for each ner
for(i in 1:nrow(ner_df)){
  cur_textid <- ner_df$text_id[i]
  char_start <- paste(cur_textid, as.numeric(ner_df$start[i])+1, sep="-")
  char_end <- paste(cur_textid, as.numeric(ner_df$end[i]), sep="-")
  tag[which(names(tag)==char_start):which(names(tag)==char_end)]<-sprintf("B-%s", ner_df$ner_type[i])
  if(which(names(tag)==char_start)!=which(names(tag)==char_end))
    tag[(which(names(tag)==char_start)+1):which(names(tag)==char_end)] <- sprintf("I-%s", ner_df$ner_type[i])
}# endfor
  
char_df %>% mutate(tag = tag) %>%
  select(-char_id, -text_char_id) %>%
  group_by(text_id, pos, word_id) %>%
  nest %>%
  mutate(words = map_chr(data, function(x) x$characters %>% unlist %>% str_c(collapse="")),
         tag = map_chr(data, function(x) x$tag %>% unlist %>% .[1])) %>%
  select(-data) -> word_df2

word_df2 %>%
  select(text_id, word_id, words, pos, tag)
```

***

```{block, type="info"}
The above result data frame makes use of the **IOB format** (short for inside, outside, beginning) for the annotations of the named entities. 

It is a common tagging format for tagging (multiword) tokens in a chunking task in computational linguistics (e.g., NP-chunking, named entitity, semantic roles). 

- The **B-** prefix before a tag indicates that the tag is the *beginning* of a chunk. 
- The **I-** prefix before a tag indicates that the tag is *inside* a chunk. 
- The **O** tag indicates that a token belongs to no chunk (i.e., *outside* of all relevant chunks).
```