# Google Colab

- As we are working with more and more data, we may need GPU computing for quicker processing.
- This lecture note shows how we can capitalize on the free GPU computing provided by Google Colab and speed up the Chinese word segmentation of `ckip-transformers`.

## Prepare Google Drive

- Create a working directory under your Google Drive, named `ENC2045_DEMO_DATA`.
- Save the corpus files needed in that Google Drive directory.
- We can access the files on our Google Drive from Google Colab. This can be useful when you need to load your own data in Google Colab.

:::{note}

You can of course name the directory in which ever ways you like. The key is that we need to put the data files on the Google Drive so that we can access these files through Google Colab.

:::

## Run Notebook in Google Colab

- Click on the button on top of the lecture notes website to open this notebook in Google Colab.

## Setting Google Colab Environment

- Important Steps for Google Colab Environment Setting
    - Change the Runtime for GPU
    - Install Modules
    - Mount Google Drive
    - Set Working Directory

## Change Runtime for GPU

- [Runtime] -> [Change runtime type]
- For [Hardware accelerator], choose [GPU]

In [None]:
!nvidia-smi

Fri Jul 14 09:48:15 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Install Modules

- Google Colab has been pre-instralled with several popular modules for machine learning and deep learning (e.g., `nltk`, `sklearn`, `tensorflow`, `pytorch`,`numpy`, `spacy`).
- We can check the pre-installed modules here.

In [None]:
!pip list

Package                          Version
-------------------------------- ---------------------
absl-py                          1.4.0
aiohttp                          3.8.4
aiosignal                        1.3.1
alabaster                        0.7.13
albumentations                   1.2.1
altair                           4.2.2
anyio                            3.7.1
appdirs                          1.4.4
argon2-cffi                      21.3.0
argon2-cffi-bindings             21.2.0
array-record                     0.4.0
arviz                            0.15.1
astropy                          5.2.2
astunparse                       1.6.3
async-timeout                    4.0.2
attrs                            23.1.0
audioread                        3.0.0
autograd                         1.6.2
Babel                            2.12.1
backcall                         0.2.0
beautifulsoup4                   4.11.2
bleach                           6.0.0
blis                             0.7.9


- We only need to install modules that are not pre-installed in Google Colab (e.g., `ckip-transformers`).
- This installation has to be done every time we work with Google Colab. But don't worry. It's quick.
- This is how we install the package on Google Colab, exactly the same as we do in our terminal.

In [None]:
## Google Drive Setting
!pip install ckip-transformers

Collecting ckip-transformers
  Downloading ckip_transformers-0.3.4-py3-none-any.whl (26 kB)
Collecting transformers>=3.5.0 (from ckip-transformers)
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers>=3.5.0->ckip-transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers>=3.5.0->ckip-transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m87.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers>=3.5.0->ckip-transformers)
  Downloadi

## Mount Google Drive
    

- To mount our Google Drive to the current Google Colab server, we need the following codes.
- The default directory of Google Colab is `/content/`. (There is a sub-directory by default, i.e., `/content/sample_data`.)
- We specify the mount point as `/content/drive`, where you can find your root directory of your Google Drive (i.e., `/content/drive/MyDrive`).

In [8]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


- After we run the above codes, we need to click on the link presented, log in with our Google Account in the new window and get the authorization code.
- Then copy the authorization code from the new window and paste it back to the text box in the notebook window.

## Set Working Directory

- Change Colab working directory to the `ENC2045_demo_data` of the Google Drive

In [5]:
import os
os.chdir('/content/drive/MyDrive/ENC2045_demo_data')
print(os.getcwd())


/content/drive/MyDrive/ENC2045_demo_data


## Try `ckip-transformers` with GPU

### Initialize the `ckip-transformers`

In [None]:
import ckip_transformers
from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger
# Initialize drivers
ws_driver = CkipWordSegmenter(model="bert-base", device=0)
pos_driver = CkipPosTagger(model="bert-base", device=0)


Downloading (…)lve/main/config.json:   0%|          | 0.00/804 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/407M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/2.86k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/407M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
def my_tokenizer(doc):
    # `doc`: a list of corpus documents (each element is a document long string)
    cur_ws = ws_driver(doc, use_delim = True, delim_set='\n')
    cur_pos = pos_driver(cur_ws)
    doc_seg = [[(x,y) for (x,y) in zip(w,p)]  for (w,p) in zip(cur_ws, cur_pos)]
    return doc_seg

### Tokenization Chinese Texts

In [None]:
import pandas as pd

df = pd.read_csv('dcard-top100.csv')
df.head()
corpus = df['content']
corpus[:10]

0    部分回應在B117 \n謝謝各位的留言，我都有看完\n好的不好的，我都接受謝謝大家🙇‍♀️\...
1    https://i.imgur.com/REIEzSd.jpg\n\n身高195公分的男大生...
2    看過這麼多在Dcard、PTT上的感情渣事和創作文\n從沒想過如此荒謬像八點檔的事情居然會發...
3    剛剛吃小火鍋，跟店員說不要金針菇（怕卡牙縫），於是店員幫我換其他配料..…\n\n沒想到餐一...
4    已經約好見面，到了當天晚上七點半才回，我是被耍了嗎 \n如下圖\n\n\nhttps://i...
5    嗨！巨砲哥 答應你的文來了😆\n這是一段與約砲小哥哥談心的奇幻旅程\n\n可憐的我情人節當天...
6    https://i.imgur.com/HCTwyAH.jpg\n（圖片非本人）\n今天逛街...
7    https://i.imgur.com/RWJLK2v.jpg\n\n因為馬鞍很寬\n想請問...
8    手機排版請見諒😖🙏🏻（圖多）\n先說這不是我第一次訂購訂製蛋糕\n也了解訂製蛋糕不可能跟圖上...
9    https://i.imgur.com/6Yk9etg.jpg\n想在這裡問大家有沒有接到這...
Name: content, dtype: object

In [None]:
%%time
corpus_seg = my_tokenizer(corpus)

Tokenization: 100%|██████████| 100/100 [00:00<00:00, 370.48it/s]
Inference: 100%|██████████| 16/16 [01:57<00:00,  7.34s/it]
Tokenization: 100%|██████████| 100/100 [00:00<00:00, 1454.07it/s]
Inference: 100%|██████████| 10/10 [01:16<00:00,  7.60s/it]

CPU times: user 3min 14s, sys: 231 ms, total: 3min 14s
Wall time: 3min 15s





In [None]:
corpus_seg[0][:50]

[('部分', 'Neqa'),
 ('回應', 'VC'),
 ('在', 'P'),
 ('B117 \n', 'FW'),
 ('謝謝', 'VJ'),
 ('各位', 'Nh'),
 ('的', 'DE'),
 ('留言', 'Na'),
 ('，', 'COMMACATEGORY'),
 ('我', 'Nh'),
 ('都', 'D'),
 ('有', 'D'),
 ('看完', 'VC'),
 ('\n', 'WHITESPACE'),
 ('好', 'VH'),
 ('的', 'DE'),
 ('不', 'D'),
 ('好', 'VH'),
 ('的', 'T'),
 ('，', 'COMMACATEGORY'),
 ('我', 'Nh'),
 ('都', 'D'),
 ('接受', 'VC'),
 ('謝謝', 'VJ'),
 ('大家', 'Nh'),
 ('🙇', 'FW'),
 ('\u200d♀️\n', 'DASHCATEGORY'),
 ('\n', 'WHITESPACE'),
 ('\n', 'WHITESPACE'),
 ('（', 'PARENTHESISCATEGORY'),
 ('第三', 'Neu'),
 ('次', 'Nf'),
 ('更新', 'VC'),
 ('在', 'P'),
 ('這邊', 'Ncd'),
 ('）', 'PARENTHESISCATEGORY'),
 ('\n', 'WHITESPACE'),
 ('B258 ', 'FW'),
 ('這邊', 'Ncd'),
 ('也', 'D'),
 ('有', 'V_2'),
 ('講到', 'VE'),
 ('怎麼', 'D'),
 ('逃生', 'VA'),
 ('\n', 'WHITESPACE'),
 ('很多', 'Neqa'),
 ('人', 'Na'),
 ('好奇', 'VH'),
 ('我', 'Nh'),
 ('是', 'SHI')]