# Google Colab

- As we are working with more and more data, we may need GPU computing for quicker processing.
- This lecture note shows how we can capitalize on the free GPU computing provided by Google Colab and speed up the Chinese word segmentation of `ckip-transformers`.

## Prepare Google Drive

- Create a working directory under your Google Drive, named `ENC2045_DEMO_DATA`.
- Save the corpus files needed in that Google Drive directory.
- We can access the files on our Google Drive from Google Colab. This can be useful when you need to load your own data in Google Colab.

:::{note}

You can of course name the directory in which ever ways you like. The key is that we need to put the data files on the Google Drive so that we can access these files through Google Colab.

:::

## Run Notebook in Google Colab

- Click on the button on top of the lecture notes website to open this notebook in Google Colab.

## Setting Google Colab Environment

- Important Steps for Google Colab Environment Setting
    - Change the Runtime for GPU
    - Install Modules
    - Mount Google Drive
    - Set Working Directory

## Change Runtime for GPU

- [Runtime] -> [Change runtime type]
- For [Hardware accelerator], choose [GPU]

In [1]:
!nvidia-smi

Thu Mar 25 11:05:53 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Install Modules

- Google Colab has been pre-instralled with several popular modules for machine learning and deep learning (e.g., `nltk`, `sklearn`, `tensorflow`, `pytorch`,`numpy`, `spacy`).
- We can check the pre-installed modules here.

In [2]:
!pip list

Package                       Version       
----------------------------- --------------
absl-py                       0.10.0        
alabaster                     0.7.12        
albumentations                0.1.12        
altair                        4.1.0         
appdirs                       1.4.4         
argon2-cffi                   20.1.0        
asgiref                       3.3.1         
astor                         0.8.1         
astropy                       4.2           
astunparse                    1.6.3         
async-generator               1.10          
atari-py                      0.2.6         
atomicwrites                  1.4.0         
attrs                         20.3.0        
audioread                     2.1.9         
autograd                      1.3           
Babel                         2.9.0         
backcall                      0.2.0         
beautifulsoup4                4.6.3         
bleach                        3.3.0         
blis      

- We only need to install modules that are not pre-installed in Google Colab (e.g., `ckip-transformers`).
- This installation has to be done every time we work with Google Colab. But don't worry. It's quick.
- This is how we install the package on Google Colab, exactly the same as we do in our terminal.

In [3]:
## Google Drive Setting
!pip install ckip-transformers

Collecting ckip-transformers
  Downloading https://files.pythonhosted.org/packages/ed/94/ae6a7d7b7e0785ee47663e879982ecff1906b2614a5bc441e73857b94145/ckip_transformers-0.2.3-py3-none-any.whl
Collecting transformers>=3.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 12.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 49.5MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 55.1MB/s 
Building wheels f

## Mount Google Drive
    

- To mount our Google Drive to the current Google Colab server, we need the following codes.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


- After we run the above codes, we need to click on the link presented, log in with our Google Account in the new window and get the authorization code.
- Then copy the authorization code from the new window and paste it back to the text box in the notebook window.

## Set Working Directory

- Change Colab working directory to the `ENC2045_demo_data` of the Google Drive

In [5]:
import os
os.chdir('/content/drive/MyDrive/ENC2045_demo_data')
print(os.getcwd())


/content/drive/MyDrive/ENC2045_demo_data


## Try `ckip-transformers` with GPU

### Initialize the `ckip-transformers`

In [6]:
import ckip_transformers
from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger
# Initialize drivers
ws_driver = CkipWordSegmenter(level=3, device=0)
pos_driver = CkipPosTagger(level=3, device=0)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=804.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=406802935.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=109540.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=301.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2860.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=406981303.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=109540.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=301.0, style=ProgressStyle(description_…




In [7]:
def my_tokenizer(doc):
    # `doc`: a list of corpus documents (each element is a document long string)
    cur_ws = ws_driver(doc, use_delim = True, delim_set='\n')
    cur_pos = pos_driver(cur_ws)
    doc_seg = [[(x,y) for (x,y) in zip(w,p)]  for (w,p) in zip(cur_ws, cur_pos)]
    return doc_seg

### Tokenization Chinese Texts

In [8]:
import pandas as pd

df = pd.read_csv('dcard-top100.csv')
df.head()
corpus = df['content']
corpus[:10]

0    部分回應在B117 \n謝謝各位的留言，我都有看完\n好的不好的，我都接受謝謝大家🙇‍♀️\...
1    https://i.imgur.com/REIEzSd.jpg\n\n身高195公分的男大生...
2    看過這麼多在Dcard、PTT上的感情渣事和創作文\n從沒想過如此荒謬像八點檔的事情居然會發...
3    剛剛吃小火鍋，跟店員說不要金針菇（怕卡牙縫），於是店員幫我換其他配料..…\n\n沒想到餐一...
4    已經約好見面，到了當天晚上七點半才回，我是被耍了嗎 \n如下圖\n\n\nhttps://i...
5    嗨！巨砲哥 答應你的文來了😆\n這是一段與約砲小哥哥談心的奇幻旅程\n\n可憐的我情人節當天...
6    https://i.imgur.com/HCTwyAH.jpg\n（圖片非本人）\n今天逛街...
7    https://i.imgur.com/RWJLK2v.jpg\n\n因為馬鞍很寬\n想請問...
8    手機排版請見諒😖🙏🏻（圖多）\n先說這不是我第一次訂購訂製蛋糕\n也了解訂製蛋糕不可能跟圖上...
9    https://i.imgur.com/6Yk9etg.jpg\n想在這裡問大家有沒有接到這...
Name: content, dtype: object

In [9]:
%%time
corpus_seg = my_tokenizer(corpus)

Tokenization: 100%|██████████| 100/100 [00:00<00:00, 425.12it/s]
Inference: 100%|██████████| 16/16 [01:08<00:00,  4.28s/it]
Tokenization: 100%|██████████| 100/100 [00:00<00:00, 1240.91it/s]
Inference: 100%|██████████| 10/10 [00:42<00:00,  4.24s/it]


CPU times: user 1min 3s, sys: 49 s, total: 1min 52s
Wall time: 1min 51s


In [10]:
corpus_seg[0][:50]

[('部分', 'Neqa'),
 ('回應', 'VC'),
 ('在', 'P'),
 ('B117 \n', 'FW'),
 ('謝謝', 'VJ'),
 ('各位', 'Nh'),
 ('的', 'DE'),
 ('留言', 'Na'),
 ('，', 'COMMACATEGORY'),
 ('我', 'Nh'),
 ('都', 'D'),
 ('有', 'D'),
 ('看完', 'VC'),
 ('\n', 'WHITESPACE'),
 ('好', 'VH'),
 ('的', 'DE'),
 ('不', 'D'),
 ('好', 'VH'),
 ('的', 'T'),
 ('，', 'COMMACATEGORY'),
 ('我', 'Nh'),
 ('都', 'D'),
 ('接受', 'VC'),
 ('謝謝', 'VJ'),
 ('大家', 'Nh'),
 ('🙇', 'FW'),
 ('\u200d♀️\n', 'DASHCATEGORY'),
 ('\n', 'WHITESPACE'),
 ('\n', 'WHITESPACE'),
 ('（', 'PARENTHESISCATEGORY'),
 ('第三', 'Neu'),
 ('次', 'Nf'),
 ('更新', 'VC'),
 ('在', 'P'),
 ('這邊', 'Ncd'),
 ('）', 'PARENTHESISCATEGORY'),
 ('\n', 'WHITESPACE'),
 ('B258 ', 'FW'),
 ('這邊', 'Ncd'),
 ('也', 'D'),
 ('有', 'V_2'),
 ('講到', 'VE'),
 ('怎麼', 'D'),
 ('逃生', 'VA'),
 ('\n', 'WHITESPACE'),
 ('很多', 'Neqa'),
 ('人', 'Na'),
 ('好奇', 'VH'),
 ('我', 'Nh'),
 ('是', 'SHI')]