# 센텐스피스(SentencePiece)
# 서브워드 토크나이징 알고리즘들을 내장

In [1]:
import sentencepiece as spm
import pandas as pd
import urllib.request
import csv

In [2]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/LawrenceDuan/IMDb-Review-Analysis/master/IMDb_Reviews.csv", filename="IMDb_Reviews.csv")

('IMDb_Reviews.csv', <http.client.HTTPMessage at 0x28d9bee2700>)

In [3]:
train_df = pd.read_csv('IMDb_Reviews.csv')
train_df['review']

0        My family and I normally do not watch local mo...
1        Believe it or not, this was at one time the wo...
2        After some internet surfing, I found the "Home...
3        One of the most unheralded great works of anim...
4        It was the Sixties, and anyone with long hair ...
                               ...                        
49995    the people who came up with this are SICK AND ...
49996    The script is so so laughable... this in turn,...
49997    "So there's this bride, you see, and she gets ...
49998    Your mind will not be satisfied by this nobud...
49999    The chaser's war on everything is a weekly sho...
Name: review, Length: 50000, dtype: object

In [4]:
print("리뷰 개수 :", len(train_df)) #리뷰 개수 출력

리뷰 개수 : 50000


In [5]:
with open('imdb_review.txt', 'w', encoding='utf8') as f:
    f.write('\n'.join(train_df['review']))

센텐스피스로 단어 집합과 각 단어에 고유한 정수를 부여

In [6]:
spm.SentencePieceTrainer.Train('--input=imdb_review.txt --model_prefix=imdb --vocab_size=5000 --model_type=bpe --max_sentence_length=9999')

input : 학습시킬 파일

model_prefix : 만들어질 모델 이름

vocab_size : 단어 집합의 크기

model_type : 사용할 모델 (unigram(default), bpe, char, word)

max_sentence_length: 문장의 최대 길이

pad_id, pad_piece: pad token id, 값

unk_id, unk_piece: unknown token id, 값

bos_id, bos_piece: begin of sentence token id, 값

eos_id, eos_piece: end of sequence token id, 값

user_defined_symbols: 사용자 정의 토큰

In [7]:
vocab_list = pd.read_csv('imdb.vocab', sep='\t', header=None, quoting=csv.QUOTE_NONE)
vocab_list.sample(10)

Unnamed: 0,0,1
3083,room,-3080
3719,▁confusing,-3716
400,▁make,-397
123,ill,-120
263,▁go,-260
3869,▁hasn,-3866
733,▁screen,-730
1099,▁All,-1096
2767,inary,-2764
3056,▁amusing,-3053


In [8]:
len(vocab_list)

5000

In [10]:
sp = spm.SentencePieceProcessor()
vocab_file = "imdb.model"
sp.load(vocab_file)

True

In [11]:
lines = [
  "I didn't at all think of it this way.",
  "I have waited a long time for someone to film"
]
for line in lines:
  print(line)
  print(sp.encode_as_pieces(line))
  print(sp.encode_as_ids(line))
  print()

I didn't at all think of it this way.
['▁I', '▁didn', "'", 't', '▁at', '▁all', '▁think', '▁of', '▁it', '▁this', '▁way', '.']
[41, 623, 4950, 4926, 138, 169, 378, 30, 58, 73, 413, 4945]

I have waited a long time for someone to film
['▁I', '▁have', '▁wa', 'ited', '▁a', '▁long', '▁time', '▁for', '▁someone', '▁to', '▁film']
[41, 141, 1364, 1120, 4, 666, 285, 92, 1078, 33, 91]



encode_as_pieces : 문장을 입력하면 서브 워드 시퀀스로 변환합니다.

encode_as_ids : 문장을 입력하면 정수 시퀀스로 변환합니다.

In [12]:
sp.GetPieceSize()

5000