<a href="https://colab.research.google.com/github/UiinKim/UiinKim/blob/main/SubwordTextEncoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import urllib.request
import tensorflow_datasets as tfds

In [2]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/LawrenceDuan/IMDb-Review-Analysis/master/IMDb_Reviews.csv", filename="IMDb_Reviews.csv")
train_df=pd.read_csv('IMDb_Reviews.csv')

In [3]:
train_df.head()

Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0
3,One of the most unheralded great works of anim...,1
4,"It was the Sixties, and anyone with long hair ...",0


In [5]:
tokenizer=tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(train_df['review'], target_vocab_size=2**13)
#단어 집합을 생성하고 각 서브워드에 고유한 정수 부여

In [6]:
print(tokenizer.subwords[:20])

['the_', ', ', '. ', 'a_', 'and_', 'of_', 'to_', 's_', 'is_', 'br', 'in_', 'I_', 'that_', 'this_', 'it_', ' /><', ' />', 'was_', 'The_', 't_']


In [7]:
print(train_df['review'][20])

Pretty bad PRC cheapie which I rarely bother to watch over again, and it's no wonder -- it's slow and creaky and dull as a butter knife. Mad doctor George Zucco is at it again, turning a dimwitted farmhand in overalls (Glenn Strange) into a wolf-man. Unfortunately, the makeup is virtually non-existent, consisting only of a beard and dimestore fangs for the most part. If it were not for Zucco and Strange's presence, along with the cute Anne Nagel, this would be completely unwatchable. Strange, who would go on to play Frankenstein's monster for Unuiversal in two years, does a Lenny impression from "Of Mice and Men", it seems.<br /><br />*1/2 (of Four)


In [8]:
print(tokenizer.encode(train_df['review'][20]))

[1590, 4162, 132, 7107, 1892, 2983, 578, 76, 12, 4632, 3422, 7, 160, 175, 372, 2, 5, 39, 8051, 8, 84, 2652, 497, 39, 8051, 8, 1374, 5, 3461, 2012, 48, 5, 2263, 21, 4, 2992, 127, 4729, 711, 3, 1391, 8044, 3557, 1277, 8102, 2154, 5681, 9, 42, 15, 372, 2, 3773, 4, 3502, 2308, 467, 4890, 1503, 11, 3347, 1419, 8127, 29, 5539, 98, 6099, 58, 94, 4, 1388, 4230, 8057, 213, 3, 1966, 2, 1, 6700, 8044, 9, 7069, 716, 8057, 6600, 2, 4102, 36, 78, 6, 4, 1865, 40, 5, 3502, 1043, 1645, 8044, 1000, 1813, 23, 1, 105, 1128, 3, 156, 15, 85, 33, 23, 8102, 2154, 5681, 5, 6099, 8051, 8, 7271, 1055, 2, 534, 22, 1, 3046, 5214, 810, 634, 8120, 2, 14, 71, 34, 436, 3311, 5447, 783, 3, 6099, 2, 46, 71, 193, 25, 7, 428, 2274, 2260, 6487, 8051, 8, 2149, 23, 1138, 4117, 6023, 163, 11, 148, 735, 2, 164, 4, 5277, 921, 3395, 1262, 37, 639, 1349, 349, 5, 2460, 328, 15, 5349, 8127, 24, 10, 16, 10, 17, 8054, 8061, 8059, 8062, 29, 6, 6607, 8126, 8053]


In [9]:
sample_string="It's mind-blowing to me that this film was even made."

tokenized_string=tokenizer.encode(sample_string)
print("정수 인코딩 후의 문장 : ", tokenized_string)

original_string=tokenizer.decode(tokenized_string)
print("기존 문장 : ", original_string)

정수 인코딩 후의 문장 :  [137, 8051, 8, 910, 8057, 2169, 36, 7, 103, 13, 14, 32, 18, 79, 681, 8058]
기존 문장 :  It's mind-blowing to me that this film was even made.


In [10]:
print("단어 집합의 크기 : ", tokenizer.vocab_size)

단어 집합의 크기 :  8268


In [11]:
for ts in tokenized_string:
  print('{}-------->{}'.format(ts, tokenizer.decode([ts])))

137-------->It
8051-------->'
8-------->s 
910-------->mind
8057-------->-
2169-------->blow
36-------->ing 
7-------->to 
103-------->me 
13-------->that 
14-------->this 
32-------->film 
18-------->was 
79-------->even 
681-------->made
8058-------->.


In [12]:
#임의로 문장에 xyz를 추가해서 인코딩 디코딩 해보기
sample_string="It's mind-blowing to me that this film was evenxyz made."
tokenized_string=tokenizer.encode(sample_string)
print("정수 인코딩 후의 문장", tokenized_string)
original_string=tokenizer.decode(tokenized_string)
print("기존 문장 : ", original_string)

정수 인코딩 후의 문장 [137, 8051, 8, 910, 8057, 2169, 36, 7, 103, 13, 14, 32, 18, 7974, 8132, 8133, 997, 681, 8058]
기존 문장 :  It's mind-blowing to me that this film was evenxyz made.


In [13]:
for ts in tokenized_string:
  print("{}------>{}".format(ts, tokenizer.decode([ts])))

137------>It
8051------>'
8------>s 
910------>mind
8057------>-
2169------>blow
36------>ing 
7------>to 
103------>me 
13------>that 
14------>this 
32------>film 
18------>was 
7974------>even
8132------>x
8133------>y
997------>z 
681------>made
8058------>.


In [14]:
import pandas as pd
import urllib.request
import tensorflow_datasets as tfds

In [16]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt", filename='ratings_train.txt')
train_data=pd.read_table('ratings_train.txt')

In [17]:
train_data=train_data.dropna(how='any')

In [19]:
tokenizer=tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(train_data['document'], target_vocab_size=2**13)

In [20]:
print(train_data['document'][10])

걍인피니트가짱이다.진짜짱이다♥


In [23]:
print(tokenizer.encode(train_data['document'][10]))

[3885, 113, 220, 105, 199, 34, 7707, 8044, 172, 7707, 561]


In [24]:
sample_string=train_data['document'][21]
tokenized_string=tokenizer.encode(sample_string)
print("정수로 인코딩 된 문장 : ", tokenized_string)
original_string=tokenizer.decode(tokenized_string)
print("기존 문장 : ", original_string)

정수로 인코딩 된 문장 :  [570, 892, 36, 584, 159, 7091, 201]
기존 문장 :  보면서 웃지 않는 건 불가능하다


In [28]:
for ts in tokenized_string:
  print("{}--->{}".format(ts, tokenizer.decode([ts])))

570--->보면서 
892--->웃
36--->지 
584--->않는 
159--->건 
7091--->불가능
201--->하다
