<a href="https://colab.research.google.com/github/bok-h22/TIL-202303/blob/master/02_Text_Vectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 문장 토큰화

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
text="I never thought through love we'd be. Making one as lovely as she. But isn't she lovely made from love."

In [None]:
from nltk.tokenize import sent_tokenize

sent_tokenize(text)

["I never thought through love we'd be.",
 'Making one as lovely as she.',
 "But isn't she lovely made from love."]

# Keras의 Tokenizer 사용하기
- 정수 인코딩
- 디코딩 지원
- 단어 토큰화 지원

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()

In [None]:
# 코퍼스를 이용해서 단어 집합을 만들 수 있도록 훈련
tokenizer.fit_on_texts(sent_tokenize(text))

In [None]:
# 생성된 단어집합 확인
tokenizer.word_index

{'love': 1,
 'as': 2,
 'lovely': 3,
 'she': 4,
 'i': 5,
 'never': 6,
 'thought': 7,
 'through': 8,
 "we'd": 9,
 'be': 10,
 'making': 11,
 'one': 12,
 'but': 13,
 "isn't": 14,
 'made': 15,
 'from': 16}

In [None]:
# 단어의 빈도수 확인
tokenizer.word_counts

OrderedDict([('i', 1),
             ('never', 1),
             ('thought', 1),
             ('through', 1),
             ('love', 2),
             ("we'd", 1),
             ('be', 1),
             ('making', 1),
             ('one', 1),
             ('as', 2),
             ('lovely', 2),
             ('she', 2),
             ('but', 1),
             ("isn't", 1),
             ('made', 1),
             ('from', 1)])

In [None]:
# 인코딩
corpus = ["she isn't lovely but i love she"]
print(tokenizer.texts_to_sequences(corpus))

[[4, 14, 3, 13, 5, 1, 4]]


In [None]:
# 디코딩
print(tokenizer.sequences_to_texts([[5, 1, 6, 14, 16, 8, 10]]))

["i love never isn't from through be"]


# OOV(Out Of Vocabulary) 설정

In [None]:
# 단어 집합의 크기 : vocab_size
vocab_size = 5

tokenizer = Tokenizer(num_words=vocab_size+2, oov_token="<oov>") # 실제 사용할 단어집합 5개 + pad, oov 토큰 개수까지 포함
tokenizer.fit_on_texts(sent_tokenize(text))

In [None]:
tokenizer.word_index

{'<oov>': 1,
 'love': 2,
 'as': 3,
 'lovely': 4,
 'she': 5,
 'i': 6,
 'never': 7,
 'thought': 8,
 'through': 9,
 "we'd": 10,
 'be': 11,
 'making': 12,
 'one': 13,
 'but': 14,
 "isn't": 15,
 'made': 16,
 'from': 17}

In [None]:
corpus = ["she is bowwow"]
tokenizer.texts_to_sequences(corpus)

[[5, 1, 1]]

# Padding

In [None]:
integer_tokens = tokenizer.texts_to_sequences(corpus)
integer_tokens

[[5, 1, 1]]

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# 기본은 pre padding
padded_tokens = pad_sequences(integer_tokens, maxlen=5)
padded_tokens

array([[0, 0, 5, 1, 1]], dtype=int32)

In [None]:
padded_tokens = pad_sequences(integer_tokens, maxlen=5, padding='post')
padded_tokens

array([[5, 1, 1, 0, 0]], dtype=int32)

# Text Vectorization

In [None]:
text1 = """I'm at a payphone trying to call home
All of my change I spent on you
Where have the times gone?
Baby, it's all wrong
Where are the plans we made for two?
Yeah, I, I know it's hard to remember
The people we used to be
It's even harder to picture
That you're not here next to me
You say it's too late to make it
But is it too late to try?
And in our time that you wasted
All of our bridges burned down
I've wasted my nights
You turned out the lights
Now I'm paralyzed
Still stuck in that time, when we called it love
But even the sun sets in paradise
I'm at a payphone, trying to call home
All of my change I spent on you
Where have the times gone?
Baby, it's all wrong
Where are the plans we made for two?
If "happy ever after" did exist
I would still be holding you like this
All those fairy tales are full of shit
One more fucking love song, I'll be sick, oh
You turned your back on tomorrow
'Cause you forgot yesterday
I gave you my love to borrow
But you just gave it away
You can't expect me to be fine
I don't expect you to care
I know I've said it before
But all of our bridges burned down
I've wasted my nights
You turned out the lights
Now I'm paralyzed
Still stuck in that time
When we called it love
But even the sun sets in paradise
I'm at a payphone trying to call home
All of my change I spent on you
Where have the times gone?
Baby, it's all wrong
Where are the plans we made for two?
If "happy ever after" did exist
I would still be holding you like this
And all those fairy tales are full of shit
One more fucking love song, I'll be sick
Now I'm at a payphone
Man, fuck that shit
I'll be out spending all this money while you sitting round
Wondering why wasn't you who came up from nothing
Made it from the bottom, now when you see me I'm strutting
And all of my cars start with a push of a button
Telling me the chances I blew up or whatever you call it
Switched the number to my phone so you never could call it
Don't need my name on my show you can tell it I'm ballin'
Swish, what a shame could have got picked
Had a really good game but you missed your last shot
So you talk about who you see at the top
Or what you could have saw, but sad to say it's over for
Phantom pulled valet open doors
Wiz like go away got what you was looking for
Now it's me who they want, so you can go
And take that little piece of shit with you
I'm at a payphone, trying to call home
All of my change I spent on you
Where have the times gone?
Baby, it's all wrong
Where are the plans we made for two?
If "happy ever after" did exist
I would still be holding you like this
All those fairy tales are full of shit
One more fucking love song, I'll be sick
Now I'm at a payphone"""

text2 = """Spent 24 hours, I need more hours with you
You spent the weekend getting even, ooh
We spent the late nights making things right between us
But now it's all good, babe
Roll that back wood, babe
And play me close
'Cause girls like you run 'round with guys like me
'Til sun down when I come through
I need a girl like you, yeah yeah
Girls like you love fun and, yeah, me too
What I want when I come through
I need a girl like you, yeah yeah
Yeah yeah yeah, yeah yeah yeah
I need a girl like you, yeah yeah
Yeah yeah yeah, yeah yeah yeah
I need a girl like you
I spent last night on the last flight to you (ey ya)
Took a whole day up trying to get way up, ooh
We spent the daylight trying to make things right between us
But now it's all good, babe
Roll that back wood, babe
And play me close, yeah
'Cause girls like you run 'round with guys like me
'Til sun down when I come through
I need a girl like you, yeah yeah
Girls like you love fun and, yeah, me too
What I want when I come through
I need a girl like you, yeah yeah
Yeah yeah yeah, yeah yeah yeah
I need a girl like you, yeah yeah
Yeah yeah yeah, yeah yeah yeah
I need a girl like you, yeah yeah
I need a girl like you, yeah yeah
I need a girl like you
Maybe it's 6:45
Maybe I'm barely alive
Maybe you've taken my shit for the last time, yeah
Maybe I know that I'm drunk
Maybe I know you're the one
Maybe you thinking it's better if you drive
Oh, 'cause girls like you run 'round with guys like me
'Til sun down when I come through
I need a girl like you, yeah
'Cause girls like you run 'round with guys like me
'Til sun down when I come through
I need a girl like you, yeah yeah
Girls like you love fun and, yeah, me too
What I want when I come through
I need a girl like you, yeah yeah
Yeah yeah yeah, yeah yeah yeah
I need a girl like you, yeah yeah
Yeah yeah yeah, yeah yeah yeah
I need a girl like you"""

In [None]:
# 코퍼스 만들기
corpus = [text1, text2]
len(corpus)

2

# DTM (Document Term Matrix)
- 문장 내 단어의 빈도


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cnt_vector = CountVectorizer()
cnt_vector.fit(corpus)

In [None]:
feature_vector = cnt_vector.transform(corpus)
feature_vector

<2x225 sparse matrix of type '<class 'numpy.int64'>'
	with 271 stored elements in Compressed Sparse Row format>

In [None]:
feature_vector.toarray()

array([[ 0,  0,  1,  3,  0, 15,  4,  7,  7,  2,  0,  4,  1,  1,  0,  9,
         1,  0,  0,  1,  1,  1,  2,  2,  7,  1,  6,  2,  1,  3,  1,  1,
         1,  1,  4,  0,  0,  3,  0,  0,  3,  2,  1,  2,  0,  0,  3,  3,
         3,  2,  0,  3,  1,  0,  6,  1,  2,  1,  3,  3,  0,  1,  2,  0,
         0,  0,  0,  2,  4,  1,  2,  0,  1,  3,  1,  1,  6,  1,  3,  4,
         0,  3,  5,  1, 19,  1,  2,  1,  2,  2,  4,  1,  4,  1,  6,  5,
         1,  0,  1,  0,  5,  1,  1,  3, 11,  1,  1,  1,  1,  0,  2,  1,
         1,  6,  1, 12,  1,  6,  3,  0,  1,  2,  3,  3,  1,  2,  2,  6,
         1,  1,  1,  1,  1,  1,  4,  0,  1,  1,  1,  1,  1,  0,  0,  1,
         0,  1,  1,  1,  2,  2,  2,  1,  5,  1,  1,  3,  1,  3,  3,  1,
         4,  1,  5,  1,  2,  2,  1,  1,  1,  0,  3,  1,  1,  1,  6, 17,
         1,  0,  0,  4,  3,  0,  0,  3,  4, 15,  1,  2,  0,  1,  1,  4,
         3,  4,  2,  0,  1,  1,  3,  1,  1,  1,  3,  0,  7,  0,  3,  1,
         3,  8,  1,  3,  0,  1,  2,  1,  1,  0,  3,  4,  0,  1, 

In [None]:
import pandas as pd

# 단어 확인
vocabs = sorted(cnt_vector.vocabulary_.items()) # 단어 순서대로 정렬
vocabs = [ item[0] for item in vocabs ] # 단어만 뽑아내기

dtm = pd.DataFrame(
    columns=vocabs,
    data=feature_vector.toarray()
)
dtm

Unnamed: 0,24,45,about,after,alive,all,and,are,at,away,babe,baby,back,ballin,barely,be,before,better,between,blew,borrow,bottom,bridges,burned,but,button,call,called,came,can,care,cars,cause,chances,change,close,come,could,day,daylight,...,to,tomorrow,too,took,top,try,trying,turned,two,up,us,used,valet,ve,want,was,wasn,wasted,way,we,weekend,what,whatever,when,where,while,who,whole,why,with,wiz,wondering,wood,would,wrong,ya,yeah,yesterday,you,your
0,0,0,1,3,0,15,4,7,7,2,0,4,1,1,0,9,1,0,0,1,1,1,2,2,7,1,6,2,1,3,1,1,1,1,4,0,0,3,0,0,...,15,1,2,0,1,1,4,3,4,2,0,1,1,3,1,1,1,3,0,7,0,3,1,3,8,1,3,0,1,2,1,1,0,3,4,0,1,1,31,2
1,1,1,0,0,1,2,5,0,0,0,4,0,2,0,1,0,0,1,2,0,0,0,0,0,2,0,0,0,0,0,0,0,4,0,0,2,7,0,1,1,...,3,0,3,1,0,0,2,0,0,2,2,0,0,1,3,0,0,0,1,2,1,3,0,7,0,0,0,1,0,5,0,0,2,0,0,1,64,0,29,0


# TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer()
tfidf_vect.fit(corpus)

In [None]:
feature_vect = tfidf_vect.transform(corpus)
feature_vect.toarray()

array([[0.        , 0.        , 0.02024321, 0.06072962, 0.        ,
        0.21604811, 0.05761283, 0.14170244, 0.14170244, 0.04048641,
        0.        , 0.08097282, 0.01440321, 0.02024321, 0.        ,
        0.18218885, 0.02024321, 0.        , 0.        , 0.02024321,
        0.02024321, 0.02024321, 0.04048641, 0.04048641, 0.10082245,
        0.02024321, 0.12145923, 0.04048641, 0.02024321, 0.06072962,
        0.02024321, 0.02024321, 0.01440321, 0.02024321, 0.08097282,
        0.        , 0.        , 0.06072962, 0.        , 0.        ,
        0.06072962, 0.04048641, 0.02024321, 0.02880641, 0.        ,
        0.        , 0.04320962, 0.06072962, 0.06072962, 0.04048641,
        0.        , 0.06072962, 0.02024321, 0.        , 0.08641924,
        0.02024321, 0.04048641, 0.02024321, 0.06072962, 0.06072962,
        0.        , 0.02024321, 0.04048641, 0.        , 0.        ,
        0.        , 0.        , 0.04048641, 0.08097282, 0.01440321,
        0.04048641, 0.        , 0.02024321, 0.06

In [None]:
vocabs = sorted(tfidf_vect.vocabulary_.items()) # 단어 순서대로 정렬
vocabs = [ item[0] for item in vocabs ] # 단어만 뽑아내기

tfidf = pd.DataFrame(
    columns=vocabs,
    data=feature_vect.toarray()
)

tfidf

Unnamed: 0,24,45,about,after,alive,all,and,are,at,away,babe,baby,back,ballin,barely,be,before,better,between,blew,borrow,bottom,bridges,burned,but,button,call,called,came,can,care,cars,cause,chances,change,close,come,could,day,daylight,...,to,tomorrow,too,took,top,try,trying,turned,two,up,us,used,valet,ve,want,was,wasn,wasted,way,we,weekend,what,whatever,when,where,while,who,whole,why,with,wiz,wondering,wood,would,wrong,ya,yeah,yesterday,you,your
0,0.0,0.0,0.020243,0.06073,0.0,0.216048,0.057613,0.141702,0.141702,0.040486,0.0,0.080973,0.014403,0.020243,0.0,0.182189,0.020243,0.0,0.0,0.020243,0.020243,0.020243,0.040486,0.040486,0.100822,0.020243,0.121459,0.040486,0.020243,0.06073,0.020243,0.020243,0.014403,0.020243,0.080973,0.0,0.0,0.06073,0.0,0.0,...,0.216048,0.020243,0.028806,0.0,0.020243,0.020243,0.057613,0.06073,0.080973,0.028806,0.0,0.020243,0.020243,0.04321,0.014403,0.020243,0.020243,0.06073,0.0,0.100822,0.0,0.04321,0.020243,0.04321,0.161946,0.020243,0.06073,0.0,0.020243,0.028806,0.020243,0.020243,0.0,0.06073,0.080973,0.0,0.014403,0.020243,0.446499,0.040486
1,0.016364,0.016364,0.0,0.0,0.016364,0.023286,0.058216,0.0,0.0,0.0,0.065456,0.0,0.023286,0.0,0.016364,0.0,0.0,0.016364,0.032728,0.0,0.0,0.0,0.0,0.0,0.023286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.046573,0.0,0.0,0.032728,0.114549,0.0,0.016364,0.016364,...,0.03493,0.0,0.03493,0.016364,0.0,0.0,0.023286,0.0,0.0,0.023286,0.032728,0.0,0.0,0.011643,0.03493,0.0,0.0,0.0,0.016364,0.023286,0.016364,0.03493,0.0,0.081502,0.0,0.0,0.0,0.016364,0.0,0.058216,0.0,0.0,0.032728,0.0,0.0,0.016364,0.745164,0.0,0.337652,0.0


# 벡터화된 텍스트 정보 활용

In [None]:
from sklearn.datasets import fetch_20newsgroups

news_data = fetch_20newsgroups(subset='all', random_state=42)

In [None]:
# target 확인
news_data.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [None]:
# 데이터 확인
print(news_data['data'][0])

From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




In [None]:
news_data['target'][0]

10

In [None]:
# 훈련 데이터 가져오기. header, footers, 특수문자를 제거하고 불러오기
train_news = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), random_state=42)
X_train = train_news['data']
y_train = train_news['target']

test_news = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), random_state=42)
X_test = test_news['data']
y_test = test_news['target']

In [None]:
X_train[:3]

['I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.',
 "A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't an

In [None]:
tfidf_vect = TfidfVectorizer()

X_train_tfidf_vect = tfidf_vect.fit_transform(X_train)
X_test_tfidf_vect  = tfidf_vect.transform(X_test)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr_clf = LogisticRegression()

lr_clf.fit(X_train_tfidf_vect, y_train)
pred = lr_clf.predict(X_test_tfidf_vect)

accuracy_score(y_test, pred)

0.6736590546999469

In [None]:
news_baseball_0427="""CNN
 —
It was one of those perfect, historic nights for Adolis García on Saturday as he slugged three home runs in his five hits and added eight runs as the Texas Rangers dismantled the Oakland Athletics 18-3.

Each homer was projected at 400+ feet, combining for an incredible 1,252 feet of home run distance. It capped the Rangers’ dominant night, after they lost the series opener 5-4 on Friday, and marked a career-best performance for García as well as the first eight RBI game by a Ranger since Nelson Cruz more than a decade ago.

“It was an incredible night for me,” García said through interpreter Raul Cardenas, according to MLB.com. “I didn’t expect something like this to happen, but I’m really blessed and thankful for it.

“I was just looking for certain pitches, in a certain zone. I wasn’t trying to do too much and not overthinking it, just trying to make good contact.”

It was an astonishing night for the right fielder. He hit a two-run home run in the first, letting it fly high into the crowd to tie the score, was hit by a pitch in the second, then hit another couple of two-run homers in the third and fifth."""

# tfidf vector 생성
test_vector = tfidf_vect.transform([news_baseball_0427])

# 예측
lr_clf.predict(test_vector)

array([9])