<a href="https://colab.research.google.com/github/dscoool/datastructure/blob/main/TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](https://jeongjaem.in/datastructure/1.png "TF-IDF")

 ◎ TF - IDF ( Term Frequency - Inverse Document Frequency ) :

  </br> </br>




    단어의 빈도(Term Frequency)와 역 문서 빈도(Inverse Document Frequency)를 토대로,

    특정 문서 내에 어떤 단어가 얼마나 중요한 지를 나타내는 통계적 수치.

 </br>
</br>
  >> 다른 문서에는 잘 등장하지 않지만, 이 문서에는 유독 많이 쓰인 단어가 이 문서의 키워드

▷ TF (Term Frequency) 는 단어 빈도를 나타내며 문서 내에서 특정 단어가 몇 번 발견되었는지를 계산.

문서-단어 행렬이 곧 단어들의 TF 값을 구한 것.
</br>

▷ IDF (Inverse Document Frequency) 는 DF 의 역수이며, 특정 단어가 발견되는 문서의 수를 뜻한다.
</br></br>
특정 문서가 아닌 대부분의 문서에서 발견되는 단어는 그만큼 흔하게 사용되어 중요한 의미를 갖지 않는다고 볼 수 있다.

![alt text](https://jeongjaem.in/datastructure/2.png "Title")

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
import json

In [15]:
# 대체한 corpus
# 수동으로 입력한 corpus 내용
corpus = [
'그룹 방탄소년단(BTS) 정국의 두 번째 솔로 싱글 3D가 세계 양대 차트로 불리는 영국 오피셜 싱글 차트 톱 100에 3주 연속 진입했다.20일(현지시간) 공개된 최신 차트에 따르면 3D는 전주보다 열 계단 하락한 32위를 기록했다.',
'3D는 닿을 수 없는 상대방에 대한 마음을 1·2·3차원의 시선이란 소재로 재미있게 풀어낸 알앤비(R&B) 팝 장르 노래다.',
'이 노래는 영국 싱글 차트 5위로 처음 진입해 22위, 32위를 기록하며 인기를 이어가고 있다.',
'세계를 무대로 활동하는 한국인 DJ 겸 프로듀서 페기 구의 나나나(NANANA)는 42위로 18주 연속 진입해 롱런을 이어갔다.',
'한편, 정국의 솔로 데뷔곡 세븐(Seven)은 세계 최대 스트리밍 플랫폼 스포티파이가 발표한 위클리 톱 송 글로벌에서 2위로 14주 연속 진입했다. 3D는 10위로 3주 연속 톱 10을 지켰다.'
]

뉴스원본: https://news.kbs.co.kr/news/pc/view/view.do?ncd=7798660


In [16]:
#TF-IDF
vect = CountVectorizer()                                #CountVectorizer를 이용하여
document_term_matrix = vect.fit_transform(corpus)

In [17]:
# document_term_matrix를 dataframe으로 변환하기!!
tf = pd.DataFrame(document_term_matrix.toarray(), columns=vect.get_feature_names_out())
# 변환한 내용 출력하기!!
tf

Unnamed: 0,100에,10위로,10을,14주,18주,20일,22위,2위로,32위를,3d가,...,최신,페기,풀어낸,프로듀서,플랫폼,하락한,한국인,한편,현지시간,활동하는
0,1,0,0,0,0,1,0,0,1,1,...,1,0,0,0,0,1,0,0,1,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,1,0,1,0,0,1,0,0,1
4,0,1,1,1,0,0,0,1,0,0,...,0,0,0,0,1,0,0,1,0,0


In [18]:
D = len(tf)
df = tf.astype(bool).sum(axis=0)
idf = np.log((D+1) / (df+1)) + 1

In [19]:
df

100에    1
10위로    1
10을     1
14주     1
18주     1
       ..
하락한     1
한국인     1
한편      1
현지시간    1
활동하는    1
Length: 83, dtype: int64

In [21]:
idf

100에    2.098612
10위로    2.098612
10을     2.098612
14주     2.098612
18주     2.098612
          ...   
하락한     2.098612
한국인     2.098612
한편      2.098612
현지시간    2.098612
활동하는    2.098612
Length: 83, dtype: float64

In [22]:
# TF-IDF (Term Frequency-Inverse Document Frequency)
tfidf = tf * idf
tfidf = tfidf / np.linalg.norm(tfidf, axis=1, keepdims=True) #정규화 normalisation

In [23]:
tfidf

Unnamed: 0,100에,10위로,10을,14주,18주,20일,22위,2위로,32위를,3d가,...,최신,페기,풀어낸,프로듀서,플랫폼,하락한,한국인,한편,현지시간,활동하는
0,0.186637,0.0,0.0,0.0,0.0,0.186637,0.0,0.0,0.150578,0.186637,...,0.186637,0.0,0.0,0.0,0.0,0.186637,0.0,0.0,0.186637,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.272686,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.298082,0.0,0.24049,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.257347,0.0,0.0,0.0,0.0,0.0,...,0.0,0.257347,0.0,0.257347,0.0,0.0,0.257347,0.0,0.0,0.257347
4,0.0,0.21568,0.21568,0.21568,0.0,0.0,0.0,0.21568,0.0,0.0,...,0.0,0.0,0.0,0.0,0.21568,0.0,0.0,0.21568,0.0,0.0


* 참고자료: [파이썬] 특성 추출 - TF-IDF </br>
https://python-explorer.tistory.com/31

## 예제 2
이번에는 aihub.or.kr에서 가져온 방송 콘텐츠 말뭉치 데이터를 사용하여,
TF-IDF 를 실행해 봅시다!! 😊

** 원본데이터: 방송 콘텐츠 한-중, 한-일 번역 병렬 말뭉치 데이터 </br>
(https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=71263)
</br>
** 파일데이터: broadcast.json </br>
다운로드 - (https://bit.ly/3Dh5lwU)

In [26]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
import json

#broadcast.json 파일을 불러옵니다!!
# json은 비정형데이터(NoSQL)의 일종입니다. 강의게시판 다운로드!!
with open('broadcast.json', encoding='UTF-8') as json_file:
    json_data = json.load(json_file)

# json_data
broadcasting_script = json_data["data"]
# json의 내용을 출력해 봅시다!!
broadcasting_script

FileNotFoundError: ignored

우리는 여기서 'ko'부분만 가져와보도록 할게요!!

In [None]:
broadcasting_script[2]['ko']

'BBB의 방송은 딱딱한 내용이 적고, 한국의 사회상황이나 문화적인 내용이 많기 때문에 항상 즐겁게 듣고 있습니다.'

In [None]:
len(broadcasting_script)

360000

자료가 길기 때문에, 100행까지만 가져와 보도록 하겠습니다!!

In [None]:
bs = broadcasting_script[:100]

for문을 사용하여 [n]['ko']를 모두 불러와서 </br>
'corpus'라는 list에 저장해 봅시다!!

corpus 라는 말뭉치를 만들어 봅시다. broadcasting_script에서 </br>
broadcasting_script[n]['ko']의 내용만 모두 추출합니다!!

In [None]:
corpus2 = []
for n, cor in enumerate(bs):
    corpus2.append(cor['ko'])
corpus2

['그 사이트에서는 인터넷에 의한 라디오 방송도 보급되어 있기 때문에 프로그램이 없어지는 것이 아니라면 별 문제는 없는 것이 아닐까라고 쓰여져 있었습니다.',
 '> 알겠습니다.',
 'BBB의 방송은 딱딱한 내용이 적고, 한국의 사회상황이나 문화적인 내용이 많기 때문에 항상 즐겁게 듣고 있습니다.',
 '그 아티스트를 상징하는 색깔이 있다고 해요.',
 "여러분의 '버킷리스트'는 무엇인가요?",
 "마지막 커튼콜에서 AAA씨가 등장한 순간, 객석이 모두 일어서 박수 함성을 보냈고, 회장에서는 '카가와!'라는 외치는 소리도 나오고 있었습니다.",
 '미국과 멕시코 출신의 2인조 아티스트 friends with you가 제작한 오브제.',
 '품위없이 헛들은 한글이라 죄송합니다.',
 '여기서는 보였죠.',
 '무조건 소리 질러주세요.',
 '정말 감사했습니다.',
 "바로 진실 씨의 '생일 코너'입니다.",
 '>일단 조건만 되면 응모해 본다 이런 걸까요?',
 '마론 씨, 안녕하세요',
 "트럼프 대통령은 대화에 전향적인 자세를 보였지만 '비핵화 의사 표시와 구체적 행동'이라는 대화의 조건이 달라진 것은 아닙니다.",
 '> 몰랐어요..',
 '"듣기 힘들 때도 있지만 매일 즐겁게 듣고 있습니다"',
 '일본식민지시대였던 1943년 기생을 교육하는 권봉이라는 기관을 무대로 한 이야기입니다.',
 '헌법 9조에 자위대 근거 조항 추가 움직임을 우리 국민들은 어떻게 보고 있습니까?',
 '이런 주제인데 저도 모르게 길어지고 말았습니다.',
 '아, 하지만 지금으로서는 돈도 모으지 않았고, 주식도 금괴도 아무것도 없으니까 필요없는건가(웃음).',
 '>차별이네요.',
 "악기소리에 집중해서 들었는데, AAA 교수의 '하루'라는 곡은 아침부터 저녁까지 하루의 변화, 감성의 변화를 음악에 잘 녹여낸 것 같아요.",
 '> 서울대 주한규 교수는 "태양광은 연평균 이용률이 15% 정도로 낮은 편이며 패널이 더러워지면 발전량도 크게 감소할 수밖에 없다"고 설명했다.',
 

이제 CounterVectorizer() 를 사용해서, </br>
불러온 문장에 TF-IDF를 적용해 봅시다!!

#### TF (Term Frequency) 라는 말은 문서에서 </br>
#### 어떠한 단어(Term)이 몇 번이나, 얼마나 자주 나오는지를 나타내는 지수입니다.</br>
예를 들어, 문서 1번에서 '사과'라는 단어가 3번 등장한다면, </br>
DOC1 - '사과' 의 TF는 3입니다.</br>
하지만 편의를 위해 3으로 표기하지 않고 표준화를 시켜 </br>
(Normalization)하여 표기합니다.


In [None]:
#TF-IDF
vect = CountVectorizer()                                #CountVectorizer를 이용하여
document_term_matrix = vect.fit_transform(corpus2)       # 문서-단어 행렬생성!!

TF-IDF를 적용했습니다. </br>
여기서 생성된 document_term_matrix를 </br>
1. tf 라는 데이터프레임(DataFrame)으로 변환하여
2. 출력합니다!!!!

In [None]:
# document_term_matrix를 dataframe으로 변환하기!!
tf = pd.DataFrame(document_term_matrix.toarray(), columns=vect.get_feature_names())
# 변환한 내용 출력하기!!
tf



Unnamed: 0,000엔에,100회,10을,10의,12시간의,15,16,1943년,1996년부터,1월,...,행복감을,행위는,헌법,헛들은,형용사,활약을,활약하고,회장에서는,흥미로웠어요,힘들
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
D = len(tf)
df = tf.astype(bool).sum(axis=0)
idf = np.log((D+1) / (df+1)) + 1             # IDF (Inverse Document Frequency)


In [None]:
# TF-IDF (Term Frequency-Inverse Document Frequency)
tfidf = tf * idf
tfidf = tfidf / np.linalg.norm(tfidf, axis=1, keepdims=True)

In [None]:
tfidf

Unnamed: 0,000엔에,100회,10을,10의,12시간의,15,16,1943년,1996년부터,1월,...,행복감을,행위는,헌법,헛들은,형용사,활약을,활약하고,회장에서는,흥미로웠어요,힘들
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


이번에는 '라디오'라는 검색어로 TF-IDF를 검색해 봅시다!!


In [None]:
tfidf[:]['라디오']

0     0.233285
1     0.000000
2     0.000000
3     0.000000
4     0.000000
        ...   
95    0.000000
96    0.000000
97    0.000000
98    0.000000
99    0.000000
Name: 라디오, Length: 100, dtype: float64

'라디오'라는 단어를 검색하면 몇 번 문서가
'라디오'를 포함하고 있는지 알려 줍니다.

In [None]:
search_word=['라디오']
result=tfidf[:][search_word]
r=result[(result > 0).any(1)]
d=r.index.values.astype(int)[0]
print("검색어:", search_word)
print("검색결과: \n")
corpus2[d]


검색어: ['라디오']
검색결과: 



  r=result[(result > 0).any(1)]


'그 사이트에서는 인터넷에 의한 라디오 방송도 보급되어 있기 때문에 프로그램이 없어지는 것이 아니라면 별 문제는 없는 것이 아닐까라고 쓰여져 있었습니다.'

'패널이'라는 단어를 검색하면 몇 번 문서가
'패널이'를 포함하고 있는지 알려 줍니다.

In [None]:
search_word=['패널이']
result=tfidf[:][search_word]
r=result[(result > 0).any(1)]
d=r.index.values.astype(int)[0]
print("검색어:", search_word)
print("검색결과: \n")
corpus2[d]

검색어: ['패널이']
검색결과: 



  r=result[(result > 0).any(1)]


'> 서울대 주한규 교수는 "태양광은 연평균 이용률이 15% 정도로 낮은 편이며 패널이 더러워지면 발전량도 크게 감소할 수밖에 없다"고 설명했다.'

Reference:

1. NMF 알고리즘을 이용한 유사한 문서 검색과 구현, https://bcho.tistory.com/1216

2. AiHub, https://aihub.or.kr/aihubdata/data/list.do?pageIndex=1&currMenu=115&topMenu=100&dataSetSn=&srchdataClCode=DATACL001&srchOrder=&SrchdataClCode=DATACL002&searchKeyword=&srchDataRealmCode=REALM002&srchDataTy=DATA003

3. TF‐IDF의 변형을 이용한 전자뉴스에서의 키워드 추출 기법, https://koreascience.kr/article/JAKO200910348031067.pdf