<br></br>
# **tf-idf**

## **1 tf - idf 를 직접 구현하기**
사용자 함수를 사용하여 tf-idf 계산하기

In [1]:
# tf-idf 계산을 위한 공식을 사용자가 직접 구현
# tf-idf('Token' , '분석 Document' , '모집단 Document 의 Token 목록')

from txtutil import tf_idf
tf_idf('it', 'it can it it', ['can', 'can', 'can', 'it', 'it', 'it'])

0.1013662770270411

In [2]:
# stopwords 자료를 가져온다
from nltk.corpus import stopwords
stopwords_eng = stopwords.words('english') +\
                ['samsung','krw','quarter','full','year']
stopwords_eng[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

<br></br>
## **2 idf : Docs 대상문서들 목록 추출하기**

In [3]:
# Document 자료 불러오기
from glob import glob
filelist    = glob('./data/News201*.txt')
filelist

['./data/News2016.txt', './data/News2015.txt', './data/News2017.txt']

In [4]:
# Document -> Filtering(Stopwords) -> token 추출
import re
docs_tokens = []

for file in filelist:
    with open(file, 'r', encoding='utf-8') as f:
        texts = f.read()
        
    # Document 전처리 작업을 진행
    texts     = texts.lower()
    texts     = texts.replace('\n', ' ')
    tokenizer = re.compile(r'[a-z]\w+') # 영단어 추출
    tokens    = tokenizer.findall(texts)    
    tokens    = [token  for token in tokens   
                 if (len(token) > 2) and (token not in stopwords_eng)]
    docs_tokens += tokens
    
docs_tokens[:5]

['posted', 'trillion', 'consolidated', 'revenue', 'trillion']

In [5]:
# 추출한 Token의 빈도를 계산
from nltk import FreqDist
import pandas as pd
pd.Series(FreqDist(docs_tokens)).sort_values(ascending=False)[:10]

demand      85
business    80
products    66
company     63
sales       50
earnings    49
market      47
trillion    47
new         40
growth      40
dtype: int64

<br></br>
## **3 tf : Doc 대상문서에서 Token 추출하기**

In [6]:
# Step1 : Document 불러오기
with open('./data/News2017.txt', 'r', encoding='utf-8') as f:
    texts = f.read()
    texts = texts.lower()

# Step2 : Token 추출
texts     = texts.replace('\n', ' ')
tokenizer = re.compile(r'[a-z]\w+')
tokens    = tokenizer.findall(texts)

# Step3 : Filtering(stopwords) 알파벳 2글자 이상인 단어를 대상
tokens = [token  for token in tokens   
                 if (len(token) > 2) and (token not in stopwords_eng)]
token_string = " ".join(tokens)
token_string[:300]

'electronics posted trillion consolidated revenue trillion operating profit fourth overall company reported revenue trillion operating profit trillion fourth earnings driven components business largest contribution coming memory business manufactures dram nand orders high performance memory products '

<br></br>
## **4 tf-idf 를 계산**
위에서 추출한 idf 정보와, tf 정보를 활용하여 tf-idf를 측정/ 출력한다

In [7]:
%%time
token_set   = list(set(tokens))
result_dict = {}

# tf-idf 계산결과를 출력한다
for txt in token_set:
    result_dict[txt] = tf_idf(txt, token_string, docs_tokens)
    
print('Calculating is Done.')

Calculating is Done.
CPU times: user 438 ms, sys: 182 µs, total: 438 ms
Wall time: 438 ms


In [8]:
# tf-idf 결과를 Pandas로 출력
import pandas as pd
tfidf = pd.Series(result_dict)
tfidf.sort_values(ascending=False)[:10]

product     0.013346
business    0.011398
demand      0.010724
products    0.008768
expect      0.008601
one         0.008348
market      0.008310
season      0.007986
earnings    0.007957
sales       0.007560
dtype: float64