<br></br>
# **tf-idf**

## **1 데이터 불러오기**
[**연간 기업결과 리포트**](https://news.samsung.com/global/samsung-electronics-announces-fourth-quarter-and-fy-2017-results)

In [1]:
# 2017년 연간결산 리포트
f     = open('./data/News2017.txt', 'r', encoding='utf-8')
texts = f.read()
texts = texts.lower()
f.close()

In [2]:
from nltk.tokenize import RegexpTokenizer
re_capt = RegexpTokenizer(r'[a-z]\w+')
token   = re_capt.tokenize(texts)
token[:5]

['samsung', 'electronics', 'posted', 'krw', 'trillion']

In [3]:
from nltk.corpus import stopwords
stopwords_eng = stopwords.words('english') +\
                ['samsung','krw','profit','quarter','full','year']
stopwords_eng[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [4]:
texts = [txt    for txt in token   
                if txt not in stopwords_eng]

from nltk import FreqDist
import pandas as pd
token      = FreqDist(texts)
token_freq = pd.Series(token).sort_values(ascending=False)
token_freq[:10]

demand      33
business    32
products    26
earnings    22
sales       21
mobile      19
company     18
new         18
due         17
expected    17
dtype: int64

In [5]:
document = ' '.join(texts)
document[:500]

'electronics posted trillion consolidated revenue trillion operating fourth overall company reported revenue trillion operating trillion fourth earnings driven components business largest contribution coming memory business manufactures dram nand orders high performance memory products servers mobile storage strong however weak seasonality impacted growth system lsi foundry businesses display panel business manufactures oled lcd screens saw increased shipments oled panels premium smartphones prof'

<br></br>
## **2 tf idf**
연설문내 단어들의 빈도를 재조정

In [6]:
# ! pip3 install sklearn

In [7]:
# ! pip3 install scipy

In [8]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec   = TfidfVectorizer()
transformed = tfidf_vec.fit_transform(raw_documents = [document])
tfidf_vec.vocabulary_

{'electronics': 148,
 'posted': 348,
 'trillion': 486,
 'consolidated': 88,
 'revenue': 401,
 'operating': 318,
 'fourth': 190,
 'overall': 327,
 'company': 76,
 'reported': 393,
 'earnings': 141,
 'driven': 135,
 'components': 82,
 'business': 54,
 'largest': 237,
 'contribution': 95,
 'coming': 73,
 'memory': 282,
 'manufactures': 271,
 'dram': 133,
 'nand': 296,
 'orders': 323,
 'high': 205,
 'performance': 337,
 'products': 362,
 'servers': 431,
 'mobile': 289,
 'storage': 459,
 'strong': 467,
 'however': 210,
 'weak': 508,
 'seasonality': 416,
 'impacted': 215,
 'growth': 202,
 'system': 471,
 'lsi': 259,
 'foundry': 189,
 'businesses': 55,
 'display': 125,
 'panel': 331,
 'oled': 313,
 'lcd': 243,
 'screens': 413,
 'saw': 410,
 'increased': 222,
 'shipments': 434,
 'panels': 332,
 'premium': 352,
 'smartphones': 444,
 'profitability': 363,
 'decreased': 112,
 'due': 137,
 'dampened': 105,
 'sales': 409,
 'asp': 37,
 'communications': 75,
 'division': 128,
 'im': 212,
 'declined':

In [9]:
transformed   = np.array(transformed.todense())
index_value   = {i[1]:i[0] for i in tfidf_vec.vocabulary_.items()}
fully_indexed = {index_value[column]:value  for row in transformed  
                                            for (column,value) in enumerate(row)}

token_tfidf = pd.Series(fully_indexed).sort_values(ascending=False)
token_tfidf[:10]

demand      0.293336
business    0.284447
products    0.231113
earnings    0.195557
sales       0.186669
mobile      0.168891
new         0.160002
company     0.160002
market      0.151113
due         0.151113
dtype: float64

<br></br>
## **3 tf - idf 를 직접 구현하기**
사용자 함수를 사용하여 tf-idf 계산하기

In [10]:
# Docs 의 list 목록을 만드는게 우선 일이다
from py.txtutil import tf_idf
tf_idf('it', 'it can it it', ['can', 'can','can', 'it', 'it', 'it' ])

0.1013662770270411

In [11]:
import re
from glob import glob
filelist    = glob('./data/News201*.txt')
docs_tokens = []

from nltk import pos_tag
for file in filelist:
    f     = open(file, 'r', encoding='utf-8')
    texts = f.read()
    f.close()
    texts = texts.lower()
    texts = texts.replace('\n', ' ')
    tokenizer = re.compile(r'[a-z]\w+') # 영단어 추출
    tokens    = tokenizer.findall(texts)    
    tokens    = [token  for token in tokens   
                 if (len(token) > 2) and (token not in stopwords_eng)]
    docs_tokens += tokens

In [12]:
from nltk import FreqDist
import pandas as pd
pd.Series(FreqDist(docs_tokens))[:5]

posted          18
trillion        47
consolidated    15
revenue         23
operating       21
dtype: int64

In [13]:
# 분석할 대상 데이터
f     = open('./data/News2017.txt', 'r', encoding='utf-8')
texts = f.read()
texts = texts.lower()
f.close()
texts = texts.replace('\n', ' ')
tokenizer = re.compile(r'[a-z]\w+') # 영문 데이터를 추출
tokens = tokenizer.findall(texts)

tokens = [token  for token in tokens   
                 if (len(token) > 2) and (token not in stopwords_eng)]

token_string = " ".join(tokens)
token_string[:500]

'electronics posted trillion consolidated revenue trillion operating fourth overall company reported revenue trillion operating trillion fourth earnings driven components business largest contribution coming memory business manufactures dram nand orders high performance memory products servers mobile storage strong however weak seasonality impacted growth system lsi foundry businesses display panel business manufactures oled lcd screens saw increased shipments oled panels premium smartphones prof'

In [14]:
%%time
from py.txtutil import tf_idf
token_set = list(set(tokens))
result_dict = {}

# 알파벳 2글자 이상인 단어를 대상으로 연산을 진행한다
for txt in token_set:
    result_dict[txt] = tf_idf(txt, texts, docs_tokens)
print('Calculating is Done.')

Calculating is Done.
CPU times: user 517 ms, sys: 416 µs, total: 518 ms
Wall time: 517 ms


In [15]:
# 생성한 TF-IDF 결과를 Pandas로 출력
import pandas as pd
tfidf = pd.Series(result_dict)
tfidf.sort_values(ascending=False)[:20]

product        0.009714
business       0.008296
demand         0.007806
products       0.006383
expect         0.006262
one            0.006077
market         0.006049
season         0.005815
earnings       0.005793
sales          0.005504
new            0.005422
mobile         0.005261
expand         0.005204
seasonal       0.005162
smartphone     0.005125
seasonality    0.004928
strong         0.004928
grow           0.004892
high           0.004855
expected       0.004760
dtype: float64