# Data Preprocessing

Yiting (Elle) Tsai (yt9mh@virginia.edu)<br>
DS 5001<br>
28 April 2020<br>


## Data

The dataset contains 6 news websites scraped by each website, which includes **content, date, title and URL**. There are 4000+ pieces of news articles and are all related to coronavirus. Following is the news source that we incorporate in this study.

- [US News](https://www.usnews.com/news/world/articles)
- [Breitbart](https://www.breitbart.com/)
- [CNN](https://www.cnn.com/)
- [Fox](https://www.foxnews.com/)
- [PowerLine](https://www.powerlineblog.com/)
- [Politico Magazine](https://www.politico.com/section/magazine)

## Overview
    
**Research Question**

1. Aim to analyze the change of sentiment over January to March, especially before and after WHO announce COVID-19 as a world pandemic on March 16, 2020
2. Whether the announcement of global pandemic increases the fear for the public.
3. How similar is the news over this time slot, will different source news articles be similar in the same period of time? 
4. What is the top frequency words appear in the articles overtime
5. Is there any topic in each coronavirus news? (i.e. medical, physics, political)

This notebook mainly focus on data preprocessing, which walk through how I convert contexts into a structure format based on document and paragraph, and create a vocabulary table with stemming. Finally TFIDF table is created for future analysis. Following is the table that will be created in this notebook

    1. token table
    2. vocabulary table (Stemming, lemmatizing)
    4. Document-term metrix
    3. tf-idf table


## Load packages and Setting

In [2]:
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
import re

In [3]:
OHCO = ['doc_id', 'paragraph_num', 'sentence_num', 'token_num']
SENTS = OHCO[:3]
PARAS = OHCO[:2]
DOCS = OHCO[:1]
bag = DOCS

gradient_cmap = 'YlGnBu'  # cmap for visualization


In [4]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('tagsets')

[nltk_data] Downloading package punkt to /Users/ellesmac/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ellesmac/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ellesmac/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package tagsets to
[nltk_data]     /Users/ellesmac/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

## Read file

In [5]:
df = pd.read_csv('covid19_news.csv')
df.head()

Unnamed: 0,doc_content,doc_date,doc_id,doc_source,doc_title,doc_url,doc_lemma
0,By YANAN WANG and MIKE STOBBE\nBEIJING (AP) — ...,2020-01-08,983556,US News,Chinese Report Says Illnesses May Be From New ...,https://www.usnews.com/news/world/articles/202...,yanan wang mike stobbe beijing ( ap ) — prelim...
1,BEIJING (AP) — Health authorities in a central...,2020-01-10,985558,US News,China Reports 1st Death From New Type of Coron...,https://www.usnews.com/news/health-news/articl...,beijing ( ap ) — health authority central chin...
2,Here are some of the latest health and medical...,2020-01-13,987316,US News,"Health Highlights: Jan. 13, 2020",https://www.usnews.com/news/health-news/articl...,"late health medical news development , compile..."
3,Here are some of the latest health and medical...,2020-01-14,988361,US News,"Health Highlights: Jan. 14, 2020",https://www.usnews.com/news/health-news/articl...,"late health medical news development , compile..."
4,"By MARI YAMAGUCHI, Associated Press\nTOKYO (AP...",2020-01-16,989997,US News,Patient in Japan Confirmed as Having New Virus...,https://www.usnews.com/news/world/articles/202...,"mari yamaguchi , associate press tokyo ( ap ) ..."


In [6]:
df.doc_source.value_counts()

US News              2886
Breitbart             901
CNN                   352
Fox                   330
PowerLine              88
Politico Magazine      62
Name: doc_source, dtype: int64

## Create Library

In [7]:
LIB = df[['doc_id', 'doc_title', 'doc_source']]

## Convert content to OHCO format
| news | paragraph type |
| --- | --- |
| USNews | \n  |
| Breitbart | \n |
| CNN | \n |
| Fox |  |
| PowerLine | 
| Politico Magazin | .directly come with Captial word, without a space|


In [8]:
# USNews, Breitbart, CNN 

ubc = df[df['doc_source'].isin(['US News', 'Breitbart', 'CNN'])]  # subset dataframe
ubc_index = list(ubc.index) # get subset index
para_ubc = pd.DataFrame()

for i in ubc_index[:]:
    text = ubc['doc_content'][i]   # get content
    para = text.split('\n')  # split by \n
    para = [string for string in para if string != ""] # delete empty string in list
    para = pd.DataFrame(para, columns = ['para_str']) # save para to dataframe
    para_ubc = para_ubc.append(pd.concat([para], keys = [ubc['doc_id'][i]], names = ['doc_id'])) # set doc_id to paragraph df
    
    
    

In [9]:
# Fox, PowerLine
fp = df[df['doc_source'].isin(['Fox', 'PowerLine'])]
fp_index = list(fp.index)
para_fp = pd.DataFrame()

for i in fp_index[:]:
    para = pd.DataFrame([fp['doc_content'][i]], columns = ['para_str'])
    para_fp = para_fp.append(pd.concat([para], keys = [fp['doc_id'][i]], names = ['doc_id']))
    
    

In [10]:
# Politico Magazine

pm = df[df['doc_source'] == 'Politico Magazine']
pm_index = list(pm.index)
para_p = pd.DataFrame()
for pm_i in pm_index:

    text = pm['doc_content'][pm_i]
    punc_filter = re.compile('([.!?][^A-Z]*)') # split by delimiter 
    split = punc_filter.split(text)

    sen_index = [0]
    for i, d in enumerate(split):
        if d in(['.', '!', '?']):
            sen_index.append(i+1) # if no space after delimiter, save index

    sen_index.append(len(split))

    para = []
    for j in range(len(sen_index)):
        iter = len(sen_index) -1
        if j != iter: 
            para = para + ([''.join(split[sen_index[j]:sen_index[j+1]])]) # get para dataframe
            
    para = [string for string in para if string != ""]
    para = pd.DataFrame(para, columns = ['para_str'])

    para_p = para_p.append(pd.concat([para], keys = [pm['doc_id'][pm_i]], names = ['doc_id']))






### Concat all paragraph dataframe

In [11]:
para_tot = pd.concat([para_ubc, para_fp, para_p])
para_tot.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1
983556,0,By YANAN WANG and MIKE STOBBE
983556,1,BEIJING (AP) — A preliminary investigation int...
983556,2,Chinese health authorities did not immediately...
983556,3,Coronaviruses are spread through coughing or s...
983556,4,The novel coronavirus is different from those ...


## Tokenize and Annotate

In [12]:
def tokenize(doc_df, OHCO=OHCO, remove_pos_tuple=False, ws=False):
    
    # Paragraphs to Sentences
    df = doc_df.para_str\
        .apply(lambda x: pd.Series(nltk.sent_tokenize(x)))\
        .stack()\
        .to_frame()\
        .rename(columns={0:'sent_str'})
    
    # Sentences to Tokens
    # Local function to pick tokenizer
    def word_tokenize(x):
        if ws:
            s = pd.Series(nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(x)))
        else:
            s = pd.Series(nltk.pos_tag(nltk.word_tokenize(x))) # Discards stuff in between
        return s
            
    df = df.sent_str\
        .apply(word_tokenize)\
        .stack()\
        .to_frame()\
        .rename(columns={0:'pos_tuple'})
    
    # Grab info from tuple
    df['pos'] = df.pos_tuple.apply(lambda x: x[1])
    df['token_str'] = df.pos_tuple.apply(lambda x: x[0])
    if remove_pos_tuple:
        df = df.drop('pos_tuple', 1)
    
    # Add index
    df.index.names = OHCO
    
    return df

In [13]:
t = tokenize(para_tot)

### Create VOCAB table


In [14]:
# tokens = pd.read_csv('data/TOKEN.csv')
# t.set_index(OHCO, inplace = True)
t.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pos_tuple,pos,token_str
doc_id,paragraph_num,sentence_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
983556,0,0,0,"(By, IN)",IN,By
983556,0,0,1,"(YANAN, NNP)",NNP,YANAN
983556,0,0,2,"(WANG, NNP)",NNP,WANG
983556,0,0,3,"(and, CC)",CC,and
983556,0,0,4,"(MIKE, NNP)",NNP,MIKE


In [15]:
t['term_str'] = t['token_str'].str.lower().str.replace('[\W_]', '') # lowercase token
t

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pos_tuple,pos,token_str,term_str
doc_id,paragraph_num,sentence_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
983556,0,0,0,"(By, IN)",IN,By,by
983556,0,0,1,"(YANAN, NNP)",NNP,YANAN,yanan
983556,0,0,2,"(WANG, NNP)",NNP,WANG,wang
983556,0,0,3,"(and, CC)",CC,and,and
983556,0,0,4,"(MIKE, NNP)",NNP,MIKE,mike
...,...,...,...,...,...,...,...
1040722,8,2,2,"(RSS, NNP)",NNP,RSS,rss
1040722,8,2,3,"(feed, NN)",NN,feed,feed
1040722,8,2,4,"(hoards, NNS)",NNS,hoards,hoards
1040722,8,2,5,"(nothing, NN)",NN,nothing,nothing


In [16]:
t.dropna(subset = ['token_str'], inplace = True)

In [18]:
# t.to_csv('TOKEN.csv')
tokens = pd.read_csv('TOKEN.csv')

### Get token is number

In [19]:
vocab = tokens.term_str.value_counts().to_frame()\
    .rename(columns={'index':'term_str', 'term_str':'n'})\
    .sort_index().reset_index().rename(columns={'index':'term_str'})
vocab.index.name = 'term_id'

In [20]:
vocab['num'] = vocab.term_str.str.match("\d+").astype('int') # match 1+ numbers [0-9]
vocab.sample(5)

Unnamed: 0_level_0,term_str,n,num
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
22537,haushalter,4,0
27354,laguna,1,0
1453,2200,20,1
24463,inclusion,10,0
51696,winterseason,1,0


### Get stopwords

In [21]:
sw = pd.DataFrame(nltk.corpus.stopwords.words('english'), columns=['term_str'])
sw = sw.reset_index().set_index('term_str')
sw.columns = ['dummy']
sw['dummy'] = 1
sw.head()

Unnamed: 0_level_0,dummy
term_str,Unnamed: 1_level_1
i,1
me,1
my,1
myself,1
we,1


In [22]:
vocab['stop'] = vocab.term_str.map(sw.dummy)
vocab['stop'] = vocab['stop'].fillna(0).astype('int')
vocab.sample(5)

Unnamed: 0_level_0,term_str,n,num,stop
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
22831,hendra,1,0,0
21923,gunman,4,0,0
51397,what,3162,0,1
31694,murmurs,1,0,0
7579,behaves,3,0,0


### Add stems

In [23]:
from nltk.stem.porter import PorterStemmer
stemmer1 = PorterStemmer()
vocab['stem_porter'] = vocab.term_str.apply(stemmer1.stem)

from nltk.stem.snowball import SnowballStemmer
stemmer2 = SnowballStemmer("english")
vocab['stem_snowball'] = vocab.term_str.apply(stemmer2.stem)

from nltk.stem.lancaster import LancasterStemmer
stemmer3 = LancasterStemmer()
vocab['stem_lancaster'] = vocab.term_str.apply(stemmer3.stem)
vocab.sample(5)

Unnamed: 0_level_0,term_str,n,num,stop,stem_porter,stem_snowball,stem_lancaster
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
44718,stabilised,1,0,0,stabilis,stabilis,stabl
31072,mohammad,32,0,0,mohammad,mohammad,mohammad
9026,broadwaythe,1,0,0,broadwayth,broadwayth,broadwayth
32862,nonproduction,1,0,0,nonproduct,nonproduct,nonproduc
32522,nicosia,3,0,0,nicosia,nicosia,nicos


### Add posmax

In [24]:
tokens['term_id'] = tokens['term_str'].map(vocab.reset_index().set_index('term_str').term_id) # map token with token id (vocab)
vocab['pos_max'] = tokens.groupby(['term_id', 'pos']).pos.count().unstack().idxmax(1) 
vocab.sample(5)

Unnamed: 0_level_0,term_str,n,num,stop,stem_porter,stem_snowball,stem_lancaster,pos_max
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
23639,houseeven,1,0,0,houseeven,houseeven,houseev,NNP
10242,cavalier,5,0,0,cavali,cavali,cava,JJR
28244,linens,5,0,0,linen,linen,lin,NNS
23354,homeas,1,0,0,homea,homea,homea,NNP
16052,doles,2,0,0,dole,dole,dol,VBZ


### Add term rank

In [25]:
if 'term_rank' not in vocab.columns:
    vocab = vocab.sort_values('n', ascending=False).reset_index()
    vocab.index.name = 'term_rank'
    vocab = vocab.reset_index()
    vocab = vocab.set_index('term_id')
    vocab['term_rank'] = vocab['term_rank'] + 1

In [26]:
vocab.head()

Unnamed: 0_level_0,term_rank,term_str,n,num,stop,stem_porter,stem_snowball,stem_lancaster,pos_max
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
47255,1,the,158255,0,1,the,the,the,DT
47794,2,to,78132,0,1,to,to,to,TO
5390,3,and,64562,0,1,and,and,and,CC
33424,4,of,62684,0,1,of,of,of,IN
24372,5,in,52259,0,1,in,in,in,IN


In [None]:
# tokens.to_csv('data/TOKEN.csv')

In [27]:
tokens = pd.read_csv('TOKEN.csv')
vocab = pd.read_csv('VOCAB.csv').set_index('term_id')
vocab

Unnamed: 0_level_0,term_rank,term_str,n,num,stop,stem_porter,stem_snowball,stem_lancaster,pos_max,df,idf,tfidf_sum
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,1,,375706,0,0,,,,",",4618,0.000000,0.000000
47256,2,the,158255,0,1,the,the,the,DT,4615,0.000282,0.069261
47795,3,to,78132,0,1,to,to,to,TO,4490,0.012208,1.398970
5391,4,and,64562,0,1,and,and,and,CC,4440,0.017071,1.611643
33425,5,of,62684,0,1,of,of,of,IN,4506,0.010663,1.030711
...,...,...,...,...,...,...,...,...,...,...,...,...
36319,52798,platformsnetflix,1,0,0,platformsnetflix,platformsnetflix,platformsnetflix,NN,1,3.664454,0.002714
21246,52799,godit,1,0,0,godit,godit,godit,NNP,1,3.664454,0.003558
3620,52800,8after,1,1,0,8after,8after,8after,CD,1,3.664454,0.007650
36316,52801,platesthe,1,0,0,platesth,platesth,platesth,JJ,1,3.664454,0.006799


## Create TF-IDF

<mark>A High weight in TF-IDF is reached by a high term frequency(in the given document) and a low document frequency of the term in the whole collection of documents.</mark>

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)<br>
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

Reference: https://towardsdatascience.com/text-summarization-using-tf-idf-e64a0644ace3

### Bag of Words

In [28]:
BOW = tokens.groupby(bag+['term_id']).term_id.count().to_frame().rename(columns={'term_id':'n'}) # document as a bag

BOW['c'] = BOW.n.astype('bool').astype('int')
BOW.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,n,c
doc_id,term_id,Unnamed: 2_level_1,Unnamed: 3_level_1
983556,0.0,70,1
983556,1245.0,1,1
983556,1254.0,1,1
983556,1282.0,1,1
983556,1305.0,1,1


### Document-Term Matrix

In [29]:
count_method = 'n' # 'c' or 'n' # n = n tokens, c = distinct token (term) count
DTCM = BOW[count_method].unstack().fillna(0).astype('int')
DTCM.head()

term_id,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,...,52792.0,52793.0,52794.0,52795.0,52796.0,52797.0,52798.0,52799.0,52800.0,52801.0
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
983556,70,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
985558,67,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
987316,25,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
988361,59,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
989997,51,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
DTCM.head(1000).sum(1).plot(); # every news article have similar length of words

### Compute TF

In [31]:
tf_method = 'sum' # sum, max, log, double_norm, raw, binary
tf_norm_k = .5 # only used for double_norm
idf_method = 'standard' # standard, max, smooth

if tf_method == 'sum':
    TF = DTCM.T / DTCM.T.sum()

elif tf_method == 'max':
    TF = DTCM.T / DTCM.T.max()

elif tf_method == 'log':
    TF = np.log10(1 + DTCM.T)
    
elif tf_method == 'raw':
    TF = DTCM.T

elif tf_method == 'double_norm':
    TF = DTCM.T / DTCM.T.max()
    TF = tf_norm_k + (1 - tf_norm_k) * TF[TF > 0] # EXPLAIN; may defeat purpose of norming

elif tf_method == 'binary':
    TF = DTCM.T.astype('bool').astype('int')
    
TF = TF.T

In [32]:
TF.head()

term_id,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,...,52792.0,52793.0,52794.0,52795.0,52796.0,52797.0,52798.0,52799.0,52800.0,52801.0
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
983556,0.102339,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
985558,0.104688,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
987316,0.116822,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
988361,0.13785,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
989997,0.097889,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Compute IDF

In [33]:
DF = DTCM[DTCM > 0].count()
DF.head()

term_id
0.0    4618
1.0      34
2.0       1
3.0       1
4.0       1
dtype: int64

In [34]:
N = DTCM.shape[0]

In [35]:
print('IDF method:', idf_method)

if idf_method == 'standard':
    IDF = np.log10(N / DF)

elif idf_method == 'max':
    IDF = np.log10(DF.max() / DF) 

elif idf_method == 'smooth':
    IDF = np.log10((1 + N) / (1 + DF)) + 1 # Correct?

IDF method: standard


### Compute TFIDF

In [36]:
TFIDF = TF * IDF
TFIDF.head()

term_id,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,...,52792.0,52793.0,52794.0,52795.0,52796.0,52797.0,52798.0,52799.0,52800.0,52801.0
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
983556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
985558,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
987316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
988361,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
989997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [37]:
vocab['df'] = DF
vocab['idf'] = IDF
vocab.head()

Unnamed: 0_level_0,term_rank,term_str,n,num,stop,stem_porter,stem_snowball,stem_lancaster,pos_max,df,idf,tfidf_sum
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,1,,375706,0,0,,,,",",4618,0.0,0.0
47256,2,the,158255,0,1,the,the,the,DT,4615,0.000282,0.069261
47795,3,to,78132,0,1,to,to,to,TO,4490,0.012208,1.39897
5391,4,and,64562,0,1,and,and,and,CC,4440,0.017071,1.611643
33425,5,of,62684,0,1,of,of,of,IN,4506,0.010663,1.030711


In [38]:
BOW['tf'] = TF.stack()
BOW['tfidf'] = TFIDF.stack()
BOW.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,n,c,tf,tfidf
doc_id,term_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
983556,0.0,70,1,0.102339,0.0
983556,1245.0,1,1,0.001462,0.003031
983556,1254.0,1,1,0.001462,0.003558
983556,1282.0,1,1,0.001462,0.002779
983556,1305.0,1,1,0.001462,0.002093


### Apply TFIDF sum to VOCAB

In [39]:
vocab['tfidf_sum'] = TFIDF.sum()

In [40]:
vocab[['term_rank','term_str','pos_max','tfidf_sum']]\
    .sort_values('tfidf_sum', ascending=False).head(20)\
    .style.background_gradient(cmap=gradient_cmap, high=1)
vocab.head()

Unnamed: 0_level_0,term_rank,term_str,n,num,stop,stem_porter,stem_snowball,stem_lancaster,pos_max,df,idf,tfidf_sum
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,1,,375706,0,0,,,,",",4618,0.0,0.0
47256,2,the,158255,0,1,the,the,the,DT,4615,0.000282,0.069261
47795,3,to,78132,0,1,to,to,to,TO,4490,0.012208,1.39897
5391,4,and,64562,0,1,and,and,and,CC,4440,0.017071,1.611643
33425,5,of,62684,0,1,of,of,of,IN,4506,0.010663,1.030711


In [None]:
# vocab = pd.read_csv('data/VOCAB.csv')