# Hierarchical TF-IDF

1. Clustering in a graph by topic
2. Finding key words within a topic

[Data source](https://www.kaggle.com/snapcrack/all-the-news#articles1.csv)

Get the data into a dataframe with pandas

In [1]:
import pandas as pd
dataset = pd.read_csv("data/articles1.csv")
dataset.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


Tokenize and clean the articles

**WARNING!!** This takes a while!

In [2]:
#import nltk
#from nltk.corpus import stopwords
#from nltk.stem.porter import PorterStemmer
#import string

#texts = list(dataset.content)
#cleaned_texts = []
'''
stopword_set = set(stopwords.words('english'))

stemmer = PorterStemmer()

for text in texts[1:5]:
    clean = text.lower()
    
    # https://stackoverflow.com/questions/15547409/how-to-get-rid-of-punctuation-using-nltk-tokenizer
    translate_table = dict((ord(char), None) for char in string.punctuation)
    
    clean = clean.translate(translate_table)
    clean = nltk.word_tokenize(clean)
    clean = set(clean).difference(stopword_set)
    stemmed = []
    for token in clean:
        stemmed.append(stemmer.stem(token))
    cleaned_texts.append(stemmed)
    '''

"\nstopword_set = set(stopwords.words('english'))\n\nstemmer = PorterStemmer()\n\nfor text in texts[1:5]:\n    clean = text.lower()\n    \n    # https://stackoverflow.com/questions/15547409/how-to-get-rid-of-punctuation-using-nltk-tokenizer\n    translate_table = dict((ord(char), None) for char in string.punctuation)\n    \n    clean = clean.translate(translate_table)\n    clean = nltk.word_tokenize(clean)\n    clean = set(clean).difference(stopword_set)\n    stemmed = []\n    for token in clean:\n        stemmed.append(stemmer.stem(token))\n    cleaned_texts.append(stemmed)\n    "

Make the tfidf stuff (which also, from sklearn, lets you specify how to tokenize things)

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import RegexpTokenizer
import numpy as np

stopword_set = set(stopwords.words('english'))
tokenizer = RegexpTokenizer(r'\w+')

def tokenize(textthing):
    textthing = textthing.lower()
    #textthing = nltk.word_tokenize(textthing)
    textthing = tokenizer.tokenize(textthing)
    textthing = list(set(textthing).difference(stopword_set))
    return textthing

texts = list(dataset.content)
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words=None)
tfs = tfidf.fit_transform(texts)

In [4]:
def top_words(articletext, n):
    article = tfidf.transform([articletext])
    #print(article)

    # https://stackoverflow.com/questions/34232190/scikit-learn-tfidfvectorizer-how-to-get-top-n-terms-with-highest-tf-idf-score
    feature_array = np.array(tfidf.get_feature_names())
    tfidf_sorting = np.argsort(article.toarray()).flatten()[::-1]

    top = feature_array[tfidf_sorting][:n]
    return top

In [5]:
for i in range(0, 10):
    print(top_words(texts[i], 10))

['collyer' 'blando' 'pileup' 'prerogatives' 'appropriating' 'cascading'
 'implode' 'divulge' 'illustrating' 'appropriated']
['argenis' 'meenagh' 'overscrutinized' 'landesberg' 'espada' '146th'
 'gola' 'hypodermic' 'giacalone' 'courtlandt']
['gaing' 'salten' 'romanized' 'okubo' 'funicello' 'desaturated'
 'canemaker' 'kitemaker' 'calligrapher' 'laundryman']
['snape' 'strivings' 'lovelette' 'greenswards' 'stettner' 'schallert'
 'telegenically' 'kiarostami' 'brookner' 'motormouth']
['unha' 'launchings' 'kctv' 'sejong' 'cesspool' 'cheong' 'reconfigured'
 'exhorting' 'habitually' 'musudan']
['sandringham' 'recuperating' 'adulyadej' 'bhumibol' 'ascended'
 'buckingham' 'norfolk' 'monarch' 'throne' '101']
['príncipe' 'tomé' 'stopovers' 'coercing' 'antagonized' 'glaser'
 'underpinned' 'destabilized' 'taipei' 'dispatching']
['proietto' 'canagliflozin' 'algaier' 'erinn' 'leptin' 'starburst'
 'metabolisms' 'egbert' 'arduously' 'dieters']
['raval' 'mazzy' 'belz' 'polihale' 'tigger' 'bedbedbedbedbed'

In [6]:
texts[0]

'WASHINGTON  —   Congressional Republicans have a new fear when it comes to their    health care lawsuit against the Obama administration: They might win. The incoming Trump administration could choose to no longer defend the executive branch against the suit, which challenges the administration’s authority to spend billions of dollars on health insurance subsidies for   and   Americans, handing House Republicans a big victory on    issues. But a sudden loss of the disputed subsidies could conceivably cause the health care program to implode, leaving millions of people without access to health insurance before Republicans have prepared a replacement. That could lead to chaos in the insurance market and spur a political backlash just as Republicans gain full control of the government. To stave off that outcome, Republicans could find themselves in the awkward position of appropriating huge sums to temporarily prop up the Obama health care law, angering conservative voters who have been 