# Keyword Analysis

Tasks:
- Find top words by media ideology through IDF scores (https://kavita-ganesan.com/python-keyword-extraction/#.YnUzEdrMJPY)
- TFIDF scores (https://github.com/kavgan/nlp-in-practice/blob/master/tf-idf/Keyword%20Extraction%20with%20TF-IDF%20and%20SKlearn.ipynb)
https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html

### TF-IDF

TF-IDF (term frequency weighed by the inverse document frequency) is a score that measures a word’s relevance in the entire corpus. Given a particular document, if we consider the word “the”, it is present several times so it has a high TF, but it is present also in almost every document, hence, IDF is very low. Overall, the word “the” has a low TF-IDF score. We suppose that the keywords of a document should have a high TF-IDF score as the higher the numerical weight value, the rarer the term. The reason we take the inverse, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. By evaluating TF-IDF, we understand how important a word is to a document and to the entire corpus. When using tf-idf scores instead of raw word counts as features, stopwords should disappear automatically. The smaller the weight, the more common the term. 

tf-idf = term_frequency * inverse_document_frequency

term_frequency = number of times a given term appears in document

inverse_document_frequency = log(total number of documents / number of documents with term) + 1*****

The process of using TF-IDF is:
1. Clean data / Preprocessing — Clean data (standardise data), normalize data (all lower case), lemmatize data (all words to root words).
2. Tokenize words with frequency
3. Find TF for words
4. Find IDF for words
5. Vectorize vocabulary

With our data we define:
- document = news media story
- corpus = entire collection of stories

In [70]:
# import modules
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer 


In [71]:
df = pd.read_csv('C:/Users/analo/OneDrive - University of Glasgow/University of Glasgow/Amsterdam Visit/ASCoR-Project/Analysis/Descriptive-Stats/preprocessed.csv', encoding='latin-1')
df.head()  

Unnamed: 0,date,maintext,title,source,media_name,ideology,Congress,text
0,2015/02/13,Advertisement\r\r\nIn honor of our 95 annivers...,6 Ways the League of Women Voters Has Impacted...,http://www.huffingtonpost.com/elisabeth-macnam...,Huffington Post,left,114th,honor anniversary list things americans part l...
1,2015/02/10,"As state legislatures shift into high gear, ma...",Opportunities for Effective Election Reforms C...,http://www.huffingtonpost.com/robert-m-brandon...,Huffington Post,left,114th,state legislatures shift high gear election re...
2,2015/02/22,"FILE - In a Tuesday, Nov. 4, 2014 file photo, ...","Scott Walker Pushes ALEC 'Right to Work' Bill,...",http://www.huffingtonpost.com/mary-bottari/sco...,Huffington Post,left,114th,file tuesday nov file photo wisconsin republic...
3,2015/02/25,Former Ohio Gov. Ted Strickland (D) announced ...,Ted Strickland Announces He's Running For The ...,http://www.huffingtonpost.com/2015/02/25/ted-s...,Huffington Post,left,114th,ohio gov ted strickland announced wednesday ll...
4,2015/02/26,Nevada Senate Minority Leader Michael Roberson...,Nevada GOP Pushes New Gun Law Reminiscent Of '...,http://www.huffingtonpost.com/2015/02/26/nevad...,Huffington Post,left,114th,nevada senate minority leader michael roberson...


In [72]:
# creating two different datasets for media split by ideology
ideology = df.groupby("ideology")
left = ideology.get_group('left')
right = ideology.get_group('right')

### Creating vocabulary
While cv.fit(...) would only create the vocabulary, cv.fit_transform(...) creates the vocabulary and returns a term-document matrix which is what we want. With this, each column in the matrix represents a word in the vocabulary while each row represents the document in our dataset where the values in this case are the word counts. Note that with this representation, counts of some words could be 0 if the word did not appear in the corresponding document. We use countVectorizer to create a vocabulary and generate word counts.

We ignore all words that have appeared in 95% of the documents, since those may be unimportant.

In [73]:
# text = df['text'].apply(str) # all documents
# text =left['text'].apply(str) # left leaning media
text =right['text'].apply(str) # right leaning media

In [74]:
# ignore words that appear in 95% of documents and in less that 1% of documents,
cv=CountVectorizer(max_df=0.95, min_df =0.01)
word_count_vector= cv.fit_transform(text)
list(cv.vocabulary_.keys())[:20] #10 words from our vocabulary.

['labor',
 'giant',
 'service',
 'employees',
 'international',
 'union',
 'soul',
 'day',
 'drive',
 'hispanic',
 'turnout',
 'midterm',
 'elections',
 'distributing',
 'language',
 'public',
 'tuesday',
 'honor',
 'dead',
 'relatives']

### Creating the IDF
We use TfidfTransformer to Compute Inverse Document Frequency (IDF). In the code below, we are essentially taking the sparse matrix from CountVectorizer (word_count_vector) to generate the IDF when you invoke tfidf_transformer.fit(...). Once we have our IDF computed, we are now ready to compute TF-IDF and then extract top keywords from the TF-IDF vectors. IDF values are sorted in descending order. The lower the IDF value of a word, the less unique it is to any particular document.

In [75]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) 
tfidf_transformer.fit(word_count_vector) # all

TfidfTransformer()

In [76]:
# print idf values for whole corpus
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names_out(),columns=["idf_weights"]) 

# sort ascending 
df_idf.sort_values(by=['idf_weights'])[:10]

Unnamed: 0,idf_weights
voter,1.060078
id,1.146021
state,1.241533
election,1.263298
voting,1.307508
vote,1.320278
people,1.363781
law,1.4755
voters,1.483033
laws,1.49444


### Computing TF-IDF and Extracting Keywords of all articles
Once we have our IDF computed, we are now ready to compute TF-IDF and extract the top keywords. In this example, we will extract top keywords for the articles. We will start by reading our test file, extracting the necessary fields (title and body) and get the texts into a list.

In [77]:
# get articles into a list
docs_test=text.tolist() 

In [78]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

In [79]:
feature_names=cv.get_feature_names_out()

In [80]:
def get_keywords(idx):

    #generate tf-idf for the given document
    tf_idf_vector=tfidf_transformer.transform(cv.transform([docs_test[idx]]))

    #sort the tf-idf vectors by descending order of scores
    sorted_items=sort_coo(tf_idf_vector.tocoo())

    #extract only the top n; n here is 10
    keywords=extract_topn_from_vector(feature_names,sorted_items,10)
    
    return keywords

def print_results(idx,keywords):

    for k in keywords:
        print(k,keywords[k])

In [81]:
idx=1 #article number
keywords=get_keywords(idx)
print_results(idx,keywords)

drug 0.288
indiana 0.262
moore 0.235
city 0.227
nfl 0.22
abuse 0.168
coalition 0.14
treatment 0.131
margins 0.11
users 0.105


In [82]:
#generate tf-idf for all news articles. 
tf_idf_vector=tfidf_transformer.transform(cv.transform(docs_test))

results=[]
for i in range(tf_idf_vector.shape[0]):
    
    # get vector for a single document
    curr_vector=tf_idf_vector[i]
    
    #sort the tf-idf vector by descending order of scores
    sorted_items=sort_coo(curr_vector.tocoo())

    #extract only the top n; n here is 10
    keywords=extract_topn_from_vector(feature_names,sorted_items,10)
    
    
    results.append(keywords)

# create dataframe with top keywords per article
df_tfidf=pd.DataFrame(zip(docs_test,results),columns=['doc','keywords'])
df_tfidf

Unnamed: 0,doc,keywords
0,labor giant service employees international un...,"{'hispanic': 0.501, 'soul': 0.23, 'pew': 0.229..."
1,tribune star terre haute slim margins change v...,"{'drug': 0.288, 'indiana': 0.262, 'moore': 0.2..."
2,raleigh ap year marked hundreds arrests nation...,"{'moral': 0.279, 'raleigh': 0.24, 'saturday': ..."
3,austin texas federal appeals court struck texa...,"{'appeals': 0.344, 'texas': 0.329, 'violates':..."
4,thursday ceremony commemorate anniversary voti...,"{'folks': 0.418, 'turns': 0.31, 'obama': 0.205..."
...,...,...
1709,watching media rush defend president biden wei...,"{'biden': 0.381, 'putin': 0.365, 'approval': 0..."
1710,year list hollywood actor smith blew iconic ca...,"{'smith': 0.655, 'hollywood': 0.208, 'theory':..."
1711,listen fox news articles eyes world focus suff...,"{'silence': 0.301, 'tech': 0.288, 'image': 0.2..."
1712,democratic congressman jamaal bowman york arre...,"{'capitol': 0.47, 'police': 0.382, 'arrested':..."


The above dataset contains a dict for each keyword, I extracted these keywords and counted them below

In [83]:
import re

listkey = df_tfidf['keywords'].to_list() # convert dict of keywords to list
stringkey = str(listkey) # convert list to str

def preprocess(raw_text):
    letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)
    words = letters_only_text.lower().split()
    words=[" ".join(words.split()) for words in words]  
    return " ".join(words)

# keep only words from the str
keywords = preprocess(stringkey)

In [84]:
from collections import Counter
# Finding top keywords in overall corpus
cnt_df_right = (pd.DataFrame(Counter(keywords.split()).most_common(20))) #another way of finding most common words in corpus
cnt_df_right.columns=['keyword', 'freq']
cnt_df_right

Unnamed: 0,keyword,freq
0,election,216
1,voting,170
2,trump,159
3,georgia,158
4,bill,157
5,biden,150
6,law,134
7,democrats,134
8,texas,123
9,court,122


In [87]:
# merge cnt_df, cnt_df_left, cnt_df_right
cnt_df_right.columns=['right', 'freq']
cnt_df_left.columns=['left', 'freq']
cnt_df.columns=['df', 'freq']
results = pd.concat([cnt_df, cnt_df_left, cnt_df_right], axis=1)
results



Unnamed: 0,df,freq,left,freq.1,right,freq.2
0,voting,864,voting,698,election,216
1,election,604,trump,409,voting,170
2,trump,565,election,389,trump,159
3,state,483,state,370,georgia,158
4,court,445,court,320,bill,157
5,law,395,rights,318,biden,150
6,rights,368,law,278,law,134
7,voter,362,voters,255,democrats,134
8,voters,352,texas,218,texas,123
9,texas,344,georgia,186,court,122
