Arpitha Gurumurthy </br>
Team: Amalgam
### **Factor:**
Style Based approaches for fake news detection

### **Micro factors for Style based:**
* Hyperpartisan: Extremely one sided
* Yellow Journalism: relying on eye-catching headlines
* Deception / lying in text

### **Dates:**
Scraped on April 20th and all of the news was posted within 2 days of scraping it


# **Topic Modeling and Latent Dirichlet Allocation (LDA)**

In [None]:
#Importing data from google sheets - politifact dataset
from io import BytesIO
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from sklearn.cluster import KMeans
import seaborn as sns
import tensorflow.compat.v1 as tf
r = requests.get('https://docs.google.com/spreadsheets/d/e/2PACX-1vQ9xbQF0uRmyBhtehROE5uTac8JbvNd-jq-NMD99y6HVuungzxDuftmYiY74ZWrenpLyDFtGToiFeMo/pub?gid=745557768&single=true&output=csv')
data = r.content
df_distillation = pd.read_csv(BytesIO(data))

In [None]:
df_distillation.head()

Unnamed: 0,Headline,Source,Posted,Link,Summary
0,Covid in Uttar Pradesh: Coronavirus overwhelms...,BBC via Yahoo News,4 hours ago,https://news.yahoo.com/covid-uttar-pradesh-cor...,"Uttar Pradesh, India's most populous state, is..."
1,"Man who allegedly told U.S. Olympian ""go home,...",Newsweek,2 hours ago,https://www.newsweek.com/california-man-attack...,
2,Corona man arrested after punching Asian Ameri...,KTLA-TV Los Angeles,21 hours ago,https://ktla.com/news/local-news/corona-man-ar...,A Corona man accused of physically assaulting ...
3,What should investors do after the 4600-point ...,MSN News,6 hours ago,https://www.msn.com/en-in/money/topstories/wha...,© Kshitij Anand What should investors do after...
4,Construction starts on 91-15 freeways toll-lan...,The Press-Enterprise,18 hours ago,https://www.pe.com/2021/04/19/construction-sta...,"Construction was set to start Monday night, Ap..."


In [None]:
df_distillation['Posted'].unique()

array(['4 hours ago', '2 hours ago', '21 hours ago', '6 hours ago',
       '18 hours ago', '22 hours ago', '13 hours ago', '17 hours ago',
       '5 days ago', '4 days ago', '1 day ago', '5 hours ago',
       '24 hours ago', '10 hours ago', '6 days ago', '10 minutes ago',
       '2 days ago', '3 days ago', '7 days ago', '16 hours ago',
       '11 hours ago', '12 hours ago', '7 hours ago', '3 hours ago',
       '9 hours ago', '20 hours ago', '14 hours ago', '15 hours ago',
       '23 hours ago', '19 hours ago', '8 hours ago', 'Posted',
       '1 hour ago', '55 minutes ago', '49 minutes ago', '50 minutes ago',
       '7 minutes ago', '19 minutes ago', '35 minutes ago'], dtype=object)

## **Data Pre-processing**
We will perform the following steps:
* Tokenization: Splitting the text into sentences and the sentences into words. Lowercasing the words and removing punctuation.
* Words that have fewer than 3 characters are removed.
* All stopwords are removed.
* Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Words are stemmed — words are reduced to their root form.


In [None]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
from nltk import PorterStemmer

In [None]:
def lemmatize_stemming(text):
    return PorterStemmer().stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [None]:
##Testing the above function on a sample document
doc_sample = df_distillation['Headline'][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['Covid', 'in', 'Uttar', 'Pradesh:', 'Coronavirus', 'overwhelms', "India's", 'most', 'populous', 'state']


 tokenized and lemmatized document: 
['covid', 'uttar', 'pradesh', 'coronaviru', 'overwhelm', 'india', 'popul', 'state']


In [None]:
##saving the preprocessed headline text as ‘processed_docs’
processed_docs = df_distillation['Headline'].map(preprocess)
processed_docs[:10]

0    [covid, uttar, pradesh, coronaviru, overwhelm,...
1    [allegedli, tell, olympian, home, punch, coupl...
2    [corona, arrest, punch, asian, american, coupl...
3                      [investor, point, fall, sensex]
4    [construct, start, freeway, toll, lane, connec...
5    [arrest, allegedli, assault, elderli, korean, ...
6    [constel, brand, domin, beer, market, motley, ...
7    [patrol, recruit, program, whitfield, sheriff,...
8    [corona, accus, sexual, exploit, girl, arraign...
9    [caption, stay, home, dalam, bahasa, inggri, b...
Name: Headline, dtype: object

## **Bag of Words on the Dataset**

In [None]:
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 coronaviru
1 covid
2 india
3 overwhelm
4 popul
5 pradesh
6 state
7 uttar
8 allegedli
9 arrest
10 coupl


**Gensim filter_extremes:**
Filtering out tokens that appear in-
* less than 15 documents (absolute number) or
more than 0.5 documents (fraction of total corpus size, not absolute number).
* after the above two steps, keeping only the first 100000 most frequent tokens.


In [None]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [None]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 covid
1 state
2 arrest
3 american
4 corona
5 investor
6 start
7 fool
8 market
9 motley
10 draft


**Gensim doc2bow**
For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.


In [None]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[0]

[(0, 1), (1, 1)]

Preview Bag Of Words for our sample preprocessed document.


In [None]:
processed_docs[0]

['covid',
 'uttar',
 'pradesh',
 'coronaviru',
 'overwhelm',
 'india',
 'popul',
 'state']

In [None]:
bow_doc_0 = bow_corpus[0]

for i in range(len(bow_doc_0)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_0[i][0], 
                                                     dictionary[bow_doc_0[i][0]], 
                                                     bow_doc_0[i][1]))

Word 0 ("covid") appears 1 time.
Word 1 ("state") appears 1 time.


## **TF-IDF**
Creating a tf-idf model object using models.TfidfModel on ‘bow_corpus’ and saving it to ‘tfidf’, then applying transformation to the entire corpus and call it ‘corpus_tfidf’. Finally we preview TF-IDF scores for our first document.

In [None]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)

In [None]:
corpus_tfidf = tfidf[bow_corpus]

In [None]:
len(corpus_tfidf)

1361

In [None]:
corpus_tfidf[0]

[(0, 0.6055853562256109), (1, 0.7957803568354147)]

## **Running LDA using Bag of Words**
Training our lda model using gensim.models.LdaMulticore and saving it to ‘lda_model’

In [None]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

For each topic, we will explore the words occuring in that topic and its relative weight.


In [None]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.180*"polit" + 0.128*"stock" + 0.093*"watch" + 0.058*"want" + 0.054*"invest" + 0.052*"april" + 0.049*"sport" + 0.043*"favr" + 0.043*"brett" + 0.040*"news"
Topic: 1 
Words: 0.208*"stock" + 0.092*"earn" + 0.088*"rise" + 0.060*"reuter" + 0.052*"record" + 0.034*"world" + 0.033*"gain" + 0.030*"say" + 0.027*"strong" + 0.026*"high"
Topic: 2 
Words: 0.115*"open" + 0.095*"stock" + 0.083*"lower" + 0.068*"year" + 0.067*"season" + 0.063*"arrest" + 0.054*"record" + 0.040*"index" + 0.039*"high" + 0.039*"close"
Topic: 3 
Words: 0.104*"covid" + 0.093*"vaccin" + 0.092*"school" + 0.072*"polit" + 0.070*"high" + 0.059*"state" + 0.037*"sport" + 0.036*"stock" + 0.032*"april" + 0.028*"mix"
Topic: 4 
Words: 0.139*"draft" + 0.109*"prospect" + 0.107*"stock" + 0.064*"viru" + 0.053*"surg" + 0.050*"biden" + 0.048*"hospit" + 0.041*"amid" + 0.038*"covid" + 0.030*"polit"
Topic: 5 
Words: 0.293*"stock" + 0.065*"high" + 0.056*"motley" + 0.056*"fool" + 0.054*"record" + 0.034*"best" + 0.023*"dividend" +

## **Running LDA using TF-IDF**

In [None]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)

In [None]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.101*"watch" + 0.071*"mondal" + 0.066*"walter" + 0.060*"rise" + 0.059*"stock" + 0.057*"vice" + 0.057*"presid" + 0.054*"die" + 0.040*"carter" + 0.037*"april"
Topic: 1 Word: 0.134*"draft" + 0.092*"prospect" + 0.072*"stock" + 0.065*"earn" + 0.043*"news" + 0.039*"look" + 0.038*"invest" + 0.033*"tech" + 0.030*"bank" + 0.028*"european"
Topic: 2 Word: 0.245*"stock" + 0.055*"motley" + 0.055*"fool" + 0.053*"energi" + 0.036*"wall" + 0.033*"bank" + 0.032*"mix" + 0.029*"street" + 0.027*"reuter" + 0.023*"record"
Topic: 3 Word: 0.082*"say" + 0.067*"viru" + 0.056*"american" + 0.043*"amid" + 0.042*"report" + 0.040*"democrat" + 0.038*"polit" + 0.037*"surg" + 0.036*"earn" + 0.034*"stock"
Topic: 4 Word: 0.064*"pandem" + 0.061*"april" + 0.050*"share" + 0.050*"gain" + 0.050*"start" + 0.049*"china" + 0.045*"world" + 0.043*"data" + 0.043*"stock" + 0.039*"record"
Topic: 5 Word: 0.071*"vote" + 0.061*"trade" + 0.058*"close" + 0.055*"record" + 0.051*"stock" + 0.048*"lower" + 0.047*"pull" + 0.047*

**Performance evaluation by classifying sample document using LDA Bag of Words model**


In [None]:
for index, score in sorted(lda_model_tfidf[bow_corpus[0]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.4570715129375458	 
Topic: 0.117*"state" + 0.090*"school" + 0.075*"year" + 0.070*"high" + 0.068*"arrest" + 0.058*"crash" + 0.042*"die" + 0.037*"ralli" + 0.037*"presid" + 0.030*"american"

Score: 0.2762371599674225	 
Topic: 0.257*"polit" + 0.087*"covid" + 0.079*"week" + 0.052*"season" + 0.050*"biden" + 0.044*"open" + 0.032*"lead" + 0.028*"sport" + 0.027*"brett" + 0.027*"favr"

Score: 0.033342018723487854	 
Topic: 0.079*"invest" + 0.069*"stock" + 0.064*"right" + 0.046*"higher" + 0.044*"covid" + 0.044*"growth" + 0.040*"high" + 0.036*"fool" + 0.036*"motley" + 0.034*"econom"

Score: 0.03333877772092819	 
Topic: 0.180*"vaccin" + 0.095*"best" + 0.093*"time" + 0.072*"polit" + 0.061*"stock" + 0.058*"dividend" + 0.036*"mix" + 0.025*"covid" + 0.024*"close" + 0.021*"latest"

Score: 0.03333811089396477	 
Topic: 0.134*"draft" + 0.092*"prospect" + 0.072*"stock" + 0.065*"earn" + 0.043*"news" + 0.039*"look" + 0.038*"invest" + 0.033*"tech" + 0.030*"bank" + 0.028*"european"

Score: 0.03333691135

## **References:**
* https://github.com/susanli2016/NLP-with-Python/blob/master/LDA_news_headlines.ipynb
* https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24
* https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

NOTES:
* Knowledge graph: https://programmerbackpack.com/python-knowledge-graph-understanding-semantic-relationships/
* http://www.martingrandjean.ch/network-visualization-shakespeare/
* question answering: https://towardsdatascience.com/question-answering-with-pretrained-transformers-using-pytorch-c3e7a44b4012