Arpitha Gurumurthy </br>
Team: Amalgam
### **Factor:**
Style Based approaches for fake news detection

### **Micro factors for Style based:**
* Hyperpartisan: Extremely one sided
* Yellow Journalism: relying on eye-catching headlines
* Deception / lying in text

### **Dates:**
Scraped on April 20th and all of the news was posted within 2 days of scraping it


# **Named Entity Recognition with NLTK and SpaCy**
Named entity recognition (NER)is the first step towards information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

In [None]:
#Importing data from google sheets - politifact dataset
from io import BytesIO
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from sklearn.cluster import KMeans
import seaborn as sns
import tensorflow.compat.v1 as tf
r = requests.get('https://docs.google.com/spreadsheets/d/e/2PACX-1vQ9xbQF0uRmyBhtehROE5uTac8JbvNd-jq-NMD99y6HVuungzxDuftmYiY74ZWrenpLyDFtGToiFeMo/pub?gid=745557768&single=true&output=csv')
data = r.content
df_distillation = pd.read_csv(BytesIO(data))

In [None]:
df_distillation.head()

Unnamed: 0,Headline,Source,Posted,Link,Summary
0,Covid in Uttar Pradesh: Coronavirus overwhelms...,BBC via Yahoo News,4 hours ago,https://news.yahoo.com/covid-uttar-pradesh-cor...,"Uttar Pradesh, India's most populous state, is..."
1,"Man who allegedly told U.S. Olympian ""go home,...",Newsweek,2 hours ago,https://www.newsweek.com/california-man-attack...,
2,Corona man arrested after punching Asian Ameri...,KTLA-TV Los Angeles,21 hours ago,https://ktla.com/news/local-news/corona-man-ar...,A Corona man accused of physically assaulting ...
3,What should investors do after the 4600-point ...,MSN News,6 hours ago,https://www.msn.com/en-in/money/topstories/wha...,© Kshitij Anand What should investors do after...
4,Construction starts on 91-15 freeways toll-lan...,The Press-Enterprise,18 hours ago,https://www.pe.com/2021/04/19/construction-sta...,"Construction was set to start Monday night, Ap..."


# NLTK

## **Information Extraction**

In [None]:
import matplotlib
matplotlib.use('Agg')

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [None]:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk

In [None]:
df_distillation['Headline_pos_wt'] = df_distillation['Headline'].apply(lambda x: ne_chunk(pos_tag(word_tokenize(x))))

In [None]:
df_distillation['Headline_pos_wt']

0       [[(Covid, NNP)], (in, IN), [(Uttar, NNP)], (Pr...
1       [(Man, NN), (who, WP), (allegedly, RB), (told,...
2       [[(Corona, NNP)], (man, NN), (arrested, VBD), ...
3       [(What, WP), (should, MD), (investors, NNS), (...
4       [[(Construction, NN)], (starts, VBZ), (on, IN)...
                              ...                        
1356    [[(Credit, NNP)], [(Suisse, NNP)], (just, RB),...
1357    [[(UPDATE, JJ)], (2-London, JJ), (stocks, NNS)...
1358    [[(China, NNP)], (stocks, NNS), (fall, VBP), (...
1359    [(GLOBAL, JJ), (MARKETS-World, NNP), (stocks, ...
1360    [[(TFSA, NNP)], (Investing, NNP), (:, :), (2, ...
Name: Headline_pos_wt, Length: 1361, dtype: object

In [None]:
##Applying word tokenization and part-of-speech tagging to the sentence.
def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

In [None]:
df_distillation['Headline_pos_wt_2'] = df_distillation['Headline'].apply(lambda x: preprocess(x))

We get a list of tuples containing the individual words in the sentence and their associated part-of-speech

In [None]:
df_distillation['Headline_pos_wt_2']

0       [(Covid, NNP), (in, IN), (Uttar, NNP), (Prades...
1       [(Man, NN), (who, WP), (allegedly, RB), (told,...
2       [(Corona, NNP), (man, NN), (arrested, VBD), (a...
3       [(What, WP), (should, MD), (investors, NNS), (...
4       [(Construction, NN), (starts, VBZ), (on, IN), ...
                              ...                        
1356    [(Credit, NNP), (Suisse, NNP), (just, RB), (pu...
1357    [(UPDATE, JJ), (2-London, JJ), (stocks, NNS), ...
1358    [(China, NNP), (stocks, NNS), (fall, VBP), (as...
1359    [(GLOBAL, JJ), (MARKETS-World, NNP), (stocks, ...
1360    [(TFSA, NNP), (Investing, NNP), (:, :), (2, CD...
Name: Headline_pos_wt_2, Length: 1361, dtype: object

## **Chunking**

Now we’ll implement noun phrase chunking to identify named entities using a regular expression consisting of rules that indicate how sentences should be chunked.
Our chunk pattern consists of one rule, that a noun phrase, NP, should be formed whenever the chunker finds an optional determiner, DT, followed by any number of adjectives, JJ, and then a noun, NN.

In [None]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'

In [None]:
NPChunker = nltk.RegexpParser(pattern) 
df_distillation['Chunk_result'] = df_distillation['Headline_pos_wt_2'].apply(lambda x: NPChunker.parse(x))
# print(result)

In [None]:
df_distillation['Chunk_result']

0       [(Covid, NNP), (in, IN), (Uttar, NNP), (Prades...
1       [[(Man, NN)], (who, WP), (allegedly, RB), (tol...
2       [(Corona, NNP), [(man, NN)], (arrested, VBD), ...
3       [(What, WP), (should, MD), (investors, NNS), (...
4       [[(Construction, NN)], (starts, VBZ), (on, IN)...
                              ...                        
1356    [(Credit, NNP), (Suisse, NNP), (just, RB), (pu...
1357    [(UPDATE, JJ), (2-London, JJ), (stocks, NNS), ...
1358    [(China, NNP), (stocks, NNS), (fall, VBP), (as...
1359    [(GLOBAL, JJ), (MARKETS-World, NNP), (stocks, ...
1360    [(TFSA, NNP), (Investing, NNP), (:, :), (2, CD...
Name: Chunk_result, Length: 1361, dtype: object

In [None]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

df_distillation['iob_tagged'] = df_distillation['Chunk_result'].apply(lambda x: tree2conlltags(x)) 

In [None]:
df_distillation['iob_tagged']

0       [(Covid, NNP, O), (in, IN, O), (Uttar, NNP, O)...
1       [(Man, NN, B-NP), (who, WP, O), (allegedly, RB...
2       [(Corona, NNP, O), (man, NN, B-NP), (arrested,...
3       [(What, WP, O), (should, MD, O), (investors, N...
4       [(Construction, NN, B-NP), (starts, VBZ, O), (...
                              ...                        
1356    [(Credit, NNP, O), (Suisse, NNP, O), (just, RB...
1357    [(UPDATE, JJ, O), (2-London, JJ, O), (stocks, ...
1358    [(China, NNP, O), (stocks, NNS, O), (fall, VBP...
1359    [(GLOBAL, JJ, O), (MARKETS-World, NNP, O), (st...
1360    [(TFSA, NNP, O), (Investing, NNP, O), (:, :, O...
Name: iob_tagged, Length: 1361, dtype: object

In [None]:
df_distillation['iob_tagged'][1]

[('Man', 'NN', 'B-NP'),
 ('who', 'WP', 'O'),
 ('allegedly', 'RB', 'O'),
 ('told', 'VBD', 'O'),
 ('U.S.', 'NNP', 'O'),
 ('Olympian', 'NNP', 'O'),
 ('``', '``', 'O'),
 ('go', 'VB', 'O'),
 ('home', 'NN', 'B-NP'),
 (',', ',', 'O'),
 ("''", "''", 'O'),
 ('punched', 'VBD', 'O'),
 ('couple', 'NN', 'B-NP'),
 (',', ',', 'O'),
 ('is', 'VBZ', 'O'),
 ('arrested', 'VBN', 'O')]

## **Alternative way - Trial**

In [None]:
##TESTING
my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

In [None]:
##TESTING
for sent in nltk.sent_tokenize(my_sent):
   chunklist = []
   for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
      if hasattr(chunk, 'label'):
      #  print(chunk.label(), ' '.join(c[0] for c in chunk))
        chunkstr = ' '.join(c[0] for c in chunk)
        chunkfinal = chunkstr + ":" + chunk.label()
        chunklist.append(chunkfinal)
   print(chunklist)

['WASHINGTON:GPE', 'New York:GPE', 'Loretta E. Lynch:PERSON', 'Brooklyn:GPE']


In [None]:
##Defining a function to label the word tokens with named enetities
def ner(headline):
  for sent in nltk.sent_tokenize(headline):
    ##Appending all the word tokens and lables to a list
    chunklist = []
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
      if hasattr(chunk, 'label'):
        # print(chunk.label(), ' '.join(c[0] for c in chunk))
        # print("-----------------------------------------")
        chunkstr = ' '.join(c[0] for c in chunk)
        chunkfinal = chunkstr + ":" + chunk.label()
        chunklist.append(chunkfinal)
    return chunklist

In [None]:
##To calculate the number of GPEs in each headline
def gpe_count(headline):
  for sent in nltk.sent_tokenize(headline):
    count = 0
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
      if hasattr(chunk, 'label'):
        if (chunk.label() == 'GPE'):
          count = count + 1
    return count

In [None]:
##Storing the named entity-GPE count as a feature in the dataframe
df_distillation['gpe_count'] = df_distillation['Headline'].apply(lambda x: gpe_count(x))

In [None]:
df_distillation['gpe_count']

0       3
1       1
2       5
3       1
4       1
       ..
1356    0
1357    1
1358    1
1359    0
1360    0
Name: gpe_count, Length: 1361, dtype: int64

In [None]:
##Storing the list of word token and their named entities as a feature
df_distillation['ner'] = df_distillation['Headline'].apply(lambda x: ner(x))

In [None]:
print(df_distillation['ner'])

0                       [Covid:GPE, Uttar:GPE, India:GPE]
1                                              [U.S.:GPE]
2       [Corona:GPE, Asian:GPE, American:GPE, U.S.:GPE...
3                                            [Sensex:GPE]
4                          [Construction:GSP, Corona:GPE]
                              ...                        
1356                 [Credit:ORGANIZATION, Suisse:PERSON]
1357                     [UPDATE:ORGANIZATION, Tesco:GPE]
1358                        [China:GPE, GDP:ORGANIZATION]
1359                                                   []
1360        [TFSA:ORGANIZATION, Motley Fool:ORGANIZATION]
Name: ner, Length: 1361, dtype: object


In [None]:
##Verification - 1 
print(df_distillation['ner'][0])

['Covid:GPE', 'Uttar:GPE', 'India:GPE']


In [None]:
##Verification - 2
print(df_distillation['gpe_count'][0])

3


## **References :**
* https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
* https://github.com/susanli2016/NLP-with-Python/blob/master/NER_NLTK_Spacy.ipynb

NOTES
* https://www.freecodecamp.org/news/an-introduction-to-part-of-speech-tagging-and-the-hidden-markov-model-953d45338f24
* https://www.analyticsvidhya.com/blog/2021/05/top-8-python-libraries-for-natural-language-processing-nlp-in-2021/