# CMPE 257 - MLSprings 2021 Cohort
## Objective: Detect fake news in political datasets using factors

## Factors by Team
## Team Equality - Abhishek Bais, Haley Feng, Jimmy Liang, Shannon Phu
### Abhishek - Misleading Intentions
#### Microfactors:  
Sentiment Analysis  
Sensationalism  
Click Bait 

Datasets:
1. [Politifact](drive.google.com/file/d/1LUTnGJ1c8WDwcEmed85GhfoEUm2sJFzN/view)  
2. [Amalgamated dataset from varied newsAPI feed](https://docs.google.com/spreadsheets/d/1jJflezhjlTPRoVHvj7UvQwllssq66zhZNPz7GI_Zwhc/edit#gid=22382224)  
3. [Sensational words corpus](https://drive.google.com/file/d/1JIes9QhZw7EUt59EBDUgrdFMokoT1W8u/view)  

### Shannon - Stance Detection
#### Microfactors:  
Sentiment Analysis  
Subjectivity Score  
BERT Embeddings  

Datasets:
1. [Fake News Challenge](https://www.kaggle.com/c/fakenewskdd2020)  
2. [Politifact](drive.google.com/file/d/1LUTnGJ1c8WDwcEmed85GhfoEUm2sJFzN/view)  
3. [Sentiment words corpus](https://drive.google.com/file/d/1JIes9QhZw7EUt59EBDUgrdFMokoT1W8u/view)  

### Haley - Political Bias
#### Microfactors: 
Sentiment Analysis  
Party Affiliation  
Vocab Selection Bias  
1. Politifact
2. [GoogleNews API](https://docs.google.com/spreadsheets/d/1Uu-266Q0ab88fnnjtrZ8MMMV8KGNzGoiGKXbBssza2s/edit?usp=sharing)
3. [Ideological Book Corpus](https://people.cs.umass.edu/~miyyer/ibc/index.html) / [Kaggle Tweets](https://docs.google.com/spreadsheets/d/14KRtIdMqbp1Tnd7AraR-ROtPDTSQgc--hMmm0L7baDc/edit?usp=sharing)

### Jimmy - Naive Realism
#### Microfactors:  
Topic Centrality  
Polarization  
Source Centrality  

## Team DataCorps - Yuxing Wang, Arun Talkad, Mayuri Lalwani

### Yuxing - Psychology Utilities
#### Microfactors:
Group confirmation<br />
Opinion leader<br />
Sentiment<br />
Datasets:
1. Politifact
2. Twitter API
3. News API

### Mayuri - Intent
#### Microfactors:
Utterance<br />
Speech<br />
Sentiment<br />
Datasets:
1. politifact
2. twitter
3. newsapi

### Arun - Source Reputation, Source Reliability
#### Microfactors:
Provenance Analysis <br />
News Subjectivity   <br />             
News Credibility  <br />              
News Veracity Detection <br />
Datasets:
1. Politifact
2. Twitter API
3. News API <br />

## Team Sparrow 
### Princy
#### Microfactors
Text similarity<br/>
Sentiment Polarity <br/>
Datasets:
1. [Stance Dataset](http://www.fakenewschallenge.org)
2. [ISOT Fake News Dataset
](https://www.uvic.ca/engineering/ece/isot/datasets/fake-news/index.php)
3. [Kaggle](https://www.kaggle.com/c/fake-news/)

## Team Amalgam
### Surabhi: Credibility
#### Microfactors
Author Experise<br />
Content Credibility<br />
Text Readability<br />
Datasets Used: 
1. Scraped Data from Politifact website 
2. Scraped news article from web

### Arpitha:  Style based approaches
#### Microfactors
Hyperpartisan<br />
Yellow Journalism<br />
Deception/Lying in text<br />
Datasets Used: 
1. Kaggle fake news dataset: https://www.kaggle.com/surekharamireddy/fake-news-detection
2. SemEval Hyperpartisan News Detection task dataset: https://pan.webis.de/semeval19/semeval19-web/

### Gayathri: Authenticity
#### Microfactors
Flesch Reading Ease Score<br />
Polarity score<br />
Subjectivity Score<br />
Datasets Used: 
1. Scraped Data from Politifact website



## Team Underdog 
### Jocelyn 
### Source Reputation
#### Microfactors
Source Ratings Score 
Reputation Score 
Sentence Similarity
### Datasets used:
1. Scraped Data from Politifact and FoxNews Website

________
## Team Musketeers
### Raghava Devaraje Urs
#### Microfactors
**Political Affiliation**
1. Sentiment analysis
2. Party affiliations 
3. Click Bait

### Kumuda Benakanahalli 
#### Microfactors
**Spam**
1. Source Reputation
2. Spam word percentage
3. Comprehensive index

### Shiv Kumar Ganesh
#### Microfactors
**Writing Style**
1. Vocab Analysis
2. Lexical Analysis
3. Readibility Analysis
_____
## Team ml-coders
###  
#### Microfactors
1. Sentiment Intensity
2. Political Bias
3. Readability Score
1. Clickbait
2. Toxicity
3. Subjectivity
4. Sentiment Polarity
1. Sensatonalism
2. Linguistic Bias
3. Vagnuess
1. Sentiment Analysis
2. Readibility Analysis


_____

In [1]:
!pip install sentence-transformers
!pip install transformers



In [2]:
# Import standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from io import BytesIO
import requests
import pickle
import nltk
from transformers import pipeline
nltk.download('punkt')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
!pip install -U -q pyDrive

In [4]:
# Import packages for google drive, auth
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
gdrive = GoogleDrive(gauth)

In [5]:
from sklearn.preprocessing import PolynomialFeatures
from textblob import TextBlob
from sentence_transformers import SentenceTransformer

# 1.0. Read in streaming news headlines

a. Streaming news headlines are from https://newsapi.org/  
c. Streaming news headlines are from CNN, Brietbart News and Fox News 2021/4/25 - 2021/4/26

In [6]:
r = requests.get('https://docs.google.com/spreadsheets/d/e/2PACX-1vQoXVHhfQlxAlQ8b3eHot7dDhXmCYM9iYC7i0mZMMpzwejhvCjMeEEHPTRhI7KCqOkRbmHBfsxKp0gw/pub?gid=1486725861&single=true&output=tsv')
data = r.content
df_test_headlines = pd.read_csv(BytesIO(data), sep='\t')
df_test_headlines

Unnamed: 0,date,text,body,source,preprocessed_statement_text,preprocessed_body
0,2021-04-26T23:59:00Z,India is spiraling deeper into Covid-19 crisis...,India is experiencing the world's worst Covid-...,CNN,india spiral deeper covid crisi need know,india experienc world worst covid outbreak rec...
1,2021-04-26T23:52:26Z,New York to lose House seat -- and an Electora...,New York state came just 89 residents short of...,CNN,new york lose hous seat elector colleg vote fa...,new york state came resid short maintain congr...
2,2021-04-26T23:29:09Z,Trump's effort to overturn loss becomes 2022 G...,Former North Carolina Gov. Pat McCrory acknowl...,CNN,trump effort overturn loss becom gop litmus te...,former north carolina gov pat mccrori acknowle...
3,2021-04-26T23:27:31Z,Fox News host admits his show was wrong about ...,A Fox News anchor admitted on air on Monday th...,CNN,fox news host admit show wrong biden limit red...,fox news anchor admit air monday show inaccur ...
4,2021-04-26T22:40:13Z,The Chauvin trial produced a new liberal icon,The conviction last week of former Minneapolis...,CNN,chauvin trial produc new liber icon,convict last week former minneapoli polic offi...
...,...,...,...,...,...,...
294,2021-04-26T16:57:43Z,Disney World is hiring in preparation for capa...,Things are apparently getting busier at Disney...,Fox News,disney world hire prepar capac increas report,thing appar get busier disney world
295,2021-04-26T16:53:12Z,New York Times 'buried' bombshell that John Ke...,"The New York Times is taking criticism for ""bu...",Fox News,new york time buri bombshel john kerri told ir...,new york time take critic buri report former s...
296,2021-04-26T16:39:50Z,"Arizona's Flag Fire balloons in size, promptin...",The Flag Fire in Arizona has swelled to a dang...,Fox News,arizona flag fire balloon size prompt evacu,flag fire arizona swell danger level prompt of...
297,2021-04-26T16:36:44Z,Used pickup prices are skyrocketing amid new v...,A shortage of new vehicles has led to steep in...,Fox News,use pickup price skyrocket amid new vehicl sho...,shortag new vehicl led steep increas price use...


# 2.0. Predict news headline is true/ false by ensembling factors

## 2.a. Define a false-o-meter
1. Associate weights with each micro-factor proportional to model accuracy
2. Probablity news is false is obtained by ensembling micro-factors as follows
Define: A [false-o-meter] as s polynomial function f(p) = p0w0 + p1w1 + p2*w2
where
i. p is predicited probability of a micro-factor
ii. w is normalized weight of micro-factors, proportional to accuracy of its prediction

3. Labels news as follows based on false-o-meter f(p) reading
i. Pants on Fire - if false-o-meter > 0.9
ii. Somewhat False - if 0.7 < false-o-meter < 0.9
iii. Mostly False - if 0.5 < false-o-meter < 0.7
iv. Half True - if 0.3 < false-o-meter < 0.5
v. Mostly True - if 0.1 < false-o-meter < 0.3
vi. True - if 0.1 < false-o-meter

## 2.b. Define Stance Predictor

In [7]:
def apply_stance_detection_featurization(df_, sentiment_analyzer, headlineCol='Headline', bodyCol='ArticleBody'):
  orig_cols = df_.copy().columns
  df_['body_sentiment_score'] = df_[bodyCol].apply(lambda text: sentiment_analyzer.polarity_scores(text)['compound'])
  df_['body_subjectivity_score'] = df_[bodyCol].apply(lambda text: TextBlob(text).sentiment[1])
  df_['title_sentiment_score'] = df_[headlineCol].apply(lambda text: sentiment_analyzer.polarity_scores(text)['compound'])
  df_['title_subjectivity_score'] = df_[headlineCol].apply(lambda text: TextBlob(text).sentiment[1])
  df_ = df_.reset_index()

  feature_names = ['body_sentiment_score', 'body_subjectivity_score', 'title_sentiment_score', 'title_subjectivity_score']
  poly = PolynomialFeatures(interaction_only=True)
  interaction_features = pd.DataFrame(poly.fit_transform(df_[feature_names].to_numpy()))
  interaction_feature_names = poly.get_feature_names(input_features=feature_names)
  interaction_features.columns = interaction_feature_names
  interaction_features = interaction_features.drop(['1'], axis=1)
  interaction_feature_names.remove('1')

  headline_sentence_embeddings = pd.DataFrame(np.stack(df_[headlineCol].apply(transformer_model.encode).to_numpy()), columns=['heademb_{}'.format(i) for i in range(768)])
  article_sentence_embeddings = pd.DataFrame(np.stack(df_[bodyCol].apply(transformer_model.encode).to_numpy()), columns=['artemb_{}'.format(i) for i in range(768)])

  features = pd.concat([df_[orig_cols], interaction_features, headline_sentence_embeddings, article_sentence_embeddings], axis=1)
  return features

In [8]:
file_id = '1DV5hmLvLJWYBviF6ps0nXV5FCu1URXyL'
model_filename = 'stance_detection.pkl'
downloaded = gdrive.CreateFile({'id': file_id})
downloaded.GetContentFile(model_filename)
pickle_filepath = '/content/{}'.format(model_filename)
stance_detection_model = pickle.load(open(pickle_filepath, 'rb'))

sentiment_analyzer = SentimentIntensityAnalyzer()
transformer_model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

def getStancePrediction(X_headline, X_body):
  prob = [0, 0, 0]
  if X_headline.size == 1:
    # prepare data for stance prediction
    df = pd.DataFrame([(X_headline.iloc[0], X_body.iloc[0])], columns=['Headline', 'ArticleBody']) 
    stance_detection_features = apply_stance_detection_featurization(df, sentiment_analyzer, headlineCol='Headline', bodyCol='ArticleBody')
    stance_detection_X = stance_detection_features.drop(['Headline', 'ArticleBody'], axis=1).to_numpy()
    stance_detection_prediction_score = stance_detection_model.predict_proba(stance_detection_X)[0]
    # make a prediction
    prob = stance_detection_prediction_score

  return prob

## 2.c. Define Sentiment Predictor

In [9]:
def getSentimentPrediction(X_news):
  prob = 0
  if X_news.size == 1:
    file_id = '1eZ0TycVjHAyaFh8eKDmyLiQ_DN8rOcbI'
    model_filename = 'Best_Sentiment_Analysis_Model_Misleading_Intentions.pkl'
    downloaded = gdrive.CreateFile({'id': file_id})
    downloaded.GetContentFile(model_filename)
    pickle_filepath = '/content/{}'.format(model_filename)
    best_sentiment_model = pickle.load(open(pickle_filepath, 'rb'))
    prob = best_sentiment_model.predict_proba(X_news)[:,1]
  return float(prob)

## 2.d. Define Sensationalism Predictor

In [10]:
def getSensationalismPrediction(X_news):
  prob = 0
  if X_news.size == 1:
    file_id = '1XEYOqUEkI52tW7ZWtIGRq0Qe5dOd2I_S'
    model_filename = 'Best_Sensationalism_Analysis_Model_Misleading_Intentions.pkl'
    downloaded = gdrive.CreateFile({'id': file_id})
    downloaded.GetContentFile(model_filename)
    pickle_filepath = '/content/{}'.format(model_filename)
    best_sensationalism_model = pickle.load(open(pickle_filepath, 'rb'))
    prob = best_sensationalism_model.predict_proba(X_news)[:,1]
  return float(prob)

## 2.e. Define ClickBait Predictor

In [11]:
def getDistilledClickBaitPrediction(X_news):
  prob = 0
  if X_news.size == 1:
    file_id = '1pgSrMJD0m_7Cd1fg1xoZEN2P_CjnpUkb'
    model_filename = 'Best_Clickbait_Analysis_Model_Misleading_Intentions.pkl'
    downloaded = gdrive.CreateFile({'id': file_id})
    downloaded.GetContentFile(model_filename)
    pickle_filepath = '/content/{}'.format(model_filename)
    best_distilled_clickbait_model = pickle.load(open(pickle_filepath, 'rb'))
    prob = best_distilled_clickbait_model.predict_proba(X_news)[:,1]
  return float(prob)

## 2.f. Define Political Bias



In [12]:
def get_BSF(df): # Balance Sentiment Factor
  if len(df) == 1:
    df['BSF'] = df['Positive']/df['Negative']
  else:
    pos_mean = df['Positive'].mean()
    neg_mean = df['Negative'].mean()
    balance_sentiment = (abs(pos_mean - df['Positive'])+abs(neg_mean - df['Negative']))/2
    df['BSF'] = balance_sentiment 
  return df

# Microfactor 2
def get_SPR(df): # Standardized Party Ratio
  # Create a ratio to measure if text has a leniency towards a particular party
  party_ratio = df['Democrat']/df['Republican']
  # Standardized the ratio to make use of the overall mean and stand deviation
  if len(df) == 1:
    df['SPR'] = party_ratio
  else:
    df['SPR'] = abs(party_ratio - np.mean(party_ratio))/np.std(party_ratio)
  return df

# Microfactor 3
def get_selection_bias(df, text_col=str): 
  clean_col_name = 'Cleaned_'+text_col
  clean_text_token = df[text_col].apply(nltk.word_tokenize)

  def count_bias_vocab(target, bias_list):
    count = 0 
    for vocab in bias_list:
      if vocab in target:
        count += 1
    return count/len(bias_list)
  
  file_id = '15DBBkgI0TVfciwwhDptWoblvhIHGKmq9'
  model_filename = 'vocab_selection.pkl'
  downloaded = gdrive.CreateFile({'id': file_id})
  downloaded.GetContentFile(model_filename)
  pickle_filepath = '/content/{}'.format(model_filename)
  vocab_selection = pickle.load(open(pickle_filepath, 'rb'))

  lib_vocab_rate = clean_text_token.apply(count_bias_vocab,bias_list=vocab_selection['liberal'])
  con_vocab_rate = clean_text_token.apply(count_bias_vocab,bias_list=vocab_selection['conservative'])
  dem_vocab_rate = clean_text_token.apply(count_bias_vocab,bias_list=vocab_selection['democrat'])
  rep_vocab_rate = clean_text_token.apply(count_bias_vocab,bias_list=vocab_selection['republican'])

  # Create two new feature
  # Weight more on liberal and conservative vocabs
  df['Dem_Vocab_Freq'] = 0.6*lib_vocab_rate+0.4*dem_vocab_rate
  df['Rep_Vocab_Freq'] = 0.6*con_vocab_rate+0.4*rep_vocab_rate
  # Create a selection bias feature
  df['Selection_Bias'] = df[["Dem_Vocab_Freq", "Rep_Vocab_Freq"]].max(axis=1)
  return df

# Combine all microfactors
def get_political_bias(df):
  # Microfactor final calculation
  if len(df) == 1:
    BSF_diff = df['BSF']
    SPR_diff = df['SPR']
  else:
    BSF_diff = abs(df['BSF'].mean() - df['BSF'])
    BSF_diff = (BSF_diff-min(BSF_diff))/(max(BSF_diff)-min(BSF_diff))
    SPR_diff = abs(df['SPR'].mean() - df['SPR'])
    SPR_diff = (SPR_diff-min(SPR_diff))/(max(SPR_diff)-min(SPR_diff))
  # Combine all the microfactors together
  political_bias = (0.2*BSF_diff+0.2*SPR_diff+0.2*(1-df['Neutral'])+0.4*df['Selection_Bias'])
  df['Political_Bias'] = political_bias
  return df

To save time on loading zero shot model and create microfactors based on overall dataframe statistics, feature generation process (zero_shot_microfactors) is added in data prep notebook 

In [13]:
def polit_bias_pipeline(df, clean_text_col=str):
  #df = zero_shot_microfactor(df, text_col) 
  df = get_BSF(df)
  df = get_SPR(df)
  df = get_selection_bias(df, clean_text_col)
  df = get_political_bias(df)
  return df, df['Political_Bias']

## 2.g. Title-Body Similarity Predictor

In [14]:
!pip install nltk==3.4 --quiet

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.util import ngrams
from scipy.sparse import vstack
from scipy.spatial.distance import cosine
from sklearn import preprocessing
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings
warnings.filterwarnings('ignore')

def cnt_sentences(df):
  df['cnt_title_sentences'] = df['clean_title'].apply(lambda x: len(sent_tokenize(x)))
  df['cnt_text_sentences'] = df['clean_body'].apply(lambda x: len(sent_tokenize(x)))

def ngram(text, n):
    n_grams = ngrams(word_tokenize(text), n)
    return [ '_'.join(grams) for grams in n_grams]

# Uni, Bi, Tri grams to get common word count features
def generate_ngrams(df):
  df["title_uni"] = df["clean_title"].map(lambda x: ngram(x, 1))
  df["body_uni"] = df["clean_body"].map(lambda x: ngram(x, 1))
  df["cnt_title_uni"] = list(df.apply(lambda x: len(x['title_uni']), axis=1))
  df["cnt_body_uni"] = list(df.apply(lambda x: len(x['body_uni']), axis=1))
  df["unq_cnt_title_uni"] = list(df.apply(lambda x: len(set(x['title_uni'])), axis=1))
  df["unq_cnt_body_uni"] = list(df.apply(lambda x: len(set(x['body_uni'])), axis=1))

  df["title_bi"] = df["clean_title"].map(lambda x: ngram(x, 2))
  df["body_bi"] = df["clean_body"].map(lambda x: ngram(x, 2))
  df["cnt_title_bi"] = list(df.apply(lambda x: len(x['title_bi']), axis=1))
  df["cnt_body_bi"] = list(df.apply(lambda x: len(x['body_bi']), axis=1))
  df["unq_cnt_title_bi"] = list(df.apply(lambda x: len(set(x['title_bi'])), axis=1))
  df["unq_cnt_body_bi"] = list(df.apply(lambda x: len(set(x['body_bi'])), axis=1))

  df["title_tri"] = df["clean_title"].map(lambda x: ngram(x, 3))
  df["body_tri"] = df["clean_body"].map(lambda x: ngram(x, 3))
  df["cnt_title_tri"] = list(df.apply(lambda x: len(x['title_tri']), axis=1))
  df["cnt_body_tri"] = list(df.apply(lambda x: len(x['body_tri']), axis=1))
  df["unq_cnt_title_tri"] = list(df.apply(lambda x: len(set(x['title_tri'])), axis=1))
  df["unq_cnt_body_tri"] = list(df.apply(lambda x: len(set(x['body_tri'])), axis=1))

def common_ngrams_in_body(df):
  df["cnt_title_unis_in_body"] =  list(df.apply(lambda x: sum([1. for w in x['title_uni'] if w in set(x['body_uni'])]), axis=1))
  df["cnt_title_bis_in_body"] =  list(df.apply(lambda x: sum([1. for w in x['title_bi'] if w in set(x['body_bi'])]), axis=1))
  df["cnt_title_tris_in_body"] =  list(df.apply(lambda x: sum([1. for w in x['title_tri'] if w in set(x['body_tri'])]), axis=1))

def concat_title_body(df):
  df['clean_title_body'] = df['clean_title'] + ' ' + df['clean_body']

def tf_idf(df):
  concat_title_body(df)
  combined_vectors = TfidfVectorizer(ngram_range=(1, 2), min_df=1, max_df=1, use_idf=True, smooth_idf=True)
  combined_vectors.fit(df["clean_title_body"])
  combined_vectors_dictionary = combined_vectors.vocabulary_
  title_vectors = TfidfVectorizer(ngram_range=(1, 2), min_df=1, max_df=1, use_idf=True, smooth_idf=True, vocabulary=combined_vectors_dictionary)
  title_tfidf_vectors = title_vectors.fit_transform(df['clean_title'])
  text_vectors = TfidfVectorizer(ngram_range=(1, 2), min_df=1, max_df=1, use_idf=True, smooth_idf=True, vocabulary=combined_vectors_dictionary)
  text_tfidf_vectors = text_vectors.fit_transform(df['clean_body'])
  return title_tfidf_vectors, text_tfidf_vectors

def similarity_score(df, title_vectors, text_vectors):
  similarity_score = []
  for i in range(len(df)):
      similarity_score.append(1 - cosine(title_vectors[i], text_vectors[i]))
  return similarity_score

def tf_idf_similarity(df):
  title_tfidf_vectors, text_tfidf_vectors = tf_idf(df)
  df['similarity_title_body'] = similarity_score(df, title_tfidf_vectors.toarray(), text_tfidf_vectors.toarray())
  return title_tfidf_vectors, text_tfidf_vectors

def svd(data, title_tfidf_vectors, text_tfidf_vectors):
  truncated_svd = TruncatedSVD(n_components=2, n_iter=10)
  combined_vectors = vstack([title_tfidf_vectors, text_tfidf_vectors])
  truncated_svd.fit(combined_vectors)
  title_svd = truncated_svd.transform(title_tfidf_vectors)
  text_svd = truncated_svd.transform(text_tfidf_vectors)
  return title_svd, text_svd

def topic_similarity(data, title_tfidf_vectors, text_tfidf_vectors):
  title_svd_vectors, text_svd_vectors = svd(data, title_tfidf_vectors, text_tfidf_vectors)
  data['topics_similarity_title_body'] = similarity_score(data, title_svd_vectors, text_svd_vectors)

def get_distilled_dataset(title, text):
  data = {'clean_title': [title.iloc[0]], 'clean_body': [text.iloc[0]]}
  df_test = pd.DataFrame(data)
  cnt_sentences(df_test)
  generate_ngrams(df_test)
  common_ngrams_in_body(df_test)
  title_tfidf_vectors, text_tfidf_vectors = tf_idf_similarity(df_test)
  topic_similarity(df_test, title_tfidf_vectors, text_tfidf_vectors)
  X_cols = [x for i,x in enumerate(features) if x!='label']
  return df_test[X_cols]

features =     ['label',  'cnt_title_uni', 'cnt_body_uni',
                'unq_cnt_title_uni', 'unq_cnt_body_uni', 'cnt_title_bi', 'cnt_body_bi',
                'unq_cnt_title_bi', 'unq_cnt_body_bi', 'cnt_title_tri', 'cnt_body_tri',
                'unq_cnt_title_tri', 'unq_cnt_body_tri', 'cnt_title_unis_in_body', 
                'cnt_title_bis_in_body', 'cnt_title_tris_in_body', 'similarity_title_body',
                'topics_similarity_title_body',
                ]

le = preprocessing.LabelEncoder()
le.fit(['agree', 'disagree', 'discuss', 'unrelated'])
dict(zip(le.classes_, le.transform(le.classes_)))

{'agree': 0, 'disagree': 1, 'discuss': 2, 'unrelated': 3}

In [15]:
def getTitleVsBodyPrediction(title, body):
  file_id = '1bwvFThCwg6pgM99R6p7Ly7K5vXPwRj6q'
  model_filename = 'title_body_similarity_model.pkl'
  downloaded = gdrive.CreateFile({'id': file_id})
  downloaded.GetContentFile(model_filename)
  pickle_filepath = '/content/{}'.format(model_filename)
  title_body_similarity_model = pickle.load(open(pickle_filepath, 'rb'))
  df_test = get_distilled_dataset(title, body)
  return title_body_similarity_model.predict(df_test), title_body_similarity_model.predict_proba(df_test)

#Define Spam Score

In [16]:
pip install textstat



In [17]:
from joblib import dump,load
def load_ham_model():
  !cp '/content/drive/MyDrive/MLSpring-2021/TeamIntegration_MLSpring2021/models/Kumuda _SpamFactor/ham_vectorizer' -d /content/ham_vectorizer
  ! cp  '/content/drive/MyDrive/MLSpring-2021/TeamIntegration_MLSpring2021/models/Kumuda _SpamFactor/ham_classifier.model' -d /content/ham_classifier.model
  ham_classifier=load('/content/ham_classifier.model')
  ham_vectorizer=load('/content/ham_vectorizer')
  return ham_vectorizer,ham_classifier

def load_spam_model():
  !cp '/content/drive/MyDrive/MLSpring-2021/TeamIntegration_MLSpring2021/models/Kumuda _SpamFactor/spam_vectorizer' -d /content/spam_vectorizer
  ! cp  '/content/drive/MyDrive/MLSpring-2021/TeamIntegration_MLSpring2021/models/Kumuda _SpamFactor/spam_classifier.model' -d /content/spam_classifier.model
  spam_classifier=load('/content/spam_classifier.model')
  spam_vectorizer=load('/content/spam_vectorizer')
  return spam_vectorizer,spam_classifier

In [18]:
ham_vectorizer,ham_classifier=load_ham_model()
spam_vectorizer,spam_classifier=load_spam_model()

In [19]:
def calculate_ham_score(X_news):
  X_train=ham_vectorizer.transform(X_news)
  ham_score=ham_classifier.predict_proba(X_train)[:,1]
  return float(ham_score[0])

In [20]:
def calculate_spam_score(X_news):
  X_train=spam_vectorizer.transform(X_news)
  spam_score=spam_classifier.predict_proba(X_train)[:,1]
  return float(spam_score[0])

In [21]:
def get_reading_ease(X_news):
  reading_ease=0.0
  reading_ease=X_news.apply(textstat.flesch_reading_ease)
  return float(reading_ease)

In [22]:
import textstat
def generateSpamScore(X_news):
  accuracy = [0.4,0.4, 0.001]
  w = [float(i)/sum(accuracy) for i in accuracy]
  sumW = 0
  prob = []
  ham_score=calculate_ham_score(X_news)
  prob.append(w[0] *(1- ham_score))#Ham Score
  sumW =sumW + w[0]

  spam_score=calculate_spam_score(X_news)
  prob.append(w[1] * spam_score)#Spam Score
  sumW += w[1]
  prob.append(w[2] * get_reading_ease(X_news)) #Reading Ease
  sumW += w[2]
   
  probTotal = sum(prob[0:len(prob)]) / sumW
  return probTotal

## 2.h. Define a false-o-meter

In [23]:
# get false-o-meter reading of news item
def get_false_o_meter_reading(df, index_num, x_news=str, x_body=str):
    X_news = df[x_news]
    X_body = df[x_body]
    model_accuracy = [0.85, 0.73, 0.89, 0.92, 0.86, 0.97, 0.6, 0.7,0.83] 
    model_weight = [acc/sum(model_accuracy) for acc in model_accuracy]
    probablity_false_news = []

    # get sentiment reading
    sentiment_prob = getSentimentPrediction(X_news)
    probablity_false_news.append(model_weight[0] * sentiment_prob)
    print('Sentiment [false-o-meter] reading is %f ' %(sentiment_prob))

    # get sensationalism reading
    sensationalism_prob = getSensationalismPrediction(X_news)
    probablity_false_news.append(model_weight[1] * sensationalism_prob)
    print('Sensationalism [false-o-meter] reading is %f ' %(sensationalism_prob))

    # get distilled clickbait reading
    clickbait_prob = getDistilledClickBaitPrediction(X_news)
    probablity_false_news.append(model_weight[2] * clickbait_prob)
    print('Distilled Clickbait [false-o-meter] reading is %f ' %(clickbait_prob))

    # get stance reading
    (agree_stance_prob, disagree_stance_prob, neutral_stance_prob) = getStancePrediction(X_news, X_body)
    probablity_false_news.append(model_weight[3] * agree_stance_prob) # agree stance
    probablity_false_news.append(model_weight[4] * disagree_stance_prob) # disagree stance
    probablity_false_news.append(model_weight[5] * neutral_stance_prob) # neutral stance
    print('Stance [false-o-meter] reading is (agree: %f, disagree: %f, neutral: %f)' % (agree_stance_prob, disagree_stance_prob, neutral_stance_prob))

    # get political bias
    r = requests.get('https://docs.google.com/spreadsheets/d/e/2PACX-1vQd6WhaekUPRDxUIYXgx_zI_zAHodXl3__bfAnEa_GWT_eR9dVO55HALi_3jjnZmEwbZ_4YvUkG7Qtx/pub?gid=896471122&single=true&output=tsv')
    data = r.content
    zero_shot_microfactors = pd.read_csv(BytesIO(data), sep='\t')
    pb_df, political_bias_prob = polit_bias_pipeline(zero_shot_microfactors, x_news)
    probablity_false_news.append(model_weight[6] * political_bias_prob.loc[index_num])
    print('Political Bias [false-o-meter] reading is %f ' %(political_bias_prob.loc[index_num]))

    # get title body similarity
    title_body_pred, title_body_pred_prob = getTitleVsBodyPrediction(X_news, X_body)
    t_b_fake_score = (title_body_pred_prob[0][1] * 0.6 + title_body_pred_prob[0][3] + 0.4)
    probablity_false_news.append(model_weight[6] * t_b_fake_score)
    print('Title-Body incongruence [false-o-meter] reading is %f ' %(t_b_fake_score))

    # get distilled psychology utility reading
    psychology_prob = getPsychologyUtilitiesPrediction(X_news)
    probablity_false_news.append(model_weight[2] * psychology_prob)
    print('Distilled Psychology Utility [false-o-meter] reading is %f ' %(psychology_prob))

    # get distilled intent reading
    intent_prob = getIntentPrediction(X_news)
    probablity_false_news.append(model_weight[0] * intent_prob)
    print('Distilled Intent [false-o-meter] reading is %f ' %(intent_prob))

    # get spam score
    spam_score= generateSpamScore(X_news)
    probablity_false_news.append(model_weight[8] * spam_score)
    print('Spam score is %f' %(spam_score))

    cummalative_probablity_false_news = sum(probablity_false_news)
    print('Ensembled [false-o-meter] reading is %f ' %(cummalative_probablity_false_news))

    return cummalative_probablity_false_news

##2.i. Define Psychology Predictor

In [24]:
import string
import joblib

def get_text_processing(text):
  stop_words = stopwords.words('english')
  stop_words.append(['breaking', 'BREAKING'])
  no_punctuation = [char for char in text if char not in string.punctuation]
  no_punctuation = ''.join(no_punctuation)
  return ' '.join([word for word in no_punctuation.split() if word.lower() not in stop_words])

def getPsychologyUtilitiesPrediction(X_news):
  prob = 0
  X_news = X_news.apply(get_text_processing)
  if X_news.size == 1:
    file_id = '16egOQ8zTftur5jPFxfWTwYDTOOjdhQ6e'
    model_filename = 'PsychologyUtilites_pipeline.pkl'
    downloaded = gdrive.CreateFile({'id': file_id})
    downloaded.GetContentFile(model_filename)
    pickle_filepath = '/content/{}'.format(model_filename)
    best_distilled_psychology_model = joblib.load(open(pickle_filepath, 'rb'))
    prob = best_distilled_psychology_model.predict(X_news)
  return 1 if prob == 'Positive' else 0

##2.j. Define Intent Predictor

In [25]:
def getIntentPrediction(X_news):
  prob = 0
  X_news = X_news.apply(get_text_processing)
  if X_news.size == 1:
    file_id = '1BFXgdw2MvJZl39CUx0jvfms1nzQI810j'
    model_filename = 'Intent_pipeline.pkl'
    downloaded = gdrive.CreateFile({'id': file_id})
    downloaded.GetContentFile(model_filename)
    pickle_filepath = '/content/{}'.format(model_filename)
    best_distilled_intent_model = joblib.load(open(pickle_filepath, 'rb'))
    prob = best_distilled_intent_model.predict(X_news)
  return 1 if prob == 'Positive' else 0

# 3.0. Automated Inference Pipeline

## 3.1. Helper: Pick a random news item

In [26]:
def get_random_news_items(num_items):
  random_news_items = df_test_headlines.sample(n=num_items)
  return random_news_items

## 3.2. Helper:Print prediction of news item

In [27]:
def print_prediction(reading):
  if reading > 0.9:
    print ("This news headline is: Pants on Fire")
  elif reading > 0.7 and reading < 0.9:
    print ("This news headline is: Somewhat False")
  elif reading > 0.5 and reading < 0.7:
    print ("This news headline is: Mostly False")
  elif reading > 0.3 and reading < 0.5:
    print ("This news headline is: Half True")
  elif reading > 0.1 and reading < 0.3:
    print ("This news headline is: Mostly True")
  elif reading < 0.1:
    print ("True")  

## 3.3. Helper: Run automated pipeline

In [28]:
def run_automated_inference_pipeline(num_items):
  X_headline = 'preprocessed_statement_text'
  X_body = 'preprocessed_body'
  random_news_items = get_random_news_items(num_items)

  i = 0
  while ( i < num_items):
    news = random_news_items.sample(1)
    print('\nRunning [false-o-meter] on - %s ' % (news['text']))
    reading = get_false_o_meter_reading(news, news.index[0], X_headline, X_body)
    print('[false-o-meter] reading is %f ' %(reading))
    print_prediction(reading)
    i = i + 1

## 3.4. Invoke automated pipeline on 20 random news items

In [None]:
num_news = 20
reading = run_automated_inference_pipeline(num_news)


Running [false-o-meter] on - 177    Fashion Notes: The 9 Best and Worst Dressed Ac...
Name: text, dtype: object 
Sentiment [false-o-meter] reading is 0.483895 
Sensationalism [false-o-meter] reading is 0.315909 
Distilled Clickbait [false-o-meter] reading is 0.971518 
Stance [false-o-meter] reading is (agree: 0.759230, disagree: 0.000043, neutral: 0.240727)
Political Bias [false-o-meter] reading is 0.295789 
Title-Body incongruence [false-o-meter] reading is 1.400000 
Distilled Psychology Utility [false-o-meter] reading is 0.000000 
Distilled Intent [false-o-meter] reading is 0.000000 
Spam score is 0.461798
Ensembled [false-o-meter] reading is 0.522364 
[false-o-meter] reading is 0.522364 
This news headline is: Mostly False

Running [false-o-meter] on - 236    VA's implant tests could help paralyzed vetera...
Name: text, dtype: object 
Sentiment [false-o-meter] reading is 0.502400 
Sensationalism [false-o-meter] reading is 0.315909 
Distilled Clickbait [false-o-meter] reading is 0.6