<a href="https://colab.research.google.com/github/YuxingW/alternusvera-spring-2021/blob/main/all_factor_ensemble_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## CMPE 257 - MLSprings 2021 Cohort
Objective: Detect fake news in political datasets using factors
mircro factors <br />
Team DataCorps - Yuxing Wang, Arun Talkad, Mayuri Lalwani

## True-o-meter Pipeline

Micorfactors for Factor - Sychology Utilities, Yuxing
* Sentiment
* Group confirmation
* Opinion leader

Micorfactors for Factor - Intent, Mayuri
* Utterance
* Speech
* Sentiment

Micorfactors for Factor -Incredibility, Arun
* Incredibility

Reference: 
* https://towardsai.net/p/nlp/sentiment-analysis-opinion-mining-with-python-nlp-tutorial-d1f173ca4e3c
* https://github.com/towardsai/tutorials/tree/master/sentiment_analysis_tutorial
* https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/


## 1.Preparation
* Scrape data from politifact
* Fetch twitter tweets by APIs

###1.1.Scrape data from politifact

In [1]:
!pip install -q beautifulsoup4
!pip install -q vaderSentiment

**Import Required Packages**

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from bs4 import BeautifulSoup
import re
import requests
import string
from string import punctuation
from sklearn.preprocessing import StandardScaler
from io import BytesIO
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')


import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
def scrape_data_from_politifact(start=1, end=11):
  url = "https://www.politifact.com/issues/"
  issues = []
  r = requests.get(url)
  soup = BeautifulSoup(r.text,'html.parser')
  results = soup.find_all('div', attrs={'class':'c-chyron__value'})
  for result in results:
    name = result.find('a').text
    issue = result.find('a')['href'].replace("/","")
    issues.append((name, issue))
  url = "https://www.politifact.com/factchecks/list/?page={pgno}&category={category}"

  records = []  

  for i in range(start,end):
    for issue, issue_url in issues[0:5]:
      fUrl = url.format(pgno=str(i), category=issue_url)
      r = requests.get(fUrl)
      soup = BeautifulSoup(r.text, 'html.parser')  
      results = soup.find_all('article', attrs={'class':'m-statement'})
      for result in results:
        date = result.find('footer',attrs={'class':'m-statement__footer'}).text.split("•")[1].rstrip("\n")
        reporter = result.find('footer',attrs={'class':'m-statement__footer'}).text.split("•")[0].replace("\nBy","")   
        author = result.find('a',attrs={'class':'m-statement__name'}).text.replace("\n","")
        statement =  result.find('div', attrs = {'class':'m-statement__quote'}).find('a').text.replace("\n","")
        statement_descr = result.find('div', {'class':'m-statement__desc'}).text.replace("\n","")
        article_url =  result.find('a')['href']
        verdict = result.find('img', attrs = {'class':'c-image__thumb'}, alt=True).attrs['alt']
        records.append(( date, issue, reporter, author, statement, statement_descr, verdict, article_url))
  return records

records = scrape_data_from_politifact()
df_politifact = pd.DataFrame(records,
                         columns=['Date', 'Issue','Reporter','Author', 'Statement', 'Description', 'Verdict', 'Url'])  
df_politifact.head()

Unnamed: 0,Date,Issue,Reporter,Author,Statement,Description,Verdict,Url
0,"May 7, 2021",Abortion,Bill McCarthy,Facebook posts,"Says Chelsea Clinton tweeted, ""If Jesus were a...","stated on May 7, 2021 in a Facebook post:",pants-fire,/personalities/facebook-posts/
1,"March 31, 2021",Abortion,Tom Kertscher,Facebook posts,“Joe Biden puts pro-life groups on domestic ex...,"stated on March 29, 2021 in a Facebook post:",barely-true,/personalities/facebook-posts/
2,"February 12, 2021",Abortion,Brandon Mulder,Greg Abbott,“Innocent lives will be saved” by ending taxpa...,"stated on January 24, 2021 in a tweet:",false,/personalities/greg-abbott/
3,"November 18, 2020",Abortion,Noah Y. Kim,Facebook posts,There is “aborted male fetus” in the Oxford-As...,"stated on November 15, 2020 in a Facebook post:",false,/personalities/facebook-posts/
4,"October 14, 2020",Abortion,Tom Kertscher,Tommy Tuberville,"Says Doug Jones ""has voted to spend our tax do...","stated on October 8, 2020 in an ad:",false,/personalities/tommy-tuberville/


###1.2.Define callables

In [4]:
def get_text_processing(text):
    stop_words = stopwords.words('english')
    stop_words.append(['breaking', 'BREAKING'])
    no_punctuation = [char for char in text if char not in string.punctuation]
    no_punctuation = ''.join(no_punctuation)
    return ' '.join([word for word in no_punctuation.split() if word.lower() not in stop_words])

In [9]:
def positive_to_num(_df, micfactor):
  if _df[micfactor][0] in ['Positive', 'Negative'] :
    _df[micfactor] = _df[micfactor].apply(lambda x: 1 if x == 'Positive' else 0)

##2.All Microfactors Generation

In [6]:
!pip install -U -q pyDrive

In [7]:
# Import packages for google drive, auth
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
gdrive = GoogleDrive(gauth)

###2.1.Define a all microfactors dictionary

In [11]:
MICROFACTORS ={
    'Sentiment': {
        'pickle_id': '1eZ0TycVjHAyaFh8eKDmyLiQ_DN8rOcbI',},
    'Sensationalism': {
        'pickle_id': '1XEYOqUEkI52tW7ZWtIGRq0Qe5dOd2I_S',},              
    'Clickbait': {
        'pickle_id': '1pgSrMJD0m_7Cd1fg1xoZEN2P_CjnpUkb',},
    'Confirmation': {
        'pickle_id': '1v4M0PuGBxhAuxSC0gghXG56IJ_lpj1_-',
        'callable': positive_to_num},
    'OpinionLeader': {
        'pickle_id': '1-FSuhaFUQ5RXQb8ehgxa3fLhct6RH6Xn',},
    'Sentiment': {
        'pickle_id': '1lmskvpX4VA4srCbjeN3x82PLwckgKfyE',},
    'Speech': {
        'pickle_id': '1wSbIsic1fw0tzyqpxsmpKV5XV-QkxYdz',},
    'Utterance': {
        'pickle_id': '1GKrJgg-UfLobb5DuYe1JkCXUtL3sdg0T',}, 
    }


###2.2.Load the pickled models for microfactors

In [23]:
import joblib, pickle

microfactor_list = MICROFACTORS.keys()

def generate_micrafactor_from_pickles(df):
  # load pickle function
  def load_pickle(file_id, microfactor):
      downloaded = gdrive.CreateFile({'id': file_id})
      downloaded.GetContentFile(microfactor + '.pkl')
      pickle_filepath = '/content/{}.pkl'.format(microfactor)
      try:
        model = pickle.load(open(pickle_filepath, 'rb'))
      except:
        model = joblib.load(open(pickle_filepath, 'rb'))
      return model;

  # Preprocessing
  df['Statement'] = df['Statement'].apply(get_text_processing)
  statement = df['Statement'].to_numpy()
  
  for microfactor in microfactor_list:
    print('generating microfactor :', microfactor)
    # load microfactor pickle
    microfactor_model = load_pickle(MICROFACTORS[microfactor]['pickle_id'],
                                    microfactor)
    # create new microfact column
    try:
      df[microfactor]  = microfactor_model.predict_proba(statement)[0]
      print('predict_proba run normally')
    except:
      df[microfactor]  = microfactor_model.predict(statement)
      if df[microfactor][0] in ['Positive', 'Negative'] :
        df[microfactor] = df[microfactor].apply(lambda x: 1 if x == 'Positive' else 0)
    
    if 'callable' in MICROFACTORS[microfactor]:
      MICROFACTORS[microfactor]['callable'](df, microfactor)


**Use politicfact dataset to generate microfactors and train Verdict**

In [24]:
df = df_politifact.copy()
generate_micrafactor_from_pickles(df)

generating microfactor : Sentiment
generating microfactor : Sensationalism
generating microfactor : Clickbait
generating microfactor : Confirmation
generating microfactor : OpinionLeader
generating microfactor : Speech
generating microfactor : Utterance


In [25]:
df.head()

Unnamed: 0,Date,Issue,Reporter,Author,Statement,Description,Verdict,Url,Sentiment,Sensationalism,Clickbait,Confirmation,OpinionLeader,Speech,Utterance
0,"May 7, 2021",Abortion,Bill McCarthy,Facebook posts,Says Chelsea Clinton tweeted Jesus alive today...,"stated on May 7, 2021 in a Facebook post:",pants-fire,/personalities/facebook-posts/,0,0,1,0,0,0,1
1,"March 31, 2021",Abortion,Tom Kertscher,Facebook posts,“Joe Biden puts prolife groups domestic extrem...,"stated on March 29, 2021 in a Facebook post:",barely-true,/personalities/facebook-posts/,0,0,1,0,0,0,1
2,"February 12, 2021",Abortion,Brandon Mulder,Greg Abbott,“Innocent lives saved” ending taxpayer funding...,"stated on January 24, 2021 in a tweet:",false,/personalities/greg-abbott/,0,0,0,0,0,0,1
3,"November 18, 2020",Abortion,Noah Y. Kim,Facebook posts,“aborted male fetus” OxfordAstraZeneca “Covid ...,"stated on November 15, 2020 in a Facebook post:",false,/personalities/facebook-posts/,0,0,0,0,0,0,1
4,"October 14, 2020",Abortion,Tom Kertscher,Tommy Tuberville,Says Doug Jones voted spend tax dollars latete...,"stated on October 8, 2020 in an ad:",false,/personalities/tommy-tuberville/,0,0,0,0,0,0,1


## 3.true-o-meter Pipeline

###3.1. Ensemble a multilabel classification pipeline

**Use ensembled model and microfactors to train Verdict, model will be pickled in data_to_pipeline processing**

In [26]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import joblib

pipeline = Pipeline([
    ('standardscaler', StandardScaler()),
    ('svm', SVC(C=1, probability=True))
])

def data_to_pipeline(df, _source=None, _target=None):
  # Split data for training and validation
  X, y = df[_source].values, df[_target].values
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42) 
  
  # Training and prediction
  pipeline.fit(X_train, y_train)
  scores = cross_val_score(pipeline, X_test, y_test, cv=3)
  print('now using %s to predict %s' % (_source, _target))
  print('cross validation scores:', scores)
  print('prediction score:', pipeline.score(X_test, y_test))

  # Save the pickle file
  pickle_filepath = '/content/' + _target + '_pipeline.pkl'
  joblib.dump(pipeline, pickle_filepath)
  print('pickle file is created:', pickle_filepath)
  print('\n')

data_to_pipeline(df, _source = microfactor_list, _target = 'Verdict')

now using dict_keys(['Sentiment', 'Sensationalism', 'Clickbait', 'Confirmation', 'OpinionLeader', 'Speech', 'Utterance']) to predict Verdict
cross validation scores: [0.25490196 0.16       0.24      ]
prediction score: 0.2251655629139073
pickle file is created: /content/Verdict_pipeline.pkl






###3.2.Automated Inference Pipeline


**Load pickled true-o-meter pipeline**

In [27]:
pickle_filepath = '/content/Verdict_pipeline.pkl'
o_meter_model = joblib.load(open(pickle_filepath, 'rb'))

**Read a new piece of data from politifact**

In [28]:
records = scrape_data_from_politifact(start=12, end=13)
df = pd.DataFrame(records, columns=['Date', 'Issue','Reporter','Author', 'Statement', 'Description', 'Verdict', 'Url'])  

**Predict and compare the Verdict**

In [29]:
generate_micrafactor_from_pickles(df)
df['pred_verdict'] = o_meter_model.predict(df[microfactor_list])
df_intestest = df[['Statement', 'Verdict', 'pred_verdict']]
df_intestest.head()

generating microfactor : Sentiment
generating microfactor : Sensationalism
generating microfactor : Clickbait
generating microfactor : Confirmation
generating microfactor : OpinionLeader
generating microfactor : Speech
generating microfactor : Utterance


Unnamed: 0,Statement,Verdict,pred_verdict
0,Central Health hospital district Texas spends ...,true,False
1,US Government Accountability Office report say...,pants-fire,False
2,Studies shown absence federal reproductive hea...,half-true,False
3,Says bill HB 97 would prevent use taxpayer dol...,false,False
4,child born prematurely according president wor...,pants-fire,False


In [30]:
df['verdict_norm'] = df['Verdict'].apply(lambda x: 'True' if x in ['true', 'mostly-true'] else 'False')
df['pred_verdict_norm'] = df['pred_verdict'].apply(lambda x: 'True' if x in ['true', 'mostly-true'] else 'False')

In [31]:
num_of_equals = 0
num_of_diffs = 0
for index, x in df.iterrows():
  if x['verdict_norm'] == x['pred_verdict_norm']:
    num_of_equals = num_of_equals + 1
  else:
    num_of_diffs = num_of_diffs +1

print('num_of_equals: %s, num_of_diffs: %s' % (num_of_equals, num_of_diffs))

num_of_equals: 17, num_of_diffs: 13


In [32]:
from google.colab import drive
drive.mount('/content/drive')
! cp /content/*.pkl /content/drive/MyDrive/pickles_nlp/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [33]:
! ls /content/*.pkl 

/content/Clickbait.pkl		  /content/Sensationalism.pkl
/content/Confirmation.pkl	  /content/Sentiment.pkl
/content/Intent.pkl		  /content/Speech.pkl
/content/OpinionLeader.pkl	  /content/Utterance.pkl
/content/PsychologyUtilities.pkl  /content/Verdict_pipeline.pkl
