# CUNEF MUCD 2021/2022
## News Classification
Autor:  
- Antonio Tello Gómez

# 5. Deployment Simulation

In this notebook, we will simulate the deployment of the model. In this hypothetical scenario, we receive a CSV file with 30 news and we return another CSV file with the result of the model predictions, ideally, we would do this automatically. However, we will do it manually and test if the model would work in such an environment. To this end, we will generate a new dataset with fake and real news and then use the pipeline with the model to preprocess and predict the new data. Then we will save the predictions and evaluate the model performance. 

In [1]:
#Basics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle 

#Functionalities
import requests
import os
import sys
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")


#NLP
import string
import re
import nltk

#Model
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.pipeline import Pipeline 
from sklearn.model_selection import GridSearchCV
sys.path.append(os.path.abspath('..'))
sys.path.append(os.path.abspath('../src'))
from src.Preprocessor import TextPreprocessor
from lightgbm import LGBMClassifier

# Métrics
from sklearn.metrics import (roc_curve, roc_auc_score, f1_score , confusion_matrix, recall_score, 
                             precision_score, classification_report, precision_recall_curve, accuracy_score)


# Creation of a New Dataset

## Generating Fake News

https://python.plainenglish.io/create-ai-content-generator-with-python-flask-and-openai-gpt-3-407a19f096b

I am going to generate fake news using the OpenAI API. Two options:
 1. (Supervised) Generate fake news manually with implausible stories related to topics in the original data
 2. (Unsupervised) Generate fake news programatically using the Topics extracted from Topic Modeling or most frequent Entities in NER.  
 
With option 2 you can generate a virtually infinite amount of fake news. However, they can be very realisitic and plausibile.  
Therefore, I will use option 1. In Which I make sure news are very implausible i.e. fake. 

In [158]:
intro = 'Write a fake news article about'
fn1 = f"{intro} Obama becoming a dicatator in 2016"
fn2 = f"{intro} Trump becoming a dicatator in 2016"
fn3 = f"{intro} Hillary Clinton bombing the white house"
fn4 = f"{intro} Iran invading the US"
fn5 = f"{intro} Fbi closing down Fox News"
fn6 = f"{intro} Ted Cruz becomin muslim"
fn7 = f"{intro} Donald Trump leading a Mexican Cartel"
fn8 = f"{intro} Bernie Sanders working for Putin"
fn9 = f"{intro} Russia hacking the US 2016 elections"
fn10 = f"{intro} Russia fighting China in 2016"
fn11 = f"{intro} Soros buying CNN"
fn12 = f"{intro} Hillary Clinton and Donald Trump having an affair"
fn13 = f"{intro} ISIS leader calls for American Muslim voters to support Hillary Clinton"
fn14 = f"{intro} Hillary Clnton and Pizzagate"
fn15 = f"{intro} Donald Trump and Obama having an affair"
fn16 = f"{intro} Trump Offering Free One-Way Tickets to Mexico for Those Who Wanna Leave America"
fn17 = f"{intro} President Obama Confirms He Will Refuse To Leave Office If Trump Is Elected"
fn18 = f"{intro} Trump declaring war to North Korea"
fn19 = f"{intro} Soross becoming president"
fn20 = f"{intro} Obama occupying the withe house"
fn21 = f"{intro} Mexico president resigning"
fn22 = f"{intro} Mexico and Canada leaving NAFTA"
fn23 = f"{intro} Obama becoming president of Canada"
fn24 = f"{intro} wikileaks and Foxnews"
fn25 = f"{intro} Putin resigning"
fn26 = f"{intro} Mexico building thier own wall"
fn27 = f"{intro} Obama going to jail for corruption"
fn28 = f"{intro} Republican Party starting civil war"
fn29 = f"{intro} DonalTrump switching to Democratic Party"
fn30 = f"{intro} Hillary Clinton switching to Republican Party"

In [159]:
fake_news = [fn1, fn2, fn3, fn4, fn5, fn6, fn7, fn8, fn9, fn10, fn11, fn12, fn13, fn14, fn15, fn16, fn17, fn18, fn19, fn20, fn21, fn22, fn23, fn24, fn25, fn26, fn27, fn28, fn29, fn30]

In [160]:
rows =  []
for fn in fake_news:
    response = openai.Completion.create(
    engine="davinci-instruct-beta-v3",
    prompt=fn,
    temperature=0.7,
    max_tokens=2000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0)
    rows.append(response['choices'][0]['text'])

In [162]:
fake_news = pd.DataFrame(rows, columns=["full_text"])
fake_news["label"] = 1

In [166]:
fake_news.to_csv("../data/new_data/fake_news.csv", index=False)

## Getting Real News

We are going to use the NewsApi api to get Real News. Unfortunately, the free version does not allow to retrieve historical news or the full text. Hence, we are going to have a dataset with a lot of tokens out of the training data.

In [13]:
def get_file_contents(filename):
    """ Given a filename,
        return the contents of that file
    """
    try:
        with open(filename, 'r') as f:
            return f.read().strip()
    except FileNotFoundError:
        print("'%s' file not found" % filename)

In [6]:
key = get_file_contents('../.apikey')

In [144]:
url1 = (f'https://newsapi.org/v2/everything?q=2016+elections&sortBy=popularity&apiKey={key}')
url2 = (f'https://newsapi.org/v2/everything?q=Clintons&sortBy=popularity&apiKey={key}')
url3 = (f'https://newsapi.org/v2/everything?q=Obama&sortBy=popularity&apiKey={key}')
url4 = (f'https://newsapi.org/v2/everything?q=Ted+Cruz&sortBy=popularity&apiKey={key}')
url5 = (f'https://newsapi.org/v2/everything?q=november+2016&sortBy=popularity&apiKey={key}')
url6 = (f'https://newsapi.org/v2/everything?q=white+house&sortBy=popularity&apiKey={key}')
url7 = (f'https://newsapi.org/v2/everything?q=president&sortBy=popularity&apiKey={key}')
url8 = (f'https://newsapi.org/v2/everything?q=bernie6&sortBy=popularity&apiKey={key}')

In [145]:
urls = [url1, url2, url3, url4, url5, url6, url7, url8]
jsons = {}

In [146]:
for idx, url in enumerate(urls): 
    r = requests.get(url)
    jsons["json{}".format(idx)] = r.json()

In [147]:
rows = []
for json in jsons.values():
    for article in json['articles']:
        rows.append([article["title"], article["description"]])

In [148]:
real_news = pd.DataFrame(rows, columns=["title", "description"])
real_news["full_text"] = real_news["title"] +" "+ real_news["description"]
real_news["label"] = 0
real_news.drop(['title','description'],axis=1,inplace=True)

In [149]:
real_news.drop_duplicates(inplace=True)

In [155]:
real_news= real_news[real_news["full_text"].str.contains("Trump|Clinton|Obama|Cruz|Elections|Bernie|president")].reset_index(drop=True)

In [165]:
len(real_news)

30

In [154]:
real_news.to_csv("../data/new_data/real_news.csv", index=False)

## Merge

In [2]:
fake_news = pd.read_csv("../data/new_data/fake_news.csv")
real_news = pd.read_csv("../data/new_data/real_news.csv")

In [3]:
df = pd.concat([fake_news, real_news], ignore_index=True)
df = df.sample(frac=1, random_state=2022).reset_index(drop=True)

In [4]:
df.head()

Unnamed: 0,full_text,label
0,\n\nPresident Donald Trump has been elected to...,1
1,Analysis: Why you can't rely on Trump to help ...,0
2,Trump endorses Vance in Ohio's Republican Sena...,0
3,"\n\nAccording to sources, Iran has invaded the...",1
4,\n\nRussia and China have been in a heated bat...,1


In [5]:
X = df['full_text']
y = df['label']

In [6]:
lgbm = pickle.load(open('../models/' + 'optim_lgbm' + '.pkl', 'rb'))

In [7]:
ypred = lgbm.predict(X)
ypred_proba= lgbm.predict_proba(X)

In [None]:
predictions = pd.DataFrame({"full_text": X, "label": ypred, "probability": ypred_proba[:,1]})
predictions.to_csv("../data/new_data/predictions.csv")

In [8]:
def evaluate_model(ytest, ypred, ypred_proba = None):
    if ypred_proba is not None:
        print('ROC-AUC score of the model: {}'.format(roc_auc_score(ytest, ypred_proba[:, 1])))
    print('Accuracy of the model: {}\n'.format(accuracy_score(ytest, ypred)))
    print('Classification report: \n{}\n'.format(classification_report(ytest, ypred)))
    print('Confusion matrix: \n{}\n'.format(confusion_matrix(ytest, ypred)))

In [9]:
evaluate_model(y,ypred, ypred_proba)

ROC-AUC score of the model: 0.47444444444444445
Accuracy of the model: 0.4666666666666667

Classification report: 
              precision    recall  f1-score   support

           0       0.45      0.33      0.38        30
           1       0.47      0.60      0.53        30

    accuracy                           0.47        60
   macro avg       0.46      0.47      0.46        60
weighted avg       0.46      0.47      0.46        60


Confusion matrix: 
[[10 20]
 [12 18]]



## Conclusion

Our model is not able to generalize outside the dataset it was originally trained on. This might be because the dataset is too small or how it was labeled. Therefore, we would not deploy this model in a real-world scenario.
