<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# GA Capstone: Fake News Classifier

Author: Tan Kai Yong Alvin

# Notebook 4: Deployment

In [1]:
# run 2 additional cells in ML OPS 0 materials

In [1]:
# Connect this Jupyter notebook to the running MLFlow server
import mlflow # import mlflow python package

# save all experiments and runs in mlflow.db
mlflow.set_tracking_uri("sqlite:///mlflow.db") # set location of where mlflow is logging all the runs on your local computer, we're using a sqlite DB called mlflow.dbb
# link inside set_tracking_uri will be replaced if using dagshub with what the platform provides as mlflow runs on their website
# experiments will then be logged on dagshub cloud instead of on your local computer

# Set the name of the experiment we're running in this notebook
# MLFlow will connect to an existing experiment if the name passed already exists, 
# or create a new one if the experiment is not already present
mlflow.set_experiment("Fake-News-Classifier") # 1st time execution will yield warning "experiment does not exist, creating new". subsequent executions does not yield warning as experiment already exists & will be reused 

# refresh mlflow webpage if necessary to view this experiment name. future runs (say after shutting and reopening mlflow), will log under the same experiment name as long as the name isn't changed

# Start automatically logging all runs below to the created MLFlow experiment
mlflow.autolog()

## Libraries

In [4]:
# Imports
import pandas as pd
import numpy as np
import requests
import matplotlib.pyplot as plt
import seaborn as sns
import time
import re
import joblib

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline # to tag preprocessing transformers + estimators
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import get_scorer # we'll introduce this later
from lightgbm import LGBMClassifier
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

## Import

In [3]:
classification_df = pd.read_csv('./datasets/classification_df.csv')

In [4]:
classification_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44689 entries, 0 to 44688
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   date                 44679 non-null  object
 1   original_title_text  44689 non-null  object
 2   classification_text  44680 non-null  object
 3   label                44689 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 1.4+ MB


In [5]:
classification_df.head()

Unnamed: 0,date,original_title_text,classification_text,label
0,2017-12-31,"As U.S. budget fight looms, Republicans flip t...",u budget fight loom republican flip fiscal scr...,0
1,2017-12-29,U.S. military to accept transgender recruits o...,u military accept transgender recruit monday p...,0
2,2017-12-31,Senior U.S. Republican senator: 'Let Mr. Muell...,senior u republican senator let mr mueller job...,0
3,2017-12-30,FBI Russia probe helped by Australian diplomat...,fbi russia probe helped australian diplomat ti...,0
4,2017-12-29,Trump wants Postal Service to charge 'much mor...,trump want postal service charge much amazon s...,0


In [7]:
# define X and y
X = classification_df['original_title_text']
y = classification_df['label']

### Train/ Test Split

In [8]:
# Define training and testing sets. Train/test split.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    stratify=y,
                                                    random_state=42)

## Modeling

In [9]:
stopwords_list = stopwords.words("english")

# some words provided the source of the article, which may reveal explicitly if the news is fake or real.Such words will be omitted to make the model less bias towards source
add_stopwords = ["21wire", "twitter", "reuters", '21WIRE', '21st', 'Century',  'Wire', 'somodevilla', 'getty', 'images', 'subscribe', 'member', 'realdonaldtrump']
stopwords_list.extend(add_stopwords)

In [10]:
pipe_cvec_lr = Pipeline([
    ('cvec', CountVectorizer(ngram_range = (1,3), min_df = 0.001, max_features =3000, stop_words = 'english', token_pattern = '\w+')), # this is the old change vs our previous GridSearch done with CountVectorizer()
    ('lr', LogisticRegression(C=0.118, class_weight={}, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False))
])

params = {
    'cvec__max_features': [3000, 3500], 
}


In [11]:
# Instantiate GridSearchCV.
gs_cvec_lr = GridSearchCV(pipe_cvec_lr, # the object that we are optimizing
                  param_grid=params, 
                  cv=5) # 5-fold cross-validation.

In [13]:
with mlflow.start_run():
    gs_cvec_lr.fit(X_train, y_train) # fit model on train data

2022/11/10 15:14:21 INFO mlflow.sklearn.utils: Logging the 5 best runs, no runs will be omitted.


In [14]:
from mlflow.artifacts import download_artifacts

# Download the desired model from MLFlow to local directory
# Get the URL by following instructions in above image (full path will be from 'model' folder instead for non-hyperparameter runs)
full_path = './mlruns/1/573457d99490409c854f3433f9793e32/artifacts/best_estimator' # paste copied full path from best_estimator
download_artifacts(full_path, dst_path='.') # download from source: full_path, destination path: where this solution code notebook is located (can reference with '.')

'C:\\Users\\alvintky89\\Documents\\GA\\my_materials\\12.01-mlops\\solution-code\\best_estimator'

In [None]:
12.01-mlops/solution-code/best_estimator/model.pkl

In [15]:
import joblib
filename = './best_estimator/model.pkl'
joblib.dump(gs_cvec_lr, filename)

['./best_estimator/model.pkl']

### Testing

In [1]:
# load the model from disk
import joblib
model_classify = joblib.load("./best_estimator/model.pkl")

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [2]:
user_input = {"text":"Schumer calls on Trump to appoint official to oversee Puerto Rico relief WASHINGTON (Reuters) - Charles Schumer, the top Democrat in the U.S. Senate, called on President Donald Trump on Sunday to name a single official to oversee and coordinate relief efforts in hurricane-ravaged Puerto Rico. Schumer, along with Representatives Nydia VelÃ zquez and Jose Serrano, said a â€œCEO of response and recoveryâ€ is needed to manage the complex and ongoing federal response in the territory, where millions of Americans remain without power and supplies. In a statement, Schumer said the current federal response to Hurricane Mariaâ€™s impact on the island had been â€œdisorganized, slow-footed and mismanaged.â€ â€œThis person will have the ability to bring all the federal agencies together, cut red tape on the public and private side, help turn the lights back on, get clean water flowing and help bring about recovery for millions of Americans who have gone too long in some of the worst conditions,â€ he said. The White House did not immediately respond to a request for comment. The Democrats contended that naming a lone individual to manage the governmentâ€™s relief efforts was critical, particularly given that the Federal Emergency Management Agency is already stretched thin from dealing with other crises, such as the aftermath of Hurricane Harvey in Texas and the wildfires in California. The severity of the Puerto Rico crisis, where a million people do not have clean water and millions are without power nearly a month after Hurricane Maria made landfall, demand a single person to focus exclusively on relief and recovery, the Democrats said. Forty-nine people have died in Puerto Rico officially, with dozens more missing. The hurricane did extensive damage to the islandâ€™s power grid, destroying homes, roads and other vital infrastructure. Now, the bankrupt territory is struggling to provide basic services like running water, and pay its bills. â€œItâ€™s tragically clear this Administration was caught flat footed when Maria hit Puerto Rico,â€ said VelÃ zquez. â€œAppointing a CEO of Response and Recovery will, at last, put one person with authority in charge to manage the response and ensure we are finally getting the people of Puerto Rico the aid they need.â€ On Thursday, Trump said the federal response has been a â€œ10â€ on a scale of one to 10 at a meeting with Puerto Rico Governor Ricardo Rossello.  The governor has asked the White House and Congress for at least $4.6 billion in block grants and other types of funding. Senator Marco Rubio called on Congress to modify an $18.7 billion aid package for areas damaged by a recent swath of hurricanes to ensure that Puerto Rico can quickly access the funds. "}

In [5]:
model_classify.predict(pd.Series(user_input['text']))

array([0], dtype=int64)

## Model Deployment with Flask

In [34]:
%%writefile inference.py 
from flask import Flask, request # import flask class, request module (to accept user inputs)
import pandas as pd # to work with dataframe
import os # to get port number that we'll hard-code to 8080  for now - important for deployment on Google Cloud (port no. will not be hard-coded there and sends a variable called port!)
import mlflow.pyfunc # to load downloaded model for making predictions (same as mlops-1 'Making Predictions using the local downloaded model')
import joblib
from io import StringIO


# Step 2: Instantiate the Flask API with name 'ModelEndpoint' ('api' is an object of the Flask() class)
api = Flask('ModelEndpoint') # 'ModelEndpoint' can be called anything else as well

# Step 3: Load the model from best_estimator folder for subsequently making predictions
# model = mlflow.pyfunc.load_model(model_uri="./best_estimator") # same code as mlops-1 'Making Predictions using the local downloaded model')
# doing this as the above line does not work
model_classify = joblib.load("./best_estimator/model.pkl")

# Step 4: Create the routes (we can create multiple! similar to multiple functions in a Python script)
# Note: we'll need to name each route differently, similar to naming individual functions differently in a Python script

## route 1: Health check. Just return success if the API is running
@api.route('/') # this is a decorator (@api - using the Flask class' object instantiated above and creating a 'route' on this 'api' flask object. then we pass a name for this route as '/' just means home page)
def home(): # just a normal Python function called 'home' with a decorator addition above
    # return a simple string as JSON (JSON is just Python dictionary)
    return {"message": "Hi there!", "success": True}, 200
# returns dictionary with a message, code for success 'True' and html code 200 (https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
# a user going to the home page of this API, will see this message defined in the dictionary (we'll see this in action soon!)

## route 2: accept input data, convert from JSON to dataframe, run predictions on the df, convert predictions to a list & return as dictionary. flask will take care of conversion to JSON object
# 'POST' method is used when we want to receive some data from the user and POST it to the API. when we want to access the route '/predict' route, we'll always need to post some data to it, else it'll error
@api.route('/predict', methods = ['POST']) # naming this 2nd route as /predict. so, in https://en.wikipedia.org/wiki, this is equivalent of the '/wiki', while the home page is what comes before '/wiki'
def make_predictions(): # create a normal Python function for predictions
    # step 1: Get the JSON object input data sent over the API
    user_input = request.get_json(force=True) # use 'request' module (imported from Flask earlier)'s .get_json method
    # by setting force=True in request.get_json, Flask will auto route input data sent to '/predict' route onto variable 'user_input'
    import sys
    # print("***********************")
    # print(type(user_input), user_input, file = sys.stderr)
    # step 2: Convert user inputs (JSON object from step#1) to pandas dataframe
    df_schema = {"article":str} # To ensure the feature columns for modeling get the correct datatype of float, because when Pandas converts from JSON to df, it infers dtype of every col
    user_input_df = pd.read_json(StringIO(user_input), lines=True, dtype=df_schema) # Convert JSONL to dataframe with additional argument of dtype of what we're expecting the API to handle so model predictions work fine
    print("***********************")
    print(type(user_input_df), user_input_df, file = sys.stderr)
    # step 3: Run predictions using the loaded 'model' on user_input_df and convert predictions output from numpy array to list
    predictions = model_classify.predict(pd.Series(user_input_df["article"][0])).tolist()
    
    if predictions[0]==1:
        return{'prediction': f'This is a FAKE news'}
    else:
        return{'prediction': f'This is a REAL news'}
    
    #return {'predictions': predictions} # return output of 'predict' route as a dictionary for Flask to convert to JSON object & send back to user at the '/predict' route. dictionary's key (can be any name) as 'predictions', values as list of model predictions
    

# Step 5: Main function that actually runs the API! - simply (blindly) copy+paste for all API runs
if __name__ == '__main__': # good practise to have this main block whenever creating a .py file
    api.run(host='0.0.0.0', # run the 'api' object created above with 2 routes on local host url '0.0.0.0' to just run on this computer
            debug=True, # Debug=True ensures any changes to inference.py (like adding an extra print somewhere in this script) automatically updates the running API
            port=int(os.environ.get("PORT", 8080)) # just use 8080 by default
           ) 

Overwriting inference.py


In [34]:
%%writefile inference.py 
from flask import Flask, request # import flask class, request module (to accept user inputs)
import pandas as pd # to work with dataframe
import os # to get port number that we'll hard-code to 8080  for now - important for deployment on Google Cloud (port no. will not be hard-coded there and sends a variable called port!)
import mlflow.pyfunc # to load downloaded model for making predictions (same as mlops-1 'Making Predictions using the local downloaded model')
import joblib
from io import StringIO
import newspaper

# Step 2: Instantiate the Flask API with name 'ModelEndpoint' ('api' is an object of the Flask() class)
api = Flask('ModelEndpoint') # 'ModelEndpoint' can be called anything else as well

# Step 3: Load the model from best_estimator folder for subsequently making predictions
# model = mlflow.pyfunc.load_model(model_uri="./best_estimator") # same code as mlops-1 'Making Predictions using the local downloaded model')
# doing this as the above line does not work
model_classify = joblib.load("./best_estimator/model.pkl")

# Step 4: Create the routes (we can create multiple! similar to multiple functions in a Python script)
# Note: we'll need to name each route differently, similar to naming individual functions differently in a Python script

## route 1: Health check. Just return success if the API is running
@api.route('/') # this is a decorator (@api - using the Flask class' object instantiated above and creating a 'route' on this 'api' flask object. then we pass a name for this route as '/' just means home page)
def home(): # just a normal Python function called 'home' with a decorator addition above
    # return a simple string as JSON (JSON is just Python dictionary)
    return {"message": "Hi there!", "success": True}, 200
# returns dictionary with a message, code for success 'True' and html code 200 (https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
# a user going to the home page of this API, will see this message defined in the dictionary (we'll see this in action soon!)

## route 2: accept input data, convert from JSON to dataframe, run predictions on the df, convert predictions to a list & return as dictionary. flask will take care of conversion to JSON object
# 'POST' method is used when we want to receive some data from the user and POST it to the API. when we want to access the route '/predict' route, we'll always need to post some data to it, else it'll error
@api.route('/predict', methods = ['POST']) # naming this 2nd route as /predict. so, in https://en.wikipedia.org/wiki, this is equivalent of the '/wiki', while the home page is what comes before '/wiki'
def make_predictions(): # create a normal Python function for predictions
    # step 1: Get the JSON object input data sent over the API
    user_input = request.get_json(force=True) # use 'request' module (imported from Flask earlier)'s .get_json method
    # by setting force=True in request.get_json, Flask will auto route input data sent to '/predict' route onto variable 'user_input'
    import sys
    # print("***********************")
    # print(type(user_input), user_input, file = sys.stderr)
    # step 2: Convert user inputs (JSON object from step#1) to pandas dataframe
    df_schema = {"article":str} # To ensure the feature columns for modeling get the correct datatype of float, because when Pandas converts from JSON to df, it infers dtype of every col
    user_input_df = pd.read_json(StringIO(user_input), lines=True, dtype=df_schema) # Convert JSONL to dataframe with additional argument of dtype of what we're expecting the API to handle so model predictions work fine
    print("***********************")
    print(type(user_input_df), user_input_df, file = sys.stderr)
    # step 3: Run predictions using the loaded 'model' on user_input_df and convert predictions output from numpy array to list
    predictions = model_classify.predict(pd.Series(user_input_df["article"][0])).tolist()
    
    if predictions[0]==1:
        return{'prediction': f'This is a FAKE news'}
    else:
        return{'prediction': f'This is a REAL news'}
    
    #return {'predictions': predictions} # return output of 'predict' route as a dictionary for Flask to convert to JSON object & send back to user at the '/predict' route. dictionary's key (can be any name) as 'predictions', values as list of model predictions
    

# Step 5: Main function that actually runs the API! - simply (blindly) copy+paste for all API runs
if __name__ == '__main__': # good practise to have this main block whenever creating a .py file
    api.run(host='0.0.0.0', # run the 'api' object created above with 2 routes on local host url '0.0.0.0' to just run on this computer
            debug=True, # Debug=True ensures any changes to inference.py (like adding an extra print somewhere in this script) automatically updates the running API
            port=int(os.environ.get("PORT", 8080)) # just use 8080 by default
           ) 

Overwriting inference.py


In [35]:
# real news Senior U.S. Republican senator: 'Let Mr. Mueller do his job' WASHINGTON (Reuters) - The special counsel investigation of links between Russia and President Trumpâ€™s 2016 election campaign should continue without interference in 2018, despite calls from some Trump administration allies and Republican lawmakers to shut it down, a prominent Republican senator said on Sunday. Lindsey Graham, who serves on the Senate armed forces and judiciary committees, said Department of Justice Special Counsel Robert Mueller needs to carry on with his Russia investigation without political interference. â€œThis investigation will go forward. It will be an investigation conducted without political influence,â€ Graham said on CBSâ€™s Face the Nation news program. â€œAnd we all need to let Mr. Mueller do his job. I think heâ€™s the right guy at the right time.â€  The question of how Russia may have interfered in the election, and how Trumpâ€™s campaign may have had links with or co-ordinated any such effort, has loomed over the White House since Trump took office in January. It shows no sign of receding as Trump prepares for his second year in power, despite intensified rhetoric from some Trump allies in recent weeks accusing Muellerâ€™s team of bias against the Republican president. Trump himself seemed to undercut his supporters in an interview last week with the New York Times in which he said he expected Mueller was â€œgoing to be fair.â€    Russiaâ€™s role in the election and the question of possible links to the Trump campaign are the focus of multiple inquiries in Washington. Three committees of the Senate and the House of Representatives are investigating, as well as Mueller, whose team in May took over an earlier probe launched by the U.S. Federal Bureau of Investigation (FBI). Several members of the Trump campaign and administration have been convicted or indicted in the investigation.  Trump and his allies deny any collusion with Russia during the campaign, and the Kremlin has denied meddling in the election. Graham said he still wants an examination of the FBIâ€™s use of a dossier on links between Trump and Russia that was compiled by a former British spy, Christopher Steele, which prompted Trump allies and some Republicans to question Muellerâ€™s inquiry.   On Saturday, the New York Times reported that it was not that dossier that triggered an early FBI probe, but a tip from former Trump campaign foreign policy adviser George Papadopoulos to an Australian diplomat that Russia had damaging information about former Trump rival Hillary Clinton.  â€œI want somebody to look at the way the Department of Justice used this dossier. It bothers me greatly the way they used it, and I want somebody to look at it,â€ Graham said. But he said the Russia investigation must continue. â€œAs a matter of fact, it would hurt us if we ignored it,â€ he said. 
empty = {"article": "Trump is dead"}

In [36]:
import requests, json

api_url = 'http://localhost:8080' # specify the URL to access
api_route = '/predict' # specify the `route` to access in the URL

# we'll need to use `requests.post()` based on our earlier specification in `\predict` route to only accept a `POST` request 
response = requests.post(f'{api_url}{api_route}', json=json.dumps(empty))
predictions = response.json()

print(predictions)

{'Prediction': 'This is a FAKE news'}


## Create Streamlit

In [42]:
%%writefile streamlit_app.py
import streamlit as st
import requests
import json
from PIL import Image
import base64

def add_bg_from_local(image_file):
    with open(image_file, "rb") as image_file:
        encoded_string = base64.b64encode(image_file.read())
    st.markdown(
    f"""
    <style>
    .stApp {{
        background-image: url(data:image/{"png"};base64,{encoded_string.decode()});
        background-size: cover
    }}
    </style>
    """,
    unsafe_allow_html=True
    )
add_bg_from_local('background4.jpeg') 

# Title of the webpage
st.title("Fake News Classifier")

st.subheader("""Use this application to differentiate REAL and FAKE news.""")

with st.form(key='myform', clear_on_submit = True):
    article = st.text_input('Enter the text of the news')
    submit = st.form_submit_button("Predict")

user_input = {'article': article}

st.write(user_input)

Overwriting streamlit_app.py


In [43]:
%%writefile -a streamlit_app.py
if submit:
    with st.spinner('Reading the news...'):
        api_url = 'http://localhost:8080' # specify the URL to access
        api_route = '/predict' # specify the `route` to access in the URL
        
        # we'll need to use `requests.post()` based on our earlier specification in `\predict` route to only accept a `POST` request 
        response = requests.post(f'{api_url}{api_route}', json=json.dumps(user_input))
        predictions = response.json()
        
        st.success('Completed')
        st.header('Verdict:')
        st.write(predictions['prediction'])
        #st.write(f"Prediction: {predictions['predictions'][0]}")

Appending to streamlit_app.py


Instruction to run this in the local host:

Activate the dsi-sg conda environment: `conda activate dsi-sg-capstone1` or `mamba activate dsi-sg` depending on your installation -> executing this changes the `base` environment to `dsi-sg` on your terminal window
- Run the file as a normal python file by typing this in your terminal window and press enter: `python inference.py`

Next,

Ensure to do conda activate dsi-sg-capstone1 
Run: streamlit run streamlit_app.py

Local host link: http://localhost:8501/