train.csv: A full training dataset with the following attributes:

* id: unique id for a news article
* title: the title of a news article
* author: author of the news article
* text: the text of the article; could be incomplete
* label: a label that marks the article as potentially unreliable
* 1: unreliable
* 0: reliable

source: https://www.kaggle.com/c/fake-news/overview

In [None]:
# import dependencies
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns 
from wordcloud import WordCloud
import re

# 1. Data Exploration

In [None]:
df = pd.read_csv('dataset/train.csv')
df.head()

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df.info()

In [None]:
# adding a new column that combines all the fields: title, author, and text
df['all'] = df['title'] + ' ' + df['author'] + ' ' + df['text']
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.info()

In [None]:
# dropping rows where title = NaN
df_drop = df.dropna(subset=['all']).reset_index(drop=True)
df_drop.info()

In [None]:
df_drop.head(10)

In [None]:
### using wordcloud to visualize common words for both reliable and unrealible news ###
reliable = df_drop[df['label'] == 0]
unreliable = df_drop[df['label'] == 1]

In [None]:
# converting to list
rel_words = reliable['all'].astype(str).tolist()
unrel_words = unreliable['all'].astype(str).tolist()

In [None]:
# joining into one string
rel_words_onestring = " ".join(rel_words)
unrel_words_onestring = " ".join(unrel_words)

In [None]:
# plotting reliable news
plt.figure(figsize=(20,20));
plt.imshow(WordCloud().generate(rel_words_onestring));
plt.show();

In [None]:
# plotting unrealiable news
plt.figure(figsize=(20,20));
plt.imshow(WordCloud().generate(unrel_words_onestring));
plt.show();

In [None]:
# reliable vs unrealiable split
print( 'Unreliable percentage =', round((len(unreliable) / len(df_drop) )*100, 2),"%")
print( 'Reliable percentage =', round((len(reliable) / len(df_drop) )*100, 2),"%")

In [None]:
# visualizing reliable vs unrealiable
sns.countplot(df['label'], label = "Count");

# 2. Preprocessing

In [None]:
# Make a new copy of the dataframe
df_clean = df_drop.copy()

# Convert all characters to lowercase - this may not be necessary if we let 
# CountVectorizer do it for us, but it doesn't take long enough to worry about.
df_clean['all'] = df_clean['all'].str.lower()

# removing possesives and contractions
df_clean['all'] = df_clean['all'].replace("’s","", regex=True)

# replacing '\n' with blank space
df_clean['all'] = df_clean['all'].replace('\n',' ', regex=True)

# removing special characters (regex)
df_clean['all'] = df_clean['all'].replace('[^A-Za-z0-9\s]+', '',regex=True)

# removing leading and trailing spaces
df_clean['all'] = df_clean['all'].str.strip()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

# Create the vectorizer by letting CountVectorizer handle tokenization and
# stop-words removal. Note that this will not update the original dataframe,
# but will instead create X. 
vectorizer = CountVectorizer(ngram_range=(1,2), stop_words=stopwords.words('english')).fit(df_clean['all'])

X = vectorizer.transform(df_clean['all'])

# 3. Training the Model

In [None]:
y = df_drop['label']

In [None]:
display(X.shape, y.shape)

In [None]:
# split the data set into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
# applying Naive Bayes classifier to the training data
from sklearn.naive_bayes import MultinomialNB

NB_classifier = MultinomialNB()
model = NB_classifier.fit(X_train, y_train)

In [None]:
# predicting on testing data and getting the model score
predicted = model.predict(X_test)

print(np.mean(predicted == y_test))

# 4. Classification Report & Confusion Matrix

In [None]:
# dependencies
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
# confusion matrix for the Training set
.y_predict_train = NB_classifier.predict(X_train)
cm = confusion_matrix(y_train, y_predict_train)
sns.heatmap(cm, annot=True);

In [None]:
# confusion matrix when predicting the Test set results
y_predict_test = NB_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm, annot=True);

In [None]:
# checking the classification report
print(classification_report(y_test, y_predict_test))

# 5. Saving the Model

In [None]:
import joblib

In [None]:
# saving the model in the current working directory
joblib_file = "News_ish.pkl"
joblib.dump(model, joblib_file)

In [None]:
# loading model from file
joblib_file = "News_ish.pkl"
loaded_model = joblib.load(joblib_file)

In [None]:
# saving the vectorizer in the current working directory
joblib_vector_file = "vectorizer.pkl"
joblib.dump(vectorizer, joblib_vector_file)

In [None]:
# loading vectorizer from file
joblib_vector_file = "vectorizer.pkl"
loaded_vectorizer = joblib.load(joblib_vector_file)

In [None]:
# model score
score = loaded_model.score(X_test, y_test)
print("Test score: {0:.2f} %".format(100 * score))
# y_predict = loaded_model.predict(X_test)

In [None]:
# testing with a new input from StarTribune news website.
input_message = str(input())

In [None]:
'''Minnesota is reporting 45 new COVID-19 deaths and more than 9,000 coronavirus cases in an unusual release Saturday that covers two days worth of data.
The latest figures cap a week when the number of COVID-19 deaths reported by the state each day fluctuated greatly.
The Minnesota Department of Health reported 72 deaths for the 24-hour period ending at 4 p.m. on Tuesday, and a record 101 deaths reported for the 24-hour period ending at 4 p.m. Wednesday. For the 48-hour period ending Friday afternoon, the state reported fewer than 50 deaths.
Funeral home directors and medical examiners need to file reports within five days of death, according to the Health Department. It's possible they pushed to file reports before Thanksgiving, so they wouldn't have to do so on the holiday weekend, said Kris Ehresmann, the state's director for infectious diseases.
It's harder to say why the two-day totals released Saturday for new cases and completed tests were low, Ehresmann said, but the holiday could have influenced decisions about whether people sought testing. Throughout the pandemic, COVID numbers released on Mondays have tended to be lower due to reduced testing and reporting activity on weekends.
With the latest figures, Minnesota has now seen 304,023 positive cases, 16,423 hospitalizations and 3,521 deaths since the pandemic arrived here in March.
Residents of long-term care and assisted-living facilities accounted for 23 of the newly announced deaths, and 2,378 deaths since the start of the pandemic.
The state's two-day count of 9,040 new cases came on a low volume of 36,601 newly completed tests, according to the Star Tribune's coronavirus tracker.
Minnesota did not plan to update its dashboard for hospital capacity on Saturday, but the Star Tribune tracker shows 380 new admissions reported over the two-day period. The one-day figures on each of the last three Saturdays were 283, 271 and 201 new admissions.
Daily reports of new admissions typically include patients who have entered the hospital at some point over the last several days — not just on the most recent day.
Numbers released Saturday show health care workers have accounted for 22,292 positive cases — up by more than 200 cases from last week. More than 257,000 people who were infected no longer need to be isolated.
COVID-19 is a viral respiratory illness caused by a new coronavirus that surfaced late last year. People at greatest risk include those 65 and older, residents of long-term care facilities and those with underlying medical conditions.
Those health problems range from lung disease and serious heart conditions to severe obesity and diabetes. People undergoing treatment for failing kidneys also run a greater risk, as do those with cancer and other conditions where treatments suppress immune systems.
Most patients with COVID-19 don't need to be hospitalized. Most illnesses involve mild or moderate symptoms; many cases are asymptomatic.'''

In [None]:
# importing string punctuation for characters removal
import string
string.punctuation

In [None]:
# list of characters to be removed. The app.py function includes '—'
bad_char = [i for i in string.punctuation]
print(bad_char)

In [None]:
# function to preprocess new text
from nltk.corpus import stopwords
def text_prepro(message):

    message = input_message.lower()
    
    # removing possesives and contractions
    message = message.replace("'s","")
    
    # removing long dash (not in string.punctuation)
    message = message.replace("—","")
    
    # replacing '\n' with blank space
    message = message.replace('\n',' ')
    
    # removing special characters (regex)
    message_nochar = ''.join((filter(lambda i: i not in bad_char, message)))
    
    # # removing leading and trailing spaces
    message_nospace = message_nochar.strip()
    
    return message_nospace

In [None]:
# new imput preprocessing
message_preprocessed = text_prepro(input_message)

In [None]:
type(message_preprocessed)

In [None]:
# conver list into list type
list_test = [message_preprocessed]
list_test

In [None]:
# new input classification
result = loaded_model.predict(loaded_vectorizer.transform(list_test))

In [None]:
print(result)

In [None]:
result