## Exploring the PREDICTION

This step will create the prediction model of the fake news

1. The data is a text data, causing we need to tokenize the input
2. We also want to output it into pickle file, so our business logic can call it.

In [10]:
import pandas as pd
import pickle

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [11]:
df = pd.read_csv("../../data/processed/fake-news-eda.csv")
df = df.drop('Unnamed: 0', axis=1)
df = df.drop('Unnamed: 0.1', axis=1)

df.head()

Unnamed: 0,title,text,label,title_length,text_length,title_words_count,text_words_count,text_sentence_count,text_clean
0,Palestinians switch off Christmas lights in Be...,"RAMALLAH, West Bank (Reuters) - Palestinians s...",1,75,1335,10,212,7,ramallah west bank reuters - palestinians swit...
1,China says Trump call with Taiwan president wo...,BEIJING (Reuters) - U.S. President-elect Donal...,1,72,373,11,59,2,beijing reuters - us president-elect donald tr...
2,FAIL! The Trump Organization’s Credit Score W...,While the controversy over Trump s personal ta...,0,64,2072,10,354,11,controversy trump personal tax returns continu...
3,Zimbabwe military chief's China trip was norma...,BEIJING (Reuters) - A trip to Beijing last wee...,1,67,2692,10,428,13,beijing reuters - trip beijing last week zimba...
4,THE MOST UNCOURAGEOUS PRESIDENT EVER Receives ...,There has never been a more UNCOURAGEOUS perso...,0,103,1946,14,314,8,never uncourageous person white house barack o...


In [12]:
## The goals here is to get the tokenization of the "text_clean" data
## We want to do prediction based on the words, text, etc. That's why we need the value for
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')

X = df["text_clean"].astype('U').values
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

In [13]:
## Let's init our model

lg = LogisticRegression(random_state=42)
lg.fit(X_train_tfidf, y_train)

In [14]:
y_pred = lg.predict(X_test_tfidf) ## predict using the Vectorized test
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.9698732599210471

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97      2188
           1       0.97      0.97      0.97      2625

    accuracy                           0.97      4813
   macro avg       0.97      0.97      0.97      4813
weighted avg       0.97      0.97      0.97      4813



## Testing with our CSV test data

After testing using our X_test split, we want to know how it works in actual data

In [15]:
test_data = pd.read_csv("../../data/raw/fake-news-test.csv", sep=";")
test_data_y = test_data["label"]
test_data_x = test_data.drop("label", axis=1)

test_data_tfidf = vectorizer.transform(test_data_x["text"])

y_pred = lg.predict(test_data_tfidf)

In [16]:
print("Accuracy:", accuracy_score(test_data_y, y_pred))
print("\nClassification Report:")
print(classification_report(test_data_y, y_pred))

Accuracy: 0.964149316249846

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.97      0.96      3753
           1       0.97      0.96      0.97      4364

    accuracy                           0.96      8117
   macro avg       0.96      0.96      0.96      8117
weighted avg       0.96      0.96      0.96      8117



## Saving model to pkl file

Goal is to call the model without re-train.
Will be outputting the value in `src/models/fake-news.pkl`

In [20]:
with open("../models/fake-news.pkl", "wb") as f:
  pickle.dump(lg, f)


with open("../models/vectorizer.pkl", "wb") as f:
  pickle.dump(vectorizer, f)