# Model Training

We are going to detect whether the title and message is real or fake. Data source has a
shape of 7796×4. The first column identifies the news, the second and third are the title and
text, and the fourth column has labels denoting whether the news is “REAL” or “FAKE”.
Dataset Link: https://www.kaggle.com/datasets/hassanamin/textdb3

In [10]:
# Imports
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import joblib


## 1. Data Preprocessing 
● Download the data source, named news.csv. \
● Split the dataset to 80% training set and 20% testing set. \
● Check and report the ratio of real-to-fake news are roughly the same in both training
and testing sets.


In [11]:
def preProcess(dataPath):
    """ 
    Data preprocessing. 
    
    :param dataPath: str, the path of data file
    :returns (x_train, x_test, y_train, y_test): tuple, features and labels for training and testing sets

    """
    print("Data Prerpocessing...")
    # read date from file
    df = pd.read_csv(dataPath)

    # extract features and labels
    X = df[['title', 'text']]
    y = df['label']

    # split the dataset
    x_train, x_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size=0.2, random_state=1) 

    # check and report the ratio of real-to-fake news in both training and testing sets
    print("The ratio of real and fake news in training sets:")
    print(pd.Series(y_train).value_counts(normalize=True), '\n')
    print("The ratio of real and fake news in testing sets:")
    print(pd.Series(y_test).value_counts(normalize=True), '\n')

    return x_train, x_test, y_train, y_test


## 2. Training Logistic Regression Models with Adding Bi-Grams to the Model 
● Prepare pipeline building up using sklearn’s CounterVectorizer and TfidfVectorizer.\
● Add bigram in both vectorizers.\
● Train logistic regression classifiers using the training set.\
● Compute (i) accuracy, (ii) precision and (iii) recall based on the testing set.\
● Save models in a .pkl file using joblib.

### 2.1 Train a logistic regression classifier with count vectorizer.

In [12]:
def logistic_regression_count(x_train, x_test, y_train, y_test):
    """ 
    Train a logistic regression classifier with count vectorizer.

    :param x_train: pandas Dataframe
    :param x_test: pandas Dataframe
    :param y_train: pandas Dataframe
    :param y_test: pandas Dataframe
    :returns sklearn model

    """

    print('Logistic regression classifier with count vectorizer, model training...')

    # creating a pipeline
    vectorizer = ColumnTransformer([('title_vect', CountVectorizer(ngram_range=(1, 2)), 'title'),
                                    ('text_vect', CountVectorizer(ngram_range=(1, 2)), 'text')
                                ])
    clf = Pipeline([
        ('vec', vectorizer),
        ('log', LogisticRegression(max_iter=1000)) 
    ])

    # Use the training data to fit the model
    clf.fit(x_train, y_train)
    print('Model training is over.\n')
    
    # make prediction
    y_pred = clf.predict(x_test)

    # evaluation
    print('Evaluation report:')
    print(classification_report(y_test, y_pred))

    # Save the model into a file named 'count_LR_model.pkl'
    joblib.dump(clf, "count_LR_model.pkl")

    return clf



### 2.2 Train a logistic regression classifier with tfidf vectorizer.

In [13]:
def logistic_regression_tfidf(x_train, x_test, y_train, y_test):
    """ 
    Train a logistic regression classifier with tfidf vectorizer.

    :param x_train: pandas Dataframe
    :param x_test: pandas Dataframe
    :param y_train: pandas Dataframe
    :param y_test: pandas Dataframe
    :returns sklearn model

    """

    print('Logistic regression classifier with tfidf vectorizer, model training...')

    # creating a pipeline
    vectorizer = ColumnTransformer([('title_vect', TfidfVectorizer(ngram_range=(1, 2)), 'title'),
                                    ('text_vect', TfidfVectorizer(ngram_range=(1, 2)), 'text')
                                ])
    clf = Pipeline([
        ('vec', vectorizer),
        ('log', LogisticRegression()) # The logistic regression using default params
    ])

    # Use the training data to fit the model
    clf.fit(x_train, y_train)
    print('Model training is over.\n')

    # make prediction
    y_pred = clf.predict(x_test)

    # evaluation
    print('Evaluation report:')
    print(classification_report(y_test, y_pred))

    # Save the model into a file named 'tfidf_LR_model.pkl'
    joblib.dump(clf, "tfidf_LR_model.pkl")

    return clf

# 3. Run and compare all models

In [14]:
x_train, x_test, y_train, y_test = preProcess('./news.csv')
count_LR_model = logistic_regression_count(x_train, x_test, y_train, y_test)
tfidf_LR_model = logistic_regression_tfidf(x_train, x_test, y_train, y_test)

Data Prerpocessing...
The ratio of real and fake news in training sets:
REAL    0.500592
FAKE    0.499408
Name: label, dtype: float64 

The ratio of real and fake news in testing sets:
REAL    0.500395
FAKE    0.499605
Name: label, dtype: float64 

Logistic regression classifier with count vectorizer, model training...
Model training is over.

Evaluation report:
              precision    recall  f1-score   support

        FAKE       0.92      0.95      0.94       633
        REAL       0.95      0.92      0.93       634

    accuracy                           0.94      1267
   macro avg       0.94      0.94      0.94      1267
weighted avg       0.94      0.94      0.94      1267

Logistic regression classifier with tfidf vectorizer, model training...
Model training is over.

Evaluation report:
              precision    recall  f1-score   support

        FAKE       0.92      0.93      0.93       633
        REAL       0.93      0.92      0.93       634

    accuracy                

| **Model**      | **Accuracy** | **Precision (fake)** | **Recall (fake)** | **Percision (real)** | **Recall (real)** |
|----------------|--------------|----------------------|-------------------|----------------------|-------------------|
| Logistic-Count | 0.94         | 0.92                 | 0.95              | 0.95                 | 0.92              |
| Logistic-Tfidf | 0.93         | 0.92                 | 0.93              | 0.93                 | 0.92              |