# Model Training for Score Relevance

## Project Overview

This notebook is part of a series dedicated to building a relevance score model. The goal is to differentiate relevant news articles from two distinct sources:
- Target 1: News from [Yahoo Finance on Crypto](https://finance.yahoo.com/topic/crypto) 
- Target 0: General news from [Yahoo News](https://news.yahoo.com)

The model employs a pipeline combining `TfidfVectorizer` and `MultinomialNB` (Naive Bayes) for classification.

## Notebook Objective

`Relevance_Analysis_GA.ipynb` focuses on training a supervised relevance score model using the `all_score_training.csv` dataset. This model is then applied to classify news articles downloaded from Gdelt in an additive manner.

- **Development Dataset**: `all_score_training.csv`
- **Deployment Dataset**: 
    - **Input**: `GA_data.csv`
    - **Output**: `GA_data_wrelevance.csv`

## Model Training Approach

The training involves several key steps:
1. **Data Preprocessing**: Utilizing custom transformers for text cleaning, stop words removal, and lemmatization.
2. **Feature Extraction**: Applying `TfidfVectorizer` to convert text data into a format suitable for model training.
3. **Model Training**: Employing `MultinomialNB` within a pipeline for classification.
4. **Evaluation**: Assessing the model's performance using metrics such as accuracy, precision, recall, F1 score, and ROC AUC.

## Implementation Details

The process begins with essential imports and data loading. Custom transformers (`TextCleaner`, `StopWordsRemover`, `Lemmatizer`) are defined for data preprocessing, followed by the establishment of a pipeline integrating these transformers with `TfidfVectorizer` and `MultinomialNB`.

The dataset `all_score_training.csv` is used for training, with the target variable indicating the relevance of news articles (1 for relevant, 0 for non-relevant). The pipeline is trained and evaluated on this dataset, and performance metrics are reported.

Finally, the trained model is saved for future use in classifying new datasets, demonstrated by applying it to `GA_data.csv` and producing `GA_data_wrelevance.csv`.

## Conclusion

This notebook showcases the end-to-end process of building a relevance score model, from preprocessing and feature extraction to training, evaluation, and deployment. The approach is designed to be reproducible and scalable, allowing for efficient classification of news articles based on their relevance to specific topics.

## Training a simple model 

Imports

In [1]:
## Imports

import pandas as pd
import numpy as np
import seaborn as sns
from nltk.stem import *
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import nltk
from nltk.corpus import stopwords
import re
from sklearn.base import BaseEstimator, TransformerMixin
from nltk.tokenize import word_tokenize
import pickle

In [2]:
training_df = pd.read_csv('all_score_training.txt')

In [3]:
training_df.head()

Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,headlines,target
0,0,0,0.0,‘Obviously I don’t think women are any less in...,1.0
1,1,1,1.0,Layoffs hit crypto and real estate tech partic...,1.0
2,2,2,2.0,A brief recap of one of the worst weeks in cry...,1.0
3,3,3,3.0,Wells Fargo economist likens crypto collapse t...,1.0
4,4,4,4.0,Crypto's Excruciating Week Has Traders Bracing...,1.0


In [4]:
c = training_df['target'].value_counts()
p = training_df['target'].value_counts(normalize=True)
pd.concat([c,p], axis=1, keys=['counts', '%'])
# final_df['target'].value_counts()

Unnamed: 0,counts,%
0.0,1187,0.684939
1.0,546,0.315061


In [5]:
training_df.drop(['Unnamed: 0'],axis=1,inplace=True)
training_df.drop(['Unnamed: 0.1'],axis=1,inplace=True)
training_df.drop(['Unnamed: 0.2'],axis=1,inplace=True)

In [6]:
training_df.head()

Unnamed: 0,headlines,target
0,‘Obviously I don’t think women are any less in...,1.0
1,Layoffs hit crypto and real estate tech partic...,1.0
2,A brief recap of one of the worst weeks in cry...,1.0
3,Wells Fargo economist likens crypto collapse t...,1.0
4,Crypto's Excruciating Week Has Traders Bracing...,1.0


In [7]:
# Ensure necessary NLTK downloads
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Custom transformer for text cleaning
class TextCleaner(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        cleaned_data = X.astype(str).map(lambda x: x.lower())
        cleaned_data = cleaned_data.map(lambda x: re.sub('[^A-Za-z0-9]+', ' ', x))
        return cleaned_data

# Custom transformer for stop words removal
class StopWordsRemover(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        removed_stop_words = X.apply(lambda x: ' '.join([word for word in x.split() if word not in self.stop_words]))
        return removed_stop_words

# Custom transformer for lemmatization
class Lemmatizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        lemmatized_data = X.apply(lambda x: ' '.join([self.lemmatizer.lemmatize(word) for word in word_tokenize(x)]))
        return lemmatized_data

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/arturoolivera/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/arturoolivera/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/arturoolivera/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
# Define the pipeline
pipeline = Pipeline([
    ('text_cleaner', TextCleaner()),
    ('stop_words_remover', StopWordsRemover()),
    ('lemmatizer', Lemmatizer()),
    ('vectorizer', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

In [9]:
# Assuming `final_df` is your DataFrame and `headlines` is the column with text data
X = training_df['headlines']
y = training_df['target'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=training_df['target'], random_state=42)

# Train the pipeline
pipeline.fit(X_train, y_train)

# Make predictions
predictions = pipeline.predict(X_test)

# Evaluate the model
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))


              precision    recall  f1-score   support

         0.0       0.90      0.99      0.94       238
         1.0       0.98      0.76      0.86       109

    accuracy                           0.92       347
   macro avg       0.94      0.88      0.90       347
weighted avg       0.92      0.92      0.92       347



In [10]:
# Predictions on training set
y_train_pred = pipeline.predict(X_train)

# Training set scores
print("Training Set Scores:")
print(f"Accuracy: {accuracy_score(y_train, y_train_pred)}")
print(f"Precision: {precision_score(y_train, y_train_pred)}")
print(f"Recall: {recall_score(y_train, y_train_pred)}")
print(f"F1 Score: {f1_score(y_train, y_train_pred)}")
print(f"ROC AUC Score: {roc_auc_score(y_train, y_train_pred)}")

# Predictions on test set
y_pred = pipeline.predict(X_test)

# Test set scores
print("\nTest Set Scores:")
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")
print(f"F1 Score: {f1_score(y_test, y_pred)}")
print(f"ROC AUC Score: {roc_auc_score(y_test, y_pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")

Training Set Scores:
Accuracy: 0.974025974025974
Precision: 0.9878345498783455
Recall: 0.9290617848970252
F1 Score: 0.9575471698113207
ROC AUC Score: 0.961896540499092

Test Set Scores:
Accuracy: 0.9193083573487032
Precision: 0.9764705882352941
Recall: 0.7614678899082569
F1 Score: 0.8556701030927835
ROC AUC Score: 0.8765322642818596
Confusion Matrix:
[[236   2]
 [ 26  83]]


In [11]:
# Storing the model for later use
pickle.dump(pipeline, open('RelScoreModel_GA.pkl', 'wb'))

### Check Case Study File 

In [12]:
df_study = pd.read_csv('GA_data.csv')

In [13]:
df_study.head()

Unnamed: 0,url,url_mobile,title,seendate,socialimage,domain,language,sourcecountry
0,https://news.yahoo.com/ai-scams-missouri-warns...,,AI Scams : Missouri warns voices of loved ones...,20240219T204500Z,https://media.zenfs.com/en/ktvi_articles_498/2...,news.yahoo.com,English,United States
1,https://www.americanbanker.com/opinion/regulat...,,Regulators should reexamine their assumptions ...,20240219T194500Z,https://source-media-brightspot.s3.us-east-1.a...,americanbanker.com,English,United States
2,https://biztoc.com/x/97e1450bfef84362,,South Korean Political Party Eyes Crypto Revol...,20240219T130000Z,https://c.biztoc.com/p/97e1450bfef84362/s.webp,biztoc.com,English,
3,https://biztoc.com/x/5c2110519540e5cf,,Unraveling the Mystery Behind XRP Price Underp...,20240219T103000Z,https://c.biztoc.com/p/5c2110519540e5cf/s.webp,biztoc.com,English,
4,https://biztoc.com/x/2f038851769a9841,,Cryptocurrency Rankings : Solana Claims the Co...,20240219T181500Z,https://c.biztoc.com/p/2f038851769a9841/s.webp,biztoc.com,English,


In [14]:
df_study.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14004 entries, 0 to 14003
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   url            14004 non-null  object
 1   url_mobile     2194 non-null   object
 2   title          14002 non-null  object
 3   seendate       14004 non-null  object
 4   socialimage    7936 non-null   object
 5   domain         14004 non-null  object
 6   language       14004 non-null  object
 7   sourcecountry  11462 non-null  object
dtypes: object(8)
memory usage: 875.4+ KB


In [15]:
study_proba = pipeline.predict_proba(df_study['title'])

In [16]:
study_proba

array([[0.78981543, 0.21018457],
       [0.4341361 , 0.5658639 ],
       [0.30248348, 0.69751652],
       ...,
       [0.82172314, 0.17827686],
       [0.88688625, 0.11311375],
       [0.59431383, 0.40568617]])

In [17]:
df_study['relevance_probability']= study_proba[:, 1]

In [18]:
strength_study = pipeline.predict(df_study['title'])

In [19]:
df_study['relevance_class']=strength_study

In [20]:
df_study.head()

Unnamed: 0,url,url_mobile,title,seendate,socialimage,domain,language,sourcecountry,relevance_probability,relevance_class
0,https://news.yahoo.com/ai-scams-missouri-warns...,,AI Scams : Missouri warns voices of loved ones...,20240219T204500Z,https://media.zenfs.com/en/ktvi_articles_498/2...,news.yahoo.com,English,United States,0.210185,0.0
1,https://www.americanbanker.com/opinion/regulat...,,Regulators should reexamine their assumptions ...,20240219T194500Z,https://source-media-brightspot.s3.us-east-1.a...,americanbanker.com,English,United States,0.565864,1.0
2,https://biztoc.com/x/97e1450bfef84362,,South Korean Political Party Eyes Crypto Revol...,20240219T130000Z,https://c.biztoc.com/p/97e1450bfef84362/s.webp,biztoc.com,English,,0.697517,1.0
3,https://biztoc.com/x/5c2110519540e5cf,,Unraveling the Mystery Behind XRP Price Underp...,20240219T103000Z,https://c.biztoc.com/p/5c2110519540e5cf/s.webp,biztoc.com,English,,0.369046,0.0
4,https://biztoc.com/x/2f038851769a9841,,Cryptocurrency Rankings : Solana Claims the Co...,20240219T181500Z,https://c.biztoc.com/p/2f038851769a9841/s.webp,biztoc.com,English,,0.512854,1.0


In [21]:
df_study.to_csv('GA_data_wrelevance.csv')