## Project Overview

In this project, a detailed machine learning pipeline named `ML_CryptoNews_Pipeline` has been developed to systematically analyze news articles related to cryptocurrencies. The pipeline integrates advanced Natural Language Processing (NLP) techniques for text data preprocessing and employs FinBERT, a financial-context-specific adaptation of the BERT model, for sentiment analysis. The objective is to predict cryptocurrency price momentum by analyzing the sentiment and content of these articles.

A comprehensive exploration of machine learning models was undertaken to identify the most effective approach for predicting price momentum. This exploration included the evaluation and hyperparameter tuning of various models, namely Random Forest, Support Vector Machine (SVM), Gradient Boosting Machine (GBM), and Deep Neural Networks (DNN). The accuracies obtained on the test set for these models were as follows: Random Forest achieved an accuracy of 0.6081330868761553, SVM reached 0.5982747997535428, GBM recorded 0.6130622304374614, and DNN achieved 0.607516943930992.

Based on these evaluations, the Gradient Boosting Machine (GBM) model was selected for inclusion in the pipeline due to its superior performance in terms of accuracy on the test set.

### Core Components

The pipeline integrates several key components:

- **Comprehensive Sentiment Analysis**:

    - **General Sentiment with FinBERT**: At the forefront of our sentiment analysis, we utilize FinBERT, a BERT model specifically fine-tuned for financial contexts, to assess the overall sentiment conveyed in cryptocurrency news articles. FinBERT's advanced capabilities allow us to classify the content into positive, neutral, or negative sentiment categories, providing an initial layer of insight into market sentiment.
    - **Aspect-Based Sentiment Analysis**: To complement and deepen our understanding, the pipeline further incorporates aspect-based sentiment analysis. This approach examines specific segments or aspects within the articles, identifying nuanced sentiments associated with particular topics or entities mentioned in the text. Utilizing SpaCy for extracting noun chunks and FinBERT for sentiment evaluation, we obtain weighted sentiment scores and proportions of positive, neutral, and negative sentiments for different aspects. This dual-layered analysis ensures a richer, more granular sentiment assessment, crucial for understanding the multifaceted nature of news impact on market dynamics.

- **Topic Modeling via LDA**: Beyond sentiment, understanding the thematic content of news articles is crucial. We employ Latent Dirichlet Allocation (LDA) for topic modeling, identifying dominant topics and themes. This analysis provides valuable insights into the focal points of discourse within the cryptocurrency market, enriching our predictive model with contextual depth.

- **Price Momentum Prediction**: The essence of our pipeline is its ability to predict cryptocurrency price movements based on the comprehensive sentiment and thematic analyses performed. By leveraging Gradient Boosting Machine (GBM) trained with historical data, we aim to correlate the nuanced sentiment assessments and thematic insights with market price behaviors.

This enriched sentiment analysis framework, paired with advanced topic modeling and predictive capabilities, forms the backbone of our `ML_CryptoNews_Pipeline`. It exemplifies our commitment to leveraging machine learning and NLP techniques to distill actionable insights from cryptocurrency news, aiming to offer a sophisticated tool for market analysis and prediction.

### Project Deliverables

This project which demonstrates an end-to-end data science workflow, includes data ingestion, processing, modeling, and generating actionable insights. Each article processed through the pipeline is evaluated for sentiment (positive, neutral, negative) and predicted price momentum (likely to rise, likely to drop), providing comprehensive market insights.

### Model Performance and Real-World Application

It is important to note that the current models achieve an accuracy of approximately 61%. While this demonstrates some predictive capability, it falls short of the reliability required for real-world financial decision-making. Consequently, the project should be viewed as an educational exercise—a demonstration of how to construct a machine learning workflow from scratch, encompassing data preprocessing, modeling, and interpretation within the fascinating context of cryptocurrency markets.

### Future Directions

Despite the limitations in model accuracy, this project lays a solid foundation for further exploration and improvement. Future work could involve refining the models, exploring additional features, and incorporating more complex NLP techniques to better capture the nuances of financial news sentiment and its impact on market movements.

This project was developed as part of the final portfolio for the Machine Learning career path at Codecademy, aimed at mastering the workflow of machine learning projects from initial concept to actionable insights.

In [1]:
import warnings
# Ignore specific warning categories
warnings.filterwarnings('ignore', category=UserWarning, module='sklearn')
import joblib
import numpy as np
import pandas as pd
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt', quiet=True)
from sklearn.preprocessing import StandardScaler
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import spacy

# Load trained models
lda_model = joblib.load('C:/Users/adrco/Final_Project-env/LDA_Model/lda_model.pkl')
tfidf_vectorizer = joblib.load('C:/Users/adrco/Final_Project-env/TF-IDF_Vectorizer/tfidf_vectorizer.pkl')
word2vec_model = Word2Vec.load('C:/Users/adrco/Final_Project-env/Word2Vec_Model/word2vec_model.model')
gbm_best_model = joblib.load('C:/Users/adrco/Final_Project-env/GBM_Model/gbm_best_model.joblib')

# Load the scaler trained on the training data
X_train = joblib.load("C:/Users/adrco/Final_Project-env/Datasets/Splits/X_train_updated.pkl")
scaler = StandardScaler().fit(X_train)

# Setup FinBERT
finbert_tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
finbert_model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")
finbert_pipeline = pipeline("sentiment-analysis", model=finbert_model, tokenizer=finbert_tokenizer)

# Load SpaCy for aspect-based sentiment analysis
nlp = spacy.load("en_core_web_sm")

def preprocess_text(text):
    return text.lower()

def get_lda_features(text):
    # Transform the text to tf-idf features
    tfidf_features = tfidf_vectorizer.transform([text])
    # Get the LDA topic distribution for the tf-idf features
    lda_features = lda_model.transform(tfidf_features)[0]
    
    # Determine the Dominant Topic (plus 1 for 1-based indexing, if needed)
    dominant_topic = np.argmax(lda_features) + 1
    
    # Return both the lda_features and the dominant topic
    return lda_features, dominant_topic

def get_word2vec_features(text):
    words = word_tokenize(text)
    vector_size = word2vec_model.vector_size
    word_vectors = np.zeros((vector_size,), dtype="float32")
    for word in words:
        if word in word2vec_model.wv:
            word_vectors += word2vec_model.wv[word]
    return word_vectors / len(words) if words else word_vectors

def analyze_sentiment_with_finbert(text):
    result = finbert_pipeline(text)[0]
    sentiment_score = {'positive': 1, 'neutral': 0, 'negative': -1}.get(result['label'], 0)
    return sentiment_score

def aspect_based_analysis(text):
    doc = nlp(text)
    aspects = [chunk.text for chunk in doc.noun_chunks]
    aspect_sentiments = [analyze_sentiment_with_finbert(aspect) for aspect in aspects]
    weighted_score = np.mean(aspect_sentiments) if aspect_sentiments else 0
    positive_prop = aspect_sentiments.count(1) / len(aspect_sentiments) if aspect_sentiments else 0
    neutral_prop = aspect_sentiments.count(0) / len(aspect_sentiments) if aspect_sentiments else 0
    negative_prop = aspect_sentiments.count(-1) / len(aspect_sentiments) if aspect_sentiments else 0
    return weighted_score, positive_prop, neutral_prop, negative_prop

def prepare_features(text):
    preprocessed_text = preprocess_text(text)
    # Capturing both LDA features and the dominant topic
    lda_features, dominant_topic = get_lda_features(preprocessed_text)
    word2vec_features = get_word2vec_features(preprocessed_text)
    finbert_score = analyze_sentiment_with_finbert(text)
    weighted_score, positive_prop, neutral_prop, negative_prop = aspect_based_analysis(text)
    
    # Including the dominant topic as a feature
    all_features = np.hstack([
        lda_features,  # Topic distribution weights
        [dominant_topic],  # Dominant Topic as a single feature
        word2vec_features,  # Word2Vec features
        [finbert_score, weighted_score, positive_prop, neutral_prop, negative_prop]  # Sentiment scores
    ])
    
    # Ensuring the feature vector is 2D (1 sample x N features)
    all_features = all_features.reshape(1, -1)

    scaled_features = scaler.transform(all_features)  # Directly use 2D array for transformation
    
    
    return scaled_features

def predict_and_output(new_text):
    scaled_features = prepare_features(new_text)
    prediction = gbm_best_model.predict(scaled_features)[0]
    
    # Analyze sentiment score using FinBERT directly for the output message
    sentiment_score = analyze_sentiment_with_finbert(new_text)
    sentiment_label = 'neutral'
    if sentiment_score > 0:
        sentiment_label = 'positive'
    elif sentiment_score < 0:
        sentiment_label = 'negative'
    
    # Combining both the sentiment analysis and price movement prediction in the output
    price_movement_prediction = "Price will likely rise." if prediction == 1 else "Price will likely drop."
    return f"Sentiment: {sentiment_label}. {price_movement_prediction}"

# Example usage, paste any news article text here:
new_text = "Bitcoin Surges to New Heights: A New Era for Cryptocurrency Investors. In an unprecedented rally, Bitcoin has once again proven its resilience and potential for investors worldwide. The cryptocurrency giant has not only surpassed previous records but has also instilled renewed confidence among its supporters. Experts attribute this surge to a combination of factors including increased adoption by mainstream financial institutions, positive regulatory news, and a growing recognition of Bitcoin as a 'digital gold' in times of economic uncertainty. This positive momentum is seen as a significant milestone for Bitcoin and the wider cryptocurrency market, signaling a strong future ahead."
print(predict_and_output(new_text))





[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.0s


Sentiment: positive. Price will likely rise.
