# Feature Engineering

In this notebook, we are going to perform feature engineering on the news articles. Our goal is to uncover latent topics within these articles using topic modeling techniques, specifically TF-IDF and LDA (Latent Dirichlet Allocation). This process will help us identify prevalent themes and potentially enhance our predictive model's ability to forecast cryptocurrency price momentum. After integrating sentiment scores into our dataset, we will explore correlation analysis to identify relationships between article sentiments, features, and price movements.

In [29]:
# Import necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Base path for datasets
dataset_path = "C:/Users/adrco/Final_Project-env/Datasets/"

# Path for the combined dataset
combined_file = dataset_path + "full_set.csv"

# Load the dataset
df = pd.read_csv(combined_file)

# Display the first rows 
print(f"Combined dataset loaded. Here are the first rows:\n")
print(df.head())


Combined dataset loaded. Here are the first rows:

     datetime                                               text  \
0  2022-10-14  despite fact blockchainbased carbon credit mar...   
1  2022-10-14  trader gained huge kudos space predicting drop...   
2  2022-10-14  always worked sticking plan clear invalidation...   
3  2022-10-14  fact broke level system giving bullish signals...   
4  2022-10-14  demand coming confirms theres fuel keep going ...   

                                                 url  price_momentums  
0  https://cryptonews.com/news/bitcoin-price-and-...                1  
1  https://cryptonews.com/news/bitcoin-price-pred...                1  
2  https://cryptonews.com/news/bitcoin-price-pred...                1  
3  https://cryptonews.com/news/bitcoin-price-pred...                1  
4  https://cryptonews.com/news/bitcoin-price-pred...                1  


## Text Preprocessing

Upon initial observation, the `text` column in our dataset seems to have undergone some preprocessing steps already, such as removal of certain characters and lowercasing. However, for the sake of thoroughness and to ensure our analysis pipeline is robust and scalable, we will outline a comprehensive preprocessing step here. 

This is an essential practice to make our model more adaptable, especially when integrating new, unprocessed text data in the future. It ensures consistency in text handling and can improve the model's performance and accuracy. Our preprocessing will include converting text to lowercase, removing punctuation and special characters, tokenization, and optionally removing stopwords.

This step, although might seem redundant for this specific dataset, sets a precedent for handling raw text data that may be introduced into our system at a later stage.


In [30]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# Adjusting the preprocessing function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and digits, keeping spaces
    text = re.sub(r"[^\w\s]", '', text)  
    # Tokenize text
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    # Join the tokens back into a string with spaces
    return " ".join(tokens)

# Apply preprocessing to each article's text
df['processed_text'] = df['text'].apply(preprocess_text)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\adrco\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\adrco\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Applying TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects how important a word is to a document in a collection or corpus. We'll use it to convert our textual data into a format that's suitable for topic modeling, focusing on the `processed_text` data.

In [31]:
# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.90, min_df=50, stop_words='english')

# Apply TF-IDF to the processed text
tfidf_matrix = tfidf_vectorizer.fit_transform(df['processed_text'])

# View shape of TF-IDF matrix
print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")


TF-IDF Matrix Shape: (8113, 571)


### Interpretation of TF-IDF Matrix Shape

The TF-IDF matrix has a shape of (8113, 571). This indicates that our dataset, consisting of 8113 cryptocurrency news articles, has been distilled to 571 unique and significant terms or words. Each row in this matrix corresponds to an individual document, and each column represents a unique term identified across the corpus. The values within the matrix reflect the term's importance or weight in each document, adjusted by the term's frequency across all documents. This refined representation is pivotal for our analysis, as it ensures that the vocabulary size is manageable yet sufficiently comprehensive to capture the essence of the discussions within our dataset. It forms a solid foundation for performing nuanced topic modeling, enabling us to extract meaningful insights and themes from the cryptocurrency news landscape.

## Latent Dirichlet Allocation (LDA) for Topic Modeling

LDA is a type of statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For topic modeling, it helps in discovering the topics that pervade through the cryptocurrency news articles, based on the TF-IDF transformed `processed_text` data.

In [32]:
# Initialize LDA Model
lda_model = LatentDirichletAllocation(n_components=15, learning_decay=0.7, random_state=42, verbose=10, max_iter=20)


# Fit LDA model to the TF-IDF matrix
lda_model.fit(tfidf_matrix)

# Function to display topics
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx + 1}:")  # Adjust topic index for display
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

# Displaying the topics
display_topics(lda_model, tfidf_vectorizer.get_feature_names_out(), 10)


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.6s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.6s


iteration: 1 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.6s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.6s


iteration: 2 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.6s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.6s


iteration: 3 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.6s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.6s


iteration: 4 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s


iteration: 5 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s


iteration: 6 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s


iteration: 7 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s


iteration: 8 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s


iteration: 9 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s


iteration: 10 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.4s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.4s


iteration: 11 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s


iteration: 12 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s


iteration: 13 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s


iteration: 14 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.4s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.4s


iteration: 15 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s


iteration: 16 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.4s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.4s


iteration: 17 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.5s


iteration: 18 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.4s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.4s


iteration: 19 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.4s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.4s


iteration: 20 of max_iter: 20


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.3s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.3s


Topic 1:
bank news central crypto currency country stated president claims legal
Topic 2:
bitcoin btc high number miners addresses transactions time mining 2021
Topic 3:
saying added tax business claimed explained sector used impt public
Topic 4:
digital bitcoin company investment mining report asset assets firm services
Topic 5:
24 hours value price day market cryptocurrencies increased dropped trade
Topic 6:
fed rate inflation rates federal reserve markets data expected year
Topic 7:
level support bullish resistance bearish break moving btc trend price
Topic 8:
crypto said south police money government korean exchange companies securities
Topic 9:
billion price volume prediction ethereum bitcoin btc trading 24hour current
Topic 10:
ftx exchange crypto collapse binance ceo look bankruptcy volume lets
Topic 11:
bitcoin market price bull bear bitcoins onchain indicators btc volatility
Topic 12:
bitcoin gold area price btc long key drop short dollar
Topic 13:
presale million stage token 

## Optimizing Topic Modeling 

After extensive experimentation with various parameters for TF-IDF vectorization and Latent Dirichlet Allocation (LDA) modeling, we have identified a configuration that appears to be well-suited for our project needs. Our goal was to uncover meaningful and distinct topics within a large dataset of cryptocurrency news articles, which required a careful balance between the breadth of vocabulary and the depth of topics extracted.

### Final Parameters

The final parameters that yielded the most relevant and interpretable topics were as follows:
- **TF-IDF Vectorization**: `max_df=0.90` and `min_df=50`, with English stop words excluded. This setup produced a TF-IDF matrix with a shape of (8113, 571), indicating that our corpus of 8113 documents was distilled down to 571 significant terms. These parameters helped in focusing on words that are prevalent enough to be informative but not so common as to be ubiquitous across documents.
- **LDA Modeling**: We chose `n_components=15` to identify a diverse range of topics, with a `learning_decay` of 0.7 to optimize the learning rate, and `max_iter=20` to ensure convergence. The `random_state` was set to 42 for reproducibility, and `verbose=10` for detailed logging during the model fitting process.

### Interpretation of Model Outputs

The LDA model revealed 15 distinct topics, each representing a unique aspect of the cryptocurrency domain. We can groups them into 5 overarching themes based on their content:

- **Topics Overview**:
  - **Regulatory and Legal Aspects**: Topics like banking regulations and legal claims within the crypto space (Topic 1).
  - **Technical and Market Analysis**: Deep dives into bitcoin mining, transaction analysis, and market movements (Topics 2, 5, 7, 11).
  - **Economic Factors**: Examination of federal rates, inflation, and their impact on cryptocurrencies (Topic 6).
  - **Specific Events and Sectors**: Insight into significant events like the FTX collapse (Topic 10) and focuses on sectors such as digital asset investment (Topic 3).
  - **Innovation and Development**: Discussions on blockchain technology and its future developments (Topic 14).

The topics identified provide a comprehensive overview of the cryptocurrency news landscape, from economic and regulatory discussions to technological advancements and market analysis.


The chosen parameters for TF-IDF vectorization and LDA modeling have successfully facilitated the extraction of meaningful topics from our cryptocurrency news dataset. The resulting topics are not only interpretable and relevant but also cover a broad spectrum of discussions within the cryptocurrency domain, making them invaluable for further analysis and insights generation in our project.

### Saving LDA Model & TF-IDF Vectorizer

In [33]:
import joblib

# Define the path for saving the LDA model
lda_model_path = "C:/Users/adrco/Final_Project-env/LDA_Model/lda_model.pkl"

# Save the LDA model
joblib.dump(lda_model, lda_model_path)

# Print confirmation
print(f"'lda_model.pkl' saved in {lda_model_path}")



'lda_model.pkl' saved in C:/Users/adrco/Final_Project-env/LDA_Model/lda_model.pkl


In [34]:
# Define the path for saving the TF-IDF Vectorizer
tfidf_vectorizer_path = "C:/Users/adrco/Final_Project-env/TF-IDF_Vectorizer/tfidf_vectorizer.pkl"

# Save the TF-IDF Vectorizer
joblib.dump(tfidf_vectorizer, tfidf_vectorizer_path)

# Print confirmation
print(f"'tfidf_vectorizer.pkl' saved in {tfidf_vectorizer_path}")


'tfidf_vectorizer.pkl' saved in C:/Users/adrco/Final_Project-env/TF-IDF_Vectorizer/tfidf_vectorizer.pkl


### Saving the Document-Topic Distribution Matrix

The document-topic distribution matrix, which indicates the distribution of topics across our articles, is saved for easy retrieval. This matrix is fundamental for further analysis and understanding the prevalence of topics within our dataset.

In [35]:
import pandas as pd

# Generate the document-topic distribution matrix
doc_topic_dist = lda_model.transform(tfidf_matrix)

# Convert document-topic distribution matrix to DataFrame
doc_topic_dist_df = pd.DataFrame(doc_topic_dist)


[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.3s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.3s


### Append Dominant Topic to DataFrame

The dominant topic is determined by identifying the topic with the highest weight for each document in the topic distribution matrix produced by the LDA model.

In [36]:
import numpy as np

# Identifying and appending the dominant topic for each document
dominant_topic = np.argmax(doc_topic_dist, axis=1)
df['Dominant_Topic'] = dominant_topic + 1 


### Incorporate Topic Distribution Weights into DataFrame

Next, we include the distribution weights of all topics for each article as new columns in the DataFrame. These weights represent the proportion of each topic within an article, providing a detailed view of the article's thematic structure.

In [37]:
# Adding topic distribution weights as new columns to the DataFrame
topic_weight_columns = [f'Topic_{i + 1}_Weight' for i in range(15)]  
topic_weights_df = pd.DataFrame(doc_topic_dist, columns=topic_weight_columns)
enhanced_df = pd.concat([df, topic_weights_df], axis=1)

### Save the Enhanced DataFrame

In [38]:
# Saving the enhanced DataFrame with topic information to CSV in the specified dataset path
enhanced_df.to_csv(dataset_path + 'enhanced_dataset_with_topics.csv', index=False)

print(f"enhanced_df saved as 'enhanced_dataset_with_topics.csv' in {dataset_path}")

# Saving the document-topic distribution matrix
doc_topic_dist_df.to_csv(dataset_path + 'doc_topic_distribution.csv', index=False)

print(f"doc_topic_dist_df saved as 'doc_topic_distribution.csv' in {dataset_path}")

enhanced_df saved as 'enhanced_dataset_with_topics.csv' in C:/Users/adrco/Final_Project-env/Datasets/
doc_topic_dist_df saved as 'doc_topic_distribution.csv' in C:/Users/adrco/Final_Project-env/Datasets/


The enhanced DataFrame, which now includes the dominant topic for each document along with the topic distribution weights, will be saved to our project's dataset directory. This step ensures that we have a permanent record of the topic modeling enhancements made to our dataset, allowing for easy access and further analysis in subsequent stages of the project.

### Documenting Topics and Their Top Words

For reference and further analysis, we document the topics identified by our LDA model along with their top words. Saving this information to a text file provides a quick overview of each topic's main themes, aiding in the interpretation and communication of our topic modeling results.


In [39]:
# Save the topics and their top words to a text file
topics_summary_path = dataset_path + 'topics_summary.txt'  
with open(topics_summary_path, 'w') as f:
    for i, topic in enumerate(lda_model.components_):
        top_features_ind = topic.argsort()[:-11:-1]
        top_features = [tfidf_vectorizer.get_feature_names_out()[j] for j in top_features_ind]
        # Adjust topic numbering to start from 1
        topic_str = "Topic {}: {}\n".format(i + 1, ' '.join(top_features))
        f.write(topic_str)

# Print confirmation
print(f"Topics summary saved to {topics_summary_path}")


Topics summary saved to C:/Users/adrco/Final_Project-env/Datasets/topics_summary.txt


### Mapping Numeric Topics to Descriptive Labels

To enhance interpretability of the dominant topics in our dataset, we will map the numeric topic identifiers to descriptive labels. This step transforms the abstract topic numbers into meaningful labels that reflect the primary theme or subject matter of each topic, facilitating easier analysis and communication of the results.


### Determination of Topic Names

The names assigned to each topic were derived from analyzing the top representative words identified by the Latent Dirichlet Allocation (LDA) model. These names are intended to capture the essence of the themes or subject matters that are prevalent within each topic, based on the clustering of similar words. Here is how we labeled each topic from 1 to 15, with a brief description of their focus areas:

- **Topic 1 - Banking and Legal Aspects**: Focuses on banks, legal claims, and regulatory statements in the crypto space.
- **Topic 2 - Bitcoin Mining**: Covers aspects of Bitcoin mining, including miner activity and transaction details.
- **Topic 3 - Business and Taxation**: Related to business operations, taxation, and public sector involvement in crypto.
- **Topic 4 - Investment and Digital Assets**: Discusses company investments in digital assets, including asset management and services.
- **Topic 5 - Market Dynamics**: Captures short-term market movements, including daily price changes and trading volume.
- **Topic 6 - Economic Indicators**: Concerns with macroeconomic indicators like inflation rates and federal reserve policies.
- **Topic 7 - Technical Analysis**: Deals with technical analysis indicators like support/resistance levels and price trends.
- **Topic 8 - Government and Security**: Focuses on government actions, security issues, and regulatory responses to crypto.
- **Topic 9 - Market Capitalization and Trading**: Involves market cap, trading volumes, and price predictions for cryptocurrencies.
- **Topic 10 - Exchanges and Crises**: Pertains to crypto exchanges, notable crises like the FTX collapse, and their implications.
- **Topic 11 - Market Sentiment**: Reflects on overall market sentiment, including bull/bear market indicators and volatility.
- **Topic 12 - Price Comparisons**: Discusses Bitcoin's price movements in relation to other assets like gold.
- **Topic 13 - Token Sales and Funding**: Covers topics related to presales, token fundraising, and initial coin offerings (ICOs).
- **Topic 14 - Blockchain Technology**: Focuses on blockchain technology developments, user engagement, and future prospects.
- **Topic 15 - Trading Platforms**: Relates to the use of trading platforms, social trading, and decision-making tools for traders.

This mapping from topics to descriptive labels enhances the interpretability of our dataset, facilitating clearer communication of findings and supporting more nuanced analysis.


In [40]:
# Mapping of topic numbers to descriptive labels 
topic_labels = {
    1: "Banking and Legal Aspects",
    2: "Bitcoin Mining",
    3: "Business and Taxation",
    4: "Investment and Digital Assets",
    5: "Market Dynamics",
    6: "Economic Indicators",
    7: "Technical Analysis",
    8: "Government and Security",
    9: "Market Capitalization and Trading",
    10: "Exchanges and Crises",
    11: "Market Sentiment",
    12: "Price Comparisons",
    13: "Token Sales and Funding",
    14: "Blockchain Technology",
    15: "Trading Platforms"
}

# Apply the mapping to replace numeric topic identifiers with descriptive labels
enhanced_df['Dominant_Topic_Label'] = enhanced_df['Dominant_Topic'].map(topic_labels)


In [41]:
# Saving the updated enhanced DataFrame with adjusted topic numbering and descriptive labels
enhanced_df.to_csv(dataset_path + 'enhanced_dataset_with_topics.csv', index=False)

# Print confirmation
print("Enhanced dataset with adjusted topic numbering and descriptive labels saved successfully.")


Enhanced dataset with adjusted topic numbering and descriptive labels saved successfully.


# Sentiment Analysis with FinBERT

In this section, we will perform sentiment analysis on the "processed_text" column of our dataset using FinBERT. FinBERT is a pre-trained NLP model specialized for financial texts. This analysis will help us understand the overall sentiment (positive, neutral, negative) conveyed in cryptocurrency news articles. The sentiment scores will be stored in a new column named "finBERT_sentiment_score", with values -1 (negative), 0 (neutral), and 1 (positive).


In [42]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from tqdm.auto import tqdm

# Define the path to your dataset
dataset_path = "C:/Users/adrco/Final_Project-env/Datasets/enhanced_dataset_with_topics.csv"

# Load the dataset
enhanced_df = pd.read_csv(dataset_path)

# Display the first few rows to ensure it's loaded correctly
print(enhanced_df.head())


     datetime                                               text  \
0  2022-10-14  despite fact blockchainbased carbon credit mar...   
1  2022-10-14  trader gained huge kudos space predicting drop...   
2  2022-10-14  always worked sticking plan clear invalidation...   
3  2022-10-14  fact broke level system giving bullish signals...   
4  2022-10-14  demand coming confirms theres fuel keep going ...   

                                                 url  price_momentums  \
0  https://cryptonews.com/news/bitcoin-price-and-...                1   
1  https://cryptonews.com/news/bitcoin-price-pred...                1   
2  https://cryptonews.com/news/bitcoin-price-pred...                1   
3  https://cryptonews.com/news/bitcoin-price-pred...                1   
4  https://cryptonews.com/news/bitcoin-price-pred...                1   

                                      processed_text  Dominant_Topic  \
0  despite fact blockchainbased carbon credit mar...               2   
1  trade

### Initializing FinBERT

Before analyzing the sentiments of our dataset, we need to initialize the FinBERT model and tokenizer. These components are crucial for preparing our text data (tokenization) and for performing the sentiment analysis through the model's pipeline.


In [43]:
# Initialize tokenizer and model for FinBERT
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

# Initialize the sentiment analysis pipeline
finbert_pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)


### Performing Sentiment Analysis

We will now apply the FinBERT model to analyze sentiments of the processed texts. The sentiment output from FinBERT will be mapped to numerical scores: -1 for negative, 0 for neutral, and 1 for positive sentiments. This process enriches our dataset with valuable sentiment insights directly relevant to our analysis of cryptocurrency market dynamics.


In [44]:
# Define a function to apply FinBERT sentiment analysis and map to numerical scores
def analyze_sentiment(text):
    try:
        result = finbert_pipeline(text)[0]
        # Correctly map the labels to numerical scores
        sentiment_mapping = {'positive': 1, 'neutral': 0, 'negative': -1}
        return sentiment_mapping[result['label']]
    except Exception as e:
        # Print the error and the text for better debugging
        print(f"Error analyzing text: {text}\nException: {e}")
        # Return None in case of an error
        return None

# Apply sentiment analysis to the 'processed_text' column with progress bar
tqdm.pandas(desc="Analyzing Sentiments")
enhanced_df['finBERT_sentiment_score'] = enhanced_df['processed_text'].progress_apply(analyze_sentiment)

# Display the first few rows to verify the sentiment scores
print(enhanced_df.head())


Analyzing Sentiments:   0%|          | 0/8113 [00:00<?, ?it/s]

     datetime                                               text  \
0  2022-10-14  despite fact blockchainbased carbon credit mar...   
1  2022-10-14  trader gained huge kudos space predicting drop...   
2  2022-10-14  always worked sticking plan clear invalidation...   
3  2022-10-14  fact broke level system giving bullish signals...   
4  2022-10-14  demand coming confirms theres fuel keep going ...   

                                                 url  price_momentums  \
0  https://cryptonews.com/news/bitcoin-price-and-...                1   
1  https://cryptonews.com/news/bitcoin-price-pred...                1   
2  https://cryptonews.com/news/bitcoin-price-pred...                1   
3  https://cryptonews.com/news/bitcoin-price-pred...                1   
4  https://cryptonews.com/news/bitcoin-price-pred...                1   

                                      processed_text  Dominant_Topic  \
0  despite fact blockchainbased carbon credit mar...               2   
1  trade

# Aspect-Based Sentiment Analysis

This section focuses on extracting aspects from the cryptocurrency news articles and analyzing the sentiment of each aspect using FinBERT. Aspect-based sentiment analysis allows us to understand the sentiment towards specific entities or topics within the text, providing deeper insights into the dataset. We will use spaCy for aspect extraction and then apply FinBERT to assess the sentiment of sentences or fragments containing those aspects.


In [45]:
import pandas as pd
import spacy
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from tqdm.auto import tqdm
tqdm.pandas()

# Initialize tokenizer and model for FinBERT
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

# Initialize the sentiment analysis pipeline with FinBERT
finbert_pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)


### Defining the Aspect Sentiment Analysis Function

We will define a function to perform two main tasks: extract aspects using spaCy and analyze the sentiment of these aspects using FinBERT. The function will process each text, identify nouns and proper nouns as aspects, and determine the sentiment of the text related to each aspect. The output will be a list of tuples, each containing an aspect and its associated sentiment score.


In [46]:
def extract_aspect_sentiments(text):
    # Extract aspects from the text
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    aspects = [token.text for token in doc if token.pos_ in ["NOUN", "PROPN"]]
    
    # Initialize a list to hold the sentiment scores for each aspect
    aspect_sentiments = []
    
    # Analyze sentiment for each aspect
    for aspect in aspects:
        # Replace the aspect in the text with a placeholder for targeted sentiment analysis
        modified_text = text.replace(aspect, "<aspect>")
        sentiment_result = finbert_pipeline(modified_text)[0]
        
        # Map the sentiment label to a numerical score
        sentiment_score = {'positive': 1, 'neutral': 0, 'negative': -1}[sentiment_result['label']]
        
        # Append the aspect and its sentiment score to the list
        aspect_sentiments.append((aspect, sentiment_score))
    
    return aspect_sentiments

# Apply the function to each row in the dataset
tqdm.pandas(desc="Extracting Aspect Sentiments")
enhanced_df['aspect_sentiments'] = enhanced_df['processed_text'].progress_apply(extract_aspect_sentiments)

Extracting Aspect Sentiments:   0%|          | 0/8113 [00:00<?, ?it/s]

In [47]:
# Display the first few rows to verify 
print(enhanced_df.head())

     datetime                                               text  \
0  2022-10-14  despite fact blockchainbased carbon credit mar...   
1  2022-10-14  trader gained huge kudos space predicting drop...   
2  2022-10-14  always worked sticking plan clear invalidation...   
3  2022-10-14  fact broke level system giving bullish signals...   
4  2022-10-14  demand coming confirms theres fuel keep going ...   

                                                 url  price_momentums  \
0  https://cryptonews.com/news/bitcoin-price-and-...                1   
1  https://cryptonews.com/news/bitcoin-price-pred...                1   
2  https://cryptonews.com/news/bitcoin-price-pred...                1   
3  https://cryptonews.com/news/bitcoin-price-pred...                1   
4  https://cryptonews.com/news/bitcoin-price-pred...                1   

                                      processed_text  Dominant_Topic  \
0  despite fact blockchainbased carbon credit mar...               2   
1  trade

In [48]:

dataset_path = "C:/Users/adrco/Final_Project-env/Datasets/"  # Re-asserting path to avoid reloading all previous notebooks

# Directly save the DataFrame with the new filename
enhanced_df.to_csv(dataset_path + "enhanced_dataset.csv", index=False)

# Print confirmation
print(f"Enhanced dataset saved as 'enhanced_dataset.csv' in {dataset_path}")



Enhanced dataset saved as 'enhanced_dataset.csv' in C:/Users/adrco/Final_Project-env/Datasets/


## Generating Word Embeddings with Word2Vec

Word embeddings provide a way to represent text data in numerical form, capturing the context and semantic relationships between words. By using Word2Vec, we can convert the processed text from our articles into vectors that encapsulate these relationships. These vectors can then be used as features in machine learning models, potentially improving our ability to predict cryptocurrency price momentums based on the content of news articles. In this section, we will train a Word2Vec model on our dataset and explore how to utilize these embeddings as features for our predictive modeling tasks.


### Optimizing Word2Vec Parameters for Cryptocurrency News

To enhance the quality of word embeddings generated from our cryptocurrency news articles dataset, we've carefully selected a set of Word2Vec parameters tailored to the characteristics of our text data. These optimized parameters aim to capture the nuanced semantic relationships and specific vocabulary present in the financial domain, particularly in cryptocurrency news. Here's a brief overview of the chosen parameters and their significance:

- **`vector_size=300`:** We increased the dimensionality of the word vectors to 300. This higher dimensionality allows the model to capture a richer set of semantic relationships and nuances in the text, which is particularly beneficial for the specialized vocabulary of cryptocurrency news.

- **`window=10`:** The window size was set to 10, allowing the model to consider a broader context when predicting words. This is crucial for capturing the complex relationships between terms that are further apart in the text, enhancing the embeddings' ability to reflect the context-dependent meanings of words in financial news.

- **`min_count=2`:** We set the minimum word frequency threshold to 2, ignoring words that appear only once in the dataset. This helps focus the model on more relevant and recurrent vocabulary, reducing noise from very rare terms and improving the overall quality of the embeddings.

- **`sg=1`:** We chose the Skip-gram architecture over CBOW because Skip-gram is generally more effective for datasets with specialized vocabularies and rare words, common in cryptocurrency news. Skip-gram focuses on predicting context words from target words, which helps in learning high-quality embeddings for less frequent terms.

- **`workers=4`:** The number of worker threads was set to 4 to leverage multi-core processing, speeding up the training process. Adjust this number based on your machine's CPU cores to optimize training time without compromising model quality.

By adjusting these parameters, our Word2Vec model is better equipped to generate meaningful and informative word embeddings from our cryptocurrency news dataset, providing a solid foundation for subsequent predictive modeling tasks.


In [49]:
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize

# Ensure NLTK's tokenizer is available
nltk.download('punkt')

# Since our 'processed_text' column in 'enhanced_df' contains plain string representations of texts and not lists of tokens,
# we need to tokenize these strings into list of words.
# Preparing our dataset for training the Word2Vec model.
enhanced_df['tokenized_text'] = enhanced_df['processed_text'].apply(word_tokenize)

# Training the Word2Vec model with optimized parameters. This model will learn word embeddings from our tokenized texts,
# capturing the semantic relationships between words based on their co-occurrences in the dataset.
model = Word2Vec(sentences=enhanced_df['tokenized_text'], vector_size=300, window=10, min_count=2, sg=1, workers=4)

# Summarize the loaded model to verify its configuration and the vocabulary size.
print(model)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\adrco\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Word2Vec<vocab=8084, vector_size=300, alpha=0.025>


### Interpreting the Word2Vec Model Summary

The output of our Word2Vec model training provides a concise summary of the model's characteristics. Specifically, the summary `Word2Vec<vocab=8084, vector_size=300, alpha=0.025>` indicates several key aspects of the trained model:

- **`vocab=8084`**: This number represents the size of the model's vocabulary. It means that our Word2Vec model has learned embeddings for 8,084 unique words found in the `processed_text` column of our dataset. This vocabulary size is a reflection of the diversity and richness of the textual content we're analyzing.

- **`vector_size=300`**: This parameter indicates the dimensionality of the word vectors generated by the model. Each word in the model's vocabulary is represented as a 300-dimensional vector. A higher dimensionality allows the model to capture more nuanced semantic relationships between words, though it also increases the model's complexity and memory requirements.

- **`alpha=0.025`**: This is the initial learning rate for the training algorithm. The learning rate controls how much the model's weights are adjusted during training with respect to the loss gradient. Over time, this rate can decrease. A typical starting value is 0.025, and it is automatically tuned during training based on the model's performance and optimization settings.

In summary, the model's output gives us insight into the scale of our text data (through the vocabulary size), the complexity of the word embeddings (via the vector size), and a glimpse into the training process (through the initial learning rate). With this model, we're now equipped to explore the semantic space of cryptocurrency news articles, leveraging these embeddings for downstream tasks such as sentiment analysis, clustering, or predictive modeling.


### Saving the Model

In [50]:
# Define the path where you want to save the Word2Vec model
model_save_path = "C:\\Users\\adrco\\Final_Project-env\\Word2Vec_Model\\word2vec_model.model"

# Saving the model
model.save(model_save_path)

# Print confirmation 
print(f"Word2Vec model saved to {model_save_path}")


Word2Vec model saved to C:\Users\adrco\Final_Project-env\Word2Vec_Model\word2vec_model.model


## Preparing Word Embeddings for Machine Learning

Having trained our Word2Vec model on the cryptocurrency news articles, we now possess a rich set of word embeddings that capture the semantic nuances of the financial domain. Each word in our dataset is represented as a 300-dimensional vector, encapsulating its context and relationships with other words.

The next step involves leveraging these embeddings to represent entire articles, not just individual words, in a format suitable for machine learning models. Since models require fixed-size input vectors, we cannot directly feed them lists of varying-length word vectors. To address this, we'll aggregate word vectors within an article to create a single, fixed-size vector representation for the entire text.

### Average Word Vectors

A straightforward and effective approach to achieve this is by calculating the average word vector for each article. This method involves summing the vectors of all words in an article and dividing by the number of words, resulting in a single vector that represents the semantic essence of the text. This average vector can then be used as a feature set for predictive modeling tasks, such as forecasting cryptocurrency price movements based on news content.

In the following code cell, we define a function that computes the average word vector for a list of words (tokens) from our processed text. This function ensures that only words present in our Word2Vec model's vocabulary are considered, avoiding any errors due to out-of-vocabulary words.


In [51]:
import numpy as np

def average_word_vectors(words, model, vocabulary, vector_size):
    feature_vec = np.zeros((vector_size,), dtype="float32")  # Initialize a zero vector of the same dimensionality as Word2Vec vectors
    nwords = 0

    for word in words:
        if word in vocabulary:  # Check if the word is in the Word2Vec model's vocabulary
            nwords += 1
            feature_vec = np.add(feature_vec, model.wv[word])  # Add the word's vector to the feature vector
    
    if nwords > 0:
        feature_vec = np.divide(feature_vec, nwords)  # Average the feature vector by the number of words
    return feature_vec


### Applying Word Embeddings to the Dataset

With our Word2Vec model trained and the vocabulary set prepared, we're now ready to transform our text data into a format that captures the semantic richness encoded in the word embeddings. By applying the average word vectors function to each article in our dataset, we consolidate the vast information contained in individual word vectors into a single, cohesive vector representation for each article. This process enables us to represent the entire textual content of an article with a fixed-size vector.


In [52]:
# Retrieve the model's vocabulary. It's important for filtering words when calculating averages.
vocabulary = set(model.wv.index_to_key)

# Vector size used in Word2Vec model
vector_size = 300

# Apply the function to each row in the DataFrame. This will create a new column with the averaged word vectors.
enhanced_df['word_vector'] = enhanced_df['tokenized_text'].apply(lambda x: average_word_vectors(x, model, vocabulary, vector_size))


### Saving the Enhanced Dataset

In [53]:

# Saving the enhanced DataFrame with Word2Vec features
enhanced_df.to_pickle(dataset_path + "enhanced_df_with_word_vectors.pkl")  # Using pickle to preserve the vector format

# Print confirmation 
print(f"Enhanced DataFrame with Word2Vec features saved to {dataset_path}")


Enhanced DataFrame with Word2Vec features saved to C:/Users/adrco/Final_Project-env/Datasets/


### Verifying the Enhanced Dataset

In [54]:
# Load a small sample to verify the word vector column
print(enhanced_df.head())


     datetime                                               text  \
0  2022-10-14  despite fact blockchainbased carbon credit mar...   
1  2022-10-14  trader gained huge kudos space predicting drop...   
2  2022-10-14  always worked sticking plan clear invalidation...   
3  2022-10-14  fact broke level system giving bullish signals...   
4  2022-10-14  demand coming confirms theres fuel keep going ...   

                                                 url  price_momentums  \
0  https://cryptonews.com/news/bitcoin-price-and-...                1   
1  https://cryptonews.com/news/bitcoin-price-pred...                1   
2  https://cryptonews.com/news/bitcoin-price-pred...                1   
3  https://cryptonews.com/news/bitcoin-price-pred...                1   
4  https://cryptonews.com/news/bitcoin-price-pred...                1   

                                      processed_text  Dominant_Topic  \
0  despite fact blockchainbased carbon credit mar...               2   
1  trade

### Analyzing the Enhanced DataFrame with Word2Vec Features

- **`word_vector`:** This new column contains the average Word2Vec vector for each article. Each entry is a list of numerical values, each representing a dimension in the 300-dimensional vector space where the Word2Vec model maps words. These vectors encapsulate the semantic essence of the article's text, making them valuable features for machine learning models.

By adding these word vectors to our dataset, we've enriched it with quantitative features that capture the semantic properties of the articles' text. 

## Conclusion on Feature Engineering

The feature engineering stage of our project is complete. We have added several important features to our dataset, including FinBERT sentiment scores, aspect-based sentiments, Word2Vec embeddings, identified topics, and their corresponding weights. These features are intended to capture the detailed content and sentiment of cryptocurrency news articles, which are vital for our subsequent analysis.

### Next Steps

The next phase involves feature selection and comparative analysis, which will be conducted in a new notebook titled `ML_CryptoNews_FS`. Our aim is to identify the features that most effectively predict our target variable, "price momentums". We will use various feature selection techniques and tools to examine the relationship between our engineered features and the target variable. This step is crucial for optimizing our predictive models to accurately forecast price momentum based on news article content.