## **Problem Statement**

### Business Context

The prices of the stocks of companies listed under a global exchange are influenced by a variety of factors, with the company's financial performance, innovations and collaborations, and market sentiment being factors that play a significant role. News and media reports can rapidly affect investor perceptions and, consequently, stock prices in the highly competitive financial industry. With the sheer volume of news and opinions from a wide variety of sources, investors and financial analysts often struggle to stay updated and accurately interpret its impact on the market. As a result, investment firms need sophisticated tools to analyze market sentiment and integrate this information into their investment strategies.

### Problem Definition

With an ever-rising number of news articles and opinions, an investment startup aims to leverage artificial intelligence to address the challenge of interpreting stock-related news and its impact on stock prices. They have collected historical daily news for a specific company listed under NASDAQ, along with data on its daily stock price and trade volumes.

As a member of the Data Science and AI team in the startup, you have been tasked with analyzing the data, developing an AI-driven sentiment analysis system that will automatically process and analyze news articles to gauge market sentiment, and summarizing the news at a weekly level to enhance the accuracy of their stock price predictions and optimize investment strategies. This will empower their financial analysts with actionable insights, leading to more informed investment decisions and improved client outcomes.

### Data Dictionary

* `Date` : The date the news was released
* `News` : The content of news articles that could potentially affect the company's stock price
* `Open` : The stock price (in \$) at the beginning of the day
* `High` : The highest stock price (in \$) reached during the day
* `Low` :  The lowest stock price (in \$) reached during the day
* `Close` : The adjusted stock price (in \$) at the end of the day
* `Volume` : The number of shares traded during the day
* `Label` : The sentiment polarity of the news content
    * 1: positive
    * 0: neutral
    * -1: negative

## **Installing Necessary Libraries**

In [None]:
# Install the sentence-transformers library, which provides easy access to a variety of pre-trained transformer models.
# These models are optimized for creating high-quality sentence embeddings.
!pip install -U sentence-transformers -q

# Install the gensim library, a well-known toolkit for natural language processing that includes word embedding models such as Word2Vec.
!pip install -U gensim -q

# Install the transformers library, maintained by Hugging Face, for a wide range of pre-trained transformer-based models.
!pip install -U transformers -q

# Install tqdm, a Python package that provides progress bars to help track the installation process or model training progress.
!pip install -U tqdm -q

## Import the Necessary Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

## **Loading the dataset**

In [None]:
from google.colab import drive

try:
  # Force remount Google Drive to refresh the connection
  drive.mount('/content/drive', force_remount=True)  # Added force_remount=True
  print("Drive mounted successfully!")
except ValueError as e:
  print(f"Mount failed: {e}")

In [None]:
path='/content/drive/My Drive/AIMLTRAINING/collabdata/NLP/stock_news.csv'
df=pd.read_csv(path)

## **Data Overview**

In [None]:
# Display basic information about the DataFrame
print("Data Information:")
print("="*30)
print(df.info())

# Show summary statistics for numerical columns
print("\nData Description:")
print("="*30)
print(df.describe().T)

# Display shape of the dataset (number of rows and columns)
print("\nDataset Shape:")
print("="*30)
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

# Preview the first few rows of the dataset
print("\nFirst 5 Rows:")
print("="*30)
print(df.head())

# Check for any missing values in each column
print("\nMissing Values per Column:")
print("="*30)
print(df.isnull().sum())

# Count the number of duplicate rows in the dataset
print("\nNumber of Duplicate Rows:")
print("="*30)
print(df.duplicated().sum())


### Data Insights:

- **Data Information:**
  - The dataset contains 349 entries and 8 columns.
  - Columns include stock prices, volume, date, news, and sentiment label.

- **Data Description:**
  - Stock prices (Open, High, Low, Close) range from $36.25 to $68.81.
  - Average volume traded: ~128M, with a range up to 244M.
  - Sentiment label is mostly neutral with a mean of -0.05.

- **Dataset Shape:**
  - 349 rows, 8 columns (manageable for analysis).

- **First 5 Rows:**
  - Stock prices remain the same for multiple news articles on the same date.

- **Missing Values:**
  - No missing values in any column.

- **Duplicate Rows:**
  - No duplicate rows, ensuring data uniqueness.

## **Exploratory Data Analysis**

### Univariate Analysis

* Distribution of individual variables
* Compute and check the distribution of the length of news content

### **Date** Univariate Analysis
- Since the `Date` column is of object type, let's convert it to `datetime` for easier manipulation and analysis.

In [None]:
#convert Date column into 'date' type
df['Date']=pd.to_datetime(df['Date'])


#Check the data type of "Date"
print(df['Date'].dtype)

#print sample
df.head()

In [None]:
# Univariate analysis of the Date column
# Print basic information and key insights for the 'Date' column
print('Number of Unique dates: ', len(df['Date'].unique()))
print('Start and end date: ', df['Date'].min(), df['Date'].max())
print('Range of the date: ', df['Date'].max() - df['Date'].min())
print('Days with more than one record:', df['Date'].duplicated().sum())

# Print the top 5 days with the most records
print("Top 5 Days with the Most Records:")
print(df['Date'].value_counts().sort_values(ascending=False).head(10))


In [None]:
# Group by Date and count the number of occurrences for each date
date_counts = df['Date'].value_counts().sort_values(ascending=False)

# Plotting a bar chart to visualize the top 10 days with the most records
plt.figure(figsize=(10, 6))
date_counts.head(10).plot(kind='bar', color='orange')
plt.title("Top 10 Days with the Most Records", fontsize=14)
plt.xlabel("Date", fontsize=12)
plt.ylabel("Number of Records", fontsize=12)
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()


### Date Column Insights:

1. The dataset spans **118 days** (from January 2 to April 30, 2019) with **71 unique dates**, indicating non-uniform data collection over the period.

2. **January 3rd, 2019** has the highest number of records (28), suggesting significant events or noteworthy occurrences on that day.

3. A total of **278 records** correspond to **multiple entries per day**, pointing to frequent updates or recurring events being logged on certain days.

### **News** univariate Analysis

In [None]:
from collections import Counter
from nltk.corpus import stopwords
import nltk
import pprint
df_news = pd.DataFrame()
# Calculate the length of each news article and create a new column 'news_length'
df_news['news_length'] = df['News'].apply(len)

# Show basic descriptive statistics for the 'news_length' column and print them in a readable format
print("Descriptive Statistics for 'news_length':")
print(df_news['news_length'].describe().to_string())

# Download the list of stopwords if not already downloaded
nltk.download('stopwords')

# Define the set of English stopwords to exclude common words from word count analysis
stop = set(stopwords.words('english'))

# Split news text into individual words, remove stop words, and count word frequencies
word_counts = Counter(
    word.lower() for text in df['News'] for word in text.split() if word.lower() not in stop
)

# Display the 10 most common words in the news articles with pretty print
print("\nTop 10 Most Common Words in News Articles:")
pprint.pprint(word_counts.most_common(10))


### `News` insights:

1. **Article Length Consistency**: The average length of news articles is around 311 words, with most articles falling between 290 and 336 words. This suggests a fairly consistent length, likely suitable for a concise news format.

2. **Focus on Apple and Related Topics**: The word frequency shows "apple" as the most common term, appearing 152 times, with related words like "revenue" and "trade" also frequent. This indicates that Apple’s financial performance and market-related topics are central themes in the dataset.

3. **Emphasis on Market and Economic Conditions**: Other frequent terms, such as "stock," "U.S.," and "company," point to a broader focus on stock market trends and economic conditions that may impact Apple's business or the tech sector in general.

In [None]:
import matplotlib.pyplot as plt
from collections import Counter
from nltk.corpus import stopwords
import nltk
import numpy as np

# Download stopwords if needed
nltk.download('stopwords')

# Define stop words
stop = set(stopwords.words('english'))

# # Summaries for the Stock Price and Volume Columns
# print("Descriptive Statistics for Stock Prices and Volume:")
# print(df[['Open', 'High', 'Low', 'Close', 'Volume']].describe().to_string())

# Histograms and Summary for each Stock Price Column
stock_price_columns = ['Open', 'High', 'Low', 'Close']
for col in stock_price_columns:
    # print(f"\nData Summary for {col}:")
    # print(df[col].describe().to_string())  # Print data summary

    # Plot histogram
    plt.figure(figsize=(8, 5))
    ax = df[col].hist(bins=20, color='skyblue', edgecolor='black')
    plt.title(f"Distribution of {col} Price")
    plt.xlabel(f"{col} Price ($)")
    plt.ylabel("Frequency")

    # Add frequency count on top of each bar
    for p in ax.patches:
        ax.annotate(f'{p.get_height():.0f}', (p.get_x() + p.get_width() / 2., p.get_height()),
                    ha='center', va='center', fontsize=10, color='black', xytext=(0, 10),
                    textcoords='offset points')

    plt.show()

# Volume Column Analysis
# print("\nData Summary for Volume:")
# print(df['Volume'].describe().to_string())  # Print data summary

# Histogram for Volume
plt.figure(figsize=(8, 5))
ax = df['Volume'].hist(bins=20, color='salmon', edgecolor='black')
plt.title("Distribution of Trade Volume")
plt.xlabel("Volume")
plt.ylabel("Frequency")

# Add frequency count on top of each bar
for p in ax.patches:
    ax.annotate(f'{p.get_height():.0f}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', fontsize=10, color='black', xytext=(0, 10),
                textcoords='offset points')

plt.show()

# Box Plots for Stock Price Columns to Identify Outliers
for col in stock_price_columns:
    plt.figure(figsize=(8, 5))
    ax = df.boxplot(column=col)
    plt.title(f"{col} Price Box Plot")
    plt.ylabel("Price ($)")

    # Calculate the outliers using the IQR method
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Find outliers
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    # print(f"\nOutliers in {col}:")
    # print(outliers[[col]])

    plt.show()

# Sentiment Label Analysis
# print("\nSentiment Label Distribution:")
label_counts = df['Label'].value_counts()
# print(label_counts)

# Bar Chart for Sentiment Label Distribution
plt.figure(figsize=(6, 4))
ax = label_counts.plot(kind='bar', color=['green', 'gray', 'red'])
plt.title("Distribution of Sentiment Labels")
plt.xlabel("Sentiment Label")
plt.ylabel("Frequency")
plt.xticks(ticks=[0, 1, 2], labels=['Positive', 'Neutral', 'Negative'], rotation=0)

# Add counter at the top of each bar
for p in ax.patches:
    ax.annotate(f'{p.get_height():.0f}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', fontsize=10, color='black', xytext=(0, 10),
                textcoords='offset points')

plt.show()


### **Insights for univariate analysis:**
**Open Price**:
   - The average opening price is **46.23**, with a standard deviation of **6.44**, indicating moderate volatility in the stock’s opening price over the period.
   - There are **outliers** at **66.82** and **64.33**, suggesting occasional price spikes that could be of interest for further analysis, as these prices are significantly higher than the 75th percentile of **50.71**.

**High Price**:
   - The mean high price is **46.70**, and the **max price** is **67.06**, indicating that the stock occasionally peaks at higher values, possibly due to market events or high demand.
   - Similar to the Open price, **outliers** are observed at **67.06** and **64.88**, signaling unusual market behavior during certain periods.

**Low Price**:
   - The low price has an average value of **45.75**, with prices typically fluctuating between **41.48** and **49.78**.
   - Outliers are visible at **65.86** and **62.29**, which are significantly above the typical range for low prices, suggesting these periods could involve significant price drops or market disruptions.

**Close Price**:
   - The average closing price is **44.93**, slightly lower than the other price columns, indicating a potential downward trend or market correction toward the end of the trading session.
   - Outliers are at **64.81** and **62.57**, with these values being much higher than the typical closing range of **49.11** and **40.25**, indicating rare spikes in the stock's closing value.

**Volume:**
1. The mean trading volume is **128.95 million** shares with a standard deviation of **43.17 million**, indicating substantial variation in the volume of trades over the period.
2. The minimum volume observed is **45.45 million**, while the maximum volume peaks at **244.44 million**, signaling periods of both low and very high trading activity, which may be linked to key events affecting stock prices.

**Sentiment Labels:**
1. The sentiment distribution shows that **170 instances** (about **49%**) are labeled as **neutral** (Label 0), which is the most common sentiment, reflecting periods of market stability.
2. The **negative sentiment** label (Label -1) appears in **99 instances**, showing a notable portion of periods associated with market declines or pessimism. Meanwhile, **positive sentiment** (Label 1) is less frequent, with **80 instances** (about **23%**), possibly reflecting periods of optimism or growth in the market.

### Bivariate Analysis

* Correlation
* Sentiment Polarity vs Price
* Date vs Price

**Note**: The above points are listed to provide guidance on how to approach bivariate analysis. Analysis has to be done beyond the above listed points to get maximum scores.

### 1. Correlation Analysis
- **Goal**: Examine correlations between numerical variables (e.g., Open, High, Low, Close, Volume) to understand how stock prices and trading volumes are interrelated.

In [None]:
#correlation matrix between numerical columns
corr_matrix=df[['Open','High','Close','Volume']].corr()
# print(corr_matrix)
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix,annot=True,cmap='coolwarm',center=0)
plt.title("Correlation Matrix of Stock Prices and Volume")
plt.show()

 **Correlations insights**:
   - The `Open`, `High`, and `Close` prices show a very strong positive correlation with each other (close to 1.0), indicating they move together closely on a daily basis.
   - `Volume` has a very low negative correlation with `Open`, `High`, and `Close` prices (around -0.07 to -0.08), suggesting a weak inverse relationship between trading volume and price movements.

### 2. Sentiment Polarity vs. Price
- **Goal**: Determine how sentiment (represented by Label: positive, neutral, or negative) impacts stock prices.

In [None]:
# Group data by 'Label' and calculate mean prices for each sentiment category
sentiment_price_summary = df.groupby('Label')[['Open', 'High', 'Low', 'Close']].mean()
# print(sentiment_price_summary)

# Visualize sentiment impact on closing price
plt.figure(figsize=(8, 5))
sns.barplot(data=sentiment_price_summary.reset_index(), x='Label', y='Close', palette="viridis")
plt.title("Average Closing Price by Sentiment Label")
plt.xlabel("Sentiment Label")
plt.ylabel("Average Closing Price ($)")
plt.xticks(ticks=[0, 1, 2], labels=['Positive', 'Neutral', 'Negative'])
plt.show()


**Sentiment Vs Price insights**:
   - For sentiment label -1 (negative sentiment), `Close` prices are generally lower, suggesting that negative sentiment might correspond to reduced stock prices.
   - Neutral sentiment (label 0) and positive sentiment (label 1) show higher average `Close` prices compared to negative sentiment, potentially indicating that neutral or positive news has a stabilizing or slightly positive effect on stock prices.

### 3. Date vs. Price
- **Goal**: Understand how stock prices fluctuate over time and identify any noticeable trends or seasonality.

In [None]:
# Print the date range in the data to understand the time span covered
print("Date range in the dataset:")
print(f"Start date: {df['Date'].min()}")
print(f"End date: {df['Date'].max()}")
print("\n")

# # Print the first and last few rows of the DataFrame to verify date ordering
# print("First few rows of the dataset:")
# print(df.head())
# print("\n")
# print("Last few rows of the dataset:")
# print(df.tail())
# print("\n")

# Calculate and print summary statistics for the 'Close' price
print("Summary statistics for 'Close' price:")
print(df['Close'].describe())
print("\n")

# Plot Close price over time
plt.figure(figsize=(12, 6))
plt.plot(df['Date'], df['Close'], color='blue', label='Close Price')
plt.title("Stock Closing Price Over Time")
plt.xlabel("Date")
plt.ylabel("Closing Price ($)")
plt.legend()
plt.show()


**Date Range and Price Summary**:
   - The dataset covers a four-month period from January 2, 2019, to April 30, 2019, which provides a short-term perspective on stock price and sentiment trends.
   - The average `Close` price over this period is approximately \$44.93, with a standard deviation of 6.4, showing moderate price volatility.
   - The highest `Close` price is \$64.80, indicating some significant upward price movements within this time frame.

### 4. Sentiment Polarity vs. Volume
- **Goal**: Assess if trading volume differs across sentiment labels, as higher volumes may indicate heightened investor reaction to certain types of news.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Print basic statistics for trading volume by sentiment polarity
# print("Summary Statistics for Trading Volume by Sentiment Polarity:\n")
volume_stats = df.groupby('Label')['Volume'].describe()
# print(volume_stats)
# print("\nAverage Trading Volume for each Sentiment Label:\n")
# print(volume_stats['mean'])

# Set the style for the plots
sns.set(style="whitegrid")

# Boxplot: Volume vs Sentiment Polarity
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='Label', y='Volume', palette="coolwarm")
plt.title("Trading Volume by Sentiment Polarity")
plt.xlabel("Sentiment Polarity")
plt.ylabel("Trading Volume")
plt.xticks([0, 1, 2], ["Negative", "Neutral", "Positive"])
plt.show()

# Bar Plot: Average Volume by Sentiment Polarity
volume_by_sentiment = df.groupby('Label')['Volume'].mean().reset_index()
plt.figure(figsize=(10, 6))
sns.barplot(data=volume_by_sentiment, x='Label', y='Volume', palette="coolwarm")
plt.title("Average Trading Volume by Sentiment Polarity")
plt.xlabel("Sentiment Polarity")
plt.ylabel("Average Trading Volume")
plt.xticks([0, 1, 2], ["Negative", "Neutral", "Positive"])
plt.show()


**Insights from Trading Volume by Sentiment Polarity**

1. **Neutral Sentiment Shows the Highest Volume**: Neutral sentiment is associated with the highest average trading volume, suggesting significant market activity on days with ambiguous news.

2. **Stable Volume for Positive Sentiment**: Positive news leads to stable trading volumes, indicating consistent investor confidence.

3. **Negative Sentiment Drives High, Varied Volume**: Negative sentiment triggers high trading volumes but with more fluctuation, reflecting stronger emotional reactions to bad news.

In [None]:

# Additional Feature Calculations
df['Price_Range'] = df['High'] - df['Low']  # Price volatility (High - Low)
df['Short_Term_MA'] = df['Close'].rolling(window=10).mean()  # Short-term moving average
df['Long_Term_MA'] = df['Close'].rolling(window=50).mean()  # Long-term moving average

# Summary Statistics for Volume by Sentiment Polarity
# print("\nSummary Statistics for Trading Volume by Sentiment Polarity:")
# print(df.groupby('Label')['Volume'].describe())

# Sentiment Polarity vs Trading Volume Boxplot
plt.figure(figsize=(8, 6))
sns.boxplot(x='Label', y='Volume', data=df)
plt.title('Sentiment Polarity vs. Trading Volume')
plt.show()

# Sentiment Polarity vs Close Price Boxplot
plt.figure(figsize=(12, 8))
sns.boxplot(x='Label', y='Close', data=df)
plt.title('Sentiment Polarity vs. Close Price')
plt.show()

# Sentiment Polarity vs Price Volatility (Price Range)
plt.figure(figsize=(10, 6))
sns.boxplot(x='Label', y='Price_Range', data=df)
plt.title('Sentiment Polarity vs. Price Volatility (Price Range)')
plt.show()

# Sentiment Polarity and Moving Averages Plot
plt.figure(figsize=(12, 6))
sns.lineplot(x='Date', y='Short_Term_MA', data=df, label='10-day MA')
sns.lineplot(x='Date', y='Long_Term_MA', data=df, label='50-day MA')
sns.scatterplot(x='Date', y='Label', data=df, label='Sentiment Polarity', color='blue', alpha=0.5)
plt.title('Sentiment Polarity and Moving Averages')
plt.show()

# Rolling Correlation between Sentiment and Close Price
df['Rolling_Correlation'] = df['Label'].rolling(window=30).corr(df['Close'])
plt.figure(figsize=(12, 6))
plt.plot(df['Date'], df['Rolling_Correlation'], label='Rolling Correlation')
plt.title('Rolling Correlation between Sentiment and Close Price')
plt.xlabel('Date')
plt.ylabel('Correlation')
plt.show()

# Sentiment Polarity vs Trading Volume and Close Price Over Time
plt.figure(figsize=(12, 6))
sns.lineplot(x='Date', y='Volume', data=df, label='Volume')
sns.scatterplot(x='Date', y='Label', data=df, color='blue', alpha=0.5, label='Sentiment Polarity')
plt.title('Sentiment Polarity and Trading Volume Over Time')
plt.show()

# Additional Summary for Insights
print("\nAverage Trading Volume by Sentiment Label:")
print(df.groupby('Label')['Volume'].mean())

print("\nCorrelation between Sentiment and Close Price:")
print(df[['Label', 'Close']].corr())

### Insights

1. **Trading Volume by Sentiment**:
   - Average trading volume is highest for neutral sentiment (`Label = 0`) at ~132M shares.
   - Negative (`Label = -1`) and positive (`Label = 1`) sentiments show similar average trading volumes (~126M shares for negative, ~124M for positive).
   - Neutral sentiment may generate higher trading activity, suggesting market stability or cautious investor behavior.

2. **Volume Distribution by Sentiment**:
   - The 25th and 75th percentiles for all sentiments fall within similar ranges, indicating relatively consistent volume across sentiments.
   - Volume variability (std deviation) is slightly higher for neutral sentiment, suggesting more fluctuation when news is perceived as neutral.

3. **Sentiment and Stock Price Correlation**:
   - The correlation between sentiment (`Label`) and closing price (`Close`) is low (0.06), suggesting weak direct association.
   - This may imply that news sentiment alone has limited immediate impact on daily stock prices, or that other factors (e.g., broader market conditions) dominate price movements.

4. **Price Volatility and Sentiment** (from additional boxplots and statistics):
   - Negative sentiment appears associated with higher price volatility (`Price_Range`), which could indicate investor reactions to negative news.
   - Positive sentiment shows slightly lower volatility, aligning with potentially more stable or optimistic market conditions.

5. **Additional Observations for Trading Strategies**:
   - Higher trading volumes under neutral sentiment could imply a consolidation phase where prices fluctuate within a range.
   - Low correlation may indicate the potential for lagged effects, suggesting further analysis with time-lagged sentiment features for predictive modeling.

## **Data Preprocessing**

**Split the Target Variable and Predictors**

In [None]:
# Assuming 'df' is your DataFrame containing the data
X = df.drop(columns=['Label'])  # Predictors
y = df['Label']                 # Target variable

**Split the Data into Train, Validation, and Test Sets**
- Splitting the data helps to ensure the model is trained, validated, and tested on separate portions of the data.

In [None]:
from sklearn.model_selection import train_test_split

# Split data into training and temporary (for validation and test) sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)

# Split the temporary set into validation and test sets
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)


In [None]:
#copy the original
X_train_orig=X_train.copy()
X_val_orig=X_val.copy()
X_test_orig=X_test.copy()

**Preprocessing the Text Data**
- Preprocessing involves tokenization, lowercasing, removing stopwords, punctuation, and stemming/lemmatization. This ensures that the model gets clean and meaningful inputs.


In [None]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Download required NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english')).union(set(ENGLISH_STOP_WORDS))

# Preprocess text function
def preprocess_text(text):
    text = re.sub(r'\W', ' ', str(text))  # Remove non-alphanumeric characters
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = [lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words]
    return ' '.join(text)

# Apply preprocessing to news articles
X_train['processed_news'] = X_train['News'].apply(preprocess_text)
X_val['processed_news'] = X_val['News'].apply(preprocess_text)
X_test['processed_news'] = X_test['News'].apply(preprocess_text)


In [None]:
#copy the preprossed data
X_train_prepro=X_train.drop(columns=['News'])
X_val_prepro= X_val.drop(columns=['News'])
X_test_prepro= X_test.drop(columns=['News'])

In [None]:
X_train_prepro.sample(3)

In [None]:
# Sample prints for Training Data
print("\nSample of Processed News from Training Set:")
print(X_train['processed_news'].head(3))

# Sample prints for Validation Data
print("\nSample of Processed News from Validation Set:")
print(X_val['processed_news'].head(3))

# Sample prints for Test Data
print("\nSample of Processed News from Test Set:")
print(X_test['processed_news'].head(3))

## **Word Embeddings**

- In this section, we will use three embedding techniques: Word2Vec, GloVe, and Sentence Transformer.

**Tokenization**

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize

# split the text data into tokens (words) using NLTK's word_tokenize
X_train['tokenized'] = X_train['processed_news'].apply(word_tokenize)
X_val['tokenized'] = X_val['processed_news'].apply(word_tokenize)
X_test['tokenized'] = X_test['processed_news'].apply(word_tokenize)

In [None]:
#sample of a tokenized news
X_train['tokenized'][0]

**1. Word2Vec**
- Word2Vec is a technique to map words into continuous vector representations based on their context in a corpus of text

In [None]:
from gensim.models import Word2Vec

# 1. Trained a Word2Vec model on the tokenized training data to learn word embeddings
 # (vector representations of words).
model_w2v = Word2Vec(sentences=X_train['tokenized'], vector_size=100, window=5, min_count=1, workers=4)

# Get the word vectors for the training set (average of word vectors for each document)
def get_word2vec_embeddings(tokens):
    embeddings = [model_w2v.wv[word] for word in tokens if word in model_w2v.wv]
    if embeddings:
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros(model_w2v.vector_size)

# 2. Apply the model to transform each document into a vector by averaging the embeddings of the words in the document.
X_train['w2v_embeddings'] = X_train['tokenized'].apply(get_word2vec_embeddings)
X_val['w2v_embeddings'] = X_val['tokenized'].apply(get_word2vec_embeddings)
X_test['w2v_embeddings'] = X_test['tokenized'].apply(get_word2vec_embeddings)


In [None]:
# Sample of 10 Word2Vec embeddings (first 10 values) from the training set
print("Sample of Word2Vec embeddings from the training data (first 10 values):")
print(X_train['w2v_embeddings'].head(1).apply(lambda x: x[:10]).to_list())

# Sample of 10 Word2Vec embeddings (first 10 values) from the validation set
print("\nSample of Word2Vec embeddings from the validation data (first 10 values):")
print(X_val['w2v_embeddings'].head(1).apply(lambda x: x[:10]).to_list())

# Sample of 10 Word2Vec embeddings (first 10 values) from the test set
print("\nSample of Word2Vec embeddings from the test data (first 10 values):")
print(X_test['w2v_embeddings'].head(1).apply(lambda x: x[:10]).to_list())

# Print size of the embedding vector
print("\nSize of the Word2Vec embedding vector:", len(X_train['w2v_embeddings'][0]))


**2. GloVe**
- We are going to use a pre-trained GloVe embedding file: glove.6B.100d.txt. This file contains word embeddings (vector representations) for a vocabulary of 6 billion tokens and 100 dimensions per word. It is pre-trained on large corpora like Wikipedia and Common Crawl.

In [None]:
# 1. Load pre-trained GloVe embeddings
def load_glove_embeddings(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

# Path to the GloVe embeddings file
glove_path = '/content/drive/My Drive/AIMLTRAINING/collabdata/NLP/glove.6B.100d.txt.word2vec'

# Load GloVe embeddings from the provided path
glove_embeddings = load_glove_embeddings(glove_path)

# 2. Get GloVe embeddings for the training set (average of word vectors for each document)
def get_glove_embeddings(tokens):
    embeddings = [glove_embeddings[word] for word in tokens if word in glove_embeddings]
    if embeddings:
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros(100)  # 100 is the embedding size in this case

# Apply GloVe embeddings to training, validation, and test sets
X_train['glove_embeddings'] = X_train['tokenized'].apply(get_glove_embeddings)
X_val['glove_embeddings'] = X_val['tokenized'].apply(get_glove_embeddings)
X_test['glove_embeddings'] = X_test['tokenized'].apply(get_glove_embeddings)


In [None]:
# Sample of 10 GloVe embeddings (first 10 values) from the training set
print("Sample of GloVe embeddings from the training data (first 10 values):")
print(X_train['glove_embeddings'].head(1).apply(lambda x: x[:10]).to_list())

# Sample of 10 GloVe embeddings (first 10 values) from the validation set
print("\nSample of GloVe embeddings from the validation data (first 10 values):")
print(X_val['glove_embeddings'].head(1).apply(lambda x: x[:10]).to_list())

# Sample of 10 GloVe embeddings (first 10 values) from the test set
print("\nSample of GloVe embeddings from the test data (first 10 values):")
print(X_test['glove_embeddings'].head(1).apply(lambda x: x[:10]).to_list())

# Print size of the embedding vector
print("\nSize of the GloVe embedding vector:", len(X_train['glove_embeddings'][0]))


**3. Sentence Transformer**
- Sentence Transformers provide embeddings for entire sentences or documents, rather than individual words, making them ideal for tasks like document classification.

In [None]:
from sentence_transformers import SentenceTransformer

# Initialize the model
# The all-MiniLM-L6-v2 model is an all-round (all) model trained on a large and diverse dataset of over 1 billion training samples
# and generates state-of-the-art sentence embeddings of 384 dimensions.
model_st = SentenceTransformer('all-MiniLM-L6-v2')

# Get sentence embeddings for the training set
X_train['st_embeddings'] = X_train['processed_news'].apply(lambda x: model_st.encode(x))
X_val['st_embeddings'] = X_val['processed_news'].apply(lambda x: model_st.encode(x))
X_test['st_embeddings'] = X_test['processed_news'].apply(lambda x: model_st.encode(x))


In [None]:
# Sample of 10 Sentence Transformer embeddings (first 10 values) from the training set
print("Sample of Sentence Transformer embeddings from the training data (first 10 values):")
print(X_train['st_embeddings'].head(1).apply(lambda x: x[:10]).to_list())

# Sample of 10 Sentence Transformer embeddings (first 10 values) from the validation set
print("\nSample of Sentence Transformer embeddings from the validation data (first 10 values):")
print(X_val['st_embeddings'].head(1).apply(lambda x: x[:10]).to_list())

# Sample of 10 Sentence Transformer embeddings (first 10 values) from the test set
print("\nSample of Sentence Transformer embeddings from the test data (first 10 values):")
print(X_test['st_embeddings'].head(1).apply(lambda x: x[:10]).to_list())

# Print size of the embedding vector
print("\nSize of the Sentence Transformer embedding vector:", len(X_train['st_embeddings'][0]))


## **Sentiment Analysis**

**Goal**
- The goal of this project is to develop an AI-driven sentiment analysis system that analyzes stock-related news to gauge market sentiment, supporting improved stock price predictions and optimized investment strategies for more informed, effective decision-making.

**Primary Metric Choice** - **F1-Score**

**Why F1-Score?**
- Given the three-class classification problem with an imbalanced dataset (170 neutral, 99 negative, and 80 positive labels), the F1-score is an ideal metric. It balances precision and recall, emphasizing the model's ability to correctly classify each sentiment label (positive, neutral, negative), while accounting for class imbalance.

**Complementary Metric: Accuracy**

**Why Accuracy?**
- Accuracy offers a quick overview of correctness but can be misleading in imbalanced data, where the model might favor the majority class (neutral). As such, accuracy can serve as a supplementary measure to provide a broader perspective but isn’t suitable as the primary metric.

**1. Basic Sentiment Analysis Using Word2Vec Embeddings**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import numpy as np

# Extract features (Word2Vec embeddings) from X_train, X_val, and X_test
X_train_w2v = np.stack(X_train['w2v_embeddings'].values)
X_val_w2v = np.stack(X_val['w2v_embeddings'].values)
X_test_w2v = np.stack(X_test['w2v_embeddings'].values)

# Initialize the Random Forest classifier
model_w2v = RandomForestClassifier(random_state=42)

# Train the model
model_w2v.fit(X_train_w2v, y_train)

# Predict on training, validation, and test sets
y_train_pred = model_w2v.predict(X_train_w2v)
y_val_pred = model_w2v.predict(X_val_w2v)
y_test_pred = model_w2v.predict(X_test_w2v)

# Print classification report for each set
print("\nTraining Set Classification Report:")
print("-" * 40)
print(classification_report(y_train, y_train_pred))

print("\nValidation Set Classification Report:")
print("-" * 40)
print(classification_report(y_val, y_val_pred))

print("\nTest Set Classification Report:")
print("-" * 40)
print(classification_report(y_test, y_test_pred))


### Basic Word2Vec Model Performance Observations:

1. **Overfitting**: Perfect training performance (F1, precision, recall = 1.0) suggests overfitting, indicating poor generalization to new data.

2. **Poor Generalization**: Validation and test metrics are much lower (accuracy around 42-44%), showing the model struggles with unseen data.

3. **Precision and Recall Imbalance**: The model has low precision and recall for minority classes, especially class `1`, indicating bias toward the majority class and potential issues with false positives and negatives.

4. **Class Sensitivity**: Performance is better on the majority class (class `0`), indicating sensitivity to class distribution.

**2. Basic Sentiment Analysis Using GloVe**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import numpy as np

# Extract GloVe embeddings for training, validation, and test sets
X_train_glove = np.stack(X_train['glove_embeddings'].values)
X_val_glove = np.stack(X_val['glove_embeddings'].values)
X_test_glove = np.stack(X_test['glove_embeddings'].values)

# Initialize the Random Forest classifier
model_glove = RandomForestClassifier(random_state=42)

# Train the model
model_glove.fit(X_train_glove, y_train)

# Predict on training, validation, and test sets
y_train_pred = model_glove.predict(X_train_glove)
y_val_pred = model_glove.predict(X_val_glove)
y_test_pred = model_glove.predict(X_test_glove)

# Print classification report for each set
print("\nTraining Set Classification Report:")
print("-" * 40)
print(classification_report(y_train, y_train_pred))

print("\nValidation Set Classification Report:")
print("-" * 40)
print(classification_report(y_val, y_val_pred))

print("\nTest Set Classification Report:")
print("-" * 40)
print(classification_report(y_test, y_test_pred))


### Basic GloVe Model Performance Observations:

1. **Overfitting**: Perfect scores on the training set (F1, precision, recall = 1.0) indicate strong overfitting, limiting generalization.

2. **Poor Generalization**: Significant performance drop on validation and test sets, with low F1 scores (validation ~0.33, test ~0.23), showing the model struggles with unseen data.

3. **High Recall, Low Precision**: While recall is higher for some classes, low precision on validation and test sets suggests many false positives, indicating imbalanced performance across classes.

**3. Basic Sentiment Analysis Using Sentence Transformer**

In [None]:
#  Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import numpy as np

# Extract the sentence embeddings for each set
X_train_st = np.vstack(X_train['st_embeddings'].values)
X_val_st = np.vstack(X_val['st_embeddings'].values)
X_test_st = np.vstack(X_test['st_embeddings'].values)

# Initialize a Random Forest classifier (or another model)
model_st = RandomForestClassifier(random_state=42)

# Train the classifier
model_st.fit(X_train_st, y_train)

# Predict on training, validation, and test sets
y_train_pred = model_st.predict(X_train_st)
y_val_pred = model_st.predict(X_val_st)
y_test_pred = model_st.predict(X_test_st)

# Print classification report for each set
print("\nTraining Set Classification Report:")
print("-" * 40)
print(classification_report(y_train, y_train_pred))

print("\nValidation Set Classification Report:")
print("-" * 40)
print(classification_report(y_val, y_val_pred))

print("\nTest Set Classification Report:")
print("-" * 40)
print(classification_report(y_test, y_test_pred))


### Basic Sentence Transformer Model Performance Observation:

1. **Overfitting**: The model achieves perfect scores on the training set (F1, accuracy = 1.0), which suggests overfitting and poor generalization to new data.

2. **Poor Generalization**: Performance drops significantly on the validation (F1 ~0.49, accuracy ~0.54) and test sets (F1 ~0.24, accuracy ~0.38), indicating the model struggles to generalize effectively.

3. **Class Imbalance Issues**: Precision for class 1 is zero, and recall is extremely low for certain classes, suggesting poor model performance on underrepresented classes.

### Hyperparameter Tuning
**1. Tuned Sentiment Analysis Using Word2Vec Embeddings**

In [None]:
# **1. Sentiment Analysis Using Word2Vec Embeddings**

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import numpy as np

# Extract features (Word2Vec embeddings) from X_train, X_val, and X_test
X_train_w2v = np.stack(X_train['w2v_embeddings'].values)
X_val_w2v = np.stack(X_val['w2v_embeddings'].values)
X_test_w2v = np.stack(X_test['w2v_embeddings'].values)

# Set the best hyperparameters directly based on previous tuning
best_params = {
    'n_estimators': 50,
    'max_depth': 10,
    'min_samples_split': 2,
    'min_samples_leaf': 1,
    'max_features': 'log2'
}

# Initialize the Random Forest classifier with the best parameters
model_w2v_tuned = RandomForestClassifier(random_state=42, **best_params)

# Fit the model on the training data
model_w2v_tuned.fit(X_train_w2v, y_train)

# Predict on training, validation, and test sets
y_train_pred = model_w2v_tuned.predict(X_train_w2v)
y_val_pred = model_w2v_tuned.predict(X_val_w2v)
y_test_pred = model_w2v_tuned.predict(X_test_w2v)

# Print classification report for each set
print("\nTraining Set Classification Report:")
print("-"*40)
print(classification_report(y_train, y_train_pred))

print("\nValidation Set Classification Report:")
print("-"*40)
print(classification_report(y_val, y_val_pred))

print("\nTest Set Classification Report:")
print("-"*40)
print(classification_report(y_test, y_test_pred))

# Display best hyperparameters
print("\nBest Hyperparameters:")
print(best_params)


### Tuned Word2Vec Model Performance Insights:

1. **Overfitting**: The model achieves perfect scores (F1 = 1.0) on the training set, indicating overfitting. This means the model fits too closely to the training data and likely struggles with generalization.

2. **Improvement Over Untuned Model**: The tuned model shows slight improvements in validation (F1 = 0.395) and test performance (F1 = 0.341) compared to the untuned model, but overfitting still persists.

3. **Class Imbalance**: The model struggles with class imbalance, particularly for class 1, where both precision and recall are low, especially on the test set.

### Comparison with Untuned Word2Vec Model

- **Validation and Test Performance**: Both the tuned and untuned models show poor generalization, with relatively low F1 scores and accuracy on the test set (F1 ~0.34 for tuned vs. ~0.33 for untuned).
- **Slight Improvement**: The tuned model offers marginal improvement in precision and recall over the untuned model, but both still exhibit signs of overfitting.

**2. Tuned Sentiment Analysis Using GloVe**

In [None]:
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
import joblib

# Extract GloVe embeddings for training, validation, and test sets
X_train_glove = np.stack(X_train['glove_embeddings'].values)
X_val_glove = np.stack(X_val['glove_embeddings'].values)
X_test_glove = np.stack(X_test['glove_embeddings'].values)

# Set the best hyperparameters directly based on previous tuning
best_params_glove = {
    'n_estimators': 50,
    'learning_rate': 0.1,
    'max_depth': 7,
    'min_samples_split': 5,
    'min_samples_leaf': 1
}

# Initialize the Gradient Boosting classifier with the best parameters
model_glove_tuned = GradientBoostingClassifier(random_state=42, **best_params_glove)

# Train the model on the training set
model_glove_tuned.fit(X_train_glove, y_train)

# Predict on training, validation, and test sets
y_train_pred = model_glove_tuned.predict(X_train_glove)
y_val_pred = model_glove_tuned.predict(X_val_glove)
y_test_pred = model_glove_tuned.predict(X_test_glove)

# Print classification report for each set
print("\nTraining Set Classification Report:")
print("-"*40)
print(classification_report(y_train, y_train_pred))

print("\nValidation Set Classification Report:")
print("-"*40)
print(classification_report(y_val, y_val_pred))

print("\nTest Set Classification Report:")
print("-"*40)
print(classification_report(y_test, y_test_pred))


### Tuned GloVe Model Insights:
- **Training Set**: Perfect performance (F1 = 1.0, Accuracy = 1.0), indicating overfitting.
- **Validation Set**: Moderate performance drop (F1 = 0.44, Accuracy = 0.46), still quite far from training set perfection, showing some overfitting.
- **Test Set**: Further drop (F1 = 0.33, Accuracy = 0.36), with the model continuing to struggle with generalization to unseen data, especially with low recall for class -1 and class 1.

### Comparison with Untuned/Basic GloVe Model:
- The tuned model shows slight improvement over the untuned model, outperforming it on both the validation (F1 = 0.44 vs. 0.25) and test (F1 = 0.33 vs. 0.21) sets, but both still face significant challenges in generalization.

**3. Tuned Sentiment Analysis Using Sentence Transformer**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import numpy as np
import joblib

# Extract the sentence embeddings for each set
X_train_st = np.vstack(X_train['st_embeddings'].values)
X_val_st = np.vstack(X_val['st_embeddings'].values)
X_test_st = np.vstack(X_test['st_embeddings'].values)

# Combine train and validation sets for cross-validation
X_combined = np.vstack((X_train_st, X_val_st))
y_combined = np.hstack((y_train, y_val))

# Set the best hyperparameters based on previous tuning results
best_params_rf = {
    'n_estimators': 50,
    'max_depth': None,
    'min_samples_split': 10,
    'min_samples_leaf': 1,
    'max_features': 'sqrt'
}

# Initialize the Random Forest classifier with the best parameters
model_st_tuned = RandomForestClassifier(random_state=42, **best_params_rf)

# Train the model on the training set
model_st_tuned.fit(X_train_st, y_train)

# Predict on training, validation, and test sets
y_train_pred = model_st_tuned.predict(X_train_st)
y_val_pred = model_st_tuned.predict(X_val_st)
y_test_pred = model_st_tuned.predict(X_test_st)

# Print classification report for each set
print("\nTraining Set Classification Report:")
print("-"*40)
print(classification_report(y_train, y_train_pred))

print("\nValidation Set Classification Report:")
print("-"*40)
print(classification_report(y_val, y_val_pred))

print("\nTest Set Classification Report:")
print("-"*40)
print(classification_report(y_test, y_test_pred))

### Insights on the Untuned Sentence Transformer Model:
- **Training Set**: Perfect performance (F1, Accuracy, Precision, Recall all 1.0), indicating overfitting.
- **Validation Set**: Moderate performance (F1: 0.38, Accuracy: 0.52), with recall for class 0 being high (0.88), but very low recall for class 1 (0.00), suggesting a bias toward class 0.
- **Test Set**: Poor performance (F1: 0.28, Accuracy: 0.40), with low precision for class 0 (0.38) and class 1 (0.00), indicating difficulty in generalizing to unseen data.

### Comparison (Tuned vs Untuned):
- **Test Set**: The tuned model improves significantly in precision (0.60 vs. 0.21), but recall remains low (0.42), indicating better focus on reducing false positives while missing some true positives.

### Model Selection Based on our Criteria

#### Summary Table

| Model                      | Dataset       | Accuracy | F1 Score | Precision | Recall |
|----------------------------|---------------|----------|----------|-----------|--------|
| **Untuned GloVe**          | Training      | 1.00     | 1.00     | 1.00      | 1.00   |
|                            | Validation    | 0.46     | 0.25     | -         | -      |
|                            | Test          | 0.36     | 0.21     | -         | -      |
| **Tuned GloVe**            | Training      | 1.00     | 1.00     | 1.00      | 1.00   |
|                            | Validation    | 0.46     | 0.44     | -         | -      |
|                            | Test          | 0.36     | 0.33     | -         | -      |
| **Untuned Sentence Transformer** | Training | 1.00     | 1.00     | 1.00      | 1.00   |
|                            | Validation    | 0.52     | 0.38     | -         | -      |
|                            | Test          | 0.40     | 0.28     | -         | -      |
| **Tuned Sentence Transformer** | Training  | 1.00     | 1.00     | 1.00      | 1.00   |
|                            | Validation    | 0.54     | 0.49     | 0.65      | 0.54   |
|                            | Test          | 0.38     | 0.24     | 0.60      | 0.42   |
| **Untuned Word2Vec**       | Training      | 1.00     | 1.00     | 1.00      | 1.00   |
|                            | Validation    | 0.45     | 0.22     | -         | -      |
|                            | Test          | 0.35     | 0.20     | -         | -      |
| **Tuned Word2Vec**         | Training      | 1.00     | 1.00     | 1.00      | 1.00   |
|                            | Validation    | 0.49     | 0.39     | -         | -      |
|                            | Test          | 0.37     | 0.27     | -         | -      |

#### Model Selection Analysis

1. **F1-Score (Primary Metric)**:
   - The **Tuned GloVe** model has the highest F1-score on the test set (0.33), which shows better generalization on unseen data compared to other models. It also has a relatively good F1-score (0.44) on the validation set.
   - The **Tuned Sentence Transformer** model has the highest F1-score on the validation set (0.49), showing it may perform slightly better in detecting the minority classes, although its test F1-score is lower (0.24).

2. **Accuracy (Complementary Metric)**:
   - Both **Tuned GloVe** and **Tuned Sentence Transformer** models perform similarly in terms of accuracy on the test set, with **Tuned GloVe** at 0.36 and **Tuned Sentence Transformer** at 0.38, indicating neither model significantly overfits to the majority class.

3. **Model Recommendation**:
   - Given the F1-score priority and the slightly better balance of precision and recall, **Tuned GloVe** appears to be the best choice. It performs consistently across validation and test sets, handling the imbalanced dataset better than the alternatives.
   - **Tuned Sentence Transformer** is a strong alternative if further tuning is possible, as its high precision on the validation set suggests it could be refined for potentially better minority class detection.

### Conclusion

The **Tuned GloVe** model aligns best with your criteria by offering the most consistent F1-scores across validation and test datasets, making it the recommended model for your sentiment analysis system.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Assuming 'model_glove_tuned' is your final trained and tuned GloVe model
final_model = model_glove_tuned

# Predicting on validation and test sets
val_preds = final_model.predict(X_val_glove)
test_preds = final_model.predict(X_test_glove)

# Creating confusion matrices
val_conf_matrix = confusion_matrix(y_val, val_preds)
test_conf_matrix = confusion_matrix(y_test, test_preds)

# Printing the actual confusion matrix data
print("Validation Set Confusion Matrix Data:")
print("Negative\tNeutral\tPositive")
for row in val_conf_matrix:
    print("\t".join(map(str, row)))

print("\nTest Set Confusion Matrix Data:")
print("Negative\tNeutral\tPositive")
for row in test_conf_matrix:
    print("\t".join(map(str, row)))

# Plotting confusion matrices for validation and test sets
fig, ax = plt.subplots(1, 2, figsize=(14, 6))

# Validation set confusion matrix
disp_val = ConfusionMatrixDisplay(confusion_matrix=val_conf_matrix, display_labels=['Negative', 'Neutral', 'Positive'])
disp_val.plot(ax=ax[0], cmap='Blues', colorbar=False)  # Display without color bar to avoid duplication
ax[0].set_title("Validation Set Confusion Matrix")

# Test set confusion matrix
disp_test = ConfusionMatrixDisplay(confusion_matrix=test_conf_matrix, display_labels=['Negative', 'Neutral', 'Positive'])
disp_test.plot(ax=ax[1], cmap='Blues', colorbar=True)  # Display with color bar for consistency
ax[1].set_title("Test Set Confusion Matrix")

plt.tight_layout()  # Adjust layout for readability
plt.show()


### Observations on the final model:

1. **Weakness in Predicting Negative Sentiment**: The model struggles to correctly classify negative sentiment, with very few true negatives in both validation and test sets.

2. **Good Performance on Neutral Sentiment**: The model performs better on neutral sentiment, with more correct predictions, though misclassifications still occur.

3. **Challenges with Positive Sentiment**: The model has difficulty predicting positive sentiment, often misclassifying it as neutral or negative, especially in the test set.

## **Weekly News Summarization**

**Important Note**: It is recommended to run this section of the project independently from the previous sections in order to avoid runtime crashes due to RAM overload.

Load the Data

In [None]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

path2='/content/drive/My Drive/AIMLTRAINING/collabdata/NLP/stock_news.csv'
df_llm=pd.read_csv(path2)
# df_llm=pd.read_csv("stock_news.csv")

Convert the Date Column to Pandas Date Format

In [None]:
df_llm['Date'] = pd.to_datetime(df_llm['Date'])
# Print dataset information and summaries
print("Dataset Information:")
print("-"*30)
print(df_llm.info())  # Provides an overview of the dataframe, including data types and non-null counts

print("\nFirst Five Rows of the Dataset:")
print("-"*30)
print(df_llm.head())  # Displays the first 5 rows of the dataset

print("\nShape of the Dataset:")
print("-"*30)
print(f"Rows: {df_llm.shape[0]}, Columns: {df_llm.shape[1]}")  # Displays the number of rows and columns

print("\nCount of Missing Values in Each Column:")
print("-"*30)
print(df_llm.isnull().sum())  # Displays the count of missing values in each column

**Group the Data by Week**

In [None]:
# Group the data by week and aggregate
df_weekly = df_llm.groupby(pd.Grouper(key='Date', freq='W')).agg({
    'News': ' || '.join,  # Combine daily news using '||' as the delimiter
    'Open': 'first',      # Use the opening price of the first day of the week
    'High': 'max',        # Use the highest price during the week
    'Low': 'min',         # Use the lowest price during the week
    'Close': 'last',      # Use the closing price of the last day of the week
    'Volume': 'sum',      # Sum up the volume traded during the week
    'Label': 'last'       # Use the label of the last day of the week
}).reset_index()

# Display the resulting DataFrame
# Display the resulting DataFrame
print("Weekly Aggregated Data:")
print("-"*30)
print(df_weekly.head())  # Print first 5 rows to preview

# Preview the shape of the aggregated data
print("\nShape of Aggregated Weekly Data:")
print("-"*30)
print(df_weekly.shape)

# Check for any missing values in the aggregated data
print("\nMissing Values in Aggregated Weekly Data:")
print("-"*30)
print(df_weekly.isnull().sum())

# Display information about the aggregated data
print("\nAggregated Weekly Data Info:")
print("-"*30)
print(df_weekly.info())

# Display first few rows of the dataset after grouping and aggregation
print("\nFirst Five Rows After Grouping and Aggregation:")
print("-"*30)
print(df_weekly.head())

# Check how the 'News' column looks after concatenation
print("\nExample of 'News' Column After Aggregation:")
print("-"*30)
print(df_weekly['News'].head().T)


In [None]:
from google.colab import drive

try:
  # Force remount Google Drive to refresh the connection
  drive.mount('/content/drive', force_remount=True)  # Added force_remount=True
  print("Drive mounted successfully!")
except ValueError as e:
  print(f"Mount failed: {e}")
  print("Check your authentication and try again. You might need to authorize access to your Google Drive.")

In [None]:
# Define the file path for saving the DataFrame as a CSV file
file_path = '/content/drive/My Drive/AIMLTRAINING/collabdata/NLP/df_weekly.csv'

# Save the entire DataFrame to a CSV file
df_weekly.to_csv(file_path, index=False, encoding='utf-8')

print(f"DataFrame saved to {file_path}")


#### Installing and Importing the necessary libraries

In [None]:
!pip install git+https://github.com/abetlen/llama-cpp-python.git
# !CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.2.45 --force-reinstall --no-cache-dir -q

#### Summarization

**Note**:

- The model is expected to summarize the news from the week by identifying the top three positive and negative events that are most likely to impact the price of the stock.

- As an output, the model is expected to return a JSON containing two keys, one for Positive Events and one for Negative Events.

For the project, we need to define the prompt to be fed to the LLM to help it understand the task to perform. The following should be the components of the prompt:

1. **Role**: Specifies the role the LLM will be taking up to perform the specified task, along with any specific details regarding the role

  - **Example**: `You are an expert data analyst specializing in news content analysis.`

2. **Task**: Specifies the task to be performed and outlines what needs to be accomplished, clearly defining the objective

  - **Example**: `Analyze the provided news headline and return the main topics contained within it.`

3. **Instructions**: Provides detailed guidelines on how to perform the task, which includes steps, rules, and criteria to ensure the task is executed correctly

  - **Example**:

```
Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.
```

4. **Output Format**: Specifies the format in which the final response should be structured, ensuring consistency and clarity in the generated output

  - **Example**: `Return the output in JSON format with keys as the topic number and values as the actual topic.`

**Full Prompt Example**:

```
You are an expert data analyst specializing in news content analysis.

Task: Analyze the provided news headline and return the main topics contained within it.

Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.

Return the output in JSON format with keys as the topic number and values as the actual topic.
```

**Sample Output**:

`{"1": "Politics", "2": "Economy", "3": "Health" }`

#### Loading the model

In [None]:
# Function to download the model from the Hugging Face model hub
from huggingface_hub import hf_hub_download

# Importing the Llama class from the llama_cpp module
from llama_cpp import Llama

# Importing the library for data manipulation
import pandas as pd

from tqdm import tqdm # For progress bar related functionalities
tqdm.pandas()

In [None]:
import os
print(f"Number of CPU cores: {os.cpu_count()}")


In [None]:
# Ensure you have the necessary libraries imported
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM, AutoTokenizer
from llama_cpp import Llama  # Make sure to use the correct Llama class if from llama_cpp

# Model configuration
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGUF"
model_basename = "llama-2-13b-chat.Q5_K_M.gguf"

# Download model files from HuggingFace Hub
model_path = hf_hub_download(
    repo_id=model_name_or_path,
    filename=model_basename
)

# Initialize the Llama model (if using llama_cpp)
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2,  # Number of CPU threads (adjust as per your system)
    n_batch=512,  # Batch size; ensure it fits within your GPU's VRAM
    n_gpu_layers=10,  # Number of GPU layers, modify according to GPU VRAM
    n_ctx=5096,  # Context window size (modify if necessary)
)

##### Utility Functions

In [None]:
# defining a function to parse the JSON output from the model
def extract_json_data(json_str):
    import json
    try:
        # Find the indices of the opening and closing curly braces
        json_start = json_str.find('{')
        json_end = json_str.rfind('}')

        if json_start != -1 and json_end != -1:
            extracted_category = json_str[json_start:json_end + 1]  # Extract the JSON object
            data_dict = json.loads(extracted_category)
            return data_dict
        else:
            print(f"Warning: JSON object not found in response: {json_str}")
            return {}
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON: {e}")
        return {}

##### Defining the response function

In [None]:
# Function to generate, process, and return the response from the LLaMA model
def response_llama(prompt, news, model):

    # System message to guide the model's behavior
    system_message = """
    [INST]<<SYS>> Respond to the user question based on the user prompt<</SYS>>[/INST]
    """

    # Combine user prompt with the system message for better context
    full_prompt = f"{prompt}\nNews Articles: {news}\n{system_message}"

    # Generate the response with model parameters for better quality
    response = model(
        prompt=full_prompt,
        max_tokens=512,           # Increase max tokens for longer responses
        temperature=0.01,          # Low temperature for deterministic output
        top_p=0.95,                # Allows more diverse token sampling
        repeat_penalty=1.2,        # Penalize repeated tokens
        top_k=50,                  # Limits token sampling to top 50
        stop=['INST'],             # Stop generating at the end of instruction
        echo=False                 # Do not echo the input in the output
    )

    # Extract and return the response text
    response_text = response["choices"][0]["text"]
    return response_text


**Note**: Use this section to test out the prompt with one instance before using it for the entire weekly data.

In [None]:
prompt='''Analyze the provided news articles and do the following:

1. Identify the top three positive events that are likely to impact the stock price.

2. Identify the top three negative events that are likely to impact the stock price.

3. Select the top 10 keywords in the News

4. PROVIDE YOUR RESPONSE IN JSON FORMAT ONLY

5. Do not provide any additional content

6. Follow the output example below
7. STRICTLY PROVIDE JSON RESPONSE ONLY.

Example JSON Output:
{

  "top_positive_events": [

    "Company announces record-breaking quarterly earnings.",

    "Successful product launch with positive reviews.",

    "Strategic partnership with a major industry player."

  ],

  "top_negative_events": [

    "Cybersecurity breach exposes sensitive customer data.",

    "CEO resigns amid scandal.",

    "Economic downturn impacting industry sales."

  ],

  "top_10_keywords": [

    "company", "product", "earnings", "cybersecurity", "breach", "CEO", "recession", "economy", "sales", "partnership"

  ]

}'''

##### Checking the model output on a sample

In [None]:
#Test response for just one row
news=df_weekly['News'][1]
# print(news)
response=response_llama(prompt,news,lcpp_llm)
print(response)
df_weekly.at[1, 'Model_Response'] = response

Run for the entire data

In [None]:
# Process rows in the DataFrame
for index, record in df_weekly.iterrows():
    news = record['News']  # Extract the 'News' data from the current row
    if index==1: continue
    try:
        # Use the updated response_llama function to get the model's response
        response = response_llama(prompt, news, lcpp_llm)  # Pass the correct arguments
        df_weekly.at[index, 'Model_Response'] = response
    except Exception as e:
        print(f"Error processing row {index}: {e}")
        df_weekly.at[index, 'Model_Response'] = "Error"

# Display the updated DataFrame with the model's response as a new column
print(df_weekly.head())


In [None]:
# Step 5: Save the updated DataFrame to Google Drive as a CSV file
import os  # Importing the OS module for file handling

path = '/content/drive/My Drive/AIMLTRAINING/collabdata/NLP/Response_weekly2.csv'
df_weekly.to_csv(path, index=False)

In [None]:

from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive',force_remount=True)

path = '/content/drive/My Drive/AIMLTRAINING/collabdata/NLP/Response_weekly2.csv'
df_weekly_processed=pd.read_csv(path)

In [None]:
#drop News
df_weekly_processed.drop(columns=['News'], inplace=True)


##### Checking the model output on the weekly data

In [None]:
df_weekly_processed.sample(2).T

##### Formatting the model output

In [None]:
import json
import re

def extract_json(text):
    # Regular expression pattern to match JSON-like structure
    json_pattern = r'\{.*\}'

    # Search for the JSON part
    match = re.search(json_pattern, text, re.DOTALL)

    if match:
        # Extract and return the JSON string
        json_string = match.group(0)

        # Convert the JSON string to a Python dictionary
        try:
            return json.loads(json_string)
        except json.JSONDecodeError:
            print("Error decoding JSON.")
            return None
    else:
        print("No JSON found.")
        return None

# Create DataFrame
df = df_weekly_processed.copy()

# Apply the extract_json function to the 'Model_Response' column and create a new 'Response' column
df['Response'] = df['Model_Response'].apply(extract_json)

# Drop the 'Model_Response' column
df.drop(columns=['Model_Response'], inplace=True)



In [None]:
df.sample(2).T

In [None]:
# Sample a random entry from 'Response'
sample_response = df['Response'].sample(1).values[0]

# Pretty print the JSON content
pretty_json = json.dumps(sample_response, indent=4)

# Display the pretty-printed JSON
print(pretty_json)


**Processed News Sentiment Analysis**

In [None]:
# Analyse the positive and Negative sentiment
def analyze_responses(df):
    analysis_results = []

    # Initialize counters for sentiment types
    positive_sentiment_count = 0
    negative_sentiment_count = 0

    for response in df['Response']:
        # Extract positive and negative events
        positive_events = response.get("top_positive_events", [])
        negative_events = response.get("top_negative_events", [])

        # Count the number of positive and negative events
        positive_count = len(positive_events)
        negative_count = len(negative_events)

        # Append analysis results
        analysis_results.append({
            "positive_count": positive_count,
            "negative_count": negative_count
        })

    # Add the analysis results as a new column to the dataframe
    df['Analysis'] = analysis_results

    return df


In [None]:
df['Analysis']

**Count Most Common Topic Pairs from Positive and Negative Events**

In [None]:
import itertools
from collections import Counter

# Create a list to hold pairs of topics
topic_pairs = []

# Loop through each entry to generate topic pairs
for response in df['Response']:
    try:
        # Extract the relevant topic fields directly from the dictionary
        positive_events = response.get("top_positive_events", [])
        negative_events = response.get("top_negative_events", [])

        # Combine both positive and negative events into a single list
        all_events = positive_events + negative_events

        # Get all unique pairs of topics (sorted to avoid duplicates)
        topic_pairs.extend(itertools.combinations(sorted(all_events), 2))
    except Exception as e:
        print(f"Error processing entry: {e}")

# Count the frequency of topic pairs
pair_counter = Counter(topic_pairs)

# Display the most common topic pairs
top_pairs = pair_counter.most_common(10)

# Print the top 10 most common topic pairs
print(top_pairs)


**Insights**

- **Sentiment Distribution**: Despite the varied events, there has been no occurrence of strong positive or negative sentiment. This could indicate that the events analyzed are seen as balanced or neutral, meaning there isn't a clear sentiment trend, possibly reflecting mixed market perceptions in the data during the given timeframe.
- **Apple's Revenue and Economic Concerns:** Apple's revenue warning is frequently paired with concerns about the global economy and stock market declines.

- **Trade Optimism vs Market Reactions:** Trade optimism is often linked to contrasting market movements, such as stock declines and oil price rebounds.

**Keyword Analysis**
- Occurences in the weekly News data

In [None]:
from collections import Counter

# Initialize a Counter to track keyword frequencies
keyword_counter = Counter()

# Loop through each entry in the column
for response in df['Response']:
    try:
        # Directly access the 'top_10_keywords' list from the dictionary
        keywords = response.get("top_10_keywords", [])

        # Update the counter with the keywords in this entry
        keyword_counter.update(keywords)
    except Exception as e:
        print(f"Error processing entry: {e}")

# Show the most common keywords
print(keyword_counter.most_common(10))


In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Sample a random entry from 'Response' column
sample_response = df['Response'].sample(1).values[0]

# Access the 'top_10_keywords' list directly from the dictionary
keywords = sample_response.get("top_10_keywords", [])

# Count the frequency of each keyword (using 1 as a dummy count for this sample)
keyword_counter = {keyword: 1 for keyword in keywords}

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400).generate_from_frequencies(keyword_counter)

# Display the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')  # Turn off the axis
plt.show()


**Keyword Pair Frequency Analysis**

In [None]:
import itertools
from collections import Counter

# Create a list to hold pairs of keywords
keyword_pairs = []

# Loop through each entry to generate keyword pairs
for response in df['Response']:
    try:
        # Extract the top 10 keywords directly from the dictionary
        keywords = response.get("top_10_keywords", [])

        # Get all unique pairs of keywords (sorted to avoid duplicates)
        keyword_pairs.extend(itertools.combinations(sorted(keywords), 2))
    except Exception as e:
        print(f"Error processing entry: {e}")

# Count the frequency of keyword pairs
pair_counter = Counter(keyword_pairs)

# Display the most common keyword pairs
top_keyword_pairs = pair_counter.most_common(10)

# Print the top 10 most common keyword pairs
print(top_keyword_pairs)


**Insight on Keyword occurences:**
- **Dominance of Apple and China:** The frequent mention of "Apple" and "China" highlights their central role in the topics, indicating their significance in trade and economic discussions.

- **Tech and Trade Themes:** Keywords like "Trade," "Tech," and "Stocks" suggest a strong focus on technology and international trade, with an emphasis on market movements.
- **Apple and China are frequently mentioned together**, with the most common pair appearing 11 times, highlighting their strong connection in the dataset.

- **Tech-related terms like "Apple", "China", and "Trade" are often paired together**, indicating a significant focus on the intersection of technology, international trade, and market dynamics.

## **Conclusions and Recommendations**

### Conclusion: Stock Market Sentiment Analysis

#### Summary
This project aimed to analyze the relationship between news articles and stock market sentiment, leveraging natural language processing (NLP) and machine learning techniques to generate actionable insights.

#### Key Findings
- **Dominant Themes:** Business, Finance, Technology, China, and Economy were the most prevalent themes in the news articles, playing a significant role in influencing stock prices and market sentiment.
  
- **Sentiment Analysis:** News sentiment strongly impacts stock prices, with negative sentiment causing larger price movements compared to positive or neutral sentiment.

- **Topic Modeling:** Common topic pairs such as **Business-Finance**, **Business-Technology**, and **China-Economy** exhibit strong correlations, highlighting areas that influence investor sentiment and market dynamics. Apple and China were particularly central, with **"Apple"** and **"China"** frequently mentioned together, pointing to their significance in trade and economic discussions.

- **Model Performance:** The **Tuned GloVe** model outperformed other models in terms of F1-score on the test set (0.33), showcasing its ability to generalize well despite the imbalanced dataset. It consistently performed well in both validation and test sets, making it the most reliable for stock market sentiment analysis.

#### Recommendations

1. **Integrate NLP into Stock Market Analysis:**
   - Leverage sentiment analysis and topic modeling to enhance stock market prediction systems. By evaluating news sentiment, you can better understand market trends and improve forecasting accuracy.

2. **Monitor Key Topic Pairs:**
   - Focus on monitoring **Business-Finance**, **Business-Technology**, and **China-Economy** as these topics strongly correlate with stock price fluctuations and investor sentiment. Tracking sentiment around these topics can provide deeper insights into market movements.

3. **Implement Sentiment Analysis for Investment Decisions:**
   - Use sentiment analysis to aid in short-term investment strategies. For instance, **positive** or **negative** sentiment in news articles can serve as early indicators for potential market changes. Investors could incorporate sentiment signals to adjust their portfolios accordingly.



-


