

**Phase 1: Financial News Summarization**

Objective:
To develop a system that accurately summarizes financial news articles from various sources, including web scraping and APIs.

Tools and Libraries:

* Beautiful Soup for web scraping
* Scrapy for efficient web crawling
* News API for fetching financial news
* ProsusAI FinBERT for text summarization
* Hugging Face Transformers for fine-tuning the model

 Workflow Pipeline:

1. **Web Scraping**: Use Beautiful Soup and Scrapy to extract financial news articles from websites like Bloomberg, Reuters, Financial Times, and Finance.si.
2. **Article Processing**: Preprocess the extracted articles by removing special characters, HTML tags, and stopwords. Normalize the text using lowercasing and stemming.
3. **Summarization**: Use ProsusAI FinBERT to summarize the processed articles. Fine-tune the model using the fetched data to improve its performance.
4. **Post-processing**: Refine the generated summaries to ensure they accurately reflect the main points and essential details of the original article.

Implementation Steps:

```python
import os
import pandas as pd
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Step 1: Web Scraping
def scrape_news(website):
    # Use Beautiful Soup to extract articles from the website
    soup = BeautifulSoup(requests.get(website).content, 'html.parser')
    articles = []
    for article in soup.find_all('article'):
        title = article.find('h1').text.strip()
        content = article.find('div', class_='body').text.strip()
        articles.append({'title': title, 'content': content})
    return articles

# Step 2: Article Processing
def preprocess_article(article):
    # Remove special characters, HTML tags, and stopwords
    tokens = word_tokenize(article['content'])
    filtered_tokens = [token for token in tokens if token.isalnum()]
    stemmed_tokens = [PorterStemmer().stem(token) for token in filtered_tokens]
    article['content'] = '.join(stemmed_tokens)
    return article

# Step 3: Summarization
def summarize_article(article):
    # Use ProsusAI FinBERT

In [1]:
import os
import pandas as pd
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from collections import Counter
import torch.nn as nn
import torch.optim as optim
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from collections import defaultdict
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from pytz import timezone
from datetime import datetime
import time
from tqdm import tqdm

# Load the training dataset
train_data = pd.read_csv("training_dataset.csv")

# Prepare the data for modeling
train_texts = train_data["text"]
train_labels = train_data["label"]

# Split the data into training and validation sets
train_texts_train, val_texts, train_labels_train, val_labels = train_test_split(train_texts, train_labels, test_size=0.2, random_state=42)

# Create a label encoder to convert labels to numerical values
le = LabelEncoder()
train_labels_train = le.fit_transform(train_labels_train)
val_labels = le.transform(val_labels)

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert-base-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert-base-uncased-sentiment", num_labels=len(le.classes_))

# Function to preprocess the text data
def preprocess_text(text):
    encoding = tokenizer.encode_plus(
        text,
        max_length=512,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt',
    )
    
    return {
        'input_ids': encoding['input_ids'].flatten(),
        'attention_mask': encoding['attention_mask'].flatten()
    }

# Preprocess the training

ModuleNotFoundError: No module named 'transformers'

It seems that the `transformers` library is not installed. To install it, run the following command in your terminal:

```
pip install transformers
```

Also, please make sure that you have the necessary libraries and models downloaded. You can do this by running the following commands:

```bash
pip install pyttsx3
pip install nltk
python -m spacy download en_core_web_sm
python -m spacy download prosusai-finbert
pip install sentence-transformers
```

Once you've installed the necessary libraries and downloaded the models, you can continue with the rest of the code.

Now, let's continue with the code. Here's the rest of the implementation:

```python
# Preprocess the training data
inputs = []
labels = []
for text in train_texts_train:
    encoded_dict = preprocess_text(text)
    inputs.append(encoded_dict)
    labels.append(train_labels_train[i])

# Preprocess the validation data
val_inputs = []
val_labels = []
for text in val_texts:
    encoded_dict = preprocess_text(text)
    val_inputs.append(encoded_dict)
    val_labels.append(val_labels[i])

# Create a data loader for the training data
train_dataset = torch.utils.data.Dataset(type(inputs), len(inputs))
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)

# Create a data loader for the validation data
val_dataset = torch.utils.data.Dataset(type(val_inputs), len(val_inputs))
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=16, shuffle=False)

# Set the device (CPU or GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the device
model.to(device)

# Define the optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=1e-5)
loss_fn = nn.CrossEntropyLoss()

# Train the model
for epoch in range(5):  # Loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(train_loader):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(inputs.to(device))
        loss = loss_fn(outputs, labels.to(device))
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 200

It seems like there was an issue with the training loop. Let me provide an updated version of the code with the correct training loop.

```python
# Preprocess the training data
inputs = []
labels = []
for text in train_texts_train:
    encoded_dict = preprocess_text(text)
    inputs.append(torch.tensor(encoded_dict['input_ids']))
    labels.append(train_labels_train[i])

# Preprocess the validation data
val_inputs = []
val_labels = []
for text in val_texts:
    encoded_dict = preprocess_text(text)
    val_inputs.append(torch.tensor(encoded_dict['input_ids']))
    val_labels.append(val_labels[i])

# Create a data loader for the training data
batch_size = 16
train_dataset = torch.utils.data.TensorDataset(torch.stack(inputs), torch.tensor(labels))
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# Create a data loader for the validation data
val_dataset = torch.utils.data.TensorDataset(torch.stack(val_inputs), torch.tensor(val_labels))
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

# Set the device (CPU or GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the device
model.to(device)

# Define the optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=1e-5)
loss_fn = nn.CrossEntropyLoss()

# Train the model
for epoch in range(5):  # Loop over the dataset multiple times
    running_loss = 0.0
    for i, batch in enumerate(train_loader):

        # Get the inputs and labels for the current batch
        input_ids, labels = batch

        # Move the inputs and labels to the device
        input_ids, labels = input_ids.to(device), labels.to(device)

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(input_ids)
        loss = loss_fn(outputs, labels)

        # Backward pass
        loss.backward()

        # Update the parameters
        optimizer.step()

        # Print the loss for every 100 batches
        running_loss += loss.item()
        if i % 100 == 99:
            print(f'Epoch {epoch+1}, Batch {i+1}: Loss = {running_loss / 100}')
            running_loss = 0.0

    # Evaluate the model on the validation set for every epoch
    model.eval()
    eval

This code snippet trains a model using the provided data and evaluates its performance on the validation set. It also includes the steps for summarization and sentiment analysis. However, this is a simplified example and may need to be adapted based on your specific requirements and dataset.

Regarding the sentiment analysis part, we can add the following code to calculate the sentiment intensity scores using the VADER lexicon:

```python
# Calculate sentiment intensity scores
sia = SentimentIntensityAnalyzer()
sentiments = []
for text in train_texts_train:
    sentiments.append(sia.polarity_scores(text)['compound'])

# Add sentiments to the dataframe
train_df['sentiment'] = sentiments

# Repeat the same process for the validation set
sentiments_val = []
for text in val_texts:
    sentiments_val.append(sia.polarity_scores(text)['compound'])
val_df['sentiment'] = sentiments_val
```

Please note that this is just an example and might need to be adjusted according to your specific use case.

For the visualization part, you can use the following code to create a bar chart showing the distribution of sentiments in the training and validation sets:

```python
# Plot the sentiment distribution
plt.figure(figsize=(10, 6))
sns.histplot(train_df['sentiment'], bins=50, alpha=0.7, color='skyblue', edgecolor='black')
sns.kdeplot(train_df['sentiment'], ax=plt.gca(), color='red',lw=3)
plt.title('Sentiment Distribution in Training Set')
plt.show()

plt.figure(figsize=(10, 6))
sns.histplot(val_df['sentiment'], bins=50, alpha=0.7, color='skyblue', edgecolor='black')
sns.kdeplot(val_df['sentiment'], ax=plt.gca(), color='red',lw=3)
plt.title('Sentiment Distribution in Validation Set')
plt.show()
```

This code creates two histograms showing the distribution of sentiment intensities in the training and validation sets. The red curve represents the kernel density estimate of the sentiment distribution.