# 1. Introduction
## 1.1 Stock Sentiment Analysis Using Financial News
Performing sentiment analysis on Indian stock market news using financial news from reputed sources like **Economic Times, Financial Express, or Bloomberg Quint** involves several steps. The goal is to process news headlines or articles, extract sentiment, and analyze the impact on stock prices.

## 1.2 Steps Involved
1. **Data Collection:** Collect financial news data (headlines/articles) from reliable sources using APIs, web scraping, or publicly available datasets.
2. **Preprocessing:** Clean and preprocess the text data.
3. **Sentiment Analysis:** Use natural language processing (NLP) models to analyze the sentiment (positive, negative, or neutral) of the news.
4. **Stock Analysis:** Correlate the sentiment with stock market trends (e.g., stock price changes).

## 1.3 Implementation Requirements
### Libraries:
* `pandas, nltk, transformers, beautifulsoup4, requests` (for scraping), or APIs for data gathering.
* **Sentiment analysis models:** Traditional models like **VADER** or advanced models like **BERT** for financial sentiment.

# 2. Import libraries
required these libraries when scraping data from websites and convert the extracted content into a structured format from webpages, like stock news, blogs, articles, etc.


In [1]:
# for sending HTTP requests; it allows us to access and retrieve data from the web
import requests

# for parsing HTML; it helps us parse and navigate HTML, making it easy to extract data like news headlines.
from bs4 import BeautifulSoup

# for data manipulation; it allows us to organize and manipulate this data in tabular form
import pandas as pd

# Comprehensive library for natural language processing tasks, where text data needs to be cleaned, tokenized, and transformed.
import nltk

# helps to remove stopwords like 'and', 'the', 'is', which do not contribute much to sentiment
from nltk.corpus import stopwords

# breaking text into individual words (tokenization).
from nltk.tokenize import word_tokenize

# Regular expression operations from the 're' module, to check if a string contains the specified search pattern.
# https://regex101.com/
import re

# 3. Data Collection
## Step 1: Data collection with web scraping
* Collect financial news data (scraping from **Economic Times** as an example)
* sends an HTTP GET request to the 'Economic Times' website to get the page’s HTML content and parses it using `BeautifulSoup`.

In [2]:
url = "https://economictimes.indiatimes.com/markets/stocks"

# sends an HTTP request to the 'Economic Times' website to get the page’s HTML content and retrieves the webpage’s HTML.
response = requests.get(url)

# response is then passed to 'BeautifulSoup' for parsing so that we can access individual HTML elements, such as the headlines.
soup = BeautifulSoup(response.content, "html.parser")
# soup

* extracts all the **h3** tags from the parsed HTML content
* most news websites, headlines are usually wrapped in specific HTML tags like **h3**; extract these to get the actual news headlines.
* Use this to extract specific HTML tags that contain the information (e.g., headlines, articles).


In [3]:
# find and extracts all headline elements (this depends on the website structure)
headlines = soup.find_all('h3')  # Modify tag/structure based on source

# Extract the text from the headlines
news_data = []
for headline in headlines:
    news_data.append(headline.get_text())
# news_data

## Step 2: Data organization using **Pandas** dataframe
* converts the list of headlines into a `pandas DataFrame`, with the column name **“Headline”**
* used for tabular organization of data to organize, manipulate, or analyze using pandas’ powerful DataFrame functionality.


In [4]:
news_df = pd.DataFrame(news_data, columns=["Headline"])
print(news_df.head())

                                            Headline
0                                         BULL'S EYE
1  \n                            Must Watch\n    ...


# 4. Preprocessing the Text Data
Dataset EDA (Exploratory Data Analysis)

In [5]:
# Download stopwords once (if not already installed)
nltk.download('stopwords')

# Download punkt tokenizer once (if not already installed)
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 4.1 Function for text preprocessing
* to clean the text data by removing unwanted characters, converting text to lowercase, and removing stopwords.
* for reducing noise in data; ensuring that the news headlines are cleaned and tokenized, making them ready for sentiment analysis.


In [6]:
def preprocess_text(text):

    # Remove special characters, digits, and extra spaces
    # https://regex101.com/

    # case sensitive, "\W" matches any non-word character (equivalent to [^a-zA-Z0-9_]) except for line terminators and replace with 'space'
    text = re.sub(r'\W', ' ', text)

    # case sensitive, "\s+"" matches any whitespace character (equivalent to [\r\n\t\f\v ]) any number of times except for line terminators and replace with 'space'
    text = re.sub(r'\s+', ' ', text)

    # Convert to lowercase
    text = text.lower()

    # Tokenize and remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text)
    filtered_tokens = [word for word in tokens if word not in stop_words]

    return ' '.join(filtered_tokens)


## 4.2 Apply preprocessing to the news headlines
* applies the **preprocess_text** function to every row of the **Headline** column and stores the cleaned results in a new column **Cleaned_Headline.**
* It cleans the raw headlines and stores the cleaned version in a new column for later use in sentiment analysis.


In [7]:
news_df['Cleaned_Headline'] = news_df['Headline'].apply(preprocess_text)
print(news_df.head())

                                            Headline Cleaned_Headline
0                                         BULL'S EYE         bull eye
1  \n                            Must Watch\n    ...       must watch
