# Analyzing the Impact of Qualitative Financial News on Quantitative Stock Market Movements

## Project Overview
This project explores how financial news sentiment influences stock market movements, using a powerful combination of the **FNSPID dataset** and **FinBERT** model.

- **FNSPID Dataset**:  
  - 29.7 million stock prices  
  - 15.7 million financial news articles  
  - 4,775 S&P 500 companies (1999–2023)  
  - News sourced from four major stock market websites  
  - Includes labeled sentiment information  

- **FinBERT Model**:
  - Fine-tuned version of BERT for financial sentiment analysis
  - Trained on the Financial PhraseBank
  - Outputs probabilities for **positive**, **negative**, or **neutral** sentiment

The project aims to connect **qualitative news sentiment** with **quantitative stock price movements** by processing large-scale data, analyzing correlations, and building predictive models.

---

## Machine Learning Environment
The project was developed using **Learn Constructor Cloud Platform** with the following configuration:

- **Python version**: 3.9
- **Deep learning framework**: PyTorch 2.0
- **Hardware**:
  - CPU: 6 cores
  - RAM: 80 GB
  - GPU: Nvidia A100 (40 GB, 3g partition)
- **CUDA drivers** installed for GPU acceleration
- Pre-installed machine learning libraries (NumPy, Pandas, Scikit-Learn, Transformers, etc.)

---

## Data Acquisition
The following datasets were downloaded from Hugging Face:

1. **Stock Price Data**  
```bash
wget https://huggingface.co/datasets/Zihan1004/FNSPID/resolve/main/Stock_price/full_history.zip
wget https://huggingface.co/datasets/Zihan1004/FNSPID/resolve/main/Stock_news/nasdaq_exteral_data.csv
pip install cudf-cu12 --extra-index-url=https://pypi.nvidia.com


In [1]:
import zipfile
import os

zip_path = "full_history.zip"  
extract_to = "Price_History"  

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to)

os.remove(zip_path)

print("Extraction complete and ZIP file removed!")

Extraction complete and ZIP file removed!


In [1]:
import pandas as pd

# Define the file path
file_path = 'cleaned_data/TSLA/TSLA_with_articles.csv'

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)
df.info()


FileNotFoundError: [Errno 2] No such file or directory: 'cleaned_data/TSLA/TSLA_with_articles.csv'

In [3]:
df.head()

Unnamed: 0,Date,Article_title,Stock_symbol,Url,Lsa_summary,Luhn_summary,Textrank_summary,Lexrank_summary
0,2022-05-02 00:00:00 UTC,US STOCKS-Wall Street up before Fed meet as te...,TSLA,https://www.nasdaq.com/articles/us-stocks-wall...,"After spending much of the day in the red, Tes...","After spending much of the day in the red, Tes...","After spending much of the day in the red, Tes...","After spending much of the day in the red, Tes..."
1,2022-05-02 00:00:00 UTC,Supply chains snarl Taiwan tech firms as some ...,TSLA,https://www.nasdaq.com/articles/supply-chains-...,"While AUO supplies displays for top carmakers,...","While AUO supplies displays for top carmakers,...","While AUO supplies displays for top carmakers,...","While AUO supplies displays for top carmakers,..."
2,2022-05-02 00:00:00 UTC,"Twitter estimates spam, fake accounts comprise...",TSLA,https://www.nasdaq.com/articles/twitter-estima...,The disclosure came days after Tesla Inc TSLA....,The disclosure came days after Tesla Inc TSLA....,The disclosure came days after Tesla Inc TSLA....,The disclosure came days after Tesla Inc TSLA....
3,2022-05-02 00:00:00 UTC,EXCLUSIVE-Musk seeks to put in less money in n...,TSLA,https://www.nasdaq.com/articles/exclusive-musk...,Yet most of his wealth is tied up in the share...,Yet most of his wealth is tied up in the share...,Yet most of his wealth is tied up in the share...,Yet most of his wealth is tied up in the share...
4,2022-05-02 00:00:00 UTC,EXCLUSIVE-Musk in talks for new Twitter financ...,TSLA,https://www.nasdaq.com/articles/exclusive-musk...,Musk has also pledged some of this Tesla Inc T...,Musk has also pledged some of this Tesla Inc T...,Musk has also pledged some of this Tesla Inc T...,Musk has also pledged some of this Tesla Inc T...


Do this for finbert sentiment analysis script

> conda create -n finbert python=3.9
> conda activate finbert
> pip install transformers torch pandas
> python finbert_sentiment_analysis.py



In [6]:
import pandas as pd

# Define the file path
file_path = 'sentiment_scores/articles/TSLA/TSLA_with_full_article_sentiment.csv'

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9308 entries, 0 to 9307
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              9308 non-null   object 
 1   Article_title     9308 non-null   object 
 2   Stock_symbol      9308 non-null   object 
 3   Url               9308 non-null   object 
 4   Lsa_summary       9308 non-null   object 
 5   Luhn_summary      9308 non-null   object 
 6   Textrank_summary  9308 non-null   object 
 7   Lexrank_summary   9308 non-null   object 
 8   sentiment_label   9308 non-null   object 
 9   sentiment_score   9308 non-null   float64
 10  raw_score         9308 non-null   float64
dtypes: float64(2), object(9)
memory usage: 800.0+ KB


In [7]:
df.head(5)

Unnamed: 0,Date,Article_title,Stock_symbol,Url,Lsa_summary,Luhn_summary,Textrank_summary,Lexrank_summary,sentiment_label,sentiment_score,raw_score
0,2022-05-02 00:00:00 UTC,US STOCKS-Wall Street up before Fed meet as te...,TSLA,https://www.nasdaq.com/articles/us-stocks-wall...,"After spending much of the day in the red, Tes...","After spending much of the day in the red, Tes...","After spending much of the day in the red, Tes...","After spending much of the day in the red, Tes...",negative,-0.608806,0.608806
1,2022-05-02 00:00:00 UTC,Supply chains snarl Taiwan tech firms as some ...,TSLA,https://www.nasdaq.com/articles/supply-chains-...,"While AUO supplies displays for top carmakers,...","While AUO supplies displays for top carmakers,...","While AUO supplies displays for top carmakers,...","While AUO supplies displays for top carmakers,...",negative,-0.949896,0.949896
2,2022-05-02 00:00:00 UTC,"Twitter estimates spam, fake accounts comprise...",TSLA,https://www.nasdaq.com/articles/twitter-estima...,The disclosure came days after Tesla Inc TSLA....,The disclosure came days after Tesla Inc TSLA....,The disclosure came days after Tesla Inc TSLA....,The disclosure came days after Tesla Inc TSLA....,negative,-0.923227,0.923227
3,2022-05-02 00:00:00 UTC,EXCLUSIVE-Musk seeks to put in less money in n...,TSLA,https://www.nasdaq.com/articles/exclusive-musk...,Yet most of his wealth is tied up in the share...,Yet most of his wealth is tied up in the share...,Yet most of his wealth is tied up in the share...,Yet most of his wealth is tied up in the share...,neutral,0.0,0.88253
4,2022-05-02 00:00:00 UTC,EXCLUSIVE-Musk in talks for new Twitter financ...,TSLA,https://www.nasdaq.com/articles/exclusive-musk...,Musk has also pledged some of this Tesla Inc T...,Musk has also pledged some of this Tesla Inc T...,Musk has also pledged some of this Tesla Inc T...,Musk has also pledged some of this Tesla Inc T...,neutral,0.0,0.811385


In [None]:
merged_data_TSLA.csv

In [24]:
import pandas as pd

# Define the file path
file_path = 'merged_data_BA.csv'

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 916 entries, 0 to 915
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           916 non-null    object 
 1   close          916 non-null    float64
 2   volume         916 non-null    int64  
 3   log_return     916 non-null    float64
 4   simple_return  916 non-null    float64
 5   avg_sentiment  916 non-null    float64
 6   article_count  916 non-null    int64  
dtypes: float64(4), int64(2), object(1)
memory usage: 50.2+ KB


In [25]:
df

Unnamed: 0,Date,close,volume,log_return,simple_return,avg_sentiment,article_count
0,2018-05-24,359.000000,3932100,-0.000585,-0.000585,0.338118,5
1,2018-05-25,360.089996,2463700,0.003032,0.003036,0.519751,6
2,2018-05-29,352.480011,4249600,-0.021360,-0.021134,0.000000,2
3,2018-05-30,358.190002,2825200,0.016070,0.016199,0.181975,4
4,2018-05-31,352.160004,4406300,-0.016978,-0.016835,0.243231,3
...,...,...,...,...,...,...,...
911,2023-12-07,237.330002,6363700,0.001856,0.001857,-0.193893,8
912,2023-12-11,248.080002,7545000,0.013718,0.013813,0.040945,15
913,2023-12-12,248.630005,5719200,0.002215,0.002217,0.093797,12
914,2023-12-13,250.910004,5513400,0.009128,0.009170,0.076529,7


AAPL: Apple Inc.
	•	GOOG / GOOGL: Alphabet Inc. (Google)
	•	MSFT: Microsoft Corporation
	•	AMZN: Amazon.com, Inc.
	•	TSLA: Tesla, Inc.
	•	BRK.A / BRK.B: Berkshire Hathaway
	•	FB (now META): Meta Platforms, Inc.
	•	NVDA: NVIDIA Corporation
	•	INTC: Intel Corporation
	•	SPY: SPDR S&P 500 ETF Trust (a popular ETF)
	•	AMD: Advanced Micro Devices, Inc.
	•	WMT: Walmart Inc.
	•	DIS: The Walt Disney Company
	•	IBM: International Business Machines Corporation
	•	BA: The Boeing Company
	•	JNJ: Johnson & Johnson
	•	V: Visa Inc.
	•	UNH: UnitedHealth Group Incorporated

# Question 2

In [3]:
import pandas as pd

# Define the file path
file_path = 'merged_data_TSLA.csv'

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7927 entries, 0 to 7926
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Date            7927 non-null   object 
 1   close           7927 non-null   float64
 2   volume          7927 non-null   int64  
 3   log_return      7927 non-null   float64
 4   simple_return   7927 non-null   float64
 5   Article_title   7927 non-null   object 
 6   Stock_symbol    7927 non-null   object 
 7   title_label     7927 non-null   object 
 8   title_score     7927 non-null   float64
 9   lsa_label       7927 non-null   object 
 10  lsa_score       7927 non-null   float64
 11  luhn_label      7927 non-null   object 
 12  luhn_score      7927 non-null   float64
 13  textrank_label  7927 non-null   object 
 14  textrank_score  7927 non-null   float64
 15  lexrank_label   7927 non-null   object 
 16  lexrank_score   7927 non-null   float64
dtypes: float64(8), int64(1), object(8

In [4]:
df

Unnamed: 0,Date,close,volume,log_return,simple_return,Article_title,Stock_symbol,title_label,title_score,lsa_label,lsa_score,luhn_label,luhn_score,textrank_label,textrank_score,lexrank_label,lexrank_score
0,2022-05-02,300.980011,75781500,0.036290,0.036956,US STOCKS-Wall Street up before Fed meet as te...,TSLA,positive,0.680717,negative,-0.918020,negative,-0.627872,positive,0.837929,negative,-0.608806
1,2022-05-02,300.980011,75781500,0.036290,0.036956,Supply chains snarl Taiwan tech firms as some ...,TSLA,positive,0.771192,negative,-0.960010,positive,0.908334,positive,0.442370,negative,-0.949896
2,2022-05-02,300.980011,75781500,0.036290,0.036956,"Twitter estimates spam, fake accounts comprise...",TSLA,negative,-0.601211,negative,-0.923227,negative,-0.953690,negative,-0.953690,negative,-0.923227
3,2022-05-02,300.980011,75781500,0.036290,0.036956,EXCLUSIVE-Musk seeks to put in less money in n...,TSLA,neutral,0.000000,positive,0.571970,negative,-0.536489,positive,0.571970,neutral,0.000000
4,2022-05-02,300.980011,75781500,0.036290,0.036956,EXCLUSIVE-Musk in talks for new Twitter financ...,TSLA,neutral,0.000000,neutral,0.000000,negative,-0.655296,negative,-0.655296,neutral,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7922,2023-12-14,251.050003,160829200,0.047976,0.049145,"Noteworthy Thursday Option Activity: TSLA, AYX...",TSLA,neutral,0.000000,positive,0.575542,neutral,0.000000,neutral,0.000000,neutral,0.000000
7923,2023-12-14,251.050003,160829200,0.047976,0.049145,Dutch vehicle authority RDW says no Tesla reca...,TSLA,negative,-0.592259,negative,-0.949792,negative,-0.874185,negative,-0.637674,negative,-0.934566
7924,2023-12-15,253.500000,135720800,0.009712,0.009759,Elon Musk says oil and gas should not be demon...,TSLA,neutral,0.000000,neutral,0.000000,neutral,0.000000,negative,-0.880835,neutral,0.000000
7925,2023-12-15,253.500000,135720800,0.009712,0.009759,Lucid Motors Stock: What Went Wrong With This ...,TSLA,negative,-0.513092,neutral,0.000000,neutral,0.000000,neutral,0.000000,neutral,0.000000
