# Sentiment Analysis

From the data that we preprocessed through cleaning, then linguistic processing; we are now able to gather the sentiment of the news articles.

The sentiment scores returned by `NLTK's VADER SentimentIntensityAnalyzer` range from -1 to 1. (VADER: Valence Aware Dictionary and sEntiment Reasoner)  
- Scores between -1 and -0.05 general indicate negative sentiment
- Scores between -0.05 and 0.05 are considered neutrual
- Sores between 0.05 and 1 indicate positive sentiment  

The more extreme the score, the stronger the sentiment.  

In [79]:
# import autotokenizer and automodelforsequenceclassification modules from transformers package
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [80]:
# import pytorch package
import torch
from torch.nn.functional import softmax

In [81]:
model_name = "ProsusAI/finbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [82]:
def get_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512, padding=True)
    outputs = model(**inputs)
    probabilities = softmax(outputs.logits, dim=1)
    sentiment_score = probabilities[0].tolist()
    sentiment_labels = ['negative', 'neutral', 'positive']
    sentiment = sentiment_labels[sentiment_score.index(max(sentiment_score))]
    return sentiment, max(sentiment_score)

In [83]:
# import pandas
import pandas as pd

In [84]:
cleaned_articles_df = pd.read_csv('../DataFrames/cleaned_articles_df.tsv')

In [85]:
cleaned_articles_df

Unnamed: 0,title,content,publishedAt
0,DOJ subpoenas NVIDIA as part of antitrust prob...,If you click Accept all we and our partners in...,2024-09-04
1,The Leaked Nvidia RTX Has So Many Cores It Act...,The GeForce RTX is already so big that any PC ...,2024-09-27
2,ByteDance will reportedly use Huawei chips to ...,If you click Accept all we and our partners in...,2024-09-30
3,This chart shows one potential advantage AWSs ...,Noah BergerGetty Images Big tech cloud provide...,2024-09-26
4,Nvidias days of absolute dominance in AI could...,Rodrigo Liang CEO and cofounder of SambaNova S...,2024-09-26
...,...,...,...
94,Nvidia is back in the trillion club and Jensen...,In This Story Nvidia NVDA chief executive Jens...,2024-09-26
95,Sony confirms PS Pro has nextgen raytracing te...,When you buy through links on our articles Fut...,2024-09-13
96,Acers formidable new Predator Orion desktop ca...,What you need to know Acer announced two new g...,2024-09-04
97,LGs entrylevel OLED is a great gaming TV and a...,If youve been curious about buying an OLED TV ...,2024-09-19


In [86]:
cleaned_articles_df['publishedAt'] = pd.to_datetime(cleaned_articles_df['publishedAt'])

In [87]:
cleaned_articles_df['publishedAt'] = cleaned_articles_df['publishedAt'].dt.date

In [88]:
cleaned_articles_df

Unnamed: 0,title,content,publishedAt
0,DOJ subpoenas NVIDIA as part of antitrust prob...,If you click Accept all we and our partners in...,2024-09-04
1,The Leaked Nvidia RTX Has So Many Cores It Act...,The GeForce RTX is already so big that any PC ...,2024-09-27
2,ByteDance will reportedly use Huawei chips to ...,If you click Accept all we and our partners in...,2024-09-30
3,This chart shows one potential advantage AWSs ...,Noah BergerGetty Images Big tech cloud provide...,2024-09-26
4,Nvidias days of absolute dominance in AI could...,Rodrigo Liang CEO and cofounder of SambaNova S...,2024-09-26
...,...,...,...
94,Nvidia is back in the trillion club and Jensen...,In This Story Nvidia NVDA chief executive Jens...,2024-09-26
95,Sony confirms PS Pro has nextgen raytracing te...,When you buy through links on our articles Fut...,2024-09-13
96,Acers formidable new Predator Orion desktop ca...,What you need to know Acer announced two new g...,2024-09-04
97,LGs entrylevel OLED is a great gaming TV and a...,If youve been curious about buying an OLED TV ...,2024-09-19


In [89]:
sentiment_df = pd.DataFrame()

In [90]:
sentiment_df['title_sentiment'], sentiment_df['title_sentiment_score'] = zip(*cleaned_articles_df['title'].apply(get_sentiment))
sentiment_df['content_sentiment'], sentiment_df['content_sentiment_score'] = zip(*cleaned_articles_df['content'].apply(get_sentiment))
sentiment_df['publishedAt'] = cleaned_articles_df['publishedAt']

In [91]:
sentiment_df

Unnamed: 0,title_sentiment,title_sentiment_score,content_sentiment,content_sentiment_score,publishedAt
0,neutral,0.909428,positive,0.952919,2024-09-04
1,positive,0.533817,neutral,0.631737,2024-09-27
2,positive,0.756288,positive,0.952919,2024-09-30
3,negative,0.826232,positive,0.770212,2024-09-26
4,negative,0.590499,neutral,0.531220,2024-09-26
...,...,...,...,...,...
94,positive,0.816141,negative,0.951209,2024-09-26
95,positive,0.910873,positive,0.904911,2024-09-13
96,positive,0.807024,positive,0.884717,2024-09-04
97,positive,0.797082,positive,0.880443,2024-09-19


# Stock Price Data Collection

The next step in building our analysis model is collecting the stock price data for the specified company.  

In [92]:
# import yahoo finance package
import yfinance as yf

In [93]:
# import timedelta modules from datetime
from datetime import timedelta

In [94]:
# gather stock data
def get_stock_data(ticker, start_date, end_date):
    stock = yf.Ticker(ticker)
    data = stock.history(start=start_date, end=end_date)
    return data

In [95]:
earliest_date = sentiment_df['publishedAt'].min()
latest_date = sentiment_df['publishedAt'].max()

In [96]:
start_date = earliest_date - timedelta(days=5)
end_date = latest_date + timedelta(days=5)

In [97]:
stock_data = get_stock_data('NVDA', start_date, end_date)

In [98]:
stock_data

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-08-30 00:00:00-04:00,119.519777,121.739588,117.209977,119.359795,333751600,0.0,0.0
2024-09-03 00:00:00-04:00,116.000078,116.200058,107.280822,107.990761,477155100,0.0,0.0
2024-09-04 00:00:00-04:00,105.400985,113.260306,104.111095,106.200912,372470300,0.0,0.0
2024-09-05 00:00:00-04:00,104.981017,109.640622,104.751041,107.200829,306850700,0.0,0.0
2024-09-06 00:00:00-04:00,108.030759,108.14075,100.941361,102.821205,413638100,0.0,0.0
2024-09-09 00:00:00-04:00,104.871024,106.540887,103.681131,106.460892,273912000,0.0,0.0
2024-09-10 00:00:00-04:00,107.800776,109.390643,104.94102,108.090752,268283700,0.0,0.0
2024-09-11 00:00:00-04:00,109.380641,117.179976,107.410808,116.900002,441422400,0.0,0.0
2024-09-12 00:00:00-04:00,116.839996,120.790001,115.379997,119.139999,367100500,0.01,0.0
2024-09-13 00:00:00-04:00,119.080002,119.959999,117.599998,119.099998,238358300,0.0,0.0


**Note:**  
Some days are missing for stock price data because the Stock market is open only during the weekdays, not including holidays.  

We also fix the formatting for the date for the stock prices. This way we ensure that the data matches with the sentiment data.

In [99]:
# reformat stock data dates
stock_data.index = pd.to_datetime(stock_data.index)

In [100]:
stock_data.index = pd.to_datetime(stock_data.index).date

In [101]:
stock_data = stock_data.reset_index()

In [102]:
stock_data = stock_data.rename(columns={'index': 'date'})

In [103]:
stock_data

Unnamed: 0,date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,2024-08-30,119.519777,121.739588,117.209977,119.359795,333751600,0.0,0.0
1,2024-09-03,116.000078,116.200058,107.280822,107.990761,477155100,0.0,0.0
2,2024-09-04,105.400985,113.260306,104.111095,106.200912,372470300,0.0,0.0
3,2024-09-05,104.981017,109.640622,104.751041,107.200829,306850700,0.0,0.0
4,2024-09-06,108.030759,108.14075,100.941361,102.821205,413638100,0.0,0.0
5,2024-09-09,104.871024,106.540887,103.681131,106.460892,273912000,0.0,0.0
6,2024-09-10,107.800776,109.390643,104.94102,108.090752,268283700,0.0,0.0
7,2024-09-11,109.380641,117.179976,107.410808,116.900002,441422400,0.0,0.0
8,2024-09-12,116.839996,120.790001,115.379997,119.139999,367100500,0.01,0.0
9,2024-09-13,119.080002,119.959999,117.599998,119.099998,238358300,0.0,0.0


#### Price Change

We need to create a new column in our `stock_data` table that calculates the price change of the stock from the next day.

In [104]:
stock_data['price_change'] = stock_data['Close'].pct_change(fill_method=None).shift(-1)

In [105]:
stock_data

Unnamed: 0,date,Open,High,Low,Close,Volume,Dividends,Stock Splits,price_change
0,2024-08-30,119.519777,121.739588,117.209977,119.359795,333751600,0.0,0.0,-0.09525
1,2024-09-03,116.000078,116.200058,107.280822,107.990761,477155100,0.0,0.0,-0.016574
2,2024-09-04,105.400985,113.260306,104.111095,106.200912,372470300,0.0,0.0,0.009415
3,2024-09-05,104.981017,109.640622,104.751041,107.200829,306850700,0.0,0.0,-0.040854
4,2024-09-06,108.030759,108.14075,100.941361,102.821205,413638100,0.0,0.0,0.035398
5,2024-09-09,104.871024,106.540887,103.681131,106.460892,273912000,0.0,0.0,0.015309
6,2024-09-10,107.800776,109.390643,104.94102,108.090752,268283700,0.0,0.0,0.081499
7,2024-09-11,109.380641,117.179976,107.410808,116.900002,441422400,0.0,0.0,0.019162
8,2024-09-12,116.839996,120.790001,115.379997,119.139999,367100500,0.01,0.0,-0.000336
9,2024-09-13,119.080002,119.959999,117.599998,119.099998,238358300,0.0,0.0,-0.019479


# Merge Data

The next step is to merge the stock price data with the sentiment analysis data.  

Once the data is merged, we are able to create our model.  

In [106]:
# merge the sentiment scores and stock data on their dates
merged_data_df = pd.merge(sentiment_df, stock_data, left_on='publishedAt', right_on='date', how='left')

In [107]:
merged_data_df = merged_data_df.sort_values('date')

In [108]:
merged_data_df = merged_data_df.drop('date', axis=1)

In [109]:
merged_data_df = merged_data_df.rename(columns={'publishedAt': 'date'})

In [110]:
merged_data_df

Unnamed: 0,title_sentiment,title_sentiment_score,content_sentiment,content_sentiment_score,date,Open,High,Low,Close,Volume,Dividends,Stock Splits,price_change
0,neutral,0.909428,positive,0.952919,2024-09-04,105.400985,113.260306,104.111095,106.200912,372470300.0,0.0,0.0,0.009415
96,positive,0.807024,positive,0.884717,2024-09-04,105.400985,113.260306,104.111095,106.200912,372470300.0,0.0,0.0,0.009415
64,neutral,0.907637,negative,0.619797,2024-09-04,105.400985,113.260306,104.111095,106.200912,372470300.0,0.0,0.0,0.009415
76,neutral,0.963091,positive,0.896472,2024-09-04,105.400985,113.260306,104.111095,106.200912,372470300.0,0.0,0.0,0.009415
90,positive,0.861416,positive,0.935935,2024-09-04,105.400985,113.260306,104.111095,106.200912,372470300.0,0.0,0.0,0.009415
...,...,...,...,...,...,...,...,...,...,...,...,...,...
61,neutral,0.910408,neutral,0.950407,2024-09-21,,,,,,,,
63,positive,0.923365,positive,0.923918,2024-09-21,,,,,,,,
72,negative,0.625904,positive,0.952919,2024-09-29,,,,,,,,
75,positive,0.898789,positive,0.924873,2024-09-28,,,,,,,,


In [111]:
import os

In [112]:
merged_data_df.to_csv(os.path.join('../DataFrames', 'merged_data_df.tsv'), index=False)