# Sentiment Analysis

From the data that we preprocessed through cleaning, then linguistic processing; we are now able to gather the sentiment of the news articles.

The sentiment scores returned by `NLTK's VADER SentimentIntensityAnalyzer` range from -1 to 1. (VADER: Valence Aware Dictionary and sEntiment Reasoner)  
- Scores between -1 and -0.05 general indicate negative sentiment
- Scores between -0.05 and 0.05 are considered neutrual
- Sores between 0.05 and 1 indicate positive sentiment  

The more extreme the score, the stronger the sentiment.  

In [242]:
# import nltk sentiment intsensity analyzer module
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

In [243]:
# download nltk VADER Lexicon data package
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/justinhoang/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [244]:
# get the sentment of the texts from nltk's sentiment intensity analyzer
def get_sentiment(text):
    sia = SentimentIntensityAnalyzer()
    return sia.polarity_scores(text)['compound']

In [245]:
# import pandas
import pandas as pd

In [246]:
processed_articles_df = pd.read_csv('../DataFrames/processed_articles_df.tsv')

In [247]:
# fix datetime formatting
# csv does not save datetime format, rather it converts it to a string
processed_articles_df['publishedAt'] = pd.to_datetime(processed_articles_df['publishedAt'])

In [248]:
processed_articles_df['publishedAt'] = processed_articles_df['publishedAt'].dt.date

In [249]:
processed_articles_df['sentiment_title'] = processed_articles_df['processed_title'].apply(get_sentiment)
processed_articles_df['sentiment_content'] = processed_articles_df['processed_content'].apply(get_sentiment)

In [250]:
sentiments_df = processed_articles_df[['sentiment_title', 'sentiment_content', 'publishedAt']]

In [251]:
sentiments_df

Unnamed: 0,sentiment_title,sentiment_content,publishedAt
0,0.0000,0.5423,2024-09-04
1,-0.6808,0.2263,2024-09-27
2,0.0000,0.5423,2024-09-30
3,-0.4019,0.5106,2024-09-01
4,0.2500,0.1779,2024-09-26
...,...,...,...
94,0.2023,0.8126,2024-09-21
95,0.0772,0.0000,2024-09-04
96,0.0000,-0.4019,2024-09-04
97,0.3612,0.0000,2024-09-04


# Stock Price Data Collection

The next step in building our analysis model is collecting the stock price data for the specified company.  

In [252]:
# import yahoo finance package
import yfinance as yf

In [253]:
# import timedelta modules from datetime
from datetime import timedelta

In [254]:
# gather stock data
def get_stock_data(ticker, start_date, end_date):
    stock = yf.Ticker(ticker)
    data = stock.history(start=start_date, end=end_date)
    return data

In [255]:
earliest_date = sentiments_df['publishedAt'].min()
latest_date = sentiments_df['publishedAt'].max()

In [256]:
start_date = earliest_date - timedelta(days=5)
end_date = latest_date + timedelta(days=5)

In [257]:
stock_data = get_stock_data('NVDA', start_date, end_date)

In [258]:
stock_data

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-08-27 00:00:00-04:00,125.03931,129.188949,123.869404,128.289032,303134600,0.0,0.0
2024-08-28 00:00:00-04:00,128.109038,128.319027,122.629511,125.599258,448101100,0.0,0.0
2024-08-29 00:00:00-04:00,121.349623,124.41936,116.700019,117.579941,453023300,0.0,0.0
2024-08-30 00:00:00-04:00,119.519777,121.739588,117.209977,119.359795,333751600,0.0,0.0
2024-09-03 00:00:00-04:00,116.000078,116.200058,107.280822,107.990761,477155100,0.0,0.0
2024-09-04 00:00:00-04:00,105.400985,113.260306,104.111095,106.200912,372470300,0.0,0.0
2024-09-05 00:00:00-04:00,104.981017,109.640622,104.751041,107.200829,306850700,0.0,0.0
2024-09-06 00:00:00-04:00,108.030759,108.14075,100.941361,102.821205,413638100,0.0,0.0
2024-09-09 00:00:00-04:00,104.871024,106.540887,103.681131,106.460892,273912000,0.0,0.0
2024-09-10 00:00:00-04:00,107.800776,109.390643,104.94102,108.090752,268283700,0.0,0.0


**Note:**  
Some days are missing for stock price data because the Stock market is open only during the weekdays, not including holidays.  

We also fix the formatting for the date for the stock prices. This way we ensure that the data matches with the sentiment data.

In [259]:
# reformat stock data dates
stock_data.index = pd.to_datetime(stock_data.index)

In [260]:
stock_data.index = pd.to_datetime(stock_data.index).date

In [261]:
stock_data = stock_data.reset_index()

In [262]:
stock_data = stock_data.rename(columns={'index': 'date'})

In [263]:
stock_data

Unnamed: 0,date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,2024-08-27,125.03931,129.188949,123.869404,128.289032,303134600,0.0,0.0
1,2024-08-28,128.109038,128.319027,122.629511,125.599258,448101100,0.0,0.0
2,2024-08-29,121.349623,124.41936,116.700019,117.579941,453023300,0.0,0.0
3,2024-08-30,119.519777,121.739588,117.209977,119.359795,333751600,0.0,0.0
4,2024-09-03,116.000078,116.200058,107.280822,107.990761,477155100,0.0,0.0
5,2024-09-04,105.400985,113.260306,104.111095,106.200912,372470300,0.0,0.0
6,2024-09-05,104.981017,109.640622,104.751041,107.200829,306850700,0.0,0.0
7,2024-09-06,108.030759,108.14075,100.941361,102.821205,413638100,0.0,0.0
8,2024-09-09,104.871024,106.540887,103.681131,106.460892,273912000,0.0,0.0
9,2024-09-10,107.800776,109.390643,104.94102,108.090752,268283700,0.0,0.0


#### Price Change

We need to create a new column in our `stock_data` table that calculates the price change of the stock from the next day.

In [264]:
stock_data['Price Change'] = stock_data['Close'].pct_change(fill_method=None).shift(-1)

In [265]:
stock_data

Unnamed: 0,date,Open,High,Low,Close,Volume,Dividends,Stock Splits,Price Change
0,2024-08-27,125.03931,129.188949,123.869404,128.289032,303134600,0.0,0.0,-0.020967
1,2024-08-28,128.109038,128.319027,122.629511,125.599258,448101100,0.0,0.0,-0.063848
2,2024-08-29,121.349623,124.41936,116.700019,117.579941,453023300,0.0,0.0,0.015137
3,2024-08-30,119.519777,121.739588,117.209977,119.359795,333751600,0.0,0.0,-0.09525
4,2024-09-03,116.000078,116.200058,107.280822,107.990761,477155100,0.0,0.0,-0.016574
5,2024-09-04,105.400985,113.260306,104.111095,106.200912,372470300,0.0,0.0,0.009415
6,2024-09-05,104.981017,109.640622,104.751041,107.200829,306850700,0.0,0.0,-0.040854
7,2024-09-06,108.030759,108.14075,100.941361,102.821205,413638100,0.0,0.0,0.035398
8,2024-09-09,104.871024,106.540887,103.681131,106.460892,273912000,0.0,0.0,0.015309
9,2024-09-10,107.800776,109.390643,104.94102,108.090752,268283700,0.0,0.0,0.081499


# Merge Data

The next step is to merge the stock price data with the sentiment analysis data.  

Once the data is merged, we are able to create our model.  

In [266]:
# merge the sentiment scores and stock data on their dates
merged_data_df = pd.merge(sentiments_df, stock_data, left_on='publishedAt', right_on='date', how='left')

In [267]:
merged_data_df = merged_data_df.sort_values('date')

In [268]:
merged_data_df = merged_data_df.drop('date', axis=1)

In [269]:
merged_data_df = merged_data_df.rename(columns={'publishedAt': 'date'})

In [270]:
merged_data_df

Unnamed: 0,sentiment_title,sentiment_content,date,Open,High,Low,Close,Volume,Dividends,Stock Splits,Price Change
29,0.0000,0.5423,2024-09-03,116.000078,116.200058,107.280822,107.990761,477155100.0,0.0,0.0,-0.016574
40,0.0000,0.0000,2024-09-03,116.000078,116.200058,107.280822,107.990761,477155100.0,0.0,0.0,-0.016574
42,0.5423,0.5423,2024-09-03,116.000078,116.200058,107.280822,107.990761,477155100.0,0.0,0.0,-0.016574
6,0.0000,0.0516,2024-09-03,116.000078,116.200058,107.280822,107.990761,477155100.0,0.0,0.0,-0.016574
86,0.0000,0.3612,2024-09-03,116.000078,116.200058,107.280822,107.990761,477155100.0,0.0,0.0,-0.016574
...,...,...,...,...,...,...,...,...,...,...,...
67,-0.5106,-0.0516,2024-09-21,,,,,,,,
69,0.4939,0.3612,2024-09-21,,,,,,,,
78,-0.0516,0.5423,2024-09-29,,,,,,,,
81,0.0000,0.0000,2024-09-28,,,,,,,,


In [271]:
import os

In [272]:
merged_data_df.to_csv(os.path.join('../DataFrames', 'merged_data_df.tsv'), index=False)