# Financial Sentiment Analysis (Single)

In this program, I run a sentiment analysis of a single company based on financial news articles.

The company that I am targeting is Nvidia [NVDA]  

From that sentiment, I will use the data to try and predict the price movement of that company's stock price.  

The packages that I am using are:  
`os`, `dotenv`, `datetime`, `newsapi`, `pandas`, `nltk`, `re`, `string`, `yfinance`

## Fetching News Articles

The first step is to fetch the news articles.  

I am using `NewsAPI` to get articles quickly and easily. Then, I use `pandas` to put the articles into a dataframe, where I can collect and read the data easier.  

**Filtering articles:**  
Filter articles that only exist  
- `NewsAPI` sometimes fetches articles that were removed  

**Extracting the data:**  
Extract only the necessary data from the articles
- Title
- Description
- Content

All others can be discarded.  

Both of these steps are part of the cleaning data step that is next in text preprocessing.

In [98]:
import os
from dotenv import load_dotenv

In [99]:
# get path to the environment file
env_path = '../config/.env'
load_dotenv(env_path)

True

In [100]:
# import datetime, timedelta modules from datetime
from datetime import datetime, timedelta

In [101]:
# import newsapi package
from newsapi import NewsApiClient

In [102]:
# init newsapi
newsapi = NewsApiClient(api_key=os.getenv('NEWS_API_KEY'))

In [103]:
company = "Nvidia"
days_back = 29
end_date = datetime.now()
start_date = end_date - timedelta(days=days_back)

In [104]:
# fetch all articles that mention Nvidia
all_articles = newsapi.get_everything(q=company,
                                      from_param=start_date.strftime('%Y-%m-%d'),
                                      to=end_date.strftime('%Y-%m-%d'),
                                      language='en')

In [105]:
import pandas as pd
pd.__version__

'2.2.3'

In [106]:
# place all_articles into a dataframe
all_articles_df = pd.DataFrame(all_articles['articles'])
all_articles_df

Unnamed: 0,source,author,title,description,url,urlToImage,publishedAt,content
0,"{'id': None, 'name': 'Yahoo Entertainment'}",Lawrence Bonk,DOJ subpoenas NVIDIA as part of antitrust prob...,The DOJ has sent subpoenas to NVIDIA and other...,https://consent.yahoo.com/v2/collectConsent?se...,,2024-09-04T15:34:35Z,"If you click 'Accept all', we and our partners..."
1,"{'id': None, 'name': 'Gizmodo.com'}",Kyle Barr,The Leaked Nvidia RTX 5090 Has So Many Cores I...,Get ready to watch the lights on your block di...,https://gizmodo.com/the-leaked-nvidia-rtx-5090...,https://gizmodo.com/app/uploads/2024/09/Nvidia...,2024-09-27T13:35:22Z,The GeForce RTX 4090 is already so big that an...
2,"{'id': None, 'name': 'Yahoo Entertainment'}",Jeremy Gan,ByteDance will reportedly use Huawei chips to ...,"As first reported by Reuters, ByteDance, the C...",https://consent.yahoo.com/v2/collectConsent?se...,,2024-09-30T15:48:46Z,"If you click 'Accept all', we and our partners..."
3,"{'id': 'business-insider', 'name': 'Business I...",Emma Cosgrove,Nvidia might actually lose in this key part of...,"As AI matures, Nvidia, Groq, and Cerebras focu...",https://www.businessinsider.com/nvidia-may-los...,https://i.insider.com/66d0c408392a3bda9f2349e3...,2024-09-01T13:00:02Z,Justin Sullivan/Getty\r\n<ul><li>Inference mad...
4,"{'id': 'business-insider', 'name': 'Business I...",Eugene Kim,This chart shows one potential advantage AWS's...,"AI chip investments by Amazon, Google, and Mic...",https://www.businessinsider.com/aws-ai-chips-w...,https://i.insider.com/6622c44b23b29110d3011ce1...,2024-09-26T09:00:02Z,Noah Berger/Getty Images\r\n<ul><li>Big tech c...
...,...,...,...,...,...,...,...,...
95,"{'id': None, 'name': 'Yahoo Entertainment'}","Sean Williams, The Motley Fool","Billionaires Warren Buffett, David Tepper, and...",Some of Wall Street's most successful value-se...,https://finance.yahoo.com/news/billionaires-wa...,https://s.yimg.com/cv/apiv2/social/images/yaho...,2024-09-21T09:06:00Z,"For the better part of two years, the bulls ha..."
96,"{'id': None, 'name': 'Theregister.com'}",Thomas Claburn,OpenAI allegedly wants TSMC 1.6nm for in-house...,"Another job for Broadcom, then\nOpenAI's first...",https://www.theregister.com/2024/09/04/openai_...,https://regmedia.co.uk/2021/01/12/shutterstock...,2024-09-04T02:29:19Z,OpenAI's first custom-designed silicon chips a...
97,"{'id': None, 'name': 'Theregister.com'}",Liam Proven,Double Debian update: 11.11 and 12.7 arrive at...,But Bullseye's days are numbered and it's time...,https://www.theregister.com/2024/09/04/double_...,https://regmedia.co.uk/2021/08/16/shutterstock...,2024-09-04T11:28:06Z,"The latest update to Debian ""Bookworm"" arrives..."
98,"{'id': 'business-insider', 'name': 'Business I...",Dan DeFrancesco,The tech industry is ready for robot taxis. Bu...,"Driverless cars are gaining momentum, with Tes...",https://www.businessinsider.com/waymo-robot-ta...,https://i.insider.com/66a90e9a1a227600e632ca38...,2024-09-04T12:48:22Z,Waymo's fully autonomous Jaguar I-PACEBlue Pla...


In [107]:
# filter articles function
# only filters valid articles
# valid meaning: article exists and description of article exists
def filter_removed_articles(articles):
    return [article for article in articles if article.get('title') != '[Removed]']

In [108]:
# filter the all_articles
valid_articles = filter_removed_articles(all_articles['articles'])

In [109]:
valid_articles_df = pd.DataFrame(valid_articles)
valid_articles_df

Unnamed: 0,source,author,title,description,url,urlToImage,publishedAt,content
0,"{'id': None, 'name': 'Yahoo Entertainment'}",Lawrence Bonk,DOJ subpoenas NVIDIA as part of antitrust prob...,The DOJ has sent subpoenas to NVIDIA and other...,https://consent.yahoo.com/v2/collectConsent?se...,,2024-09-04T15:34:35Z,"If you click 'Accept all', we and our partners..."
1,"{'id': None, 'name': 'Gizmodo.com'}",Kyle Barr,The Leaked Nvidia RTX 5090 Has So Many Cores I...,Get ready to watch the lights on your block di...,https://gizmodo.com/the-leaked-nvidia-rtx-5090...,https://gizmodo.com/app/uploads/2024/09/Nvidia...,2024-09-27T13:35:22Z,The GeForce RTX 4090 is already so big that an...
2,"{'id': None, 'name': 'Yahoo Entertainment'}",Jeremy Gan,ByteDance will reportedly use Huawei chips to ...,"As first reported by Reuters, ByteDance, the C...",https://consent.yahoo.com/v2/collectConsent?se...,,2024-09-30T15:48:46Z,"If you click 'Accept all', we and our partners..."
3,"{'id': 'business-insider', 'name': 'Business I...",Emma Cosgrove,Nvidia might actually lose in this key part of...,"As AI matures, Nvidia, Groq, and Cerebras focu...",https://www.businessinsider.com/nvidia-may-los...,https://i.insider.com/66d0c408392a3bda9f2349e3...,2024-09-01T13:00:02Z,Justin Sullivan/Getty\r\n<ul><li>Inference mad...
4,"{'id': 'business-insider', 'name': 'Business I...",Eugene Kim,This chart shows one potential advantage AWS's...,"AI chip investments by Amazon, Google, and Mic...",https://www.businessinsider.com/aws-ai-chips-w...,https://i.insider.com/6622c44b23b29110d3011ce1...,2024-09-26T09:00:02Z,Noah Berger/Getty Images\r\n<ul><li>Big tech c...
...,...,...,...,...,...,...,...,...
94,"{'id': None, 'name': 'Yahoo Entertainment'}","Sean Williams, The Motley Fool","Billionaires Warren Buffett, David Tepper, and...",Some of Wall Street's most successful value-se...,https://finance.yahoo.com/news/billionaires-wa...,https://s.yimg.com/cv/apiv2/social/images/yaho...,2024-09-21T09:06:00Z,"For the better part of two years, the bulls ha..."
95,"{'id': None, 'name': 'Theregister.com'}",Thomas Claburn,OpenAI allegedly wants TSMC 1.6nm for in-house...,"Another job for Broadcom, then\nOpenAI's first...",https://www.theregister.com/2024/09/04/openai_...,https://regmedia.co.uk/2021/01/12/shutterstock...,2024-09-04T02:29:19Z,OpenAI's first custom-designed silicon chips a...
96,"{'id': None, 'name': 'Theregister.com'}",Liam Proven,Double Debian update: 11.11 and 12.7 arrive at...,But Bullseye's days are numbered and it's time...,https://www.theregister.com/2024/09/04/double_...,https://regmedia.co.uk/2021/08/16/shutterstock...,2024-09-04T11:28:06Z,"The latest update to Debian ""Bookworm"" arrives..."
97,"{'id': 'business-insider', 'name': 'Business I...",Dan DeFrancesco,The tech industry is ready for robot taxis. Bu...,"Driverless cars are gaining momentum, with Tes...",https://www.businessinsider.com/waymo-robot-ta...,https://i.insider.com/66a90e9a1a227600e632ca38...,2024-09-04T12:48:22Z,Waymo's fully autonomous Jaguar I-PACEBlue Pla...


In [110]:
# extract article essentials function
# extract only the title, description, and content from the articles
def extract_article_essentials(articles):
    return [{'title': article['title'], 'content': article['content'], 'publishedAt': article['publishedAt']} for article in articles]

In [111]:
extracted_articles = extract_article_essentials(valid_articles)

In [112]:
extracted_articles_df = pd.DataFrame(extracted_articles)
extracted_articles_df

Unnamed: 0,title,content,publishedAt
0,DOJ subpoenas NVIDIA as part of antitrust prob...,"If you click 'Accept all', we and our partners...",2024-09-04T15:34:35Z
1,The Leaked Nvidia RTX 5090 Has So Many Cores I...,The GeForce RTX 4090 is already so big that an...,2024-09-27T13:35:22Z
2,ByteDance will reportedly use Huawei chips to ...,"If you click 'Accept all', we and our partners...",2024-09-30T15:48:46Z
3,Nvidia might actually lose in this key part of...,Justin Sullivan/Getty\r\n<ul><li>Inference mad...,2024-09-01T13:00:02Z
4,This chart shows one potential advantage AWS's...,Noah Berger/Getty Images\r\n<ul><li>Big tech c...,2024-09-26T09:00:02Z
...,...,...,...
94,"Billionaires Warren Buffett, David Tepper, and...","For the better part of two years, the bulls ha...",2024-09-21T09:06:00Z
95,OpenAI allegedly wants TSMC 1.6nm for in-house...,OpenAI's first custom-designed silicon chips a...,2024-09-04T02:29:19Z
96,Double Debian update: 11.11 and 12.7 arrive at...,"The latest update to Debian ""Bookworm"" arrives...",2024-09-04T11:28:06Z
97,The tech industry is ready for robot taxis. Bu...,Waymo's fully autonomous Jaguar I-PACEBlue Pla...,2024-09-04T12:48:22Z


## Preprocess Text
***This is a crucial***  
Proprocessing helps clean and normalize the text data making it more suitable for analysis.  

After getting the articles, I can now preprocess the text in the articles.    


### Data Cleaning 
**Identify and remove noise:**  
We want to first remove all noise from the data.  
- Punction
- Extra whitespace

**Text normalization:**  
- Stopwords
    - Remove common/irrelevent words that are unlikely to convey much sentiment.  
- Capital letters
    - All letters should be the same case so all words are treated the same in the tokenization process.  

**Data masking:**  
Data masking is not needed in this context.  

Clean text should result.

In [113]:
# import re package (regular expressions)
import re

In [114]:
import string

In [115]:
# import nltk packages (Natural Language Toolkit)
import nltk
from nltk.corpus import stopwords

In [116]:
# download nltk data packages
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/justinhoang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [117]:
# get all stopwords from data package
stop_words = set(stopwords.words("english"))
print(stop_words)

{'few', 'll', "mustn't", 'ours', 'or', "won't", 'below', 'doing', 'were', 'ain', 'we', 'needn', "haven't", "you're", 'has', 'after', "that'll", 'under', "shan't", 'weren', 'they', "it's", 'until', 'hasn', 't', 'did', "wasn't", 'itself', 'm', 'been', 'yourselves', "needn't", 'their', 'doesn', 'd', "you'd", 're', 'further', "shouldn't", "you'll", 'y', 'only', 'each', "wouldn't", 'how', 'themselves', 'its', 'isn', 'from', "couldn't", 'if', 'where', 'during', 'all', 'whom', 'such', 'so', 'most', "she's", 'to', "doesn't", "hadn't", 'was', 'aren', 'about', 'shouldn', 'both', 'same', 'he', 'our', 'my', 'down', 'which', 'there', 'herself', 'into', 'some', 'her', 'ma', 'by', 'above', 'your', 'does', 'them', 'have', 'in', 'theirs', 'him', 'when', 'had', 'more', 'won', 'too', 'these', "hasn't", 'up', 'can', 'yourself', 've', 'very', 'wouldn', 'just', 'out', 'shan', 'with', 'couldn', 'an', 'that', 'being', 'here', 'me', 'she', 'should', 'now', 'it', 'against', 'off', "should've", 'himself', 'no', 

In [118]:
# clean text function
# cleans the data (text)
def clean_text(text):
    # remove extra whitespace
    cleaned_text = re.sub(r'\s+', ' ', text).strip()
    # remove HTML tags
    cleaned_text = re.sub(r'<[^>]+>', '', cleaned_text)
    # remove patterns
    cleaned_text = re.sub(r'/\w+\s+\w+', '', cleaned_text)
    # remove URLs
    cleaned_text = re.sub(r'http\S+|www\S+|https\S+', '', cleaned_text, flags=re.MULTILINE)
    # remove punctuation
    cleaned_text = ''.join([char for char in text if char not in string.punctuation])
    
    # lower case all text
    cleaned_text = text.lower()
    # remove stop words from text
    cleaned_text = ' '.join([word for word in cleaned_text.split() if word not in stop_words])
    
    return cleaned_text

In [119]:
columns_to_process = ['title', 'content']

In [120]:
cleaned_articles_df = extracted_articles_df.copy()

In [121]:
for column in columns_to_process:
    cleaned_articles_df[column] = extracted_articles_df[column].apply(clean_text)

In [122]:
cleaned_articles_df

Unnamed: 0,title,content,publishedAt
0,doj subpoenas nvidia part antitrust probe rega...,"click 'accept all', partners, including 239 pa...",2024-09-04T15:34:35Z
1,leaked nvidia rtx 5090 many cores actually scares,geforce rtx 4090 already big pc builder worth ...,2024-09-27T13:35:22Z
2,bytedance reportedly use huawei chips train ne...,"click 'accept all', partners, including 240 pa...",2024-09-30T15:48:46Z
3,nvidia might actually lose key part ai chip bu...,justin sullivan/getty <ul><li>inference made 4...,2024-09-01T13:00:02Z
4,chart shows one potential advantage aws's ai c...,noah berger/getty images <ul><li>big tech clou...,2024-09-26T09:00:02Z
...,...,...,...
94,"billionaires warren buffett, david tepper, ter...","better part two years, bulls firmly control wa...",2024-09-21T09:06:00Z
95,openai allegedly wants tsmc 1.6nm in-house ai ...,openai's first custom-designed silicon chips a...,2024-09-04T02:29:19Z
96,double debian update: 11.11 12.7 arrive,"latest update debian ""bookworm"" arrives time l...",2024-09-04T11:28:06Z
97,tech industry ready robot taxis. rest us willi...,waymo's fully autonomous jaguar i-paceblue pla...,2024-09-04T12:48:22Z


### Linguistic Proccesing

I break up each of these processes into their own methods to ensure they are being processed correctly.  
Then I apply them all at once.

**Tokenization**  
Breaking down text into individual words or tokens. Helps separate individual words from the raw text, which makes it easier to analyze and understand.

**POS Tagging**  
Part-of-Speech Tagging

**Lemmatization**  
A technique to reduce words to their root forms based on their part of speech and the meaning of a word based on its context.  
An algorithmic process of identifying an inflected word's "lemma" based on its intended meaning. The algorithm we are utilizing is apart of the `NLTK` library.

**Named-Entity Recoginition**


#### Tokenization

In [123]:
# import nltk tokenization module
from nltk.tokenize import word_tokenize


In [124]:
# import nltk punkt word tokenization data package
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/justinhoang/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [125]:
# tokenize words in text
def tokenize(text):
    return word_tokenize(text)

In [126]:
tokenized_articles_df = cleaned_articles_df.copy()

In [127]:
for column in columns_to_process:
    for index, row in cleaned_articles_df.iterrows():
        try:
            tokenized_articles_df.at[index, column] = tokenize(row[column])
        except Exception as e:
            print(e)

In [128]:
tokenized_articles_df

Unnamed: 0,title,content,publishedAt
0,"[doj, subpoenas, nvidia, part, antitrust, prob...","[click, 'accept, all, ', ,, partners, ,, inclu...",2024-09-04T15:34:35Z
1,"[leaked, nvidia, rtx, 5090, many, cores, actua...","[geforce, rtx, 4090, already, big, pc, builder...",2024-09-27T13:35:22Z
2,"[bytedance, reportedly, use, huawei, chips, tr...","[click, 'accept, all, ', ,, partners, ,, inclu...",2024-09-30T15:48:46Z
3,"[nvidia, might, actually, lose, key, part, ai,...","[justin, sullivan/getty, <, ul, >, <, li, >, i...",2024-09-01T13:00:02Z
4,"[chart, shows, one, potential, advantage, aws,...","[noah, berger/getty, images, <, ul, >, <, li, ...",2024-09-26T09:00:02Z
...,...,...,...
94,"[billionaires, warren, buffett, ,, david, tepp...","[better, part, two, years, ,, bulls, firmly, c...",2024-09-21T09:06:00Z
95,"[openai, allegedly, wants, tsmc, 1.6nm, in-hou...","[openai, 's, first, custom-designed, silicon, ...",2024-09-04T02:29:19Z
96,"[double, debian, update, :, 11.11, 12.7, arrive]","[latest, update, debian, ``, bookworm, '', arr...",2024-09-04T11:28:06Z
97,"[tech, industry, ready, robot, taxis, ., rest,...","[waymo, 's, fully, autonomous, jaguar, i-paceb...",2024-09-04T12:48:22Z


#### POS Tagging

In [129]:
# import nltk pos tagging module
from nltk import pos_tag

In [130]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/justinhoang/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [131]:
# tag tokens
def pos_tag_text(tokens):
    return pos_tag(tokens)

In [132]:
pos_tagged_articles_df = tokenized_articles_df.copy()

In [133]:
for column in columns_to_process:
    for index, row in tokenized_articles_df.iterrows():
        try:
            pos_tagged_articles_df.at[index, column] = pos_tag_text(row[column])
        except Exception as e:
            print(e)

In [134]:
pos_tagged_articles_df

Unnamed: 0,title,content,publishedAt
0,"[(doj, NN), (subpoenas, NNS), (nvidia, VBP), (...","[(click, NN), ('accept, IN), (all, DT), (', ''...",2024-09-04T15:34:35Z
1,"[(leaked, VBN), (nvidia, RB), (rtx, JJ), (5090...","[(geforce, NN), (rtx, NN), (4090, CD), (alread...",2024-09-27T13:35:22Z
2,"[(bytedance, NN), (reportedly, RB), (use, JJ),...","[(click, NN), ('accept, IN), (all, DT), (', ''...",2024-09-30T15:48:46Z
3,"[(nvidia, NN), (might, MD), (actually, RB), (l...","[(justin, NN), (sullivan/getty, JJ), (<, NNP),...",2024-09-01T13:00:02Z
4,"[(chart, NN), (shows, VBZ), (one, CD), (potent...","[(noah, JJ), (berger/getty, NN), (images, NNS)...",2024-09-26T09:00:02Z
...,...,...,...
94,"[(billionaires, NNS), (warren, VBP), (buffett,...","[(better, RBR), (part, NN), (two, CD), (years,...",2024-09-21T09:06:00Z
95,"[(openai, RB), (allegedly, RB), (wants, VBZ), ...","[(openai, NN), ('s, POS), (first, JJ), (custom...",2024-09-04T02:29:19Z
96,"[(double, JJ), (debian, JJ), (update, NN), (:,...","[(latest, JJS), (update, NN), (debian, JJ), (`...",2024-09-04T11:28:06Z
97,"[(tech, NN), (industry, NN), (ready, JJ), (rob...","[(waymo, NN), ('s, POS), (fully, RB), (autonom...",2024-09-04T12:48:22Z


#### Lemmatization

In [135]:
# import nltk lemmatization modules
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [136]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/justinhoang/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [137]:
# define wordnet
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [138]:
# lemmatize words from tagged tokens
def lemmatize(tagged_tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in tagged_tokens]

In [139]:
lemmatized_articles_df = pos_tagged_articles_df.copy()

In [140]:
for column in columns_to_process:
    for index, row in pos_tagged_articles_df.iterrows():
        try:
            lemmatized_articles_df.at[index, column] = lemmatize(row[column])
        except Exception as e:
            print(e)

In [141]:
lemmatized_articles_df

Unnamed: 0,title,content,publishedAt
0,"[doj, subpoena, nvidia, part, antitrust, probe...","[click, 'accept, all, ', ,, partner, ,, includ...",2024-09-04T15:34:35Z
1,"[leak, nvidia, rtx, 5090, many, core, actually...","[geforce, rtx, 4090, already, big, pc, builder...",2024-09-27T13:35:22Z
2,"[bytedance, reportedly, use, huawei, chip, tra...","[click, 'accept, all, ', ,, partner, ,, includ...",2024-09-30T15:48:46Z
3,"[nvidia, might, actually, lose, key, part, ai,...","[justin, sullivan/getty, <, ul, >, <, li, >, i...",2024-09-01T13:00:02Z
4,"[chart, show, one, potential, advantage, aws, ...","[noah, berger/getty, image, <, ul, >, <, li, >...",2024-09-26T09:00:02Z
...,...,...,...
94,"[billionaire, warren, buffett, ,, david, teppe...","[well, part, two, year, ,, bull, firmly, contr...",2024-09-21T09:06:00Z
95,"[openai, allegedly, want, tsmc, 1.6nm, in-hous...","[openai, 's, first, custom-designed, silicon, ...",2024-09-04T02:29:19Z
96,"[double, debian, update, :, 11.11, 12.7, arrive]","[late, update, debian, ``, bookworm, '', arriv...",2024-09-04T11:28:06Z
97,"[tech, industry, ready, robot, taxi, ., rest, ...","[waymo, 's, fully, autonomous, jaguar, i-paceb...",2024-09-04T12:48:22Z


#### Named Entity Recognition

In [142]:
# import nltk named entity modules
from nltk.chunk import ne_chunk

In [143]:
# download nltk ne data packages
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /Users/justinhoang/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/justinhoang/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [144]:
# create named entity recognition trees from the tagged tokens
def ner(pos_tags):
    return ne_chunk(pos_tags)

In [146]:
ner_articles_df = pos_tagged_articles_df.copy()

In [147]:
for column in columns_to_process:
    ner_articles_df[column] = pos_tagged_articles_df[column].apply(ner)

In [148]:
ner_articles_df

Unnamed: 0,title,content,publishedAt
0,"[(doj, NN), (subpoenas, NNS), (nvidia, VBP), (...","[(click, NN), ('accept, IN), (all, DT), (', ''...",2024-09-04T15:34:35Z
1,"[(leaked, VBN), (nvidia, RB), (rtx, JJ), (5090...","[(geforce, NN), (rtx, NN), (4090, CD), (alread...",2024-09-27T13:35:22Z
2,"[(bytedance, NN), (reportedly, RB), (use, JJ),...","[(click, NN), ('accept, IN), (all, DT), (', ''...",2024-09-30T15:48:46Z
3,"[(nvidia, NN), (might, MD), (actually, RB), (l...","[(justin, NN), (sullivan/getty, JJ), (<, NNP),...",2024-09-01T13:00:02Z
4,"[(chart, NN), (shows, VBZ), (one, CD), (potent...","[(noah, JJ), (berger/getty, NN), (images, NNS)...",2024-09-26T09:00:02Z
...,...,...,...
94,"[(billionaires, NNS), (warren, VBP), (buffett,...","[(better, RBR), (part, NN), (two, CD), (years,...",2024-09-21T09:06:00Z
95,"[(openai, RB), (allegedly, RB), (wants, VBZ), ...","[(openai, NN), ('s, POS), (first, JJ), (custom...",2024-09-04T02:29:19Z
96,"[(double, JJ), (debian, JJ), (update, NN), (:,...","[(latest, JJS), (update, NN), (debian, JJ), (`...",2024-09-04T11:28:06Z
97,"[(tech, NN), (industry, NN), (ready, JJ), (rob...","[(waymo, NN), ('s, POS), (fully, RB), (autonom...",2024-09-04T12:48:22Z


In [149]:
# ensure ner function is creating ner trees
print(ner_articles_df['title'][0])
print(ner_articles_df['content'][0])

(S
  doj/NN
  subpoenas/NNS
  nvidia/VBP
  part/NN
  antitrust/JJ
  probe/NN
  regarding/VBG
  ai/JJ
  processors/NNS)
(S
  click/NN
  'accept/IN
  all/DT
  '/''
  ,/,
  partners/NNS
  ,/,
  including/VBG
  239/CD
  part/NN
  iab/NN
  transparency/NN
  &/CC
  amp/NN
  ;/:
  consent/NN
  framework/NN
  ,/,
  also/RB
  store/VBP
  and/or/JJ
  access/NN
  information/NN
  device/NN
  (/(
  in/IN
  words/NNS
  ,/,
  use/NN
  …/JJ
  [/NNP
  +678/NNP
  chars/VBZ
  ]/VB)


### Preprocess the Text

After creating all the functions necessary for linguistic processing, we can now apply all the entire suite of preprocessing text functions on our articles.  

After processing, we can move on to gathering the sentiment of articles.

#### Linguisitic Processing

In [150]:
# apply all linguistic processing functions on text
def linguistic_process(text):
    tokens = tokenize(text)
    pos_tags = pos_tag_text(tokens)
    lemmatized = lemmatize(pos_tags)
    ner_tree = ner(pos_tags)
    
    processed_text = ' '.join(lemmatized)
    return processed_text, ner_tree

#### Apply Preprocessing

In [151]:
# apply entire preprocessing process on text
def preprocess_text(text):
    cleaned = clean_text(text)
    processed, ner = linguistic_process(cleaned)
    return processed, ner

In [152]:
processed_articles_df = extracted_articles_df.copy()

In [153]:
for column in columns_to_process:
    processed_articles_df[f'processed_{column}'], processed_articles_df[f'ner_{column}'] = zip(*extracted_articles_df[column].apply(preprocess_text))

In [154]:
processed_articles_df

Unnamed: 0,title,content,publishedAt,processed_title,ner_title,processed_content,ner_content
0,DOJ subpoenas NVIDIA as part of antitrust prob...,"If you click 'Accept all', we and our partners...",2024-09-04T15:34:35Z,doj subpoena nvidia part antitrust probe regar...,"[(doj, NN), (subpoenas, NNS), (nvidia, VBP), (...","click 'accept all ' , partner , include 239 pa...","[(click, NN), ('accept, IN), (all, DT), (', ''..."
1,The Leaked Nvidia RTX 5090 Has So Many Cores I...,The GeForce RTX 4090 is already so big that an...,2024-09-27T13:35:22Z,leak nvidia rtx 5090 many core actually scare,"[(leaked, VBN), (nvidia, RB), (rtx, JJ), (5090...",geforce rtx 4090 already big pc builder worth ...,"[(geforce, NN), (rtx, NN), (4090, CD), (alread..."
2,ByteDance will reportedly use Huawei chips to ...,"If you click 'Accept all', we and our partners...",2024-09-30T15:48:46Z,bytedance reportedly use huawei chip train new...,"[(bytedance, NN), (reportedly, RB), (use, JJ),...","click 'accept all ' , partner , include 240 pa...","[(click, NN), ('accept, IN), (all, DT), (', ''..."
3,Nvidia might actually lose in this key part of...,Justin Sullivan/Getty\r\n<ul><li>Inference mad...,2024-09-01T13:00:02Z,nvidia might actually lose key part ai chip bu...,"[(nvidia, NN), (might, MD), (actually, RB), (l...",justin sullivan/getty < ul > < li > inference ...,"[(justin, NN), (sullivan/getty, JJ), (<, NNP),..."
4,This chart shows one potential advantage AWS's...,Noah Berger/Getty Images\r\n<ul><li>Big tech c...,2024-09-26T09:00:02Z,chart show one potential advantage aws 's ai c...,"[(chart, NN), (shows, VBZ), (one, CD), (potent...",noah berger/getty image < ul > < li > big tech...,"[(noah, JJ), (berger/getty, NN), (images, NNS)..."
...,...,...,...,...,...,...,...
94,"Billionaires Warren Buffett, David Tepper, and...","For the better part of two years, the bulls ha...",2024-09-21T09:06:00Z,"billionaire warren buffett , david tepper , te...","[(billionaires, NNS), (warren, VBP), (buffett,...","well part two year , bull firmly control wall ...","[(better, RBR), (part, NN), (two, CD), (years,..."
95,OpenAI allegedly wants TSMC 1.6nm for in-house...,OpenAI's first custom-designed silicon chips a...,2024-09-04T02:29:19Z,openai allegedly want tsmc 1.6nm in-house ai c...,"[(openai, RB), (allegedly, RB), (wants, VBZ), ...",openai 's first custom-designed silicon chip a...,"[(openai, NN), ('s, POS), (first, JJ), (custom..."
96,Double Debian update: 11.11 and 12.7 arrive at...,"The latest update to Debian ""Bookworm"" arrives...",2024-09-04T11:28:06Z,double debian update : 11.11 12.7 arrive,"[(double, JJ), (debian, JJ), (update, NN), (:,...",late update debian `` bookworm '' arrive time ...,"[(latest, JJS), (update, NN), (debian, JJ), (`..."
97,The tech industry is ready for robot taxis. Bu...,Waymo's fully autonomous Jaguar I-PACEBlue Pla...,2024-09-04T12:48:22Z,tech industry ready robot taxi . rest u willin...,"[(tech, NN), (industry, NN), (ready, JJ), (rob...",waymo 's fully autonomous jaguar i-paceblue pl...,"[(waymo, NN), ('s, POS), (fully, RB), (autonom..."


In [181]:
# remove the time variable from 'publishedAt'
processed_articles_df['publishedAt'] = pd.to_datetime(processed_articles_df['publishedAt'])

In [156]:
processed_articles_df['publishedAt'] = processed_articles_df['publishedAt'].dt.date

In [157]:
processed_articles_df

Unnamed: 0,title,content,publishedAt,processed_title,ner_title,processed_content,ner_content
0,DOJ subpoenas NVIDIA as part of antitrust prob...,"If you click 'Accept all', we and our partners...",2024-09-04,doj subpoena nvidia part antitrust probe regar...,"[(doj, NN), (subpoenas, NNS), (nvidia, VBP), (...","click 'accept all ' , partner , include 239 pa...","[(click, NN), ('accept, IN), (all, DT), (', ''..."
1,The Leaked Nvidia RTX 5090 Has So Many Cores I...,The GeForce RTX 4090 is already so big that an...,2024-09-27,leak nvidia rtx 5090 many core actually scare,"[(leaked, VBN), (nvidia, RB), (rtx, JJ), (5090...",geforce rtx 4090 already big pc builder worth ...,"[(geforce, NN), (rtx, NN), (4090, CD), (alread..."
2,ByteDance will reportedly use Huawei chips to ...,"If you click 'Accept all', we and our partners...",2024-09-30,bytedance reportedly use huawei chip train new...,"[(bytedance, NN), (reportedly, RB), (use, JJ),...","click 'accept all ' , partner , include 240 pa...","[(click, NN), ('accept, IN), (all, DT), (', ''..."
3,Nvidia might actually lose in this key part of...,Justin Sullivan/Getty\r\n<ul><li>Inference mad...,2024-09-01,nvidia might actually lose key part ai chip bu...,"[(nvidia, NN), (might, MD), (actually, RB), (l...",justin sullivan/getty < ul > < li > inference ...,"[(justin, NN), (sullivan/getty, JJ), (<, NNP),..."
4,This chart shows one potential advantage AWS's...,Noah Berger/Getty Images\r\n<ul><li>Big tech c...,2024-09-26,chart show one potential advantage aws 's ai c...,"[(chart, NN), (shows, VBZ), (one, CD), (potent...",noah berger/getty image < ul > < li > big tech...,"[(noah, JJ), (berger/getty, NN), (images, NNS)..."
...,...,...,...,...,...,...,...
94,"Billionaires Warren Buffett, David Tepper, and...","For the better part of two years, the bulls ha...",2024-09-21,"billionaire warren buffett , david tepper , te...","[(billionaires, NNS), (warren, VBP), (buffett,...","well part two year , bull firmly control wall ...","[(better, RBR), (part, NN), (two, CD), (years,..."
95,OpenAI allegedly wants TSMC 1.6nm for in-house...,OpenAI's first custom-designed silicon chips a...,2024-09-04,openai allegedly want tsmc 1.6nm in-house ai c...,"[(openai, RB), (allegedly, RB), (wants, VBZ), ...",openai 's first custom-designed silicon chip a...,"[(openai, NN), ('s, POS), (first, JJ), (custom..."
96,Double Debian update: 11.11 and 12.7 arrive at...,"The latest update to Debian ""Bookworm"" arrives...",2024-09-04,double debian update : 11.11 12.7 arrive,"[(double, JJ), (debian, JJ), (update, NN), (:,...",late update debian `` bookworm '' arrive time ...,"[(latest, JJS), (update, NN), (debian, JJ), (`..."
97,The tech industry is ready for robot taxis. Bu...,Waymo's fully autonomous Jaguar I-PACEBlue Pla...,2024-09-04,tech industry ready robot taxi . rest u willin...,"[(tech, NN), (industry, NN), (ready, JJ), (rob...",waymo 's fully autonomous jaguar i-paceblue pl...,"[(waymo, NN), ('s, POS), (fully, RB), (autonom..."


# Sentiment Analysis

From the data that we preprocessed through cleaning, then linguistic processing; we are now able to gather the sentiment of the news articles.

The sentiment scores returned by `NLTK's VADER SentimentIntensityAnalyzer` range from -1 to 1. (VADER: Valence Aware Dictionary and sEntiment Reasoner)  
- Scores between -1 and -0.05 general indicate negative sentiment
- Scores between -0.05 and 0.05 are considered neutrual
- Sores between 0.05 and 1 indicate positive sentiment  

The more extreme the score, the stronger the sentiment.  

In [158]:
# import nltk sentiment intsensity analyzer module
from nltk.sentiment import SentimentIntensityAnalyzer

In [159]:
# download nltk VADER Lexicon data package
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/justinhoang/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [160]:
# get the sentment of the texts from nltk's sentiment intensity analyzer
def get_sentiment(text):
    sia = SentimentIntensityAnalyzer()
    return sia.polarity_scores(text)['compound']

In [161]:
processed_articles_df['sentiment_title'] = processed_articles_df['processed_title'].apply(get_sentiment)
processed_articles_df['sentiment_content'] = processed_articles_df['processed_content'].apply(get_sentiment)

In [162]:
sentiments_df = processed_articles_df[['sentiment_title', 'sentiment_content', 'publishedAt']]

In [163]:
sentiments_df

Unnamed: 0,sentiment_title,sentiment_content,publishedAt
0,0.0000,0.5423,2024-09-04
1,-0.6808,0.2263,2024-09-27
2,0.0000,0.5423,2024-09-30
3,-0.4019,0.5106,2024-09-01
4,0.2500,0.1779,2024-09-26
...,...,...,...
94,0.2023,0.8126,2024-09-21
95,0.0772,0.0000,2024-09-04
96,0.0000,-0.4019,2024-09-04
97,0.3612,0.0000,2024-09-04


# Stock Price Data Collection

The next step in building our analysis model is collecting the stock price data for the specified company.  

In [164]:
# import yahoo finance package
import yfinance as yf

In [165]:
# gather stock data
def get_stock_data(ticker, start_date, end_date):
    stock = yf.Ticker(ticker)
    data = stock.history(start=start_date, end=end_date)
    return data

In [166]:
earliest_date = sentiments_df['publishedAt'].min()
latest_date = sentiments_df['publishedAt'].max()

In [167]:
start_date = earliest_date
end_date = latest_date

In [168]:
stock_data = get_stock_data('NVDA', start_date, end_date)

In [169]:
stock_data

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-09-03 00:00:00-04:00,116.000078,116.200058,107.280822,107.990761,477155100,0.0,0.0
2024-09-04 00:00:00-04:00,105.400985,113.260306,104.111095,106.200912,372470300,0.0,0.0
2024-09-05 00:00:00-04:00,104.981017,109.640622,104.751041,107.200829,306850700,0.0,0.0
2024-09-06 00:00:00-04:00,108.030759,108.14075,100.941361,102.821205,413638100,0.0,0.0
2024-09-09 00:00:00-04:00,104.871024,106.540887,103.681131,106.460892,273912000,0.0,0.0
2024-09-10 00:00:00-04:00,107.800776,109.390643,104.94102,108.090752,268283700,0.0,0.0
2024-09-11 00:00:00-04:00,109.380641,117.179976,107.410808,116.900002,441422400,0.0,0.0
2024-09-12 00:00:00-04:00,116.839996,120.790001,115.379997,119.139999,367100500,0.01,0.0
2024-09-13 00:00:00-04:00,119.080002,119.959999,117.599998,119.099998,238358300,0.0,0.0
2024-09-16 00:00:00-04:00,116.790001,118.18,114.360001,116.779999,248772300,0.0,0.0


**Note:**  
Some days are missing for stock price data because the Stock market is open only during the weekdays, not including holidays.  

We also fix the formatting for the date for the stock prices. This way we ensure that the data matches with the sentiment data.

In [170]:
# reformat stock data dates
stock_data.index = pd.to_datetime(stock_data.index)

In [171]:
stock_data.index = pd.to_datetime(stock_data.index).date

In [172]:
stock_data = stock_data.reset_index()

In [173]:
stock_data = stock_data.rename(columns={'index': 'date'})

In [174]:
stock_data

Unnamed: 0,date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,2024-09-03,116.000078,116.200058,107.280822,107.990761,477155100,0.0,0.0
1,2024-09-04,105.400985,113.260306,104.111095,106.200912,372470300,0.0,0.0
2,2024-09-05,104.981017,109.640622,104.751041,107.200829,306850700,0.0,0.0
3,2024-09-06,108.030759,108.14075,100.941361,102.821205,413638100,0.0,0.0
4,2024-09-09,104.871024,106.540887,103.681131,106.460892,273912000,0.0,0.0
5,2024-09-10,107.800776,109.390643,104.94102,108.090752,268283700,0.0,0.0
6,2024-09-11,109.380641,117.179976,107.410808,116.900002,441422400,0.0,0.0
7,2024-09-12,116.839996,120.790001,115.379997,119.139999,367100500,0.01,0.0
8,2024-09-13,119.080002,119.959999,117.599998,119.099998,238358300,0.0,0.0
9,2024-09-16,116.790001,118.18,114.360001,116.779999,248772300,0.0,0.0


# Merge Data

The next step is to merge the stock price data with the sentiment analysis data.  

Once the data is merged, we are able to create our model.  

In [175]:
# merge the sentiment scores and stock data on their dates
merged_data_df = pd.merge(sentiments_df, stock_data, left_on='publishedAt', right_on='date', how='left')

In [176]:
merged_data_df = merged_data_df.sort_values('date')

In [177]:
merged_data_df = merged_data_df.drop('date', axis=1)

In [178]:
merged_data_df = merged_data_df.rename(columns={'publishedAt': 'date'})

In [179]:
merged_data_df

Unnamed: 0,sentiment_title,sentiment_content,date,Open,High,Low,Close,Volume,Dividends,Stock Splits
40,0.0000,0.0000,2024-09-03,116.000078,116.200058,107.280822,107.990761,477155100.0,0.0,0.0
42,0.5423,0.5423,2024-09-03,116.000078,116.200058,107.280822,107.990761,477155100.0,0.0,0.0
9,0.3612,0.0000,2024-09-03,116.000078,116.200058,107.280822,107.990761,477155100.0,0.0,0.0
86,0.0000,0.3612,2024-09-03,116.000078,116.200058,107.280822,107.990761,477155100.0,0.0,0.0
20,-0.1531,0.2500,2024-09-03,116.000078,116.200058,107.280822,107.990761,477155100.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
67,-0.5106,-0.0516,2024-09-21,,,,,,,
69,0.4939,0.3612,2024-09-21,,,,,,,
78,-0.0516,0.5423,2024-09-29,,,,,,,
81,0.0000,0.0000,2024-09-28,,,,,,,
