# Financial Sentiment Analysis (Single)

In this program, I run a sentiment analysis of a single company based on financial news articles.

The company that I am targeting is Nvidia [NVDA]

## Fetching News Articles

The first step is to fetch the news articles.  

I am using `NewsAPI` to get articles quickly and easily. Then, I use `pandas` to put the articles into a dataframe, where I can collect and read the data easier.  

**Filtering articles:**  
Filter articles that only exist  
- `NewsAPI` sometimes fetches articles that were removed  

**Extracting the data:**  
Extract only the necessary data from the articles
- Title
- Description
- Content

All others can be discarded.  

Both of these steps are part of the cleaning data step that is next in text preprocessing.

In [1]:
import os
from dotenv import load_dotenv

In [2]:
# get path to the environment file
env_path = '../config/.env'
load_dotenv(env_path)

True

In [3]:
# import newsapi package
from newsapi import NewsApiClient

In [4]:
# init newsapi
newsapi = NewsApiClient(api_key=os.getenv('NEWS_API_KEY'))

In [6]:
# fetch all articles that mention Nvidia
all_articles = newsapi.get_everything(q='Nvidia',
                                      language='en')

In [7]:
import pandas as pd
pd.__version__

'2.2.3'

In [8]:
# place all_articles into a dataframe
all_articles_df = pd.DataFrame(all_articles['articles'])
print(all_articles_df[['title']].head())

                                               title
0  DOJ subpoenas NVIDIA as part of antitrust prob...
1                                          [Removed]
2  Nvidia might actually lose in this key part of...
3  Nvidia CEO Jensen Huang says the payback on AI...
4  Stock market today: Dow hits record high while...


In [9]:
# filter articles function
# only filters valid articles
# valid meaning: article exists and description of article exists
def filter_removed_articles(articles):
    return [article for article in articles if article.get('title') != '[Removed]']

In [10]:
# filter the all_articles
valid_articles = filter_removed_articles(all_articles['articles'])

In [11]:
valid_articles_df = pd.DataFrame(valid_articles)
print(valid_articles_df[['title']].head())

                                               title
0  DOJ subpoenas NVIDIA as part of antitrust prob...
1  Nvidia might actually lose in this key part of...
2  Nvidia CEO Jensen Huang says the payback on AI...
3  Stock market today: Dow hits record high while...
4  Nvidia Hit With DOJ Subpoena In Escalating Ant...


In [12]:
# extract article essentials function
# extract only the title, description, and content from the articles
def extract_article_essentials(articles):
    return [{'title': article['title'], 'descripton': article['description'], 'content': article['content']} for article in articles]

In [13]:
extracted_articles = extract_article_essentials(valid_articles)

In [14]:
extracted_articles_df = pd.DataFrame(extracted_articles)
print(extracted_articles_df[['title']].head())

                                               title
0  DOJ subpoenas NVIDIA as part of antitrust prob...
1  Nvidia might actually lose in this key part of...
2  Nvidia CEO Jensen Huang says the payback on AI...
3  Stock market today: Dow hits record high while...
4  Nvidia Hit With DOJ Subpoena In Escalating Ant...


## Preprocess Text
***This is a crucial***  
Proprocessing helps clean and normalize the text data making it more suitable for analysis.  

After getting the articles, I can now preprocess the text in the articles.  

### Data Cleaning 
**Identify and remove noise:**  
We want to first remove all noise from the data.  
- Punction
- Extra whitespace

**Text normalization:**  
- Stopwords
- Capital letters
    - All letters should be the same case so all words are treated the same in the tokenization process.  

**Data masking:**  
Data masking is not needed in this context.  

Clean text should result.  

**Tokenization:**  


In [15]:
# import re package (regular expressions)
import re

In [16]:
import string

In [17]:
# import nltk packages (Natural Language Toolkit)
import nltk
from nltk.corpus import stopwords

In [18]:
# download nltk data package 'stopwords'
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/justinhoang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
# clean text function
# cleans the data (text)
def clean_text(text):
    # identify and remove noise
    # remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    # remove punctuation
    text = ''.join([char for char in text if char not in string.punctuation])
    
    # normalize text
    # lower case all text
    text = text.lower()
    stop_words = stopwords.words('english')
    
    # tokenize words
    
    
    return text