Financial Stock Price Dataset: Daily or intraday data for various stocks, which can be obtained from financial databases like Yahoo Finance or Alpha Vantage.

In [1]:
# Example: Fetching historical stock prices from Yahoo Finance
import yfinance as yf

# Define the stock ticker and time period
ticker = 'AAPL'
data = yf.download(ticker, start='2023-01-01', end='2024-05-01')
data["Company"] = "Apple"

ticker = 'MSFT'
data2 = yf.download(ticker, start='2023-01-01', end='2024-05-01')
data2["Company"] = "Microsoft"

print(data2.head())

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed

                  Open        High         Low       Close   Adj Close  \
Date                                                                     
2023-01-03  243.080002  245.750000  237.399994  239.580002  236.183502   
2023-01-04  232.279999  232.869995  225.960007  229.100006  225.852097   
2023-01-05  227.199997  227.550003  221.759995  222.309998  219.158356   
2023-01-06  223.000000  225.759995  219.350006  224.929993  221.741211   
2023-01-09  226.449997  231.240005  226.410004  227.119995  223.900162   

              Volume    Company  
Date                             
2023-01-03  25740000  Microsoft  
2023-01-04  50623400  Microsoft  
2023-01-05  39585600  Microsoft  
2023-01-06  43613600  Microsoft  
2023-01-09  27369800  Microsoft  





In [2]:
data.to_csv("data/apple_stock.csv")
data2.to_csv("data/msft_stock.csv")

Apple data

In [16]:
data.describe()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
count,333.0,333.0,333.0,333.0,333.0,333.0
mean,173.83003,175.402523,172.508378,174.054655,173.113882,59554360.0
std,15.985741,15.743105,15.953854,15.806214,15.920892,17972610.0
min,126.010002,127.769997,124.169998,125.019997,123.855095,24048300.0
25%,166.539993,168.559998,165.649994,167.039993,166.321625,47720200.0
50%,175.720001,177.580002,174.050003,175.839996,174.944794,54945800.0
75%,185.889999,187.330002,184.270004,186.009995,185.217163,66133400.0
max,198.020004,199.619995,197.0,198.110001,197.361084,154357300.0


For Stock Prices:

- Handle missing values by forward-filling or removing them.
- Normalize prices or returns to bring them into a common scale.

In [17]:
# Example: Cleaning and normalizing stock price data
data.fillna(method='ffill', inplace=True)  # Forward-fill missing values
data['Returns'] = data['Adj Close'].pct_change().fillna(0)  # Calculate daily returns
data['Normalized_Returns'] = (data['Returns'] - data['Returns'].mean()) / data['Returns'].std()

print(data[['Returns', 'Normalized_Returns']].head())


             Returns  Normalized_Returns
Date                                    
2023-01-03  0.000000           -0.080077
2023-01-04  0.010314            0.720935
2023-01-05 -0.010605           -0.903638
2023-01-06  0.036794            2.777383
2023-01-09  0.004089            0.237454


  data.fillna(method='ffill', inplace=True)  # Forward-fill missing values


Financial News Articles: News articles related to the stock market, which can be sourced from news aggregators like Google News, or specific financial news providers like Bloomberg or Reuters.

Apple news

In [30]:
import newspaper
from newspaper import Article
import pandas as pd
from datetime import datetime

# Define news sources to scrape
news_sources = [
    "https://www.reuters.com/technology/",
    "https://www.cnbc.com/technology/",
    "https://www.bloomberg.com/technology"
    "https://www.forbes.com/sites/technology/"
    "https://www.businesstoday.in/latest/economy/"
]

# Keywords to filter articles
keywords = ["Apple", "iPhone","Apple Vision Pro","AAPL","MacBook", "iPad",]

# Date range for filtering
start_date = datetime(2022, 8, 1)
end_date = datetime(2024, 5, 1)

# Function to collect articles
def collect_articles(news_sources, keywords, start_date, end_date):
    articles = []
    
    for source in news_sources:
        paper = newspaper.build(source, memoize_articles=False)
        
        for article in paper.articles:
            try:
                article.download()
                article.parse()

                # Check if the article's publication date is within the desired range
                if article.publish_date and start_date <= article.publish_date <= end_date:
                    # Check if the article contains any of the keywords
                    if any(keyword in article.text for keyword in keywords):
                        articles.append({
                            "title": article.title,
                            "date": article.publish_date,
                            "text": article.text,
                            "source": source
                        })
            except Exception as e:
                print(f"Failed to download article: {e}")
    
    return articles

# Collect articles
articles = collect_articles(news_sources, keywords, start_date, end_date)

# Convert to DataFrame for easier handling
df = pd.DataFrame(articles)
df.dropna(subset=['date'], inplace=True)  # Drop articles without a publish date

print(df.head())

# Save the dataset to a CSV file
#df.to_csv('apple_microsoft_financial_news.csv', index=False)


Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to downlo

In [34]:
df.head()

Unnamed: 0,title,date,text,source
0,Apple announces new MacBook Air laptops with i...,2024-03-04,Apple on Monday announced new versions of its ...,https://www.cnbc.com/technology/
1,Here's what Meta CEO Mark Zuckerberg has to sa...,2024-02-14,Meta CEO Mark Zuckerberg demonstrates an Oculu...,https://www.cnbc.com/technology/
2,Apple's Vision Pro virtual reality headset lau...,2024-02-02,The first customer walks out of the Apple Stor...,https://www.cnbc.com/technology/
3,Apple Vision Pro review: This is the future of...,2024-01-30,In this article AAPL Follow your favorite stoc...,https://www.cnbc.com/technology/
4,"Apple $3,499 Vision Pro headset now available ...",2024-01-19,"Preorders for Apple 's $3,499 Vision Pro heads...",https://www.cnbc.com/technology/


In [35]:
df.to_csv('data/apple_financial_news.csv', index=False)

Microsoft News

In [36]:
import newspaper
from newspaper import Article
import pandas as pd
from datetime import datetime

# Define news sources to scrape
news_sources = [
    "https://www.reuters.com/technology/",
    "https://www.cnbc.com/technology/",
    "https://www.bloomberg.com/technology"
    "https://www.forbes.com/sites/technology/"
    "https://www.businesstoday.in/latest/economy/"
]

# Keywords to filter articles
keywords = ["Microsoft"]

# Date range for filtering
start_date = datetime(2022, 8, 1)
end_date = datetime(2024, 5, 1)

# Function to collect articles
def collect_articles(news_sources, keywords, start_date, end_date):
    articles = []
    
    for source in news_sources:
        paper = newspaper.build(source, memoize_articles=False)
        
        for article in paper.articles:
            try:
                article.download()
                article.parse()

                # Check if the article's publication date is within the desired range
                if article.publish_date and start_date <= article.publish_date <= end_date:
                    # Check if the article contains any of the keywords
                    if any(keyword in article.text for keyword in keywords):
                        articles.append({
                            "title": article.title,
                            "date": article.publish_date,
                            "text": article.text,
                            "source": source
                        })
            except Exception as e:
                print(f"Failed to download article: {e}")
    
    return articles

# Collect articles
articles = collect_articles(news_sources, keywords, start_date, end_date)

# Convert to DataFrame for easier handling
df_msft = pd.DataFrame(articles)
df_msft.dropna(subset=['date'], inplace=True)  # Drop articles without a publish date

print(df_msft.head())

# Save the dataset to a CSV file
df_msft.to_csv('data/microsoft_financial_news.csv', index=False)


Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to download article: can't compare offset-naive and offset-aware datetimes
Failed to downlo

In [None]:
df_msft