EDA on stock news data
- fix datetime conversion issues
- check the number of rows with blank stock tickers
- check if any stock tickers have no news data
- check to see the date ranges for each stock ticker

Data Source - Filtered version in google drive:

Reference URL:
https://huggingface.co/datasets/Zihan1004/FNSPID

Citation:
@misc{dong2024fnspid,
  title={FNSPID: A Comprehensive Financial News Dataset in Time Series},
  author={Zihan Dong and Xinyu Fan and Zhiyuan Peng},
  year={2024},
  eprint={2402.06698},
  archivePrefix={arXiv},
  primaryClass={q-fin.ST}
}

In [None]:
# import statements

# to be able to access files from my drive
from google.colab import drive
# to be able to read the csv files
import csv
import pandas as pd
import numpy as np

In [None]:
# mount my google drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# load csv files of S&P 100 files
file_path_sp100 = '/content/drive/MyDrive/DATASCI_266_NLP/sp100_news_2018_2023.csv'

df_sp100 = pd.read_csv(file_path_sp100)

  df_sp100 = pd.read_csv(file_path_sp100)


In [None]:
# load csv files for
file_path_sptech = '/content/drive/MyDrive/DATASCI_266_NLP/sptech_news_2018_2023.csv'

df_sptech = pd.read_csv(file_path_sptech)

  df_sptech = pd.read_csv(file_path_sptech)


In [None]:
# fix the date formatting issue for dates that lack time component
df_sp100['Datetime'] = df_sp100['Date'].apply(
    lambda x: x if ' ' in x else f'{x} 00:00:00'
)

df_sptech['Datetime'] = df_sptech['Date'].apply(
    lambda x: x if ' ' in x else f'{x} 00:00:00'
)

# convert to datetime
df_sptech['Datetime'] = pd.to_datetime(df_sptech['Datetime'])
df_sp100['Datetime'] = pd.to_datetime(df_sp100['Datetime'])

In [None]:
# ensure there are no n/a values
print(f'number of n/a datetimes in S&P tech: {len(df_sptech[df_sptech["Datetime"].isna()])}')
print(f'number of n/a datetimes in S&P 100: {len(df_sp100[df_sp100["Datetime"].isna()])}')

number of n/a datetimes in S&P tech: 0
number of n/a datetimes in S&P 100: 0


In [None]:
# ensure S&P 100 files loaded correctly
df_sp100.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Date,Article_title,Stock_symbol,Url,Publisher,Author,Article,Lsa_summary,Luhn_summary,Textrank_summary,Lexrank_summary,Datetime
0,12025,12025.0,2023-12-16 22:00:00,My 6 Largest Portfolio Holdings Heading Into 2...,AAPL,https://www.nasdaq.com/articles/my-6-largest-p...,,,"After an absolute disaster of a year in 2022, ...",3: Apple There's little question that Apple (N...,3: Apple There's little question that Apple (N...,3: Apple There's little question that Apple (N...,3: Apple There's little question that Apple (N...,2023-12-16 22:00:00
1,12026,12026.0,2023-12-16 22:00:00,Brokers Suggest Investing in Apple (AAPL): Rea...,AAPL,https://www.nasdaq.com/articles/brokers-sugges...,,,"When deciding whether to buy, sell, or hold a ...",Let's take a look at what these Wall Street he...,Click to get this free report Apple Inc. (AAPL...,Let's take a look at what these Wall Street he...,Brokerage Recommendation Trends for AAPL Let's...,2023-12-16 22:00:00
2,12027,12027.0,2023-12-16 21:00:00,"Company News for Dec 19, 2023",AAPL,https://www.nasdaq.com/articles/company-news-f...,,,Shares of Apple Inc. AAPL lost 0.9% on China’s...,Shares of Apple Inc. AAPL lost 0.9% on China’s...,Click to get this free report Apple Inc. (AAPL...,Click to get this free report Apple Inc. (AAPL...,Click to get this free report Apple Inc. (AAPL...,2023-12-16 21:00:00
3,12028,12028.0,2023-12-16 21:00:00,NVIDIA (NVDA) Up 243% YTD: Will It Carry Momen...,AAPL,https://www.nasdaq.com/articles/nvidia-nvda-up...,,,NVIDIA Corporation NVDA has witnessed a remark...,Other Stocks in the $1T Club Apart from NVIDIA...,Other Stocks in the $1T Club Apart from NVIDIA...,Other Stocks in the $1T Club Apart from NVIDIA...,Other Stocks in the $1T Club Apart from NVIDIA...,2023-12-16 21:00:00
4,12029,12029.0,2023-12-16 21:00:00,"Pre-Market Most Active for Dec 19, 2023 : BMY,...",AAPL,https://www.nasdaq.com/articles/pre-market-mos...,,,The NASDAQ 100 Pre-Market Indicator is up 10.1...,"Apple Inc. (AAPL) is +0.86 at $196.75, with 1,...","Apple Inc. (AAPL) is +0.86 at $196.75, with 1,...","Apple Inc. (AAPL) is +0.86 at $196.75, with 1,...","Apple Inc. (AAPL) is +0.86 at $196.75, with 1,...",2023-12-16 21:00:00


In [None]:
# ensure S&P tech stocks loaded correctly
df_sptech.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Date,Article_title,Stock_symbol,Url,Publisher,Author,Article,Lsa_summary,Luhn_summary,Textrank_summary,Lexrank_summary,Datetime
0,12025,12025.0,2023-12-16 22:00:00,My 6 Largest Portfolio Holdings Heading Into 2...,AAPL,https://www.nasdaq.com/articles/my-6-largest-p...,,,"After an absolute disaster of a year in 2022, ...",3: Apple There's little question that Apple (N...,3: Apple There's little question that Apple (N...,3: Apple There's little question that Apple (N...,3: Apple There's little question that Apple (N...,2023-12-16 22:00:00
1,12026,12026.0,2023-12-16 22:00:00,Brokers Suggest Investing in Apple (AAPL): Rea...,AAPL,https://www.nasdaq.com/articles/brokers-sugges...,,,"When deciding whether to buy, sell, or hold a ...",Let's take a look at what these Wall Street he...,Click to get this free report Apple Inc. (AAPL...,Let's take a look at what these Wall Street he...,Brokerage Recommendation Trends for AAPL Let's...,2023-12-16 22:00:00
2,12027,12027.0,2023-12-16 21:00:00,"Company News for Dec 19, 2023",AAPL,https://www.nasdaq.com/articles/company-news-f...,,,Shares of Apple Inc. AAPL lost 0.9% on China’s...,Shares of Apple Inc. AAPL lost 0.9% on China’s...,Click to get this free report Apple Inc. (AAPL...,Click to get this free report Apple Inc. (AAPL...,Click to get this free report Apple Inc. (AAPL...,2023-12-16 21:00:00
3,12028,12028.0,2023-12-16 21:00:00,NVIDIA (NVDA) Up 243% YTD: Will It Carry Momen...,AAPL,https://www.nasdaq.com/articles/nvidia-nvda-up...,,,NVIDIA Corporation NVDA has witnessed a remark...,Other Stocks in the $1T Club Apart from NVIDIA...,Other Stocks in the $1T Club Apart from NVIDIA...,Other Stocks in the $1T Club Apart from NVIDIA...,Other Stocks in the $1T Club Apart from NVIDIA...,2023-12-16 21:00:00
4,12029,12029.0,2023-12-16 21:00:00,"Pre-Market Most Active for Dec 19, 2023 : BMY,...",AAPL,https://www.nasdaq.com/articles/pre-market-mos...,,,The NASDAQ 100 Pre-Market Indicator is up 10.1...,"Apple Inc. (AAPL) is +0.86 at $196.75, with 1,...","Apple Inc. (AAPL) is +0.86 at $196.75, with 1,...","Apple Inc. (AAPL) is +0.86 at $196.75, with 1,...","Apple Inc. (AAPL) is +0.86 at $196.75, with 1,...",2023-12-16 21:00:00


## **This next section checks the coverage of the news data against the tickers and the date range**

In [None]:
# load the ticker symbols

# Pulled Dec 2nd
sp100_tickers = [
    'AAPL', 'ABBV', 'ABT', 'ACN', 'ADBE', 'AIG', 'AMD', 'AMGN', 'AMT', 'AMZN',
    'AVGO', 'AXP', 'BA', 'BAC', 'BK', 'BKNG', 'BLK', 'BMY', 'BRK.B', 'C',
    'CAT', 'CHTR', 'CL', 'CMCSA', 'COF', 'COP', 'COST', 'CRM', 'CSCO', 'CVS',
    'CVX', 'DE', 'DHR', 'DIS', 'DOW', 'DUK', 'EMR', 'F', 'FDX', 'GD',
    'GE', 'GILD', 'GM', 'GOOG', 'GOOGL', 'GS', 'HD', 'HON', 'IBM', 'INTC',
    'INTU', 'JNJ', 'JPM', 'KHC', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'MA',
    'MCD', 'MDLZ', 'MDT', 'MET', 'META', 'MMM', 'MO', 'MRK', 'MS', 'MSFT',
    'NEE', 'NFLX', 'NKE', 'NVDA', 'ORCL', 'PEP', 'PFE', 'PG', 'PM', 'PYPL',
    'QCOM', 'RTX', 'SBUX', 'SCHW', 'SO', 'SPG', 'T', 'TGT', 'TMO', 'TMUS',
    'TSLA', 'TXN', 'UNH', 'UNP', 'UPS', 'USB', 'V', 'VZ', 'WFC', 'WMT', 'XOM'
]

# Pulled on Dec 2nd
sp500_tech_tickers = [
    'AAPL', 'ACN', 'ADBE', 'ADI', 'ADSK', 'AKAM', 'AMAT', 'AMD', 'ANET', 'ANSS',
    'APH', 'AVGO', 'CDNS', 'CDW', 'CRM', 'CRWD', 'CSCO', 'CTSH', 'DELL', 'ENPH',
    'EPAM', 'FFIV', 'FICO', 'FSLR', 'FTNT', 'GDDY', 'GEN', 'GLW', 'HPE', 'HPQ',
    'IBM', 'INTC', 'INTU', 'IT', 'JBL', 'JNPR', 'KEYS', 'KLAC', 'LRCX', 'MCHP',
    'MPWR', 'MSFT', 'MSI', 'MU', 'NOW', 'NTAP', 'NVDA', 'NXPI', 'ON', 'ORCL',
    'PANW', 'PLTR', 'PTC', 'QCOM', 'QRVO', 'ROP', 'SMCI', 'SNPS', 'STX', 'SWKS',
    'TDY', 'TEL', 'TER', 'TRMB', 'TXN', 'TYL', 'VRSN', 'WDC', 'ZBRA'
]

print(f'S&P 100 Ticker length: {len(sp100_tickers)}')
print(f'Tech Ticker length: {len(sp500_tech_tickers)}')
print(f'S&P 100 dataframe unique ticker lenght: {len(df_sp100["Stock_symbol"].unique())}')
print(f'Tech dataframe unique ticker lenght: {len(df_sptech["Stock_symbol"].unique())}')

S&P 100 Ticker length: 101
Tech Ticker length: 69
S&P 100 dataframe unique ticker lenght: 98
Tech dataframe unique ticker lenght: 68


In [None]:
# rows with null values
print(f'S&P 100 Ticker null row count: {sum(df_sp100["Stock_symbol"].isna())}\n')
print(f'Tech Ticker null row count: {sum(df_sptech["Stock_symbol"].isna())}\n')

# print the missing values
[print(f'Missing S&P Ticker: {value}') for value in sp100_tickers if value not in df_sp100['Stock_symbol'].unique()]
[print(f'Missing Tech Ticker: {value}') for value in sp500_tech_tickers if value not in df_sptech['Stock_symbol'].unique()]

S&P 100 Ticker null row count: 332788

Tech Ticker null row count: 142198

Missing S&P Ticker: HON
Missing S&P Ticker: META
Missing S&P Ticker: RTX
Missing Tech Ticker: PLTR


[None]

In [None]:
# get the date range for each stock symbol S&P 100
date_ranges_sp100 = df_sp100.groupby('Stock_symbol')['Datetime'].agg(['min', 'max'])

# check that the dates cover
print(f'S&P 100 Stocks with no news in Jan 2018 {date_ranges_sp100[date_ranges_sp100["min"] >= "2018-01-31"]}')
print(f'S&P 100 Stocks with no news in Dec 2023 {date_ranges_sp100[date_ranges_sp100["max"] <= "2023-12-01"]}')

S&P 100 Stocks with no news in Jan 2018                              min                 max
Stock_symbol                                        
AAPL         2020-03-09 00:00:00 2023-12-16 22:00:00
AMZN         2020-04-27 00:00:00 2023-12-16 23:00:00
BA           2018-05-24 00:00:00 2023-12-16 23:00:00
BKNG         2018-02-27 00:00:00 2023-12-16 21:00:00
DIS          2019-06-13 00:00:00 2023-12-16 19:00:00
DOW          2018-06-29 00:00:00 2023-12-16 00:00:00
F            2019-04-30 00:00:00 2023-12-16 13:00:00
GM           2020-06-05 13:19:01 2023-12-16 21:00:00
GOOG         2018-11-13 00:00:00 2023-12-16 23:00:00
GOOGL        2018-07-25 00:00:00 2020-06-10 11:25:13
JPM          2018-09-15 00:00:00 2020-06-11 06:05:13
LIN          2023-12-14 00:00:00 2023-12-16 09:00:00
MSFT         2022-04-26 00:00:00 2023-12-16 23:02:00
SCHW         2020-05-07 00:00:00 2020-06-11 05:58:15
TSLA         2019-07-01 00:00:00 2023-12-16 22:00:00
UNH          2019-07-08 00:00:00 2020-06-11 06:22:31
WMT   

In [None]:
# get the date range for each stock symbol S&P Tech
date_ranges_sptech = df_sptech.groupby('Stock_symbol')['Datetime'].agg(['min', 'max'])

# check the dates cover Jan 2018
print(f'S&P Tech Stocks with no news in Jan 2018 {date_ranges_sptech[date_ranges_sptech["min"] >= "2018-01-31"]}')
print(f'S&P Tech Stocks with no news in Dec 2023 {date_ranges_sptech[date_ranges_sptech["max"] <= "2023-12-01"]}')

S&P Tech Stocks with no new in Jan 2018                     min                 max
Stock_symbol                               
AAPL         2020-03-09 2023-12-16 22:00:00
CRWD         2019-06-06 2023-12-16 14:00:00
DELL         2018-12-28 2023-12-16 05:00:00
GEN          2018-02-22 2023-12-13 00:00:00
MSFT         2022-04-26 2023-12-16 23:02:00
S&P Tech Stocks with no news in Dec 2023                     min                 max
Stock_symbol                               
CDW          2018-01-16 2020-06-09 04:22:22
CSCO         2018-01-04 2020-06-10 05:36:24
INTU         2018-01-04 2020-06-04 08:27:36
IT           2018-01-05 2020-05-29 00:00:00
JBL          2018-01-02 2020-06-11 06:25:52
JNPR         2018-01-03 2020-05-28 00:00:00
KEYS         2018-01-11 2020-06-11 06:25:52
MCHP         2018-01-08 2020-06-05 05:54:02
NXPI         2018-01-08 2020-06-08 04:53:42
PTC          2018-01-02 2020-06-10 06:31:13
SNPS         2018-01-03 2020-06-10 12:10:22
STX          2018-01-05 2020-06-08 09:3