# Initial Kaggle Dataset: CSV/JSON Ingestion and Formatting

The dataset provided by <a href="https://www.kaggle.com/datasets/paultimothymooney/stock-market-data"> Kaggle Dataset</a> consists of four groupings of stock information:
- NASDAQ
    - Market participants transact trades through dealers rather than directly with each other.
    - Utilizes market makers who maintain inventories of stock to buy and sell from their accounts in transactions with individual customers and other deailers.
    - MMs give two sided quotes, a bid (buy) and ask (sell) price for a security in which they are making a market. More than 260 market making firms provide liquidity for NASDAQ listed stocks.
    - Known for technology and innovation related stocks, which means growth-oriented and more volatile.
- NYSE
    - Sells stocks via the auction method: Market participants transact trades directly with each other.
    - Transactions occur from open to close between 9:30 am EST - 4:00 pm EST. 
    - Early orders can be placed starting at 6:30 am EST, and as the market is not open yet, these orders are paired with the highest bidding price and lowest asking price.
    - NYSE has 'designated market makers' which serve as a human point of contact for the listed company on the NYSE trading floor.
    - DMMs provided ~17% of liquidity in NYSE trading in 2019
    - NYSE retains bluechips and industrials, not necessarily but meaning more stable and well established.
- SP500
    - More information will be added later
- Forbes2000
    - More information will be added later
    
Two of these entities are stock exchanges while the other two are indexes. For the initial pull of the data we will only be working with the NASDAQ and NYSE datasets. Once we have a more grounded foundation and understanding of the data we will do additional analyses on SP500 and Forbes2000 stocks,

In [1]:
import os
import pandas as pd
import warnings

In [2]:
nasdaq = os.listdir("./stock_market_data/nasdaq/csv")
nasdaq_csv = pd.DataFrame()
nyse = os.listdir("./stock_market_data/nyse/csv")
nyse_csv = pd.DataFrame()

Now that we have each director of CSV's saved, we will create a function to:
1. Loop through a given directory
2. Create a path for each csv
3. Read that csv as a pandas dataframe
4. Append that data into one large data frame

Initially we will keep separate files for Nasdaq and NYSE. For later analyses we have the option of combining.

In [3]:
# incase of future warnings for pd.concat
def fxn():
    warnings.warn("Future", FutureWarning)

In [4]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fxn()
    for csv in nasdaq:
        try:
            temp = pd.read_csv(f"./stock_market_data/nasdaq/csv/{csv}")
            temp["ticker"] = csv.split(".")[0]
            nasdaq_csv = pd.concat([nasdaq_csv, temp])
        except:
            # incase of any .ipynb_checkpoints/.DS_Store files
            continue

    for csv in nyse:
        try:
            temp = pd.read_csv(f"./stock_market_data/nyse/csv/{csv}")
            temp["ticker"] = csv.split(".")[0]
            nyse_csv = pd.concat([nyse_csv, temp])
        except:
            # incase of any .ipynb_checkpoints/.DS_Store files
            continue

In [5]:
nasdaq_csv.shape, nyse_csv.shape

((8752326, 8), (6994408, 8))

In [6]:
nasdaq_csv.head()

Unnamed: 0,Date,Low,Open,Volume,High,Close,Adjusted Close,ticker
0,16-02-1990,0.073785,0.0,940636800.0,0.079861,0.077257,0.054863,CSCO
1,20-02-1990,0.074653,0.0,151862400.0,0.079861,0.079861,0.056712,CSCO
2,21-02-1990,0.075521,0.0,70531200.0,0.078993,0.078125,0.055479,CSCO
3,22-02-1990,0.078993,0.0,45216000.0,0.081597,0.078993,0.056095,CSCO
4,23-02-1990,0.078125,0.0,44697600.0,0.079861,0.078559,0.055787,CSCO


In [7]:
nyse_csv.head()

Unnamed: 0,Date,Low,Open,Volume,High,Close,Adjusted Close,ticker
0,19-06-1992,15.0,15.0,86000.0,15.0,15.0,3.640341,NXN
1,22-06-1992,15.0,15.0,17000.0,15.0,15.0,3.640341,NXN
2,23-06-1992,15.0,15.0,3400.0,15.0,15.0,3.640341,NXN
3,24-06-1992,15.0,15.0,4500.0,15.0,15.0,3.640341,NXN
4,25-06-1992,15.0,15.0,800.0,15.0,15.0,3.640341,NXN


In [8]:
nasdaq_csv.to_csv("./data/nasdaq_csv.csv")
nyse_csv.to_csv("./data/nyse_csv.csv")

With our ingestion completed of the NASDAQ and NYSE csv data, we can now dive into cleaning and Exploratory Data Analysis