# Clean and Explore Stock Information

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline

In [2]:
# due to file size, set low_memory to False to alter load order
nasdaq = pd.read_csv("./data/nasdaq_csv.csv",index_col=0, low_memory=False)
nyse = pd.read_csv("./data/nyse_csv.csv",index_col=0, low_memory=False)

### Check the size of both of our datasets

In [3]:
nasdaq.shape, nyse.shape

((8752326, 8), (6994408, 8))

Similar shapes, first noticeable difference is the row count difference. We will confirm below, but one possibility is the NASDAQ dataset goes further back in time despite NYSE having the longer tenure.

In [4]:
nasdaq.head(2)

Unnamed: 0,Date,Low,Open,Volume,High,Close,Adjusted Close,ticker
0,16-02-1990,0.073785,0.0,940636800.0,0.0798610001802444,0.077257,0.054863,CSCO
1,20-02-1990,0.074653,0.0,151862400.0,0.0798610001802444,0.079861,0.056712,CSCO


In [5]:
nyse.head(2)

Unnamed: 0,Date,Low,Open,Volume,High,Close,Adjusted Close,ticker
0,19-06-1992,15.0,15.0,86000.0,15.0,15.0,3.640341,NXN
1,22-06-1992,15.0,15.0,17000.0,15.0,15.0,3.640341,NXN


Let's check column types, assuming Date is currently 'object' across both dataframes as pandas has some difficulty recognizing datetime

In [6]:
# nasdaq.info()

In [7]:
# nyse.info()

Both datasets have Date as object, but we also have a small null population, we'll take care of the nulls first then change Date to explore this data further

In [8]:
nasdaq.isnull().sum()

Date                   0
Low               130276
Open              130276
Volume            130276
High              130277
Close             130277
Adjusted Close    130277
ticker                 0
dtype: int64

In [9]:
print(f"Percentage of NASDAQ Null rows: {round(((130_277/8_752_326)*100),2)}%") 

Percentage of NASDAQ Null rows: 1.49%


In [10]:
nyse.isnull().sum()

Date                  0
Low               94982
Open              94982
Volume            94982
High              94982
Close             94982
Adjusted Close    94982
ticker                0
dtype: int64

In [11]:
print(f"Percentage of NYSE Null rows: {round(((94_982/6_994_408)*100),2)}%") 

Percentage of NYSE Null rows: 1.36%


The population of rows of null values within each dataset respectively is less than 5% of the entire data set. We have a couple of options:
1. Drop the null rows as the total percentage is within accept range
2. Fill the null values using simple means such as fillna mean, mode, back or forward fill
3. Fill the null values using regression

For our first iteration we are going to simply drop the nulls to save time. After we build and test our data pipeline for both Linear Regression for stock price prediction and ARIMA for Time Series modeling, we can return to this if we think it can improve our models.


In [12]:
nasdaq.dropna(axis=0, inplace=True)

In [13]:
nyse.dropna(axis=0, inplace=True)

In [14]:
nasdaq.shape, nyse.shape

((8622049, 8), (6899426, 8))

Change Date to Datetime and we are ready to start looking through descriptive statistics, distributions of data, and check for any seasonality in the data

In [15]:
# remove an additional bad row from nasdaq that looks like human error or merge error from dataset
nasdaq = nasdaq[nasdaq["Date"]!="18-1218-12-1991"]

In [16]:
# add format to speed up performance on large dataset
nasdaq["Date"] = pd.to_datetime(nasdaq['Date'],format="%d-%m-%Y")
nyse["Date"] = pd.to_datetime(nyse['Date'],format="%d-%m-%Y")

In [17]:
nasdaq.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8622048 entries, 0 to 4300
Data columns (total 8 columns):
 #   Column          Dtype         
---  ------          -----         
 0   Date            datetime64[ns]
 1   Low             float64       
 2   Open            object        
 3   Volume          float64       
 4   High            object        
 5   Close           float64       
 6   Adjusted Close  float64       
 7   ticker          object        
dtypes: datetime64[ns](1), float64(4), object(3)
memory usage: 592.0+ MB
