# Company1 - ETL Process

This notebook is organized in the following sections:
* [Step 0 - Preliminary: Viewing the data](#0)
* [Step 1 - Checking for duplicates](#1)
* [Step 2 - Checking for missing values](#2)
* [Step 3 - Imputing/dropping missing values](#3)
* [Step 4 - Ensuring Correct Datatypes](#4)
* [Step 5 - Preparation for merging](#5)


<a id='0'></a>
## Step 0 - Preliminary: Viewing the data

In [2]:
import pandas as pd

In [3]:
# Load the share prices csv

prices = pd.read_csv("data/us-shareprices-daily.csv",delimiter=';')

In [4]:
prices.tail()

Unnamed: 0,Ticker,SimFinId,Date,Open,High,Low,Close,Adj. Close,Volume,Dividend,Shares Outstanding
5767354,ZYXI,171401,2024-02-26,13.04,13.04,12.67,12.82,12.82,335055,,36435000.0
5767355,ZYXI,171401,2024-02-27,12.83,13.77,12.83,13.74,13.74,395525,,36435000.0
5767356,ZYXI,171401,2024-02-28,13.63,13.7,13.38,13.49,13.49,290887,,32170182.0
5767357,ZYXI,171401,2024-02-29,13.51,13.57,13.28,13.56,13.56,232534,,32170182.0
5767358,ZYXI,171401,2024-03-01,12.05,13.43,12.0,12.3,12.3,1216112,,32170182.0


In [5]:
# Load the companies csv 
companies = pd.read_csv("data/us-companies.csv",delimiter=';')

Selecting a company:

In [6]:
company_bruker = companies[companies["Company Name"] == "MICROSOFT CORP"]
company_bruker = company_bruker.copy()
company_bruker

Unnamed: 0,Ticker,SimFinId,Company Name,IndustryId,ISIN,End of financial year (month),Number Employees,Business Summary,Market,CIK,Main Currency
3650,MSFT,59265,MICROSOFT CORP,101003.0,US5949181045,6.0,166475.0,Microsoft Corp is a technology company. It dev...,us,789019.0,USD


In [8]:
prices = prices[prices["SimFinId"].isin([1253240, 111052, 63877,56317, 59265])]
prices = prices.copy()
prices.Ticker.unique()

array(['AAPL', 'ABT', 'BRKR', 'MSFT', 'TSLA'], dtype=object)

<a id='1'></a>
## Step 1 - Checking for duplicates

In [9]:
#Check for share prices

prices.duplicated().any()

False

In [10]:
# Check for companies 

companies.duplicated().any()

False

Conclusion: There are no duplicate values.

<a id='2'></a>
## Step 2 - Checking for missing values

**Checking in specific company:**

In [12]:
prices.isna().sum() / len(prices)

Ticker                0.000000
SimFinId              0.000000
Date                  0.000000
Open                  0.000000
High                  0.000000
Low                   0.000000
Close                 0.000000
Adj. Close            0.000000
Volume                0.000000
Dividend              0.987086
Shares Outstanding    0.000000
dtype: float64

Conclusion: The only missing values are dividends in the prices df, which are not paid out all the time. We can assume they equal 0 when they are missing.

<a id='3'></a>
## Step 3 - Imputing/dropping missing values

In [14]:
prices["Dividend"] = prices["Dividend"].fillna(0)

In [15]:
prices.Dividend.unique()

array([0.  , 0.19, 0.2 , 0.22, 0.23, 0.24, 0.32, 0.36, 0.45, 0.47, 0.51,
       0.55, 0.04, 0.05, 0.46, 0.56, 0.62, 0.68, 0.75])

In [25]:
prices.isna().sum() / len(prices)

Ticker                0.0
SimFinId              0.0
Date                  0.0
Open                  0.0
High                  0.0
Low                   0.0
Close                 0.0
Adj. Close            0.0
Volume                0.0
Dividend              0.0
Shares Outstanding    0.0
dtype: float64

Conclusion: No missing values anymore.

<a id='4'></a>
## Step 4 - Ensuring correct datatypes

In [17]:
prices.info()


<class 'pandas.core.frame.DataFrame'>
Index: 6195 entries, 14253 to 5192437
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Ticker              6195 non-null   object 
 1   SimFinId            6195 non-null   int64  
 2   Date                6195 non-null   object 
 3   Open                6195 non-null   float64
 4   High                6195 non-null   float64
 5   Low                 6195 non-null   float64
 6   Close               6195 non-null   float64
 7   Adj. Close          6195 non-null   float64
 8   Volume              6195 non-null   int64  
 9   Dividend            6195 non-null   float64
 10  Shares Outstanding  6195 non-null   float64
dtypes: float64(7), int64(2), object(2)
memory usage: 580.8+ KB


In [18]:
prices

Unnamed: 0,Ticker,SimFinId,Date,Open,High,Low,Close,Adj. Close,Volume,Dividend,Shares Outstanding
14253,AAPL,111052,2019-04-01,47.91,47.92,47.09,47.81,45.81,111447856,0.0,1.842914e+10
14254,AAPL,111052,2019-04-02,47.77,48.62,47.76,48.51,46.47,91062928,0.0,1.842914e+10
14255,AAPL,111052,2019-04-03,48.31,49.12,48.29,48.84,46.79,93087320,0.0,1.842914e+10
14256,AAPL,111052,2019-04-04,48.70,49.09,48.28,48.92,46.87,76457100,0.0,1.842914e+10
14257,AAPL,111052,2019-04-05,49.11,49.27,48.98,49.25,47.19,74106576,0.0,1.842914e+10
...,...,...,...,...,...,...,...,...,...,...,...
5192433,TSLA,56317,2024-02-26,192.29,201.78,192.00,199.40,199.40,111747116,0.0,3.184790e+09
5192434,TSLA,56317,2024-02-27,204.04,205.60,198.26,199.73,199.73,108645412,0.0,3.184790e+09
5192435,TSLA,56317,2024-02-28,200.42,205.30,198.44,202.04,202.04,99806173,0.0,3.184790e+09
5192436,TSLA,56317,2024-02-29,204.18,205.28,198.45,201.88,201.88,85906974,0.0,3.184790e+09


We see there are is a float64 which might be cleaner to format as int64 (Shares Outstanding) Let's double check, we'll do this for the whole dataset.

In [19]:
prices[prices["Shares Outstanding"] % 1 != 0]["Shares Outstanding"].unique()

array([], dtype=float64)

We see there are some float64 which might be cleaner to format as int64, column headers such as ID's, which can only be full numbers. Let's double check, we'll do this for the whole dataset.

Conclusion:
- prices_bruker: date column is object and not date format, will be changed + "Shares Outstanding" can be changed to int

In [21]:
# for prices_bruker

prices["Shares Outstanding"] =  prices["Shares Outstanding"].astype(int)
prices["Date"] = pd.to_datetime(prices.Date, format="%Y-%m-%d")

<a id='5'></a>
## Step 5 - Load csv for ML preparation

In [24]:
prices.head()

Unnamed: 0,Ticker,SimFinId,Date,Open,High,Low,Close,Adj. Close,Volume,Dividend,Shares Outstanding
14253,AAPL,111052,2019-04-01,47.91,47.92,47.09,47.81,45.81,111447856,0.0,18429136000
14254,AAPL,111052,2019-04-02,47.77,48.62,47.76,48.51,46.47,91062928,0.0,18429136000
14255,AAPL,111052,2019-04-03,48.31,49.12,48.29,48.84,46.79,93087320,0.0,18429136000
14256,AAPL,111052,2019-04-04,48.7,49.09,48.28,48.92,46.87,76457100,0.0,18429136000
14257,AAPL,111052,2019-04-05,49.11,49.27,48.98,49.25,47.19,74106576,0.0,18429136000


In [26]:
prices.to_csv("prices_output.csv")