# Company1 - ETL Process

This notebook is organized in the following sections:
* [Step 0 - Preliminary: Viewing the data](#0)
* [Step 1 - Checking for duplicates](#1)
* [Step 2 - Checking for missing values](#2)
* [Step 3 - Imputing/dropping missing values](#3)
* [Step 4 - Ensuring Correct Datatypes](#4)
* [Step 5 - Preparation for merging](#5)


<a id='0'></a>
## Step 0 - Preliminary: Viewing the data

In [1]:
import pandas as pd

In [6]:
# Load the share prices csv

prices = pd.read_csv("data/us-shareprices-daily.csv",delimiter=';')

In [16]:
prices.tail()

Unnamed: 0,Ticker,SimFinId,Date,Open,High,Low,Close,Adj. Close,Volume,Dividend,Shares Outstanding
5767354,ZYXI,171401,2024-02-26,13.04,13.04,12.67,12.82,12.82,335055,,36435000.0
5767355,ZYXI,171401,2024-02-27,12.83,13.77,12.83,13.74,13.74,395525,,36435000.0
5767356,ZYXI,171401,2024-02-28,13.63,13.7,13.38,13.49,13.49,290887,,32170182.0
5767357,ZYXI,171401,2024-02-29,13.51,13.57,13.28,13.56,13.56,232534,,32170182.0
5767358,ZYXI,171401,2024-03-01,12.05,13.43,12.0,12.3,12.3,1216112,,32170182.0


In [9]:
# Load the companies csv 
companies = pd.read_csv("data/us-companies.csv",delimiter=';')

In [13]:
companies.tail()

Unnamed: 0,Ticker,SimFinId,Company Name,IndustryId,ISIN,End of financial year (month),Number Employees,Business Summary,Market,CIK,Main Currency
6065,ZWS,17663788,Zurn Elkay Water Solutions Corporation,100001.0,US98983L1089,12.0,2700.0,Zurn Elkay Water Solutions Corporation designs...,us,1439288.0,USD
6066,ZY,1243193,Zymergen Inc.,106002.0,US98985X1000,12.0,758.0,Zymergen is a biofacturing company using biolo...,us,1645842.0,USD
6067,ZYME,17663790,Zymeworks Inc.,106002.0,CA98985W1023,12.0,291.0,"Zymeworks Inc., a clinical-stage biopharmaceut...",us,1403752.0,USD
6068,ZYNE,901704,"Zynerba Pharmaceuticals, Inc.",106002.0,US98986X1090,12.0,25.0,Zynerba Pharmaceuticals Inc together with its ...,us,1621443.0,USD
6069,ZYXI,171401,ZYNEX INC,106004.0,US98986M1036,12.0,768.0,"Zynex, Inc. engages in the design, manufacture...",us,846475.0,USD


Selecting a company:

In [None]:
unique_companies = list(companies["Company Name"].unique())
unique_companies

In [35]:
company_bruker = companies[companies["SimFinId"] == 1253240]
company_bruker

Unnamed: 0,Ticker,SimFinId,Company Name,IndustryId,ISIN,End of financial year (month),Number Employees,Business Summary,Market,CIK,Main Currency
930,BRKR,1253240,BRUKER CORP,106001.0,US1167941087,12.0,7400.0,Bruker is enabling scientists to make breakthr...,us,1109354.0,USD


In [37]:
prices_bruker = prices[prices["SimFinId"] == 1253240]
prices_bruker

Unnamed: 0,Ticker,SimFinId,Date,Open,High,Low,Close,Adj. Close,Volume,Dividend,Shares Outstanding
812641,BRKR,1253240,2019-04-01,38.76,39.01,38.47,38.87,38.20,761845,,156814676.0
812642,BRKR,1253240,2019-04-02,38.94,39.01,38.39,38.53,37.86,666801,,156814676.0
812643,BRKR,1253240,2019-04-03,38.74,39.24,38.63,39.01,38.33,2647355,,156814676.0
812644,BRKR,1253240,2019-04-04,38.95,39.01,37.82,38.32,37.66,503730,,156814676.0
812645,BRKR,1253240,2019-04-05,38.41,39.23,38.30,38.41,37.74,509516,,156814676.0
...,...,...,...,...,...,...,...,...,...,...,...
813875,BRKR,1253240,2024-02-26,84.34,84.88,83.21,83.60,83.35,788986,,137671143.0
813876,BRKR,1253240,2024-02-27,83.60,84.29,83.05,83.70,83.45,559922,,137671143.0
813877,BRKR,1253240,2024-02-28,84.18,87.25,84.13,86.48,86.23,1433649,,137671143.0
813878,BRKR,1253240,2024-02-29,86.90,88.92,85.86,86.54,86.33,2687003,0.05,137671143.0


<a id='1'></a>
## Step 1 - Checking for duplicates

In [19]:
#Check for share prices

prices.duplicated().any()

False

In [17]:
# Check for companies 

companies.duplicated().any()

False

Conclusion: There are no duplicate values.

<a id='2'></a>
## Step 2 - Checking for missing values

**Checking full data sets first:**

In [21]:
# Check missing values for share prices

prices.isna().sum() / len(prices)

Ticker                0.000000
SimFinId              0.000000
Date                  0.000000
Open                  0.000000
High                  0.000000
Low                   0.000000
Close                 0.000000
Adj. Close            0.000000
Volume                0.000000
Dividend              0.993940
Shares Outstanding    0.089509
dtype: float64

In [22]:
# Check missing values for companies

companies.isna().sum() / len(companies)

Ticker                           0.012521
SimFinId                         0.000000
Company Name                     0.012191
IndustryId                       0.047117
ISIN                             0.178254
End of financial year (month)    0.012026
Number Employees                 0.124053
Business Summary                 0.051400
Market                           0.000000
CIK                              0.001977
Main Currency                    0.000000
dtype: float64

In [29]:
prices.head()

Unnamed: 0,Ticker,SimFinId,Date,Open,High,Low,Close,Adj. Close,Volume,Dividend,Shares Outstanding
0,A,45846,2019-04-01,80.96,81.77,80.96,81.56,78.33,1522681,0.16,317515869.0
1,A,45846,2019-04-02,81.71,81.76,81.03,81.14,77.93,1203037,,317515869.0
2,A,45846,2019-04-03,81.54,82.02,81.46,81.94,78.7,2141025,,317515869.0
3,A,45846,2019-04-04,81.84,82.05,80.44,80.83,77.63,2180112,,317515869.0
4,A,45846,2019-04-05,81.19,81.92,81.05,81.47,78.25,1502875,,317515869.0


**Checking in specific company:**

In [39]:
prices_bruker.isna().sum() / len(prices_bruker)

Ticker                0.000000
SimFinId              0.000000
Date                  0.000000
Open                  0.000000
High                  0.000000
Low                   0.000000
Close                 0.000000
Adj. Close            0.000000
Volume                0.000000
Dividend              0.983858
Shares Outstanding    0.000000
dtype: float64

Conclusion: The only missing values are dividends, which are not paid out all the time. We can assume they equal 0 when they are missing.

<a id='3'></a>
## Step 3 - Imputing/dropping missing values

In [44]:
prices_bruker["Dividend"] = prices_bruker["Dividend"].fillna(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  prices_bruker["Dividend"] = prices_bruker["Dividend"].fillna(0)


In [48]:
prices_bruker.Dividend.unique()

array([0.  , 0.04, 0.05])

In [47]:
prices_bruker.isna().sum() / len(prices_bruker)

Ticker                0.0
SimFinId              0.0
Date                  0.0
Open                  0.0
High                  0.0
Low                   0.0
Close                 0.0
Adj. Close            0.0
Volume                0.0
Dividend              0.0
Shares Outstanding    0.0
dtype: float64

Conclusion: No missing values anymore.

<a id='4'></a>
## Step 4 - Ensuring correct datatypes

In [None]:
#hello

<a id='5'></a>
## Step 5 - Preparation for merging