# Cleaning our data

Cleaning your data and dealing with dates. That's the life of anyone who deals with data, especially in finance. Most data sets are messy. This is especially true if you or your firm is the one collecting the data and not just purchasing something.

DataCamp has an article on (cleaning data in `pandas`)[https://www.datacamp.com/community/tutorials/data-preparation-with-pandas].

## Getting set up

We'll start by bringing in our Zillow data again.


In [4]:
import numpy as np
import pandas as pd

uw = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/zestimatesAndCutoffs_byGeo_uw_2017-10-10_forDataPage.csv')
uw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2610 entries, 0 to 2609
Data columns (total 24 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   RegionType                    2610 non-null   object 
 1   RegionName                    2610 non-null   object 
 2   StateName                     2609 non-null   object 
 3   MSAName                       1071 non-null   object 
 4   AllHomes_Tier1                2610 non-null   float64
 5   AllHomes_Tier2                2610 non-null   float64
 6   AllHomes_Tier3                2610 non-null   float64
 7   AllHomes_AllTiers             2610 non-null   float64
 8   UWHomes_Tier1                 2610 non-null   float64
 9   UWHomes_Tier2                 2610 non-null   int64  
 10  UWHomes_Tier3                 2610 non-null   float64
 11  UWHomes_AllTiers              2610 non-null   float64
 12  UWHomes_TotalValue_Tier1      2610 non-null   float64
 13  UWH

Once of the most important steps in data cleaning is just looking at what we have. What are the variables? What are their types? How many unique values of each variable do we have? Any missings? Do we see anything unexpected?

In [4]:
uw['UWHomes_Tier2'] = uw["UWHomes_Tier2"].astype('float64')


In [5]:
uw['RegionType'].value_counts()


Zip       1247
City      1017
County     227
MSA         95
State       23
Nation       1
Name: RegionType, dtype: int64

In [6]:
uw.isna().sum()

RegionType                         0
RegionName                         0
StateName                          1
MSAName                         1539
AllHomes_Tier1                     0
AllHomes_Tier2                     0
AllHomes_Tier3                     0
AllHomes_AllTiers                  0
UWHomes_Tier1                      0
UWHomes_Tier2                      0
UWHomes_Tier3                      0
UWHomes_AllTiers                   0
UWHomes_TotalValue_Tier1           0
UWHomes_TotalValue_Tier2           0
UWHomes_TotalValue_Tier3           0
UWHomes_TotalValue_AllTiers        0
UWHomes_MedianValue_AllTiers       0
AllHomes_Tier1_ShareUW             0
AllHomes_Tier2_ShareUW             0
AllHomes_Tier3_ShareUW             0
AllHomes_AllTiers_ShareUW          0
UWHomes_ShareInTier1               0
UWHomes_ShareInTier2               0
UWHomes_ShareInTier3               0
dtype: int64

In [7]:
uw.RegionName.unique()

array(['United States', 'Alabama', 'California', ..., '98612', '32081',
       '33578'], dtype=object)

In [8]:
uw.RegionName.nunique()

2496

In [9]:
uw[uw['RegionType'] == 'MSA'].MSAName.nunique()

95

In [10]:
uw[uw['RegionType'] == 'MSA']['MSAName'].nunique()

95

We can bring back the stock data too, as that data has some missing values.

In [11]:
prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col=0, parse_dates=True)

Why are there missing values? Holidays and weekends, when trading doesn't take place.

In [12]:
prices.isna().sum()

AAPL.O    78
MSFT.O    78
INTC.O    78
AMZN.O    78
GS.N      78
SPY       78
.SPX      78
.VIX      78
EUR=       0
XAU=       5
GDX       78
GLD       78
dtype: int64

We can drop these rows. We'll specify `axis=0`, or rows.

In [13]:
prices = prices.dropna(axis=0)
prices.isna().sum()

AAPL.O    0
MSFT.O    0
INTC.O    0
AMZN.O    0
GS.N      0
SPY       0
.SPX      0
.VIX      0
EUR=      0
XAU=      0
GDX       0
GLD       0
dtype: int64

In [15]:
prices.head(15)

Unnamed: 0_level_0,AAPL.O,MSFT.O,INTC.O,AMZN.O,GS.N,SPY,.SPX,.VIX,EUR=,XAU=,GDX,GLD
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2010-01-04,30.572827,30.95,20.88,133.9,173.08,113.33,1132.99,20.04,1.4411,1120.0,47.71,109.8
2010-01-05,30.625684,30.96,20.87,134.69,176.14,113.63,1136.52,19.35,1.4368,1118.65,48.17,109.7
2010-01-06,30.138541,30.77,20.8,132.25,174.26,113.71,1137.14,19.16,1.4412,1138.5,49.34,111.51
2010-01-07,30.082827,30.452,20.6,130.0,177.67,114.19,1141.69,19.06,1.4318,1131.9,49.1,110.82
2010-01-08,30.282827,30.66,20.83,133.52,174.31,114.57,1144.98,18.13,1.4412,1136.1,49.84,111.37
2010-01-11,30.015684,30.27,20.95,130.308,171.56,114.73,1146.98,17.55,1.4513,1152.6,50.17,112.85
2010-01-12,29.674256,30.07,20.608,127.35,167.82,113.66,1136.22,18.25,1.4494,1127.3,48.35,110.49
2010-01-13,30.092827,30.35,20.96,129.11,169.07,114.62,1145.68,17.85,1.451,1138.4,48.86,111.54
2010-01-14,29.918542,30.96,21.48,127.35,168.53,114.93,1148.46,17.63,1.4502,1142.85,48.6,112.03
2010-01-15,29.418542,30.86,20.8,127.14,165.21,113.64,1136.03,17.91,1.4382,1129.9,47.42,110.86


## Pyjanitor

We are going to look at a fun package that is based on something from the [R](https://www.r-project.org) statistical programming language, called [pyjanitor](https://pyjanitor-devs.github.io/pyjanitor/). 

To use this package, you'll need to type the following in the terminal (Mac) or command prompt (Windows)

```
conda install -c conda-forge pyjanitor
```

There are even [finance specific tools](https://pyjanitor-devs.github.io/pyjanitor/api/finance/).

In [18]:
import janitor
from janitor import clean_names


In [19]:
prices = (
    pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col=0, parse_dates=True)
    .remove_columns(['GLD'])
    .dropna()
    .rename_column('AAPL.O', 'AAPL')
    .rename_column('MSFT.O', 'MSFT')
)

In [22]:
prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col=0, parse_dates=True)

prices = prices.clean_names()

In [21]:
prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col=0, parse_dates=True)

prices = clean_names(prices)

In [23]:
prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col=0, parse_dates=True)

prices = prices.flag_nulls()

In [24]:
prices.get_dupes()

Unnamed: 0_level_0,AAPL.O,MSFT.O,INTC.O,AMZN.O,GS.N,SPY,.SPX,.VIX,EUR=,XAU=,GDX,GLD,null_flag
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
