# <span style='font-family: CMU Sans Serif, sans-serif;'> Addressing missing data  </span> 

This notebook looks at the amount of missing data and tries to handle it correctly. The data was sourced from Wharton Research Data Services (WRDS) and produces by the Global Factor Data (GFD) team. Below we import the necessary packages for cleaning data.

In [2]:
import pandas as pd
import numpy as np
import janitor

Additionally some notebook setup is done below. Specifically we adjust the notebook width for better presentation of the data.

In [6]:
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', 0)  # Use full cell width
pd.set_option('display.expand_frame_repr', False)  # Prevent line breaks

# <span style='font-family: CMU Sans Serif, sans-serif;'> Loading data and supporting information  </span> 

Data was downloaded in `../data__collect.ipynb` and is here imported for cleaning.

In [9]:
# Define path to where the data is located
dataPath = "../data__collect/usa__gfd.parquet"

# Read data to raw data
dataRawGfdUs = pd.read_parquet(dataPath)

Besides importing data we also want information about the features provided by the GFD team. We import the label clusters and features details, which can also be used to determine basis and primary features.

In [92]:
# Raw URL to label clusters and details from GFD repo
urlLabelClustersGfd = "https://raw.githubusercontent.com/bkelly-lab/ReplicationCrisis/master/GlobalFactors/Cluster%20Labels.csv"
urlLabelDetailsGfd = "https://raw.githubusercontent.com/bkelly-lab/ReplicationCrisis/master/GlobalFactors/Factor%20Details.xlsx"

# Read dataframes
dataLabelClustersGfd = pd.read_csv(urlLabelClustersGfd)
dataLabelDetailsGfd = pd.read_excel(urlLabelDetailsGfd)

# Create list of primary characters
listPrimaryFeatures = dataLabelClustersGfd['characteristic'].tolist()

# <span style='font-family: CMU Sans Serif, sans-serif;'> EDA  </span> 

## <span style='font-family: CMU Sans Serif, sans-serif;'> Feature category check  </span> 

Before removing/imputation, transformation, feature engineering etc. we need to understand our data; how it is structured, what each variable tells us and its type, etc. To keep track of this we create a data dictionary. We are focused on the primary features (153) which are used to describe each firm, and the basis features (39) are generally less erroneous. 

We have 153 primary features and the rest are basis features. First we confirm that we have the 153 primary features and then how many basis features we have.

In [93]:
# Create list of observed primary features and basis features
listObsPrimaryFeatures = [feature for feature in dataRawGfdUs.columns.tolist() if feature in listPrimaryFeatures]
listObsBasisFeatures = [feature for feature in dataRawGfdUs.columns.tolist() if feature not in listPrimaryFeatures]

# Count len of observed feature types
intCountBasisFeatures = len(listObsBasisFeatures)
intCountPrimaryFeatures = len(listObsPrimaryFeatures)

# Print count 
print(f"Count basis features: {intCountBasisFeatures}")
print(f"Count primary features: {intCountPrimaryFeatures}")
print(f"Count of total observed features: {dataRawGfdUs.shape[1]}")

Count basis features: 39
Count primary features: 153
Count of total observed features: 192


We have the correct amount of features for each category. The data dictionary is now created.

In [85]:
dictDataDescription = {}
dictPrimaryDataDescription = {}
dictBasisDataDescription = {}

In [94]:
for feat in listObsPrimaryFeatures:
    description = dataLabelDetailsGfd[dataLabelDetailsGfd['abr_jkp'] == feat]['name_new']
    dictPrimaryDataDescription[feat] = {
        "Type": dataRawGfdUs[feat].dtype,
        "Description": description.values[0] if not description.empty else "N/A",
        "Pre-trans Range": f"[{dataRawGfdUs[feat].min()}, {dataRawGfdUs[feat].max()}]" if pd.api.types.is_numeric_dtype(dataRawGfdUs[feat].dtype) else "N/A",
        "Pre-clean NaNs": dataRawGfdUs[feat].isna().sum(),
        "Transformation": "None",
        "Post-trans Range": "N/A",
        "Post-clean NaNs": "N/A",
        "Outlier handled (T/F)": False
    }

for feat in listObsBasisFeatures:
    description = dataLabelDetailsGfd[dataLabelDetailsGfd['abr_jkp'] == feat]['name_new']

    dictBasisDataDescription[feat] = {
        "Type": dataRawGfdUs[feat].dtype,
        "Description": description.values[0] if not description.empty else "N/A",
        "Pre-trans Range": f"[{dataRawGfdUs[feat].min()}, {dataRawGfdUs[feat].max()}]" if pd.api.types.is_numeric_dtype(dataRawGfdUs[feat].dtype) else "N/A",
        "Pre-clean NaNs": dataRawGfdUs[feat].isna().sum(),
        "Transformation": "None",
        "Post-trans Range": "N/A",
        "Post-clean NaNs": "N/A",
        "Outlier handled (T/F)": False
    }

dictDataDescription = {
    "Basis": dictBasisDataDescription,
    "Primary": dictPrimaryDataDescription
}

In [None]:
dictDataDescription['Primary']['bev_mev']

{'Type': Float64Dtype(),
 'Description': 'Book-to-market enterprise value',
 'Pre-trans Range': '[1.3169633700051e-17, 38265.70349130235]',
 'Pre-clean NaNs': np.int64(955783),
 'Transformation': 'None',
 'Post-trans Range': 'N/A',
 'Post-clean NaNs': 'N/A',
 'Outlier handled (T/F)': False}