# About Converting from SAS to Dataframes
Each dataset is examined and concatenated in a way that makes the most sense for that set. Further, column types
are set for more efficient storage.

Currently, this does not convert all dataset, but does provide tools and a template to convert any datasets not yet converted by this notebook.

## Merging visit datafiles

145 different data files is crazy to manage for folks who are looking for correlations across datasets collected from the OAI. It has also led to a massive number of variables being defined: 11,000+. This is over 3x the actually number of unique variables measured across all visits. 

Given that several files are same/similar data collected across visits, this notebook defines tools to collect all that data into single files with columns to mark which visit a variable corresponds to. This greatly reduces the variable namespace. Whereas you may originally have:

file 1:
```
ID V00FOO
1   30.5
2   24.7
```

file 2:
```
ID V01FOO
1   31.9
2   27.3
```
This merges it into a single dataframe:
```
ID Visit FOO
1   V00  30.5
2   V00  24.7
1   V01  31.9
2   V01  27.3
```
## Port data types / Set efficient storage

In storage, SAS only stores values as floats or strings. This is extremely inefficient. When the pyreadstat library reads the data it has some rudimentary detection of column types. Still, most OAI columns endup as dtype object. Thus column types still need to be examined and cast to the most efficient data type.

Also, SAS allows for multiple user defined data markers for any column of data. This takes two forms. The first is similar to Panda's categorical types. This can be stored efficiently in Pandas. The second allows for multiple types of missing data (where Pandas only has NA or NaN). This causes most columns to be of mixed types in Pandas (float and str).

## Preserving SAS missing values in Pandas
By allowing for user defined missing values, SAS allows you to either treat all missing values the same or leverage the fact that not all values are missing for the same reason. Neither Python Pandas (or even R's dataframes) allow this as directly. Rather than throw this information away, for any dataset that includes missing values, two dataframes will be created. The first contains all the data with NaN or NA in place of all missing values. The second will be a dataframe with only columns that had missing values. These columns will have NaN in place of all values, but contain the full missing value labels at the same indices they existed in the original data. This allows ignoring missing values as the default case, but when needed, an NA in the data dataframe can trigger a check for a missing value in the missing value dataframe.

e.g. data
```
ID Visit FOO   BAR  BAZ
1   V00  30.5   5   0.1
2   V00  NaN    7   0.5 
1   V01  31.9   1   0.6
2   V01  27.3   NA  1.2
```
e.g. missing values/"shadow" dataframe
```
ID Visit FOO      BAR
1   V00  NaN       NA
2   V00  .Refused  NA
1   V01  NaN       NA
2   V01  Nan      .Unknown
```

Note that columns with no missing values do not get copied to the shadow dataframe (with the exception of ID and Visit for indexing purposes).

## TODO

### Files with a different format
Currently 5 files aren't handled yet. In each, the column naming format doesn't use visit prefixes:
* Biospec_fnih_joco_demographics
* biospec_fnih_joco_assays
* kmri_poma_incoa_moaks_bicl
* kmri_poma_tkr_chondrometrics
* kmri_poma_tkr_moaks_bicl
    
A separate create_df() function needs to be made to handle these cases.


### Other tasks
For example:

* Improve suggest_conversion():
    * add the ability to detect boolean columns (e.g. SUSPECTMINUTE in acceldatabymin)
    * check for possible categoricals in numeric columns with few unique values
    * handle columns of numeric strings (double check they have a leading 0 or other reason to be a string)

* Sanity checks:
    * in SAS string missing values are a blank, did pyreadstat capture this? string cols seem to have empty strings
    
* Other improvements:
    * Add method to summarize and sanity check the missing_df shadow dataframes
    * Cleanup labels so they drop SV and IEI as prefixes, column collapsing
    * In data like the accelerometry data, there isn't a single line per patient, fix so ID isn't used as an index
        * acceldatabyday, acceldatabymin, accelerometry
        * flxr_kneealign_cooke, flxr_kneealign_duryea
        * kmri_qcart_eckstein, kmri_sq_moaks_bicl
        * kxr_fta_duryea, kxr_qjsw_duryea, kxr_qjsw_rel_duryea, kxr_sq_bu, kxr_sq_rel_bu
        * mif, mri, xray

# Conversion Tools
Script to create a global dictionary of metadata needed to convert to efficient datatypes, along with functions to examine and convert the SAS data into Pandas dataframss.

## Imports / Definitions

In [1]:
from copy import deepcopy
import datetime
import math
import os
import numpy as np
import pandas as pd
import pandas.api.types as pdtypes
import pickle
import pyreadstat
import re
from string import digits
from tqdm import tqdm

import OAI_Utilities as utils

# Setup 
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
display(HTML("<style>.output_result { max-width:95% !important; }</style>"))

In [2]:
# Constants
data_dir = '../data/structured_data/'
visits = {'P02':'IEI', 'P01':'SV', 'V00':'EV', 'V01':'12m', 'V02':'18m', 'V03':'24m', 'V04':'30m', 'V05':'36m',
          'V06':'48m', 'V07':'60m', 'V08':'72m', 'V09':'84m', 'V10':'96m', 'V11':'108m', 'V99':"Outcomes"}
visit_prefixes = set(visits.keys())

meta_vars = [ 'column_labels',
 'column_names',
 'column_names_to_labels',
 'file_encoding',
 'file_format',
 'file_label',
 'missing_ranges',
 'missing_user_values',
 'notes',
 'number_columns',
 'number_rows',
 'original_variable_types',
 'readstat_variable_types',
 'table_name',
 'value_labels',
 'variable_alignment',
 'variable_display_width',
 'variable_measure',
 'variable_storage_width',
 'variable_to_label',
 'variable_value_labels']

metadata_dict_names = ['column_names_to_labels', 'original_variable_types', 'readstat_variable_types',
                       'value_labels', 'variable_alignment', 'variable_display_width',
                       'variable_measure', 'variable_storage_width', 'variable_to_label', 'variable_value_labels']

default_missing_value_codes = {
    ' ': '.: Missing Form/Incomplete Workbook',
    'A': '.A: Not Expected',
    'B': '.B: Low/Below Range',
    'C': '.C: Cannot Do/Attempted: unable to complete',
    'D': '.D: Don’t Know/Unknown/Uncertain',
    'E': '.E: Non-Exposed Control',
    'F': '.F: Not done, phone contact',
    'G': '.G: Unreleased high value',
    'H': '.H: High/Above range',
    'I': '.I: Inadequate data',
    'K': '.K: Cannot do/not attempted, unable',
    'L': '.L: Permanently Lost',
    'M': '.M: Missing',
    'N': '.N: Not Required/Not edited',
    'O': '.O: Not done, other reason',
    'P': '.P: Prosthetic',
    'R': '.R: Refused',
    'S': '.S: Unreleased low value',
    'T': '.T: Technical problems',
    'U': '.U: Unable to examine',
    'V': '.V: Missed visit',
    'W': '.W: Impossible value'
}
default_missing_val_tokens = set(default_missing_value_codes.keys())

## Look at the filesets

In [3]:
# All SAS files
all_files = os.listdir(data_dir)
all_files = [x for x in all_files if '.sas7bdat' in x]
all_files.remove('sageancillarystudy_formats.sas7bdat') ## At a binary level this seems like another CPORT file. WTF?
all_files.sort()

# How many files are there?
print('File cnt: ' + str(len(all_files)))

# How many sets?
# Drop extensions and then drop visit suffixes
tmp = set([f.translate(f.maketrans('', '', digits)) for f in [f.removesuffix('.sas7bdat') for f in all_files]])
print('File set cnt: ' + str(len(tmp)))

File cnt: 145
File set cnt: 41


## Utility Functions

In [4]:
#   Given a common filename prefix, return the sorted list of sas7bdat files starting with that prefix
# e.g. 'foo' -> ['foo01.sas7bdat', 'foo02.sas7bdat']
def get_data_file_names(all_files, prefix):
    file_list = [x for x in all_files if x.startswith(prefix)]
    file_list.sort()
    return file_list


#   Clean visit prefixes
# e.g. ['V01FOO', 'v02BAR'] -> ['FOO', 'BAR']
def remove_visit_prefixes(str_list):
    return [s[3:] if re.match("^[vVpP]\d\d\D\S*", s) else s for s in str_list]


#   Return a list of all unique prefixes
# e.g. ['V01FOO', 'V02BAR'] -> ['V01', 'V02']
def collect_prefixes(str_list):
    return list({s[:3] for s in str_list if re.match("^[vVpP]\d\d\D\S*", s) })

#   Get value_labels from catalog file
# The dictionaries that define user-defined types are stored in '.sas7bcat' files
def get_value_labels(filepath):
    _, data_catalog = pyreadstat.read_sas7bcat(filepath)
    return data_catalog.value_labels

#    Debug funct to see which files a given var name exists in and the datatype in each file
# e.g. FOO -> bar01.sas7bcat V01FOO string 3 $
def show_sources_types(files_meta, var):
    for file, meta in files_meta.items():
        for v in meta.column_names:
            if v.endswith(var):
                print(file + ' ' + v + ' ' + meta.readstat_variable_types[v] + ' ' + str(meta.variable_storage_width[v]) + ' ' + meta.variable_to_label.get(v, ''))
                
idx = pd.IndexSlice # strange bit of syntactic sugar, more of a function than a variable

## Creating a Global Metadata Dictionary
The `.sas7bdat` files contain the raw data, and most metadata.  `.sas7bcat` includes definitions for the user defined types.

This section defines functions for handling this metadata. This primarily reads in all metadata across datasets and creates a single global dictionary of metadata. This dictionary
can then be used to correctly set the data types for any data read in from subsequent `.sas7bdat` files.

We don't care about the entire metadata set returned by pyreadstat, only `['column_names', 'readstat_variable_types', 'variable_storage_width', 'variable_to_label']`
- See the `Exploring Available SAS Metadata` notebook for details on user-defined types and the meanings of various metadata items

In [5]:
# Grab the metadata across all files
# - Roughly ~4.5 min runtime
files_meta = {}
for filename in all_files:
    _, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sas7bdat, data_dir + filename,
                                                         num_processes=6, metadataonly=True)
    files_meta[filename] = meta

In [6]:
# Normalize names
#  - SAS is case insensitive, to normalize names for Python, all variables from SAS are migrated to uppercase
for file, meta in files_meta.items():
    meta.column_names = [n.upper() for n in meta.column_names]
    for v_name in ['readstat_variable_types', 'variable_storage_width', 'variable_to_label']:
        var = getattr(meta, v_name)
        setattr(meta, v_name, {k.upper(): v for k,v in var.items()})

In [7]:
# Make master map of original variable names to collapsed names
# - This dict maps names like V01FOO -> FOO
var_name_map = {}
for meta in files_meta.values():
    new = {n: (n[3:] if n[:3] in visit_prefixes else n) for n in meta.column_names}
    var_name_map = {**var_name_map, **new}

In [8]:
# Building the global metadata dictionary
# global_var_name - {storage_type, storage_width, user_defined_type}
#
# - The collapse of names causes 4 conflicts:
#      name   type  width - [files]
# COHORT
#   V00COHORT double 8 - [Clinical_fnih.sas7bdat, enrollees.sas7bdat]
#   COHORT string 11 - [measinventory.sas7bdat]
# RACE
#   RACE string 1 - [Biospec_fnih_joco_demographics.sas7bdat ]
#   P02RACE double 8 - [Clinical_fnih.sas7bdat, enrollees.sas7bdat, measinventory.sas7bdat]
# PTH
#   Has value_label YNDK for allclinicalXX.sas7bdat, but no type in boneancillarystudy.sas7bdat
# READER
#   Has value_label '$' in most files, but no type in kmri_sq_biclXX.sas7bdat
#
# This code captures the common setting and leaves the outliers for custom treatment later

global_meta = {}
for meta in files_meta.values():
    for col in meta.column_names:
        if col not in ['RACE', 'COHORT']: # Exceptions, see above
            descript = meta.column_names_to_labels.get(col, None)
            storage_type = meta.readstat_variable_types[col]
            storage_width = meta.variable_storage_width[col]
            data_type = meta.variable_to_label.get(col, None)
            if data_type in ['$', 'BEST', 'MMDDYY']: 
                data_type = None
            if global_meta.get(var_name_map[col]) and col not in ['ID', 'VERSION']:
                # storage_type
                assert global_meta[var_name_map[col]]['storage_type'] == storage_type
                # Set to largest storage width seen so far
                # width
                if global_meta[var_name_map[col]]['storage_width'] != storage_width:
                    global_meta[var_name_map[col]]['storage_width'] = max(global_meta[var_name_map[col]]['storage_width'], storage_width)
                # data_type
                assert global_meta[var_name_map[col]]['data_type'] == data_type or not data_type
            else:       
                global_meta[var_name_map[col]] = {'storage_type': storage_type, 'storage_width': storage_width, 'data_type': data_type, 'descript': descript}
global_meta['Visit'] = {'storage_type': None, 'storage_width': None, 'data_type': None, 'descript': 'Which visit this data was collected during'}

### Clean up user defined types (SAS value labels) and add to global metadata

Finish the construction of the global metadata map by adding in value-labels (SAS mechanism for defining categoricals and missing value flags)

The following sanity checks are applied:
* Get rid of all NaNs and double NaNs in dictionaries
* Confirm that if V01FOO & V02FOO collapse into FOO, then the datatypes are truly the same

In [9]:
# Grab the system-wide map of all value_labels to their user-defined type dict
value_labels = get_value_labels(data_dir + 'formats.sas7bcat')

# Create a list of value_labels associated with a double (those for data stored as strings seem fine)
vl_doubles = []
for name in value_labels.keys():
    for var, meta in global_meta.items():
        if meta['data_type'] == name and meta['storage_type'] == 'double':
            vl_doubles.append(name)
            break

# Clean out NaNs and double NaNs            
for name in vl_doubles:
    value_labels[name] = {('.' if isinstance(k, float) and math.isnan(k) else k):v for k,v in value_labels[name].items()}
    
# Swap out data_type name for actual dictionary from value_labels
# add in CategoricalDtype objects to be reused
for meta in global_meta.values():
    if meta['data_type'] and value_labels.get(meta['data_type']):
        meta['data_type'] = value_labels[meta['data_type']]
        meta['CategoricalDtype'] = pd.CategoricalDtype(meta['data_type'].values())

### Look at the scope of the name collapse

In [10]:
# How large is the name collapse?
tot_col_set = set()
for meta in files_meta.values():
    tot_col_set.update([n.upper() for n in meta.column_names])
print('Original number of unique variables: ' + str(len(tot_col_set)))
print('Collapsed numer of variables: ' + str(len(set(var_name_map.values()))))

Original number of unique variables: 11006
Collapsed numer of variables: 3368


## Create a single dataframe out of multiple datasets
Create a single dataframe for all variables across a given fileset.

This only sets the datatype for some columns.
* ID - set to unsigned int
* Visit, Version - set to Categorical
* columns where all values are SAS user-defined - set Categorical

This means that numeric columns with a mix of missing values and numbers are not automatically converted. Later functions allow you to investigate the data more closely, suggest conversions, and convert the columns as you best see fit.

In [11]:
def create_df(prefix):
    # read in data from each file and append data to master dataframe
    df_list = []
    for filename in get_data_file_names(all_files, prefix):
        tmp_df, _ = pyreadstat.read_file_multiprocessing(pyreadstat.read_sas7bdat, data_dir + filename,
                                                            num_processes=6, user_missing=True)
       
        # Normalize column names to uppercase (SAS is case insensitive and variable names are inconsistent across data files)
        # DON'T drop prefixes here because a some files have data for more than one visit and common 
        # variables would collapse (and Pandas will complain about dup column names with an obscure msg)
        tmp_df = tmp_df.rename(columns={c: c.upper() for c in tmp_df.columns})
        print(filename + '\tVar Cnt: ' + str(len(tmp_df.columns)))
            
        # What visits does this dataframe cover?
        inc_visits = collect_prefixes(tmp_df.columns)
        inc_visits.sort()
        print('Visits: ' + str(inc_visits))
        
        # Process all variables collected in the same visit at the same time
        for visit in inc_visits:
            visit_vars = ['ID', 'VERSION']  # variables that don't change between visits
            if 'SIDE' in tmp_df:
                visit_vars.append('SIDE')
            if 'KNEESIDE' in tmp_df:  # part of bone ancillary 
                visit_vars.append('KNEESIDE')            
            if 'HIPSIDE' in tmp_df:  # part of bone ancillary 
                visit_vars.append('HIPSIDE')                            
            if 'READPRJ' in tmp_df:
                visit_vars.append('READPRJ')
            visit_vars.extend([v for v in tmp_df.columns if v.startswith(visit)])
            tmp2_df = tmp_df[visit_vars]
            new_cols = {c: c.removeprefix(visit) for c in visit_vars}  # drop visit prefixes from variable names
            tmp2_df = tmp2_df.rename(columns=new_cols)
            # Categorical values must be set to the master set of all values for each column,
            # otherwise categorical lists don't match when concatenating and revert to strings
            for col in tmp2_df.columns:
                dt = global_meta[col]['data_type']
                if dt:
                    if isinstance(dt, dict):
                        # Need to swap out . for NaN
                        if global_meta[col]['storage_type'] == 'double':
                            tmp2_df[col].fillna('.', inplace=True)
                        tmp2_df[col] = tmp2_df[col].apply(lambda x: dt.get(x, x))
                        tmp2_df[col] = tmp2_df[col].astype(global_meta[col]['CategoricalDtype'])
                    else:
                        print('Unhandled data type: ', col, dt)
            tmp2_df.insert(1, 'Visit', visit)  # Mark which visit these variables are associated with
            tmp2_df = tmp2_df.copy(deep=True)
            # print(visit, tmp2_df.columns)
            df_list.append(tmp2_df)

    master_df = pd.concat(df_list, axis=0)
    master_df['ID'] = pd.to_numeric(master_df['ID'], downcast='unsigned')
    master_df = master_df.astype({col: 'category' for col in ['Visit', 'VERSION']})

    return master_df

## Dataset Inspection and Type Optimization Functions
A set of functions to examine the data being converted and help detect the columns that may need manual inspection before conversion.

create_df() only sets a few column types. These function looks for columns that likely need manual conversion.

In [12]:
#     Highlights which columns change between files
#
# Across a series of files data00.sas7bdat-data11.sas7bdat columns may be added or
# dropped based on the evolution of questions asked in OAI. Further, capitalization
# may change for the column names. This function highlights those changes.
# 
# Run before create_df() to see what is in the 'sas7bdat' files
def column_uniformity_check(prefix):
    tot_cnt = 0
    col_set = {}
    for filename in get_data_file_names(all_files, prefix):
        tmp_df, _ = pyreadstat.read_file_multiprocessing(pyreadstat.read_sas7bdat, data_dir + filename,
                                                         catalog_file=data_dir + 'formats.sas7bcat',
                                                         num_processes=6, user_missing=True)
        tot_cnt += tmp_df.shape[0]
        print('\n' + filename + ': '+ str(tmp_df.shape))
        
        col_names = remove_visit_prefixes(list(tmp_df.columns))
        if not col_set:
            print(col_names)
            col_set = set([c.upper() for c in col_names])
        # display any new or missing elements 
        elif col_set ^ set(col_names):
            # Weed out difference solely from SAS case insensitivity
            upper_col_names = set([c.upper() for c in col_names])
            if not col_set ^ upper_col_names:
                print('Names only differ by case')
            else:
                print(list(col_set ^ set(upper_col_names)))
                col_set = set(upper_col_names)

    print('\nTotal rows: ' + str(tot_cnt))

In [13]:
#   After create_df() has been called, gather_column_data_stats() looks at the contents of each column and outputs
# its findings in a dataframe. Indexed by the column names in the original dataset, this dataframe allows for
# easy lookup of the column label, numeric/string/na/date value counts, min and max numeric values, the presence
# of decimal places, and lists of unique missing values and strings present.
#
# Using the type data from gather_column_data_stats(), suggest_conversions() can suggest likely column types for each
# column. It outputs a dictionary that can be pasted into a cell (modified as desired) and then serve as input to 
# convert_columns().



#   Examine dataframe and return information about the data present in each column in a column indexed dataframe
# Two dataframes are returned. The first contains data summaries for all colummns that likely still need
# conversion. The second one containing data summaries for all columns already in categorical or uint format
# (presumably from create_df).
def gather_column_data_stats(df):
    conv_list, done_list = [], []

    for col in df.columns:
        label = global_meta[col]['descript']
        col_dtype = df[col].dtype
        na_cnt = df[col].isna().sum()

        # Type matches already converted column types
        if col_dtype in ['category', np.uint32, np.uint8]:
            done_list.append({'col': col, 'label': label, 'type': col_dtype, 'na_cnt': na_cnt})

        # 100% numeric, check for type ((un)signed int, float)
        elif col_dtype == float:
            num_type, num_cnt, uniq_num, max_num, min_num = examine_numeric_vals(df[col])
            add_entry(conv_list, col, label, num_type=num_type,
                      uniq_num=uniq_num, max_num=max_num, min_num=min_num, num_cnt=num_cnt, na_cnt=na_cnt)

        # object columns can be mixed types, gather stats on all type present
        elif col_dtype == object:
            uniq_strs, str_list, mv_list, numeric_str = None, None, None, None
            num_type, uniq_num, max_num, min_num = None, None, None, None
            mv_cnt, str_cnt, num_cnt, date_cnt, na_cnt = 0, 0, 0, 0, 0

            
            # look at data of each type in column
            col_types = list(df[col].apply(type).unique())   
            for data_type in col_types:
                col_subset = df[col][df[col].apply(lambda x: isinstance(x, data_type))] 
                if data_type == float or data_type == int:
                    num_type, num_cnt, uniq_num, max_num, min_num = examine_numeric_vals(col_subset)
                elif data_type == str:
                    str_cnt, uniq_strs, str_list, mv_cnt, mv_list, numeric_str = examine_str_vals(col_subset)
                elif data_type == datetime.date:
                    date_cnt = col_subset.shape[0]
                else:  # Unexpected datatype
                    print('{} contained unexpected data type: {}'.format(col, data_type))
                    
            # Add stats to entry for this data column
            add_entry(conv_list, col, label, uniq_strs=uniq_strs, str_list=str_list, mv_cnt=mv_cnt, mv_list=mv_list,
                      numeric_str=numeric_str, num_type=num_type, uniq_num=uniq_num, max_num=max_num, min_num=min_num, 
                      str_cnt=str_cnt, num_cnt=num_cnt, date_cnt=date_cnt, na_cnt=na_cnt)
        else:
            print('{} contained unexpected data types: {}'.format(col, list(df[col].apply(type).unique())))
        
    return pd.DataFrame(conv_list).set_index('col'), pd.DataFrame(done_list).set_index('col')


#   Using the output of gather_column_data_stats() display a dictionary suggesting what Pandas types each
# column should be converted to.
def suggest_conversions(df):
    print('targets = {')
    # Dates
    count_cols = ['str_cnt', 'num_cnt', 'date_cnt', 'na_cnt']
    subset = df[count_cols].apply(lambda x: x > 0)
    dates = df[~subset.str_cnt & ~subset.num_cnt & subset.date_cnt].index.to_list()
    dates.extend(df[subset.str_cnt & ~subset.num_cnt & subset.date_cnt & (df.uniq_strs == 0)].index.to_list())
    if dates:
        print('# Columns with only dates, missing, and NA values')
        print("'date': {},\n".format(dates))
    
    # Numeric
    numeric = df[subset.num_cnt & ~subset.date_cnt & (df.uniq_strs == 0)]
    unsigned = numeric[numeric.num_type == 'unsigned'].index.to_list()
    if unsigned:
        print('# Columns with only unsigned ints, missing, and NA values')
        print("'unsigned': {},\n".format(unsigned))

    signed = numeric[numeric.num_type == 'signed'].index.to_list()
    if signed:
        print('# Columns with only signed ints, missing, and NA values')
        print("'signed': {},\n".format(signed))

    floats = numeric[numeric.num_type == 'float'].index.to_list()
    if floats:
        print('# Columns with only floats, missing, and NA values')
        print("'float': {},\n".format(floats))

    # Strings
    strings = df[subset.str_cnt & ~subset.num_cnt & ~subset.date_cnt].index.to_list()
    if strings:
        print('# Columns with only strings, missing, and NA values')
        print("'cat': {},\n".format(strings))
    print('}\n')
    
    print('\nHandled columns: {}'.format(len(dates) + len(unsigned) + len(signed) + len(floats) + len(strings)))
    
    missing_cols = set(df.index.to_list()) - set(dates+unsigned+signed+floats+strings)
    if missing_cols:
        print('Unhandled columns: {}'.format(missing_cols))
        

#   Given a dictionary mapping types to lists of columns to convert to that type, along with the type information
# information dataframe, and the data itself, return a dataframe with the corresponding columns converted as well
# as a matching "shadow" dataframe of missing values.
#
# If no missing values are present anywhere in the dataset, an empty dataframe is returned
def convert_columns(targets, data_stats_df, df):
    df = df.copy(deep=True)
    
    # create the shadow datframe for missing values
    all_cols = [col for col_list in targets.values() for col in col_list]
    cols_w_missing = [col for col in all_cols if data_stats_df.loc[col, 'missing_val_cnt'] > 0]
    missing_df = df[cols_w_missing].applymap(lambda x: x if isinstance(x, str) else np.NaN)
    missing_df = missing_df.applymap(lambda x: default_missing_value_codes.get(x, x))
    missing_df = missing_df.astype('category')
    if not missing_df.empty:
        missing_df = pd.concat([df['ID'], df['Visit'], missing_df], axis=1)
#        missing_df['ID'] = df['ID']
#        missing_df['Visit'] = df['Visit']
        missing_df = missing_df.set_index(['ID', 'Visit'])
        
    for col_type, cols_to_conv in targets.items():
        na = np.NaN
        if col_type in ['unsigned', 'signed']:
            na = pd.NA
        
        # Now that missing values are copied to a shadow dataframe, remove them from the dataframe
        cols_w_missing = [col for col in cols_to_conv if data_stats_df.loc[col, 'missing_val_cnt'] > 0]
        replace_dict = {col: {val: na for val in data_stats_df.loc[col, 'missing_val_list']} for col in cols_w_missing}
        df[cols_w_missing] = df[cols_w_missing].replace(replace_dict)   
        
        if col_type in ['signed', 'unsigned']:
            cols_w_nas = [col for col in cols_to_conv if data_stats_df.loc[col, 'na_cnt'] > 0]
            df[cols_w_nas] = df[cols_w_nas].fillna(pd.NA) # Manual conversions needed due to presence of NaNs
            cols_w_nas = list(set(cols_w_nas + cols_w_missing))
            df[cols_w_nas] = df[cols_w_nas].astype('Int32')  # Int32 allows for pd.NA values
            
            df[cols_to_conv] = df[cols_to_conv].apply(pd.to_numeric, downcast=col_type)

        elif col_type == 'float':
            df[cols_to_conv] = df[cols_to_conv].apply(pd.to_numeric, downcast='float')
        elif col_type == 'cat':
            df[cols_to_conv] = df[cols_to_conv].astype('category')
        elif col_type == 'date':
            df[cols_to_conv] = df[cols_to_conv].astype('datetime64[ns]')
   
    return df.set_index(['ID', 'Visit']), missing_df


# Helper functions

#   Check for numeric type ((un)signed int, float. NaN), min, max, unique cnt 
def examine_numeric_vals(col):
    num_type = 'float'
    min_val, max_val = col.min(), col.max()
    unique_cnt = len(col.unique())

    # Does it really need to be a float?
    if all(col.apply(type) == int) or col[~col.isna()].apply(float.is_integer).all():  # ignoring NaNs, are the rest integers?
        num_type = 'unsigned'
        if min_val < 0:
            num_type = 'signed'

    num_cnt = col.shape[0] - col.isna().sum()
    # Are they all NA?
    if num_cnt == 0:
        num_type = 'na'
            
    return num_type, num_cnt, unique_cnt, max_val, min_val


#   Get a list of unique strs, and whether strings are numbers
def examine_str_vals(col):
    missing_vals = set([k for k in default_missing_value_codes.keys()])
    uniques = set(col.unique()) - missing_vals
    missing_vals = missing_vals & set(col.unique())
    unique_cnt = len(uniques)
    
    # Are they all strings written as strings?
    numeric_str = False
    if not pd.to_numeric(col, errors='coerce').isna().any():
        numeric_str = True
    return col.shape[0], len(uniques), uniques, len(missing_vals), missing_vals, numeric_str


#    Add column data to list
# - convenience function to reduce argument count
def add_entry(conv_list, col_name, label, uniq_strs=0, str_list=None, mv_cnt=0, mv_list=None, numeric_str=None,
              num_type=None, uniq_num=None, max_num=None, min_num=None, str_cnt=0, num_cnt=0, date_cnt=0, na_cnt=0):
    conv_list.append({'col': col_name, 'label': label, 'uniq_strs': uniq_strs, 'str_list': str_list, 
                      'missing_val_cnt': mv_cnt, 'missing_val_list': mv_list, 'numeric_str': numeric_str,
                      'num_type': num_type, 'uniq_num': uniq_num, 'max_num': max_num, 'min_num': min_num, 
                      'str_cnt': str_cnt, 'num_cnt': num_cnt, 'date_cnt': date_cnt, 'na_cnt': na_cnt})

In [14]:
#         Column data stats summary functions - called after gather_column_data_stats()

#   Print a dataframe wide summary of the datatypes found in the analysis
# arguments - data dataframe, data summary dataframe
def data_stats_summary(df, data_stats_df):
    print('Already defined cols: {} \tCols to convert: {}\t Total col cnt: {}'.format(df.shape[1] - data_stats_df.shape[0], data_stats_df.shape[0], df.shape[1]))
    print('\nColumn types to convert:\n{}'.format(column_types_present(data_stats_df)))
    print('\nNumeric types of columns:\n{}'.format(data_stats_df['num_type'].value_counts()))
    print('\nLargest number of unique strings: {}'.format(data_stats_df['uniq_strs'].max()))
    #print('\nHistogram of different NA count sizes:\n{}'.format(data_stats_df['na_cnt'].value_counts()))

    
#    Print a table of the combinations of data types found across columns in the analysis
# arguments - data summary dataframe
def column_types_present(df):
    count_cols = ['str_cnt', 'num_cnt', 'date_cnt', 'na_cnt']
    subset = df[count_cols]
    subset = subset.apply(lambda x: x > 0)
    return subset.groupby(count_cols).size().reset_index().rename(columns={0:'count'})

In [15]:
# Inspect values and types
                
#   Show string columns data to look for patterns
def show_string_col_stats(stats_df):
    count_cols = ['str_cnt', 'num_cnt', 'date_cnt', 'na_cnt']
    subset = stats_df[count_cols].apply(lambda x: x > 0)
    return stats_df[subset.str_cnt & ~subset.num_cnt & ~subset.date_cnt & subset.na_cnt]                


#    Check the converted dataframe for wrong and suspcious things
#  List which columns aren't categoricals though create_df should have made them so
#  List which columns have NA/NaN values though they weren't expected to
def sanity_check(df):
    # Confirm all categorical columns expected to be categorical are
    for col in df:
        if col not in ['Visit', 'VERSION'] and  global_meta[col]['data_type']:
            if not pdtypes.is_categorical_dtype(df[col]):
                print('Failure to make column categorical: ' + col)
            if df[col].isna().all():
                print('All NaN in categorical col: ' + col)
    if df.select_dtypes(include='object').columns.to_list():
        print('Columns still object type: ', df.select_dtypes(include='object').columns.to_list())         

# Dataset Conversions
Processed in alphabetical order

## Biospec_fnih_joco_demographics
TODO: Not handled yet as column naming format doesn't use visit prefixes

In [16]:
prefix = 'Biospec_fnih_joco_demographics'
column_uniformity_check(prefix)


Biospec_fnih_joco_demographics.sas7bdat: (129, 8)
['LabCorp_Accession_ID', 'SpecID', 'Timepoint', 'Age', 'BMI', 'Race', 'Gender', 'VERSION']

Total rows: 129


## Clinical_fnih

In [17]:
prefix = 'Clinical_fnih'
column_uniformity_check(prefix)


Clinical_fnih.sas7bdat: (600, 48)
['ID', 'SIDE', 'VERSION', 'CASE', 'GROUPTYPE', 'JSPAINPRG', 'JSONLYPRG', 'PAINONLYPRG', 'NONPRG', 'XRJSM', 'XRKL', 'XRJSL', 'XRJSM', 'XRKL', 'XRJSL', 'XRJSM', 'XRKL', 'XRJSL', 'XRJSM', 'XRKL', 'XRJSL', 'XRJSM', 'XRKL', 'XRJSL', 'MCMJSW', 'MCMJSW', 'MCMJSW', 'MCMJSW', 'MCMJSW', 'KPMEDCV', 'BMI', 'AGE', 'SEX', 'HISP', 'COHORT', 'RACE', 'WOMKP', 'WOMKP', 'WOMKP', 'WOMKP', 'WOMKP', 'WOMADL', 'WOMADL', 'WOMADL', 'WOMADL', 'WOMADL', 'JSPRG', 'PAINPRG']

Total rows: 600


In [18]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

Clinical_fnih.sas7bdat	Var Cnt: 48
Visits: ['P01', 'P02', 'V00', 'V01', 'V03', 'V05', 'V06']
(4200, 17)

Starting dataframe size: 0.92MB


In [19]:
for col in tmp_df.select_dtypes(include='category'):
    if not all(isinstance(cat, str) for cat in tmp_df[col].cat.categories):
        print(col, tmp_df[col].cat.categories)
    if tmp_df[col].isna().sum():
        print(col)

KPMEDCV
SEX
HISP
RACE
XRKL
COHORT


In [20]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [21]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 10 	Cols to convert: 7	 Total col cnt: 17

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False     True     False    True      1
1     True     True     False   False      6

Numeric types of columns:
num_type
float       6
unsigned    1
Name: count, dtype: int64

Largest number of unique strings: 0


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
BMI,BMI,0,{},1,{O},False,float,192,46.7,18.6,1,599,0,0
XRJSM,BL(BU): joint space narrowing (OARSI grades 0-...,0,{},2,"{A, P}",False,float,14,3.0,0.0,90,2894,0,0
XRJSL,BL (BU): joint space narrowing (OARSI grades 0...,0,{},2,"{A, P}",False,float,5,2.0,0.0,90,2894,0,0
MCMJSW,BL: reading (JD): medial minimum JSW [mm],0,{},3,"{A, P, T}",False,float,1375,8.255,0.0,87,2897,0,0
AGE,,0,,0,,,unsigned,36,79.0,45.0,0,600,0,3600
WOMKP,BL: WOMAC Pain Score,0,{},2,"{M, I}",False,float,27,20.0,0.0,2,2985,0,0
WOMADL,BL: WOMAC Disability Score,0,{},2,"{M, I}",False,float,158,68.0,0.0,9,2978,0,0


In [22]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['AGE'],

# Columns with only floats, missing, and NA values
'float': ['BMI', 'XRJSM', 'XRJSL', 'MCMJSW', 'WOMKP', 'WOMADL'],

}


Handled columns: 7


In [23]:
targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['AGE'],

# Columns with only floats, missing, and NA values
'float': ['BMI', 'XRJSM', 'XRJSL', 'MCMJSW', 'WOMKP', 'WOMADL'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [24]:
# Clean up the side var and make an index
new_df['SIDE'] = new_df['SIDE'].cat.rename_categories({'1: Right': 'RIGHT', '2: Left': 'LEFT'})
new_df.set_index('SIDE', append=True, inplace=True)

In [25]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)


VERSION    category
KPMEDCV    category
BMI         float32
SEX        category
HISP       category
RACE       category
XRJSM       float32
XRKL       category
XRJSL       float32
MCMJSW      float32
AGE           UInt8
COHORT     category
WOMKP       float32
WOMADL      float32
dtype: object

Missing values present, shadow dataframe created.
               BMI XRJSM XRJSL MCMJSW WOMKP WOMADL
ID      Visit                                     
9001695 P01    NaN   NaN   NaN    NaN   NaN    NaN
9002116 P01    NaN   NaN   NaN    NaN   NaN    NaN
9002430 P01    NaN   NaN   NaN    NaN   NaN    NaN
9002817 P01    NaN   NaN   NaN    NaN   NaN    NaN
9003316 P01    NaN   NaN   NaN    NaN   NaN    NaN
...            ...   ...   ...    ...   ...    ...
9993833 V06    NaN   NaN   NaN    NaN   NaN    NaN
9994408 V06    NaN   NaN   NaN    NaN   NaN    NaN
9995338 V06    NaN   NaN   NaN    NaN   NaN    NaN
9996098 V06    NaN   NaN   NaN    NaN   NaN    NaN
9997381 V06    NaN   NaN   NaN    NaN   Na

In [26]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.19MB
Shadow dataframe size: 0.05MB


In [27]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## acceldatabyday

In [28]:
prefix = 'acceldatabyday'
column_uniformity_check(prefix)  # Becasue there is more than one dataset file


acceldatabyday06.sas7bdat: (13040, 26)
['ID', 'VERSION', 'PAStudyDay', 'VDaySequence', 'PAMonth', 'PAWeekDay', 'DAYModMinT', 'DAYModMinF', 'DAYModMinS', 'DAYVigMinT', 'DAYVigMinF', 'DAYVigMinS', 'DAYMVMinT', 'DAYMVMinF', 'DAYMVMinS', 'DAYCnt', 'DAYLtMinT', 'DAYLtMinF', 'DAYLtMinS', 'DAYMVBoutMinT', 'DAYMVBoutMinF', 'DAYMVBoutMinS', 'DAYVBoutMinT', 'DAYVBoutMinF', 'DAYVBoutMinS', 'WearHr']

acceldatabyday08.sas7bdat: (9399, 26)
Names only differ by case

Total rows: 22439


In [29]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

acceldatabyday06.sas7bdat	Var Cnt: 26
Visits: ['V06']
acceldatabyday08.sas7bdat	Var Cnt: 26
Visits: ['V08']
(22439, 27)

Starting dataframe size: 5.61MB


In [30]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [31]:
data_stats_summary(tmp_df, data_stats_df)

data_stats_df

Already defined cols: 3 	Cols to convert: 24	 Total col cnt: 27

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False     True     False   False     23
1     True    False     False   False      1

Numeric types of columns:
num_type
unsigned    22
float        1
Name: count, dtype: int64

Largest number of unique strings: 7


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
PASTUDYDAY,,0,,0,,,unsigned,1278.0,1427.0,28.0,0,22439,0,0
VDAYSEQUENCE,,0,,0,,,unsigned,7.0,7.0,1.0,0,22439,0,0
PAMONTH,,0,,0,,,unsigned,12.0,12.0,1.0,0,22439,0,0
PAWEEKDAY,,7,"{Saturday, Friday, Wednesday, Thursday, Sunday...",0,{},False,,,,,22439,0,0,0
DAYMODMINT,,0,,0,,,unsigned,189.0,253.0,0.0,0,22439,0,0
DAYMODMINF,,0,,0,,,unsigned,194.0,269.0,0.0,0,22439,0,0
DAYMODMINS,,0,,0,,,unsigned,446.0,570.0,0.0,0,22439,0,0
DAYVIGMINT,,0,,0,,,unsigned,83.0,122.0,0.0,0,22439,0,0
DAYVIGMINF,,0,,0,,,unsigned,82.0,124.0,0.0,0,22439,0,0
DAYVIGMINS,,0,,0,,,unsigned,97.0,129.0,0.0,0,22439,0,0


In [32]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['PASTUDYDAY', 'VDAYSEQUENCE', 'PAMONTH', 'DAYMODMINT', 'DAYMODMINF', 'DAYMODMINS', 'DAYVIGMINT', 'DAYVIGMINF', 'DAYVIGMINS', 'DAYMVMINT', 'DAYMVMINF', 'DAYMVMINS', 'DAYCNT', 'DAYLTMINT', 'DAYLTMINF', 'DAYLTMINS', 'DAYMVBOUTMINT', 'DAYMVBOUTMINF', 'DAYMVBOUTMINS', 'DAYVBOUTMINT', 'DAYVBOUTMINF', 'DAYVBOUTMINS'],

# Columns with only floats, missing, and NA values
'float': ['WEARHR'],

# Columns with only strings, missing, and NA values
'cat': ['PAWEEKDAY'],

}


Handled columns: 24


In [33]:
targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['PASTUDYDAY', 'VDAYSEQUENCE', 'PAMONTH', 'DAYMODMINT', 'DAYMODMINF', 'DAYMODMINS', 'DAYVIGMINT', 'DAYVIGMINF', 'DAYVIGMINS', 'DAYMVMINT', 'DAYMVMINF', 'DAYMVMINS', 'DAYCNT', 'DAYLTMINT', 'DAYLTMINF', 'DAYLTMINS', 'DAYMVBOUTMINT', 'DAYMVBOUTMINF', 'DAYMVBOUTMINS', 'DAYVBOUTMINT', 'DAYVBOUTMINF', 'DAYVBOUTMINS'],

# Columns with only floats, missing, and NA values
'float': ['WEARHR'],

# Columns with only strings, missing, and NA values
'cat': ['PAWEEKDAY'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [34]:
sanity_check(new_df)
print()
print(new_df.dtypes)


VERSION          category
PASTUDYDAY         uint16
VDAYSEQUENCE        uint8
PAMONTH             uint8
PAWEEKDAY        category
DAYMODMINT          uint8
DAYMODMINF         uint16
DAYMODMINS         uint16
DAYVIGMINT          uint8
DAYVIGMINF          uint8
DAYVIGMINS          uint8
DAYMVMINT           uint8
DAYMVMINF          uint16
DAYMVMINS          uint16
DAYCNT             uint32
DAYLTMINT          uint16
DAYLTMINF          uint16
DAYLTMINS          uint16
DAYMVBOUTMINT       uint8
DAYMVBOUTMINF       uint8
DAYMVBOUTMINS      uint16
DAYVBOUTMINT        uint8
DAYVBOUTMINF        uint8
DAYVBOUTMINS        uint8
WEARHR            float32
dtype: object


In [35]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.98MB
Shadow dataframe size: 0.17MB


In [36]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## acceldatabymin

In [37]:
prefix = 'acceldatabymin'
column_uniformity_check(prefix)


acceldatabymin06.sas7bdat: (20629910, 8)
['ID', 'VERSION', 'PAStudyDay', 'MinSequence', 'SuspectMinute', 'MINCnt', 'PAWeekDay', 'PAMonth']

acceldatabymin08.sas7bdat: (14878346, 8)
Names only differ by case

Total rows: 35508256


In [38]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

acceldatabymin06.sas7bdat	Var Cnt: 8
Visits: ['V06']
acceldatabymin08.sas7bdat	Var Cnt: 8
Visits: ['V08']
(35508256, 9)

Starting dataframe size: 4000.94MB


In [39]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [40]:
data_stats_summary(tmp_df, data_stats_df)

data_stats_df

Already defined cols: 3 	Cols to convert: 6	 Total col cnt: 9

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False     True     False   False      4
1    False     True     False    True      1
2     True    False     False   False      1

Numeric types of columns:
num_type
unsigned    5
Name: count, dtype: int64

Largest number of unique strings: 7


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
PASTUDYDAY,,0,,0,,,unsigned,1290.0,1427.0,28.0,0,35508256,0,0
MINSEQUENCE,,0,,0,,,unsigned,1440.0,1440.0,1.0,0,35508256,0,0
SUSPECTMINUTE,,0,,0,,,unsigned,2.0,1.0,0.0,0,35508256,0,0
MINCNT,,0,,0,,,unsigned,10328.0,20758.0,0.0,0,35508165,0,91
PAWEEKDAY,,7,"{Saturday, Friday, Wednesday, Thursday, Sunday...",0,{},False,,,,,35508256,0,0,0
PAMONTH,,0,,0,,,unsigned,12.0,12.0,1.0,0,35508256,0,0


In [41]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['PASTUDYDAY', 'MINSEQUENCE', 'SUSPECTMINUTE', 'MINCNT', 'PAMONTH'],

# Columns with only strings, missing, and NA values
'cat': ['PAWEEKDAY'],

}


Handled columns: 6


In [42]:
targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['PASTUDYDAY', 'MINSEQUENCE', 'SUSPECTMINUTE', 'MINCNT', 'PAMONTH'],

# Columns with only strings, missing, and NA values
'cat': ['PAWEEKDAY'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [43]:
new_df['SUSPECTMINUTE'] = new_df['SUSPECTMINUTE'].astype('bool')

TODO: SUSPECTMINUTE should be a bool

In [44]:
sanity_check(new_df)
print()
print(new_df.dtypes)

if not missing_df.empty:
    print('Missing values present, shadow dataframe created.')
    print(missing_df)


VERSION          category
PASTUDYDAY         uint16
MINSEQUENCE        uint16
SUSPECTMINUTE        bool
MINCNT             UInt16
PAWEEKDAY        category
PAMONTH             uint8
dtype: object


In [45]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('\nShadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 474.14MB

Shadow dataframe size: 270.91MB


In [46]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## accelerometry

In [47]:
prefix = 'accelerometry'
column_uniformity_check(prefix)


accelerometry06.sas7bdat: (2712, 27)
['ID', 'VERSION', 'AAVMNT', 'AAMVBMF', 'ADHHS8', 'ADHHSD8', 'AAMVBMT', 'AAMVMNS', 'AAVBMS', 'AAMVBMS', 'AAMVMNT', 'AACSM03', 'AAMVMNF', 'AMPA1', 'AAVMNF', 'AACNT', 'AALTMNF', 'AAMDMNF', 'AAVBMT', 'APASTAT', 'AALTMNT', 'AAMDMNT', 'AAVBMF', 'ANVDAYS', 'AALTMNS', 'AAMDMNS', 'AAVMNS']

accelerometry08.sas7bdat: (1797, 27)

Total rows: 4509


In [48]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

accelerometry06.sas7bdat	Var Cnt: 27
Visits: ['V06']
accelerometry08.sas7bdat	Var Cnt: 27
Visits: ['V08']
(4509, 28)

Starting dataframe size: 4.06MB


In [49]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [50]:
data_stats_summary(tmp_df, data_stats_df)

data_stats_df

Already defined cols: 5 	Cols to convert: 23	 Total col cnt: 28

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0     True    False     False   False      1
1     True     True     False   False     22

Numeric types of columns:
num_type
float       20
unsigned     2
Name: count, dtype: int64

Largest number of unique strings: 5


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
AAVMNT,Accelerometry: average daily minutes of vigoro...,0,{},1,{A},False,float,159.0,95.42857,0.0,1188,3321,0,0
AAMVBMF,Accelerometry: average daily bout minutes of m...,0,{},1,{A},False,float,524.0,134.4286,0.0,1188,3321,0,0
AAMVBMT,Accelerometry: average daily bout minutes of m...,0,{},1,{A},False,float,503.0,133.0,0.0,1188,3321,0,0
AAMVMNS,Accelerometry: average daily minutes of modera...,0,{},1,{A},False,float,1761.0,416.8333,0.0,1188,3321,0,0
AAVBMS,Accelerometry: average daily bout minutes of v...,0,{},1,{A},False,float,178.0,112.0,0.0,1188,3321,0,0
AAMVBMS,Accelerometry: average daily bout minutes of m...,0,{},1,{A},False,float,1213.0,399.0,0.0,1188,3321,0,0
AAMVMNT,Accelerometry: average daily minutes of modera...,0,{},1,{A},False,float,755.0,162.8333,0.0,1188,3321,0,0
AACSM03,Accelerometry: ACSM 2003 physical activity gui...,0,{},1,{A},False,float,44.0,1.0,0.0,1188,3321,0,0
AAMVMNF,Accelerometry: average daily minutes of modera...,0,{},1,{A},False,float,776.0,175.3333,0.0,1188,3321,0,0
AMPA1,Accelerometry: month of the 1st valid day of p...,0,{},1,{A},False,unsigned,12.0,12.0,1.0,1188,3321,0,0


In [51]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['AMPA1', 'ANVDAYS'],

# Columns with only floats, missing, and NA values
'float': ['AAVMNT', 'AAMVBMF', 'AAMVBMT', 'AAMVMNS', 'AAVBMS', 'AAMVBMS', 'AAMVMNT', 'AACSM03', 'AAMVMNF', 'AAVMNF', 'AACNT', 'AALTMNF', 'AAMDMNF', 'AAVBMT', 'AALTMNT', 'AAMDMNT', 'AAVBMF', 'AALTMNS', 'AAMDMNS', 'AAVMNS'],

# Columns with only strings, missing, and NA values
'cat': ['APASTAT'],

}


Handled columns: 23


In [52]:
targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['AMPA1', 'ANVDAYS'],

# Columns with only floats, missing, and NA values
'float': ['AAVMNT', 'AAMVBMF', 'AAMVBMT', 'AAMVMNS', 'AAVBMS', 'AAMVBMS', 'AAMVMNT', 'AACSM03', 'AAMVMNF', 'AAVMNF', 'AACNT', 'AALTMNF', 'AAMDMNF', 'AAVBMT', 'AALTMNT', 'AAMDMNT', 'AAVBMF', 'AALTMNS', 'AAMDMNS', 'AAVMNS'],

# Columns with only strings, missing, and NA values
'cat': ['APASTAT'],

}
new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [53]:
sanity_check(new_df)
print()
print(new_df.dtypes)

if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)


VERSION    category
AAVMNT      float32
AAMVBMF     float32
ADHHS8     category
ADHHSD8    category
AAMVBMT     float32
AAMVMNS     float32
AAVBMS      float32
AAMVBMS     float32
AAMVMNT     float32
AACSM03     float32
AAMVMNF     float32
AMPA1         UInt8
AAVMNF      float32
AACNT       float64
AALTMNF     float32
AAMDMNF     float32
AAVBMT      float32
APASTAT    category
AALTMNT     float32
AAMDMNT     float32
AAVBMF      float32
ANVDAYS       UInt8
AALTMNS     float32
AAMDMNS     float32
AAVMNS      float32
dtype: object

Missing values present, shadow dataframe created.
              AMPA1 ANVDAYS AAVMNT AAMVBMF AAMVBMT AAMVMNS AAVBMS AAMVBMS  \
ID      Visit                                                               
9000099 V06     NaN     NaN    NaN     NaN     NaN     NaN    NaN     NaN   
9001695 V06     NaN     NaN    NaN     NaN     NaN     NaN    NaN     NaN   
9001897 V06     NaN     NaN    NaN     NaN     NaN     NaN    NaN     NaN   
9002116 V06     NaN     NaN  

In [54]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.47MB
Shadow dataframe size: 0.17MB


In [55]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## allclinical

In [56]:
prefix = 'allclinical'
column_uniformity_check(prefix)


allclinical00.sas7bdat: (4796, 1187)
['ID', 'VERSION', 'BLDCOLL', 'BLDHRS1', 'BLDHRS2', 'BLDRAW1', 'BLDRAW2', 'BLSURD1', 'BLSURD2', 'CITRATE', 'EDTA', 'excess1', 'excess2', 'hemat1', 'hemat2', 'hoursp1', 'hoursp2', 'hrsuc1', 'hrsuc2', 'illpwk1', 'illpwk2', 'LEAKAG1', 'LEAKAG2', 'MRSEQNL', 'MRSEQNR', 'MULTST1', 'MULTST2', 'othvp1', 'othvp2', 'pdate1', 'pdate2', 'PLAQHR1', 'PLAQHR2', 'qovp1', 'qovp2', 'SEAQHR1', 'SEAQHR2', 'SERUM', 'ucdate1', 'ucdate2', 'URINHR1', 'URINHR2', 'URINOB1', 'URINOB2', 'URNCOLL', 'URSURD1', 'URSURD2', 'vcoll1', 'vcoll2', 'vein1', 'vein2', 'void1', 'void2', 'KPN', 'KPNREV', 'KPNREVY', 'KPNR12', 'KPNR12M', 'KPNLEV', 'KPNLEVY', 'KPNL12', 'KPNL12M', 'KPACT30', 'HPNR12', 'HPNRIL', 'HPNROL', 'HPNRFL', 'HPNRB', 'HPNRLB', 'HPNRDK', 'HPNL12', 'HPNLIL', 'HPNLOL', 'HPNLFL', 'HPNLB', 'HPNLLB', 'HPNLDK', 'BP30', 'BP30OFT', 'BPBAD', 'BPUB', 'BPMB', 'BPLB', 'BPB', 'BPDK', 'OJPNRS', 'OJPNLS', 'OJPNRE', 'OJPNLE', 'OJPNRW', 'OJPNLW', 'OJPNRH', 'OJPNLH', 'OJPNRA', 'OJPNLA', 'OJ

In [57]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

allclinical00.sas7bdat	Var Cnt: 1187
Visits: ['P01', 'P02', 'V00']
allclinical01.sas7bdat	Var Cnt: 575
Visits: ['V01']
allclinical02.sas7bdat	Var Cnt: 229
Visits: ['V02']
allclinical03.sas7bdat	Var Cnt: 779
Visits: ['V03']
allclinical04.sas7bdat	Var Cnt: 229
Visits: ['V04']
allclinical05.sas7bdat	Var Cnt: 605
Visits: ['V05']
allclinical06.sas7bdat	Var Cnt: 903
Visits: ['V06']
allclinical07.sas7bdat	Var Cnt: 257
Visits: ['V07']
allclinical08.sas7bdat	Var Cnt: 703
Visits: ['V08']
allclinical09.sas7bdat	Var Cnt: 259
Visits: ['V09']
allclinical10.sas7bdat	Var Cnt: 1095
Visits: ['V10']
allclinical11.sas7bdat	Var Cnt: 195
Visits: ['V11']
(58334, 1851)

Starting dataframe size: 753.43MB


Note the compression from 7016 variables to 1851.

In [58]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [59]:
data_stats_summary(tmp_df, data_stats_df)

data_stats_df

Already defined cols: 1503 	Cols to convert: 348	 Total col cnt: 1851

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False    False      True   False      6
1    False     True     False    True      2
2     True    False     False   False     22
3     True    False      True   False      1
4     True     True     False   False    317

Numeric types of columns:
num_type
float       190
unsigned    122
na           29
signed        7
Name: count, dtype: int64

Largest number of unique strings: 187.0


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
KPNR12M,"SV:Q15ai.Right knee pain, aching or stiffness:...",0.0,{},3,"{A, M, D}",False,unsigned,13,12.0,1.0,28211,12873,0,0
KPNL12M,"SV:Q18ai.Left knee pain, aching or stiffness: ...",0.0,{},3,"{A, M, D}",False,unsigned,14,12.0,0.0,28711,12373,0,0
TMJE30D,"SV:Q45bi.TMJ: jaw joint or in front of ear, ho...",0.0,{},2,"{A, D}",False,unsigned,25,30.0,1.0,15880,1163,0,0
TMJE30A,"SV:Q45biii.TMJ: jaw joint or in front of ear, ...",0.0,{},2,"{A, D}",False,unsigned,16,30.0,0.0,15876,1167,0,0
TMJF30D,"SV:Q46bi.TMJ: across face or cheek, how many d...",0.0,{},1,{A},False,unsigned,24,30.0,1.0,16544,499,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AC2AR3,"FU SAQ:Q44.Exercise: (2) Activity code, top 3 ...",0.0,{},1,{N},False,unsigned,38,37.0,1.0,1468,2145,0,0
AC3AR3,"FU SAQ:Q44.Exercise: (3) Activity code, top 3 ...",0.0,{},1,{N},False,unsigned,39,38.0,1.0,1714,1899,0,0
AC1AR4,"FU SAQ:Q47.Exercise: (1) Activity code, top 3 ...",0.0,{},1,{N},False,unsigned,38,37.0,1.0,1239,2374,0,0
AC2AR4,"FU SAQ:Q47.Exercise: (2) Activity code, top 3 ...",0.0,{},1,{A},False,unsigned,36,37.0,1.0,1460,2153,0,0


In [60]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only dates, missing, and NA values
'date': ['SVDATE', 'DATE', 'EVDATE', 'PSDATE', 'SSDATE', 'FVDATE', 'ISEXMDT'],

# Columns with only unsigned ints, missing, and NA values
'unsigned': ['KPNR12M', 'KPNL12M', 'TMJE30D', 'TMJE30A', 'TMJF30D', 'TMJF30A', 'BPTOT', 'BPDAYCV', 'BPBEDCV', 'KPACDCV', 'INJR1', 'INJR2', 'INJR3', 'KRSRA', 'ARTR1', 'ARTR2', 'ARTR3', 'MENR1', 'MENR2', 'MENR3', 'LRR1', 'LRR2', 'OTSR1', 'OTSR2', 'OTSR3', 'INJL1', 'INJL2', 'INJL3', 'KRSLA', 'ARTL1', 'ARTL2', 'ARTL3', 'MENL1', 'MENL2', 'MENL3', 'LRL1', 'OTSL1', 'OTSL2', 'OTSL3', 'OV1AGE', 'OV2AGE', 'HYSAGE', 'BLDHRS1', 'BLDHRS2', 'BLSURD1', 'BLSURD2', 'HOURSP1', 'HOURSP2', 'PDATE1', 'PDATE2', 'PLAQHR1', 'PLAQHR2', 'SEAQHR1', 'SEAQHR2', 'UCDATE1', 'UCDATE2', 'URINHR1', 'URINHR2', 'URSURD1', 'URSURD2', 'WOMSTFR', 'WOMSTFL', 'HIPFXAG', 'SPNFXAG', 'SMKAGE', 'SMKAVE', 'SMKAMT', 'SMKSTOP', 'PIPEAGE', 'PIPEAMT', 'PIPSTOP', 'BISPYRS', 'RX30NUM', 'COMORB', 'SMKPKYR', 'PSMKYR', 'CESD', 'NWARNS', 'NNOSE

In [61]:
targets = {
# Columns with only dates, missing, and NA values
'date': ['SVDATE', 'DATE', 'EVDATE', 'PSDATE', 'SSDATE', 'FVDATE', 'ISEXMDT'],

# Columns with only unsigned ints, missing, and NA values
'unsigned': ['KPNR12M', 'KPNL12M', 'TMJE30D', 'TMJE30A', 'TMJF30D', 'TMJF30A', 'BPTOT', 'BPDAYCV', 'BPBEDCV', 'KPACDCV', 'INJR1', 'INJR2', 'INJR3', 'KRSRA', 'ARTR1', 'ARTR2', 'ARTR3', 'MENR1', 'MENR2', 'MENR3', 'LRR1', 'LRR2', 'OTSR1', 'OTSR2', 'OTSR3', 'INJL1', 'INJL2', 'INJL3', 'KRSLA', 'ARTL1', 'ARTL2', 'ARTL3', 'MENL1', 'MENL2', 'MENL3', 'LRL1', 'OTSL1', 'OTSL2', 'OTSL3', 'OV1AGE', 'OV2AGE', 'HYSAGE', 'BLDHRS1', 'BLDHRS2', 'BLSURD1', 'BLSURD2', 'HOURSP1', 'HOURSP2', 'PDATE1', 'PDATE2', 'PLAQHR1', 'PLAQHR2', 'SEAQHR1', 'SEAQHR2', 'UCDATE1', 'UCDATE2', 'URINHR1', 'URINHR2', 'URSURD1', 'URSURD2', 'WOMSTFR', 'WOMSTFL', 'HIPFXAG', 'SPNFXAG', 'SMKAGE', 'SMKAVE', 'SMKAMT', 'SMKSTOP', 'PIPEAGE', 'PIPEAMT', 'PIPSTOP', 'BISPYRS', 'RX30NUM', 'COMORB', 'SMKPKYR', 'PSMKYR', 'CESD', 'NWARNS', 'NNOSERV', 'NSKIP', 'NERRORS', 'BPSYS', 'BPDIAS', 'RPAVG', 'STEPST1', 'STEPST2', 'HRB4WLK', 'NUMSTOP', 'HR400WK', 'LLWGT', 'RLWGT', '400MTR', 'AGE', 'HOURWK', 'MISSWK', 'PASE', 'WEEKWK', 'WKHR7CV', 'BLUPMN1', 'BLUPMN2', 'PRRDDYS', 'URUPMN2', 'VISDYS', 'AMPA1', 'ANVDAYS', 'SF12BP', 'SF12PF', 'SF12VT', 'SF12SF', 'WTLSYR', 'AC1AR1', 'AC2AR1', 'AC3AR1', 'AC1AR2', 'AC2AR2', 'AC3AR2', 'AC1AR3', 'AC2AR3', 'AC3AR3', 'AC1AR4', 'AC2AR4', 'AC3AR4'],

# Columns with only signed ints, missing, and NA values
'signed': ['RKFHDEG', 'LKFHDEG', 'RKALNMT', 'LKALNMT', 'DFBCOLL', 'DFUCOLL', 'URUPMN1'],

# Columns with only floats, missing, and NA values
'float': ['HEIGHT', 'WEIGHT', 'BMI', 'HRSUC1', 'HRSUC2', 'HSPSS', 'HSMSS', 'WOMKPR', 'KOOSKPR', 'KOOSYMR', 'WOMADLR', 'WOMKPL', 'KOOSKPL', 'KOOSYML', 'WOMADLL', 'KOOSFSR', 'KOOSQOL', 'WOMTSL', 'WOMTSR', 'HT25MM', 'WT25KG', 'WTMAXKG', 'WTMINKG', 'DTDFIB', 'SUPVITD', 'FIBVGFR', 'SUPB12', 'DTCAFFN', 'SRVFAT', 'DTAIU', 'DTCHOL', 'PCTCOL1', 'DTPHOS', 'DTVITC', 'DTB1', 'PCTXLS', 'SUPB2', 'PCTCOL9', 'DTVITK', 'DTRET', 'SUPVITE', 'SUPNIAC', 'DTANZN', 'DTLUT', 'BAPFAT', 'PCTCARB', 'PCTSWT', 'DTACAR', 'SUPCA', 'SRVGRN', 'SRVFRT', 'SUPFOL', 'DTBCAR', 'DTPROT', 'DTPOTA', 'DTSFAT', 'SUPVITC', 'DTOLEC', 'SUPBCAR', 'DTKCAL', 'BAPPROT', 'BAPCARB', 'SUPVITA', 'SUPB6', 'NFDSDAY', 'DTNIAC', 'FIBBEAN', 'DTNA', 'DTARE', 'DTLYC', 'DTFAT', 'PCTSMAL', 'SUPFE', 'SUPCU', 'DTB12', 'DTGEN', 'DTMETH', 'SUPZINC', 'SRVVEG', 'DTCALC', 'SUPMG', 'DTDAID', 'SRVMEAT', 'DTFE', 'FIBGRN', 'DTCYST', 'DTSF', 'PCTPROT', 'SRVDRY', 'DTB6', 'SUPB1', 'SUPSE', 'DTMG', 'PCTFAT', 'DTVITD', 'DTPROA', 'DTCARB', 'PCTALCH', 'DTFOL', 'DTLIN', 'PCTLARG', 'DTVITE', 'DTCRYP', 'PCTMEDS', 'DTRIBO', 'DTZINC', 'CSTIME1', 'CSTIME2', 'RLLGTH', 'RLBACK', 'RLARM', 'RLHORIZ', 'RLVERT', 'LLLGTH', 'LLBACK', 'LLARM', 'LLHORIZ', 'LLVERT', 'TIMET1', 'TIMET2', 'ABCIRC', 'CSPACE', '20MPACE', 'RFTLPL', 'RFTHPL', 'RFTLRL', 'RFTHRL', 'RETLPL', 'RETHPL', 'RETLRL', 'RETHRL', 'LFTLPL', 'LFTHPL', 'LFTLRL', 'LFTHRL', 'LETLPL', 'LETHPL', 'LETLRL', 'LETHRL', '400MTIM', 'RFSFR', 'LESFR', 'LFSFP', 'RESFP', 'RESFR', 'LFSFR', 'RFSFP', 'LESFP', 'ICPTSKL', 'ICPTSKR', 'IPSKL', 'CPSKR', 'IPSKR', 'CPSKL', 'LLDILSM', 'LLDIFST', 'LLDILST', 'LLDIFSS', 'LLDILSI', 'LLDIFSP', 'AAVMNT', 'AAMVBMF', 'AAMVBMT', 'AAMVMNS', 'AAVBMS', 'AAMVBMS', 'AAMVMNT', 'AACSM03', 'AAMVMNF', 'AAVMNF', 'AACNT', 'AALTMNF', 'AAMDMNF', 'AAVBMT', 'AALTMNT', 'AAMDMNT', 'AAVBMF', 'AALTMNS', 'AAMDMNS', 'AAVMNS', 'IPSHL', 'IPSHR', 'ICPTSHR', 'CPSHL', 'ICPTSHL', 'CPSHR', 'SF12RP', 'SF12RE', 'SF12GH', 'SF12MH'],

# Columns with only strings, missing, and NA values
'cat': ['LRR3', 'LRL2', 'LRL3', 'STFID2', 'STFID1', 'HESTFID', 'SVXRRID', 'BPSTFID', 'RPSTFID', 'ACSTFID', 'SCSTFID', 'RCSTFID', 'W2STFID', 'W4STFID', 'K1STFID', 'ISSTFID', 'WPSTFID', 'K5STFID', 'KPSTFID', 'APASTAT', 'HVSTFID', 'WLC1IL'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [62]:
sanity_check(new_df)
print()
print(new_df.dtypes)

if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)


VERSION    category
KPNREV     category
KPNREVY    category
KPNR12     category
KPNR12M       UInt8
             ...   
ACT37D     category
ACTNAA     category
ACTNAB     category
ACTNAC     category
ACTNAD     category
Length: 1849, dtype: object

Missing values present, shadow dataframe created.
              ISEXMDT           KPNR12M           KPNL12M           TMJE30D  \
ID      Visit                                                                 
9000099 P01       NaN               NaN               NaN  .A: Not Expected   
9000296 P01       NaN  .A: Not Expected  .A: Not Expected  .A: Not Expected   
9000622 P01       NaN               NaN  .A: Not Expected  .A: Not Expected   
9000798 P01       NaN  .A: Not Expected               NaN  .A: Not Expected   
9001104 P01       NaN  .A: Not Expected  .A: Not Expected  .A: Not Expected   
...               ...               ...               ...               ...   
9999365 V11       NaN               NaN               NaN           

In [63]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 154.42MB
Shadow dataframe size: 18.72MB


In [64]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

Column LRR3 is marked as categorical but is all NaN. Will be stored as float.
Column LRL2 is marked as categorical but is all NaN. Will be stored as float.
Column LRL3 is marked as categorical but is all NaN. Will be stored as float.


## biomarkers

In [65]:
prefix = 'biomarkers'
column_uniformity_check(prefix)


biomarkers00.sas7bdat: (4796, 53)
['ID', 'VERSION', 'BLDCOLL', 'BLDHRS1', 'BLDHRS2', 'BLDRAW1', 'BLDRAW2', 'BLSURD1', 'BLSURD2', 'CITRATE', 'EDTA', 'excess1', 'excess2', 'hemat1', 'hemat2', 'hoursp1', 'hoursp2', 'hrsuc1', 'hrsuc2', 'illpwk1', 'illpwk2', 'LEAKAG1', 'LEAKAG2', 'MRSEQNL', 'MRSEQNR', 'MULTST1', 'MULTST2', 'othvp1', 'othvp2', 'pdate1', 'pdate2', 'PLAQHR1', 'PLAQHR2', 'qovp1', 'qovp2', 'SEAQHR1', 'SEAQHR2', 'SERUM', 'ucdate1', 'ucdate2', 'URINHR1', 'URINHR2', 'URINOB1', 'URINOB2', 'URNCOLL', 'URSURD1', 'URSURD2', 'vcoll1', 'vcoll2', 'vein1', 'vein2', 'void1', 'void2']

biomarkers01.sas7bdat: (4796, 62)
['PRRDDYS', 'DFUCOLL', 'SRGSTAT', 'BLUPMN1', 'PAXRNA', 'URUPMN1', 'URUPMN2', 'BLUPMN2', 'DFBCOLL']

biomarkers02.sas7bdat: (288, 60)
['PRRDDYS', 'PAXRNA']

biomarkers03.sas7bdat: (4796, 61)
['MRKSIDE']

biomarkers04.sas7bdat: (494, 60)
['MRKSIDE']

biomarkers05.sas7bdat: (4796, 60)

biomarkers06.sas7bdat: (4796, 62)
['LMPHCT', 'PAXRNA']

biomarkers08.sas7bdat: (4796, 61)
['PA

In [66]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

biomarkers00.sas7bdat	Var Cnt: 53
Visits: ['V00']
biomarkers01.sas7bdat	Var Cnt: 62
Visits: ['V01']
biomarkers02.sas7bdat	Var Cnt: 60
Visits: ['V02']
biomarkers03.sas7bdat	Var Cnt: 61
Visits: ['V03']
biomarkers04.sas7bdat	Var Cnt: 60
Visits: ['V04']
biomarkers05.sas7bdat	Var Cnt: 60
Visits: ['V05']
biomarkers06.sas7bdat	Var Cnt: 62
Visits: ['V06']
biomarkers08.sas7bdat	Var Cnt: 61
Visits: ['V08']
biomarkers10.sas7bdat	Var Cnt: 5
Visits: ['V10']
(34354, 65)

Starting dataframe size: 39.68MB


In [67]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [68]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 38 	Cols to convert: 27	 Total col cnt: 65

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0     True     True     False   False     27

Numeric types of columns:
num_type
unsigned    22
signed       3
float        2
Name: count, dtype: int64

Largest number of unique strings: 0


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
BLDHRS1,EV:Phlebotomy: time venipuncture completed (fi...,0,{},2,"{A, M}",False,unsigned,427,63300.0,24000.0,2118,24629,0,0
BLDHRS2,EV:Phlebotomy: time venipuncture completed (re...,0,{},1,{A},False,unsigned,80,48000.0,24600.0,26634,113,0,0
BLSURD1,EV:Phlebotomy: days between most recent surger...,0,{},3,"{A, M, D}",False,unsigned,537,984.0,0.0,23341,3406,0,0
BLSURD2,EV:Phlebotomy: days between most recent surger...,0,{},1,{A},False,unsigned,15,264.0,36.0,26732,15,0,0
HOURSP1,,0,{},3,"{A, M, W}",False,unsigned,26,24.0,0.0,2146,24601,0,0
HOURSP2,,0,{},2,"{A, M}",False,unsigned,13,16.0,2.0,26637,110,0,0
HRSUC1,,0,{},3,"{A, M, W}",False,float,240,24.0,0.1,2224,24492,0,0
HRSUC2,,0,{},2,"{A, M}",False,float,45,16.5,2.4,26676,71,0,0
PDATE1,,0,{},1,{A},False,unsigned,2218,19313.0,16124.0,2113,24634,0,0
PDATE2,,0,{},1,{A},False,unsigned,109,19046.0,16154.0,26634,113,0,0


In [69]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['BLDHRS1', 'BLDHRS2', 'BLSURD1', 'BLSURD2', 'HOURSP1', 'HOURSP2', 'PDATE1', 'PDATE2', 'PLAQHR1', 'PLAQHR2', 'SEAQHR1', 'SEAQHR2', 'UCDATE1', 'UCDATE2', 'URINHR1', 'URINHR2', 'URSURD1', 'URSURD2', 'BLUPMN1', 'BLUPMN2', 'PRRDDYS', 'URUPMN2'],

# Columns with only signed ints, missing, and NA values
'signed': ['DFBCOLL', 'DFUCOLL', 'URUPMN1'],

# Columns with only floats, missing, and NA values
'float': ['HRSUC1', 'HRSUC2'],

}


Handled columns: 27


In [70]:
targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['BLDHRS1', 'BLDHRS2', 'BLSURD1', 'BLSURD2', 'HOURSP1', 'HOURSP2', 'PDATE1', 'PDATE2', 'PLAQHR1', 'PLAQHR2', 'SEAQHR1', 'SEAQHR2', 'UCDATE1', 'UCDATE2', 'URINHR1', 'URINHR2', 'URSURD1', 'URSURD2', 'BLUPMN1', 'BLUPMN2', 'PRRDDYS', 'URUPMN2'],

# Columns with only signed ints, missing, and NA values
'signed': ['DFBCOLL', 'DFUCOLL', 'URUPMN1'],

# Columns with only floats, missing, and NA values
'float': ['HRSUC1', 'HRSUC2'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [71]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)


VERSION    category
BLDCOLL    category
BLDHRS1      UInt16
BLDHRS2      UInt16
BLDRAW1    category
             ...   
SRGSTAT    category
URUPMN1       Int16
URUPMN2      UInt16
MRKSIDE    category
LMPHCT     category
Length: 63, dtype: object

Missing values present, shadow dataframe created.
              BLDHRS1           BLDHRS2           BLSURD1           BLSURD2  \
ID      Visit                                                                 
9000099 V00       NaN  .A: Not Expected  .A: Not Expected  .A: Not Expected   
9000296 V00       NaN  .A: Not Expected  .A: Not Expected  .A: Not Expected   
9000622 V00       NaN  .A: Not Expected  .A: Not Expected  .A: Not Expected   
9000798 V00       NaN  .A: Not Expected  .A: Not Expected  .A: Not Expected   
9001104 V00       NaN  .A: Not Expected  .A: Not Expected  .A: Not Expected   
...               ...               ...               ...               ...   
9999365 V10       NaN               NaN               NaN             

In [72]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 4.30MB
Shadow dataframe size: 1.10MB


In [73]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## biospec_fnih_joco_assays
TODO: Not handled yet as column naming format doesn't use visit prefixes

In [74]:
prefix = 'biospec_fnih_joco_assays'
column_uniformity_check(prefix)


biospec_fnih_joco_assays.sas7bdat: (129, 181)
['SpecID', 'VERSION', 'Serum_C1_2C_lc', 'Serum_C2C_lc', 'Serum_COLL2_1_NO2_lc', 'Serum_CPII_lc', 'Serum_CS846_lc', 'Serum_CTXI_lc', 'Serum_Comp_lc', 'Serum_HA_lc', 'Serum_MMP_3_lc', 'Serum_NTXI_lc', 'Serum_PIIANP_lc', 'Urine_CTXII_lc', 'Urine_C1_2C_lc', 'Urine_C2C_lc', 'Urine_Creatinine_lc', 'Urine_NTXI_lc', 'Urine_alpha_lc', 'Urine_beta_lc', 'Serum_C1_2C_NUM', 'Serum_C2C_NUM', 'Serum_CPII_NUM', 'Serum_PIIANP_NUM', 'Serum_COLL2_1_NO2_NUM', 'Serum_CS846_NUM', 'Serum_CTXI_NUM', 'Serum_Comp_NUM', 'Serum_HA_NUM', 'Serum_MMP_3_NUM', 'Serum_NTXI_NUM', 'Urine_CTXII_NUM', 'Urine_C1_2C_NUM', 'Urine_C2C_NUM', 'Urine_Creatinine_NUM', 'Urine_NTXI_NUM', 'Urine_alpha_NUM', 'Urine_beta_NUM', 'Urine_Col21N2_NUM', 'Urine_CTXII_NUMCA', 'Urine_C1_2C_NUMCA', 'Urine_C2C_NUMCA', 'Urine_NTXI_NUMCA', 'Urine_alpha_NUMCA', 'Urine_beta_NUMCA', 'Urine_Col21N2_NUMCA', 'Serum_C1_2C_ALTNUM', 'Serum_C2C_ALTNUM', 'Serum_COLL2_1_NO2_ALTNUM', 'Serum_CPII_ALTNUM', 'Serum_CS8

## biospec_fnih_labcorp

In [75]:
prefix = 'biospec_fnih_labcorp'
column_uniformity_check(prefix)


biospec_fnih_labcorp00.sas7bdat: (600, 187)
['ID', 'READPRJ', 'VERSION', 'Serum_C1_2C_lc', 'Serum_C2C_lc', 'Serum_COLL2_1_NO2_lc', 'Serum_CPII_lc', 'Serum_CS846_lc', 'Serum_CTXI_lc', 'Serum_Comp_lc', 'Serum_HA_lc', 'Serum_MMP_3_lc', 'Serum_NTXI_lc', 'Serum_PIIANP_lc', 'Urine_CTXII_lc', 'Urine_C1_2C_lc', 'Urine_C2C_lc', 'Urine_Creatinine_lc', 'Urine_NTXI_lc', 'Urine_alpha_lc', 'Urine_beta_lc', 'Serum_C1_2C_NUM', 'Serum_C2C_NUM', 'Serum_CPII_NUM', 'Serum_PIIANP_NUM', 'Serum_COLL2_1_NO2_NUM', 'Serum_CS846_NUM', 'Serum_CTXI_NUM', 'Serum_Comp_NUM', 'Serum_HA_NUM', 'Serum_MMP_3_NUM', 'Serum_NTXI_NUM', 'Urine_CTXII_NUM', 'Urine_C1_2C_NUM', 'Urine_C2C_NUM', 'Urine_Creatinine_NUM', 'Urine_NTXI_NUM', 'Urine_alpha_NUM', 'Urine_beta_NUM', 'Urine_Col21N2_NUM', 'Urine_CTXII_NUMCA', 'Urine_C1_2C_NUMCA', 'Urine_C2C_NUMCA', 'Urine_NTXI_NUMCA', 'Urine_alpha_NUMCA', 'Urine_beta_NUMCA', 'Urine_Col21N2_NUMCA', 'Serum_C1_2C_ALTNUM', 'Serum_C2C_ALTNUM', 'Serum_COLL2_1_NO2_ALTNUM', 'Serum_CPII_ALTNUM', 'Seru

In [76]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

biospec_fnih_labcorp00.sas7bdat	Var Cnt: 187
Visits: ['V00']
biospec_fnih_labcorp01.sas7bdat	Var Cnt: 187
Visits: ['V01']
biospec_fnih_labcorp03.sas7bdat	Var Cnt: 187
Visits: ['V03']
(1786, 188)

Starting dataframe size: 13.47MB


In [77]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [78]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 3 	Cols to convert: 185	 Total col cnt: 188

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False     True     False   False      1
1    False     True     False    True     70
2     True    False     False   False    114

Numeric types of columns:
num_type
unsigned    36
float       35
Name: count, dtype: int64

Largest number of unique strings: 1758


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{22},0,{},True,,,,,1786,0,0,0
SERUM_C1_2C_LC,,93,"{, 0.59, 1.39, 1.34, 0.07, 0.28, 0.45, 0.21, 0...",0,{},False,,,,,1786,0,0,0
SERUM_C2C_LC,,256,"{, 257, 246, 122, 262, 198, 296, 205, 344, 174...",0,{},False,,,,,1786,0,0,0
SERUM_COLL2_1_NO2_LC,,1758,"{, 5.7743, 11.4612, 13.6369, 10.0114, 7.6235, ...",0,{},False,,,,,1786,0,0,0
SERUM_CPII_LC,,967,"{, 663, 623, 737, 813, 869, 1308, 3051, 565, 1...",0,{},False,,,,,1786,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
URINE_C2C_PLATE_ID,,48,"{, uC2C-HUSA 19Aug2013-2, uC2C-HUSA 29Aug2013-...",0,{},False,,,,,1786,0,0,0
URINE_CREATININE_PLATE_ID,,49,"{uCreatinine 25Oct2013-2, , uCreatinine 24Oct2...",0,{},False,,,,,1786,0,0,0
URINE_NTXI_PLATE_ID,,47,"{, uNTX-I 26Aug2013-3, uNTX-I 22Aug2013-3, uNT...",0,{},False,,,,,1786,0,0,0
URINE_ALPHA_PLATE_ID,,48,"{, uCTX-Ia 14Aug2013-4, uCTX-Ia 15Aug2013-2, u...",0,{},False,,,,,1786,0,0,0


In [79]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['SERUM_C2C_NUM', 'SERUM_CPII_NUM', 'SERUM_PIIANP_NUM', 'SERUM_CS846_NUM', 'SERUM_COMP_NUM', 'SERUM_HA_NUM', 'SERUM_NTXI_NUM', 'URINE_C2C_NUM', 'URINE_NTXI_NUM', 'SERUM_C2C_ALTNUM', 'SERUM_CPII_ALTNUM', 'SERUM_CS846_ALTNUM', 'SERUM_COMP_ALTNUM', 'SERUM_HA_ALTNUM', 'SERUM_NTXI_ALTNUM', 'SERUM_PIIANP_ALTNUM', 'URINE_C2C_ALTNUM', 'URINE_NTXI_ALTNUM', 'SERUM_C1_2C_LOWLIM', 'SERUM_C2C_LOWLIM', 'SERUM_COLL2_1_NO2_LOWLIM', 'SERUM_CPII_LOWLIM', 'SERUM_CS846_LOWLIM', 'SERUM_CTXI_LOWLIM', 'SERUM_COMP_LOWLIM', 'SERUM_HA_LOWLIM', 'SERUM_MMP_3_LOWLIM', 'SERUM_NTXI_LOWLIM', 'SERUM_PIIANP_LOWLIM', 'URINE_CTXII_LOWLIM', 'URINE_C1_2C_LOWLIM', 'URINE_C2C_LOWLIM', 'URINE_NTXI_LOWLIM', 'URINE_ALPHA_LOWLIM', 'URINE_BETA_LOWLIM', 'URINE_COL21N2_LOWLIM'],

# Columns with only floats, missing, and NA values
'float': ['SERUM_C1_2C_NUM', 'SERUM_COLL2_1_NO2_NUM', 'SERUM_CTXI_NUM', 'SERUM_MMP_3_NUM', 'URINE_CTXII_NUM', 'URINE_C1_2C_

In [80]:
targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['SERUM_C2C_NUM', 'SERUM_CPII_NUM', 'SERUM_PIIANP_NUM', 'SERUM_CS846_NUM', 'SERUM_COMP_NUM', 'SERUM_HA_NUM', 'SERUM_NTXI_NUM', 'URINE_C2C_NUM', 'URINE_NTXI_NUM', 'SERUM_C2C_ALTNUM', 'SERUM_CPII_ALTNUM', 'SERUM_CS846_ALTNUM', 'SERUM_COMP_ALTNUM', 'SERUM_HA_ALTNUM', 'SERUM_NTXI_ALTNUM', 'SERUM_PIIANP_ALTNUM', 'URINE_C2C_ALTNUM', 'URINE_NTXI_ALTNUM', 'SERUM_C1_2C_LOWLIM', 'SERUM_C2C_LOWLIM', 'SERUM_COLL2_1_NO2_LOWLIM', 'SERUM_CPII_LOWLIM', 'SERUM_CS846_LOWLIM', 'SERUM_CTXI_LOWLIM', 'SERUM_COMP_LOWLIM', 'SERUM_HA_LOWLIM', 'SERUM_MMP_3_LOWLIM', 'SERUM_NTXI_LOWLIM', 'SERUM_PIIANP_LOWLIM', 'URINE_CTXII_LOWLIM', 'URINE_C1_2C_LOWLIM', 'URINE_C2C_LOWLIM', 'URINE_NTXI_LOWLIM', 'URINE_ALPHA_LOWLIM', 'URINE_BETA_LOWLIM', 'URINE_COL21N2_LOWLIM'],

# Columns with only floats, missing, and NA values
'float': ['SERUM_C1_2C_NUM', 'SERUM_COLL2_1_NO2_NUM', 'SERUM_CTXI_NUM', 'SERUM_MMP_3_NUM', 'URINE_CTXII_NUM', 'URINE_C1_2C_NUM', 'URINE_CREATININE_NUM', 'URINE_ALPHA_NUM', 'URINE_BETA_NUM', 'URINE_COL21N2_NUM', 'URINE_CTXII_NUMCA', 'URINE_C1_2C_NUMCA', 'URINE_C2C_NUMCA', 'URINE_NTXI_NUMCA', 'URINE_ALPHA_NUMCA', 'URINE_BETA_NUMCA', 'URINE_COL21N2_NUMCA', 'SERUM_C1_2C_ALTNUM', 'SERUM_COLL2_1_NO2_ALTNUM', 'SERUM_CTXI_ALTNUM', 'SERUM_MMP_3_ALTNUM', 'URINE_CTXII_ALTNUM', 'URINE_C1_2C_ALTNUM', 'URINE_ALPHA_ALTNUM', 'URINE_BETA_ALTNUM', 'URINE_COL21N2_ALTNUM', 'URINE_CTXII_ALTNUMCA', 'URINE_C1_2C_ALTNUMCA', 'URINE_C2C_ALTNUMCA', 'URINE_NTXI_ALTNUMCA', 'URINE_ALPHA_ALTNUMCA', 'URINE_BETA_ALTNUMCA', 'URINE_COL21N2_ALTNUMCA', 'URINE_COL21N2SD', 'URINE_COL21N2CV'],

# Columns with only strings, missing, and NA values
'cat': ['SERUM_C1_2C_LC', 'SERUM_C2C_LC', 'SERUM_COLL2_1_NO2_LC', 'SERUM_CPII_LC', 'SERUM_CS846_LC', 'SERUM_CTXI_LC', 'SERUM_COMP_LC', 'SERUM_HA_LC', 'SERUM_MMP_3_LC', 'SERUM_NTXI_LC', 'SERUM_PIIANP_LC', 'URINE_CTXII_LC', 'URINE_C1_2C_LC', 'URINE_C2C_LC', 'URINE_CREATININE_LC', 'URINE_NTXI_LC', 'URINE_ALPHA_LC', 'URINE_BETA_LC', 'SERUM_C1_2C_COMMENT', 'SERUM_C2C_COMMENT', 'SERUM_COLL2_1_NO2_COMMENT', 'SERUM_CPII_COMMENT', 'SERUM_CS846_COMMENT', 'SERUM_CTXI_COMMENT', 'SERUM_COMP_COMMENT', 'SERUM_HA_COMMENT', 'SERUM_MMP_3_COMMENT', 'SERUM_NTXI_COMMENT', 'SERUM_PIIANP_COMMENT', 'URINE_CTXII_COMMENT', 'URINE_C1_2C_COMMENT', 'URINE_C2C_COMMENT', 'URINE_CREATININE_COMMENT', 'URINE_NTXI_COMMENT', 'URINE_ALPHA_COMMENT', 'URINE_BETA_COMMENT', 'SERUM_C1_2C_HQC', 'SERUM_C2C_HQC', 'SERUM_COLL2_1_NO2_HQC', 'SERUM_CPII_HQC', 'SERUM_CS846_HQC', 'SERUM_CTXI_HQC', 'SERUM_COMP_HQC', 'SERUM_HA_HQC', 'SERUM_MMP_3_HQC', 'SERUM_NTXI_HQC', 'SERUM_PIIANP_HQC', 'URINE_CTXII_HQC', 'URINE_CREATININE_HQC', 'URINE_NTXI_HQC', 'URINE_ALPHA_HQC', 'URINE_BETA_HQC', 'SERUM_C1_2C_LQC', 'SERUM_C2C_LQC', 'SERUM_COLL2_1_NO2_LQC', 'SERUM_CPII_LQC', 'SERUM_CS846_LQC', 'SERUM_CTXI_LQC', 'SERUM_COMP_LQC', 'SERUM_HA_LQC', 'SERUM_MMP_3_LQC', 'SERUM_NTXI_LQC', 'SERUM_PIIANP_LQC', 'URINE_CTXII_LQC', 'URINE_C1__2C_LQC', 'URINE_C2C_LQC', 'URINE_CREATININE_LQC', 'URINE_NTXI_LQC', 'URINE_ALPHA_LQC', 'URINE_BETA_LQC', 'SERUM_C1_2C_MQC', 'SERUM_C2C_MQC', 'SERUM_CPII_MQC', 'SERUM_CS846_MQC', 'SERUM_MMP_3_MQC', 'URINE_C1__2C_MQC', 'URINE_C2C_MQC', 'SERUM_C1_2C_KIT_LOT_NUM', 'SERUM_C2C_KIT_LOT_NUM', 'SERUM_COLL2_1_NO2_KIT_LOT_NUM', 'SERUM_CPII_KIT_LOT_NUM', 'SERUM_CS846_KIT_LOT_NUM', 'SERUM_CTXI_KIT_LOT_NUM', 'SERUM_COMP_KIT_LOT_NUM', 'SERUM_HA_KIT_LOT_NUM', 'SERUM_MMP_3_KIT_LOT_NUM', 'SERUM_NTXI_KIT_LOT_NUM', 'SERUM_PIIANP_KIT_LOT_NUM', 'URINE_CTXII_KIT_LOT_NUM', 'URINE_C1__2C_KIT_LOT_NUM', 'URINE_C2C_KIT_LOT_NUM', 'URINE_CREATININE_KIT_LOT_NUM', 'URINE_NTXI_KIT_LOT_NUM', 'URINE_ALPHA_KIT_LOT_NUM', 'URINE_BETA_KIT_LOT_NUM', 'SERUM_C1_2C_PLATE_ID', 'SERUM_C2C_PLATE_ID', 'SERUM_COLL2_1_NO2_PLATE_ID', 'SERUM_CPII_PLATE_ID', 'SERUM_CS846_PLATE_ID', 'SERUM_CTXI_PLATE_ID', 'SERUM_COMP_PLATE_ID', 'SERUM_HA_PLATE_ID', 'SERUM_MMP_3_PLATE_ID', 'SERUM_NTXI_PLATE_ID', 'SERUM_PIIANP_PLATE_ID', 'URINE_CTXII_PLATE_ID', 'URINE_C1__2C_PLATE_ID', 'URINE_C2C_PLATE_ID', 'URINE_CREATININE_PLATE_ID', 'URINE_NTXI_PLATE_ID', 'URINE_ALPHA_PLATE_ID', 'URINE_BETA_PLATE_ID'],

}


new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [81]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION                      category
READPRJ                        object
SERUM_C1_2C_LC               category
SERUM_C2C_LC                 category
SERUM_COLL2_1_NO2_LC         category
                               ...   
URINE_C2C_PLATE_ID           category
URINE_CREATININE_PLATE_ID    category
URINE_NTXI_PLATE_ID          category
URINE_ALPHA_PLATE_ID         category
URINE_BETA_PLATE_ID          category
Length: 186, dtype: object


In [82]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 1.98MB
Shadow dataframe size: 0.01MB


In [83]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## boneancillarystudy

In [16]:
prefix = 'boneancillarystudy'
column_uniformity_check(prefix)


boneancillarystudy.sas7bdat: (1212, 50)
['ID', 'VERSION', 'READPRJ', 'KNEESIDE', 'KneeDXADt', 'MedialBMD', 'LateralBMD', 'BMDRatio', 'KneeDXASfware', 'TrabMRSeqDt', 'BVF', 'TrN', 'TrSp', 'TrTh', 'HIPSIDE', 'HipDXADt', 'NeckBMD', 'PDATE', 'PTH', 'VitD', 'KneeDXADt', 'MedialBMD', 'LateralBMD', 'BMDRatio', 'KneeDXASfware', 'TrabMRSeqDt', 'BVF', 'TrN', 'TrSp', 'TrTh', 'HipDXADt', 'NeckBMD', 'PDATE', 'PTH', 'VitD', 'KneeDXADt', 'MedialBMD', 'LateralBMD', 'BMDRatio', 'KneeDXASfware', 'TrabMRSeqDt', 'BVF', 'TrN', 'TrSp', 'TrTh', 'HipDXADt', 'NeckBMD', 'PDATE', 'PTH', 'VitD']

Total rows: 1212


In [18]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

boneancillarystudy.sas7bdat	Var Cnt: 50
Visits: ['V04', 'V05', 'V06']
(3636, 21)

Starting dataframe size: 1.06MB


In [19]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [20]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 6 	Cols to convert: 15	 Total col cnt: 21

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False    False      True   False      4
1    False     True     False    True     10
2     True    False     False   False      1

Numeric types of columns:
num_type
float       8
na          4
unsigned    2
Name: count, dtype: int64

Largest number of unique strings: 1.0


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1.0,{62},0,{},True,,,,,3636,0,0,0
KNEEDXADT,,,,0,,,na,1.0,,,0,0,2276,0
MEDIALBMD,,0.0,,0,,,float,2277.0,2.052863,0.506177,0,2276,0,1360
LATERALBMD,,0.0,,0,,,float,2277.0,1.596222,0.458337,0,2276,0,1360
BMDRATIO,,0.0,,0,,,float,2277.0,2.093117,0.541803,0,2276,0,1360
KNEEDXASFWARE,,0.0,,0,,,unsigned,3.0,1.0,0.0,0,2276,0,1360
TRABMRSEQDT,,,,0,,,na,1.0,,,0,0,1148,0
BVF,Bone Volume Fraction - avg of the 20 central s...,0.0,,0,,,float,256.0,0.537,0.005,0,1013,0,2623
TRN,,0.0,,0,,,float,720.0,1.908,0.048,0,1013,0,2623
TRSP,,0.0,,0,,,float,843.0,22.329,0.247,0,1013,0,2623


In [21]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only dates, missing, and NA values
'date': ['KNEEDXADT', 'TRABMRSEQDT', 'HIPDXADT', 'PDATE'],

# Columns with only unsigned ints, missing, and NA values
'unsigned': ['KNEEDXASFWARE', 'VITD'],

# Columns with only floats, missing, and NA values
'float': ['MEDIALBMD', 'LATERALBMD', 'BMDRATIO', 'BVF', 'TRN', 'TRSP', 'TRTH', 'NECKBMD'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ'],

}


Handled columns: 15


In [22]:
targets = {
# Columns with only dates, missing, and NA values
'date': ['KNEEDXADT', 'TRABMRSEQDT', 'HIPDXADT', 'PDATE'],

# Columns with only unsigned ints, missing, and NA values
'unsigned': ['KNEEDXASFWARE', 'VITD'],

# Columns with only floats, missing, and NA values
'float': ['MEDIALBMD', 'LATERALBMD', 'BMDRATIO', 'BVF', 'TRN', 'TRSP', 'TRTH', 'NECKBMD'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [23]:
# Clean up the side var and make an index
new_df['KNEESIDE'] = new_df['KNEESIDE'].cat.rename_categories({'1: Right': 'RIGHT', '2: Left': 'LEFT'})
new_df['HIPSIDE'] = new_df['HIPSIDE'].cat.rename_categories({'1: Right': 'RIGHT', '2: Left': 'LEFT'})
#new_df.set_index('SIDE', append=True, inplace=True)

In [24]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION                category
KNEESIDE               category
HIPSIDE                category
READPRJ                  object
KNEEDXADT        datetime64[ns]
MEDIALBMD               float32
LATERALBMD              float32
BMDRATIO                float32
KNEEDXASFWARE             UInt8
TRABMRSEQDT      datetime64[ns]
BVF                     float32
TRN                     float32
TRSP                    float32
TRTH                    float32
HIPDXADT         datetime64[ns]
NECKBMD                 float32
PDATE            datetime64[ns]
PTH                    category
VITD                      UInt8
dtype: object


In [25]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.49MB
Shadow dataframe size: 0.03MB


In [26]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## enrollees

In [93]:
prefix = 'enrollees'
tmp_df = create_df(prefix)
print(tmp_df.shape)
print(tmp_df.columns)

enrollees.sas7bdat	Var Cnt: 60
Visits: ['P02', 'V00', 'V01', 'V02', 'V03', 'V04', 'V05', 'V06', 'V08', 'V10']
(47960, 17)
Index(['ID', 'Visit', 'VERSION', 'HISP', 'RACE', 'SEX', 'CHRTHLF', 'COHORT',
       'IMAGESA', 'IMAGESB', 'IMAGESC', 'IMAGESD', 'IMAGESE', 'IMAGESF',
       'IMAGESG', 'SITE', 'HADINTV'],
      dtype='object')


In [94]:
# Are all variables associated with the same release VERSION?
tmp_df['VERSION'].value_counts()

VERSION
25    47960
Name: count, dtype: int64

From reading about the data, the enrollees data is best stored as two separate dataframes:
* One has basic data about the enrollees collected at either the IEI or EV. 
* The second tracks participation in different image groups by visit.

### Enrollee Data

In [95]:
enrollee_df = tmp_df[tmp_df['Visit'] == 'P02'].copy()
enrollee_df.dropna(how='all', axis='columns', inplace=True)  # Drop all columns for data that wasn't collected during visit P02
enrollee_df.drop('Visit', axis='columns', inplace=True)  # this information is independent of the visit

enrollee_df = enrollee_df.join(tmp_df[tmp_df['Visit'] == 'V00'][['ID', 'CHRTHLF', 'COHORT', 'SITE']].set_index('ID'), on='ID')
enrollee_df = enrollee_df.join(tmp_df[tmp_df['Visit'] == 'V01'][['ID', 'HADINTV']].set_index('ID'), on='ID')
enrollee_df = enrollee_df.astype({'SITE': 'category'})
enrollee_df.set_index('ID', inplace=True)

In [96]:
sanity_check(enrollee_df)
print()
print(enrollee_df.dtypes)


VERSION    category
HISP       category
RACE       category
SEX        category
CHRTHLF    category
COHORT     category
SITE       category
HADINTV    category
dtype: object


In [97]:
utils.write_parquet(enrollee_df, 'data/enrollees_values.parquet')
enrollees_df = None

### Image Group Participation

In [98]:
df_list = []
for v in ['V00', 'V01', 'V02', 'V03', 'V04', 'V05', 'V06', 'V08', 'V10']:
    image_groups_df = tmp_df[tmp_df['Visit'] == v].copy()
    image_groups_df.dropna(how='all', axis='columns', inplace=True)
    if v == 'V00':
        image_groups_df.drop(['CHRTHLF', 'COHORT', 'SITE'], axis='columns', inplace=True)
    if v == 'V01':
        image_groups_df.drop(['HADINTV'], axis='columns', inplace=True)
    df_list.append(image_groups_df)
image_groups_df = pd.concat(df_list, axis=0)
image_groups_df = image_groups_df.set_index(['ID', 'Visit'])

In [99]:
sanity_check(image_groups_df)
print()
print(image_groups_df.dtypes)


VERSION    category
IMAGESA    category
IMAGESB    category
IMAGESC    category
IMAGESD    category
IMAGESE    category
IMAGESF    category
IMAGESG    category
dtype: object


In [100]:
utils.write_parquet(image_groups_df, 'data/image_groups_values.parquet')
df_list = None
image_groups_df = None

## flxr_kneealign_cooke

In [101]:
prefix = 'flxr_kneealign_cooke'
column_uniformity_check(prefix)


flxr_kneealign_cooke01.sas7bdat: (2474, 6)
['ID', 'SIDE', 'READPRJ', 'VERSION', 'BARCDDC', 'HKANGLE']

flxr_kneealign_cooke03.sas7bdat: (2550, 6)

flxr_kneealign_cooke05.sas7bdat: (1822, 6)

flxr_kneealign_cooke06.sas7bdat: (264, 6)

Total rows: 7110


In [102]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

flxr_kneealign_cooke01.sas7bdat	Var Cnt: 6
Visits: ['V01']
flxr_kneealign_cooke03.sas7bdat	Var Cnt: 6
Visits: ['V03']
flxr_kneealign_cooke05.sas7bdat	Var Cnt: 6
Visits: ['V05']
flxr_kneealign_cooke06.sas7bdat	Var Cnt: 6
Visits: ['V06']
(7110, 7)

Starting dataframe size: 1.19MB


In [103]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [104]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 4 	Cols to convert: 3	 Total col cnt: 7

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0     True    False     False   False      2
1     True     True     False   False      1

Numeric types of columns:
num_type
float    1
Name: count, dtype: int64

Largest number of unique strings: 3556


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{60},0,{},True,,,,,7110,0,0,0
BARCDDC,FU kXR reading (DC): barcode of image analyzed,3556,"{016602059004, 016601803507, 016601981204, 016...",0,{},False,,,,,7110,0,0,0
HKANGLE,FU kXR reading (DC): limb alignment (mechanica...,0,{},3,"{I, P, T}",False,float,243.0,19.9,-16.1,138,6972,0,0


In [105]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only floats, missing, and NA values
'float': ['HKANGLE'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ', 'BARCDDC'],

}


Handled columns: 3


In [106]:
targets = {
# Columns with only floats, missing, and NA values
'float': ['HKANGLE'],

# Columns with only strings, missing, and NA values
'cat': ['BARCDDC'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [107]:
# Clean up the side var and make an index
new_df['SIDE'] = new_df['SIDE'].cat.rename_categories({'1: Right': 'RIGHT', '2: Left': 'LEFT'})
new_df.set_index('SIDE', append=True, inplace=True) #  Note that indices are unique yet, Prj 15 and Prj 37 contain repeats

In [108]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION    category
READPRJ      object
BARCDDC    category
HKANGLE     float32
dtype: object

Missing values present, shadow dataframe created.
              HKANGLE
ID      Visit        
9000099 V01       NaN
        V01       NaN
9000622 V01       NaN
        V01       NaN
9000798 V01       NaN
...               ...
9994408 V06       NaN
9995277 V06       NaN
        V06       NaN
9996284 V06       NaN
        V06       NaN

[7110 rows x 1 columns]


In [109]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.95MB
Shadow dataframe size: 0.14MB


In [110]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## flxr_kneealign_duryea

In [111]:
prefix = 'flxr_kneealign_duryea'
column_uniformity_check(prefix)


flxr_kneealign_duryea01.sas7bdat: (2864, 9)
['ID', 'SIDE', 'READPRJ', 'VERSION', 'HKANGJD', 'BRCDHJD', 'femlen', 'tiblen', 'appll']

flxr_kneealign_duryea03.sas7bdat: (2522, 9)
Names only differ by case

flxr_kneealign_duryea05.sas7bdat: (1828, 9)
Names only differ by case

flxr_kneealign_duryea06.sas7bdat: (352, 9)
Names only differ by case

Total rows: 7566


In [112]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

flxr_kneealign_duryea01.sas7bdat	Var Cnt: 9
Visits: ['V01']
flxr_kneealign_duryea03.sas7bdat	Var Cnt: 9
Visits: ['V03']
flxr_kneealign_duryea05.sas7bdat	Var Cnt: 9
Visits: ['V05']
flxr_kneealign_duryea06.sas7bdat	Var Cnt: 9
Visits: ['V06']
(7566, 10)

Starting dataframe size: 1.61MB


In [113]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [114]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 4 	Cols to convert: 6	 Total col cnt: 10

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False     True     False   False      1
1    False     True     False    True      1
2     True    False     False   False      2
3     True     True     False   False      2

Numeric types of columns:
num_type
float    4
Name: count, dtype: int64

Largest number of unique strings: 3783


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{32},0,{},True,,,,,7566,0,0,0
HKANGJD,FU flXR reading (JD): limb alignment (mechanic...,0,,0,,,float,5626.0,17.82,-17.205,0,7566,0,0
BRCDHJD,FU flXR reading (JD): barcode of image analyze...,3783,"{016602059004, 016601803507, 016601981204, 016...",0,{},True,,,,,7566,0,0,0
FEMLEN,,0,{},1,{T},False,float,6557.0,587.27041,362.583479,4,7562,0,0
TIBLEN,,0,{},1,{T},False,float,6535.0,507.906293,284.876981,4,7562,0,0
APPLL,,0,,0,,,float,7562.0,1074.030858,646.708919,0,7562,0,4


In [115]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only floats, missing, and NA values
'float': ['HKANGJD', 'FEMLEN', 'TIBLEN', 'APPLL'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ', 'BRCDHJD'],

}


Handled columns: 6


In [116]:
targets = {
# Columns with only floats, missing, and NA values
'float': ['HKANGJD', 'FEMLEN', 'TIBLEN', 'APPLL'],

# Columns with only strings, missing, and NA values
'cat': ['BRCDHJD'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [117]:
# Clean up the side var and make an index
new_df['SIDE'] = new_df['SIDE'].cat.rename_categories({'1: Right': 'RIGHT', '2: Left': 'LEFT'})
new_df.set_index('SIDE', append=True, inplace=True) #  Note that indices are unique yet, Prj 15 and Prj 37 contain repeats

In [118]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION    category
READPRJ      object
HKANGJD     float32
BRCDHJD    category
FEMLEN      float32
TIBLEN      float32
APPLL       float32
dtype: object

Missing values present, shadow dataframe created.
              FEMLEN TIBLEN
ID      Visit              
9000099 V01      NaN    NaN
        V01      NaN    NaN
9000622 V01      NaN    NaN
        V01      NaN    NaN
9000798 V01      NaN    NaN
...              ...    ...
9994408 V06      NaN    NaN
9995277 V06      NaN    NaN
        V06      NaN    NaN
9996284 V06      NaN    NaN
        V06      NaN    NaN

[7566 rows x 2 columns]


In [119]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 1.08MB
Shadow dataframe size: 0.15MB


In [120]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kmri_fnih_boneshape_imorphics

In [121]:
prefix = 'kmri_fnih_boneshape_imorphics'
column_uniformity_check(prefix)


kmri_fnih_boneshape_imorphics00.sas7bdat: (600, 17)
['ID', 'SIDE', 'READPRJ', 'VERSION', 'MF_tAB', 'LF_tAB', 'MT_tAB', 'LT_tAB', 'MP_tAB', 'LP_tAB', 'notch', 'TrFLat', 'TrFMed', 'nFemurOAVector', 'nTibiaOAVector', 'nPatellaOAVector', 'BARCDIM']

kmri_fnih_boneshape_imorphics01.sas7bdat: (582, 17)
Names only differ by case

kmri_fnih_boneshape_imorphics03.sas7bdat: (600, 17)
Names only differ by case

Total rows: 1782


In [122]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kmri_fnih_boneshape_imorphics00.sas7bdat	Var Cnt: 17
Visits: ['V00']
kmri_fnih_boneshape_imorphics01.sas7bdat	Var Cnt: 17
Visits: ['V01']
kmri_fnih_boneshape_imorphics03.sas7bdat	Var Cnt: 17
Visits: ['V03']
(1782, 18)

Starting dataframe size: 0.41MB


In [123]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [124]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 4 	Cols to convert: 14	 Total col cnt: 18

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False     True     False   False     12
1     True    False     False   False      2

Numeric types of columns:
num_type
float    12
Name: count, dtype: int64

Largest number of unique strings: 1782


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{22},0,{},True,,,,,1782,0,0,0
MF_TAB,,0,,0,,,float,1782.0,4004.77343,1620.142083,0,1782,0,0
LF_TAB,,0,,0,,,float,1782.0,2787.029706,1133.109803,0,1782,0,0
MT_TAB,,0,,0,,,float,1782.0,1875.392611,783.972223,0,1782,0,0
LT_TAB,,0,,0,,,float,1782.0,1541.036035,624.428716,0,1782,0,0
MP_TAB,,0,,0,,,float,1782.0,846.565926,150.706779,0,1782,0,0
LP_TAB,,0,,0,,,float,1782.0,1092.835159,198.621343,0,1782,0,0
NOTCH,,0,,0,,,float,1782.0,2311.196452,987.383041,0,1782,0,0
TRFLAT,,0,,0,,,float,1782.0,1879.94037,865.157296,0,1782,0,0
TRFMED,,0,,0,,,float,1782.0,1107.068212,474.336764,0,1782,0,0


In [125]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only floats, missing, and NA values
'float': ['MF_TAB', 'LF_TAB', 'MT_TAB', 'LT_TAB', 'MP_TAB', 'LP_TAB', 'NOTCH', 'TRFLAT', 'TRFMED', 'NFEMUROAVECTOR', 'NTIBIAOAVECTOR', 'NPATELLAOAVECTOR'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ', 'BARCDIM'],

}


Handled columns: 14


In [126]:
targets = {
# Columns with only floats, missing, and NA values
'float': ['MF_TAB', 'LF_TAB', 'MT_TAB', 'LT_TAB', 'MP_TAB', 'LP_TAB', 'NOTCH', 'TRFLAT', 'TRFMED', 'NFEMUROAVECTOR', 'NTIBIAOAVECTOR', 'NPATELLAOAVECTOR'],

# Columns with only strings, missing, and NA values
'cat': ['BARCDIM'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [127]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION             category
SIDE                category
READPRJ               object
MF_TAB               float32
LF_TAB               float32
MT_TAB               float32
LT_TAB               float32
MP_TAB               float32
LP_TAB               float32
NOTCH                float32
TRFLAT               float32
TRFMED               float32
NFEMUROAVECTOR       float32
NTIBIAOAVECTOR       float32
NPATELLAOAVECTOR     float32
BARCDIM             category
dtype: object


In [128]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.39MB
Shadow dataframe size: 0.01MB


In [129]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kmri_fnih_qcart_Chondrometrics

In [130]:
prefix = 'kmri_fnih_qcart_Chondrometrics'
column_uniformity_check(prefix)


kmri_fnih_qcart_Chondrometrics00.sas7bdat: (600, 97)
['ID', 'VERSION', 'READPRJ', 'side', 'WMTVCL', 'WMTSBA', 'WMTVCN', 'WMTMTH', 'WMTACS', 'WMTPD', 'WMTCAAB', 'WMTMTC', 'WMTMAV', 'WMTCTS', 'WMTACV', 'CMTMAT', 'CMTMTH', 'EMTMTH', 'IMTMTH', 'AMTMTH', 'PMTMTH', 'CMTPD', 'EMTPD', 'IMTPD', 'AMTPD', 'PMTPD', 'BMFVCL', 'BMFSBA', 'BMFVCN', 'BMFMTH', 'BMFACS', 'BMFPD', 'BMFCAAB', 'BMFMTC', 'BMFMAV', 'BMFCTS', 'BMFACV', 'CBMFMAT', 'CBMFMTH', 'EBMFMTH', 'IBMFMTH', 'CBMFPD', 'EBMFPD', 'IBMFPD', 'WMTFVCL', 'WMTFVCN', 'WMTFMTH', 'WMTFMAV', 'BMTFMAT', 'BMTFMTH', 'WLTVCL', 'WLTSBA', 'WLTVCN', 'WLTMTH', 'WLTACS', 'WLTPD', 'WLTCAAB', 'WLTMTC', 'WLTMAV', 'WLTCTS', 'WLTACV', 'CLTMAT', 'CLTMTH', 'ELTMTH', 'ILTMTH', 'ALTMTH', 'PLTMTH', 'CLTPD', 'ELTPD', 'ILTPD', 'ALTPD', 'PLTPD', 'BLFVCL', 'BLFSBA', 'BLFVCN', 'BLFMTH', 'BLFACS', 'BLFPD', 'BLFCAAB', 'BLFMTC', 'BLFMAV', 'BLFCTS', 'BLFACV', 'CBLFMAT', 'CBLFMT', 'EBLFMT', 'IBLFMT', 'CBLFPD', 'EBLFPD', 'IBLFPD', 'WLTFVCL', 'WLTFVCN', 'WLTFMTH', 'WLTFMAV', 'BLT

In [131]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kmri_fnih_qcart_Chondrometrics00.sas7bdat	Var Cnt: 97
Visits: ['V00']
kmri_fnih_qcart_Chondrometrics01.sas7bdat	Var Cnt: 97
Visits: ['V01']
kmri_fnih_qcart_Chondrometrics03.sas7bdat	Var Cnt: 97
Visits: ['V03']
(1781, 98)

Starting dataframe size: 1.50MB


In [132]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [133]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 4 	Cols to convert: 94	 Total col cnt: 98

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False     True     False   False     92
1     True    False     False   False      2

Numeric types of columns:
num_type
float    92
Name: count, dtype: int64

Largest number of unique strings: 1781


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{22},0,{},True,,,,,1781,0,0,0
WMTVCL,BL/FU kMRI reading (FE): volume of cartilage -...,0,,0,,,float,1663.0,4537.8329,714.4630,0,1781,0,0
WMTSBA,BL/FU kMRI reading (FE): total area of subchon...,0,,0,,,float,1744.0,16.8410,6.9163,0,1781,0,0
WMTVCN,BL/FU kMRI reading (FE): normalized cartilage ...,0,,0,,,float,1584.0,2.8168,0.8730,0,1781,0,0
WMTMTH,BL/FU kMRI reading (FE): mean cartilage thickn...,0,,0,,,float,1587.0,2.7499,0.7810,0,1781,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
WLTFMTH,BL/FU kMRI reading (FE): mean cartilage thickn...,0,,0,,,float,1533.0,6.1331,2.3679,0,1781,0,0
WLTFMAV,BL/FU kMRI reading (FE): maximum cartilage thi...,0,,0,,,float,1621.0,10.7896,3.5926,0,1781,0,0
BLTFMAT,BL/FU kMRI reading (FE): minimum cartilage thi...,0,,0,,,float,1608.0,7.1752,0.0000,0,1781,0,0
BLTFMTH,BL/FU kMRI reading (FE): mean cartilage thickn...,0,,0,,,float,1598.0,9.3280,3.0360,0,1781,0,0


In [134]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only floats, missing, and NA values
'float': ['WMTVCL', 'WMTSBA', 'WMTVCN', 'WMTMTH', 'WMTACS', 'WMTPD', 'WMTCAAB', 'WMTMTC', 'WMTMAV', 'WMTCTS', 'WMTACV', 'CMTMAT', 'CMTMTH', 'EMTMTH', 'IMTMTH', 'AMTMTH', 'PMTMTH', 'CMTPD', 'EMTPD', 'IMTPD', 'AMTPD', 'PMTPD', 'BMFVCL', 'BMFSBA', 'BMFVCN', 'BMFMTH', 'BMFACS', 'BMFPD', 'BMFCAAB', 'BMFMTC', 'BMFMAV', 'BMFCTS', 'BMFACV', 'CBMFMAT', 'CBMFMTH', 'EBMFMTH', 'IBMFMTH', 'CBMFPD', 'EBMFPD', 'IBMFPD', 'WMTFVCL', 'WMTFVCN', 'WMTFMTH', 'WMTFMAV', 'BMTFMAT', 'BMTFMTH', 'WLTVCL', 'WLTSBA', 'WLTVCN', 'WLTMTH', 'WLTACS', 'WLTPD', 'WLTCAAB', 'WLTMTC', 'WLTMAV', 'WLTCTS', 'WLTACV', 'CLTMAT', 'CLTMTH', 'ELTMTH', 'ILTMTH', 'ALTMTH', 'PLTMTH', 'CLTPD', 'ELTPD', 'ILTPD', 'ALTPD', 'PLTPD', 'BLFVCL', 'BLFSBA', 'BLFVCN', 'BLFMTH', 'BLFACS', 'BLFPD', 'BLFCAAB', 'BLFMTC', 'BLFMAV', 'BLFCTS', 'BLFACV', 'CBLFMAT', 'CBLFMT', 'EBLFMT', 'IBLFMT', 'CBLFPD', 'EBLFPD', 'IBLFPD', 'WLTFVCL', 'WLTFVCN', 'WLTFMTH', 'WLTFMAV', 'BLTFMAT', 'BLTFMTH'],

In [135]:
targets = {
# Columns with only floats, missing, and NA values
'float': ['WMTVCL', 'WMTSBA', 'WMTVCN', 'WMTMTH', 'WMTACS', 'WMTPD', 'WMTCAAB', 'WMTMTC', 'WMTMAV', 'WMTCTS', 'WMTACV', 'CMTMAT', 'CMTMTH', 'EMTMTH', 'IMTMTH', 'AMTMTH', 'PMTMTH', 'CMTPD', 'EMTPD', 'IMTPD', 'AMTPD', 'PMTPD', 'BMFVCL', 'BMFSBA', 'BMFVCN', 'BMFMTH', 'BMFACS', 'BMFPD', 'BMFCAAB', 'BMFMTC', 'BMFMAV', 'BMFCTS', 'BMFACV', 'CBMFMAT', 'CBMFMTH', 'EBMFMTH', 'IBMFMTH', 'CBMFPD', 'EBMFPD', 'IBMFPD', 'WMTFVCL', 'WMTFVCN', 'WMTFMTH', 'WMTFMAV', 'BMTFMAT', 'BMTFMTH', 'WLTVCL', 'WLTSBA', 'WLTVCN', 'WLTMTH', 'WLTACS', 'WLTPD', 'WLTCAAB', 'WLTMTC', 'WLTMAV', 'WLTCTS', 'WLTACV', 'CLTMAT', 'CLTMTH', 'ELTMTH', 'ILTMTH', 'ALTMTH', 'PLTMTH', 'CLTPD', 'ELTPD', 'ILTPD', 'ALTPD', 'PLTPD', 'BLFVCL', 'BLFSBA', 'BLFVCN', 'BLFMTH', 'BLFACS', 'BLFPD', 'BLFCAAB', 'BLFMTC', 'BLFMAV', 'BLFCTS', 'BLFACV', 'CBLFMAT', 'CBLFMT', 'EBLFMT', 'IBLFMT', 'CBLFPD', 'EBLFPD', 'IBLFPD', 'WLTFVCL', 'WLTFVCN', 'WLTFMTH', 'WLTFMAV', 'BLTFMAT', 'BLTFMTH'],

# Columns with only strings, missing, and NA values
'cat': ['BARCDFE'],

}


new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [136]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION    category
SIDE       category
READPRJ      object
WMTVCL      float32
WMTSBA      float32
             ...   
WLTFMTH     float32
WLTFMAV     float32
BLTFMAT     float32
BLTFMTH     float32
BARCDFE    category
Length: 96, dtype: object


In [137]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.94MB
Shadow dataframe size: 0.01MB


In [138]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kmri_fnih_qcart_biomediq

In [139]:
prefix = 'kmri_fnih_qcart_biomediq'
column_uniformity_check(prefix)


kmri_fnih_qcart_biomediq00.sas7bdat: (600, 12)
['ID', 'SIDE', 'VERSION', 'READPRJ', 'MedialTibialCartilage', 'LateralTibialCartilage', 'MedialFemoralCartilage', 'LateralFemoralCartilage', 'PatellarCartilage', 'MedialMeniscus', 'LateralMeniscus', 'BARCDED']

kmri_fnih_qcart_biomediq01.sas7bdat: (582, 12)
Names only differ by case

kmri_fnih_qcart_biomediq03.sas7bdat: (600, 12)
Names only differ by case

Total rows: 1782


In [140]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kmri_fnih_qcart_biomediq00.sas7bdat	Var Cnt: 12
Visits: ['V00']
kmri_fnih_qcart_biomediq01.sas7bdat	Var Cnt: 12
Visits: ['V01']
kmri_fnih_qcart_biomediq03.sas7bdat	Var Cnt: 12
Visits: ['V03']
(1782, 13)

Starting dataframe size: 0.34MB


In [141]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [142]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 4 	Cols to convert: 9	 Total col cnt: 13

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False     True     False    True      7
1     True    False     False   False      2

Numeric types of columns:
num_type
float    7
Name: count, dtype: int64

Largest number of unique strings: 1782


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{22},0,{},True,,,,,1782,0,0,0
MEDIALTIBIALCARTILAGE,,0,,0,,,float,1733.0,4619.8556,1012.8847,0,1781,0,1
LATERALTIBIALCARTILAGE,,0,,0,,,float,1736.0,5274.3321,1126.027,0,1781,0,1
MEDIALFEMORALCARTILAGE,,0,,0,,,float,1765.0,15784.4759,2479.2692,0,1781,0,1
LATERALFEMORALCARTILAGE,,0,,0,,,float,1764.0,12416.6298,3183.2454,0,1781,0,1
PATELLARCARTILAGE,,0,,0,,,float,1742.0,5676.0061,802.9758,0,1781,0,1
MEDIALMENISCUS,,0,,0,,,float,1743.0,5783.287,418.1429,0,1781,0,1
LATERALMENISCUS,,0,,0,,,float,1725.0,4595.9431,478.2498,0,1781,0,1
BARCDED,Barcode of image analyzed,1782,"{016610890706, 016610214503, 016611288912, 016...",0,{},True,,,,,1782,0,0,0


In [143]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only floats, missing, and NA values
'float': ['MEDIALTIBIALCARTILAGE', 'LATERALTIBIALCARTILAGE', 'MEDIALFEMORALCARTILAGE', 'LATERALFEMORALCARTILAGE', 'PATELLARCARTILAGE', 'MEDIALMENISCUS', 'LATERALMENISCUS'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ', 'BARCDED'],

}


Handled columns: 9


In [144]:
targets = {
# Columns with only floats, missing, and NA values
'float': ['MEDIALTIBIALCARTILAGE', 'LATERALTIBIALCARTILAGE', 'MEDIALFEMORALCARTILAGE', 'LATERALFEMORALCARTILAGE', 'PATELLARCARTILAGE', 'MEDIALMENISCUS', 'LATERALMENISCUS'],

# Columns with only strings, missing, and NA values
'cat': ['BARCDED'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [145]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION                    category
SIDE                       category
READPRJ                      object
MEDIALTIBIALCARTILAGE       float32
LATERALTIBIALCARTILAGE      float32
MEDIALFEMORALCARTILAGE      float32
LATERALFEMORALCARTILAGE     float32
PATELLARCARTILAGE           float32
MEDIALMENISCUS              float32
LATERALMENISCUS             float32
BARCDED                    category
dtype: object


In [146]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.36MB
Shadow dataframe size: 0.01MB


In [147]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

##  kmri_fnih_sbp_qmetrics

In [148]:
prefix = 'kmri_fnih_sbp_qmetrics'
column_uniformity_check(prefix)


kmri_fnih_sbp_qmetrics00.sas7bdat: (600, 40)
['ID', 'VERSION', 'READPRJ', 'SIDE', 'SubBArea_MedFem', 'CurAverage_SubB_MedFem', 'CurSD_SubB_MedFem', 'Cur_5pc_SubB_MedFem', 'Cur_95pc_SubB_MedFem', 'SCRAverage_SubB_MedFem', 'SCRSD_SubB_MedFem', 'SCR_5pc_SubB_MedFem', 'SCR_95pc_SubB_MedFem', 'SubBArea_LatFem', 'CurAverage_SubB_LatFem', 'CurSD_SubB_LatFem', 'Cur_5pc_SubB_LatFem', 'Cur_95pc_SubB_LatFem', 'SCRAverage_SubB_LatFem', 'SCRSD_SubB_LatFem', 'SCR_5pc_SubB_LatFem', 'SCR_95pc_SubB_LatFem', 'SubBArea_MedTib', 'CurAverage_SubB_MedTib', 'CurSD_SubB_MedTib', 'Cur_5pc_SubB_MedTib', 'Cur_95pc_SubB_MedTib', 'SCRAverage_SubB_MedTib', 'SCRSD_SubB_MedTib', 'SCR_5pc_SubB_MedTib', 'SCR_95pc_SubB_MedTib', 'SubBArea_LatTib', 'CurAverage_SubB_LatTib', 'CurSD_SubB_LatTib', 'Cur_5pc_SubB_LatTib', 'Cur_95pc_SubB_LatTib', 'SCRAverage_SubB_LatTib', 'SCRSD_SubB_LatTib', 'SCR_5pc_SubB_LatTib', 'SCR_95pc_SubB_LatTib']

kmri_fnih_sbp_qmetrics01.sas7bdat: (600, 40)
Names only differ by case

kmri_fnih_sbp_qm

In [149]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kmri_fnih_sbp_qmetrics00.sas7bdat	Var Cnt: 40
Visits: ['V00']
kmri_fnih_sbp_qmetrics01.sas7bdat	Var Cnt: 40
Visits: ['V01']
kmri_fnih_sbp_qmetrics03.sas7bdat	Var Cnt: 40
Visits: ['V03']
(1800, 41)

Starting dataframe size: 1.98MB


In [150]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [151]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 4 	Cols to convert: 37	 Total col cnt: 41

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False     True     False    True      4
1     True    False     False   False      1
2     True     True     False   False     32

Numeric types of columns:
num_type
float    36
Name: count, dtype: int64

Largest number of unique strings: 1


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{22},0,{},True,,,,,1800,0,0,0
SUBBAREA_MEDFEM,,0,,0,,,float,1753.0,1470.5475,387.191531,0,1752,0,48
CURAVERAGE_SUBB_MEDFEM,,0,{},1,{T},False,float,1675.0,0.046131,0.007478,48,1752,0,0
CURSD_SUBB_MEDFEM,,0,{},1,{T},False,float,1678.0,0.070508,0.017858,48,1752,0,0
CUR_5PC_SUBB_MEDFEM,,0,{},1,{T},False,float,1733.0,-0.006122,-0.14542,48,1752,0,0
CUR_95PC_SUBB_MEDFEM,,0,{},1,{T},False,float,1714.0,0.177308,0.051765,48,1752,0,0
SCRAVERAGE_SUBB_MEDFEM,,0,{},1,{T},False,float,1742.0,7.332705,1.51978,56,1744,0,0
SCRSD_SUBB_MEDFEM,,0,{},1,{T},False,float,1741.0,1.742045,0.724587,56,1744,0,0
SCR_5PC_SUBB_MEDFEM,,0,{},1,{T},False,float,1742.0,3.765704,-0.753886,56,1744,0,0
SCR_95PC_SUBB_MEDFEM,,0,{},1,{T},False,float,1744.0,9.30661,3.395922,56,1744,0,0


In [152]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only floats, missing, and NA values
'float': ['SUBBAREA_MEDFEM', 'CURAVERAGE_SUBB_MEDFEM', 'CURSD_SUBB_MEDFEM', 'CUR_5PC_SUBB_MEDFEM', 'CUR_95PC_SUBB_MEDFEM', 'SCRAVERAGE_SUBB_MEDFEM', 'SCRSD_SUBB_MEDFEM', 'SCR_5PC_SUBB_MEDFEM', 'SCR_95PC_SUBB_MEDFEM', 'SUBBAREA_LATFEM', 'CURAVERAGE_SUBB_LATFEM', 'CURSD_SUBB_LATFEM', 'CUR_5PC_SUBB_LATFEM', 'CUR_95PC_SUBB_LATFEM', 'SCRAVERAGE_SUBB_LATFEM', 'SCRSD_SUBB_LATFEM', 'SCR_5PC_SUBB_LATFEM', 'SCR_95PC_SUBB_LATFEM', 'SUBBAREA_MEDTIB', 'CURAVERAGE_SUBB_MEDTIB', 'CURSD_SUBB_MEDTIB', 'CUR_5PC_SUBB_MEDTIB', 'CUR_95PC_SUBB_MEDTIB', 'SCRAVERAGE_SUBB_MEDTIB', 'SCRSD_SUBB_MEDTIB', 'SCR_5PC_SUBB_MEDTIB', 'SCR_95PC_SUBB_MEDTIB', 'SUBBAREA_LATTIB', 'CURAVERAGE_SUBB_LATTIB', 'CURSD_SUBB_LATTIB', 'CUR_5PC_SUBB_LATTIB', 'CUR_95PC_SUBB_LATTIB', 'SCRAVERAGE_SUBB_LATTIB', 'SCRSD_SUBB_LATTIB', 'SCR_5PC_SUBB_LATTIB', 'SCR_95PC_SUBB_LATTIB'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ'],

}


Handled columns: 37


In [153]:
targets = {
# Columns with only floats, missing, and NA values
'float': ['SUBBAREA_MEDFEM', 'CURAVERAGE_SUBB_MEDFEM', 'CURSD_SUBB_MEDFEM', 'CUR_5PC_SUBB_MEDFEM', 'CUR_95PC_SUBB_MEDFEM', 'SCRAVERAGE_SUBB_MEDFEM', 'SCRSD_SUBB_MEDFEM', 'SCR_5PC_SUBB_MEDFEM', 'SCR_95PC_SUBB_MEDFEM', 'SUBBAREA_LATFEM', 'CURAVERAGE_SUBB_LATFEM', 'CURSD_SUBB_LATFEM', 'CUR_5PC_SUBB_LATFEM', 'CUR_95PC_SUBB_LATFEM', 'SCRAVERAGE_SUBB_LATFEM', 'SCRSD_SUBB_LATFEM', 'SCR_5PC_SUBB_LATFEM', 'SCR_95PC_SUBB_LATFEM', 'SUBBAREA_MEDTIB', 'CURAVERAGE_SUBB_MEDTIB', 'CURSD_SUBB_MEDTIB', 'CUR_5PC_SUBB_MEDTIB', 'CUR_95PC_SUBB_MEDTIB', 'SCRAVERAGE_SUBB_MEDTIB', 'SCRSD_SUBB_MEDTIB', 'SCR_5PC_SUBB_MEDTIB', 'SCR_95PC_SUBB_MEDTIB', 'SUBBAREA_LATTIB', 'CURAVERAGE_SUBB_LATTIB', 'CURSD_SUBB_LATTIB', 'CUR_5PC_SUBB_LATTIB', 'CUR_95PC_SUBB_LATTIB', 'SCRAVERAGE_SUBB_LATTIB', 'SCRSD_SUBB_LATTIB', 'SCR_5PC_SUBB_LATTIB', 'SCR_95PC_SUBB_LATTIB'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [154]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION                   category
SIDE                      category
READPRJ                     object
SUBBAREA_MEDFEM            float32
CURAVERAGE_SUBB_MEDFEM     float32
CURSD_SUBB_MEDFEM          float32
CUR_5PC_SUBB_MEDFEM        float32
CUR_95PC_SUBB_MEDFEM       float32
SCRAVERAGE_SUBB_MEDFEM     float32
SCRSD_SUBB_MEDFEM          float32
SCR_5PC_SUBB_MEDFEM        float32
SCR_95PC_SUBB_MEDFEM       float32
SUBBAREA_LATFEM            float32
CURAVERAGE_SUBB_LATFEM     float32
CURSD_SUBB_LATFEM          float32
CUR_5PC_SUBB_LATFEM        float32
CUR_95PC_SUBB_LATFEM       float32
SCRAVERAGE_SUBB_LATFEM     float32
SCRSD_SUBB_LATFEM          float32
SCR_5PC_SUBB_LATFEM        float32
SCR_95PC_SUBB_LATFEM       float32
SUBBAREA_MEDTIB            float32
CURAVERAGE_SUBB_MEDTIB     float32
CURSD_SUBB_MEDTIB          float32
CUR_5PC_SUBB_MEDTIB        float32
CUR_95PC_SUBB_MEDTIB       float32
SCRAVERAGE_SUBB_MEDTIB     float32
SCRSD_SUBB_MED

In [155]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.38MB
Shadow dataframe size: 0.08MB


In [156]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kmri_fnih_sq_moaks_bicl

In [157]:
prefix = 'kmri_fnih_sq_moaks_bicl'
column_uniformity_check(prefix)


kmri_fnih_sq_moaks_bicl00.sas7bdat: (600, 122)
['ID', 'SIDE', 'VERSION', 'READPRJ', 'READER', 'MCMPM', 'MCMPL', 'MCMFMA', 'MCMFLA', 'MCMFMP', 'MCMFLP', 'MCMFMC', 'MCMFLC', 'MCMTMA', 'MCMTLA', 'MCMTMC', 'MCMTLC', 'MCMTMP', 'MCMTLP', 'MBMSFMA', 'MBMPFMA', 'MBMNFMA', 'MBMSFLA', 'MBMPFLA', 'MBMNFLA', 'MBMSFMC', 'MBMPFMC', 'MBMNFMC', 'MBMSFLC', 'MBMPFLC', 'MBMNFLC', 'MBMSFMP', 'MBMPFMP', 'MBMNFMP', 'MBMSFLP', 'MBMPFLP', 'MBMNFLP', 'MBMSSS', 'MBMPSS', 'MBMNSS', 'MBMSTMA', 'MBMPTMA', 'MBMNTMA', 'MBMSTLA', 'MBMPTLA', 'MBMNTLA', 'MBMSTMC', 'MBMPTMC', 'MBMNTMC', 'MBMSTLC', 'MBMPTLC', 'MBMNTLC', 'MBMSTMP', 'MBMPTMP', 'MBMNTMP', 'MBMSTLP', 'MBMPTLP', 'MBMNTLP', 'MBMSPM', 'MBMPPM', 'MBMNPM', 'MBMSPL', 'MBMPPL', 'MBMNPL', 'MMTMA', 'MMTLA', 'MMTMB', 'MMTLB', 'MMTMP', 'MMTLP', 'MMHMA', 'MMHLA', 'MMHMB', 'MMHLB', 'MMHMP', 'MMHLP', 'MMSMA', 'MMSLA', 'MMSMB', 'MMSLB', 'MMSMP', 'MMSLP', 'MMXMM', 'MMXMA', 'MMXLA', 'MMXLL', 'MMRTM', 'MMRTL', 'MOSPS', 'MOSPI', 'MOSPM', 'MOSPL', 'MOSFMA', 'MOSFLA', 'MOSFMP',

In [158]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kmri_fnih_sq_moaks_bicl00.sas7bdat	Var Cnt: 122
Visits: ['V00']
kmri_fnih_sq_moaks_bicl01.sas7bdat	Var Cnt: 122
Visits: ['V01']
kmri_fnih_sq_moaks_bicl03.sas7bdat	Var Cnt: 122
Visits: ['V03']
(1800, 123)

Starting dataframe size: 1.79MB


In [159]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [160]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 104 	Cols to convert: 19	 Total col cnt: 123

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0     True    False     False   False      4
1     True     True     False   False     15

Numeric types of columns:
num_type
unsigned    15
Name: count, dtype: int64

Largest number of unique strings: 50


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{22},0,{},True,,,,,1800,0,0,0
READER,BL/FU kMRI reading (BI): reader,2,"{R02, R01}",0,{},False,,,,,1800,0,0,0
MBMNFMA,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},1,{U},False,unsigned,5.0,3.0,0.0,1,1781,0,0
MBMNFLA,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},1,{U},False,unsigned,7.0,5.0,0.0,1,1781,0,0
MBMNFMC,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},1,{U},False,unsigned,5.0,3.0,0.0,1,1781,0,0
MBMNFLC,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},1,{U},False,unsigned,4.0,2.0,0.0,1,1781,0,0
MBMNFMP,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},1,{U},False,unsigned,5.0,3.0,0.0,1,1781,0,0
MBMNFLP,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},1,{U},False,unsigned,5.0,3.0,0.0,1,1781,0,0
MBMNSS,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},1,{U},False,unsigned,6.0,4.0,0.0,1,1781,0,0
MBMNTMA,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},1,{U},False,unsigned,4.0,2.0,0.0,1,1781,0,0


In [161]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['MBMNFMA', 'MBMNFLA', 'MBMNFMC', 'MBMNFLC', 'MBMNFMP', 'MBMNFLP', 'MBMNSS', 'MBMNTMA', 'MBMNTLA', 'MBMNTMC', 'MBMNTLC', 'MBMNTMP', 'MBMNTLP', 'MBMNPM', 'MBMNPL'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ', 'READER', 'MCMNTS', 'MTCMNTS'],

}


Handled columns: 19


In [162]:
targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['MBMNFMA', 'MBMNFLA', 'MBMNFMC', 'MBMNFLC', 'MBMNFMP', 'MBMNFLP', 'MBMNSS', 'MBMNTMA', 'MBMNTLA', 'MBMNTMC', 'MBMNTLC', 'MBMNTMP', 'MBMNTLP', 'MBMNPM', 'MBMNPL'],

# Columns with only strings, missing, and NA values
'cat': ['READER', 'MCMNTS', 'MTCMNTS'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [163]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION    category
SIDE       category
READPRJ      object
READER     category
MCMPM      category
             ...   
MEFFWK     category
MITBSIG    category
MPOPCYS    category
MCMNTS     category
MTCMNTS    category
Length: 121, dtype: object

Missing values present, shadow dataframe created.
              MBMNFMA MBMNFLA MBMNFMC MBMNFLC MBMNFMP MBMNFLP MBMNSS MBMNTMA  \
ID      Visit                                                                  
9001695 V00       NaN     NaN     NaN     NaN     NaN     NaN    NaN     NaN   
9002116 V00       NaN     NaN     NaN     NaN     NaN     NaN    NaN     NaN   
9002430 V00       NaN     NaN     NaN     NaN     NaN     NaN    NaN     NaN   
9002817 V00       NaN     NaN     NaN     NaN     NaN     NaN    NaN     NaN   
9003316 V00       NaN     NaN     NaN     NaN     NaN     NaN    NaN     NaN   
...               ...     ...     ...     ...     ...     ...    ...     ...   
9993833 V03       NaN

In [164]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.72MB
Shadow dataframe size: 0.05MB


In [165]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kmri_poma_incoa_moaks_bicl
TODO: Not handled yet as column naming format doesn't use visit prefixes

In [166]:
prefix = 'kmri_poma_incoa_moaks_bicl'
column_uniformity_check(prefix)


kmri_poma_incoa_moaks_bicl.sas7bdat: (2481, 101)
['ID', 'side', 'version', 'visit', 'iwave', 'incrank', 'itype', 'iclass', 'Reader', 'CartPatellaMed', 'CartPatellaLat', 'CartFemurAntMed', 'CartFemurAntLat', 'CartTibioFemurPostMed', 'CartTibioFemurPostLat', 'CartTibioFemurCentMed', 'CartTibioFemurCentLat', 'CartTibioTibiaAntMed', 'CartTibioTibiaAntLat', 'CartTibioTibiaCentMed', 'CartTibioTibiaCentLat', 'CartTibioTibiaPostMed', 'CartTibioTibiaPostLat', 'BMLFemAntSizeMed', 'BMLFemAntBMLMed', 'BMLFemAntLesMed', 'BMLFemAntSizeLat', 'BMLFemAntBMLLat', 'BMLFemAntLesLat', 'BMLFemCentSizeMed', 'BMLFemCentBMLMed', 'BMLFemCentLesMed', 'BMLFemCentSizeLat', 'BMLFemCentBMLLat', 'BMLFemCentLesLat', 'BMLFemPostSizeMed', 'BMLFemPostBMLMed', 'BMLFemPostLesMed', 'BMLFemPostSizeLat', 'BMLFemPostBMLLat', 'BMLFemPostLesLat', 'BMLTibSubSpCentSize', 'BMLTibSubSpCentBML', 'BMLTibSubSpCentLes', 'BMLTibAntSizeMed', 'BMLTibAntBMLMed', 'BMLTibAntLesMed', 'BMLTibAntSizeLat', 'BMLTibAntBMLLat', 'BMLTibAntLesLat', '

## kmri_poma_tkr_chondrometrics
TODO: Not handled yet as column naming format doesn't use visit prefixes

In [167]:
prefix = 'kmri_poma_tkr_chondrometrics'
column_uniformity_check(prefix)


kmri_poma_tkr_chondrometrics.sas7bdat: (1330, 99)
['id', 'side', 'version', 'visit', 'class', 'tmpt', 'newstrata', 'MT_VC', 'MT_tAB', 'MT_VCtAB', 'MT_ThCtAB_aMe', 'MT_AC', 'MT_dABp', 'MT_cAB', 'MT_ThCcAB_aMe', 'MT_ThCtAB_aMav', 'MT_ThCtAB_aSD', 'MT_ThCtAB_aCV', 'cMT_ThCtAB_aMiv', 'cMT_ThCtAB_aMe', 'eMT_ThCtAB_aMe', 'iMT_ThCtAB_aMe', 'aMT_ThCtAB_aMe', 'pMT_ThCtAB_aMe', 'cMT_dABp', 'eMT_dABp', 'iMT_dABp', 'aMT_dABp', 'pMT_dABp', 'cMF_VC', 'cMF_tAB', 'cMF_VCtAB', 'cMF_ThCtAB_aMe', 'cMF_AC', 'cMF_dABp', 'cMF_cAB', 'cMF_ThCcAB_aMe', 'cMF_ThCtAB_aMav', 'cMF_ThCtAB_aSD', 'cMF_ThCtAB_aCV', 'ccMF_ThCtAB_aMiv', 'ccMF_ThCtAB_aMe', 'ecMF_ThCtAB_aMe', 'icMF_ThCtAB_aMe', 'ccMF_dABp', 'ecMF_dABp', 'icMF_dABp', 'MFTC_VC', 'MFTC_VCtAB', 'MFTC_ThCtAB_aMe', 'MFTC_ThCtAB_aMav', 'cMFTC_ThCtAB_aMiv', 'cMFTC_ThCtAB_aMe', 'LT_VC', 'LT_tAB', 'LT_VCtAB', 'LT_ThCtAB_aMe', 'LT_AC', 'LT_dABp', 'LT_cAB', 'LT_ThCcAB_aMe', 'LT_ThCtAB_aMav', 'LT_ThCtAB_aSD', 'LT_ThCtAB_aCV', 'cLT_ThCtAB_aMiv', 'cLT_ThCtAB_aMe', 'eLT_

## kmri_poma_tkr_moaks_bicl
TODO: Not handled yet as column naming format doesn't use visit prefixes

In [168]:
prefix = 'kmri_poma_tkr_moaks_bicl'
column_uniformity_check(prefix)


kmri_poma_tkr_moaks_bicl.sas7bdat: (1339, 100)
['id', 'side', 'version', 'visit', 'tmpt', 'class', 'newstrata', 'Reader', 'CartPatellaMed', 'CartPatellaLat', 'CartFemurAntMed', 'CartFemurAntLat', 'CartTibioFemurPostMed', 'CartTibioFemurPostLat', 'CartTibioFemurCentMed', 'CartTibioFemurCentLat', 'CartTibioTibiaAntMed', 'CartTibioTibiaAntLat', 'CartTibioTibiaCentMed', 'CartTibioTibiaCentLat', 'CartTibioTibiaPostMed', 'CartTibioTibiaPostLat', 'BMLFemAntSizeMed', 'BMLFemAntBMLMed', 'BMLFemAntLesMed', 'BMLFemAntSizeLat', 'BMLFemAntBMLLat', 'BMLFemAntLesLat', 'BMLFemCentSizeMed', 'BMLFemCentBMLMed', 'BMLFemCentLesMed', 'BMLFemCentSizeLat', 'BMLFemCentBMLLat', 'BMLFemCentLesLat', 'BMLFemPostSizeMed', 'BMLFemPostBMLMed', 'BMLFemPostLesMed', 'BMLFemPostSizeLat', 'BMLFemPostBMLLat', 'BMLFemPostLesLat', 'BMLTibSubSpCentSize', 'BMLTibSubSpCentBML', 'BMLTibSubSpCentLes', 'BMLTibAntSizeMed', 'BMLTibAntBMLMed', 'BMLTibAntLesMed', 'BMLTibAntSizeLat', 'BMLTibAntBMLLat', 'BMLTibAntLesLat', 'BMLTibCentS

## kmri_qcart_eckstein

In [169]:
prefix = 'kmri_qcart_eckstein'
column_uniformity_check(prefix)


kmri_qcart_eckstein00.sas7bdat: (3708, 97)
['ID', 'SIDE', 'readprj', 'VERSION', 'CBLFMAT', 'CBMFPD', 'BMFACV', 'BMTFMTH', 'WLTFMAV', 'IMTMTH', 'PLTMTH', 'BLFVCL', 'WMTCTS', 'BMFPD', 'BMFMTC', 'EMTMTH', 'WLTVCL', 'BLFACS', 'WMTVCN', 'WMTMAV', 'EBLFPD', 'AMTPD', 'WLTFMTH', 'WMTMTH', 'EMTPD', 'WLTFVCL', 'CMTMTH', 'CLTMAT', 'ILTMTH', 'WMTCAAB', 'BMFVCL', 'PMTPD', 'BLFACV', 'BMTFMAT', 'WMTFMTH', 'IBLFPD', 'EBMFPD', 'BMFACS', 'CMTPD', 'BLTFMAT', 'WLTCTS', 'WLTMAV', 'IBMFMTH', 'WMTFMAV', 'AMTMTH', 'WMTFVCN', 'ALTPD', 'WMTSBA', 'BMFVCN', 'BLFMTC', 'ELTMTH', 'PLTPD', 'BLFMAV', 'IMTPD', 'WLTFVCN', 'BLFVCN', 'EBMFMTH', 'WLTACV', 'IBLFMT', 'ELTPD', 'BLFCAAB', 'WMTFVCL', 'WLTMTH', 'CMTMAT', 'WLTCAAB', 'BLFSBA', 'IBMFPD', 'BLFCTS', 'CLTMTH', 'WLTMTC', 'CLTPD', 'CBMFMTH', 'WMTPD', 'PMTMTH', 'BLFMTH', 'WMTACS', 'ILTPD', 'CBMFMAT', 'EBLFMT', 'WLTSBA', 'WMTACV', 'ALTMTH', 'CBLFPD', 'BLTFMTH', 'BMFCAAB', 'CBLFMT', 'WLTACS', 'BMFMTH', 'BMFCTS', 'BMFSBA', 'BMFMAV', 'BLFPD', 'WMTMTC', 'WLTVCN', 'WMTVCL', '

In [170]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kmri_qcart_eckstein00.sas7bdat	Var Cnt: 97
Visits: ['V00']
kmri_qcart_eckstein01.sas7bdat	Var Cnt: 97
Visits: ['V01']
kmri_qcart_eckstein03.sas7bdat	Var Cnt: 97
Visits: ['V03']
kmri_qcart_eckstein05.sas7bdat	Var Cnt: 97
Visits: ['V05']
kmri_qcart_eckstein06.sas7bdat	Var Cnt: 97
Visits: ['V06']
(9790, 98)

Starting dataframe size: 28.84MB


In [171]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [172]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 4 	Cols to convert: 94	 Total col cnt: 98

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0     True    False     False   False      2
1     True     True     False   False     92

Numeric types of columns:
num_type
float    92
Name: count, dtype: int64

Largest number of unique strings: 6641


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,9,"{07, 18, 66, 08, 22b, 09A, 04, 09B, 22}",0,{},False,,,,,9790,0,0,0
CBLFMAT,BL/FU kMRI reading (FE): minimum cartilage thi...,0,{},2,"{I, T}",False,float,3570.0,3.8340,0.0000,6,9779,0,0
CBMFPD,BL/FU kMRI reading (FE): % area of subchondral...,0,{},2,"{I, T}",False,float,1959.0,100.0000,0.0000,6,9782,0,0
BMFACV,BL/FU kMRI reading (FE): CV of cartilage thick...,0,{},2,"{I, T}",False,float,8266.0,283.1440,8.2590,6,9782,0,0
BMTFMTH,BL/FU kMRI reading (FE): mean cartilage thickn...,0,{},2,"{I, T}",False,float,3092.0,7.9000,0.0000,6,9782,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
WMTMTC,BL/FU kMRI reading (FE): mean cartilage thickn...,0,{},2,"{I, T}",False,float,3218.0,3.0460,0.8673,6,9782,0,0
WLTVCN,BL/FU kMRI reading (FE): normalized cartilage ...,0,{},2,"{I, T}",False,float,3957.0,3.5356,0.0690,6,9779,0,0
WMTVCL,BL/FU kMRI reading (FE): volume of cartilage -...,0,{},2,"{I, T}",False,float,6211.0,5017.8010,435.4890,6,9782,0,0
WLTPD,BL/FU kMRI reading (FE): % area of subchondral...,0,{},2,"{I, T}",False,float,1956.0,94.1220,0.0000,6,9779,0,0


In [173]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only floats, missing, and NA values
'float': ['CBLFMAT', 'CBMFPD', 'BMFACV', 'BMTFMTH', 'WLTFMAV', 'IMTMTH', 'PLTMTH', 'BLFVCL', 'WMTCTS', 'BMFPD', 'BMFMTC', 'EMTMTH', 'WLTVCL', 'BLFACS', 'WMTVCN', 'WMTMAV', 'EBLFPD', 'AMTPD', 'WLTFMTH', 'WMTMTH', 'EMTPD', 'WLTFVCL', 'CMTMTH', 'CLTMAT', 'ILTMTH', 'WMTCAAB', 'BMFVCL', 'PMTPD', 'BLFACV', 'BMTFMAT', 'WMTFMTH', 'IBLFPD', 'EBMFPD', 'BMFACS', 'CMTPD', 'BLTFMAT', 'WLTCTS', 'WLTMAV', 'IBMFMTH', 'WMTFMAV', 'AMTMTH', 'WMTFVCN', 'ALTPD', 'WMTSBA', 'BMFVCN', 'BLFMTC', 'ELTMTH', 'PLTPD', 'BLFMAV', 'IMTPD', 'WLTFVCN', 'BLFVCN', 'EBMFMTH', 'WLTACV', 'IBLFMT', 'ELTPD', 'BLFCAAB', 'WMTFVCL', 'WLTMTH', 'CMTMAT', 'WLTCAAB', 'BLFSBA', 'IBMFPD', 'BLFCTS', 'CLTMTH', 'WLTMTC', 'CLTPD', 'CBMFMTH', 'WMTPD', 'PMTMTH', 'BLFMTH', 'WMTACS', 'ILTPD', 'CBMFMAT', 'EBLFMT', 'WLTSBA', 'WMTACV', 'ALTMTH', 'CBLFPD', 'BLTFMTH', 'BMFCAAB', 'CBLFMT', 'WLTACS', 'BMFMTH', 'BMFCTS', 'BMFSBA', 'BMFMAV', 'BLFPD', 'WMTMTC', 'WLTVCN', 'WMTVCL', 'WLTPD'],

In [174]:
targets = {
# Columns with only floats, missing, and NA values
'float': ['CBLFMAT', 'CBMFPD', 'BMFACV', 'BMTFMTH', 'WLTFMAV', 'IMTMTH', 'PLTMTH', 'BLFVCL', 'WMTCTS', 'BMFPD', 'BMFMTC', 'EMTMTH', 'WLTVCL', 'BLFACS', 'WMTVCN', 'WMTMAV', 'EBLFPD', 'AMTPD', 'WLTFMTH', 'WMTMTH', 'EMTPD', 'WLTFVCL', 'CMTMTH', 'CLTMAT', 'ILTMTH', 'WMTCAAB', 'BMFVCL', 'PMTPD', 'BLFACV', 'BMTFMAT', 'WMTFMTH', 'IBLFPD', 'EBMFPD', 'BMFACS', 'CMTPD', 'BLTFMAT', 'WLTCTS', 'WLTMAV', 'IBMFMTH', 'WMTFMAV', 'AMTMTH', 'WMTFVCN', 'ALTPD', 'WMTSBA', 'BMFVCN', 'BLFMTC', 'ELTMTH', 'PLTPD', 'BLFMAV', 'IMTPD', 'WLTFVCN', 'BLFVCN', 'EBMFMTH', 'WLTACV', 'IBLFMT', 'ELTPD', 'BLFCAAB', 'WMTFVCL', 'WLTMTH', 'CMTMAT', 'WLTCAAB', 'BLFSBA', 'IBMFPD', 'BLFCTS', 'CLTMTH', 'WLTMTC', 'CLTPD', 'CBMFMTH', 'WMTPD', 'PMTMTH', 'BLFMTH', 'WMTACS', 'ILTPD', 'CBMFMAT', 'EBLFMT', 'WLTSBA', 'WMTACV', 'ALTMTH', 'CBLFPD', 'BLTFMTH', 'BMFCAAB', 'CBLFMT', 'WLTACS', 'BMFMTH', 'BMFCTS', 'BMFSBA', 'BMFMAV', 'BLFPD', 'WMTMTC', 'WLTVCN', 'WMTVCL', 'WLTPD'],

# Columns with only strings, missing, and NA values
'cat': ['BARCDFE'],

}
new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [175]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION    category
SIDE       category
READPRJ      object
CBLFMAT     float32
CBMFPD      float32
             ...   
WMTMTC      float32
WLTVCN      float32
WMTVCL      float32
WLTPD       float32
BARCDFE    category
Length: 96, dtype: object

Missing values present, shadow dataframe created.
              CBLFMAT CBMFPD BMFACV BMTFMTH WLTFMAV IMTMTH PLTMTH BLFVCL  \
ID      Visit                                                              
9000099 V00       NaN    NaN    NaN     NaN     NaN    NaN    NaN    NaN   
        V00       NaN    NaN    NaN     NaN     NaN    NaN    NaN    NaN   
9000296 V00       NaN    NaN    NaN     NaN     NaN    NaN    NaN    NaN   
9000798 V00       NaN    NaN    NaN     NaN     NaN    NaN    NaN    NaN   
        V00       NaN    NaN    NaN     NaN     NaN    NaN    NaN    NaN   
...               ...    ...    ...     ...     ...    ...    ...    ...   
9995338 V06       NaN    NaN    NaN     NaN     NaN   

In [176]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 4.80MB
Shadow dataframe size: 0.96MB


In [177]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kmri_qcart_link

In [178]:
prefix = 'kmri_qcart_link'
column_uniformity_check(prefix)


kmri_qcart_link00.sas7bdat: (300, 9)
['ID', 'VERSION', 'READPRJ', 'LFT2AV', 'LTT2AV', 'MFT2AV', 'MTT2AV', 'PATT2AV', 'AccessionNumber']

kmri_qcart_link03.sas7bdat: (287, 9)
Names only differ by case

kmri_qcart_link06.sas7bdat: (300, 9)
Names only differ by case

Total rows: 887


In [179]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kmri_qcart_link00.sas7bdat	Var Cnt: 9
Visits: ['V00']
kmri_qcart_link03.sas7bdat	Var Cnt: 9
Visits: ['V03']
kmri_qcart_link06.sas7bdat	Var Cnt: 9
Visits: ['V06']
(887, 9)

Starting dataframe size: 0.20MB


In [180]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [181]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 3 	Cols to convert: 6	 Total col cnt: 9

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0     True    False     False   False      1
1     True     True     False   False      5

Numeric types of columns:
num_type
float    5
Name: count, dtype: int64

Largest number of unique strings: 1


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{40},0,{},True,,,,,887,0,0,0
LFT2AV,BL/FU kMRI reading (TL): average cartilage T2 ...,0,{},1,{T},False,float,837.0,42.817,28.477,14,873,0,0
LTT2AV,BL/FU kMRI reading (TL): average cartilage T2 ...,0,{},1,{T},False,float,824.0,44.593,23.638,19,867,0,0
MFT2AV,BL/FU kMRI reading (TL): average cartilage T2 ...,0,{},1,{T},False,float,825.0,55.22,30.015,17,870,0,0
MTT2AV,BL/FU kMRI reading (TL): average cartilage T2 ...,0,{},1,{T},False,float,817.0,60.378,24.722,17,870,0,0
PATT2AV,BL/FU kMRI reading (TL): average cartilage T2 ...,0,{},1,{T},False,float,826.0,65.14,22.994,28,859,0,0


In [182]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only floats, missing, and NA values
'float': ['LFT2AV', 'LTT2AV', 'MFT2AV', 'MTT2AV', 'PATT2AV'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ'],

}


Handled columns: 6


In [183]:
targets = {
# Columns with only floats, missing, and NA values
'float': ['LFT2AV', 'LTT2AV', 'MFT2AV', 'MTT2AV', 'PATT2AV'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [184]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION    category
READPRJ      object
LFT2AV      float32
LTT2AV      float32
MFT2AV      float32
MTT2AV      float32
PATT2AV     float32
dtype: object

Missing values present, shadow dataframe created.
              LFT2AV LTT2AV MFT2AV MTT2AV                 PATT2AV
ID      Visit                                                    
9003175 V00      NaN    NaN    NaN    NaN                     NaN
9011115 V00      NaN    NaN    NaN    NaN                     NaN
9011949 V00      NaN    NaN    NaN    NaN                     NaN
9013634 V00      NaN    NaN    NaN    NaN  .T: Technical problems
9015798 V00      NaN    NaN    NaN    NaN                     NaN
...              ...    ...    ...    ...                     ...
9982698 V06      NaN    NaN    NaN    NaN                     NaN
9983798 V06      NaN    NaN    NaN    NaN                     NaN
9989352 V06      NaN    NaN    NaN    NaN                     NaN
9995913 V06      NaN    NaN 

In [185]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.08MB
Shadow dataframe size: 0.02MB


In [186]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kmri_qcart_vs

In [187]:
prefix = 'kmri_qcart_vs'
column_uniformity_check(prefix)


kmri_qcart_vs00.sas7bdat: (160, 127)
['ID', 'SIDE', 'READPRJ', 'VERSION', 'LTWVOL', 'PATTMAX', 'CLFTAVG', 'TRFNVOL', 'WFTAVG', 'CMFSABC', 'CMTSABC', 'PMFSADA', 'PATBMEV', 'CLFTVAR', 'CLTSAAS', 'LTNVOL', 'LTSABC', 'TRFTMAX', 'PLFSAAS', 'LTRBMEV', 'LTTMAX', 'CMTTAVG', 'CLTSABC', 'PLFSADA', 'MTSABC', 'BARCDVS', 'LPBMEV', 'MTNVOL', 'CLTSADA', 'CMTCAVG', 'MTSADA', 'CMFSAAS', 'PMFBMEV', 'CMTRENG', 'MTWVOL', 'PLFSABC', 'CLFWVOL', 'WFTVAR', 'CMFTVAR', 'CMTSADA', 'CMTSAAS', 'CLTRENG', 'CLFCAVG', 'CMFSADA', 'CLFRENG', 'CLTCAVG', 'CMTNVOL', 'WFTMAX', 'CMFTAVG', 'LTSAAS', 'CMTWVOL', 'CLFTMAX', 'MPBMEV', 'CMFCAVG', 'LTRENG', 'CLFSADA', 'CMFRENG', 'CLFSAAS', 'CMFWVOL', 'PLFTAVG', 'CLFSABC', 'MTRBMEV', 'CMTTVAR', 'LTTAVG', 'CLTTAVG', 'MTBMEV', 'PATCAVG', 'PLFRENG', 'CLFNVOL', 'CMTTMAX', 'MTTAVG', 'PMFTVAR', 'PLFWVOL', 'CLTNVOL', 'CLTWVOL', 'LTCAVG', 'PMFTAVG', 'PLFCAVG', 'LTBMEV', 'PMFCAVG', 'PLFTVAR', 'WFSADA', 'PATTAVG', 'MTCAVG', 'CLTTVAR', 'CMFNVOL', 'CLFBMEV', 'WFWVOL', 'CMFTMAX', 'TRFTAVG', 'W

In [188]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kmri_qcart_vs00.sas7bdat	Var Cnt: 127
Visits: ['V00']
kmri_qcart_vs01.sas7bdat	Var Cnt: 127
Visits: ['V01']
(320, 128)

Starting dataframe size: 1.29MB


In [189]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [190]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 5 	Cols to convert: 123	 Total col cnt: 128

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0     True    False     False   False      2
1     True     True     False   False    121

Numeric types of columns:
num_type
float       120
unsigned      1
Name: count, dtype: int64

Largest number of unique strings: 301


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{03},0,{},True,,,,,320,0,0,0
LTWVOL,BL/FU kMRI reading (VS): volume of cartilage -...,0,{},1,{I},False,float,300.0,4493.533793,622.615115,20,300,0,0
PATTMAX,BL/FU kMRI reading (VS): maximum cartilage thi...,0,{},1,{I},False,float,87.0,7.350000,2.150000,20,300,0,0
CLFTAVG,BL/FU kMRI reading (VS): mean cartilage thickn...,0,{},1,{I},False,float,300.0,3.142385,1.280720,20,300,0,0
TRFNVOL,BL/FU kMRI reading (VS): normalized cartilage ...,0,{},1,{I},False,float,300.0,3.685081,0.927447,20,300,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
PMFNVOL,BL/FU kMRI reading (VS): normalized cartilage ...,0,{},1,{I},False,float,300.0,3.368355,1.453637,20,300,0,0
PMFTMAX,BL/FU kMRI reading (VS): maximum cartilage thi...,0,{},1,{I},False,float,53.0,6.200000,2.050000,20,300,0,0
LTTVAR,BL/FU kMRI reading (VS): variance of cartilage...,0,{},1,{I},False,float,300.0,2.135934,0.142891,20,300,0,0
CMFBMEV,BL/FU kMRI reading (VS): bone marrow edema vol...,0,{},1,{I},False,float,112.0,7743.094866,0.000000,20,300,0,0


In [191]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['PMFSADA'],

# Columns with only floats, missing, and NA values
'float': ['LTWVOL', 'PATTMAX', 'CLFTAVG', 'TRFNVOL', 'WFTAVG', 'CMFSABC', 'CMTSABC', 'PATBMEV', 'CLFTVAR', 'CLTSAAS', 'LTNVOL', 'LTSABC', 'TRFTMAX', 'PLFSAAS', 'LTRBMEV', 'LTTMAX', 'CMTTAVG', 'CLTSABC', 'PLFSADA', 'MTSABC', 'LPBMEV', 'MTNVOL', 'CLTSADA', 'CMTCAVG', 'MTSADA', 'CMFSAAS', 'PMFBMEV', 'CMTRENG', 'MTWVOL', 'PLFSABC', 'CLFWVOL', 'WFTVAR', 'CMFTVAR', 'CMTSADA', 'CMTSAAS', 'CLTRENG', 'CLFCAVG', 'CMFSADA', 'CLFRENG', 'CLTCAVG', 'CMTNVOL', 'WFTMAX', 'CMFTAVG', 'LTSAAS', 'CMTWVOL', 'CLFTMAX', 'MPBMEV', 'CMFCAVG', 'LTRENG', 'CLFSADA', 'CMFRENG', 'CLFSAAS', 'CMFWVOL', 'PLFTAVG', 'CLFSABC', 'MTRBMEV', 'CMTTVAR', 'LTTAVG', 'CLTTAVG', 'MTBMEV', 'PATCAVG', 'PLFRENG', 'CLFNVOL', 'CMTTMAX', 'MTTAVG', 'PMFTVAR', 'PLFWVOL', 'CLTNVOL', 'CLTWVOL', 'LTCAVG', 'PMFTAVG', 'PLFCAVG', 'LTBMEV', 'PMFCAVG', 'PLFTVAR', 'WFSADA', 'PATTAVG', 'MTCAVG', 'CLTTVA

In [192]:
targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['PMFSADA'],

# Columns with only floats, missing, and NA values
'float': ['LTWVOL', 'PATTMAX', 'CLFTAVG', 'TRFNVOL', 'WFTAVG', 'CMFSABC', 'CMTSABC', 'PATBMEV', 'CLFTVAR', 'CLTSAAS', 'LTNVOL', 'LTSABC', 'TRFTMAX', 'PLFSAAS', 'LTRBMEV', 'LTTMAX', 'CMTTAVG', 'CLTSABC', 'PLFSADA', 'MTSABC', 'LPBMEV', 'MTNVOL', 'CLTSADA', 'CMTCAVG', 'MTSADA', 'CMFSAAS', 'PMFBMEV', 'CMTRENG', 'MTWVOL', 'PLFSABC', 'CLFWVOL', 'WFTVAR', 'CMFTVAR', 'CMTSADA', 'CMTSAAS', 'CLTRENG', 'CLFCAVG', 'CMFSADA', 'CLFRENG', 'CLTCAVG', 'CMTNVOL', 'WFTMAX', 'CMFTAVG', 'LTSAAS', 'CMTWVOL', 'CLFTMAX', 'MPBMEV', 'CMFCAVG', 'LTRENG', 'CLFSADA', 'CMFRENG', 'CLFSAAS', 'CMFWVOL', 'PLFTAVG', 'CLFSABC', 'MTRBMEV', 'CMTTVAR', 'LTTAVG', 'CLTTAVG', 'MTBMEV', 'PATCAVG', 'PLFRENG', 'CLFNVOL', 'CMTTMAX', 'MTTAVG', 'PMFTVAR', 'PLFWVOL', 'CLTNVOL', 'CLTWVOL', 'LTCAVG', 'PMFTAVG', 'PLFCAVG', 'LTBMEV', 'PMFCAVG', 'PLFTVAR', 'WFSADA', 'PATTAVG', 'MTCAVG', 'CLTTVAR', 'CMFNVOL', 'CLFBMEV', 'WFWVOL', 'CMFTMAX', 'TRFTAVG', 'WFSABC', 'MTRENG', 'MTSAAS', 'PATRENG', 'PMFWVOL', 'PLFNVOL', 'LTSADA', 'PMFSABC', 'MTTMAX', 'WFCAVG', 'TRFRENG', 'PATSAAS', 'CLTTMAX', 'MTTVAR', 'TRFWVOL', 'PMFRENG', 'PATNVOL', 'PLFTMAX', 'TRFCAVG', 'TRFSAAS', 'WFSAAS', 'PATWVOL', 'WFRENG', 'TRFTVAR', 'PLFBMEV', 'WFNVOL', 'PATSABC', 'PATTVAR', 'TRFSABC', 'TRFSADA', 'PMFSAAS', 'PMFNVOL', 'PMFTMAX', 'LTTVAR', 'CMFBMEV', 'PATSADA'],

# Columns with only strings, missing, and NA values
'cat': ['BARCDVS'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [193]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION    category
SIDE       category
READPRJ      object
LTWVOL      float32
PATTMAX     float32
             ...   
PMFTMAX     float32
LTTVAR      float32
CMFBMEV     float32
PATSADA     float32
VSORDR     category
Length: 126, dtype: object

Missing values present, shadow dataframe created.
              PMFSADA LTWVOL PATTMAX CLFTAVG TRFNVOL WFTAVG CMFSABC CMTSABC  \
ID      Visit                                                                 
9003406 V00       NaN    NaN     NaN     NaN     NaN    NaN     NaN     NaN   
9007827 V00       NaN    NaN     NaN     NaN     NaN    NaN     NaN     NaN   
9021791 V00       NaN    NaN     NaN     NaN     NaN    NaN     NaN     NaN   
9040390 V00       NaN    NaN     NaN     NaN     NaN    NaN     NaN     NaN   
9040892 V00       NaN    NaN     NaN     NaN     NaN    NaN     NaN     NaN   
...               ...    ...     ...     ...     ...    ...     ...     ...   
9986207 V01       NaN    NaN 

In [194]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.21MB
Shadow dataframe size: 0.06MB


In [195]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kmri_sq_bicl

In [196]:
prefix = 'kmri_sq_bicl'
column_uniformity_check(prefix)


kmri_sq_bicl00.sas7bdat: (115, 81)
['ID', 'SIDE', 'READPRJ', 'VERSION', 'BLATMTP', 'BMEDMTA', 'BCBMTN', 'BCBLTN', 'BLATMXL', 'BMEDMTP', 'BLATMTA', 'BCBPM', 'BCAMFWT', 'BCAFLWT', 'BMEDMXM', 'BCBPC', 'BCATRMT', 'BCBLFP', 'BCAMFWE', 'BCAPME', 'BCBPL', 'BCATRME', 'BCAFLWE', 'BCATRLE', 'BCBLFN', 'BCAPLT', 'BCBMFN', 'BCATRLT', 'BCALTT', 'BNBMLS', 'BCALTE', 'BCBLTP', 'BLATMTB', 'BCBMTP', 'BCAPLE', 'BLATMXA', 'BMEDMTB', 'BCAMTE', 'BCAMTT', 'BMEDMXA', 'BCBMFP', 'BCAPMT', 'WMEDMXM', 'WLATMXL', 'WBMTMC', 'WBMFLP', 'WMEDMTP', 'WCMPL', 'WBMFMP', 'WMEDMTA', 'WBMFLA', 'WCMFLC', 'BNBMLSK', 'WCMTMC', 'WCMFLP', 'WLATMTA', 'WBMPL', 'WLATMTP', 'WCMFLA', 'READER', 'WCMFMA', 'WCMFMP', 'WBMFLC', 'WBMTSS', 'WBMFMC', 'WBMTLP', 'WCMTMA', 'WBMFMA', 'WBMTMP', 'WCMTLC', 'WCMPM', 'WBMTLA', 'WLATMTB', 'WCMFMC', 'WCMTLP', 'WMEDMTB', 'WBMTMA', 'WBMTLC', 'WCMTMP', 'WBMPM', 'WCMTLA']

kmri_sq_bicl03.sas7bdat: (115, 81)

Total rows: 230


In [197]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kmri_sq_bicl00.sas7bdat	Var Cnt: 81
Visits: ['V00']
kmri_sq_bicl03.sas7bdat	Var Cnt: 81
Visits: ['V03']
(230, 82)

Starting dataframe size: 0.32MB


In [198]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [199]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 78 	Cols to convert: 4	 Total col cnt: 82

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False     True     False   False      2
1     True    False     False   False      2

Numeric types of columns:
num_type
unsigned    2
Name: count, dtype: int64

Largest number of unique strings: 2


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{10},0,{},True,,,,,230,0,0,0
BNBMLS,BL/FU kMRI reading (BI): BLOKS: number of BMLs...,0,,0,,,unsigned,12.0,13.0,0.0,0,230,0,0
BNBMLSK,BL/FU kMRI reading (BI): BLOKS: number of BMLs...,0,,0,,,unsigned,12.0,13.0,0.0,0,230,0,0
READER,BL/FU kMRI reading (BI): reader,2,"{R02, R01}",0,{},False,,,,,230,0,0,0


In [200]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['BNBMLS', 'BNBMLSK'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ', 'READER'],

}


Handled columns: 4


In [201]:
targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['BNBMLS', 'BNBMLSK'],

# Columns with only strings, missing, and NA values
'cat': ['READER'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [202]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION    category
SIDE       category
READPRJ      object
BLATMTP    category
BMEDMTA    category
             ...   
WBMTMA     category
WBMTLC     category
WCMTMP     category
WBMPM      category
WCMTLA     category
Length: 80, dtype: object


In [203]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.30MB
Shadow dataframe size: 0.00MB


In [204]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kmri_sq_blksbml_bicl

In [205]:
prefix = 'kmri_sq_blksbml_bicl'
column_uniformity_check(prefix)


kmri_sq_blksbml_bicl00.sas7bdat: (593, 9)
['ID', 'side', 'READPRJ', 'VERSION', 'BBMLP', 'BBMLA', 'BBMLST', 'BBMLS', 'BBMLNUM']

kmri_sq_blksbml_bicl03.sas7bdat: (593, 9)

Total rows: 1186


In [206]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kmri_sq_blksbml_bicl00.sas7bdat	Var Cnt: 9
Visits: ['V00']
kmri_sq_blksbml_bicl03.sas7bdat	Var Cnt: 9
Visits: ['V03']
(1186, 10)

Starting dataframe size: 0.12MB


In [207]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [208]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 8 	Cols to convert: 2	 Total col cnt: 10

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False     True     False    True      1
1     True    False     False   False      1

Numeric types of columns:
num_type
unsigned    1
Name: count, dtype: int64

Largest number of unique strings: 1


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{10},0,{},True,,,,,1186,0,0,0
BBMLNUM,BL/FU kMRI reading (BI): BLOKS: BML ID number,0,,0,,,unsigned,14.0,13.0,1.0,0,1180,0,6


In [209]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['BBMLNUM'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ'],

}


Handled columns: 2


In [210]:
targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['BBMLNUM'],

}


new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [211]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION    category
SIDE       category
READPRJ      object
BBMLP      category
BBMLA      category
BBMLST     category
BBMLS      category
BBMLNUM       UInt8
dtype: object


In [212]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.10MB
Shadow dataframe size: 0.01MB


In [213]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kmri_sq_moaks_bicl

In [214]:
prefix = 'kmri_sq_moaks_bicl'
column_uniformity_check(prefix)


kmri_sq_moaks_bicl00.sas7bdat: (3017, 121)
['ID', 'SIDE', 'READPRJ', 'VERSION', 'MCMPM', 'MCMPL', 'MCMFMA', 'MCMFLA', 'MCMFMP', 'MCMFLP', 'MCMFMC', 'MCMFLC', 'MCMTMA', 'MCMTLA', 'MCMTMC', 'MCMTLC', 'MCMTMP', 'MCMTLP', 'MBMSFMA', 'MBMSFLA', 'MBMSFMC', 'MBMSFLC', 'MBMSFMP', 'MBMSFLP', 'MBMSSS', 'MBMSTMA', 'MBMSTLA', 'MBMSTMC', 'MBMSTLC', 'MBMSTMP', 'MBMSTLP', 'MBMSPM', 'MBMSPL', 'MBMPFMA', 'MBMPFLA', 'MBMPFMC', 'MBMPFLC', 'MBMPFMP', 'MBMPFLP', 'MBMPSS', 'MBMPTMA', 'MBMPTLA', 'MBMPTMC', 'MBMPTLC', 'MBMPTMP', 'MBMPTLP', 'MBMPPM', 'MBMPPL', 'MBMNFMA', 'MBMNFLA', 'MBMNFMC', 'MBMNFLC', 'MBMNFMP', 'MBMNFLP', 'MBMNSS', 'MBMNTMA', 'MBMNTLA', 'MBMNTMC', 'MBMNTLC', 'MBMNTMP', 'MBMNTLP', 'MBMNPM', 'MBMNPL', 'MMTMA', 'MMTMB', 'MMTMP', 'MMTLA', 'MMTLB', 'MMTLP', 'MMHMA', 'MMHMB', 'MMHMP', 'MMHLA', 'MMHLB', 'MMHLP', 'MMSMA', 'MMSMB', 'MMSMP', 'MMSLA', 'MMSLB', 'MMSLP', 'MMXMM', 'MMXMA', 'MMXLA', 'MMXLL', 'MMRTM', 'MMRTL', 'MOSPS', 'MOSPI', 'MOSPM', 'MOSFMA', 'MOSFMP', 'MOSFMC', 'MOSTM', 'MOSPL', 'MOS

In [215]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kmri_sq_moaks_bicl00.sas7bdat	Var Cnt: 121
Visits: ['V00']
kmri_sq_moaks_bicl01.sas7bdat	Var Cnt: 121
Visits: ['V01']
kmri_sq_moaks_bicl03.sas7bdat	Var Cnt: 121
Visits: ['V03']
kmri_sq_moaks_bicl05.sas7bdat	Var Cnt: 121
Visits: ['V05']
kmri_sq_moaks_bicl06.sas7bdat	Var Cnt: 121
Visits: ['V06']
(10429, 122)

Starting dataframe size: 8.05MB


In [216]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [217]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 104 	Cols to convert: 18	 Total col cnt: 122

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0     True    False     False   False      3
1     True     True     False   False     15

Numeric types of columns:
num_type
unsigned    15
Name: count, dtype: int64

Largest number of unique strings: 184


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,10,"{63E, 63B, 63D, 65, 63F, 61, 30, 63C, 63A, 22}",0,{},False,,,,,10429,0,0,0
MBMNFMA,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},2,"{M, U}",False,unsigned,5.0,3.0,0.0,5,5596,0,0
MBMNFLA,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},2,"{M, U}",False,unsigned,7.0,5.0,0.0,6,5595,0,0
MBMNFMC,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},2,"{M, U}",False,unsigned,8.0,11.0,0.0,6,5595,0,0
MBMNFLC,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},2,"{M, U}",False,unsigned,6.0,9.0,0.0,5,5596,0,0
MBMNFMP,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},2,"{M, U}",False,unsigned,6.0,4.0,0.0,5,5596,0,0
MBMNFLP,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},2,"{M, U}",False,unsigned,8.0,9.0,0.0,5,5596,0,0
MBMNSS,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},2,"{M, U}",False,unsigned,7.0,9.0,0.0,4,5597,0,0
MBMNTMA,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},2,"{M, U}",False,unsigned,4.0,2.0,0.0,8,5593,0,0
MBMNTLA,BL/FU kMRI reading (BI): MOAKS: number of BML ...,0,{},2,"{M, U}",False,unsigned,3.0,1.0,0.0,6,5595,0,0


In [218]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['MBMNFMA', 'MBMNFLA', 'MBMNFMC', 'MBMNFLC', 'MBMNFMP', 'MBMNFLP', 'MBMNSS', 'MBMNTMA', 'MBMNTLA', 'MBMNTMC', 'MBMNTLC', 'MBMNTMP', 'MBMNTLP', 'MBMNPM', 'MBMNPL'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ', 'MCMNTS', 'MTCMNTS'],

}


Handled columns: 18


In [219]:
# Clean up comments:
for col in 'MTCMNTS', 'MCMNTS':
    for target_str in ['none', 'NONE', 'None.', 'nnoe']:
        tmp_df[col] = tmp_df[col].replace(target_str, '')

In [220]:
targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['MBMNFMA', 'MBMNFLA', 'MBMNFMC', 'MBMNFLC', 'MBMNFMP', 'MBMNFLP', 'MBMNSS', 'MBMNTMA', 'MBMNTLA', 'MBMNTMC', 'MBMNTLC', 'MBMNTMP', 'MBMNTLP', 'MBMNPM', 'MBMNPL'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ', 'MCMNTS', 'MTCMNTS'],

}
new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [221]:
# Clean up the side var and make an index
new_df['SIDE'] = new_df['SIDE'].cat.rename_categories({'1: Right': 'RIGHT', '2: Left': 'LEFT'})
new_df.set_index('SIDE', append=True, inplace=True) #  Note that indices are unique yet, Prj 15 and Prj 37 contain repeats

In [222]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)


VERSION    category
READPRJ    category
MCMPM      category
MCMPL      category
MCMFMA     category
             ...   
MEFFWK     category
MPOPCYS    category
MITBSIG    category
MCMNTS     category
MTCMNTS    category
Length: 119, dtype: object

Missing values present, shadow dataframe created.
              MBMNFMA MBMNFLA MBMNFMC MBMNFLC MBMNFMP MBMNFLP MBMNSS MBMNTMA  \
ID      Visit                                                                  
9000798 V00       NaN     NaN     NaN     NaN     NaN     NaN    NaN     NaN   
9001400 V00       NaN     NaN     NaN     NaN     NaN     NaN    NaN     NaN   
9001695 V00       NaN     NaN     NaN     NaN     NaN     NaN    NaN     NaN   
9001897 V00       NaN     NaN     NaN     NaN     NaN     NaN    NaN     NaN   
        V00       NaN     NaN     NaN     NaN     NaN     NaN    NaN     NaN   
...               ...     ...     ...     ...     ...     ...    ...     ...   
9995338 V06       NaN     NaN     NaN     NaN     NaN     NaN

In [223]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 1.82MB
Shadow dataframe size: 0.24MB


In [224]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kmri_sq_worms_link

In [225]:
prefix = 'kmri_sq_worms_link'
column_uniformity_check(prefix)


kmri_sq_worms_link00.sas7bdat: (300, 33)
['ID', 'side', 'VERSION', 'READPRJ', 'MWMTMA', 'MWMTMB', 'MWMTMP', 'MWMTLA', 'MWMTLB', 'MWMTLP', 'MWMTMS', 'MWMTLS', 'MWACL', 'MWPCL', 'MWMCL', 'MWLCL', 'MWPATND', 'MWPOTND', 'MWCMP', 'MWCMFT', 'MWCMFM', 'MWCMFL', 'MWCMTM', 'MWCMTL', 'MWBMP', 'MWBMFT', 'MWBMFM', 'MWBMFL', 'MWBMTM', 'MWBMTL', 'MWEFFWK', 'MWLB', 'MWPOPCY']

kmri_sq_worms_link03.sas7bdat: (291, 33)
Names only differ by case

kmri_sq_worms_link06.sas7bdat: (300, 33)
Names only differ by case

Total rows: 891


In [226]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kmri_sq_worms_link00.sas7bdat	Var Cnt: 33
Visits: ['V00']
Unhandled data type:  MWPOPCY DEGREE
kmri_sq_worms_link03.sas7bdat	Var Cnt: 33
Visits: ['V03']
Unhandled data type:  MWPOPCY DEGREE
kmri_sq_worms_link06.sas7bdat	Var Cnt: 33
Visits: ['V06']
Unhandled data type:  MWPOPCY DEGREE
(891, 34)

Starting dataframe size: 0.22MB


In [227]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [228]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 32 	Cols to convert: 2	 Total col cnt: 34

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0     True    False     False   False      1
1     True     True     False   False      1

Numeric types of columns:
num_type
unsigned    1
Name: count, dtype: int64

Largest number of unique strings: 1


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{39},0,{},True,,,,,891,0,0,0
MWPOPCY,BL/FU kMRI Reading (Link WORMS): Popliteal Cyst,0,{},2,"{M, T}",False,unsigned,4.0,3.0,0.0,202,689,0,0


In [229]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['MWPOPCY'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ'],

}


Handled columns: 2


In [230]:
targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['MWPOPCY'],
}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [231]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Failure to make column categorical: MWPOPCY
Columns still object type:  ['READPRJ']

VERSION    category
SIDE       category
READPRJ      object
MWMTMA     category
MWMTMB     category
MWMTMP     category
MWMTLA     category
MWMTLB     category
MWMTLP     category
MWMTMS     category
MWMTLS     category
MWACL      category
MWPCL      category
MWMCL      category
MWLCL      category
MWPATND    category
MWPOTND    category
MWCMP      category
MWCMFT     category
MWCMFM     category
MWCMFL     category
MWCMTM     category
MWCMTL     category
MWBMP      category
MWBMFT     category
MWBMFM     category
MWBMFL     category
MWBMTM     category
MWBMTL     category
MWEFFWK    category
MWLB       category
MWPOPCY       UInt8
dtype: object

Missing values present, shadow dataframe created.
                   MWPOPCY
ID      Visit             
9003175 V00            NaN
9011115 V00    .M: Missing
9011949 V00            NaN
9013634 V00            NaN
9015798 V00    .M: Missing
...                  

In [232]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.19MB
Shadow dataframe size: 0.01MB


In [233]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kxr_fnih_bti_duke

In [234]:
prefix = 'kxr_fnih_bti_duke'
column_uniformity_check(prefix)


kxr_fnih_bti_duke00.sas7bdat: (600, 11)
['ID', 'SIDE', 'VERSION', 'READPRJ', 'BTI_H0', 'BTI_H1', 'BTI_H2', 'BTI_V0', 'BTI_V1', 'BTI_V2', 'BARCDBTI']

kxr_fnih_bti_duke01.sas7bdat: (600, 11)

kxr_fnih_bti_duke03.sas7bdat: (600, 11)

Total rows: 1800


In [235]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kxr_fnih_bti_duke00.sas7bdat	Var Cnt: 11
Visits: ['V00']
kxr_fnih_bti_duke01.sas7bdat	Var Cnt: 11
Visits: ['V01']
kxr_fnih_bti_duke03.sas7bdat	Var Cnt: 11
Visits: ['V03']
(1800, 12)

Starting dataframe size: 0.33MB


In [236]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [237]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 4 	Cols to convert: 8	 Total col cnt: 12

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False     True     False    True      6
1     True    False     False   False      2

Numeric types of columns:
num_type
float    6
Name: count, dtype: int64

Largest number of unique strings: 1785


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{22},0,{},True,,,,,1800,0,0,0
BTI_H0,Fractal Bone Trabecular Integrity Horizontal -...,0,,0,,,float,1700.0,3.101711,2.140136,0,1699,0,101
BTI_H1,Fractal Bone Trabecular Integrity Horizontal -...,0,,0,,,float,1700.0,-0.068849,-0.385847,0,1699,0,101
BTI_H2,Fractal Bone Trabecular Integrity Horizontal -...,0,,0,,,float,1700.0,0.330728,-0.018622,0,1699,0,101
BTI_V0,Fractal Bone Trabecular Integrity Vertical - c...,0,,0,,,float,1700.0,2.82662,1.366253,0,1699,0,101
BTI_V1,Fractal Bone Trabecular Integrity Vertical - s...,0,,0,,,float,1700.0,0.213952,-0.439273,0,1699,0,101
BTI_V2,Fractal Bone Trabecular Integrity Vertical - q...,0,,0,,,float,1700.0,1.007951,0.101282,0,1699,0,101
BARCDBTI,Barcode of image analyzed,1785,"{016601653603, 016602269503, 016600104503, 016...",0,{},True,,,,,1800,0,0,0


In [238]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only floats, missing, and NA values
'float': ['BTI_H0', 'BTI_H1', 'BTI_H2', 'BTI_V0', 'BTI_V1', 'BTI_V2'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ', 'BARCDBTI'],

}


Handled columns: 8


In [239]:
targets = {
# Columns with only floats, missing, and NA values
'float': ['BTI_H0', 'BTI_H1', 'BTI_H2', 'BTI_V0', 'BTI_V1', 'BTI_V2'],

# Columns with only strings, missing, and NA values
'cat': ['BARCDBTI'],

}


new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [240]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION     category
SIDE        category
READPRJ       object
BTI_H0       float32
BTI_H1       float32
BTI_H2       float32
BTI_V0       float32
BTI_V1       float32
BTI_V2       float32
BARCDBTI    category
dtype: object


In [241]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.35MB
Shadow dataframe size: 0.01MB


In [242]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kxr_fta_duryea

In [243]:
prefix = 'kxr_fta_duryea'
column_uniformity_check(prefix)


kxr_fta_duryea00.sas7bdat: (6178, 7)
['ID', 'SIDE', 'READPRJ', 'VERSION', 'FTAFLAG', 'FTANGLE', 'BRCDJD']

kxr_fta_duryea01.sas7bdat: (5807, 7)

kxr_fta_duryea03.sas7bdat: (5482, 7)

kxr_fta_duryea05.sas7bdat: (5313, 7)

kxr_fta_duryea06.sas7bdat: (5092, 7)

kxr_fta_duryea08.sas7bdat: (3557, 7)

kxr_fta_duryea10.sas7bdat: (3689, 7)

Total rows: 35118


In [244]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kxr_fta_duryea00.sas7bdat	Var Cnt: 7
Visits: ['V00']
kxr_fta_duryea01.sas7bdat	Var Cnt: 7
Visits: ['V01']
kxr_fta_duryea03.sas7bdat	Var Cnt: 7
Visits: ['V03']
kxr_fta_duryea05.sas7bdat	Var Cnt: 7
Visits: ['V05']
kxr_fta_duryea06.sas7bdat	Var Cnt: 7
Visits: ['V06']
kxr_fta_duryea08.sas7bdat	Var Cnt: 7
Visits: ['V08']
kxr_fta_duryea10.sas7bdat	Var Cnt: 7
Visits: ['V10']
(35118, 8)

Starting dataframe size: 5.92MB


In [245]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [246]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 5 	Cols to convert: 3	 Total col cnt: 8

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0     True    False     False   False      2
1     True     True     False   False      1

Numeric types of columns:
num_type
float    1
Name: count, dtype: int64

Largest number of unique strings: 19944


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{17},0,{},True,,,,,35118,0,0,0
FTANGLE,BL kXR reading (JD): femoral tibial angle (FTA...,0,{},2,"{P, T}",False,float,1850.0,10.17,-18.02,645,34473,0,0
BRCDJD,BL kXR reading (JD): barcode of image analyzed...,19944,"{016600855104, 016602586401, 016603773601, 016...",0,{},False,,,,,35118,0,0,0


In [247]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only floats, missing, and NA values
'float': ['FTANGLE'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ', 'BRCDJD'],

}


Handled columns: 3


In [248]:
targets = {
# Columns with only floats, missing, and NA values
'float': ['FTANGLE'],

# Columns with only strings, missing, and NA values
'cat': ['BRCDJD'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [249]:
# Clean up the side var and make an index
new_df['SIDE'] = new_df['SIDE'].cat.rename_categories({'1: Right': 'RIGHT', '2: Left': 'LEFT'})
new_df.set_index('SIDE', append=True, inplace=True) #  Note that indices are unique yet, Prj 15 and Prj 37 contain repeats

In [250]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION    category
READPRJ      object
FTAFLAG    category
FTANGLE     float32
BRCDJD     category
dtype: object

Missing values present, shadow dataframe created.
              FTANGLE
ID      Visit        
9000099 V00       NaN
        V00       NaN
9000296 V00       NaN
        V00       NaN
9000622 V00       NaN
...               ...
9999510 V10       NaN
9999862 V10       NaN
        V10       NaN
9999865 V10       NaN
9999878 V10       NaN

[35118 rows x 1 columns]


In [251]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 4.31MB
Shadow dataframe size: 0.24MB


In [252]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kxr_qjsw_duryea

In [253]:
prefix = 'kxr_qjsw_duryea'
column_uniformity_check(prefix)


kxr_qjsw_duryea00.sas7bdat: (6245, 36)
['ID', 'SIDE', 'readprj', 'VERSION', 'CFWDTH', 'MCMJSW', 'JSW175', 'JSW200', 'JSW250', 'BARCDJD', 'JSW300', 'JSW225', 'TPCFDS', 'BMANG', 'JSW150', 'JSW275', 'LJSW850', 'LJSW900', 'LJSW700', 'LJSW825', 'LJSW750', 'LJSW875', 'LJSW725', 'LJSW775', 'LJSW800', 'XMJSW', 'INCPLL', 'INCPLM', 'NOMMJSW', 'NOLJSWX', 'MJSWBB', 'INCSTPS', 'LTPMEBE', 'NOMJSWX', 'NOLMIN', 'IMPIXSZ']

kxr_qjsw_duryea01.sas7bdat: (5874, 36)
Names only differ by case

kxr_qjsw_duryea03.sas7bdat: (5544, 36)
Names only differ by case

kxr_qjsw_duryea05.sas7bdat: (5349, 36)
Names only differ by case

kxr_qjsw_duryea06.sas7bdat: (5134, 36)
Names only differ by case

kxr_qjsw_duryea08.sas7bdat: (3557, 36)
Names only differ by case

kxr_qjsw_duryea10.sas7bdat: (3689, 36)
Names only differ by case

Total rows: 35392


In [254]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kxr_qjsw_duryea00.sas7bdat	Var Cnt: 36
Visits: ['V00']
kxr_qjsw_duryea01.sas7bdat	Var Cnt: 36
Visits: ['V01']
kxr_qjsw_duryea03.sas7bdat	Var Cnt: 36
Visits: ['V03']
kxr_qjsw_duryea05.sas7bdat	Var Cnt: 36
Visits: ['V05']
kxr_qjsw_duryea06.sas7bdat	Var Cnt: 36
Visits: ['V06']
kxr_qjsw_duryea08.sas7bdat	Var Cnt: 36
Visits: ['V08']
kxr_qjsw_duryea10.sas7bdat	Var Cnt: 36
Visits: ['V10']
(35392, 37)

Starting dataframe size: 28.76MB


In [255]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [256]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 13 	Cols to convert: 24	 Total col cnt: 37

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False     True     False    True      1
1     True    False     False   False      2
2     True     True     False   False     21

Numeric types of columns:
num_type
float    22
Name: count, dtype: int64

Largest number of unique strings: 20079


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,1,{16},0,{},True,,,,,35392,0,0,0
CFWDTH,BL/FU kXR reading (JD): width of femoral condy...,0,{},1,{P},False,float,12550.0,116.733,57.552,399,34993,0,0
MCMJSW,BL: reading (JD): medial minimum JSW [mm],0,{},2,"{P, T}",False,float,4114.0,10.296,0.0,592,34800,0,0
JSW175,BL/FU kXR reading (JD): medial JSW at x=0.175 ...,0,{},2,"{P, T}",False,float,861.0,10.9,0.0,775,34617,0,0
JSW200,BL/FU kXR reading (JD): medial JSW at x=0.200 ...,0,{},2,"{P, T}",False,float,871.0,11.31,0.0,657,34735,0,0
JSW250,BL/FU kXR reading (JD): medial JSW at x=0.250 ...,0,{},2,"{P, T}",False,float,906.0,11.71,0.0,574,34818,0,0
BARCDJD,BL/FU kXR reading (JD): barcode of image analy...,20079,"{016600855104, 016602586401, 016603773601, 016...",0,{},True,,,,,35392,0,0,0
JSW300,BL/FU kXR reading (JD): medial JSW at x=0.300 ...,0,{},2,"{P, T}",False,float,993.0,14.62,0.0,570,34822,0,0
JSW225,BL/FU kXR reading (JD): medial JSW at x=0.225 ...,0,{},2,"{P, T}",False,float,881.0,11.31,0.0,599,34793,0,0
TPCFDS,BL/FU kXR reading (JD): distance from tibial p...,0,{},3,"{A, P, T}",False,float,4345.0,20.648,0.0,864,34528,0,0


In [257]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only floats, missing, and NA values
'float': ['CFWDTH', 'MCMJSW', 'JSW175', 'JSW200', 'JSW250', 'JSW300', 'JSW225', 'TPCFDS', 'BMANG', 'JSW150', 'JSW275', 'LJSW850', 'LJSW900', 'LJSW700', 'LJSW825', 'LJSW750', 'LJSW875', 'LJSW725', 'LJSW775', 'LJSW800', 'XMJSW', 'IMPIXSZ'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ', 'BARCDJD'],

}


Handled columns: 24


In [258]:
targets = {
# Columns with only floats, missing, and NA values
'float': ['CFWDTH', 'MCMJSW', 'JSW175', 'JSW200', 'JSW250', 'JSW300', 'JSW225', 'TPCFDS', 'BMANG', 'JSW150', 'JSW275', 'LJSW850', 'LJSW900', 'LJSW700', 'LJSW825', 'LJSW750', 'LJSW875', 'LJSW725', 'LJSW775', 'LJSW800', 'XMJSW', 'IMPIXSZ'],

# Columns with only strings, missing, and NA values
'cat': ['BARCDJD'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [259]:
# Clean up the side var and make an index
new_df['SIDE'] = new_df['SIDE'].cat.rename_categories({'1: Right': 'RIGHT', '2: Left': 'LEFT'})
new_df.set_index('SIDE', append=True, inplace=True) #  Note that indices are unique yet, Prj 15 and Prj 37 contain repeats

In [260]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION    category
READPRJ      object
CFWDTH      float32
MCMJSW      float32
JSW175      float32
JSW200      float32
JSW250      float32
BARCDJD    category
JSW300      float32
JSW225      float32
TPCFDS      float32
BMANG       float32
JSW150      float32
JSW275      float32
LJSW850     float32
LJSW900     float32
LJSW700     float32
LJSW825     float32
LJSW750     float32
LJSW875     float32
LJSW725     float32
LJSW775     float32
LJSW800     float32
XMJSW       float32
INCPLL     category
INCPLM     category
NOMMJSW    category
NOLJSWX    category
MJSWBB     category
INCSTPS    category
LTPMEBE    category
NOMJSWX    category
NOLMIN     category
IMPIXSZ     float32
dtype: object

Missing values present, shadow dataframe created.
              CFWDTH MCMJSW JSW175 JSW200 JSW250 JSW300 JSW225 TPCFDS BMANG  \
ID      Visit                                                                 
9000099 V00      NaN    NaN    NaN    NaN    NaN    NaN 

In [261]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 7.47MB
Shadow dataframe size: 0.92MB


In [262]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kxr_qjsw_rel_duryea

In [263]:
prefix = 'kxr_qjsw_rel_duryea'
column_uniformity_check(prefix)


kxr_qjsw_rel_duryea00.sas7bdat: (556, 36)
['ID', 'SIDE', 'READPRJ', 'VERSION', 'CFWDTH', 'MCMJSW', 'JSW175', 'JSW200', 'JSW250', 'BARCDJD', 'JSW300', 'JSW225', 'TPCFDS', 'BMANG', 'JSW150', 'JSW275', 'LJSW850', 'LJSW900', 'LJSW700', 'LJSW825', 'LJSW750', 'LJSW875', 'LJSW725', 'LJSW775', 'LJSW800', 'XMJSW', 'INCPLL', 'INCPLM', 'NOMMJSW', 'NOLJSWX', 'MJSWBB', 'INCSTPS', 'LTPMEBE', 'NOMJSWX', 'NOLMIN', 'IMPIXSZ']

kxr_qjsw_rel_duryea01.sas7bdat: (524, 36)

kxr_qjsw_rel_duryea03.sas7bdat: (500, 36)

kxr_qjsw_rel_duryea05.sas7bdat: (144, 36)

Total rows: 1724


In [264]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kxr_qjsw_rel_duryea00.sas7bdat	Var Cnt: 36
Visits: ['V00']
kxr_qjsw_rel_duryea01.sas7bdat	Var Cnt: 36
Visits: ['V01']
kxr_qjsw_rel_duryea03.sas7bdat	Var Cnt: 36
Visits: ['V03']
kxr_qjsw_rel_duryea05.sas7bdat	Var Cnt: 36
Visits: ['V05']
(1724, 37)

Starting dataframe size: 1.42MB


In [265]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [266]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 13 	Cols to convert: 24	 Total col cnt: 37

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False     True     False    True      1
1     True    False     False   False      2
2     True     True     False   False     21

Numeric types of columns:
num_type
float    22
Name: count, dtype: int64

Largest number of unique strings: 423


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
READPRJ,Project,4,"{20D, 20C, 20B, 20A}",0,{},False,,,,,1724,0,0,0
CFWDTH,BL/FU kXR reading (JD): width of femoral condy...,0,{},1,{P},False,float,875.0,109.83,71.205,15,969,0,0
MCMJSW,BL: reading (JD): medial minimum JSW [mm],0,{},2,"{P, T}",False,float,679.0,8.2,0.0,28,1512,0,0
JSW175,BL/FU kXR reading (JD): medial JSW at x=0.175 ...,0,{},2,"{P, T}",False,float,551.0,8.67,0.0,39,1501,0,0
JSW200,BL/FU kXR reading (JD): medial JSW at x=0.200 ...,0,{},2,"{P, T}",False,float,544.0,8.57,0.0,33,1507,0,0
JSW250,BL/FU kXR reading (JD): medial JSW at x=0.250 ...,0,{},2,"{P, T}",False,float,558.0,9.4,0.0,34,1506,0,0
BARCDJD,BL/FU kXR reading (JD): barcode of image analy...,423,"{, 016600962006, 016600065404, 016602164601, 0...",0,{},False,,,,,1724,0,0,0
JSW300,BL/FU kXR reading (JD): medial JSW at x=0.300 ...,0,{},2,"{P, T}",False,float,582.0,11.01,0.0,34,1506,0,0
JSW225,BL/FU kXR reading (JD): medial JSW at x=0.225 ...,0,{},2,"{P, T}",False,float,539.0,8.77,0.0,33,1507,0,0
TPCFDS,BL/FU kXR reading (JD): distance from tibial p...,0,{},2,"{P, T}",False,float,429.0,10.947,0.421,25,587,0,0


In [267]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only floats, missing, and NA values
'float': ['CFWDTH', 'MCMJSW', 'JSW175', 'JSW200', 'JSW250', 'JSW300', 'JSW225', 'TPCFDS', 'BMANG', 'JSW150', 'JSW275', 'LJSW850', 'LJSW900', 'LJSW700', 'LJSW825', 'LJSW750', 'LJSW875', 'LJSW725', 'LJSW775', 'LJSW800', 'XMJSW', 'IMPIXSZ'],

# Columns with only strings, missing, and NA values
'cat': ['READPRJ', 'BARCDJD'],

}


Handled columns: 24


In [268]:
targets = {
# Columns with only floats, missing, and NA values
'float': ['CFWDTH', 'MCMJSW', 'JSW175', 'JSW200', 'JSW250', 'JSW300', 'JSW225', 'TPCFDS', 'BMANG', 'JSW150', 'JSW275', 'LJSW850', 'LJSW900', 'LJSW700', 'LJSW825', 'LJSW750', 'LJSW875', 'LJSW725', 'LJSW775', 'LJSW800', 'XMJSW', 'IMPIXSZ'],

# Columns with only strings, missing, and NA values
'cat': ['BARCDJD'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [269]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Columns still object type:  ['READPRJ']

VERSION    category
SIDE       category
READPRJ      object
CFWDTH      float32
MCMJSW      float32
JSW175      float32
JSW200      float32
JSW250      float32
BARCDJD    category
JSW300      float32
JSW225      float32
TPCFDS      float32
BMANG       float32
JSW150      float32
JSW275      float32
LJSW850     float32
LJSW900     float32
LJSW700     float32
LJSW825     float32
LJSW750     float32
LJSW875     float32
LJSW725     float32
LJSW775     float32
LJSW800     float32
XMJSW       float32
INCPLL     category
INCPLM     category
NOMMJSW    category
NOLJSWX    category
MJSWBB     category
INCSTPS    category
LTPMEBE    category
NOMJSWX    category
NOLMIN     category
IMPIXSZ     float32
dtype: object

Missing values present, shadow dataframe created.
              CFWDTH MCMJSW JSW175 JSW200 JSW250 JSW300 JSW225  \
ID      Visit                                                    
9004184 V00      NaN    NaN    NaN    NaN    NaN    NaN    NaN

In [270]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.35MB
Shadow dataframe size: 0.05MB


In [271]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kxr_sq_bu

In [394]:
prefix = 'kxr_sq_bu'
column_uniformity_check(prefix)


kxr_sq_bu00.sas7bdat: (12813, 24)
['ID', 'SIDE', 'READPRJ', 'VERSION', 'BARCDBU', 'XROSFM', 'XRSCFM', 'XRCYFM', 'XRJSM', 'XRCHM', 'XROSTM', 'XRSCTM', 'XRCYTM', 'XRATTM', 'XRKL', 'XROSFL', 'XRSCFL', 'XRCYFL', 'XRJSL', 'XRCHL', 'XROSTL', 'XRSCTL', 'XRCYTL', 'XRATTL']

kxr_sq_bu01.sas7bdat: (8483, 24)
Names only differ by case

kxr_sq_bu03.sas7bdat: (7988, 24)
Names only differ by case

kxr_sq_bu05.sas7bdat: (7843, 24)
Names only differ by case

kxr_sq_bu06.sas7bdat: (10813, 24)

kxr_sq_bu08.sas7bdat: (3580, 24)

kxr_sq_bu10.sas7bdat: (3675, 24)

Total rows: 55195


In [395]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

kxr_sq_bu00.sas7bdat	Var Cnt: 24
Visits: ['V00']
kxr_sq_bu01.sas7bdat	Var Cnt: 24
Visits: ['V01']
kxr_sq_bu03.sas7bdat	Var Cnt: 24
Visits: ['V03']
kxr_sq_bu05.sas7bdat	Var Cnt: 24
Visits: ['V05']
kxr_sq_bu06.sas7bdat	Var Cnt: 24
Visits: ['V06']
kxr_sq_bu08.sas7bdat	Var Cnt: 24
Visits: ['V08']
kxr_sq_bu10.sas7bdat	Var Cnt: 24
Visits: ['V10']
(55195, 25)

Starting dataframe size: 11.93MB


In [396]:
tmp_df.dtypes

ID           uint32
Visit      category
VERSION    category
SIDE       category
READPRJ      object
BARCDBU      object
XROSFM     category
XRSCFM     category
XRCYFM     category
XRJSM        object
XRCHM      category
XROSTM     category
XRSCTM     category
XRCYTM     category
XRATTM     category
XRKL       category
XROSFL     category
XRSCFL     category
XRCYFL     category
XRJSL        object
XRCHL      category
XROSTL     category
XRSCTL     category
XRCYTL     category
XRATTL     category
dtype: object

Documentation says that there is no longer a differentiation between project 37 and project 42. It advises all project 42 labels be converted to 37.  Done here.

In [401]:
# Data cleanup 
tmp_df.loc[idx[tmp_df['READPRJ'] == '42'], 'READPRJ'] = '37'
tmp_df.loc[tmp_df['READPRJ'] == '15', 'READPRJ'] = 15       
tmp_df.loc[tmp_df['READPRJ'] == '37', 'READPRJ'] = 37
tmp_df['READPRJ'] = pd.to_numeric(tmp_df['READPRJ'], downcast='unsigned')

# Convert the numeric strings to actual ints
colon_ints_to_int = {'0: 0': '0', '1: 1': '1', '2: 2': '2', '3: 3': '3', '4: 4': '4', '5: 5': '5', '6: 6': '6',
                     '7: 7': '7', '8: 8': '8', '9: 9': '9', '10: 10': '10', '11: 11': '11', '12: 12': '12'}

for col in tmp_df.select_dtypes(include='category'):
    if col not in ['VERSION', 'READPRJ', 'BARCDBU', 'XRJSM', 'XRJSL']:
        tmp_df[col] = tmp_df[col].cat.rename_categories(colon_ints_to_int)
        tmp_df[col] = tmp_df[col].astype('object')

In [402]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 3 	Cols to convert: 22	 Total col cnt: 25

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0     True    False     False   False     20
1     True     True     False   False      2

Numeric types of columns:
num_type
float    2
Name: count, dtype: int64

Largest number of unique strings: 24554


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Visit,Which visit this data was collected during,7,"{V08, V00, V03, V05, V10, V01, V06}",0,{},False,,,,,55195,0,0,0
SIDE,Side,2,"{1: Right, 2: Left}",0,{},False,,,,,55195,0,0,0
BARCDBU,BL/FU kXR reading (BU): barcode of image analyzed,24554,"{, 016600855104, 016601448204, 016602586401, 0...",0,{},False,,,,,55195,0,0,0
XROSFM,BL/FU kXR reading (BU): osteophytes (OARSI gra...,8,"{2, .M: Missing, .T: Technical Problems, .P: P...",0,{},False,,,,,55195,0,0,0
XRSCFM,BL/FU kXR reading (BU): sclerosis (OARSI grade...,8,"{2, .T: Technical Problems, .P: Prosthetic, .:...",0,{},False,,,,,55195,0,0,0
XRCYFM,BL/FU kXR reading (BU): cysts (Grades 0-1) fem...,6,"{.T: Technical Problems, .P: Prosthetic, .: Mi...",0,{},False,,,,,55195,0,0,0
XRJSM,BL(BU): joint space narrowing (OARSI grades 0-...,0,{},2,"{P, T}",False,float,14.0,3.0,0.0,555,54640,0,0
XRCHM,BL/FU kXR reading (BU): chondrocalcinosis (Gra...,7,"{.M: Missing, .T: Technical Problems, .P: Pros...",0,{},False,,,,,55195,0,0,0
XROSTM,BL/FU kXR reading (BU): osteophytes (OARSI gra...,8,"{2, .M: Missing, .T: Technical Problems, .P: P...",0,{},False,,,,,55195,0,0,0
XRSCTM,BL/FU kXR reading (BU): sclerosis (OARSI grade...,8,"{2, .T: Technical Problems, .P: Prosthetic, .:...",0,{},False,,,,,55195,0,0,0


In [403]:
tmp_df.dtypes

ID           uint32
Visit        object
VERSION    category
SIDE         object
READPRJ       uint8
BARCDBU      object
XROSFM       object
XRSCFM       object
XRCYFM       object
XRJSM        object
XRCHM        object
XROSTM       object
XRSCTM       object
XRCYTM       object
XRATTM       object
XRKL         object
XROSFL       object
XRSCFL       object
XRCYFL       object
XRJSL        object
XRCHL        object
XROSTL       object
XRSCTL       object
XRCYTL       object
XRATTL       object
dtype: object

In [404]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only floats, missing, and NA values
'float': ['XRJSM', 'XRJSL'],

# Columns with only strings, missing, and NA values
'cat': ['Visit', 'SIDE', 'BARCDBU', 'XROSFM', 'XRSCFM', 'XRCYFM', 'XRCHM', 'XROSTM', 'XRSCTM', 'XRCYTM', 'XRATTM', 'XRKL', 'XROSFL', 'XRSCFL', 'XRCYFL', 'XRCHL', 'XROSTL', 'XRSCTL', 'XRCYTL', 'XRATTL'],

}


Handled columns: 22


In [362]:
targets = {
# Columns with only floats, missing, and NA values
'float': ['XRJSM', 'XRJSL'],

# Columns with only strings, missing, and NA values
'cat': ['BARCDBU'],

}

targets = {
# Columns with only floats, missing, and NA values
'float': ['XRJSM', 'XRJSL'],

# Columns with only strings, missing, and NA values
'cat': ['BARCDBU', 'XROSFM', 'XRSCFM', 'XRCYFM', 'XRCHM', 'XROSTM', 'XRSCTM', 'XRCYTM', 'XRATTM', 'XRKL', 'XROSFL', 'XRSCFL', 'XRCYFL', 'XRCHL', 'XROSTL', 'XRSCTL', 'XRCYTL', 'XRATTL'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [363]:
# Data cleanup
new_df['BARCDBU'] = new_df['BARCDBU'].cat.rename_categories({'           T': 'T'})

# Clean up the side var and make an index
new_df['SIDE'] = new_df['SIDE'].cat.rename_categories({'1: Right': 'RIGHT', '2: Left': 'LEFT'})
new_df.set_index('SIDE', append=True, inplace=True) #  Note that indices are NOT unique yet, Prj 15 and Prj 37 contain repeats

AttributeError: Can only use .cat accessor with a 'category' dtype

In [None]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

In [None]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))

In [None]:
used_vals = set()
for col in new_df.columns:
    if col not in ['VERSION', 'READPRJ', 'BARCDBU', 'XRJSM', 'XRJSL']:
        tmp = new_df[col].value_counts()
        tmp = tmp[tmp > 0]
        used_vals |= set(list(tmp.index))
        
used_vals

In [None]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## kxr_sq_rel_bu

In [None]:
prefix = 'kxr_sq_rel_bu'
column_uniformity_check(prefix)

In [None]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

In [None]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [None]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

In [None]:
suggest_conversions(data_stats_df)

In [None]:
targets = {
# Columns with only floats, missing, and NA values
'float': ['XRJSM', 'XRJSL'],

# Columns with only strings, missing, and NA values
'cat': ['BARCDBU'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [None]:
# Data cleanup 
new_df.loc[idx[new_df['READPRJ'] == '42'], 'READPRJ'] = '37'
new_df.loc[new_df['READPRJ'] == '15', 'READPRJ'] = 15       
new_df.loc[new_df['READPRJ'] == '37', 'READPRJ'] = 37
new_df['READPRJ'] = new_df['READPRJ'].astype('category')

In [None]:
# Clean up the side var and make an index
new_df['SIDE'] = new_df['SIDE'].cat.rename_categories({'1: Right': 'RIGHT', '2: Left': 'LEFT'})
new_df.set_index('SIDE', append=True, inplace=True) #  Note that indices are unique yet, Prj 15 and Prj 37 contain repeats

In [None]:
# Convert the numeric strings to actual ints
for col in new_df.columns:
    if col not in ['VERSION', 'READPRJ', 'BARCDBU', 'XRJSM', 'XRJSL']:
        new_df[col] = new_df[col].cat.rename_categories({'0: 0': 0, '1: 1': 1, '2: 2': 2, '3: 3': 3, '4: 4': 4})

In [None]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

In [None]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))

In [None]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## measinventory

In [405]:
prefix = 'measinventory'
column_uniformity_check(prefix)


measinventory.sas7bdat: (4796, 381)
['id', 'VERSION', 'cohort', 'AGE', 'SEX', 'RACE', 'HISP', 'VisitType12', 'VisitType18', 'VisitType24', 'VisitType30', 'VisitType36', 'VisitType48', 'VisitType60', 'VisitType72', 'VisitType84', 'VisitType96', 'VisitType108', 'XRKLR', 'OAGRDR', 'RKSX', 'IndexKneeR', 'InC03_BLR', 'InC04_BLR', 'InC07_BLR', 'InC08_BLR', 'InC09_BLR', 'InC10_BLR', 'InC15_BLR', 'InC16_BLR', 'InC17_BLR', 'InC18_BLR', 'InC22_BLR', 'InC37_BLR', 'InC39_BLR', 'InC40_BLR', 'InC60_BLR', 'InC63_BLR', 'InC65_BLR', 'InC66_BLR', 'InC30_BLR', 'InC03_12mR', 'InC04_12mR', 'InC07_12mR', 'InC08_12mR', 'InC09_12mR', 'InC10_12mR', 'InC15_12mR', 'InC16_12mR', 'InC17_12mR', 'InC18_12mR', 'InC22_12mR', 'InC32_12mR', 'InC37_12mR', 'InC39_12mR', 'InC40_12mR', 'InC60_12mR', 'InC61_12mR', 'InC63_12mR', 'InC65_12mR', 'InC66_12mR', 'InC03_24mR', 'InC04_24mR', 'InC07_24mR', 'InC08_24mR', 'InC09_24mR', 'InC10_24mR', 'InC15_24mR', 'InC16_24mR', 'InC17_24mR', 'InC18_24mR', 'InC22_24mR', 'InC32_24mR', 'In

In [406]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

measinventory.sas7bdat	Var Cnt: 381
Visits: ['P01', 'P02', 'V00', 'V99']
(19184, 29)

Starting dataframe size: 0.95MB


In [407]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [408]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 28 	Cols to convert: 1	 Total col cnt: 29

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False     True     False    True      1

Numeric types of columns:
num_type
unsigned    1
Name: count, dtype: int64

Largest number of unique strings: 0


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
AGE,,0,,0,,,unsigned,36,79.0,45.0,0,4796,0,14388


In [409]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['AGE'],

}


Handled columns: 1


In [410]:
targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['AGE'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [411]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)


VERSION    category
OAGRDR     category
RKSX       category
OAGRDL     category
LKSX       category
SEX        category
RACE       category
HISP       category
AGE           UInt8
XRKLR      category
XRKLL      category
ERKBLRP    category
ELKBLRP    category
ERKRPCF    category
ELKRPCF    category
ERKRPSN    category
ELKRPSN    category
ERHBLRP    category
ELHBLRP    category
ERHRPCF    category
ELHRPCF    category
ERHRPSN    category
ELHRPSN    category
ERKVSRP    category
ELKVSRP    category
ERHVSRP    category
ELHVSRP    category
dtype: object


In [412]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.77MB
Shadow dataframe size: 0.15MB


In [413]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## MIF

In [414]:
prefix = 'mif'
column_uniformity_check(prefix)


mif00.sas7bdat: (15578, 9)
['ID', 'VERSION', 'MIFNAME', 'FRMCODE', 'MIFFREQ', 'MIFDUR', 'MIFUSE', 'INGCODE', 'INGNAME']

mif01.sas7bdat: (14839, 9)

mif02.sas7bdat: (1125, 9)

mif03.sas7bdat: (14876, 9)

mif04.sas7bdat: (1772, 9)

mif05.sas7bdat: (15233, 9)

mif06.sas7bdat: (15492, 9)

mif08.sas7bdat: (14527, 9)

mif10.sas7bdat: (14342, 9)

Total rows: 107784


In [415]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

mif00.sas7bdat	Var Cnt: 9
Visits: ['V00']
mif01.sas7bdat	Var Cnt: 9
Visits: ['V01']
mif02.sas7bdat	Var Cnt: 9
Visits: ['V02']
mif03.sas7bdat	Var Cnt: 9
Visits: ['V03']
mif04.sas7bdat	Var Cnt: 9
Visits: ['V04']
mif05.sas7bdat	Var Cnt: 9
Visits: ['V05']
mif06.sas7bdat	Var Cnt: 9
Visits: ['V06']
mif08.sas7bdat	Var Cnt: 9
Visits: ['V08']
mif10.sas7bdat	Var Cnt: 9
Visits: ['V10']
(107784, 10)

Starting dataframe size: 19.05MB


In [416]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

It isn't clear what to do with INGCODE, they seem to be 9 digits plus '.0' except for some values labelled 'M'

In [417]:
data_stats_summary(tmp_df, data_stats_df)

data_stats_df

Already defined cols: 7 	Cols to convert: 3	 Total col cnt: 10

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0     True    False     False   False      2
1     True     True     False   False      1

Numeric types of columns:
num_type
unsigned    1
Name: count, dtype: int64

Largest number of unique strings: 10749


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
MIFNAME,EV:MIF: medication name,10749,"{BENAZPRIL HCTZ, HYDROCO-APAPS-500, PROPOX-NIA...",0,{},False,,,,,107784,0,0,0
INGCODE,EV:MIF: medication ingredient code (calc),0,{},1,{M},False,unsigned,928.0,99999999.0,2000002.0,1,107783,0,0
INGNAME,EV:MIF: medication ingredient name (calc),929,"{, HALAZEPAM, METHYLTESTOSTERONE, ERGOTAMINE, ...",0,{},False,,,,,107784,0,0,0


In [418]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['INGCODE'],

# Columns with only strings, missing, and NA values
'cat': ['MIFNAME', 'INGNAME'],

}


Handled columns: 3


In [419]:
targets = {
# Columns with only unsigned ints, missing, and NA values
'unsigned': ['INGCODE'],

# Columns with only strings, missing, and NA values
'cat': ['MIFNAME', 'INGNAME'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [420]:
sanity_check(new_df)
print()
print(new_df.dtypes)

if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)


VERSION    category
MIFNAME    category
FRMCODE    category
MIFFREQ    category
MIFDUR     category
MIFUSE     category
INGCODE      UInt32
INGNAME    category
dtype: object

Missing values present, shadow dataframe created.
              INGCODE
ID      Visit        
9000296 V00       NaN
        V00       NaN
        V00       NaN
9000622 V00       NaN
        V00       NaN
...               ...
9999878 V10       NaN
        V10       NaN
        V10       NaN
        V10       NaN
        V10       NaN

[107784 rows x 1 columns]


In [421]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 2.94MB
Shadow dataframe size: 0.52MB


In [422]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## MRI

In [423]:
prefix = 'mri'
column_uniformity_check(prefix)


mri00.sas7bdat: (77224, 12)
['ID', 'VERSION', 'MEXAMTP', 'MNDREAS', 'MRBARCD', 'MRCOMP', 'MRDATE', 'MRSIDE', 'MRSURDY', 'MRTECID', 'QCRESLT', 'SCNUPGR']

mri01.sas7bdat: (68817, 12)

mri02.sas7bdat: (4045, 12)

mri03.sas7bdat: (76306, 13)
['MRMARK']

mri04.sas7bdat: (6924, 12)
['MRMARK']

mri05.sas7bdat: (69210, 12)

mri06.sas7bdat: (75789, 12)

mri08.sas7bdat: (57799, 15)
['MQCFLAG', 'CLUPGR', 'MQCCMNT']

mri10.sas7bdat: (65624, 16)
['MRMARK']

Total rows: 501738


In [424]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

mri00.sas7bdat	Var Cnt: 12
Visits: ['V00']
mri01.sas7bdat	Var Cnt: 12
Visits: ['V01']
mri02.sas7bdat	Var Cnt: 12
Visits: ['V02']
mri03.sas7bdat	Var Cnt: 13
Visits: ['V03']
mri04.sas7bdat	Var Cnt: 12
Visits: ['V04']
mri05.sas7bdat	Var Cnt: 12
Visits: ['V05']
mri06.sas7bdat	Var Cnt: 12
Visits: ['V06']
mri08.sas7bdat	Var Cnt: 15
Visits: ['V08']
mri10.sas7bdat	Var Cnt: 16
Visits: ['V10']
(501738, 17)

Starting dataframe size: 170.98MB


In [425]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [426]:
data_stats_summary(tmp_df, data_stats_df)

data_stats_df

Already defined cols: 11 	Cols to convert: 6	 Total col cnt: 17

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0     True    False     False   False      4
1     True    False      True   False      1
2     True     True     False   False      1

Numeric types of columns:
num_type
na          2
unsigned    1
Name: count, dtype: int64

Largest number of unique strings: 315633


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
MEXAMTP,EV MRI:MRI series type (calc),16,"{R COR MPR, R SAG 3D DESS WE, OAI Prescription...",0,{},False,,,,,501738,0,0,0
MRBARCD,EV MRI:MRI series barcode (calc),315633,"{, 016612098016, 016611159513, 016610778203, 0...",0,{},False,,,,,501738,0,0,0
MRDATE,EV MRI:Date MRI series completed (calc),0,{},1,{A},False,na,1.0,,,186092,0,315596,0
MRSURDY,EV MRI:Days between most recent surgery and MR...,0,{},2,"{A, M}",False,unsigned,623.0,1318.0,1.0,465951,35723,0,0
MRTECID,EV MRI:MRI tech ID (calc),79,"{, 5021, 1018, 3090, 3044, 3123, A005, 3053, E...",0,{},False,,,,,501738,0,0,0
MQCCMNT,FU MRI:MRI QC comment (calc),2,"{, MR Gradient Out of Spec}",0,{},False,na,1.0,,,123423,0,0,0


In [427]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only dates, missing, and NA values
'date': ['MRDATE'],

# Columns with only unsigned ints, missing, and NA values
'unsigned': ['MRSURDY'],

# Columns with only strings, missing, and NA values
'cat': ['MEXAMTP', 'MRBARCD', 'MRTECID', 'MQCCMNT'],

}


Handled columns: 6


In [428]:
targets = {
# Columns with only dates, missing, and NA values
'date': ['MRDATE'],

# Columns with only unsigned ints, missing, and NA values
'unsigned': ['MRSURDY'],

# Columns with only strings, missing, and NA values
'cat': ['MEXAMTP', 'MRBARCD', 'MRTECID', 'MQCCMNT'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [429]:
sanity_check(new_df)
print()
print(new_df.dtypes)

if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)


VERSION          category
MEXAMTP          category
MNDREAS          category
MRBARCD          category
MRCOMP           category
MRDATE     datetime64[ns]
MRSIDE           category
MRSURDY            UInt16
MRTECID          category
QCRESLT          category
SCNUPGR          category
MRMARK           category
CLUPGR           category
MQCCMNT          category
MQCFLAG          category
dtype: object

Missing values present, shadow dataframe created.
                         MRDATE           MRSURDY
ID      Visit                                    
9000099 V00    .A: Not Expected  .A: Not Expected
        V00    .A: Not Expected  .A: Not Expected
        V00                 NaN  .A: Not Expected
        V00                 NaN  .A: Not Expected
        V00                 NaN  .A: Not Expected
...                         ...               ...
9999878 V10                 NaN  .A: Not Expected
        V10                 NaN  .A: Not Expected
        V10                 NaN  .A: Not Exp

In [430]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 43.34MB
Shadow dataframe size: 2.51MB


In [431]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## outcomes

In [432]:
prefix = 'outcomes'
column_uniformity_check(prefix)


outcomes99.sas7bdat: (4796, 92)
['id', 'version', 'RNTCNT', 'ERKDATE', 'ERKFLDT', 'ERKRPCF', 'ERKTLPR', 'ERKTPPR', 'ERKBLRP', 'ERKVSRP', 'ERKPODX', 'ERKDAYS', 'ERKVSPR', 'ERKVSAF', 'ERKRPSN', 'ERKXRPR', 'ERKXRAF', 'ELKDATE', 'ELKFLDT', 'ELKRPCF', 'ELKTLPR', 'ELKTPPR', 'ELKBLRP', 'ELKVSRP', 'ELKPODX', 'ELKDAYS', 'ELKVSPR', 'ELKVSAF', 'ELKRPSN', 'ELKXRPR', 'ELKXRAF', 'ERHDATE', 'ERHFLDT', 'ERHRPCF', 'ERHPODX', 'ERHDAYS', 'ERHVSPR', 'ERHVSAF', 'ERHRPSN', 'ERHXRPR', 'ERHXRAF', 'ERHVSRP', 'ERHBLRP', 'ELHDATE', 'ELHFLDT', 'ELHRPCF', 'ELHPODX', 'ELHDAYS', 'ELHVSPR', 'ELHVSAF', 'ELHRPSN', 'ELHXRPR', 'ELHXRAF', 'ELHVSRP', 'ELHBLRP', 'EXLVSQD', 'ERXIOA', 'ERXIOAN', 'ERXNOA', 'ERXNOAN', 'ERKLOA', 'ERKLOAN', 'ELXIOA', 'ELXIOAN', 'ELXNOA', 'ELXNOAN', 'ELKLOA', 'ELKLOAN', 'ERXJSNM', 'ERXJSNL', 'ELXJSNM', 'ELXJSNL', 'ERJSFP', 'ERJSLP', 'ERNJSLP', 'ERJSTFP', 'ERJSFW', 'ERJSLW', 'ERNJSLW', 'ERJSTFW', 'ELJSFP', 'ELJSLP', 'ELNJSLP', 'ELJSTFP', 'ELJSFW', 'ELJSLW', 'ELNJSLW', 'ELJSTFW', 'EDDCF', 'EDDDATE'

In [433]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

outcomes99.sas7bdat	Var Cnt: 92
Visits: ['V99']
(4796, 93)

Starting dataframe size: 2.04MB


In [434]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [435]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 88 	Cols to convert: 5	 Total col cnt: 93

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0     True    False      True   False      5

Numeric types of columns:
Series([], Name: count, dtype: int64)

Largest number of unique strings: 0


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
ERKDATE,"Outcomes: right knee, date of follow-up knee r...",0,{},1,{A},False,,,,,4519,0,277,0
ELKDATE,"Outcomes: left knee, date of follow-up knee re...",0,{},1,{A},False,,,,,4525,0,271,0
ERHDATE,"Outcomes: right hip, date of follow-up hip rep...",0,{},1,{A},False,,,,,4678,0,118,0
ELHDATE,"Outcomes: left hip, date of follow-up hip repl...",0,{},1,{A},False,,,,,4684,0,112,0
EDDDATE,Outcomes: date of death (calc),0,{},1,{A},False,,,,,4481,0,315,0


In [436]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only dates, missing, and NA values
'date': ['ERKDATE', 'ELKDATE', 'ERHDATE', 'ELHDATE', 'EDDDATE'],

}


Handled columns: 5


In [437]:
targets = {
# Columns with only dates, missing, and NA values
'date': ['ERKDATE', 'ELKDATE', 'ERHDATE', 'ELHDATE', 'EDDDATE'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [438]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)


VERSION          category
RNTCNT           category
ERKDATE    datetime64[ns]
ERKFLDT          category
ERKRPCF          category
                ...      
ELJSTFW          category
EDDCF            category
EDDDATE    datetime64[ns]
EDDFLDT          category
EDDVSPR          category
Length: 91, dtype: object

Missing values present, shadow dataframe created.
                        ERKDATE           ELKDATE           ERHDATE  \
ID      Visit                                                         
9000099 V99    .A: Not Expected  .A: Not Expected  .A: Not Expected   
9000296 V99    .A: Not Expected  .A: Not Expected  .A: Not Expected   
9000622 V99    .A: Not Expected  .A: Not Expected  .A: Not Expected   
9000798 V99    .A: Not Expected  .A: Not Expected  .A: Not Expected   
9001104 V99    .A: Not Expected  .A: Not Expected  .A: Not Expected   
...                         ...               ...               ...   
9999365 V99    .A: Not Expected  .A: Not Expected  .A: Not Expected 

In [439]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 1.02MB
Shadow dataframe size: 0.15MB


In [440]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## sageancillarystudy

In [441]:
prefix = 'sageancillarystudy'
column_uniformity_check(prefix)


sageancillarystudy.sas7bdat: (746, 26)
['ID', 'sAGEdate', 'sAGEage', 'sAGEgender', 'sAGEdm', 'sAGEdmmeds', 'sAGEdmmouth', 'sAGEdminj', 'sAGEarm', 'sAGEnoarm', 'sAGEeqfail', 'sAGEcloth', 'sAGE', 'sAGEdate', 'sAGEage', 'sAGEgender', 'sAGEdm', 'sAGEdmmeds', 'sAGEdmmouth', 'sAGEdminj', 'sAGEarm', 'sAGEnoarm', 'sAGEeqfail', 'sAGEcloth', 'sAGE', 'Version']

Total rows: 746


In [442]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

sageancillarystudy.sas7bdat	Var Cnt: 26
Visits: ['V05', 'V06']
Unhandled data type:  SAGEGENDER GENDERF
Unhandled data type:  SAGEDM DIABF
Unhandled data type:  SAGEDMMEDS DIABMEDF
Unhandled data type:  SAGEDMMOUTH DIABMEDF
Unhandled data type:  SAGEDMINJ DIABMEDF
Unhandled data type:  SAGEARM ARMF
Unhandled data type:  SAGENOARM NOMEASF
Unhandled data type:  SAGEEQFAIL NOMEASF
Unhandled data type:  SAGECLOTH CLOTHF
Unhandled data type:  SAGEGENDER GENDERF
Unhandled data type:  SAGEDM DIABF
Unhandled data type:  SAGEDMMEDS DIABMEDF
Unhandled data type:  SAGEDMMOUTH DIABMEDF
Unhandled data type:  SAGEDMINJ DIABMEDF
Unhandled data type:  SAGEARM ARMF
Unhandled data type:  SAGENOARM NOMEASF
Unhandled data type:  SAGEEQFAIL NOMEASF
Unhandled data type:  SAGECLOTH CLOTHF
(1492, 15)

Starting dataframe size: 0.20MB


In [443]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [444]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 3 	Cols to convert: 12	 Total col cnt: 15

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0    False    False      True   False      1
1    False     True     False    True     11

Numeric types of columns:
num_type
unsigned    10
na           1
float        1
Name: count, dtype: int64

Largest number of unique strings: 0.0


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
SAGEDATE,,,,0,,,na,1,,,0,0,746,0
SAGEAGE,,0.0,,0,,,unsigned,37,83.0,48.0,0,746,0,746
SAGEGENDER,,0.0,,0,,,unsigned,3,1.0,0.0,0,746,0,746
SAGEDM,,0.0,,0,,,unsigned,3,1.0,0.0,0,746,0,746
SAGEDMMEDS,,0.0,,0,,,unsigned,4,2.0,0.0,0,746,0,746
SAGEDMMOUTH,,0.0,,0,,,unsigned,4,2.0,0.0,0,746,0,746
SAGEDMINJ,,0.0,,0,,,unsigned,4,2.0,0.0,0,746,0,746
SAGEARM,,0.0,,0,,,unsigned,3,2.0,0.0,0,746,0,746
SAGENOARM,,0.0,,0,,,unsigned,4,2.0,0.0,0,746,0,746
SAGEEQFAIL,,0.0,,0,,,unsigned,4,2.0,0.0,0,746,0,746


In [445]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only dates, missing, and NA values
'date': ['SAGEDATE'],

# Columns with only unsigned ints, missing, and NA values
'unsigned': ['SAGEAGE', 'SAGEGENDER', 'SAGEDM', 'SAGEDMMEDS', 'SAGEDMMOUTH', 'SAGEDMINJ', 'SAGEARM', 'SAGENOARM', 'SAGEEQFAIL', 'SAGECLOTH'],

# Columns with only floats, missing, and NA values
'float': ['SAGE'],

}


Handled columns: 12


In [446]:
targets = {
# Columns with only dates, missing, and NA values
'date': ['SAGEDATE'],

# Columns with only unsigned ints, missing, and NA values
'unsigned': ['SAGEAGE', 'SAGEGENDER', 'SAGEDM', 'SAGEDMMEDS', 'SAGEDMMOUTH', 'SAGEDMINJ', 'SAGEARM', 'SAGENOARM', 'SAGEEQFAIL', 'SAGECLOTH'],

# Columns with only floats, missing, and NA values
'float': ['SAGE'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [447]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)

Failure to make column categorical: SAGEGENDER
Failure to make column categorical: SAGEDM
Failure to make column categorical: SAGEDMMEDS
Failure to make column categorical: SAGEDMMOUTH
Failure to make column categorical: SAGEDMINJ
Failure to make column categorical: SAGEARM
Failure to make column categorical: SAGENOARM
Failure to make column categorical: SAGEEQFAIL
Failure to make column categorical: SAGECLOTH

VERSION              category
SAGEDATE       datetime64[ns]
SAGEAGE                 UInt8
SAGEGENDER              UInt8
SAGEDM                  UInt8
SAGEDMMEDS              UInt8
SAGEDMMOUTH             UInt8
SAGEDMINJ               UInt8
SAGEARM                 UInt8
SAGENOARM               UInt8
SAGEEQFAIL              UInt8
SAGECLOTH               UInt8
SAGE                  float32
dtype: object


In [448]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 0.07MB
Shadow dataframe size: 0.01MB


In [449]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)

## subjectchar

The data that was held in SubjectChar files has been added to allclinical. Why subjectchar00 is still shipped is a mystery.

In [450]:
# Grab the variable names from both subjectchar000 and allclinical00
tmp_df, _ = pyreadstat.read_file_multiprocessing(pyreadstat.read_sas7bdat, data_dir + 'subjectchar00.sas7bdat',
                                                            num_processes=6, user_missing=True)
sc_vars = set(tmp_df.columns)
tmp_df, _ = pyreadstat.read_file_multiprocessing(pyreadstat.read_sas7bdat, data_dir + 'allclinical00.sas7bdat',
                                                            num_processes=6, user_missing=True)
ac_vars = set(tmp_df.columns)

# Display unique variables that are in subjectchar00 that aren't included in allclinical00
print(sc_vars - ac_vars)

set()


Nothing. Seems like SubjectChar00 is included for no reason.

## xray

In [451]:
prefix = 'xray'
column_uniformity_check(prefix)


xray00.sas7bdat: (15164, 16)
['ID', 'VERSION', 'ACCEPT', 'ALIGN', 'CENTER', 'DEPICT', 'EXAMTP', 'EXPOSE', 'MOTION', 'POSITN', 'XNDREAS', 'XRBARCD', 'XRCOMP', 'XRDATE', 'XRSIDE', 'XRTECID']

xray01.sas7bdat: (9775, 16)

xray03.sas7bdat: (8069, 16)

xray05.sas7bdat: (7028, 16)

xray06.sas7bdat: (15886, 16)

xray08.sas7bdat: (4796, 16)

xray10.sas7bdat: (7788, 16)

Total rows: 68506


In [452]:
tmp_df = create_df(prefix)
print(tmp_df.shape)
print('\nStarting dataframe size: {:.2f}MB'.format(tmp_df.memory_usage(deep=True).sum() / (1024**2)))

xray00.sas7bdat	Var Cnt: 16
Visits: ['V00']
xray01.sas7bdat	Var Cnt: 16
Visits: ['V01']
xray03.sas7bdat	Var Cnt: 16
Visits: ['V03']
xray05.sas7bdat	Var Cnt: 16
Visits: ['V05']
xray06.sas7bdat	Var Cnt: 16
Visits: ['V06']
xray08.sas7bdat	Var Cnt: 16
Visits: ['V08']
xray10.sas7bdat	Var Cnt: 16
Visits: ['V10']
(68506, 17)

Starting dataframe size: 17.76MB


In [453]:
data_stats_df, done_df = gather_column_data_stats(tmp_df)

In [454]:
data_stats_summary(tmp_df, data_stats_df)
data_stats_df

Already defined cols: 13 	Cols to convert: 4	 Total col cnt: 17

Column types to convert:
   str_cnt  num_cnt  date_cnt  na_cnt  count
0     True    False     False   False      3
1     True    False      True   False      1

Numeric types of columns:
Series([], Name: count, dtype: int64)

Largest number of unique strings: 51621


Unnamed: 0_level_0,label,uniq_strs,str_list,missing_val_cnt,missing_val_list,numeric_str,num_type,uniq_num,max_num,min_num,str_cnt,num_cnt,date_cnt,na_cnt
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
EXAMTP,SV/EV XR:X-ray image type (calc),11,"{Full Limb, PA Bilateral Hand, Lateral Right K...",0,{},False,,,,,68506,0,0,0
XRBARCD,SV/EV XR:X-ray image barcode (calc),51621,"{, 016600855104, 016603165904, 016603691001, 0...",0,{},False,,,,,68506,0,0,0
XRDATE,SV/EV XR:Date x-ray completed (calc),0,{},1,{A},False,,,,,16886,0,51620,0
XRTECID,SV/EV XR:Clinical center radiology tech ID (calc),429,"{, 4061, 3049, C062, 4068, 4096, A002, A005, 3...",0,{},False,,,,,68506,0,0,0


In [455]:
suggest_conversions(data_stats_df)

targets = {
# Columns with only dates, missing, and NA values
'date': ['XRDATE'],

# Columns with only strings, missing, and NA values
'cat': ['EXAMTP', 'XRBARCD', 'XRTECID'],

}


Handled columns: 4


In [456]:
targets = {
# Columns with only dates, missing, and NA values
'date': ['XRDATE'],

# Columns with only strings, missing, and NA values
'cat': ['EXAMTP', 'XRBARCD', 'XRTECID'],

}

new_df, missing_df = convert_columns(targets, data_stats_df, tmp_df)

In [457]:
sanity_check(new_df)
print()
print(new_df.dtypes)
if not missing_df.empty:
    print('\nMissing values present, shadow dataframe created.')
    print(missing_df)


VERSION          category
ACCEPT           category
ALIGN            category
CENTER           category
DEPICT           category
EXAMTP           category
EXPOSE           category
MOTION           category
POSITN           category
XNDREAS          category
XRBARCD          category
XRCOMP           category
XRDATE     datetime64[ns]
XRSIDE           category
XRTECID          category
dtype: object

Missing values present, shadow dataframe created.
              XRDATE
ID      Visit       
9000099 V00      NaN
        V00      NaN
        V00      NaN
9000296 V00      NaN
        V00      NaN
...              ...
9999862 V10      NaN
9999865 V10      NaN
        V10      NaN
9999878 V10      NaN
        V10      NaN

[68506 rows x 1 columns]


In [458]:
print('\nFinal dataframe size: {:.2f}MB'.format(new_df.memory_usage(deep=True).sum() / (1024**2)))
print('Shadow dataframe size: {:.2f}MB'.format(missing_df.memory_usage(deep=True).sum() / (1024**2)))


Final dataframe size: 7.50MB
Shadow dataframe size: 0.38MB


In [459]:
utils.write_parquet(new_df, 'data/' + prefix + '_values.parquet')
if not missing_df.empty:
    utils.write_parquet(new_df, 'data/' + prefix + '_missing_values.parquet', verbose=False)