# Exploring SAS file metadata

Anyone working with the OAI structured data has a choice, import the data from ASCII or SAS. This project chooses to rely on the SAS data. There are over 13,000 different variables recorded by OAI.  Writing heuristics to guess the optimal data types for all 13,000 is likely to be more flawed than leveraging what we can from SAS metadata. This notebook explores what SAS metadata can be pulled out by pyreadstat (there may be metadata it ignores; I haven't verified it's code). It is mostly here as a record of discovery and not typically needed for anyone looking to jump into the data.

The data seems to be stored in two ways:
* A collection of sas7bdat and sas7bcat files.
* In the SAS propietary CPORT format (labeled .xpt instead of .cpt)

Thanks to the OAI employee who chose to support a closed source, proprietary format. While SAS was common enough in 2012, chosing proprietary formats for govt. owned data was already bad form by then. Further, the files are listed as .XPT files just to keep users confused (you can find users trying to solve this mystery for this exact dataset in internet forums).  Not having a SAS instance, we have to ignore the the CPORT files and hope no information is lost in doing so. It isn't also clear why the data is saved in a compressed format (save space) but also bundled with a non-compressed form (benefits of compression lost). Maybe historical reasons.

**Main Explorations:**
* What is present in the OAI metadata?
* What is and isn't consistent?
  * Good data design doesn't repeat values since that just opens the door inconsistency. This data has lots of repeated data. Some of this what OAI released and though repeats in the metadata may be how pyreadstat saves the data it reads.

All statistical platforms have their design pro's and con's. An excellent benefit of SAS is that it allows for multiple different markers for missing values. This is probably the largest issue in translating from SAS to Pandas.  Much of this notebook examines  

Subtle details may change between OAI release versions. Thus, once facts are established, they are encoded as assertions that can be verified when a new version of the data is released.

## Setup / Imports / Constants

In [None]:
# Setup 
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
display(HTML("<style>.output_result { max-width:95% !important; }</style>"))

In [None]:
import math
import os
import pandas as pd
import pickle
import pyreadstat
from tqdm import tqdm
import datetime
import re

In [None]:
# Constants
data_dir = '../data/structured_data/'
pdfs_dir = '../data/pdfs/General/Formats_SAS/'

visit_prefixes = {'P02':'IEI', 'P01':'SV', 'V00':'EV', 'V01':'12m', 'V02':'18m', 'V03':'24m', 'V04':'30m', 'V05':'36m',
          'V06':'48m', 'V07':'60m', 'V08':'72m', 'V09':'84m', 'V10':'96m', 'V11':'108m', 'V99':"Outcomes"}
visit_prefixes = set(visit_prefixes.keys())

# Metadata values pulled out by pyreadstat
meta_vars = [ 'column_labels',
 'column_names',
 'column_names_to_labels',
 'file_encoding',
 'file_format',
 'file_label',
 'missing_ranges',
 'missing_user_values',
 'notes',
 'number_columns',
 'number_rows',
 'original_variable_types',
 'readstat_variable_types',
 'table_name',
 'value_labels',
 'variable_alignment',
 'variable_display_width',
 'variable_measure',
 'variable_storage_width',
 'variable_to_label',
 'variable_value_labels']

## Read in all metadata

In [None]:
# All SAS files
all_files = os.listdir(data_dir)
all_files = [x for x in all_files if '.sas7bdat' in x]
all_files.remove('sageancillarystudy_formats.sas7bdat') ## At a binary level this seems like another CPORT file. WTF?
all_files.sort()

It seems that between the sas7bdat files and the sas7bcat files, almost all metadata is stored in the sas7bdat along with the actual data. The only metadata that seems to be provided by `formats.sas7bcat` is value_labels: a dictionary of value maps. We will read that in later, but for now lets look at the main metadata. 

In [None]:
# Roughly ~1.5 min runtime
files_meta = {}
for filename in all_files:
    _, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sas7bdat, data_dir + filename,
                                                         num_processes=6, metadataonly=True)
    files_meta[filename] = meta

### See what metadata variable values are common across all files

In [None]:
# See what metadata is collected across all files
# if a single value, list it.
# if a collection store the collection size for this file

# create storage dict
meta_dict = {v: [] for v in meta_vars}
meta_dict['Filename'] = []

for filename, meta in files_meta.items():
    meta_dict['Filename'].append(filename)

    for mv in meta_vars:
        var = getattr(meta, mv)
        if var:
            if isinstance(var, int):
                meta_dict[mv].append(var)
            elif isinstance(var, str):
                meta_dict[mv].append(var)
            else:
                meta_dict[mv].append(len(var))
        else:
             meta_dict[mv].append(None)

tmp_df = pd.DataFrame(meta_dict).set_index('Filename')
meta_dict = None
tmp_df

**Vocab**

I don't know if columns can have different names than variables, but they don't seem to in this dataset. As far as this data is concerned `column_names` == `variables_names`.

Note that two kinds of labels exist: variable and value. The former is prose describing a variable. The latter is user defined type in SAS; similar to a CategoricalDtype in Pandas. A single value label is a named dict mapping the stored values in a column to more verbose categories. This can be categorical values that cover all values in a collumn, or they can just be a set of missing value labels (e.g. ).

* column_names: Seems to be the same as variable names
* column_labels: Brief prose descriptions of what a variable records
* column_names_to_labels: A dict of column names to column labels
* file_encoding: Character encoding of stored strings
* file_format: The storage format
* file_label:
* missing_ranges: I believe this only applies to SPSS files
* missing_user_values: A dict that maps variable names to a list of character values (A to Z and _ for SAS) representing user defined missing values in SAS
* notes: 
* number_columns:
* number_rows: Number of rows of data
* original_variable_types:  A dict that maps variable names to the name of a data type. This could be the name of a SAS native data type, like `$` (meaning string values) or a user defined type (which are stored in the `value_labels` dict (e.g. GENDER, or RACE)). In some cases, this is like a map from variable name to the name of CategoricalDtype objects. In other cases, the majority of the data is a native data type but the named data type is merely a map of what characters are used to denote a missing value in that column. This would be like having a column of numbers in Pandas with categorical values mixed in for missing values. If a variable was not assigned a named data type, the value NULL is given.
* readstat_variable_types: Within the data files, I believe columns are either strings or doubles (numeric in SAS). Note that even if stored primarily as a double, character strings may be mixed in to represent missing values
* table_name: Not sure if this is useful in any way
* value_labels: Not provided by sas7bdat, the master value_label dict is in `formats.sas7bcat`. It is a dict of dicts. The top level dict maps a `value_label` name to a dict that translates column values into categorical values (sometimes just a translation of the different missing values types).
* variable_alignment: A dict mapping variable names to a display alignment: left, center, right or unknown
* variable_display_width: A dict mapping variable names to the display width in SAS
* variable_measure: A dict mapping variable names to a description of the measure: nominal, ordinal, scale or unknown
* variable_storage_width: A dict mapping variable names to a storage width
* variable_to_label: Holds the same values as original_variable_types, except that variables without an assigned data type are left out of this dict rather than assigned NULL.
* variable_value_labels: A dict mapping variable names directly to the appropriate `value_label` data type map. Empty unless a sas7bcat was given. It is a combination of value_labels and variable_to_label.

In [None]:
# Sanity checks on columns that should all have the same number of values and same keys in their dicts
# - failure means the data (or pyreaadstat) has changed since this was written
for file, meta in files_meta.items():
    assert meta.number_columns == len(meta.column_labels)
    assert meta.number_columns == len(meta.column_names)
    # Do column_names + column_labels = column_names_to_labels?
    assert set(meta.column_labels) == set(meta.column_names_to_labels.values())
    
    # Do all variables match a column name?
    names = set(meta.column_names)
    assert len(names ^ set(meta.column_names_to_labels.keys())) == 0
    assert len(names ^ set(meta.original_variable_types.keys())) == 0
    assert len(names ^ set(meta.readstat_variable_types.keys())) == 0
    assert len(names ^ set(meta.variable_alignment.keys())) == 0
    assert len(names ^ set(meta.variable_display_width.keys())) == 0
    assert len(names ^ set(meta.variable_storage_width.keys())) == 0
    assert names >= set(meta.variable_to_label.keys())

#### Empty metadata variables

In [None]:
# List metadata variables which are empty for all files

print(str(list(tmp_df.columns[tmp_df.isna().all()].sort_values())))

In [None]:
# Sanity check
# - failure means the data (or pyreaadstat) has changed since this was written
empty_vars = ['missing_ranges', 'missing_user_values', 'notes', 'value_labels', 'variable_value_labels']
for var in empty_vars:
    assert tmp_df[var].isna().all()

#### Partially populated

In [None]:
# Which columns have an occasional empty value?
incomplete = set(tmp_df.columns[tmp_df.isna().any()]) - set(tmp_df.columns[tmp_df.isna().all()])
for x in incomplete:
    print(x + ': ' + str(tmp_df[x].unique()))

#### Constant values

So `file_encoding`, `file_format` are the same across all files.

In [None]:
# Sanity check
# - failure means the data (or pyreaadstat) has changed since this was written
assert len(list(tmp_df['file_encoding'].unique())) == 1
assert list(tmp_df['file_encoding'].unique())[0] == 'WINDOWS-1252'
assert len(list(tmp_df['file_format'].unique())) == 1
assert list(tmp_df['file_format'].unique())[0] == 'sas7bdat'

#### Unique values across all metadata

In [None]:
# See what the unique values occur for given variable dictionaries
col_list = ['original_variable_types', 'readstat_variable_types', 'variable_alignment', 'variable_display_width', 'variable_measure', 'variable_storage_width', 'variable_to_label']

col_sets = {c: set() for c in col_list}
for file, meta in files_meta.items():
    for col in col_list:
        var = getattr(meta, col)
        if var:
            if isinstance(var, dict):
                col_sets[col].update(var.values())
            else:
                col_sets[col].update(var)
                
for col, vals in col_sets.items():
    print(col + '(' + str(len(vals)) + '): ' + str(list(vals)) + '\n')

In [None]:
col_sets['original_variable_types'] - col_sets['variable_to_label']

So it seems that only `original_variable_types`, `variable_storage_width`, `variable_to_label` have anything unique to say about a variable (aside from variable/column names and labels). As shown several cells above, `original_variable_types` has a dict value for every variable (even if that value is `NULL`). `variable_to_label` seems to have the same data, but leaves out variables that don't use a value list. 'Value labels' seem to be SAS's form of user defined formats ( https://libguides.library.kent.edu/SAS/UserDefinedFormats ), somewhat like categories in Pandas.

In [None]:
# Sanity check
# - failure means the data (or pyreaadstat) has changed since this was written
assert col_sets['readstat_variable_types'] == {'double', 'string'}
assert col_sets['variable_alignment'] == {'unknown'}
assert col_sets['variable_display_width'] == {0}
assert col_sets['variable_measure'] == {'unknown'}

## Check for differences in format files

There seem to be two competing SAS catalog files*:
* `../data/pdfs/General/Formats_SAS/formats.sas7bcat`
* `../data/structured_data/formats.sas7bcat`

At a binary level, they are different, let's look at them as far as pyreadstat can determine. 

*(Ignoring the `kMRI_SQ_WORMS_Link_Formats.sas7bcat` file for now)

In [None]:
_, local_cat = pyreadstat.read_sas7bcat(data_dir + 'formats.sas7bcat')
_, pdf_cat = pyreadstat.read_sas7bcat(pdfs_dir + 'formats.sas7bcat')

for cat in [local_cat, pdf_cat]:
    for v in meta_vars:
        var = getattr(cat,v)
        if var:
            if isinstance(var, int):
                print(v + ': ' + str(var))
            else:
                print(v + ' - ' + str(len(var)))
        else:
            print(v + ' - empty')
    print()

They seem the same. Both only contain 'value_labels' a dict mapping data type names to dicts defining those data types. Confirm:

In [None]:
# Compare the dicts of the local catalog vs the one in the pdfs dir
# - each is a dict of value dicts
# - this is more code than normally needed because pyreadstat is creating NaN keys

# At the top level, both have the same keys
assert set(local_cat.value_labels.keys()) == set(pdf_cat.value_labels.keys())

for k1, v1 in local_cat.value_labels.items():
    # Get both sub dictionaries
    v1_list = [(k,v) for k,v in v1.items()]
    pdf_v1_list = [(k,v) for k,v in pdf_cat.value_labels[k1].items()]
        
    for (k2, v2) in v1_list:
        # Confirm match exists
        key_match = False
        for (pdf_k2, pdf_v2) in pdf_v1_list:
            if pdf_k2 == k2 or (isinstance(k2, float) and math.isnan(k2) and math.isnan(pdf_k2)):
                key_match = True
                if pdf_v2 == v2:
                    break
                else:
                    # Mis-match on sub-dict values
                    print('Dict[' + str(k1) + '][' + str(k2) + ']: [local v: ' + v2 + ']\t[pdf v: ' + pdf_v2 + ']')
                    break
        if not key_match:
            print('Dict[' + str(k1) + '): [local k: ' + str(k2) + ' had no match in: ' +  str([k for (k,v) in pdf_v1_list]))

In [None]:
# Flip the order see if any keys exist in the PDF versions that don't exist in the local catalog files
for k1, v1 in pdf_cat.value_labels.items():
    # Get both sub dictionaries
    v1_list = [(k,v) for k,v in v1.items()]
    local_v1_list = [(k,v) for k,v in local_cat.value_labels[k1].items()]
        
    for (k2, v2) in v1_list:
        # Confirm match exists
        key_match = False
        for (local_k2, local_v2) in local_v1_list:
            if local_k2 == k2 or (isinstance(k2, float) and math.isnan(k2) and math.isnan(local_k2)):
                key_match = True
                if local_v2 == v2:
                    break
        if not key_match:
            print('Dict[' + str(k1) + '): [pdf k: ' + str(k2) + ' had no match in: ' +  str([k for (k,v) in local_v1_list]))

**Great**. They aren't the same. So two keys shipped with the data aren't in the catalog file shipped with the PDFs. Seven values seem to have minor differences in their label text.

In [None]:
# Are any of the conflicting value dicts present in this data set?
conflicting_value_labels = {'$SYNACCP', 'WTCHG', 'CRTMRPH', 'MMMRPH'}
col_sets['variable_to_label']

print(conflicting_value_labels & col_sets['variable_to_label'])

It seems like they are all used, so the point isn't exactly moot.

### Undefined data formats

In [None]:
# Are their any variable types that aren't user defined?
undefined_set = col_sets['variable_to_label'] - set(local_cat.value_labels.keys())
print(undefined_set)

* $ in SAS means character data
* BEST is numeric data that lets the system chose the best display format
* MMDDYY is obviously a date

I haven't found documentation for the rest ( https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/leforinforref/p0z62k899n6a7wn1r5in6q5253v1.htm ). Formats whose definitions weren't exported?

In [None]:
# Which data sets contain these undefined formats
undefined_set.remove('$')
undefined_set.remove('BEST')
undefined_set.remove('MMDDYY')


for filename, meta in files_meta.items():
    vls = set(meta.variable_to_label.values())
    for value_label in undefined_set:
        if value_label in vls:
            print(filename + ' ' + str(value_label))

This makes sense. For some reason `kmri_sq_worms` has its own catalog file. Who knows what is going on with the `sageancillarystudy`? Ignoring for now.

### Look closer at NaN keys
Since NaN does not equal NaN, they should never be made a key in a Python dictionary. Yet, the `value_label` dictionaries returned from pyreadstat has them. Further, a Python dictionary cannot have two copies of the same key, and yet we have that here as well. This seems like a side-effect of pyreadstat, not SAS. This looks at the extent of the problem.

In [None]:
# Look for dictionaries with more than one NaN entry.
for vl_name, vl_dict in local_cat.value_labels.items():
    v1_klist = [k for k in vl_dict.keys()]
    cnt = 0    
    for key in v1_klist:
        if isinstance(key, float) and math.isnan(key):
            cnt += 1
    if cnt > 1:
        print("Duplicate NaN keys in value label: " + vl_name)

In [None]:
# See if the count differs for the second catalog
for vl_name, vl_dict in pdf_cat.value_labels.items():
    v1_klist = [k for k in vl_dict.keys()]
    cnt = 0    
    for key in v1_klist:
        if isinstance(key, float) and math.isnan(key):
            cnt += 1
    if cnt > 1:
        print("Duplicate NaN keys in value label: " + vl_name)

In [None]:
# When two NaNs occur, what are the values?
for k1, v1 in local_cat.value_labels.items():
    # Get both sub dictionaries
    v1_list = [(k,v) for k,v in v1.items()]
    
    cnt = 0
    last_val = ''
    for (k2, v2) in v1_list:
        if isinstance(k2, float) and math.isnan(k2):
            cnt += 1
            if cnt > 1:
                print(k1 + ': \t' + last_val + '\t' + v2)
            last_val = v2

The duplicates seem to be a parsing issue, not two values that parse two NaN.

In [None]:
# How many value labels contain one or more NaNs?
cnt = 0    
for v1_dict in local_cat.value_labels.values():
    v1_klist = [k for k in v1_dict.keys()]
    for key in v1_klist:
        if isinstance(key, float) and math.isnan(key):
            cnt += 1
            break
print("Dicts: " + str(len(local_cat.value_labels)) + '\tDicts w/ NaNs: ' + str(cnt))

Seems that NaN's are common.

### Double check missing value marker
In SAS, '.' is the default marker for a missing value in a numeric column and ' ' for string columns. Since both seem to have been converted to NaN's, lets make sure all the value label dictionarys reflect this.

## Explore optimal pyreadstat settings

Look at how various pyreadstat flags interact with this particular dataset.

In [None]:
# Grab a small sample dataset with mixed types (including user defined types)
# Parse with only catalog file
filename = 'kmri_sq_blksbml_bicl03.sas7bdat'
df1, meta1 = pyreadstat.read_file_multiprocessing(pyreadstat.read_sas7bdat, data_dir + filename,
                                                         catalog_file=data_dir + 'formats.sas7bcat',
                                                         num_processes=6)
df1

### Look at metadata

In [None]:
# What metadata is present when a catalog file is included?
for v in meta_vars:
    var = getattr(meta1,v)
    if var:
        if isinstance(var, int):
            print(v + ': ' + str(var))
        else:
            print(v + ' - ' + str(len(var)))
    else:
        print(v + ' - empty')

It seems that the user defined values (`value_labels`) have been inserted into the data, but not stored in the metadata dict. `variable_value_labels` is also empty.

In [None]:
# Look at what types pyreadstat chose with a provided catalog file
df1.dtypes

### Look at categories/value labels

In [None]:
# Confirm that two columns share a user defined type (they should according to documentation)
meta1.variable_to_label

In [None]:
# Let's look at an example
df1.V03BBMLP.value_counts()

In [None]:
df1.V03BBMLP.dtype

In [None]:
# What possible values?
local_cat.value_labels['BBMLSPE']

Looks like the pystatreader is only casting the column to categorical, not setting the category types via `value_labels` data (Confirmed looking at the code base). This means that the same variable captured across two different visit may have different categorical types (since some values may be present only in one visit set and not another).

### Look at NaNs

In [None]:
# How many NaN values
df1.V03BBMLP.isna().sum()

### Try setting user_missing=True

In [None]:
# Parse with catalog files and user_missing=True
df2, meta2 = pyreadstat.read_file_multiprocessing(pyreadstat.read_sas7bdat, data_dir + filename,
                                                         catalog_file=data_dir + 'formats.sas7bcat',
                                                         num_processes=6, user_missing=True)

In [None]:
df2.V03BBMLP.isna().sum()

In [None]:
df2.V03BBMLP.value_counts()

This only leaves the NaNs to be cconverted to the user defined categorical.  It seems pyreadstat parses . as NaN, but only connects NaN to a value label in the value label table. In the actual data it remains a NaN. This could be confusing if NaNs exist in the data for other reasons.

In [None]:
# What metadata is present when a catalog file is included?
for v in meta_vars:
    var = getattr(meta2,v)
    if var:
        if isinstance(var, int):
            print(v + ': ' + str(var))
        else:
            print(v + ' - ' + str(len(var)))
    else:
        print(v + ' - empty')

`value_labels` and `variable_value_labels` are still empty. So is `missing_ranges` and `missing_user_values`.

### Looking at dates

In [None]:
# Parse with catalog files and user_missing=True and dates_as_pandas_datetime=True
filename = 'outcomes99.sas7bdat'
df3, meta3 = pyreadstat.read_file_multiprocessing(pyreadstat.read_sas7bdat, data_dir + filename,
                                                         catalog_file=data_dir + 'formats.sas7bcat',
                                                         num_processes=6, dates_as_pandas_datetime=True, user_missing=True)
df3

* No flags: date columns are mixed objects: NaN (float) and Python datatime.date objects
* If dates_as_pandas_datetime=True, then columns become Pandas.datetimens[64]
* If dates_as_pandas_datetime=True and user_missing=True, get a mixed objects: str and datetime.datetime objects. 

In ths case, the string in a date column is the char `A`. In the pyreadstat code, only columns with user defined data type have signal characters like `A` converted to their verbose form (e.g. `A: Not Expected`). This will likely happen for the other columns that use SAS native data types.

In a system like Pandas you can't combine multiple missing value types with other datatypes without getting a mixed type column (less efficient and prevents some column wide actions). In this case, only one missing value type is used (.A) but that isn't guaranteed for all dates. If this is the only missing value flag, it can be noted in comments that NaT = .A, and all date columns can be converted to Pandas datetime columns.  While this example only shows it in relation to dates, but it could also apply to numeric columns, with more missing types being used.

### How many variables don't have set categories?
Both dates and numeric value columns will need closer inspection to store them well.  How many are there?

In [None]:
# How many variables don't have a user-defined data format?
var_cnt = 0
null_cnt = 0
date_cnt = 0
best_cnt = 0
str_cnt = 0
for filename, meta in files_meta.items():
    var_cnt += len(meta.original_variable_types.keys())
    null_cnt += len([n for n in meta.original_variable_types.values() if n == 'NULL'])
    date_cnt += len([n for n in meta.original_variable_types.values() if n == 'MMDDYY'])
    best_cnt += len([n for n in meta.original_variable_types.values() if n == 'BEST'])
    str_cnt += len([n for n in meta.original_variable_types.values() if n == '$'])
print('Total variables stored: ' + str(var_cnt))
print('NULL variables stored: ' + str(null_cnt))
print('Date (MMDDYY) variables stored: ' + str(date_cnt))
print('Numeric (BEST) variables stored: ' + str(best_cnt))
print('String variables stored: ' + str(str_cnt))

In [None]:
# How many of those undefined types (NULL) are string vs numeric
null_str_cnt = 0
null_dbl_cnt = 0
for filename, meta in files_meta.items():
    for var_name, data_type in meta.original_variable_types.items():
        if data_type == 'NULL':
            storage_type = meta.readstat_variable_types[var_name]
            if storage_type == 'string':
                null_str_cnt += 1
            else:  # double
                null_dbl_cnt += 1
print('NULL - string variables stored: ' + str(null_str_cnt))
print('NULL - numeric variables stored: ' + str(null_dbl_cnt))

## Missing data without a catalog file

In [None]:
filename = 'kmri_sq_blksbml_bicl03.sas7bdat'
df1, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sas7bdat, data_dir + filename,
                                                         catalog_file=data_dir + 'formats.sas7bcat',
                                                         num_processes=6, user_missing=True)
df1

In [None]:
df1.dtypes

In [None]:
df2, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sas7bdat, data_dir + filename, num_processes=6, user_missing=True)
df2

In [None]:
df2.dtypes

The absence of category columns is expected. Are the number of NaNs sensible?

In [None]:
df2.V03BBMLP.isna().sum()

In [None]:
# Confirm that the number of NaNs is the same for every column
for col in df1.columns:
    cat_col_uni = len(df1[col].unique())
    col_uni = len(df2[col].unique())
    if cat_col_uni != col_uni:
        print(col)

In both cases, the number of NaNs is what we expect. What does the data look like? Presumably mixed type columns? 

In [None]:
print(df1['V03BBMLP'].unique())
print(df2['V03BBMLP'].unique())

Yes. This could be useful. user_missing=True doeesn't depend on catalog data.

## Look at names with prefix

For many files looked at so far, only `ID` and `VERSION` variables lack a visit prefix. Is this true across all files?

In [None]:
# Grab all variable names
all_var_names = []
for meta in files_meta.values():
    tmp = [(str.upper(var), filename) for var in meta.column_names]
    all_var_names.extend(tmp)

In [None]:
print('Total cnt of variables across files: ' + str(len(all_var_names)))

cnt = 0
no_prefix = set()
for (name, _) in all_var_names:
    if name[:3] not in visit_prefixes:
        cnt += 1
        no_prefix.add(name) 
print('Variable cnt w/out prefix: ' + str(cnt))
print('Unique variable names without a prefix: ' + str(len(no_prefix)))

## Which Variable Names Are Reused Across Files?

In [None]:
# How many variable names are reused across datafiles?
avn_ser = pd.DataFrame(all_var_names, columns =['Var', 'File'])
cnts = avn_ser.Var.value_counts(ascending=True)
cnts[cnts > 1]

VERSION seems to exist in every file, and ID in all but two.

### Which files lack an ID variable?

In [None]:
# Which files lack an ID field?
for file, meta in files_meta.items():
    if not {'ID', 'id', 'Id'} & set(meta.column_names):
        print(file)

In [None]:
print(files_meta['Biospec_fnih_joco_demographics.sas7bdat'].column_names)
print(files_meta['biospec_fnih_joco_assays.sas7bdat'].column_names)

### Looking at files that reuse variable names
The reuse of names like `SIDE` isn't concerning. However, the number of variables with a prefix that are being used is. This is repeating data. This requires sanity checks to ensure the values are the same in both locations.

In [None]:
# What files are various repeated variables used in?
subset = avn_ser[avn_ser.Var.isin(list(cnts[cnts > 1].index))]
var_dict = {}
for var, file in subset.to_records(index=False):
    if var_dict.get(var):
        var_dict[var].append(file)
    else:
        var_dict[var] = [file]

ignore = ['ID', 'VERSION']
total_file_set = set()
for k,v in var_dict.items():
    if k not in ignore:
        print(k + '(' + str(len(v))+'): ' + str(v))
        total_file_set.update(v)

In [None]:
total_file_set = list(total_file_set)
total_file_set.sort()
total_file_set

In [None]:
# What files use variables that don't have a prefix AND are part of a fileset
file_set = set()
for var in no_prefix:
    if var not in ignore:
        file_list = []
        for (name, fname) in all_var_names:
            if var == name:
                file_list.append(fname)
                file_set.add(fname)
        # fileset check
        fileset = False
        for f in file_list:
            if re.match("\S*\d\d\.sas7bdat", f):
                fileset = True
        if fileset:
            print(var + '(' + str(len(file_list)) + ')'': ' + str(file_list))

In [None]:
file_list = list(file_set)
file_list.sort()
file_list

In [None]:
len(file_list)

## Variable names across files
When concatentating files across visits, for now, the code is easiest if `ID` and `VERSION` are the only two variables that lacka a visit pre-fix. Which files does this apply to?

In [None]:
# Which files have all a visit prefix for all variables
cnt = 0
for file, meta in files_meta.items():
    cols = [n.upper() for n in meta.column_names]
    if 'ID' in cols:
        cols.remove('ID')
    cols.remove('VERSION')
    all_visit_prefix = True
    for col in cols:
        if col[:3] not in visit_prefixes:
            all_visit_prefix = False
    if all_visit_prefix:
        print(file)
        cnt += 1
print(cnt)


## Are value_ labels Consistent Across Visits?
If we drop the visit prefixes, does the same variable have the same value label across visits? This will be important if we concatenate dataframes across visits.

In [None]:
# Go through all variable names, get their the name of their value_label, drop their prefix, and build a dict mapping variable names without a prefix to a list of all the value_labels assigned to any variable sharing the same prefix-free name
val_label_dict = {}
for file, meta in files_meta.items():
    for var_name, val_lbl in meta.variable_to_label.items():
        var_name = str.upper(var_name)
        if var_name[:3] in visit_prefixes:
            var_name = var_name[3:]
        if val_label_dict.get(var_name):
            tmp_vl, f_list = val_label_dict[var_name]
            if val_lbl == tmp_vl:
                f_list.append(file)
            else:
                print('Mismatch: ' + var_name + ' ' + val_lbl + ' ' + tmp_vl)
        else:
            val_label_dict[var_name] = (val_lbl, [file])

In [None]:
# Find what variables map to RACE and what files they are part of
tmp = []
for file, meta in files_meta.items():
    tmp_df = pd.DataFrame([(k,v,file) for k, v in meta.variable_to_label.items()], columns=['Variable', 'Label', 'File'])
    tmp.append(tmp_df)
tmp_df = pd.concat(tmp, axis=0)
tmp_df[tmp_df.Variable.str.endswith('Race') | tmp_df.Variable.str.endswith('RACE')]

## Summary:
    
* Keep any eye out for dataset using: `CRTMRPH`, `WTCHG`, `$SYNACCP`, `MMMRPH` (inconsistent data definitions)
* `kmri_sq_worms` has its own catalog file. Who knows what is going on with the `sageancillarystudy`? Both have value_labels not defined in `formats.sas7bcat`
* `value_label` dictionaries need to be cleaned of duplicate NaNs, and the NaNs need to replaced with a Python friendly character.
* pyreadstat isn't setting the categorical values from `value_labels`. This is probably typically fine, but for this dataset, we will want to concatenate data from different files, so standardized CategoricalDtypes are needed.
* Columns that use SAS native data types won't expand missing value descriptions, instead raw single characters will be left in place. Out of 13K variables, over 4K fall into this category.
* Though not shown here, early work found that variables that were all uppercase in some files were mixed case in others. SAS is case insensitive, but Python isn't. To be safe, all variable and value names from pyreadstat will be converted into upper case.

**TODO**
* Section 1.3.3 - Double check missing value marker
* Check that variables that are reused across files have the same values.