### Unique CPS Household ID 

April 30, 2019

Brian Dew, @bd_econ

-----

ABOUT:

This file uses struct to read a monthly CPS file and then, based on dates and month in sample, merges the monthly file with the bd CPS feather file that should contain the same households. When the bd CPS file contains the same household, a new CPSID is generated based on the date and QSTNUM of the bd CPS file. When there is no match in the first possible month (where MIS=1 should be), the program continues to look. If it does not find a match, it generates a new CPSID based on the date and QSTNUM of the current month.

---



Drawn primarily from the description of the IPUMS CPSID. Works currently for May 1995-onward.

**WORK IN PROGRESS**

Eventually, this will have three sections. One that handles the feather files for 1989-94, a second that handles the dates right around and affected by the mid-1995 break, and a third that handles raw CPS files after the mid-1995 break.

I could also try to create a format that handles feather files when they are available and raw files only when they are not already included in the feather file. 

The overall goal of this file should be: 
1) Check what months are covered by raw data files. 
2) Check whether CPSIDs are available for those months.
3) If CPSIDs are missing, generate them and store them in the dictionary.

In `bd_CPS_reader.ipynb` the CPSID should be generated only if the dictionary contains that month of data. To efficiently do this, I should read the dictionary once and generate a list of available months to cross check. That way, I can generate a bd CPS feather file without the ID, then use that feather file to generat the CPSIDs more efficiently. Once the CPSIDs are generated locally, this won't apply. Separately, I'll need to be able to generate a CPSID from a raw data file, so that I can add new months of data efficiently as they are released. 

Notes:

- One issue when creating QSTNUM is that I use a different process to read the data in this notebook and so the QSTNUM generated here will not match the QSTNUM generated in the reader. Therefore, a dictionary maps the QSTNUM to HHIDs

In [1]:
# Import preliminaries
import os, re, struct, pickle, string, sys
import pandas as pd
print('pandas:', pd.__version__)
pd.options.mode.chained_assignment = None
import numpy as np
print('numpy:', np.__version__)
from bd_CPS_details import StatesMap, DataDict

os.chdir('/home/brian/Documents/CPS/data')

#sys.stdout = open('cps_id_log.txt', 'w')

dd_matcher = pickle.load(open('cps_basic_dd.pkl', 'rb'))['matcher']


# Storage of IDs in pickled dictionary
ids_file = 'CPS_unique_ids.pkl'
if os.path.isfile(ids_file):
    print('ID dictionary file exists')
    cps_ids_full = pickle.load(open(ids_file, 'rb'))
else:
    cps_ids_full = {}

pandas: 2.2.2
numpy: 1.26.4
ID dictionary file exists


In [2]:
#for date in ['2025-04-01', '2025-05-01', '2025-06-01', '2025-07-01']:
#    del cps_ids_full[pd.to_datetime(date)]

In [3]:
# Return regex pattern that will parse data dictionary dd_file
def return_dd_parser(dd_file):
    
    DataDict = {'2025_Basic_CPS_Public_Use_Record_Layout_plus_IO_Code_list.txt': 
                '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
                'May_2024_Basic_CPS_Public_Use_Record_Layout.txt':
                '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
                '2024_Basic_CPS_Public_Use_Record_Layout_plus_IO_Code_list.txt': 
                '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
                '2023_Basic_CPS_Public_Use_Record_Layout_plus_IO_Code_list.txt': 
                '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
                'January_2020_Record_Layout.txt': 
                '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
                'January_2017_Record_Layout.txt': 
                '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
                'January_2015_Record_Layout.txt':
                '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
                'January_2014_Record_Layout.txt':
                '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
                'January_2013_Record_Layout.txt':
                '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
                'may12dd.txt':
                '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
                'jan10dd.txt':
                '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
                'jan09dd.txt':
                '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
                'jan07dd.txt':
                '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
                'augnov05dd.txt':
                '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
                'may04dd.txt':
                '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
                'jan03dd.txt':
                '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
                'jan98dd.asc':
                'D (\w+)\s+(\d{1,2})\s+(\d+)\s+',
                'jan98dd2.asc':
                'D (\w+)\s+(\d{1,2})\s+(\d+)\s+',
                'sep95_dec97_dd.txt':
                '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
                'jun95_aug95_dd.txt':
                '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
                'apr94_may95_dd.txt':
                '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
                'jan94_mar94_dd.txt':
                '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)'}
    
    return DataDict[dd_file]


# Create HHID2 for pre May 2004 data
def id2_gen(np_mo):
    hrsample = [x[1:3] for x in np_mo['HRSAMPLE']]
    hrsersuf = [x.strip() for x in np_mo['HRSERSUF']]
    sersuf_d = {a: str(ord(a.lower()) - 96).zfill(2) for a in set(hrsersuf)
            if a in list(string.ascii_letters)}
    sersuf_d.update({'-1': '00', '-1.0': '00', '0': '00'})
    sersuf = list(map(sersuf_d.get, hrsersuf))
    np_mo.loc[np_mo['HUHHNUM'] < 0, 'HUHHNUM'] = 0
    huhhnum = np_mo['HUHHNUM'].astype('U1')
    
    id2 = [''.join(i) for i in zip(hrsample, sersuf, huhhnum)]

    return(np.array(id2, dtype='uint32'))

  '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
  '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
  '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
  '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
  '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
  '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
  '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
  '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
  '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
  '\n(\w+)\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)',
  '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
  '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
  '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
  '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
  '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
  '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
  'D (\w+)\s+(\d{1,2})\s+(\d+)\s+',
  'D (\w+)\s+(\d{1,2})\s+(\d+)\s+',
  '\n(?:\x0c)?(\w+)\s+(\d+)\s+.*? \s+.*?(\d\d*).*?(\d\d+)',
  '\n(?:

In [8]:
# List of monthly raw CPS data files to process
raw_monthly_data_file_list = [file for file in os.listdir() 
                              if file.endswith('pub.dat') 
                              and (pd.to_datetime(file[:5], format='%b%y')
                                   >= pd.to_datetime('1995-05-01')) and
                              (pd.to_datetime(file[:5], format='%b%y')
                                   not in cps_ids_full.keys())]

#raw_monthly_data_file_list = ['nov04pub.dat']

In [4]:
#raw_monthly_data_file_list = ['apr25pub.dat', 'may25pub.dat', 'jun25pub.dat', 'jul25pub.dat']

In [5]:
# For 1995-1997 map the HHID and HHID2 to QSTNUM
pre98files = [f[3:5] for f in raw_monthly_data_file_list 
              if f[3:5] in ['95', '96', '97']]

if len(pre98files) > 0:
    qstnum_map_file = 'qstnum_map.pkl'
    if os.path.isfile(qstnum_map_file):
        print('QSTNUM mapping dictionary file exists')
        qstnum_map = pickle.load(open(qstnum_map_file, 'rb'))
    else:
        qstnum_map = {}

        columns = ['MONTH', 'HHID', 'HHID2', 'QSTNUM']

        for year in [1995, 1996, 1997]:
            df = pd.read_feather(f'clean/cps{year}.ft', columns=columns)
            df['ID'] = df['HHID'].astype('str') + df['HHID2'].astype('str')
            for month in df['MONTH'].unique():
                date = (year * 100 + month) % 10000
                dfm = df[df['MONTH'] == month].copy()
                qstnum_map[date] = dfm.set_index('ID')['QSTNUM'].to_dict()
        # Write to file
        with open(qstnum_map_file, 'wb') as f:
            pickle.dump(qstnum_map, f)

In [6]:
# Loop over files of interest and generate unique IDs
for file in raw_monthly_data_file_list:
    # Details for matching new file to previous data
    curr_mo = pd.to_datetime(file[:5], format='%b%y')
    curr_mo_short = int(curr_mo.strftime('%y%m'))
    
    # Handling dates before and at break
    if curr_mo < pd.to_datetime('1995-05-01'):
        continue

    print('Current month:', curr_mo)
    
    # Identify possible matching months
    mo_diffs = [1, 2, 3, 9, 10, 11, 12, 13, 14, 15]
    poss_mos = [poss_mo for poss_mo in [curr_mo - pd.DateOffset(months=mo_diff)
                for mo_diff in mo_diffs] 
                if poss_mo >= pd.to_datetime('1995-05-01')]
    
    # Put in format to match with bd CPS data
    yymms = [int(pm.strftime('%y%m')) for pm in poss_mos]

    # Which annual bd CPS files to pull
    years = list(set([pm.year for pm in poss_mos]))
    if curr_mo == pd.to_datetime('1995-05-01'):
        yymms = ['9505']
        years = [1995]
    bd_CPS_files = [f'cps{year}.ft' for year in years]
    
    # For each month in sample, which months can match?
    match_months = {
        2: [1],
        3: [2, 1],
        4: [3, 2, 1],
        5: [12, 11, 10, 9],
        6: [13, 12, 11, 10, 1],
        7: [14, 13, 12, 11, 2, 1],
        8: [15, 14, 13, 12, 3, 2, 1]
    }
    
    # Return list of yymms to search for each MIS based on curr_mo
    search_list = {mis: [int(search_mo.strftime('%y%m')) for search_mo in 
                         [curr_mo - pd.DateOffset(months=mo_diff) 
                          for mo_diff in match_months[mis]] 
                         if search_mo > pd.to_datetime('1995-08-01')]
                   for mis in [2, 3, 4, 5, 6, 7, 8]}

    # Background to read current monthly file
    # read data dictionary text file 
    dd_file = dd_matcher[file]
    data_dict = open(dd_file, 'r', encoding='iso-8859-1').read()
    if dd_file == 'may04dd.txt':
        data_dict = data_dict.replace('HRHHID (partII)', 'HRHHID2')

    # manually list out the IDs for series of interest 
    var_names = ['HRMONTH', 'HRYEAR4', 'HRMIS', 'QSTNUM', 'OCCURNUM', 
                 'HRHHID', 'HRHHID2', 'GESTFIPS', 'HWHHWGT']   

    if curr_mo < pd.to_datetime('2004-05-01'):
        var_names = ['HRMONTH', 'HRYEAR4', 'HRMIS', 'QSTNUM', 'OCCURNUM', 
                     'HRHHID', 'HRSAMPLE', 'HRSERSUF', 'HUHHNUM', 'GESTFIPS', 
                     'HWHHWGT', 'HRYEAR']      

    # regular expression matching series name and data dict pattern
    p = return_dd_parser(dd_file)

    # pick data type based on size of variable
    def id_dtype(size, name):
        size = int(size)
        dtype = ('U4' if name in ['HRSAMPLE']
                 else 'U2' if name in ['HRSERSUF']
                 else 'intp' if size > 9 
                 else 'int32' if size > 4 
                 else 'int16' if size > 2 
                 else 'int8')
        return dtype

    # dictionary of variable name: [start, end, and length + 's']
    if dd_file in ['jan98dd.asc', 'jan98dd2.asc']:
        d = {s[0]: [int(s[2])-1, int(s[2])+int(s[1])-1, 
                    f'{s[1]}s', id_dtype(s[1], s[0])] 
             for s in re.findall(p, data_dict) if s[0] in var_names}       
    else:
        d = {s[0]: [int(s[2])-1, int(s[3]), f'{s[1]}s', id_dtype(s[1], s[0])]
         for s in re.findall(p, data_dict) if s[0] in var_names}

    # data types
    dtypes = [(k, v[-1]) for k, v in d.items()]

    # weight variable start and end location
    ws, we = d['HWHHWGT'][:2]

    # lists of variable starts, ends, and lengths
    start, end, width, dtype = zip(*d.values())

    # create list of which characters to skip in each row
    skip = ([f'{s - e}x' for s, e in zip(start, [0] + list(end[:-1]))])

    # create format string by joining skip and variable segments
    unpack_fmt = ''.join([j for i in zip(skip, width) for j in i])

    # struct can interpret row bytes with the format string
    unpacker = struct.Struct(unpack_fmt).unpack_from

    # Assign new date variable
    date = lambda x: (((x.HRYEAR4.astype(np.int32) * 100) + 
                      x.HRMONTH.astype(np.int8)) % 10000)
    
    # 1998 and onward have OCCURNUM to keep first in HH
    if curr_mo >= pd.to_datetime('1998-01-01'):

    
        # Read new monthly file
        data = [unpacker(row) for row in open(file, 'rb') 
                if (row[ws:we].strip() > b'0')]

        # Convert to dataframe using specified weights
        df = (pd.DataFrame(np.array(data, dtype=dtypes))
                .assign(DATE = date))
        
        # Create HHID2 if necessary
        if curr_mo < pd.to_datetime('2004-05-01'):
            df['HRHHID2'] = id2_gen(df)
            
        # Keep only first observation in each HH
        df = df.drop_duplicates(subset=['HRHHID', 'HRHHID2'], keep='first')
        
    else:
        # Read new monthly file
        data = [unpacker(row) for row in open(file, 'rb') 
                if (row[ws:we].strip() > b'0')]

        # Convert to dataframe using specified weights
        df = pd.DataFrame(np.array(data, dtype=dtypes))
        
        # Create HHID2 if necessary
        if curr_mo < pd.to_datetime('2004-05-01'):
            df['HRHHID2'] = id2_gen(df)        

        # Keep only first observation in each HH
        df = df.drop_duplicates(subset=['HRHHID', 'HRHHID2'], keep='first')
        
        # Create HRYEAR4 from HRYEAR
        df['HRYEAR4'] = df['HRYEAR'] + 1900
        df = df.drop(['HRYEAR'], axis=1)
        
        # Assign date
        df = df.assign(DATE = date)
        
        # Create QSTNUM
        df['ID'] = df['HRHHID'].astype('str') + df['HRHHID2'].astype('str')
        df = df[df['ID'].isin(qstnum_map[curr_mo_short].keys())]
        df['QSTNUM'] = df['ID'].map(qstnum_map[curr_mo_short])

    # Rename HHIDs
    df = df.rename({'HRHHID': 'HHID', 'HRHHID2': 'HHID2'}, axis=1)

    # Need to map state to state id codes
    df['STATE'] = df['GESTFIPS'].map(StatesMap)

    # Drop GESTFIPS and OCCURNUM
    df = df.drop(['GESTFIPS'], axis=1)
        
    tot_hh = len(df)
    print('Total HHs in sample:', tot_hh)

    # Read potential match data
    keep_cols = ['YEAR', 'MONTH', 'MIS', 'HHID', 'HHID2', 'QSTNUM', 
                 'OCCURNUM', 'STATE']

    date = lambda x: (((x.YEAR.astype(np.int32) * 100) + 
                      x.MONTH.astype(np.int8)) % 10000)
    
    mdf = (pd.concat(
        [(pd.read_feather(f'clean/cps{year}.ft', columns=keep_cols)
            .assign(DATE = date))
         for year in years], sort=False))
    
    subset = ['YEAR', 'MONTH', 'HHID', 'HHID2']
    
    mdf = mdf.drop_duplicates(subset=subset, keep='first')

    mdf = (mdf[mdf['DATE'].isin(yymms)].drop(['MONTH', 'YEAR'], axis=1))

    # Merge data
    d = {}

    # MIS = 1 households get current id
    dfmis1 = df.loc[df['HRMIS'] == 1, ['QSTNUM', 'DATE']]
    dfmis1['ID'] = dfmis1['DATE'] * 100000 + dfmis1['QSTNUM']
    mis1id = dfmis1.set_index('QSTNUM')['ID'].to_dict()
    d.update(mis1id)
    print('New HHs (MIS1):', len(d))

    df = df.loc[df['HRMIS'] > 1]
    dft = df

    # Loop over MIS and potentional matches to find matched id
    for mis in [2, 3, 4, 5, 6, 7, 8]:    
        for pm in search_list[mis]:
            results = (dft.loc[dft['HRMIS'] == mis]
                          .merge(mdf[mdf['DATE'] == pm], 
                                 on=['HHID', 'HHID2', 'STATE']))

            results['ID'] = results['DATE_y'] * 100000 + results['QSTNUM_y']

            matched_id = results.set_index('QSTNUM_x')['ID'].to_dict()
            print(f'Matched HHs (MIS{mis}): ', len(matched_id))
            d.update(matched_id)

            dft = dft.loc[~dft['QSTNUM'].isin(matched_id.keys())]

        if len(search_list[mis]) > 0:
            # Households with no match get current id, same has MIS=1
            new_hh = dft[dft['HRMIS'] == mis]
            new_hh['ID'] = new_hh['DATE'] * 100000 + new_hh['QSTNUM']
            new_hh_d = new_hh.set_index('QSTNUM')['ID'].to_dict()
            d.update(new_hh_d)
            print(f'Replacement HHs (MIS{mis}): ', len(new_hh_d))
            if len(new_hh_d) > 2000:
                print('\nWARNING too many replacements, CHECK!\n')

    print('Total IDs created:', len(d))
    print('Total IDs not created:', tot_hh - len(d), '\n\n')

    monthly_id_dict = {curr_mo: d}

    # Save results
    cps_ids_full.update(monthly_id_dict)


# Write to file
with open(ids_file, 'wb') as f:
    pickle.dump(cps_ids_full, f)
    
print('Total months of IDs:', len(cps_ids_full))

Current month: 2025-04-01 00:00:00
Total HHs in sample: 40882
New HHs (MIS1): 4987
Matched HHs (MIS2):  4274
Replacement HHs (MIS2):  729
Matched HHs (MIS3):  4242
Matched HHs (MIS3):  495
Replacement HHs (MIS3):  330
Matched HHs (MIS4):  4067
Matched HHs (MIS4):  579
Matched HHs (MIS4):  233
Replacement HHs (MIS4):  69
Matched HHs (MIS5):  3901
Matched HHs (MIS5):  432
Matched HHs (MIS5):  159
Matched HHs (MIS5):  56
Replacement HHs (MIS5):  411
Matched HHs (MIS6):  3774
Matched HHs (MIS6):  576
Matched HHs (MIS6):  241
Matched HHs (MIS6):  47
Matched HHs (MIS6):  443
Replacement HHs (MIS6):  311
Matched HHs (MIS7):  3836
Matched HHs (MIS7):  351
Matched HHs (MIS7):  190
Matched HHs (MIS7):  34
Matched HHs (MIS7):  449
Matched HHs (MIS7):  242
Replacement HHs (MIS7):  142
Matched HHs (MIS8):  3685
Matched HHs (MIS8):  486
Matched HHs (MIS8):  162
Matched HHs (MIS8):  29
Matched HHs (MIS8):  417
Matched HHs (MIS8):  228
Matched HHs (MIS8):  117
Replacement HHs (MIS8):  46
Total IDs cre

#### Pre-1994 data

In [6]:
# Dictionary of unique IDS for 1989-93
ids_file = 'CPSID_89-93.pkl'
if os.path.isfile(ids_file) == False:
    
    match_mos = {2: [1],
                 3: [2, 1],
                 4: [3, 2, 1],
                 5: [12, 11, 10, 9],
                 6: [13, 12, 11, 10, 1],
                 7: [14, 13, 12, 11, 2, 1],
                 8: [15, 14, 13, 12, 3, 2, 1]}
    
    date_range = [(dt.year, dt.month) for dt in 
                  pd.date_range(start='1989-01-01', end='1996-12-01', freq='MS')]

    columns = ['MONTH', 'YEAR', 'STATE', 'MIS', 'HHID', 'HHID2', 'QSTNUM', 
               'OCCURNUM', 'HHNUM', 'BASICWGT']

    id2 = lambda x: np.where(x['HHID2'] > 0, x['HHID2'] % 100, x['HHNUM'])
    
    mdf = (pd.concat([(pd.read_feather(f'clean/cps{year}.ft', columns=columns)
                        .query('OCCURNUM == 1')) for year in range(1989, 1997)])
             .assign(ID2 = id2))

    mdf['DATE'] = ((mdf['YEAR'] * 100) + mdf['MONTH']) % 10000 

    combined_ids = {}

    for year, month in date_range:

        ids = {}

        mo_id = (year % 100) * 100 + month
        curr_mo = pd.to_datetime(f'{year}-{month}-01')

        data = (pd.read_feather(f'clean/cps{year}.ft', columns=columns)
                  .query('MONTH == @month and OCCURNUM == 1'))
        
        if 'HHID2' in data.keys():
            data['ID2'] = data['HHID2'] % 100
        else:
            data['ID2'] = data['HHNUM']

        data['DATE'] = mo_id

        ndf = data.query('MIS == 1')

        ndf['ID'] = ndf['DATE'] *100000 + ndf['QSTNUM']

        new_id = ndf.set_index('QSTNUM')['ID'].to_dict()

        ids.update(new_id)

        # Search for old IDS
        search_list = {mis: [(date.year, date.month) for date in 
                             [curr_mo - pd.DateOffset(months=mos) for mos in offsets]]
                       for mis, offsets in match_mos.items()}


        for mis, slist in search_list.items():

            df = data.query('MIS == @mis')

            for i, (syear, smonth) in enumerate(slist):

                d = mdf.query('YEAR == @syear and MONTH == @smonth and MIS == (@i + 1)')

                if len(d[d['HHID2'] > 0]) > 0:
                    results = (df.merge(d, on=['HHID', 'HHID2', 'STATE']))
                else:
                    results = (df.merge(d, on=['HHID', 'ID2', 'STATE']))

                results['ID'] = results['DATE_y'] * 100000 + results['QSTNUM_y']

                matched_id = results.set_index('QSTNUM_x')['ID'].to_dict()

                ids.update(matched_id)

                df = df.query('QSTNUM not in @ids.keys()')

            df['ID'] = df['DATE'] * 100000 + df['QSTNUM']

            new_id = df.set_index('QSTNUM')['ID'].to_dict()

            ids.update(new_id)

        combined_ids[curr_mo] = ids

    # Write to file
    with open(ids_file, 'wb') as f:
        pickle.dump(combined_ids, f)