
## Loading explanations and comparing groups

The fifty observations in the validation data with the highest and lowest predicted GPAs were given "explanations" using the LIME algorithm. These explanations are contained in a dictionary which I load here, along with two files containing the indices and the predicted values. I also load another file that contains the names and descriptions of each variable.

Inspection of these explanations should illustrate the features the model is using to determine whether respondents have high or low GPA scores.

In [24]:
import pickle
import pandas as pd
exp = pickle.load(open('new_lime_explanations_dict.p','rb'))
lowest_indices = pickle.load(open('lowest_indices.p', 'rb'))
highest_indices = pickle.load(open('highest_indices.p', 'rb'))
meta = pd.read_csv('ffc_variable_types.csv')

In [29]:
meta.index = meta['variable']
del meta['variable']

In [15]:
exp

{15: [('m3j4b_6_1.0 <= 0.00', 0.082601255055925388),
  ('f3d3a_5_1.0 <= 0.00', 0.062078020834230613),
  ('m4d4_4.0 <= 0.00', 0.014356585195862134),
  ('p5j10a_3.0 <= 0.00', 0.010538206245675492),
  ('m4h1l_2.0 <= 0.00', 0.0094738829952781805)],
 19: [('f3i2_9.0 <= 0.00', 0.029054045455762831),
  ('t5b3e_3.0 <= 0.00', 0.028583245177947875),
  ('hv3f5_6.0 <= 0.00', 0.014999446000797939),
  ('f3j2f > 0.33', -0.0094305725743658499),
  ('0.00 < f5k14f_1.0 <= 1.00', -0.0077164551024945655)],
 21: [('m2b18g_6.0 <= 0.00', 0.044521990055136927),
  ('p5i10a_10.0 <= 0.00', 0.033210535574034854),
  ('m1f15_1.0 <= 0.00', 0.031400485711111009),
  ('m5a8f03_1.0 > 0.00', -0.010279181353953523),
  ('f4k10 <= -0.33', 0.0086821913586804921)],
 28: [('f3d1a_4.0 <= 0.00', 0.070615491440118097),
  ('p5q2a_3.0 <= 0.00', 0.036323032580108459),
  ('hv3k1b_5.0 <= 0.00', 0.011213509721147406),
  ('m2d4a_4.0 <= 0.00', 0.0082278754288294276),
  ('m4c13_1.0 <= 0.00', 0.0076683240766350852)],
 29: [('hv3k1a_4.0 <= 0

Now that the data have been loaded I first do some basic analysis to see the range of variables that are present in the explanations.

First I convert the explanations to a dictionary, which is an easier format to process than that returned by LIME. I then convert them to a pandas dataframe.

In [16]:
explanations = {}
for k,v in exp.items():
    user_exp = {}
    for x in v:
        user_exp[x[0]] = x[1]
    explanations[k] = user_exp

We can inspect a given element of the dictionary to see the explanation for a particular observation. For example the sub-dictionary below contains the explanation for observation 15. The keys in this dictionary are a combination of variables and values. For example the first key `f3d3a_5_1.0 <= 0.00` denotes the variable `f3d3a_5`, corresponding to the question posted to the father of the child in year 3 of the survey: "Who could you trust: child's sibling?". The second part of the key denotes that the response category `1.0` was less than or equal to `0`. Looking this up in the [survey documentation](https://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_dad_cb3.txt) indicates that `1.0` indicates an answer of `Yes` to the question. While this syntax is somewhat confusing it indicates that this particular dummy variable had a value of 0 for this respondent. This therefore indicates that the child's rather did not answer yes to this particular question. The value of this element of the dictionary is a local coefficient generated by LIME that indicates the weight that this variable contributed to the local prediction. In this case the predictor was positive.



In [23]:
explanations[15]

{'f3d3a_5_1.0 <= 0.00': 0.062078020834230613,
 'm3j4b_6_1.0 <= 0.00': 0.082601255055925388,
 'm4d4_4.0 <= 0.00': 0.014356585195862134,
 'm4h1l_2.0 <= 0.00': 0.0094738829952781805,
 'p5j10a_3.0 <= 0.00': 0.010538206245675492}

In [33]:
meta.loc['f3d3a_5']

label            Who could you trust: child's sibling?
unique_values                                        5
variable_type                              categorical
Name: f3d3a_5, dtype: object

I now convert this dictionary into a pandas dataframe.

In [17]:
df = pd.DataFrame.from_dict(explanations, orient='index')

In [20]:
df.shape

(100, 481)

In [18]:
df.head()

Unnamed: 0,hv4d1e_1.0 <= 0.00,f4b6c_1.0 <= 0.00,f1f5_1.0 > 0.00,m2citywt_rep6 > -0.11,p5l18_3.0 <= 0.00,m1d2g_3.0 <= 0.00,f3d1f_4.0 <= 0.00,f3d4a_1.0 <= 0.00,m3f2c2 <= -0.68,m3e1_1.0 <= 0.00,...,hv3b14_9.0 <= 0.00,p5q3r_3.0 <= 0.00,m4i0h_2.0 <= 0.00,m5g16d_3.0 <= 0.00,0.00 < m4k16a_4.0 <= 1.00,t5d5_2.0 <= 0.00,f1g9e_2.0 <= 0.00,f4b3_2.0 <= 0.00,hv3j23j_4.0 <= 0.00,f3b6d_4.0 <= 0.00
15,,,,,,,,,,,...,,,,,,,,,,
19,,,,,,,,,,,...,,,,,,,,,,
21,,,,,,,,,,,...,,,,,,,,,,
28,,,,,,,,,,,...,,,,,,,,,,
29,,,,,,,,,,,...,,,,,,,,,,


In [97]:
df_low = df.loc[[x[0] for x in lowest_indices]]
df_high = df.loc[[x[0] for x in highest_indices]]
df_low = df_low.dropna(axis=1, how='all') # dropping empty columns
df_high = df_high.dropna(axis=1, how='all')

Now, to get a sense of the important variables I can simplify the columns by extracting the variable names and creating a new dataframe. Apologies for the rather ugly code.

In [44]:
def extract_variable_name(s):
    """
    This function parses the column names in the explanations to extract the variable name
    from the FF survey.
    
    I have left in the comments to illustrate how the algorithm is working."""
    components = s.split()
    print(s)
    try: 
        float(components[0]) # if first component can be case to a float then var name in 2nd
        print('First component is a float')
        var = components[2]
        print('Name is in ', var)
    except ValueError:
        var = components[0]
        print('Name is in ', var)
        
    if '_' in var:
        subcomponents = var.split('_')
        if var.count('_') == 1:
            # if substring after the _ can't be cast to float then it is part of the name
            try:
                float(subcomponents[1])
                varname = subcomponents[0]
            except ValueError:
                varname = var
        elif var.count('_') > 1:
            print("More than one underscore in ", var)
            varname = subcomponents[0]+'_'+subcomponents[1]
            print("Variable name is ", varname)
            
    else:
        varname = var
    print(varname)
    return varname 

explanations_2 = {}
for k,v in explanations.items():
    user_exp = {}
    for x, v in v.items():
        var = extract_variable_name(x)
        user_exp[var] = v
    explanations_2[k] = user_exp
    
df_names = pd.DataFrame.from_dict(explanations_2, orient='index')

hv4d1e_1.0 <= 0.00
Name is in  hv4d1e_1.0
hv4d1e
f4b6c_1.0 <= 0.00
Name is in  f4b6c_1.0
f4b6c
f1f5_1.0 > 0.00
Name is in  f1f5_1.0
f1f5
m2citywt_rep6 > -0.11
Name is in  m2citywt_rep6
m2citywt_rep6
p5l18_3.0 <= 0.00
Name is in  p5l18_3.0
p5l18
m1d2g_3.0 <= 0.00
Name is in  m1d2g_3.0
m1d2g
f3d1f_4.0 <= 0.00
Name is in  f3d1f_4.0
f3d1f
f3d4a_1.0 <= 0.00
Name is in  f3d4a_1.0
f3d4a
m3f2c2 <= -0.68
Name is in  m3f2c2
m3f2c2
m3e1_1.0 <= 0.00
Name is in  m3e1_1.0
m3e1
hv4r10a_8_1.0 > 0.00
Name is in  hv4r10a_8_1.0
More than one underscore in  hv4r10a_8_1.0
Variable name is  hv4r10a_8
hv4r10a_8
m4a13_5.0 <= 0.00
Name is in  m4a13_5.0
m4a13
m2c3e_202.0 <= 0.00
Name is in  m2c3e_202.0
m2c3e
hv4c13_0.0 <= 0.00
Name is in  hv4c13_0.0
hv4c13
0.00 < t5c7b_1.0 <= 1.00
First component is a float
Name is in  t5c7b_1.0
t5c7b
f2b36g_203.0 <= 0.00
Name is in  f2b36g_203.0
f2b36g
t5e18c_4.0 <= 0.00
Name is in  t5e18c_4.0
t5e18c
p5i29_4.0 <= 0.00
Name is in  p5i29_4.0
p5i29
f4d1e_2.0 <= 0.00
Name is in  f

In [45]:
df_names.shape

(100, 431)

This illustrates that there are 431 unique variables that appear in the 500 explanations. Given than each user only has 5 variables in their explanation this indicates that many variables only occur in a single explanation. By counting the empty cells in the new dataframe we can determine how many times each variable appeared in the explanations.

Using the indices imported above I can sub-set the dataframe to get the 50 highest and 50 lowest observations.

In [62]:
low = df_names.loc[[x[0] for x in lowest_indices]]
high = df_names.loc[[x[0] for x in highest_indices]]

In [68]:
def get_variable_freqs(df):
    missingness = pd.DataFrame(df.isnull().sum())
    freq_dict = {}
    for r in missingness.iterrows():
        freq_dict[r[0]] = df.shape[0] - r[1][0]
    return freq_dict

In [71]:
low_freqs = get_variable_freqs(low)

In [72]:
high_freqs = get_variable_freqs(high)

In [100]:
for i, j in sorted(low_freqs.items(), key=lambda x: x[1], reverse=True):
    print(i, j, meta.loc[i]['label'], [x for x in list(df_low.columns) if i in x])

f4k10 3 How many hours do you usually work per week at that job? ['f4k10 <= -0.33', 'f4k10 > 0.48']
k5a1a 3 A1A. How often PCG knows what you do during your free time ['k5a1a_2.0 <= 0.00']
hv4ppvtage 3 cv (mpr): age of child in months at ppvt administration ['hv4ppvtage_70.0 <= 0.00', 'hv4ppvtage_64.0 <= 0.00', 'hv4ppvtage_60.0 > 0.00']
m4a13 2 In last 2 yrs, how many romantic relationships lasted one month plus? ['m4a13_5.0 <= 0.00', 'm4a13_11.0 <= 0.00']
f2b36g 2 # Days/week you hug or show physical affection to child? ['f2b36g_203.0 <= 0.00', 'f2b36g_202.0 <= 0.00']
m1f15 2 If BF doesn't want to marry the mother, could he be required to pay child supp.? ['0.00 < m1f15_5.0 <= 1.00', 'm1f15_1.0 <= 0.00']
o5a8 2 A8. Best describe the home or building ['o5a8_101.0 <= 0.00', 'o5a8_6.0 <= 0.00']
hv5_haz 2 Child's height-for-age Z-score ['hv5_haz > 0.44']
m4f2b4 2 Is fourth person male or female? ['m4f2b4_2.0 <= 0.00']
hv3m21 2 m21: he/she hits others ['hv3m21_0.0 <= 0.00', '0.00 < hv3m21_

KeyError: 'the label [cm2md_case] is not in the [index]'

In [101]:
for i, j in sorted(high_freqs.items(), key=lambda x: x[1], reverse=True):
    print(i, j, meta.loc[i]['label'],  [x for x in list(df_high.columns) if i in x])

f1e4a 2 Who was the person who was like a father to you? ['f1e4a_10.0 <= 0.00', '0.00 < f1e4a_10.0 <= 1.00']
hv4a14 2 a14: Last 12m: how many times has child been taken to the emergency room? ['0.00 < hv4a14_0.0 <= 1.00', 'hv4a14_6.0 <= 0.00']
m1f1b 2 How long have you lived in neighborhood - months? ['m1f1b_4.0 <= 0.00', 'm1f1b_1.0 <= 0.00']
cm2kids 2 Constructed - # of children in HH ['cm2kids_8.0 <= 0.00', 'cm2kids_6.0 <= 0.00']
hv4b4 2 b4: wkend day: how much time child spends playing computer games at home/elsewh? ['hv4b4_6.0 <= 0.00', 'hv4b4_4.0 <= 0.00']
m2c3c 2 How many days a week does father-Read stories to child? ['m2c3c_7.0 <= 0.00', 'm2c3c_204.0 <= 0.00']
hv3h2b 2 h2b: if child has tantrum in public & words not work, what do you do? ['hv3h2b_7.0 <= 0.00', 'hv3h2b_17.0 <= 0.00']
m2h8g 2 What other kinds of local/state/federal agencies have helped you since child's b ['m2h8g_107.0 <= 0.00', 'm2h8g_108.0 <= 0.00']
m4f2d2 2 What is second person's relationship to you? ['m4f2d2

KeyError: 'the label [cf4md_case] is not in the [index]'

The questions and their answers can now be manually looked up in the FF codebook to determine whether there are differences between the high and low respondents.