Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_table does not completely remove data from previously registered table #44

Open
hanase opened this issue Oct 30, 2019 · 8 comments
Open
Assignees

Comments

@hanase
Copy link
Contributor

hanase commented Oct 30, 2019

When data is added to a table via add_column(), that column is not deleted when the table is overwritten with add_table(). Here is an example:

import orca
import pandas as pd

# Create a table
orca.add_table("my_table", pd.DataFrame({'a': [1, 2, 3]}))

# Add column
orca.add_column("my_table", "b", pd.Series([1000, 2000, 3000]))

# Check the column values
print(orca.get_table("my_table").b)

# Overwrite the table with new data
orca.add_table("my_table", pd.DataFrame({'c': [4, 5, 6]}))

# The new table still has the old column "b"! WRONG!
print(orca.get_table("my_table").b) 

# Throws an error - CORRECT (the original local column not found)
print(orca.get_table("my_table").a)

Function add_table() does clear the cache in this line, but I think it also needs to iterate over _COLUMNS and _LOCAL_COLUMNS (maybe also other places) to delete the old data.

@hanase
Copy link
Contributor Author

hanase commented Oct 30, 2019

Related to this, it would be useful to have a function delete_column(). Currently, it seems that the only way to delete columns is to re-register the whole table.

@smmaurer
Copy link
Member

Hi @hanase!

I wonder if the add_table() behavior is intentional, actually. The result here is unfortunate, but if tables always had to be registered before their associated columns, it might lead to columns failing to show up if code runs in the wrong order when a model is first initializing.

Was just discussing this with @janowicz, and we can't think of any reason not to have explicit deletion functions, though. It looks like they don't exist for anything right now. There could be delete_column(), delete_table(), delete_injectable(), etc.

If there were a delete_table() function that removed a table and all its associated columns, would that satisfy your use case here?

@hanase
Copy link
Contributor Author

hanase commented Oct 30, 2019

Thanks for looking into it @smmaurer and @janowicz!

My use case is as follows: There is a registered column defined as a variable. But sometimes I want to provide external numbers for this column instead of letting orca to computed it (which I do now with add_column). In other cases it needs to be computed (that's why I re-register the table). Thus, anything that would let me completely delete this column (in order to switch between the two cases), but wouldn't loose the registration of the variable, should work.

Thanks a lot!

@bridwell
Copy link
Contributor

bridwell commented Oct 31, 2019

I think the current behavior is what is expected/desired. For example, we re-add (essentially overwrite) the households table each time the household transition model is run. So we still need all the existing orca columns that we have defined for it.

Adding a delete_column function might be helpful, but in your use case is seems like you're just going to be re-adding the function/column anyways, so I'm not sure that really addresses this.

One approach might be to define a simple injectable that defines whether or not we want to use the calculation, and redefine the b orca column so that it uses this

orca.add_injectable('use_orca', True)

@orca.column('my_table')
def b(my_table, use_orca):
    if use_orca:
        return pd.Series([1000, 2000, 3000]))
    else:
        return my_table['c']
orca.get_table('my_table')['b']

0 1000
1 2000
2 3000

# now use the stored/provided value
orca.get_table('my_table').update_col('c', pd.Series([4, 5, 6]))
orca.add_injectable('use_orca', False)
orca.get_table('my_table')['b']

0 4
1 5
2 6

Another approach might be to check for the presence of the c column in the local/raw df and act on that?

@orca.column('my_table')
def b(my_table):
    if 'c' not in my_table.local_columns:
        return pd.Series([1000, 2000, 3000])
    else:
        return my_table['c']

# this call is assuming that c is in the table already
orca.get_table('my_table')['b']

0 4
1 5
2 6

# now remove the local column and use the computed value
del orca.get_table('my_table').local['c']
orca.get_table('my_table')['b']

0 1000
1 2000
2 3000

@hanase
Copy link
Contributor Author

hanase commented Oct 31, 2019

Thanks Scott @bridwell ! I'm not sure if this actually works for my case.

I need to make it more specific: In our application, I need the households.city_id to be recomputed in each step in some years (cache_scope = "step"), but then not be computed at all in other years (cache_scope = "iteration"). In those other years, the transition model runs first where the control totals are set on city_id level. Thus, households.city_id becomes a local (actually as well as computed) column, which I need to stay unchanged until the end of the iteration. (I first re-save the table without it and then use add_column(..., cache_scope='iteration')). But in the next iteration, I want the city_id to be computed again in every step. I tried so many things, but I can't get rid off the values assigned in the add_column call, so it does not go into the variable definition again.

Any idea how to do it?

Thanks.

@bridwell
Copy link
Contributor

bridwell commented Oct 31, 2019

This might be a short term fix, but how's something like this? Basically define a step that will change the scope of the column at the start of each year. Note that in the 1st year of the run the city is re-computed in the 1st step only, in the 2nd year it's re-computed in each step.

# let's define a simple data frame
orca.add_table(
    'persons',
    pd.DataFrame({
        'name': ['john', 'paul','george', 'ringo']
    })
)

# define a simple orca column
# intially define w/ a scope of 'forever' (the default)
@orca.column('persons', cache=True)
def city(persons):
    print('******************')
    print('...computing city')
    print('******************')
    return pd.Series(
        np.repeat('liverpool', len(persons)),
        index=persons.index
    )

def change_col_cache_scope(table, column, scope=None):
    """
    Change the cache scope for a given column. If scope is
    None, assume we're going to stop caching it.
    
    """
    # allowable scopes, values indicate the update granularity
    scopes = {
        None: 0,
        'step': 1,
        'iteration': 2, 
        'forever': 3
    }
    
    # fetch the column and it's caching properties
    c = orca.get_raw_column(table, column)
    curr_cache = c.cache
    curr_scope = c.cache_scope
    
    # change the cache properties
    if scope is None:
        c.cache = False
        # setting back to the default 
        # ?? is there something better to do with this?
        c.cache_scope = 'forever'
    else:
        if scope not in scopes.keys():
            raise ValueError('{} is not an allowed scope'.format(scope))
        c.cache = True
        c.cache_scope = scope
        
    # clear out any existing caches if the provided scope is
    # more granular than the existing
    old_granularity = scopes[curr_scope]
    new_granularity = scopes[scope]
    if new_granularity < old_granularity:
        c.clear_cached()

# let's define a step that changes the cache scope, this should be at the beginning of the run
# in odd years this will use iteration, in even years use step
@orca.step()
def set_scopes(iter_var):
    if iter_var % 2 == 0:
        scope = 'step'
    else:
        scope = 'iteration'
    
    print('================================')
    print('current scope: {}'.format(scope))
    print('================================')
    change_col_cache_scope('persons', 'city', scope)        

#  define some steps, each of thes collects the persons' city column
@orca.step()
def step1(persons):
    city = persons['city']
    
@orca.step()
def step2(persons):
    city = persons['city']
    
@orca.step()
def step3(persons):
    city = persons['city']

# define our run 
# in odd years, the message from the city column should only be printed in the 1st step
# in even years, the message should be printed after every step
orca.run(
    [
        'set_scopes',
        'step1',
        'step2',
        'step3'
    ],
    iter_vars=range(1, 11)
)
Running iteration 1 with iteration value 1
Running step 'set_scopes'
================================
current scope: iteration
================================
Time to execute step 'set_scopes': 0.00 s
Running step 'step1'
******************
...computing city
******************
Time to execute step 'step1': 0.00 s
Running step 'step2'
Time to execute step 'step2': 0.00 s
Running step 'step3'
Time to execute step 'step3': 0.00 s
Total time to execute iteration 1 with iteration value 1: 0.00 s
Running iteration 2 with iteration value 2
Running step 'set_scopes'
================================
current scope: step
================================
Time to execute step 'set_scopes': 0.00 s
Running step 'step1'
******************
...computing city
******************
Time to execute step 'step1': 0.00 s
Running step 'step2'
******************
...computing city
******************
Time to execute step 'step2': 0.00 s
Running step 'step3'
******************
...computing city
******************
Time to execute step 'step3': 0.00 s
Total time to execute iteration 2 with iteration value 2: 0.00 s
...

In the longer term it seems like what you're after (myself also) is a more dynamic cache that only clears things out when a value has changed in a function dependencies so we don't have to deal with this. This is definitely a high priority for me.

@hanase
Copy link
Contributor Author

hanase commented Oct 31, 2019

Wow - this looks like it could work. Obviously, it would be better if the change_col_cache_scope function would be an orca method of the column class, but this is a great workaround.

BTW, the dependencies issue is a high priority for us as well. I have done some work on it which I'm going to talk about in Denver. I hope I'll see you there.

@bridwell
Copy link
Contributor

Cool, I think we're in the same session in Denver. I'll be talking about some of my ideas for orca/caching issues as well. Should be an interesting discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants