In [1]:
import pandas as pd
import numpy as np
from git import Repo

In [2]:
repo = Repo("/Users/crankshaw/tugboat_test")

In [3]:
cur_commit = repo.commit("HEAD")
last_update = repo.commit("HEAD~1")


In [4]:
for da in last_update.diff(cur_commit).iter_change_type('A'):
    print da.a_path

features/another_pred_1/predict_1_svm.pkl
features/another_pred_1/predict_1_svm.pkl_01.npy
features/another_pred_1/predict_1_svm.pkl_02.npy
features/another_pred_1/predict_1_svm.pkl_03.npy
features/another_pred_1/predict_1_svm.pkl_04.npy
features/another_pred_1/predict_1_svm.pkl_05.npy
features/another_pred_1/predict_1_svm.pkl_06.npy
features/another_pred_1/predict_1_svm.pkl_07.npy
features/another_pred_1/predict_1_svm.pkl_08.npy
features/another_pred_1/predict_1_svm.pkl_09.npy
features/another_pred_1/predict_1_svm.pkl_10.npy
features/another_pred_1/predict_1_svm.pkl_11.npy
features/another_pred_1/predict_1_svm.pkl_12.npy
features/another_pred_1/predict_1_svm.pkl_13.npy
models/third_model.model


##What happens after a commit?

When some change is made to the model repo, we need to detect what changed and decide what actions to take based on that change. These actions include reloading existing features or applications, adding new ones,
or deleting old ones.

__NOTE:__ Diff should be made as `old_commit.diff(new_commit)`, not other way around.

####Features:

* __Added:__
    * Detection: A `features/feature_name/feature_name.pkl` file appears in the added paths list.
    * Action: Added to list of available features, but not loaded unless feature is used by at least one model.
* __Modified:__
    * Detection: The pickle for a feature object can contain more than one file, so a single feature is stored in `features/feature_name/*` where that directory must have a `feature_name.pkl` file but may contain other files as well. A feature named `feature_name` is assumed to have changed if any files in a sub-directory of `features/feature_name/` exist in the modified_paths file, OR if a `feature/feature_name/*` file appears in the added paths file, such that the file does not have a `*.pkl` file extension AND that feature has not already been detected as a new feature. I highly doubt that this situation could arise based on how I think pickling works, but it's an important edge case to handle for git. This case arises because a feature in our system is a single object, but is composed of multiple paths which is the unit of change that git is looking at (not strictly true, but close enough).
    * Action: Added to list of reload_features. Once we've compiled the list of features that need to be reloaded, we union that with the list of new features. This gives us a list of uninitialized features, and we intersect that with the list of features that are in use to yield the list of features unloaded features that need to be loaded. This process completely ignores versioning for now. Each model is expected to use the latest version of each feature.
* __Deleted:__
    * Detection: 
* __Renamed:__ This is an edge case that I haven't fully thought through. This should be discouraged, and possibly we just throw an error. The problem is that names are unique identifiers for objects. Unlike in source code where name changes are common, that really messes things up in our scheme.

####Models:

* __Added:__
    * Detection: a `*.model` file is present in the added paths list
    * Action: create and load the new model
    * Questions:
        * Where should training data for a specific model be stored?
* __Modified:__
    * Detection: a `*.model` file is present in the modified paths list
    * Action: Reload and retrain model. Already loaded training data does not need to be reloaded.
* __Deleted:__
    * Detection: a `*.model` file is present in the deleted paths list
    * Action: model is deleted. Remove that models refs to each of the features it is using. If any feature's have ref-count 0, they can be removed as well.
* __Renamed:__ For simplicity, treat as old model deleted and new model added.
    * Detection: `*.model` file present in the renamed paths list
    * Action: Add `diff.b_path` to deleted files list, add `diff.a_path` to added files list.


In [None]:

'''

Commits can cause a few issues, some of which are expected
and some of which are errors.

Features:
    added: OK, but no action taken unless there are models using these
           features.
    removed: Only okay if 

'''
def get_new_objects(last_commit, cur_commit):
    available_features = []
    unloaded_features = []
    required_features = []
    diff = last_update.diff(cur_commit)
    new_models = []
    new_features = []
    new_paths = [da.a_path for da in diff.iter_change_type('A')]
    modified_paths = [dm.a_path for dm in diff.iter_change_type('M')]
    deleted_paths = [dd.b_path for dd in diff.iter_change_type('D')]
    reload_paths = new_paths + modified_paths
    
    #
    
    
    