New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate FAISS index for KNN classifier #557
Conversation
Job PR-557-1 is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited to see this! Once my comments relating to extending KNNModel are addressed, I will try this out and run it on our benchmarks to see how it compares to the default KNN.
|
||
|
||
def fit(self, X_train, y_train): | ||
X_train = X_train.astype(np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we check if X_train is a dataframe, we could do X_train.to_numpy(dtype=np.float32), which may eliminate an unnecessary data copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point - fixed. Also fixed a second unnecessary copy that might happen in the following line.
return self | ||
|
||
def predict(self, X): | ||
X = X.astype(np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@@ -32,7 +32,7 @@ def preprocess(self, X): | |||
def _set_default_params(self): | |||
default_params = { | |||
'weights': 'uniform', | |||
'n_jobs': -1, | |||
'index_factory_string' : 'Flat', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not able to use multiple cores with this implementation? Why was n_jobs removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this param back in to allow for the use of fewer cores. It was removed because Faiss uses OpenMP underneath, which defaults to using all cores and there wasn't much in the docs on how to change it
from ...constants import REGRESSION | ||
from ....utils.exceptions import NotEnoughMemoryError | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
# TODO: Normalize data! | ||
class KNNModel(SKLearnModel): | ||
class KNNModel(AbstractModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first version of this new FAISS implementation, could you instead extend the KNN model to a new model class so we aren't removing the existing KNNModel? Something like FAISSModel? This will help significantly with apples to apples comparisons.
In regards to changing KNNModel to implement AbstractModel instead of SKLearnModel, I believe this is fine as KNN no longer utilizes SKLearnModel to my knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion - done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I wasn't quite clear in my first comment. I was referring to the KNNModel class, such as creating a new:
class FAISSModel(KNNModel):
def _get_model_type(self):
if self.problem_type == REGRESSION:
return FAISSNeighborsRegressor
else:
return FAISSNeighborsClassifier
def _set_default_params(self):
default_params = {
'index_factory_string': 'Flat',
}
for param, val in default_params.items():
self._set_default_param_value(param, val)
super()._set_default_params()
and changing KNNModel slightly to look like:
class KNNModel(AbstractModel):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self._model_type = self._get_model_type()
def _get_model_type(self):
if self.problem_type == REGRESSION:
return KNeighborsRegressor
else:
return KNeighborsClassifier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks - this makes a bit more sense than what I had in mind. Should be able to easily use both model types for comparisons now.
|
||
# Rather than try to import non-public sklearn internals, we implement our own weighting functions here | ||
# These support the same operations as the sklearn functions - at least as far as possible with FAISS | ||
def _check_weights(weights): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit comments relating to code readability: If you haven't already I'd recommend adding PEP8 violation highlighting to your IDE. While we don't follow the line-length limit PEP8 recommendation, we do follow most others, such as having 2 new lines between each def and class, as well as 2 new lines after imports before code is written.
This isn't critical, as I can clean-up the code afterwards, but is something to keep in mind in future PR's to make the process smoother.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip, I didn't know you could do this. I did it and formatted everything in a separate commit.
if weights is None: | ||
y_pred, _ = mode(outputs, axis = 1) | ||
else: | ||
y_pred,_ = weighted_mode(outputs, weights, axis = 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add space between ,
and _
return y_pred | ||
|
||
def predict_proba(self, X): | ||
X = X.astype(np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
setup.py
Outdated
@@ -57,7 +57,8 @@ def create_version_file(): | |||
'pandas>=0.24.0,<1.0', | |||
'psutil>=5.0.0', | |||
'scikit-learn>=0.22.0,<0.23', | |||
'networkx>=2.3,<3.0' | |||
'networkx>=2.3,<3.0', | |||
'faiss-cpu>=1.6.3' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May decide to remove this as a required dependency, but can keep for the moment. Will have to identify how heavy the package is.
self.index = faiss.deserialize_index(self.index) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Have exactly 2 new lines between classes
self.__dict__.update(state) | ||
self.index = faiss.deserialize_index(self.index) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: have exactly 1 new line before the EOF
@@ -53,6 +53,8 @@ def __init__(self, n_neighbors = 5, weights='uniform', n_jobs = -1, index_factor | |||
|
|||
The model itself is a clone of the sklearn one | |||
""" | |||
try_import_faiss() | |||
import faiss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just set self.faiss = faiss
and then you don't need to import it in every method
Should there be a hyperparameter that quantifies the exactness (approximation-error) of the nearest neighbor retrieval when the index gets built? Would be good to document at least one of the straightforward ways to control this, in say a comment for now. |
Do you plan to extend this categorical + numeric features? |
This is what "index_factory_string" is for - different factory strings produce an index with different exactness tradeoffs.
For now, no. FAISS only supports float32 vectors, so it's not immediately clear how to do this. You could code categorical features as vectors in R^d (either naively or with a mapping to a sphere). You could also build a classifier around a (non-faiss) index that supports something like the Jaccard metric. I'm not sure which is best. |
I think CI is broke atm by some package updating, will look into it Sunday. |
@brc7 I believe the CI is now fixed on mainline. Can you rebase your PR onto mainline to get the latest fix? |
Sounds good, thanks for clarifying. I'm unsure mixing Jaccard + Euclidean metrics for datasets with both categorical+continuous features will work well with any existing indices? Might you have bandwidth to add this in a followup PR? Other ideas for followup PRs:
|
@Innixma I rebased onto the psutil fix, but it seems that CI is still breaking. Can you confirm? (maybe I did rebase wrong) |
@brc7 No worries, I think the psutil fix we did did not resolve the issue, and the issue appears to be non-deterministic. We will continue looking into it, but it isn't a problem with your code. |
Job PR-557-10 is done. |
Great, now it has passed CI! Next is to benchmark the new implementation against the previous. I have written some code to help you get started:
Running this on my Mac, I get the following results:
While I'm very glad to see that the disk usage is halved when using FAISS, it seems to be much slower to infer. Could you try to experiment to see why this is and remedy it? The difference seems to become larger with more training data, with SAMPLE=None, KNN is 9.7s and FAISS is 454s on CoverType. Also, the scores between KNN and FAISS are not always identical, and can at times differ drastically. Can you explain why this is? Another bit of optimization I stumbled upon which you might want to give a try: https://www.sicara.ai/blog/2017-07-05-fast-custom-knn-sklearn-cython |
The default settings in the FAISSNeighbor classes just do exact brute-force search on CPU with no smart algorithms, while sklearn automatically chooses a good index type. More sensible defaults will probably fix it, but I'll profile / benchmark to make sure there's nothing too crazy going on.
I have no idea but it will be interesting to find out, since they're supposed to be the same. |
Auto-selection of the best index type based on dataset characteristics & GPU/CPU hardware would be great to have, but for now I'd say the default index type should be set to maximize performance for training datasets with #rows in the 50-500k range and #columns in the 10-100 range (I'd guess this is the bulk of autogluon-KNN usage where performance actually starts to matter). |
Regarding the validation score differences: TL;DR: FAISS cheats when it computes distances Example: I issued a query for sklearn, and it returned: Okay, what does FAISS say for this query? This difference seems small but if many points have close distances, you might get a different classification with the FAISSNeighbor classifier than with sklearn. This happens because FAISS uses BLAS on CPU to brute force distances for batch sizes larger than 20, and BLAS computes the distance in such a way that the 32bit rounding errors become a problem. This is controlled by the faiss.distance_compute_blas_threshold parameter. If you set it larger than the query batch size, I have verified that the distances become correct and the validation scores are the same, at a small speed cost. Do we want to do anything about this? Note that the other (faster) index types, such as HNSW are approximate, so the rounding errors are not so noticeable there. |
It's primarily down to what is practically most useful. I think the validation score differences are less of a concern at present than the inference times. If it is only a small speed decrease to match KNN on validation, it might be best so that we don't have to take that additional factor into consideration when determining which approach is better. |
All right, I did some comparisons. FAISS has a lot of index types, so this is only a partial comparison (even though there are lots of rows in the table) PQ is product quantization (with brute-force search), IVFx is a type of sharding (where we only check some of the "x" shards), Flat is a brute-force search, and HNSWx is the approximate near neighbor graph method with "x" connections per node. ef_search is a parameter that trades off the HNSW accuracy for speed. By setting it to minimum, you get the fastest possible HNSW with maybe less good results. AdultIncomeBinaryClassification
AmesHousingPriceRegression
CoverTypeMulticlassClassification
A good step toward an auto-selection routine would be to encode these guidelines, but for now I am leaning towards a HNSW index, probably with IVF(dependent on n) and some form of vector compression as the default. The trouble is that the best index is pretty problem dependent, but I think HNSW is a good all-purpose option |
@brc7 Thanks for the experiments! Hopefully extending the KNN to categoricals will not greatly alter these conclusions (I guess if we use OHE, the KNN will just operate on higher-dimensional vectors where quantization would anyway have no accuracy-effect for OHE dimensions). Do you think we should view validation scores for the regression task as well? I suspect the approximation-factor of nearest-neighbor retrieval may have an even greater effect on accuracy in regression. Re the 2 datasets with validation accuracies: sklearn performance seems unmatched by any point along FAISS' Pareto-frontier of accuracy/inference-speed. If all sklearn were doing differently was automatically choosing the appropriate index, I'd expect it's performance would lie at one point along FAISS' Pareto-frontier. In particular, none of the FAISS models (regardless of index) are as fast as sklearn for CoverType; any insight on this? |
Sklearn has different algorithm options than FAISS. I looked at their auto-selection code - it seems sklearn always uses a BallTree unless you have a really weird distance function. It may be that I am configuring the FAISS index wrong, or maybe BallTrees are actually better for typical AutoGluon use cases. I think it's a combination of both. FAISS(hnsw) can get great performance (look at ann-benchmarks), but those benchmarks do a lot of dataset-specific tuning and are on large pro. I suspect that due to memory limits, FAISS will be the only option for large-scale AutoML (i.e. N > 10 million) or with really high dimensional inputs (i.e. OHE for big multi-class problem), and the best option when a GPU is available. |
Regarding large scale data, can you try to also get benchmark results for uncapped Covertype ( Regarding the state of the code, I think we are good to merge after the dependency is removed from setup.py for faiss, so that it is purely an extra optional dependency until we nail down more aspects of its performance. Can you please update the PR to have setup.py be unchanged? Then I am good to give approval and merge. |
Should have clarified - all results are for SAMPLE=None. The covtype experiments are with the full 464k data points. I've removed the setup.py dependency - should be good to go |
Job PR-557-11 is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
@brc7 The validation scores are negative for regression because we assume higher==better for all metrics so those like MSE are flipped in sign. I think looking at some regression performance will be important in choosing the index (we can ofc use different default index for regression vs classification). Thanks for the explanation re sklearn's BallTree vs FAISS options. I agree FAISS should still be best option for large datasets, but its advantage seems less clear for these 3 datasets (except model-size has improved), so it's important to establish where FAISS >> sklearn. Also, if we end up needing to keep sklearn KNN around as well as FAISS (eg. for different data regimes), it may be worth standardizing their APIs a bit more so new KNN-functionality can be developed for both sklearn/FAISS models without duplicate code. |
I'm not sure what you mean here. In how I envisioned it, either sklearn-KNN or FAISS could be used to produce held-out predictions on-demand from one index (ie. no bagging), so whichever is better could be used. My thinking was just to produce leave-one-out predictions for each held-out datapoint after first calling KNN Technically this shared KNN model should be built with One practical issue that may arise with this leave-one-out scheme is the KNN validation-scores may be biased higher than other models because KNN is trained on more data (all except one datapoint) than 10-fold training offers. But this also shouldn't affect ensemble-construction, just the leaderboard. |
@brc7 and @jwmueller Regarding small - medium datasets (<10M rows), perhaps we could use nmslib? According to ann_benchmark, nmslib's HNSW implementation is SOTA, and nearly an order of magnitude faster compared to FAISS: https://github.com/erikbern/ann-benchmarks The downside is it doesn't scale as well to truly huge 1B+ row datasets like FAISS can, but even the FAISS authors recommend using nmslib for medium-small datasets when memory isn't a major concern: https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors facebookresearch/faiss#23 |
There's a great video on HNSW, how it works, and how to choose the parameters to trade off disk space, accuracy, inference and model build time here: https://www.pinecone.io/learn/hnsw/ . |
@willsmithorg Thanks for this! We may consider this deeper when we look into optimizing for 10M+ row datasets. Currently KNN is ~250x faster in v0.3.1 than it was at the time of this PR due to numerous optimizations (usage of sklearnex + efficient LOO for bagging), so the concern over having a faster KNN model has become less critical. Still, the issue of practicality on truly large datasets remains an open problem and we will likely turn towards ANN to solve. |
Description of changes:
This code adds
The new KNeighborsClassifier is a drop-in replacement for the sklearn model, but is built on top of the FAISS index. There is one additional optional parameter for the model: index_factory_string, which describes what type of FAISS index to use (this just gets passed to FAISS's index construction method).
There is one new dependency on faiss-cpu.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.