Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/distances #694

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

VascoSch92
Copy link
Contributor

Just a first sketch.

Let me know what do you think :-)

@solegalli solegalli linked an issue Sep 13, 2023 that may be closed by this pull request
Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @VascoSch92

Thank you so much for the contribution! This is looking really good.

I think the logic for the distance calculation is good. We need to make this transformer look like the other transformers that we have in the library.

Why don't you have a look at the class RelativeFeatures in the creation module, and try to incorporate that logic in this class as well? Mostly related to which parameters we need in the init, and which checks we normally do. And there you will see as well how you can import many premade bits of text for the documentation.

Let me know how you get along! Thank you!

feature_engine/creation/distance_features.py Outdated Show resolved Hide resolved
feature_engine/creation/distance_features.py Outdated Show resolved Hide resolved
feature_engine/creation/distance_features.py Outdated Show resolved Hide resolved
feature_engine/creation/distance_features.py Show resolved Hide resolved
feature_engine/creation/distance_features.py Show resolved Hide resolved
feature_engine/creation/distance_features.py Outdated Show resolved Hide resolved
feature_engine/creation/distance_features.py Outdated Show resolved Hide resolved
_check_param_drop_original(drop_original=drop_original)
self.drop_original = drop_original

self.variables = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this parameter. I'd suggest using RelativeFeatures as template to model this class: https://github.com/VascoSch92/feature_engine/blob/e1e927625678ee73c5c3a9edcf79e955ff9c5e8e/feature_engine/creation/relative_features.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variables is a parameter that we have in all transformers, so I would stick to this name instead of using coordinate_columns

In short, let's replace coordinate_columns by variables.

@glevv
Copy link
Contributor

glevv commented Sep 15, 2023

Is it possible to calculate Haversine distance using sklearn? It is quite fast and well optimized, reimplementing it seems like a not so good idea.

P.S. it could be quite interesting to add more measures of distance, for example Ruler distance

@VascoSch92
Copy link
Contributor Author

VascoSch92 commented Sep 15, 2023

Is it possible to calculate Haversine distance using sklearn? It is quite fast and well optimized, reimplementing it seems like a not so good idea

Yes it is possible to compute the Haversine distance with sklearn. I was also thinking to use an apply and the Haversine distance method of Sklearn.

The question is: is it faster than vectorisation?

But I'm happy to change if it faster or If there is a faster method than mine ;-)

@glevv
Copy link
Contributor

glevv commented Sep 15, 2023

Is it possible to calculate Haversine distance using sklearn? It is quite fast and well optimized, reimplementing it seems like a not so good idea

Yes of course I know that. The question is: Can you vectorise it? it is faster than vectorisation?

I'm not sure I understand the question.
Scikit-learn implementation is vectorized by default

@kylegilde
Copy link
Contributor

Is it possible to calculate Haversine distance using sklearn? It is quite fast and well optimized, reimplementing it seems like a not so good idea

Yes of course I know that. The question is: Can you vectorise it? it is faster than vectorisation?

I'm not sure I understand the question. Scikit-learn implementation is vectorized by default

I think the issue with the sklearn implementation is that it does a cartesian product between X and Y and yields a matrix.

We only need a pairwise calculation between X and Y that yields a vector.

@glevv
Copy link
Contributor

glevv commented Sep 16, 2023

np.diag(haversine_distances(X, Y) * R) would give you the vector you want

@kylegilde
Copy link
Contributor

haversine_distances

I know that it is a simple way to code this, but from a time complexity perspective, it's not a great idea to use quadratic complexity when only linear complexity is needed.

@glevv
Copy link
Contributor

glevv commented Sep 17, 2023

haversine_distances

I know that it is a simple way to code this, but from a time complexity perspective, it's not a great idea to use quadratic complexity when only linear complexity is needed.

Yea, you are right, this way it will be better

@solegalli
Copy link
Collaborator

Is it possible to calculate Haversine distance using sklearn? It is quite fast and well optimized, reimplementing it seems like a not so good idea.

P.S. it could be quite interesting to add more measures of distance, for example Ruler distance

Hey @glevv thanks for the suggestion.

If I understood this blog correctly, it has 3 computations: euclidean, harvesine (the one we are trying to implement here) and a more complicated one that has a smaller error (vincenty's formula). Is this correct?

I'd suggest we stick to harvesine in this PR, and see if we create an issue to expand the class later with the Vincenty's. Is this formula commonly used? do we really need an error as small as 0.5mm for geo coordinates?

@glevv
Copy link
Contributor

glevv commented Sep 18, 2023

Is it possible to calculate Haversine distance using sklearn? It is quite fast and well optimized, reimplementing it seems like a not so good idea.
P.S. it could be quite interesting to add more measures of distance, for example Ruler distance

Hey @glevv thanks for the suggestion.

If I understood this blog correctly, it has 3 computations: euclidean, harvesine (the one we are trying to implement here) and a more complicated one that has a smaller error (vincenty's formula). Is this correct?

They are all measures of distance between two points on ellipsoid. There were no Vincenty formula, but it's quite heavy to compute. In this particular blog post they talked about two simpler and faster formulas (Cheap Ruler and FCC equation) but with higher error.

I'd suggest we stick to harvesine in this PR, and see if we create an issue to expand the class later with the Vincenty's. Is this formula commonly used? do we really need an error as small as 0.5mm for geo coordinates?

Ye, let's go with haversine only, not sure about Vincenty tho

@VascoSch92
Copy link
Contributor Author

Hey @solegalli
Sorry if I disappeared. I had a lot to do with work and life. I will try to give a look at this pull request next week ,-)

@solegalli
Copy link
Collaborator

No Problem at all @VascoSch92 . Same here.

I am doing some big changes to the correlation transformers, I think we could release a new version when i got those finished, hopefully during February.

It would be great if we can squeeze this transformer in that release 2. If you find the time, we look forward to your contribution :)

@VascoSch92
Copy link
Contributor Author

No Problem at all @VascoSch92 . Same here.

I am doing some big changes to the correlation transformers, I think we could release a new version when i got those finished, hopefully during February.

It would be great if we can squeeze this transformer in that release 2. If you find the time, we look forward to your contribution :)

Hey @solegalli :-)
is it time to give another try to this transformer? What do you think?

@solegalli
Copy link
Collaborator

Sure! Contributions are welcome any time :)

@VascoSch92
Copy link
Contributor Author

ok perfect. I will work on it.

@VascoSch92
Copy link
Contributor Author

Hey @solegalli
finally I have something.

I still need some guidance for some point:

  • From which classes should I extend? I'm extending from BaseNumericalTransformer, FitFromDictMixin and GetFeatureNamesOutMixin but I don't know if is a good idea
  • I have a fit method also if I'm not using it. Should I have it anyway or can I delete it?

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @VascoSch92

This is looking good. We need to tidy the code a bit. In the init, we can only test that the user enters allowed inputs, and then assign them. We don't need extra functions/methods for this checks.

In the fit, we need to unpack the variables and check the allowed values.

If you could re-arrange that, I can then take another look.

It would be great if you could rebase main after you sync it to the latest version of the repo, because we made a lot of changes since this PR.

Thank you!

feature_engine/creation/distance_features.py Outdated Show resolved Hide resolved
return parameter

def _check_coordinate_columns(
self, columns: List[List[Union[str, int]]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this list of lists? I think it is just a list.

feature_engine/creation/distance_features.py Outdated Show resolved Hide resolved
feature_engine/creation/distance_features.py Outdated Show resolved Hide resolved
feature_engine/creation/distance_features.py Outdated Show resolved Hide resolved
feature_engine/creation/distance_features.py Outdated Show resolved Hide resolved
feature_engine/creation/distance_features.py Outdated Show resolved Hide resolved
feature_engine/creation/distance_features.py Outdated Show resolved Hide resolved
feature_engine/creation/distance_features.py Outdated Show resolved Hide resolved
feature_engine/creation/distance_features.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Geo distance transformer
4 participants