Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add groupby to dataframe #278

Merged
merged 3 commits into from Oct 15, 2020
Merged

Add groupby to dataframe #278

merged 3 commits into from Oct 15, 2020

Conversation

V1NAY8
Copy link
Contributor

@V1NAY8 V1NAY8 commented Oct 4, 2020

PR on #172

@sethmlarson Can you please review these when free? Still lot of work has to be done. 馃槙
I implemented groupby to dataframe using composite aggreagation

  • Only implemented single aggs and aggregation

  • TODO Count, grouped (ed_flights.groupby("Cancelled")), dropna

  • One more suggestion needed
    For the following query eland is taking 7.93 seconds, whereas pandas takes 1.56 seconds 馃槰
    image

If you can review the composite aggregation logic, and give some inputs on dropna, grouped, count (how to query ES using filters?). I will proceed on the logic in eland. That's where I was getting stopped.

@elasticmachine
Copy link

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some early comments, I need to pull this PR locally and give it a spin to see what the requests and responses look like :)

eland/dataframe.py Show resolved Hide resolved
eland/dataframe.py Outdated Show resolved Hide resolved
eland/operations.py Show resolved Hide resolved
eland/operations.py Outdated Show resolved Hide resolved
@V1NAY8
Copy link
Contributor Author

V1NAY8 commented Oct 8, 2020

@sethmlarson
Just a quick one,
If you comment out the following in operations.py at line 701

if b.is_timestamp:
    agg_value = elasticsearch_date_to_pandas_date(
        agg_value, b.es_date_format
    )

There will be an improvement of at least 2 seconds 馃槻
Then output for the above mentioned query would be:
image

I guess we have to do something in this:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html

Any suggestion?

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments for you!

The timestamp field should be the same as what it'd be if it wasn't in the index so if it's a datetime it should still be a datetime. This will be necessary when we go to implement resample().

eland/dataframe.py Outdated Show resolved Hide resolved
eland/groupby.py Outdated Show resolved Hide resolved
eland/groupby.py Outdated Show resolved Hide resolved
eland/field_mappings.py Outdated Show resolved Hide resolved
eland/field_mappings.py Show resolved Hide resolved
eland/query.py Outdated Show resolved Hide resolved
@V1NAY8
Copy link
Contributor Author

V1NAY8 commented Oct 14, 2020

  • Added tests, Could you please check the performance of groupby with different examples?
  • I tried some examples, It looks good.
  • Ill optimize if required.
  • Please add hacktoberfest-accepted to this.

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling this locally to take a closer look again! :) Left you a collection of comments. Also a general comment, would be good to start working on more test cases, especially to cover errors.

I need to add pytest-cov so we can see how much test coverage we have :)

eland/groupby.py Outdated Show resolved Hide resolved
eland/groupby.py Outdated Show resolved Hide resolved
eland/dataframe.py Outdated Show resolved Hide resolved
eland/dataframe.py Show resolved Hide resolved
eland/dataframe.py Outdated Show resolved Hide resolved
eland/field_mappings.py Show resolved Hide resolved
eland/dataframe.py Show resolved Hide resolved
@sethmlarson
Copy link
Contributor

jenkins test this please

@V1NAY8
Copy link
Contributor Author

V1NAY8 commented Oct 15, 2020

  • Code Coverage Bot will be good :)
  • I think we also need to add a Pull request template that has checks in the Contributing.md doc,
  • Issue template when raised for issues.

@V1NAY8
Copy link
Contributor Author

V1NAY8 commented Oct 15, 2020

  • Made groupby strictly typed
  • added more tests (I hope this should cover implemented functionality)

@sethmlarson
Copy link
Contributor

jenkins test this please

@V1NAY8
Copy link
Contributor Author

V1NAY8 commented Oct 15, 2020

Yay! Builds successful 馃槑

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Will do some tweaking and documenting now that the core functionality is there. Thanks for working so hard on this :)

@sethmlarson sethmlarson merged commit abc5ca9 into elastic:master Oct 15, 2020
@V1NAY8 V1NAY8 deleted the issue/172 branch October 15, 2020 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants