Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to calculate the OVERVIEW section of the report using manuscripts2 #80

Merged
merged 2 commits into from
Jul 25, 2018

Conversation

aswanipranjal
Copy link
Contributor

@aswanipranjal aswanipranjal commented Jul 23, 2018

This PR adds the functionality to calculate the OVERVIEW section of the report.

  • Adds tests for git and github_issues data sources
  • Adds data that is to be used to perform the tests

@coveralls
Copy link

coveralls commented Jul 23, 2018

Pull Request Test Coverage Report for Build 198

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+2.08%) to 57.39%

Totals Coverage Status
Change from base Build 193: 2.08%
Covered Lines: 897
Relevant Lines: 1563

💛 - Coveralls

Copy link
Contributor

@jgbarah jgbarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pr is too complex for a good review. Please, split it in three, one for git, another one for github issues, and a third one for github prs.

I suggest that you commit the first one, with only the changes for git, if you want as a new version of this pr. That way, I can start reviewing it straight away.

Then, while I review it, produce the other two, with the current code, in separate prs. Once the one for git is done, you can just rewrite the parts needed in the other, but I can start commenting them even before you modify them.

Do you find this reasonable?

Copy link
Contributor

@jgbarah jgbarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments, on the testing framework. I see you're using the old testing utils, that basically build enriched and raw indexes before testing. I find that too complex. In fact, for testing we only need an enriched index. From it, loaded in ES, all tests could run. Let's move to that schema. In short, you would need:

  • Some description on how the test enriched indexes are generated, so that we can generate them easily when we want. In fact, having a script for that would be great. This script could be based on the current code in utils.py, or in using p2o with some options.

  • Some code for uploading the index, that will be in the data directory, to ES before the actual testing.

Copy link
Contributor

@jgbarah jgbarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, have a look at my comments, and try to answer or address them. We can later start from there...

self.id = "commits"
self.name = "Commits"
self.desc = "Changes to the source code"
self.query = self.query.get_cardinality("hash").by_period()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using query.timeseries here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgbarah You mean to say that I should calculate the timeseries data here?

Right now I am just adding all the necessary aggregations into the query object and then I get the timeseries in report.py by calling the timeseries() method on an instance of the Commits class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I overlooked the code then, sorry. Now I realize what you do. OK, let's go this way.

self.id = "authors"
self.name = "Authors"
self.desc = "People authoring commits (changes to source code)"
self.query = self.query.get_cardinality("author_uuid").by_period()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using query.timeseries here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is same as Commits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer... ;-)

"""

results = {
"activity_metrics": [Commits(index, start, end)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we return a list, instead of just Commits(index, start, end)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using a list for the other data sources (github_issues, github_prs) as they have more than one metric in activity_metrics so I just wanted to be consistent.

I'll change it to what you suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it this way: since we have one name for each metric, let's just return the data structure for that metric, as data in a dictionary where the name of the metric is the key. If we want to group them later, for presentation, we can do it.

Copy link
Contributor Author

@aswanipranjal aswanipranjal Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, @jgbarah It'll be easier for each metric (key) in the dict to have the values in a list so that we can easily iterate and add them together as one section of the report, in the report.
It was being done previously in manuscripts also and will be more convenient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand you. For each metric, we should have only one value. If the value is complex (eg, a time series) we should have a time series. But since maybe I'm missing something, let's go this way you propose. But please, think about it, and try to find out if when you say "a list" isnt't it really, for example, several values in a data series.

"git": git.GitMetrics,
"github_issues": github_issues.IssuesMetrics,
"github_prs": github_prs.PullRequestsMetrics,
"git": git,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you're not using the contants you defined earlier here? (I mean, GIT_INDEX, etc.).

Or even better, since this seems to be just a transposing of the previous dictionary, why not compute it instead to write it again "by hand"=

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks! I'll change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

}

def __init__(self, es_url=None, start=None, end=None, data_dir=None, filters=None,
interval="month", offset=None, data_sources=None,
report_name=None, projects=False, indices=[], logo=None):

Query.interval_ = interval
self.interval = Query.interval_ = interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not belong exactly here, but any way: why Query.interval_ instead of Query.interval?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

Copy link
Contributor Author

@aswanipranjal aswanipranjal Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interval_ variable of Query class can be set for all the child classes, so I wanted it to be different from the interval variable of Report class. Hence the underscore at the end.

}

def __init__(self, es_url=None, start=None, end=None, data_dir=None, filters=None,
interval="month", offset=None, data_sources=None,
report_name=None, projects=False, indices=[], logo=None):

Query.interval_ = interval
self.interval = Query.interval_ = interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I don't like very much setting properties from a class from the code in another class. I think it is much more explicit to instantiate the class when you have all the data to instantiate, and then just pass the property value as a parameter to the instantiation. Is there any problem in doing it that way?

(similar comment for Index.es, below.

Copy link
Contributor Author

@aswanipranjal aswanipranjal Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is to set the interval for all the child classes at once. All the metrics have a query object created using the Query class. By setting the Query.interval_ value here, we won't need to set the interval again and again in all the files for each of the metrics.

For Index.es too, I wanted to set the url for each of the Index objects that were instantiated later so that we won't have to set the url for each of them again and again.

self.start_date = datetime(2015, 1, 1)
self.end_date = datetime(2018, 7, 10)

# self.config = self.__get_config(data_sources=data_sources)

def get_metric_index(self, data_source):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include text to define this function? I'm not sure what it is intended to do, exactly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is supposed to return the elasticsearch index for a corresponding data source. As the user can pass their own elasticsearch index names for each of the data sources, this function chooses in between the default and the user inputed es indices and returns the user inputed one if it is available

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Please, add that as the description of the function.


# AUTHOR METRICS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have some common code for doing this kind of stuff. Apparently, this is just creating tables, figs, from the metrics, and apparently (given the proper label, filename, etc as arguments) this could be implemented with a function, that could be called for whatever the metrics that has this structure. Am I right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gotten into the other sections of the report. But yes, I think we can create a function out of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to wait until I got to the other sections of the report so that the function can be generalised better. Is that okay with you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but please, remember this... Now, the code is becoming too spaghetti...

self.git = DATA_SOURCES['git']
self.git_index = Index(self.git[1])

Query.interval_ = "month"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the kind of behavior I was commenting with respect to initializing properties in a class. Since you're not calling report here, you need to do this initialization. But a casual user, very unlikely will be aware of this large effect of calling report...

self.assertEquals(last, TREND_LAST)
self.assertEquals(trend_percentage, TREND_PRECENTAGE)

authors = overview['author_metrics'][0].timeseries()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're testing that timeseries is properly returning a data frame, that's ok. But if you just want to check that the timeseries is find, you can use the datastructure that does not imply returning a dataframe, and thus is more simple.

BTW, for checking that if you as a dataframe, you get a dataframe, I would write a separate check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the tests and separate the dataframe and non dataframe tests. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to you.

Copy link
Contributor

@jgbarah jgbarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, answer or address the items where you couldn't answer up to now. For the rest, I'm ok with your suggestions, so let's go for that.

@aswanipranjal
Copy link
Contributor Author

@jgbarah I've updated most of the code according to the comments that you made.

I still think that setting Query.interval_ in report.py for all the Query instances that are created is simpler than passing the interval for each Query separately.
If you think It's a really bad idea then I'll update that too.

Please review it once so that I can make PRs for github_issues and github_prs too.

Also, the tests might fail because more commits added into the perceval repository. The second commit in this PR adds a tests/utils.py file which allows us to insert frozen data into elasticsearch so I can use that to update the tests for manuscripts2/elasticsearch.py

@aswanipranjal
Copy link
Contributor Author

@jgbarah ping.

@jgbarah
Copy link
Contributor

jgbarah commented Jul 25, 2018

I still think that setting Query.interval_ in report.py for all the Query instances that are created is simpler than passing the interval for each Query separately.

The thing is not whether it is simpler, but whether it will come as a surprise for somebody using the objects. Initializing a class property when instantiating an object is a huge side effect... Let's go this way fro now, but we need to revisit this in the future.

Copy link
Contributor

@jgbarah jgbarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm accepting the changes, but we're having some technical debt that we need to revisit in the future... Please, see comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants