-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functionality to calculate the OVERVIEW section of the report using manuscripts2 #80
Conversation
2683be7
to
f8cde2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pr is too complex for a good review. Please, split it in three, one for git, another one for github issues, and a third one for github prs.
I suggest that you commit the first one, with only the changes for git, if you want as a new version of this pr. That way, I can start reviewing it straight away.
Then, while I review it, produce the other two, with the current code, in separate prs. Once the one for git is done, you can just rewrite the parts needed in the other, but I can start commenting them even before you modify them.
Do you find this reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments, on the testing framework. I see you're using the old testing utils, that basically build enriched and raw indexes before testing. I find that too complex. In fact, for testing we only need an enriched index. From it, loaded in ES, all tests could run. Let's move to that schema. In short, you would need:
-
Some description on how the test enriched indexes are generated, so that we can generate them easily when we want. In fact, having a script for that would be great. This script could be based on the current code in
utils.py
, or in using p2o with some options. -
Some code for uploading the index, that will be in the
data
directory, to ES before the actual testing.
f8cde2e
to
7cf52d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, have a look at my comments, and try to answer or address them. We can later start from there...
self.id = "commits" | ||
self.name = "Commits" | ||
self.desc = "Changes to the source code" | ||
self.query = self.query.get_cardinality("hash").by_period() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using query.timeseries
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jgbarah You mean to say that I should calculate the timeseries data here?
Right now I am just adding all the necessary aggregations into the query object and then I get the timeseries in report.py by calling the timeseries()
method on an instance of the Commits
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I overlooked the code then, sorry. Now I realize what you do. OK, let's go this way.
self.id = "authors" | ||
self.name = "Authors" | ||
self.desc = "People authoring commits (changes to source code)" | ||
self.query = self.query.get_cardinality("author_uuid").by_period() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using query.timeseries
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is same as Commits
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer... ;-)
""" | ||
|
||
results = { | ||
"activity_metrics": [Commits(index, start, end)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we return a list, instead of just Commits(index, start, end)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using a list for the other data sources (github_issues
, github_prs
) as they have more than one metric in activity_metrics
so I just wanted to be consistent.
I'll change it to what you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it this way: since we have one name for each metric, let's just return the data structure for that metric, as data in a dictionary where the name of the metric is the key. If we want to group them later, for presentation, we can do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, @jgbarah It'll be easier for each metric (key) in the dict to have the values in a list so that we can easily iterate and add them together as one section of the report, in the report.
It was being done previously in manuscripts
also and will be more convenient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand you. For each metric, we should have only one value. If the value is complex (eg, a time series) we should have a time series. But since maybe I'm missing something, let's go this way you propose. But please, think about it, and try to find out if when you say "a list" isnt't it really, for example, several values in a data series.
manuscripts2/report.py
Outdated
"git": git.GitMetrics, | ||
"github_issues": github_issues.IssuesMetrics, | ||
"github_prs": github_prs.PullRequestsMetrics, | ||
"git": git, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you're not using the contants you defined earlier here? (I mean, GIT_INDEX
, etc.).
Or even better, since this seems to be just a transposing of the previous dictionary, why not compute it instead to write it again "by hand"=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks! I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
manuscripts2/report.py
Outdated
} | ||
|
||
def __init__(self, es_url=None, start=None, end=None, data_dir=None, filters=None, | ||
interval="month", offset=None, data_sources=None, | ||
report_name=None, projects=False, indices=[], logo=None): | ||
|
||
Query.interval_ = interval | ||
self.interval = Query.interval_ = interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not belong exactly here, but any way: why Query.interval_
instead of Query.interval
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interval_
variable of Query
class can be set for all the child classes, so I wanted it to be different from the interval
variable of Report class. Hence the underscore at the end.
manuscripts2/report.py
Outdated
} | ||
|
||
def __init__(self, es_url=None, start=None, end=None, data_dir=None, filters=None, | ||
interval="month", offset=None, data_sources=None, | ||
report_name=None, projects=False, indices=[], logo=None): | ||
|
||
Query.interval_ = interval | ||
self.interval = Query.interval_ = interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I don't like very much setting properties from a class from the code in another class. I think it is much more explicit to instantiate the class when you have all the data to instantiate, and then just pass the property value as a parameter to the instantiation. Is there any problem in doing it that way?
(similar comment for Index.es
, below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is to set the interval for all the child classes at once. All the metrics have a query
object created using the Query
class. By setting the Query.interval_
value here, we won't need to set the interval again and again in all the files for each of the metrics.
For Index.es
too, I wanted to set the url for each of the Index objects that were instantiated later so that we won't have to set the url for each of them again and again.
manuscripts2/report.py
Outdated
self.start_date = datetime(2015, 1, 1) | ||
self.end_date = datetime(2018, 7, 10) | ||
|
||
# self.config = self.__get_config(data_sources=data_sources) | ||
|
||
def get_metric_index(self, data_source): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include text to define this function? I'm not sure what it is intended to do, exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is supposed to return the elasticsearch index for a corresponding data source. As the user can pass their own elasticsearch index names for each of the data sources, this function chooses in between the default and the user inputed es indices and returns the user inputed one if it is available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Please, add that as the description of the function.
|
||
# AUTHOR METRICS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have some common code for doing this kind of stuff. Apparently, this is just creating tables, figs, from the metrics, and apparently (given the proper label, filename, etc as arguments) this could be implemented with a function, that could be called for whatever the metrics that has this structure. Am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gotten into the other sections of the report. But yes, I think we can create a function out of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to wait until I got to the other sections of the report so that the function can be generalised better. Is that okay with you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but please, remember this... Now, the code is becoming too spaghetti...
tests/test_git.py
Outdated
self.git = DATA_SOURCES['git'] | ||
self.git_index = Index(self.git[1]) | ||
|
||
Query.interval_ = "month" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the kind of behavior I was commenting with respect to initializing properties in a class. Since you're not calling report here, you need to do this initialization. But a casual user, very unlikely will be aware of this large effect of calling report...
tests/test_git.py
Outdated
self.assertEquals(last, TREND_LAST) | ||
self.assertEquals(trend_percentage, TREND_PRECENTAGE) | ||
|
||
authors = overview['author_metrics'][0].timeseries() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're testing that timeseries is properly returning a data frame, that's ok. But if you just want to check that the timeseries is find, you can use the datastructure that does not imply returning a dataframe, and thus is more simple.
BTW, for checking that if you as a dataframe, you get a dataframe, I would write a separate check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the tests and separate the dataframe and non dataframe tests. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, answer or address the items where you couldn't answer up to now. For the rest, I'm ok with your suggestions, so let's go for that.
bf8d467
to
2ad91df
Compare
@jgbarah I've updated most of the code according to the comments that you made. I still think that setting Please review it once so that I can make PRs for Also, the tests might fail because more commits added into the perceval repository. The second commit in this PR adds a |
2ad91df
to
528f553
Compare
@jgbarah ping. |
The thing is not whether it is simpler, but whether it will come as a surprise for somebody using the objects. Initializing a class property when instantiating an object is a huge side effect... Let's go this way fro now, but we need to revisit this in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm accepting the changes, but we're having some technical debt that we need to revisit in the future... Please, see comments.
This PR adds the functionality to calculate the OVERVIEW section of the report.
git
andgithub_issues
data sources