Add functionality to calculate the OVERVIEW section of the report using manuscripts2 #80

aswanipranjal · 2018-07-23T08:07:16Z

This PR adds the functionality to calculate the OVERVIEW section of the report.

Adds tests for git and github_issues data sources
Adds data that is to be used to perform the tests

coveralls · 2018-07-23T08:33:34Z

Pull Request Test Coverage Report for Build 198

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+2.08%) to 57.39%

Totals
Change from base Build 193:	2.08%
Covered Lines:	897
Relevant Lines:	1563

💛 - Coveralls

jgbarah

This pr is too complex for a good review. Please, split it in three, one for git, another one for github issues, and a third one for github prs.

I suggest that you commit the first one, with only the changes for git, if you want as a new version of this pr. That way, I can start reviewing it straight away.

Then, while I review it, produce the other two, with the current code, in separate prs. Once the one for git is done, you can just rewrite the parts needed in the other, but I can start commenting them even before you modify them.

Do you find this reasonable?

jgbarah

Some more comments, on the testing framework. I see you're using the old testing utils, that basically build enriched and raw indexes before testing. I find that too complex. In fact, for testing we only need an enriched index. From it, loaded in ES, all tests could run. Let's move to that schema. In short, you would need:

Some description on how the test enriched indexes are generated, so that we can generate them easily when we want. In fact, having a script for that would be great. This script could be based on the current code in utils.py, or in using p2o with some options.
Some code for uploading the index, that will be in the data directory, to ES before the actual testing.

jgbarah

Please, have a look at my comments, and try to answer or address them. We can later start from there...

jgbarah · 2018-07-23T11:19:57Z

manuscripts2/metrics/git.py

+        self.id = "commits"
+        self.name = "Commits"
+        self.desc = "Changes to the source code"
+        self.query = self.query.get_cardinality("hash").by_period()


Why not using query.timeseries here?

@jgbarah You mean to say that I should calculate the timeseries data here?

Right now I am just adding all the necessary aggregations into the query object and then I get the timeseries in report.py by calling the timeseries() method on an instance of the Commits class.

I think I overlooked the code then, sorry. Now I realize what you do. OK, let's go this way.

jgbarah · 2018-07-23T11:20:26Z

manuscripts2/metrics/git.py

+        self.id = "authors"
+        self.name = "Authors"
+        self.desc = "People authoring commits (changes to source code)"
+        self.query = self.query.get_cardinality("author_uuid").by_period()


Why not using query.timeseries here?

This is same as Commits.

Same answer... ;-)

jgbarah · 2018-07-23T11:21:22Z

manuscripts2/metrics/git.py

+    """
+
+    results = {
+        "activity_metrics": [Commits(index, start, end)],


Why we return a list, instead of just Commits(index, start, end)?

I was using a list for the other data sources (github_issues, github_prs) as they have more than one metric in activity_metrics so I just wanted to be consistent.

I'll change it to what you suggest.

Let's do it this way: since we have one name for each metric, let's just return the data structure for that metric, as data in a dictionary where the name of the metric is the key. If we want to group them later, for presentation, we can do it.

Actually, @jgbarah It'll be easier for each metric (key) in the dict to have the values in a list so that we can easily iterate and add them together as one section of the report, in the report.
It was being done previously in manuscripts also and will be more convenient.

I don't understand you. For each metric, we should have only one value. If the value is complex (eg, a time series) we should have a time series. But since maybe I'm missing something, let's go this way you propose. But please, think about it, and try to find out if when you say "a list" isnt't it really, for example, several values in a data series.

jgbarah · 2018-07-23T11:25:16Z

manuscripts2/report.py

-        "git": git.GitMetrics,
-        "github_issues": github_issues.IssuesMetrics,
-        "github_prs": github_prs.PullRequestsMetrics,
+        "git": git,


Why you're not using the contants you defined earlier here? (I mean, GIT_INDEX, etc.).

Or even better, since this seems to be just a transposing of the previous dictionary, why not compute it instead to write it again "by hand"=

Right, thanks! I'll change it.

jgbarah · 2018-07-23T11:30:18Z

manuscripts2/report.py

    }

    def __init__(self, es_url=None, start=None, end=None, data_dir=None, filters=None,
                 interval="month", offset=None, data_sources=None,
                 report_name=None, projects=False, indices=[], logo=None):

-        Query.interval_ = interval
+        self.interval = Query.interval_ = interval


This does not belong exactly here, but any way: why Query.interval_ instead of Query.interval?

The interval_ variable of Query class can be set for all the child classes, so I wanted it to be different from the interval variable of Report class. Hence the underscore at the end.

jgbarah · 2018-07-23T11:35:53Z

manuscripts2/report.py

    }

    def __init__(self, es_url=None, start=None, end=None, data_dir=None, filters=None,
                 interval="month", offset=None, data_sources=None,
                 report_name=None, projects=False, indices=[], logo=None):

-        Query.interval_ = interval
+        self.interval = Query.interval_ = interval


In general, I don't like very much setting properties from a class from the code in another class. I think it is much more explicit to instantiate the class when you have all the data to instantiate, and then just pass the property value as a parameter to the instantiation. Is there any problem in doing it that way?

(similar comment for Index.es, below.

The idea here is to set the interval for all the child classes at once. All the metrics have a query object created using the Query class. By setting the Query.interval_ value here, we won't need to set the interval again and again in all the files for each of the metrics.

For Index.es too, I wanted to set the url for each of the Index objects that were instantiated later so that we won't have to set the url for each of them again and again.

jgbarah · 2018-07-23T11:37:24Z

manuscripts2/report.py

+        self.start_date = datetime(2015, 1, 1)
+        self.end_date = datetime(2018, 7, 10)
+
+        # self.config = self.__get_config(data_sources=data_sources)

    def get_metric_index(self, data_source):


Can you include text to define this function? I'm not sure what it is intended to do, exactly.

This function is supposed to return the elasticsearch index for a corresponding data source. As the user can pass their own elasticsearch index names for each of the data sources, this function chooses in between the default and the user inputed es indices and returns the user inputed one if it is available

I see. Please, add that as the description of the function.

jgbarah · 2018-07-23T11:43:16Z

manuscripts2/report.py


+        # AUTHOR METRICS


I think we should have some common code for doing this kind of stuff. Apparently, this is just creating tables, figs, from the metrics, and apparently (given the proper label, filename, etc as arguments) this could be implemented with a function, that could be called for whatever the metrics that has this structure. Am I right?

I haven't gotten into the other sections of the report. But yes, I think we can create a function out of this.

Please, do.

I want to wait until I got to the other sections of the report so that the function can be generalised better. Is that okay with you?

OK, but please, remember this... Now, the code is becoming too spaghetti...

jgbarah · 2018-07-23T11:47:44Z

tests/test_git.py

+        self.git = DATA_SOURCES['git']
+        self.git_index = Index(self.git[1])
+
+        Query.interval_ = "month"


This is the kind of behavior I was commenting with respect to initializing properties in a class. Since you're not calling report here, you need to do this initialization. But a casual user, very unlikely will be aware of this large effect of calling report...

jgbarah · 2018-07-23T11:50:12Z

tests/test_git.py

+        self.assertEquals(last, TREND_LAST)
+        self.assertEquals(trend_percentage, TREND_PRECENTAGE)
+
+        authors = overview['author_metrics'][0].timeseries()


If you're testing that timeseries is properly returning a data frame, that's ok. But if you just want to check that the timeseries is find, you can use the datastructure that does not imply returning a dataframe, and thus is more simple.

BTW, for checking that if you as a dataframe, you get a dataframe, I would write a separate check.

I'll update the tests and separate the dataframe and non dataframe tests. Thanks!

Thanks to you.

jgbarah

Please, answer or address the items where you couldn't answer up to now. For the rest, I'm ok with your suggestions, so let's go for that.

…source

aswanipranjal · 2018-07-24T07:49:15Z

@jgbarah I've updated most of the code according to the comments that you made.

I still think that setting Query.interval_ in report.py for all the Query instances that are created is simpler than passing the interval for each Query separately.
If you think It's a really bad idea then I'll update that too.

Please review it once so that I can make PRs for github_issues and github_prs too.

Also, the tests might fail because more commits added into the perceval repository. The second commit in this PR adds a tests/utils.py file which allows us to insert frozen data into elasticsearch so I can use that to update the tests for manuscripts2/elasticsearch.py

aswanipranjal · 2018-07-24T17:11:42Z

@jgbarah ping.

jgbarah · 2018-07-25T17:41:23Z

I still think that setting Query.interval_ in report.py for all the Query instances that are created is simpler than passing the interval for each Query separately.

The thing is not whether it is simpler, but whether it will come as a surprise for somebody using the objects. Initializing a class property when instantiating an object is a huge side effect... Let's go this way fro now, but we need to revisit this in the future.

jgbarah

I'm accepting the changes, but we're having some technical debt that we need to revisit in the future... Please, see comments.

aswanipranjal force-pushed the generate-new-reports branch from 2683be7 to f8cde2e Compare July 23, 2018 08:20

aswanipranjal mentioned this pull request Jul 23, 2018

Update failing tests due to change in grimoirelab-preceval repository #79

Merged

jgbarah requested changes Jul 23, 2018

View reviewed changes

jgbarah reviewed Jul 23, 2018

View reviewed changes

aswanipranjal force-pushed the generate-new-reports branch from f8cde2e to 7cf52d9 Compare July 23, 2018 09:50

jgbarah requested changes Jul 23, 2018

View reviewed changes

jgbarah reviewed Jul 23, 2018

View reviewed changes

[manuscripts2] Add code to calculate OVERVIEW section for 'git' data …

74aa68d

…source

aswanipranjal force-pushed the generate-new-reports branch from bf8d467 to 2ad91df Compare July 24, 2018 07:38

[manuscripts2] Add tests for OVERVIEW section of 'git' data source

528f553

aswanipranjal force-pushed the generate-new-reports branch from 2ad91df to 528f553 Compare July 24, 2018 07:55

jgbarah approved these changes Jul 25, 2018

View reviewed changes

jgbarah merged commit 32b34a2 into chaoss:master Jul 25, 2018

aswanipranjal mentioned this pull request Aug 1, 2018

[report] Add functionality to calculate PROJECT ACTIVITY section of the report #92

Merged

Add functionality to calculate the OVERVIEW section of the report using manuscripts2 #80

Add functionality to calculate the OVERVIEW section of the report using manuscripts2 #80

Conversation

aswanipranjal commented Jul 23, 2018 • edited Loading

coveralls commented Jul 23, 2018 • edited Loading

Pull Request Test Coverage Report for Build 198

💛 - Coveralls

jgbarah left a comment

Choose a reason for hiding this comment

jgbarah left a comment

Choose a reason for hiding this comment

jgbarah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aswanipranjal Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aswanipranjal Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aswanipranjal Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgbarah left a comment

Choose a reason for hiding this comment

aswanipranjal commented Jul 24, 2018

aswanipranjal commented Jul 24, 2018

jgbarah commented Jul 25, 2018

jgbarah left a comment

Choose a reason for hiding this comment

aswanipranjal commented Jul 23, 2018 •

edited

Loading

coveralls commented Jul 23, 2018 •

edited

Loading

aswanipranjal Jul 24, 2018 •

edited

Loading

aswanipranjal Jul 24, 2018 •

edited

Loading

aswanipranjal Jul 24, 2018 •

edited

Loading