New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSoC][implementions] Provide basic structure for the implementations directory #160
Conversation
This patch adds the basic structure for reference implementations. The structure, at its most basic level, is as follows: Root Class <- Category class <- individual Metric class. Signed-off-by: Aniruddha Karajgi <akarajgi0@gmail.com>
Please have a look at @jgbarah! |
This patch fixes a bug in identifying which commits are merge commits or empty commits. It also adds default arguments to SourceCode init to accomodate for when no exclude_list is provided. Signed-off-by: Aniruddha Karajgi <akarajgi0@gmail.com>
195d9d3
to
30b5ac8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main comment is structural: instead of having code for issues, prs, commit in the same pull request, let's have all the code we need for the first metric, and nothing else. That way, we can focus on a clean implementation of the first metric, including all of its hierarchy of classes.
I find it appropriate to work in other metrics, but for now, if that's needed, it would be be better done in your repo.
So, for this one, I suggest to merge all three current prs (#160, #161 and #162) into a single pr (closing then the other prs), with a reimplementation of CodeChanges, including the root Metric class, Commit (if needed, I'm not sure about it for now, but do as you feel it's better), and CodeChanges. in the same commit, any other auxiliary file that may be needed (util, for example), and the README file describing the new structure.
In any case, I'm adding some specific comments to this pr, that you may adress in that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, in addition to my other review, have these comments in mind.
""" | ||
Initilizes self.df, the dataframe with one commit per row. | ||
|
||
:param data_list: A list of dictionaries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify that each item in the list is a Perceval dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
Initilizes self.df, the dataframe with one commit per row. | ||
|
||
:param data_list: A list of dictionaries. | ||
Each element a line from the JSON file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not exact. Items could come from a JSON file, but also directly from Perceval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will make the required changes!
|
||
:param data_list: A list of dictionaries. | ||
Each element a line from the JSON file | ||
:param date_range: A tuple which represents the period of interest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain that a tuple is (start, end), with any of those having None as a possible value, and the meaning of None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK!
:param data_list: A list of dictionaries. | ||
Each element a line from the JSON file | ||
:param date_range: A tuple which represents the period of interest | ||
:param src_code_obj: An object of SourceCode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand well, SourceCode is a class used to determine what is source code, right? If so, this will be an object of the SourceCode hierarchy. And to make the name more clear, for the class I would use IsSourceCode (maybe), for the parameter something like is_src_code. But we can discuss naming later on, I'm not completely sure about that for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I'll make the changes you suggest. We'll make changes later on if required.
|
||
super().__init__(data_list) | ||
|
||
clean_data_list = list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "clean_data_list" and not just "items", for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, it is a better name for the list of cleaned commit dictionaries. items
may refer to even the uncleaned list of commits, especially when we move on to the non-pandas version: which would have an unclean_list
as well as clean_list
Is there a particular reason you favor items
?
|
||
clean_data_list = list() | ||
self.since = date_range[0] | ||
self.until = date_range[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(since, until) = date_range
is more clear and more "Pythonic"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! That should n't be there : will update that.
self.since = date_range[0] | ||
self.until = date_range[1] | ||
|
||
for line in self.raw_df.iterrows(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, you're doing some stuff with rows in the dataframe here, that was initialized in Metric, is that right? If so, I don't see the rationale.
Let me tell you how i would do it.
- In Metric,
__init__
would be like (assuming 'items' is the list of Perceval dictionaries, and just with simplified code):
for item in items:
flat_item = flatten_item(item)
if flat_item:
add(self.df, flat_item)
- In Commit (or maybe Code_Change) class, you would define
flatten_item
, as a function that given an item from Perceval, with a commit, produces a flat item (that, is, an flat dictionary) if conditions apply (eg, if date is as it should be, if it is source_code, etc.), and None if not.
This way, the code for this root class would be much cleaner, with most of the details for filtering items and selecting fields to flatten will be left to child classes.
Would you mind trying it that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I have done
- Metric.py:
- Gets a list of raw dictionaries from Perceval/json file
- Flattens it using the flatten method in the same class
- Commit.py:
- Based on since, until / sourcecode, excludes commits
- Picks out certain commit attributes to create a final clean df
- CodeChanges.py
- Computes metric / timeseries metric over the df after Commit has worked with it.
In Commit (or maybe Code_Change) class, you would define flatten_item, as a function that given an item from Perceval, with a commit, produces a flat item (that, is, an flat dictionary) if conditions apply (eg, if date is as it should be, if it is source_code, etc.), and None if not.
But isn't the data flattened in the Metric class?
Are you suggesting to use a child class function (flatten_item
defined in Commit.py or CodeChanges.py) in the parent class (Metric.py) and do all the flattening as well as cleaning at once in Metric.py?
Can you please elaborate a little @jgbarah ? Thanks.
@aswanipranjal, if possible, could you give your views on this? Thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Polaris000, sorry, I am a little late in the discussion.
But isn't the data flattened in the Metric class?
Correct. The implementations of the flatten
method will be in the child classes (Commit, Issue).
Are you suggesting to use a child class function (flatten_item defined in Commit.py or CodeChanges.py) in the parent class (Metric.py) and do all the flattening as well as cleaning at once in Metric.py?
Correct. Right now the current implementation which has been merged in the repo, in the Metric class is:
def __init__(self, items):
flat_items = self._flatten_data(items)
self.raw_df = pd.DataFrame(flat_items)
The _flatten_data
function should be implemented in the Commit
class, which is what you've done and is the correct way.
What I think @jgbarah and my concern is that the following code in the __init__
method of the Commit
class:
self.issrccode_obj = issrccode_obj
(self.since, self.until) = date_range
clean_items = list()
for line in items:
commit = self._flatten_data(line)
if commit is not None:
clean_items.append(commit)
self.df = pd.DataFrame(clean_items)
if self.since is None:
self.since = utils.get_date(self.df, 'since')
if self.until is None:
self.until = utils.get_date(self.df, 'until')
will be common to all the other data sources as well (Issue, PullRequest) and hence can be moved to the Metric
class's __init__
method.
This way, we have common code to initialize the classes and specific implementations to flatten
or clean
the data according to the data sources.
Is it more clear now @Polaris000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aswanipranjal! My latest idea (#175) involves moving the population of since
and until
to compute_timeseries
method. This way, the __init__method would have nothing out of place. The Metric(root) class is not implemented anyway.
else: | ||
self.until = utils.get_date(self.df, "until") | ||
|
||
def _clean_commit(self, line): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be similar to the flatten_commit
that I mention above.
@jgbarah Thanks for the prompt review! I had a look and left a few comments where I was a little unsure. I'll combine the pull requests and create a new one. I thought they would be easier to review if separated. |
This patch adds the basic structure for the latest reference implementations. Implementations will be added from the next pull request.
Closes #154.