Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC][implementions] Provide basic structure for the implementations directory #160

Closed
wants to merge 2 commits into from

Conversation

Polaris000
Copy link
Contributor

@Polaris000 Polaris000 commented Jun 10, 2019

This patch adds the basic structure for the latest reference implementations. Implementations will be added from the next pull request.
Closes #154.

This patch adds the basic structure for reference implementations. The structure, at its most basic level, is as follows: Root Class <- Category class <- individual Metric class.

Signed-off-by: Aniruddha Karajgi <akarajgi0@gmail.com>
@Polaris000 Polaris000 changed the title Provide basic structure for the implementations directory [GSoC][implementions] Provide basic structure for the implementations directory Jun 10, 2019
@Polaris000
Copy link
Contributor Author

Please have a look at @jgbarah!

This patch fixes a bug in identifying which commits are merge commits or empty commits. It also adds default arguments to SourceCode init to accomodate for when no exclude_list is provided.

Signed-off-by: Aniruddha Karajgi <akarajgi0@gmail.com>
Copy link
Collaborator

@jgbarah jgbarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main comment is structural: instead of having code for issues, prs, commit in the same pull request, let's have all the code we need for the first metric, and nothing else. That way, we can focus on a clean implementation of the first metric, including all of its hierarchy of classes.

I find it appropriate to work in other metrics, but for now, if that's needed, it would be be better done in your repo.

So, for this one, I suggest to merge all three current prs (#160, #161 and #162) into a single pr (closing then the other prs), with a reimplementation of CodeChanges, including the root Metric class, Commit (if needed, I'm not sure about it for now, but do as you feel it's better), and CodeChanges. in the same commit, any other auxiliary file that may be needed (util, for example), and the README file describing the new structure.

In any case, I'm adding some specific comments to this pr, that you may adress in that one.

Copy link
Collaborator

@jgbarah jgbarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, in addition to my other review, have these comments in mind.

"""
Initilizes self.df, the dataframe with one commit per row.

:param data_list: A list of dictionaries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify that each item in the list is a Perceval dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Initilizes self.df, the dataframe with one commit per row.

:param data_list: A list of dictionaries.
Each element a line from the JSON file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not exact. Items could come from a JSON file, but also directly from Perceval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make the required changes!


:param data_list: A list of dictionaries.
Each element a line from the JSON file
:param date_range: A tuple which represents the period of interest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain that a tuple is (start, end), with any of those having None as a possible value, and the meaning of None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

:param data_list: A list of dictionaries.
Each element a line from the JSON file
:param date_range: A tuple which represents the period of interest
:param src_code_obj: An object of SourceCode.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand well, SourceCode is a class used to determine what is source code, right? If so, this will be an object of the SourceCode hierarchy. And to make the name more clear, for the class I would use IsSourceCode (maybe), for the parameter something like is_src_code. But we can discuss naming later on, I'm not completely sure about that for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I'll make the changes you suggest. We'll make changes later on if required.


super().__init__(data_list)

clean_data_list = list()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "clean_data_list" and not just "items", for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, it is a better name for the list of cleaned commit dictionaries. items may refer to even the uncleaned list of commits, especially when we move on to the non-pandas version: which would have an unclean_list as well as clean_list

Is there a particular reason you favor items?


clean_data_list = list()
self.since = date_range[0]
self.until = date_range[1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(since, until) = date_range is more clear and more "Pythonic"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! That should n't be there : will update that.

self.since = date_range[0]
self.until = date_range[1]

for line in self.raw_df.iterrows():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you're doing some stuff with rows in the dataframe here, that was initialized in Metric, is that right? If so, I don't see the rationale.

Let me tell you how i would do it.

  • In Metric, __init__ would be like (assuming 'items' is the list of Perceval dictionaries, and just with simplified code):
for item in items:
    flat_item = flatten_item(item)
    if flat_item:
        add(self.df, flat_item)
  • In Commit (or maybe Code_Change) class, you would define flatten_item, as a function that given an item from Perceval, with a commit, produces a flat item (that, is, an flat dictionary) if conditions apply (eg, if date is as it should be, if it is source_code, etc.), and None if not.

This way, the code for this root class would be much cleaner, with most of the details for filtering items and selecting fields to flatten will be left to child classes.

Would you mind trying it that way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I have done

  • Metric.py:
    • Gets a list of raw dictionaries from Perceval/json file
    • Flattens it using the flatten method in the same class
  • Commit.py:
    • Based on since, until / sourcecode, excludes commits
    • Picks out certain commit attributes to create a final clean df
  • CodeChanges.py
    • Computes metric / timeseries metric over the df after Commit has worked with it.

In Commit (or maybe Code_Change) class, you would define flatten_item, as a function that given an item from Perceval, with a commit, produces a flat item (that, is, an flat dictionary) if conditions apply (eg, if date is as it should be, if it is source_code, etc.), and None if not.

But isn't the data flattened in the Metric class?
Are you suggesting to use a child class function (flatten_item defined in Commit.py or CodeChanges.py) in the parent class (Metric.py) and do all the flattening as well as cleaning at once in Metric.py?

Can you please elaborate a little @jgbarah ? Thanks.
@aswanipranjal, if possible, could you give your views on this? Thanks :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Polaris000, sorry, I am a little late in the discussion.

But isn't the data flattened in the Metric class?

Correct. The implementations of the flatten method will be in the child classes (Commit, Issue).

Are you suggesting to use a child class function (flatten_item defined in Commit.py or CodeChanges.py) in the parent class (Metric.py) and do all the flattening as well as cleaning at once in Metric.py?

Correct. Right now the current implementation which has been merged in the repo, in the Metric class is:

    def __init__(self, items):
        flat_items = self._flatten_data(items)
        self.raw_df = pd.DataFrame(flat_items)

The _flatten_data function should be implemented in the Commit class, which is what you've done and is the correct way.

What I think @jgbarah and my concern is that the following code in the __init__ method of the Commit class:

self.issrccode_obj = issrccode_obj
(self.since, self.until) = date_range
clean_items = list()
for line in items:
    commit = self._flatten_data(line)
    if commit is not None:
        clean_items.append(commit)
self.df = pd.DataFrame(clean_items)
if self.since is None:
    self.since = utils.get_date(self.df, 'since')
if self.until is None:
        self.until = utils.get_date(self.df, 'until')

will be common to all the other data sources as well (Issue, PullRequest) and hence can be moved to the Metric class's __init__ method.

This way, we have common code to initialize the classes and specific implementations to flatten or clean the data according to the data sources.

Is it more clear now @Polaris000?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aswanipranjal! My latest idea (#175) involves moving the population of since and until to compute_timeseries method. This way, the __init__method would have nothing out of place. The Metric(root) class is not implemented anyway.

else:
self.until = utils.get_date(self.df, "until")

def _clean_commit(self, line):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be similar to the flatten_commit that I mention above.

@Polaris000
Copy link
Contributor Author

@jgbarah Thanks for the prompt review! I had a look and left a few comments where I was a little unsure. I'll combine the pull requests and create a new one. I thought they would be easier to review if separated.

@Polaris000
Copy link
Contributor Author

I have added combined #160 and #161 into #162 and added commits there. Please have a look . Before that, please have a look at my comment here (just above this one). Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[GSoC] A root class from which all reference implementations inherit functionality
3 participants