Skip to content

Commit

Permalink
Adding manuscripts2 folder containing:
Browse files Browse the repository at this point in the history
- new_functions.py file which has the basic classes and functions
- derived_classes.py file which has the derived classes
- Readme.md describing how to use the above 2 files
- sample_metrics.ipynb describing how the classes can be used
- tests for the basic functions inside the Query class

Addresses issues #59 and #62
  • Loading branch information
aswanipranjal committed Jun 26, 2018
1 parent d859cad commit 8821bca
Show file tree
Hide file tree
Showing 5 changed files with 1,729 additions and 0 deletions.
230 changes: 230 additions & 0 deletions manuscripts2/Readme.md
@@ -0,0 +1,230 @@
# Readme

This folder is an initialization to the new functions and classes that have to be added to manuscripts. This file provides a brief introduction to these classes and functions.


### TODO:
- Is popping from dict and then creating the nested aggregation the most efficient way to do this?
- How do you parse 3 level nested aggregation?


### Structure:
The main Class which is being implemented is the `Query` class. `Query` provides a connection to elasticsearch, queries it and computes the response.

The important variables inside the `Query` objects are as follows:

- `search`: This is the elasticsearch_dsl Search object which is responsible of quering elasticsearch and getting the results. All the aggregations, queries and filters are applied to this Search object and then it queries elasticsearch and fetches the required results.
- `parent_agg_counter`: This is a counter counting the aggregations that are applied to the Search object. It starts with `0` and increments as aggregations are added.
- `queries`: This is a dictionary containing two lists: `must` and `must_not`. This dictionary contains the normal(`must`) and inverse(`must_not`) queries.
- `aggregations`: This is an OrderedDict which contains all the aggregations applied. An ordered dict allows us to create nested aggregations easily, as we'll see below.
- `child_agg_counter_dict`: This dict keeps a track of the number of child aggregations that have been applied in an aggregation.

Rest of the variables are self explainatory.

`PullRequests` and `Issues` are subclasses derieved from the `Query` object. They have the initial queries: `"pull_requests":"true"` and `"pull_requests":"false"` respectively. They will have class specific functions in the future as the definitions of the metrics becomes clear.


### Usage

##### EXAMPLE 1: Basic usage

The idea is that the user can use the chainability of functions to create nested aggs and add queries seamlessly.

```python
from new_functions import Query, Index
github_index = Index(index="<github_index_name>", url="<elsaticsearch address>")
sample = Query(github_index)
sample.add_query({"name1":"value1"}) # appends query into the "queries" dict inside the "must" list
sample.add_inverse_query({"name2":"value2"}) # appends an inverse query to the "queries" dict inside the "must_not"
```
The `queries` dict looks something like this:

```python
{"must":[Match(name1=value1)], "must_not":[Match(name2=value2)]}
```

---

##### EXAMPLE 2: Basic aggregations- Getting the number of authors who participated in the project

```python
from new_functions import Query, Index
github_index = Index(index="<github_index_name>", url="<elsaticsearch address>")
github = Query(github_index)
github.get_cardinality("author_uuid")
github.get_aggs()
```

**Steps:**

- Create an `Index` object containing the elasticsearch connection information
- Create an `Query` object using the `Index` object created
- Add an `author_uuid` aggregation to the aggregations dict inside github object
- Get the single valued aggregation (cardinality or number of author_uuids)

**Points to Note:**

- Aggregations similar to `get_cardinality`:
- Numeric fields:
- `get_sum`: get the sum of all the values of the said field (field should be numeric)
- `get_average`: get the average of all the values of the said field (field should be numeric)
- `get_percentile`: get the percentile of all the values of the said field (field should be numeric)
- `get_min`: get the minimum value from all the values in the said field (field should be numeric)
- `get_max`: get the maximum value from all the values in the said field (field should be numeric)
- `get_extended_stats`: get the extended statistics (variance, standard deviation, and so on) for the values in the said field (field should be numeric)

- Non Numeric:
- `get_terms`: get term aggregation for the said field
**NOTE:** the `get_aggs()` function returns ony the numeric values, so in the case of `get_terms` aggregation, it will return the total count of the aggregation. It is better to use the `fetch_aggregation_results` function to get the individual terms instead.

- There is also an `add_custom_aggregation` filter which takes in an `elasticsearch_dsl Aggregation` object as it's input and adds it to the `aggregations` dict of the object (PullRequests, Issues, Query).

---

##### EXAMPLE 3: Get all the closed issues by authors.

```python
from new_functions import Index
from derived_classes import Issues
github_index = Index(index="<github_index_name>", url="<elsaticsearch address>")
issues = Issues(github_index)
issues.is_closed() # this add the filter {"state":"closed"} thus we look at only closed issues
issues.get_cardinality("id_in_repo").by_authors("author_name")
issues.fetch_aggregation_results()
```

**Steps:**

- Create the index object
- Create an Issues object with `github_index` as one of it's paremeters
- Apply the `is_closed` filter to look at closed issues
- Apply the aggregation to get cardinality (number of issues). Apply the `by_authors` aggregation which becomes the parent aggregation for the `cardinality` aggregation. This step will actually pop the last added aggregation from theaggregation list (here the 'cardinality' agg) and add it as a child agg for `terms` aggregation where field is the`author_name`.
- Call the `fetch_aggregation_results` function to get the number of closed issues by authors.

**NOTE:**`fetch_aggregation_results` loops through all the aggregations that have been added to the Object (here: `issues`) and adds them to the Search object in the sequence in which they were added. Then it queries elasticsearch using the `Search().execute()` method and returns a dict of the values that it gets from elasticsearch.
This will return a response from elasticsearch in the form of a dictionary having aggregations as one of the keys. The value for that(a dict itself) will have '0' as a key with the value containing the total number of unique authors in the repo
who created an issue/pr.

**Points to Note:**

- Aggregations similar to `by_author` are:
- `by_organizations`: It is similar to `by_authors` and is used to seggregate based on the organizations that the users belong to
- `by_period`: This creates a `date_histogram` aggregation.

---

##### EXAMPLE 4: Moar chainability

```python
from new_functions import Index
from derived_classes import PullRequests
github_index = Index(index="<github_index_name>", url="<elsaticsearch address>")
prs = PullRequests(github_index)
prs.is_closed() # this add the filter {"state":"closed"} thus we look at only closed issues
prs.get_cardinality("id_in_repo")
prs.get_cardinality("id").by_authors("author_name").by_organizations()
response = prs.fetch_aggregation_results()
```

This returns a dictionary containing the response from easlticsearch.

Here, in line-7, the caveat is that if we get cardinality on the basis of `id_in_repo`, again, then the first cardinality aggregation will be overwritten because we are storing the <aggregation_name> and <aggregation> as a key-value pair in the dict.
We can also use a list, instead of an ordered dict, but that will hinder the functionality described in EXAMPLE 5.
We can change the dict to a list if it is decided that the below functionality is not needed.

Alternatively, we can just use the `get_aggs()` function such as:

```python
number_od_closed_prs = prs.get_cardinality("id_in_repo").get_aggs()
```
Which gives us the number of closed PRs and clears the aggregation dict for new aggregations.

What _is_ in the aggregations dict, though?
```python
prs.aggregations

OrderedDict([
('cardinality_id_in_repo', Cardinality(field='id_in_repo', precision_threshold=3000)),
('terms_author_org_name',
Terms(aggs={0:
Terms(aggs={0:
Cardinality(field='id', precision_threshold=3000)},
field='author_name', missing='others', size=10000)},
field='author_org_name', missing='others', size=10000)
)
]
)
```
As we can see, it has 2 aggregations. The first `terms` agg has a `field=author_org_name` and a child aggregation which is a `terms` aggregation with `field=author_name` which in-turn has a cardinality agg with `field=id`. The dict pops the last aggregation and adds it to the aggregation for `by_authors`, `by_organization` and `by_period`.

---

##### EXAMPLE 5: Multiple nested aggregations for the same field:

```python
from new_functions import Query, Index
commits = Index(index="<github_index_name>", url="<elsaticsearch address>")
commits.get_sum("lines_changed").by_authors()
commits.get_sum("lines_added").by_authors()
commits.get_sum("lines_removed").by_authors()
commits.get_sum("files_changed").by_authors()
response = commits.fetch_aggregation_results()
```
Returns a containing aggregation of the total number of lines changed, removed, added and the total number of files changed by the authors under one aggregation. The `lines_changed`, `lines_added`, `lines_removed` and `files_changed` have aggregation ids as `0,1,2,3` respectively.

```python
commits.aggregations
OrderedDict([('terms_author_uuid',
Terms(aggs={0: Sum(field='lines_changed'),
1: Sum(field='lines_added'),
2: Sum(field='lines_removed'),
3: Sum(field='files_changed')},
field='author_uuid',
missing='others',
size=10000)
)
]
)
```

This allows us to get all the related aggregations in one go.

---

##### EXAMPLE 6: To get all the values from source:

```python
from new_functions import Index
from derived_classes import Issues
github_index = Index(index="<github_index_name>", url="<elsaticsearch address>")
issues = Issues(github_index)
issues.is_closed()
closed_issue_age = issues.fetch_results_from_source('time_to_close_days', 'id_in_repo', dataframe=True)
print(closed_issue_age)

id_in_repo time_to_close_days
0 32 0.76
1 50 3.19
2 63 0.24
3 97 2.62
4 77 71.78
5 108 2.54
6 133 0.03
7 257 0.20
8 155 1.95
9 358 0.80
10 369 1.13
11 26 0.01
12 57 2.83
13 80 0.07
... ... ...
```
Apart from aggregations, we can ge the actual values for analysis using the `fetch_results_from_source` function.


### Tests:

Run tests with the command:
```python
pytho[3.x] -m unittest -v
```

0 comments on commit 8821bca

Please sign in to comment.