Skip to content

Commit

Permalink
fix merge conflict in readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Annie Didier committed Jul 21, 2020
2 parents 5af01cb + 21bfc9f commit 2aca90c
Show file tree
Hide file tree
Showing 19 changed files with 548 additions and 239 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,5 @@ venv
.ipynb_checkpoints
build
author_rank.egg-info
htmlcov
htmlcov
*.graffle
56 changes: 34 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
A modification of PageRank to find the most prestigious authors in a scientific collaboration network.

[![Language](https://img.shields.io/badge/python-3.5%20%7C%203.6%20%7C%203.7%20%7C%203.8-blue)](#)
[![PyPi](https://img.shields.io/badge/pypi-0.0.3-blue.svg)](https://pypi.python.org/pypi/author_rank/0.0.3)
[![PyPi](https://img.shields.io/badge/pypi-0.1.0-blue.svg)](https://pypi.python.org/pypi/author_rank/0.1.0)
[![License](https://img.shields.io/github/license/adidier17/AuthorRank)](https://opensource.org/licenses/MIT)
[![Coverage Status](https://coveralls.io/repos/github/adidier17/AuthorRank/badge.svg?branch=master)](https://coveralls.io/github/adidier17/AuthorRank?branch=master)
[![Build Status](https://api.travis-ci.org/adidier17/AuthorRank.svg?branch=master)](https://travis-ci.org/adidier17/AuthorRank)
Expand Down Expand Up @@ -120,12 +120,19 @@ documents = [
```

One can compute retrieve a ranked list of authors and their scores
according to the AuthorRank algortithm:
according to the AuthorRank algorithm:

```python
from author_rank.score import top_authors
# create an AuthorRank object
ar_graph = ar.Graph()

top_authors(documents, normalize_scores=True, n=10)
# fit to the data
ar_graph.fit(
documents=documents
)

# get the top authors for a set of documents
ar_graph.top_authors(normalize_scores=True, n=10)
```

Setting _normalized_scores_ to `True` normalizes the AuthorRank scores
Expand All @@ -135,13 +142,18 @@ on a scale of 0 to 1 (inclusive), which may be helpful for interpretation.

By default, AuthorRank looks for a list of authors - with each author
represented as a dictionary of keys and values - from each document
in the list of documents passed into `top_authors` or `create` using
in the list of documents passed into `fit` using
the key `authors`, with the keys `first_name` and `last_name` as the
keys used to uniquely identify each author. However, if desired other keys
could be specified and utilized, as in the example below:

```python
top_authors(documents, normalize_scores=True, n=10, authorship_key="creators", keys=set(["given", "family"]))
ar_graph.fit(
documents=documents,
authorship_key="creators",
keys=set(["given", "family"])
)
ar_graph.top_authors(normalize_scores=True, n=10)
```

### Exporting the Co-Authorship Graph
Expand All @@ -151,25 +163,23 @@ with weights, into a JSON format for use in visualization or additional
analysis:

```python
from author_rank.graph import create, export_to_json

G = create(documents=documents)
export_to_json(G)
export = ar_graph.as_json()
print(json.dumps(export, indent=4))
```

### Progress Bar
Whether using `graph.create` or `scores.top_authors`, the `progress_bar`
When creating the AuthorRank graph, the `progress_bar`
parameter can be used to indicate the progress of applying AuthorRank to
a set of documents. This can be helpful when processing larger corpora
of documents as it provides a rough indication of the remaining time
needed to complete execution.

```python
from author_rank.graph import create
from author_rank.score import top_authors

create(documents=documents)
top_authors(documents, normalize_scores=True, n=10, progress_bar=True)
# fit to the data
ar_graph.fit(
documents=documents,
progress_bar=True
)
```

## About
Expand All @@ -191,7 +201,8 @@ coauthor together, and status should be diminished as the number of authors in a
increases. Thus, edges are weighted according to frequency of co-authorship and total number
of co-authors on articles according to the diagram shown below.

![Co-AuthorshipGraph](images/co-authorship-graph.png)
![Co-AuthorshipGraph](images/coauthorship_graph_750.png)


The applicability of this approach is not confined to research
collaborations and this module could be extended into other useful
Expand All @@ -208,10 +219,10 @@ any changes to a branch which corresponds to an open issue. Hot fixes
and bug fixes can be represented by branches with the prefix `fix/` versus
`feature/` for new capabilities or code improvements. Pull requests will
then be made from these branches into the repository's `dev` branch
prior to being pulled into `master`. Pull requests which are works in
prior to being pulled into `main`. Pull requests which are works in
progress or ready for merging should be indicated by their respective
prefixes ([WIP] and [MRG]). Pull requests with the [MRG] label will be
reviewed prior to being pulled into the `master` branch.
prefixes (`[WIP]` and `[MRG]`). Pull requests with the `[MRG]` label will be
reviewed prior to being pulled into the `main` branch.

### Running Tests

Expand All @@ -224,7 +235,7 @@ python3 -m pytest --cov=author_rank -vv

The tests included within the repository are automatically run on commit
to repository branches and any external pull requests
[using Travis CI](https://api.travis-ci.org/adidier17/AuthorRank.svg?branch=master)
[using Travis CI](https://api.travis-ci.org/adidier17/AuthorRank.svg?branch=master).

## Versioning
[Semantic versioning](http://semver.org/) is used for this project. If contributing, please conform to semantic
Expand All @@ -238,5 +249,6 @@ This project is licensed under the MIT license.
1. Xiaoming Liu, Johan Bollen, Michael L. Nelson, Herbert Van de Sompel,
Co-authorship networks in the digital library research community,
Information Processing & Management, Volume 41, Issue 6, 2005,
Pages 1462-1480, ISSN 0306-4573, http://dx.doi.org/10.1016/j.ipm.2005.03.012.
Pages 1462-1480, ISSN 0306-4573, http://dx.doi.org/10.1016/j.ipm.2005.03.012.
[Pre-print PDF](https://arxiv.org/pdf/cs/0502056.pdf).

29 changes: 29 additions & 0 deletions author_rank/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
"""
AuthorRank
========
AuthorRank is a Python package that implements a modification of PageRank to
find the most prestigious authors in a scientific collaboration network.
See https://github.com/adidier17/AuthorRank.
"""

import sys
if sys.version_info[:2] < (3, 5):
m = "Python 3.5 or later is required for NetworkX (%d.%d detected)."
raise ImportError(m % sys.version_info[:2])
del sys

__author__ = "Valentino Constantinou, Annie Didier"
__version__ = "0.1.0"

import author_rank.graph
from author_rank.graph import *

import author_rank.score
from author_rank.score import *

import author_rank.utils
from author_rank.utils import *

223 changes: 132 additions & 91 deletions author_rank/graph.py
Original file line number Diff line number Diff line change
@@ -1,97 +1,138 @@
# imports
from author_rank.score import top_authors as top
from author_rank.utils import emit_progress_bar, check_author_count
from collections import Counter
import itertools
import networkx as nx
from typing import List
from author_rank.utils import emit_progress_bar


def create(documents: List[dict], authorship_key: str = "authors", keys: set = None, progress_bar: bool = False) -> 'nx.classes.digraph.DiGraph':

"""
Creates a directed graph object from the list of input documents which are represented as dictionaries.
:param documents: a list of dictionaries which represent documents.
:param authorship_key: the key in the document which contains a list of dictionaries representing authors.
:param keys: a set that contains the keys to be used to create a UID for authors.
:param progress_bar: a boolean that indicates whether or not a progress bar should be emitted, default False.
:return: a networkx DiGraph object.
"""

# if keys are not provided, set a default
# see https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/
if keys is None:
keys = {"first_name", "last_name"}

# get the authorship from each of the documents
# gets a list of lists
doc_authors = [i[authorship_key] for i in documents]

# remove keys and values that are not used as part of an author UID
for doc in doc_authors:
for author in doc:
unwanted_keys = set(author) - set(keys)
for unwanted_key in unwanted_keys:
del author[unwanted_key]

# create a UID for each author based on the remaining keys
# each unique combination of key values will serve as keys for each author
flattened_list = list(itertools.chain.from_iterable(doc_authors))
author_uid_tuples = [tuple(d.values()) for d in flattened_list]

# get overall counts of each author
counts = Counter(author_uid_tuples)

# create lists for the edges
edges_all = list()

# process each document and create the edges with the appropriate weights
progress = "="
for doc in range(0, len(doc_authors)):
if len(doc_authors[doc]) > 1:
author_ids = [tuple(d.values()) for d in flattened_list]
pairs = (list(itertools.permutations(author_ids, 2)))
# calculate g_i_j_k
exclusivity = 1 / (len(doc_authors[doc]) - 1)
edges_all.extend([{"edge": (x[0], x[1]), "weight": exclusivity} for x in pairs])
from typing import List, Tuple
import warnings


class Graph:

def __init__(self):
self.graph = nx.DiGraph()
self._is_fit = False

def fit(self, documents: List[dict], authorship_key: str = "authors",
keys: set = None, progress_bar: bool = False) -> 'nx.classes.digraph.DiGraph':

"""
Creates a directed graph object from the list of input documents which
are represented as dictionaries.
:param documents: a list of dictionaries which represent documents.
:param authorship_key: the key in the document which contains a list
of dictionaries representing authors.
:param keys: a set that contains the keys to be used to create a UID
for authors.
:param progress_bar: a boolean that indicates whether or not a progress
bar should be emitted, default False.
:return: a NetworkX DiGraph object.
"""

# if keys are not provided, set a default
# see https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/
if keys is None:
keys = {"first_name", "last_name"}

# get the authorship from each of the documents
# gets a list of lists
doc_authors = [i[authorship_key] for i in documents]

# remove keys and values that are not used as part of an author UID
for doc in doc_authors:
for author in doc:
unwanted_keys = set(author) - set(keys)
for unwanted_key in unwanted_keys:
del author[unwanted_key]

# create a UID for each author based on the remaining keys
# unique combination of key values will serve as keys for each author
flattened_list = list(itertools.chain.from_iterable(doc_authors))
author_uid_tuples = [tuple(d.values()) for d in flattened_list]
# ajd_matrix = np.empty(shape=())

# get overall counts of each author
counts = Counter(author_uid_tuples)

acceptable_author_count = check_author_count(counts)
if acceptable_author_count is False:
warnings.warn("Number of authors in document set must be greater than one. "
"AuthorRank not fit to the data, please try again.", UserWarning)
else:
edges_all.extend([{"edge": (doc_authors[doc][0], doc_authors[doc][0]), "weight": 1}])

if progress_bar:
progress = emit_progress_bar(progress, doc+1, len(doc_authors))

# sort the edges for processing
edges_all_sorted = sorted(edges_all, key=lambda x: str(x["edge"]))
gb_object = itertools.groupby(edges_all_sorted, key=lambda x: x["edge"])

# normalize the edge weights and create the directed graph
normalized = {}
for k, v in gb_object:
try:
v = list(v) # need to reassign
numerator = sum(d["weight"] for d in list(v))
denominator = counts[k[0]]
normalized[k] = numerator / denominator
except TypeError:
# this occurs when an author is compared to one-self, which is not a valid scenario for the graph
pass

# create the directed graph
edge_list = [(k[0], k[1], v) for k, v in normalized.items()]
G = nx.DiGraph()
G.add_weighted_edges_from(edge_list)

return G


def export_to_json(graph: 'nx.classes.digraph.DiGraph'):

"""
Returns the directed graph in JSON format, containing information
about nodes and their relationships to one another in the form of edges.
A wrapper around the NetworkX functionality.
:param graph: a networkx.DiGraph object
:return: a JSON format for the provided graph
"""

return nx.readwrite.json_graph.node_link_data(graph)
# create lists for the edges
edges_all = list()

# process each document, create the edges with the appropriate weights
progress = "="
for doc in range(0, len(doc_authors)):
if len(doc_authors[doc]) > 1:
author_ids = [tuple(d.values()) for d in doc_authors[doc]]
pairs = (list(itertools.permutations(author_ids, 2)))
# calculate g_i_j_k
exclusivity = 1 / (len(doc_authors[doc]) - 1)
edges_all.extend([{"edge": (x[0], x[1]), "weight": exclusivity} for x in pairs])
else:
edges_all.extend([{"edge": (doc_authors[doc][0], doc_authors[doc][0]), "weight": 1}])

if progress_bar:
progress = emit_progress_bar(progress, doc+1, len(doc_authors))

# sort the edges for processing
edges_all_sorted = sorted(edges_all, key=lambda x: str(x["edge"]))
gb_object = itertools.groupby(edges_all_sorted, key=lambda x: x["edge"])

# normalize the edge weights and create the directed graph
normalized = {}
for k, v in gb_object:
try:
v = list(v) # need to reassign
numerator = sum(d["weight"] for d in list(v))
denominator = counts[k[0]]
normalized[k] = numerator / denominator
except TypeError:
# this occurs when an author is compared to one-self, which is
# not a valid scenario for the graph
pass

# create the directed graph
edge_list = [(k[0], k[1], v) for k, v in normalized.items()]
self.graph.add_weighted_edges_from(edge_list)

self._is_fit = True

return self.graph

def top_authors(self, n: int = 10, normalize_scores: bool = False) -> Tuple[List, List]:
"""
Calculates the top N authors in an AuthorRank graph and returns them
in sorted order.
:param n: an integer to specify the maximum number of authors to be
returned.
:param normalize_scores: a boolean to indicate whether or not to normalize
the scores between 0 and 1.
:return: a tuple which contains two lists, one for authors and the other
for their scores.
"""

# check to see if AuthorRank has been fit
if self._is_fit is False:
warnings.warn("AuthorRank must first be fit on a set of documents "
"prior to calling top_authors.", UserWarning)
return list(), list()

else:
top_authors, top_scores = top(self.graph, n=n, normalize_scores=normalize_scores)

return top_authors, top_scores

def as_json(self) -> dict:
"""
Returns the directed graph in JSON format, containing information
about nodes and their relationships to one another in the form of edges.
A wrapper around the NetworkX functionality.
:return: a JSON format for the provided graph
"""

return nx.readwrite.json_graph.node_link_data(self.graph)

0 comments on commit 2aca90c

Please sign in to comment.