fix merge conflict in readme

adidier17 · Jul 21, 2020 · 2aca90c · 2aca90c
2 parents 5af01cb + 21bfc9f
commit 2aca90c
Show file tree

Hide file tree

Showing 19 changed files with 548 additions and 239 deletions.
diff --git a/.gitignore b/.gitignore
@@ -7,4 +7,5 @@ venv
 .ipynb_checkpoints
 build
 author_rank.egg-info
-htmlcov
+htmlcov
+*.graffle
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 A modification of PageRank to find the most prestigious authors in a scientific collaboration network.
 
 [![Language](https://img.shields.io/badge/python-3.5%20%7C%203.6%20%7C%203.7%20%7C%203.8-blue)](#)
-[![PyPi](https://img.shields.io/badge/pypi-0.0.3-blue.svg)](https://pypi.python.org/pypi/author_rank/0.0.3)
+[![PyPi](https://img.shields.io/badge/pypi-0.1.0-blue.svg)](https://pypi.python.org/pypi/author_rank/0.1.0)
 [![License](https://img.shields.io/github/license/adidier17/AuthorRank)](https://opensource.org/licenses/MIT)
 [![Coverage Status](https://coveralls.io/repos/github/adidier17/AuthorRank/badge.svg?branch=master)](https://coveralls.io/github/adidier17/AuthorRank?branch=master)
 [![Build Status](https://api.travis-ci.org/adidier17/AuthorRank.svg?branch=master)](https://travis-ci.org/adidier17/AuthorRank)
@@ -120,12 +120,19 @@ documents = [
 ```
 
 One can compute retrieve a ranked list of authors and their scores 
-according to the AuthorRank algortithm: 
+according to the AuthorRank algorithm: 
 
 ```python
-from author_rank.score import top_authors
+# create an AuthorRank object
+ar_graph = ar.Graph()
 
-top_authors(documents, normalize_scores=True, n=10)
+# fit to the data
+ar_graph.fit(
+    documents=documents
+)
+
+# get the top authors for a set of documents
+ar_graph.top_authors(normalize_scores=True, n=10)
 ```
 
 Setting _normalized_scores_ to `True` normalizes the AuthorRank scores 
@@ -135,13 +142,18 @@ on a scale of 0 to 1 (inclusive), which may be helpful for interpretation.
 
 By default, AuthorRank looks for a list of authors - with each author 
 represented as a dictionary of keys and values - from each document 
-in the list of documents passed into `top_authors` or `create` using 
+in the list of documents passed into `fit` using 
 the key `authors`, with the keys `first_name` and `last_name` as the 
 keys used to uniquely identify each author. However, if desired other keys 
 could be specified and utilized, as in the example below: 
 
 ```python
-top_authors(documents, normalize_scores=True, n=10, authorship_key="creators", keys=set(["given", "family"]))
+ar_graph.fit(
+    documents=documents,
+    authorship_key="creators", 
+    keys=set(["given", "family"])
+)
+ar_graph.top_authors(normalize_scores=True, n=10)
 ```
 
 ### Exporting the Co-Authorship Graph
@@ -151,25 +163,23 @@ with weights, into a JSON format for use in visualization or additional
 analysis:
 
 ```python
-from author_rank.graph import create, export_to_json
-
-G = create(documents=documents)
-export_to_json(G)
+export = ar_graph.as_json()
+print(json.dumps(export, indent=4))
 ```
 
 ### Progress Bar 
-Whether using `graph.create` or `scores.top_authors`, the `progress_bar` 
+When creating the AuthorRank graph, the `progress_bar` 
 parameter can be used to indicate the progress of applying AuthorRank to 
 a set of documents. This can be helpful when processing larger corpora 
 of documents as it provides a rough indication of the remaining time 
 needed to complete execution. 
 
 ```python
-from author_rank.graph import create
-from author_rank.score import top_authors
-
-create(documents=documents)
-top_authors(documents, normalize_scores=True, n=10, progress_bar=True)
+# fit to the data
+ar_graph.fit(
+    documents=documents,
+    progress_bar=True
+)
 ```
 
 ## About
@@ -191,7 +201,8 @@ coauthor together, and status should be diminished as the number of authors in a
 increases. Thus, edges are weighted according to frequency of co-authorship and total number
 of co-authors on articles according to the diagram shown below.
 
-![Co-AuthorshipGraph](images/co-authorship-graph.png)
+![Co-AuthorshipGraph](images/coauthorship_graph_750.png)
+
 
 The applicability of this approach is not confined to research 
 collaborations and this module could be extended into other useful 
@@ -208,10 +219,10 @@ any changes to a branch which corresponds to an open issue. Hot fixes
 and bug fixes can be represented by branches with the prefix `fix/` versus 
 `feature/` for new capabilities or code improvements. Pull requests will 
 then be made from these branches into the repository's `dev` branch 
-prior to being pulled into `master`. Pull requests which are works in 
+prior to being pulled into `main`. Pull requests which are works in 
 progress or ready for merging should be indicated by their respective 
-prefixes ([WIP] and [MRG]). Pull requests with the [MRG] label will be 
-reviewed prior to being pulled into the `master` branch. 
+prefixes (`[WIP]` and `[MRG]`). Pull requests with the `[MRG]` label will be 
+reviewed prior to being pulled into the `main` branch. 
 
 ### Running Tests
 
@@ -224,7 +235,7 @@ python3 -m pytest --cov=author_rank -vv
 
 The tests included within the repository are automatically run on commit 
 to repository branches and any external pull requests 
-[using Travis CI](https://api.travis-ci.org/adidier17/AuthorRank.svg?branch=master)
+[using Travis CI](https://api.travis-ci.org/adidier17/AuthorRank.svg?branch=master). 
 
 ## Versioning
 [Semantic versioning](http://semver.org/) is used for this project. If contributing, please conform to semantic
@@ -238,5 +249,6 @@ This project is licensed under the MIT license.
 1. Xiaoming Liu, Johan Bollen, Michael L. Nelson, Herbert Van de Sompel, 
 Co-authorship networks in the digital library research community, 
 Information Processing & Management, Volume 41, Issue 6, 2005, 
-Pages 1462-1480, ISSN 0306-4573, http://dx.doi.org/10.1016/j.ipm.2005.03.012.
+Pages 1462-1480, ISSN 0306-4573, http://dx.doi.org/10.1016/j.ipm.2005.03.012. 
+[Pre-print PDF](https://arxiv.org/pdf/cs/0502056.pdf).
 
diff --git a/author_rank/__init__.py b/author_rank/__init__.py
@@ -0,0 +1,29 @@
+"""
+
+AuthorRank
+========
+
+AuthorRank is a Python package that implements a modification of PageRank to
+find the most prestigious authors in a scientific collaboration network.
+
+See https://github.com/adidier17/AuthorRank.
+"""
+
+import sys
+if sys.version_info[:2] < (3, 5):
+    m = "Python 3.5 or later is required for NetworkX (%d.%d detected)."
+    raise ImportError(m % sys.version_info[:2])
+del sys
+
+__author__ = "Valentino Constantinou, Annie Didier"
+__version__ = "0.1.0"
+
+import author_rank.graph
+from author_rank.graph import *
+
+import author_rank.score
+from author_rank.score import *
+
+import author_rank.utils
+from author_rank.utils import *
+
diff --git a/author_rank/graph.py b/author_rank/graph.py
@@ -1,97 +1,138 @@
 # imports
+from author_rank.score import top_authors as top
+from author_rank.utils import emit_progress_bar, check_author_count
 from collections import Counter
 import itertools
 import networkx as nx
-from typing import List
-from author_rank.utils import emit_progress_bar
-
-
-def create(documents: List[dict], authorship_key: str = "authors", keys: set = None, progress_bar: bool = False) -> 'nx.classes.digraph.DiGraph':
-
-    """
-    Creates a directed graph object from the list of input documents which are represented as dictionaries.
-    :param documents: a list of dictionaries which represent documents.
-    :param authorship_key: the key in the document which contains a list of dictionaries representing authors.
-    :param keys: a set that contains the keys to be used to create a UID for authors.
-    :param progress_bar: a boolean that indicates whether or not a progress bar should be emitted, default False.
-    :return: a networkx DiGraph object.
-    """
-
-    # if keys are not provided, set a default
-    # see https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/
-    if keys is None:
-        keys = {"first_name", "last_name"}
-
-    # get the authorship from each of the documents
-    # gets a list of lists
-    doc_authors = [i[authorship_key] for i in documents]
-
-    # remove keys and values that are not used as part of an author UID
-    for doc in doc_authors:
-        for author in doc:
-            unwanted_keys = set(author) - set(keys)
-            for unwanted_key in unwanted_keys:
-                del author[unwanted_key]
-
-    # create a UID for each author based on the remaining keys
-    # each unique combination of key values will serve as keys for each author
-    flattened_list = list(itertools.chain.from_iterable(doc_authors))
-    author_uid_tuples = [tuple(d.values()) for d in flattened_list]
-
-    # get overall counts of each author
-    counts = Counter(author_uid_tuples)
-
-    # create lists for the edges
-    edges_all = list()
-
-    # process each document and create the edges with the appropriate weights
-    progress = "="
-    for doc in range(0, len(doc_authors)):
-        if len(doc_authors[doc]) > 1:
-            author_ids = [tuple(d.values()) for d in flattened_list]
-            pairs = (list(itertools.permutations(author_ids, 2)))
-            # calculate g_i_j_k
-            exclusivity = 1 / (len(doc_authors[doc]) - 1)
-            edges_all.extend([{"edge": (x[0], x[1]), "weight": exclusivity} for x in pairs])
+from typing import List, Tuple
+import warnings
+
+
+class Graph:
+
+    def __init__(self):
+        self.graph = nx.DiGraph()
+        self._is_fit = False
+
+    def fit(self, documents: List[dict], authorship_key: str = "authors",
+            keys: set = None, progress_bar: bool = False) -> 'nx.classes.digraph.DiGraph':
+
+        """
+        Creates a directed graph object from the list of input documents which
+        are represented as dictionaries.
+        :param documents: a list of dictionaries which represent documents.
+        :param authorship_key: the key in the document which contains a list
+        of dictionaries representing authors.
+        :param keys: a set that contains the keys to be used to create a UID
+        for authors.
+        :param progress_bar: a boolean that indicates whether or not a progress
+        bar should be emitted, default False.
+        :return: a NetworkX DiGraph object.
+        """
+
+        # if keys are not provided, set a default
+        # see https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/
+        if keys is None:
+            keys = {"first_name", "last_name"}
+
+        # get the authorship from each of the documents
+        # gets a list of lists
+        doc_authors = [i[authorship_key] for i in documents]
+
+        # remove keys and values that are not used as part of an author UID
+        for doc in doc_authors:
+            for author in doc:
+                unwanted_keys = set(author) - set(keys)
+                for unwanted_key in unwanted_keys:
+                    del author[unwanted_key]
+
+        # create a UID for each author based on the remaining keys
+        # unique combination of key values will serve as keys for each author
+        flattened_list = list(itertools.chain.from_iterable(doc_authors))
+        author_uid_tuples = [tuple(d.values()) for d in flattened_list]
+        # ajd_matrix = np.empty(shape=())
+
+        # get overall counts of each author
+        counts = Counter(author_uid_tuples)
+
+        acceptable_author_count = check_author_count(counts)
+        if acceptable_author_count is False:
+            warnings.warn("Number of authors in document set must be greater than one. "
+                          "AuthorRank not fit to the data, please try again.", UserWarning)
         else:
-            edges_all.extend([{"edge": (doc_authors[doc][0], doc_authors[doc][0]), "weight": 1}])
-
-        if progress_bar:
-            progress = emit_progress_bar(progress, doc+1, len(doc_authors))
-
-    # sort the edges for processing
-    edges_all_sorted = sorted(edges_all, key=lambda x: str(x["edge"]))
-    gb_object = itertools.groupby(edges_all_sorted, key=lambda x: x["edge"])
-
-    # normalize the edge weights and create the directed graph
-    normalized = {}
-    for k, v in gb_object:
-        try:
-            v = list(v) # need to reassign
-            numerator = sum(d["weight"] for d in list(v))
-            denominator = counts[k[0]]
-            normalized[k] = numerator / denominator
-        except TypeError:
-            # this occurs when an author is compared to one-self, which is not a valid scenario for the graph
-            pass
-
-    # create the directed graph
-    edge_list = [(k[0], k[1], v) for k, v in normalized.items()]
-    G = nx.DiGraph()
-    G.add_weighted_edges_from(edge_list)
-
-    return G
-
-
-def export_to_json(graph: 'nx.classes.digraph.DiGraph'):
-
-    """
-    Returns the directed graph in JSON format, containing information
-    about nodes and their relationships to one another in the form of edges.
-    A wrapper around the NetworkX functionality.
-    :param graph: a networkx.DiGraph object
-    :return: a JSON format for the provided graph
-    """
-
-    return nx.readwrite.json_graph.node_link_data(graph)
+            # create lists for the edges
+            edges_all = list()
+
+            # process each document, create the edges with the appropriate weights
+            progress = "="
+            for doc in range(0, len(doc_authors)):
+                if len(doc_authors[doc]) > 1:
+                    author_ids = [tuple(d.values()) for d in doc_authors[doc]]
+                    pairs = (list(itertools.permutations(author_ids, 2)))
+                    # calculate g_i_j_k
+                    exclusivity = 1 / (len(doc_authors[doc]) - 1)
+                    edges_all.extend([{"edge": (x[0], x[1]), "weight": exclusivity} for x in pairs])
+                else:
+                    edges_all.extend([{"edge": (doc_authors[doc][0], doc_authors[doc][0]), "weight": 1}])
+
+                if progress_bar:
+                    progress = emit_progress_bar(progress, doc+1, len(doc_authors))
+
+            # sort the edges for processing
+            edges_all_sorted = sorted(edges_all, key=lambda x: str(x["edge"]))
+            gb_object = itertools.groupby(edges_all_sorted, key=lambda x: x["edge"])
+
+            # normalize the edge weights and create the directed graph
+            normalized = {}
+            for k, v in gb_object:
+                try:
+                    v = list(v) # need to reassign
+                    numerator = sum(d["weight"] for d in list(v))
+                    denominator = counts[k[0]]
+                    normalized[k] = numerator / denominator
+                except TypeError:
+                    # this occurs when an author is compared to one-self, which is
+                    # not a valid scenario for the graph
+                    pass
+
+            # create the directed graph
+            edge_list = [(k[0], k[1], v) for k, v in normalized.items()]
+            self.graph.add_weighted_edges_from(edge_list)
+
+            self._is_fit = True
+
+        return self.graph
+
+    def top_authors(self, n: int = 10, normalize_scores: bool = False) -> Tuple[List, List]:
+        """
+        Calculates the top N authors in an AuthorRank graph and returns them
+        in sorted order.
+        :param n: an integer to specify the maximum number of authors to be
+        returned.
+        :param normalize_scores: a boolean to indicate whether or not to normalize
+        the scores between 0 and 1.
+        :return: a tuple which contains two lists, one for authors and the other
+        for their scores.
+        """
+
+        # check to see if AuthorRank has been fit
+        if self._is_fit is False:
+            warnings.warn("AuthorRank must first be fit on a set of documents "
+                          "prior to calling top_authors.", UserWarning)
+            return list(), list()
+
+        else:
+            top_authors, top_scores = top(self.graph, n=n, normalize_scores=normalize_scores)
+
+            return top_authors, top_scores
+
+    def as_json(self) -> dict:
+        """
+        Returns the directed graph in JSON format, containing information
+        about nodes and their relationships to one another in the form of edges.
+        A wrapper around the NetworkX functionality.
+        :return: a JSON format for the provided graph
+        """
+
+        return nx.readwrite.json_graph.node_link_data(self.graph)