# Challenge 2 - The Google Scholar Package

## Description

As we worked pretty hard on scraping Google Scholar, our new challenge is to prepare a package that anyone could reuse to extract and process data from Google Scholar.

## Reminder about Google Scholar API

Google Scholar provides `Author` webpages, where authors are identified with an **author identifier** (e.g. [yySZFKoAAAAJ](https://scholar.google.fr/citations?user=yySZFKoAAAAJ&hl=en)).

The details of each publication on the Author webpage can be retrieved using the **author publication identifier** (e.g. [yySZFKoAAAAJ:kNdYIx-mwKoC](https://scholar.google.fr/citations?view_op=view_citation&hl=en&citation_for_view=yySZFKoAAAAJ:kNdYIx-mwKoC)).

Citations can be retrieved using a **publication identifier** (e.g. [15885577448857307637](https://scholar.google.fr/scholar?oi=bibs&hl=en&cites=15885577448857307637&as_sdt=5)). This is also referred as a **cluster identifier** in Google Scholar. The `publication identifier` is different from the `author publication identifier`.


## Submission rules

Submit a zip file, similar to the one you collected, with your changes.
Your code should not add any side effect apart from the expected behaviour.
And the package should pass the `flake8 gscholar/ --max-line-length 140` style checker. 

## Package design

This project is a bit more complicated. It is composed of 

- The `scraping.*` objects, used to collect information from Google Scholar. The most important object is `scraping.GoogleScholar`.
- The `GoogleScholarDB` object, used to store all significant information about authors and their publications.
- The `db.GoogleScholarDBBuilder` object that uses `scraping.GoogleScholar` to populate `GoogleScholarDB`.

```
gscholar
├── GoogleScholarDB.py
├── db
│   └── GoogleScholarDBBuilder.py
├── scraping
│   ├── GoogleScholar.py
│   ├── cache
│   │   ├── GoogleScholarCacheFile.py
│   │   ├── GoogleScholarCache.py
│   │   └── GoogleScholarCacheSQLite.py
│   ├── crawler
│   │   └── GoogleScholarCrawler.py
│   └── parser
│       └── GoogleScholarParser.py
└── utils
    ├── GSError.py
    └── logger.py
```

![design](docs/design.png)

## Usage

### scraping.GoogleScholarCache

The `GoogleScholarCache` is an object used to store web page from Google Scholar. The `GoogleScholarCache` is abstract and define this interface:

```
class GoogleScholarCache:
    def __init__(self):
        pass
    def add_author_page(self, author_id, source):
        pass
    def get_author_page(self, author_id):
        pass
    def add_publication_page(self, author_id, pub_id, source):
        pass
    def get_publication_page(self, author_id, pub_id):
        pass
    def add_citations_page(self, cluster_id, start, source):
        pass
    def get_citations_page(self, cluster_id, start):
        pass
    def add_versions_page(self, cluster_id, start, source):
        pass
    def get_versions_page(self, cluster_id, start):
        pass
    def dump(self):
        pass
    def clear(self):
        pass
    def copy_into(self, cache):
        pass
```

There are two implementations of the cache:

 - GoogleScholarCacheSQLite
 - GoogleScholarCacheFile

#### Example of GoogleScholarCacheSQLite

This object stores webpages using the SQLite database. First we need to decide where to store this database. 

In [None]:
import os
import tempfile

def get_tmp_filename() :
    # using tempfile we get an anonymous file
    fd = tempfile.NamedTemporaryFile(delete=False)
    fd.close()
    db_file = fd.name

    # we remove any existing file
    if os.path.exists(db_file):
        os.remove(db_file)
        
    return db_file

def get_tmp_dirname():

    # using tempfile we get an anonymous dir
    _temp_dir = tempfile.TemporaryDirectory()
    
    return _temp_dir.name

In [None]:
from gscholar.scraping.cache.GoogleScholarCacheSQLite import GoogleScholarCacheSQLite

# we print the temporary db filename
db_filename = get_tmp_filename()
print("The random filename we get is", db_filename)

# Example of store and retrieve for author page.
cache = GoogleScholarCacheSQLite(db_filename)

# The cache provide add/get function for each type of web page
# For example, author pages:
cache.add_author_page("aid", "page_content")
cache.get_author_page("aid")
cache.dump()

### Scraping.GoogleScholar

The `GoogleScholar` object is the main interface to collect data from Google Scholar. It uses a cache (i.e. `GoogleScholarCache`), a crawler, and a parser. The crawler and parser are mostly a structured version of the previous lab. we do not need to cover them.

In [None]:
from gscholar.scraping.GoogleScholar import GoogleScholar
from gscholar.scraping.cache.GoogleScholarCacheFile import GoogleScholarCacheFile
temp_dir = get_tmp_dirname()
print("The random directory we get is", temp_dir)
gs = GoogleScholar(cache = GoogleScholarCacheFile(temp_dir))

In [None]:
# Google Scholar provides `Author` webpages, where authors are identified with an author identifier 
# (e.g. [yySZFKoAAAAJ](https://scholar.google.fr/citations?user=yySZFKoAAAAJ&hl=en)).

# /!\ Warning: When we are not using a pre-populated cache
##             This function will only works if you have a working Selenium setup.
a_id = "yySZFKoAAAAJ"
author_details = gs.get_author_details(a_id)
print ("Name:", author_details["name"])

The result value of `get_author_details` is a dictionary.

In [None]:
from pprint import pprint 
pprint(author_details)

Additionally, we can get the list of publication for this author:

In [None]:
a_id = "yySZFKoAAAAJ"
publication_ids = gs.get_author_publications(a_id)
print ("Total publication:", len(publication_ids))
print ("First publication_id:", publication_ids.iloc[0])

In [None]:
# The return type is a list of author publication identifier 
print(publication_ids)

In [None]:
# The details of each publication on the Author webpage can be retrieved using 
# the author publication identifier (e.g. [yySZFKoAAAAJ:kNdYIx-mwKoC]).

publication_details = gs.get_publication_details("yySZFKoAAAAJ", "kNdYIx-mwKoC")
print ("Publication title:", publication_details["title"])
print ("Publication date:", publication_details["date"])
print ("Publication citation clusters:", publication_details["clusters"])

The return type of `publication_details` is a dictionary.

In [None]:
publication_details

We note that this publication has more than one `publication identifier` in its `clusters` section. However, the `cites` section provides what `publication identifier` to be used to retrieve the citations.

In [None]:
citations = gs.get_citations("15885577448857307637")
print("Title of the first one:", citations.loc[0]["title"])
print("Year of the first one:", citations.loc[0]["year"])
print("Cluster of the first one:", citations.loc[0]["clusters"])

The return type of `get_citations` is a pandas DataFrame. We note that at the current stage, the parser cannot retrieve all the required information.  For example, we do not have the publication identifier for every publication. This is because the `Parser` is imperfect.

In [None]:
citations[["title", "year", "clusters", "cites"]]

Once we are done using a GoogleScholar object, we should terminate it (In order to make sure the Selenium session is finished). 

In [None]:
gs.terminate_crawler()

### GoogleScholarDB

This last object is used after scraping and parsing. It stores processed information instead of just web pages. The parsing of webpages is not necessarily fast especially when we have to parse thousands of them. The `GoogleScholarDB` object is used to store the parsing result, and make efficient request to the data.

In [None]:
from gscholar.GoogleScholarDB import GoogleScholarDB

# we print the temporary db filename
db_filename = get_tmp_filename()
print("The random filename we get is", db_filename)

db = GoogleScholarDB(db_filename)
db.clean()

In [None]:
# add_author takes two arguments, author identifier, and author name.
db.add_author("yySZFKoAAAAJ", 'Bruno Bodin')

# add_author takes three arguments, publication identifier, title, and year of publication.
db.add_publication("9894327834646363633", "Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM", 2015, ["author1", "author2"])
db.add_publication("7383276259500311615", 'Visual SLAM algorithms: a survey from 2010 to 2016', 2017, ["author2", "author3"])

# add_authorship takes three arguments, author identifier, author-publication identifier, and publication identifier.
db.add_authorship("yySZFKoAAAAJ", "IjCSPb-OGe4C", "9894327834646363633")

# add_citation takes two arguments, publication identifier of the paper cited and publication identifier of the citing paper.
db.add_citation("9894327834646363633", cited_by = "7383276259500311615")

In [None]:
db.dump()

In [None]:
# if you try to add data that already exist in the database, then nothing bad happen
db.add_publication("7383276259500311615", 'Visual SLAM algorithms: a survey from 2010 to 2016', 2017, ["author2", "author3"])

In [None]:
# However if there is a difference, then an error shows (find the difference...)
db.add_publication("7383276259500311615", 'visual SLAM algorithms: a survey from 2010 to 2016', 2017, ["author2", "author3"])

## Tasks descriptions

## Task 1 - `GoogleScholarCache.copy_into()` 

Our first task is to provide a new feature to `GoogleScholarCache` objects.
This feature aims at enabling the copy of web pages from a cache to another. 
This feature is already defined as part of the `GoogleScholarCache` interface, it is `copy_into(self, cache)`. 

However, it has not been implemented yet for `GoogleScholarCacheFile` and `GoogleScholarCacheSQLite`.

For example, by using this feature we will be able to merge cache of different type:

In [None]:
from gscholar.scraping.cache.GoogleScholarCacheFile import GoogleScholarCacheFile
from gscholar.scraping.cache.GoogleScholarCacheSQLite import GoogleScholarCacheSQLite

db1_filename = get_tmp_filename()
db2_dirname = get_tmp_dirname()
db3_filename = get_tmp_filename()

cache1 = GoogleScholarCacheSQLite(db1_filename)
cache1.add_author_page("aid1", "page_content1")
cache1.add_publication_page("aid1","pid1", "page_content1")
cache1.dump()

cache2 = GoogleScholarCacheFile(db2_dirname)
cache2.add_author_page("aid2", "page_content2")
cache2.add_publication_page("aid2","pid1", "pub_content1")
cache2.add_publication_page("aid2","pid2", "pub_content2")
cache2.dump()

cache3 = GoogleScholarCacheSQLite(db3_filename)
cache1.copy_into(cache3)
cache2.copy_into(cache3)
cache3.dump()

**Your task is to implement the function `copy_into` in `GoogleScholarCacheFile` and `GoogleScholarCacheSQLite`.**

### Points 

 - 4 points for `GoogleScholarCacheSQLite`, 
 - 6 points for `GoogleScholarCacheFile`.

## Task 2 - The `GoogleScholarDBBuilder`

Our second task is to provide an object capable to populate `GoogleScholarDB` using the `scraping.GoogleScholar` object.

At the moment we only expect `GoogleScholarDBBuilder` to provide the `fetch_authors(list)` function. This function, for a list of authors will extract their details from a `scraping.GoogleScholar` object and populate a `GoogleScholarDB` with:

- the list of authors, 
- their authorship information,
- their publications and the publications that cite their work,
- the citations for each of their papers.

We would expect to be able to use `fetch_authors(list)` the following way:

In [None]:
from gscholar.db.GoogleScholarDBBuilder import GoogleScholarDBBuilder
from gscholar.GoogleScholarDB import GoogleScholarDB
from gscholar.scraping.GoogleScholar import GoogleScholar
from gscholar.scraping.cache.GoogleScholarCacheSQLite import GoogleScholarCacheSQLite

db_filename = get_tmp_filename()
test_database = "./tests/test_cache.sqlite"
assert(os.path.exists(test_database))
# prepare an empty db
db = GoogleScholarDB(db_filename)
db.clean()

# prepare a Google Scholar API with a cache that is already populated
gs = GoogleScholar(cache = GoogleScholarCacheSQLite(test_database))

# Set the builder with the correct parameters
builder = GoogleScholarDBBuilder(gs,db)

In [None]:
author_list = ["1TUANHcAAAAJ", # Kuba 
               "x2MfRUYAAAAJ", # Tom
               "ky6n3gwAAAAJ"] # Harry
builder.fetch_authors(author_list)

In [None]:
gs.terminate_crawler()
db.dump()

### Points 

 - 2 points if the Authors list is correct.
 - 2 points if the Publications list is correct.
 - 1 points if the Citations list is correct.
 - 1 points if the Authorship list is correct.
 

## Task 3 - The `GoogleScholarDB`

As we are able to populate the GoogleScholarDB automatically, we can now produce some analysis functions. 

 - `get_h_index(author_id)` function, that returns the H-Index of a particular author. Your function must be solely relying on `SQL`.
 - `get_citation_graph()` function that returns a networkX Graph object of citations stored inside the DB.

**Definition of H-Index**: *The H-Index of an author is the maximal value H such that the author published H publications with at least H citations.*


### `get_h_index(author_id)`

Your solution for `get_h_index(self, author_id)` should be similar to 

In [None]:
def get_h_index(self, author_id):

        request = f"""
            SELECT ... {author_id} ...
        ;
        """
        cursor = self.con.cursor()
        _h_index = cursor.execute(request).fetchone()[0]
        cursor.close()
        return _h_index

And its expected usage would be

In [None]:
from gscholar.GoogleScholarDB import GoogleScholarDB
db = GoogleScholarDB(db_filename)
h_index = [(x,db.get_h_index(x)) for x in author_list]
print (f"H-Indexes are {h_index}")

### `get_citation_graph()`

Your solution for `get_citation_graph(focus_authors=None)` should return a object of type `networkx.classes.digraph.Graph` that contains `size` and `color` attributes for each nodes. And `width` attribute for each edges. The graph would visualize citations between papers in the DB, with some way to highlight papers from specific authors.

In [None]:
from gscholar.GoogleScholarDB import GoogleScholarDB
db = GoogleScholarDB(db_filename)
graph = db.get_citation_graph(focus_authors=['1TUANHcAAAAJ', 'x2MfRUYAAAAJ'])

In [None]:
# Drawing with networkX
import networkx as nx
pos = nx.spring_layout(graph, k = 0.2)
nx.draw(graph, 
        pos, 
        node_size=[x[1]["size"] for x in graph.nodes(data=True)],
        node_color=[x[1]["color"] for x in graph.nodes(data=True)]
       )

In [None]:
# Drawing with pyvis
from pyvis.network import Network
visgraph = Network(notebook=True)
visgraph.from_nx(graph)
visgraph.show("tmp.html")   

### Points

- 3 points if the get_h_index() function is correct.
- 1 points if the get_citation_graph() function is correct. 