# Independent citation counter

In this notebook, you can calculate the number of independent citations for all of your papers.

### What the code will do?
For each entry in your Google Scholar profile, the code will output your independent citation count, total citation count and a link to access your independent citation counts.


**Sample output:**

> The impact of cosmic variance on simulating weak lensing surveys
>
> Citations: 9/15
>
> Link:  [http://scholar.google.com/scholar?cites=17631820148925503603&scipsc=1&q=-author:%27A%20Kannawadi%27+-author:%27R%20Mandelbaum%27+-author:%27C%20Lackner%27](http://scholar.google.com/scholar?cites=17631820148925503603&scipsc=1&q=-author:%27A%20Kannawadi%27+-author:%27R%20Mandelbaum%27+-author:%27C%20Lackner%27)

The first line is the title of the paper, which has 9 independent citations and 15 total citations.
The link takes you to the Google Scholar page with the independent citations.

**Note:**
Even if the program is unable to fetch independent citation counts, it will still output your total citations and provide a link to access your independent citations.


### How to use?
In the cell below, replace `qc6CJjYAAAAJ` with your Google Scholar profile ID.
You may also want to specify a proxy type (more details below).
Then, run all cells.

### Troubleshooting
If you see a `MaxTriesExceededException`, it means Google Scholar caught a whiff of your action.
Try again later, or use a better proxy.

<br>

### Enter your Google Scholar profile ID
*unless you are Albert Einstein.*

For example, if your Google Scholar profile URL is [`https://scholar.google.com/citations?user=qc6CJjYAAAAJ`](https://scholar.google.com/citations?user=qc6CJjYAAAAJ), then your profile ID is `qc6CJjYAAAAJ`.

In [None]:
# The only cell which you are expected to modify.
scholar_id = 'qc6CJjYAAAAJ'

# `proxy_type` must be one of ScraperAPI, Luminati, FreeProxy, SingleProxy or NoProxy.
# NoProxy will give only the links to independent, not the counts.
proxy_type = 'NoProxy'  # Case insensitive

#### More on `proxy_type`

By default, the code provides only the links to page containing independent citations, and does not open the page to count them.
Google Scholar actively blocks automated requests to its citation database.
Continuous, repeated requests from a single IP address may lead to a ban.
However, if you need the counts, you may be able to circumvent this by using a proxy.
Below are a few options:

- **FreeProxy**: Use continuously changing proxies for free.

    This protects your IP address, but is not very effective at circumventing Google Scholar's anti-bot prevention. You might want to use other options if you are unable to reach Google Scholar.


- **ScraperAPI** (recommended): [Create a free account](https://www.scraperapi.com/) without providing personal and payment information. Free account supports 5000 requests per month, more that sufficient to run this notebook for most researchers.

- **Luminati** (untested): Similar to ScraperAPI, and is known to circumvent Google Scholar's anti-bot prevention better. No free account is available.

- **SingleProxy**: Use a single proxy for all requests.

- **NoProxy** (default): Using `NoProxy` will not fetch the counts by default. You can still try to fetch the counts (at your own risk) by setting `links_only` below to `False`. Use this sparingly if `FreeProxy` does not work and you don't want to create any accounts. You may also use this safely if you are already connected to a VPN.




Read the [official scholarly documentation](https://scholarly.readthedocs.io/en/latest/quickstart.html#using-proxies) for more details.

In [None]:
links_only = (proxy_type.lower() == 'noproxy')

### Install and import the required packages.

In [None]:
! pip install -q scholarly

In [None]:
try:
    from scholarly import scholarly, ProxyGenerator #, MaxTriesExceededException
except IndexError:
    """ Ignore the harmless IndexError occuring from a dependency"""
    pass
import time, random
from getpass import getpass
try:
    from urllib import quote  # type: ignore ; Python 2
except ImportError:
    from urllib.parse import quote  # type: ignore ; Python 3

In [None]:
def set_proxy(proxy_type='NoProxy'):
    """Set a proxy for to scrape Google Scholar.

    Only `NoProxy`, `FreeProxy` and `ScraperAPI` have been tested.

    Parameters
    ----------
    proxy_type : str, optional
        Type of proxy to use. Case insensitive. Options are:
        `ScraperAPI`, `Luminati`, `FreeProxy`, `SingleProxy` and
        `NoProxy` (default).
    """
    if proxy_type.lower() == 'noproxy':
        print("Using no proxies!")
        return

    pg = ProxyGenerator()
    if proxy_type.lower() == 'scraperapi':
        payload = {'api_key': getpass("Enter your ScraperAPI key:"), }
        pg.ScraperAPI(payload['api_key'])
        print("Using ScraperAPI!")
    elif proxy_type.lower() == 'luminati':
        pg.Luminati(getpass("Enter your Luminati username:"), getpass("Enter your Luminati password:"))
        print("Using Luminati!")
    elif proxy_type.lower() == 'singleproxy':
        proxy_address = getpass("Enter your proxy address:")
        pg.SingleProxy(proxy_address, proxy_address)
        print(f"Using SingleProxy: {proxy_address}")
    else:
        pg.FreeProxies()
        print("Using FreeProxy!")

    scholarly.use_proxy(pg)

def standardize_names(name):
    if not " " in name:
        return name
    try:
        parts = name.split(' ')
        firstname, lastname = parts[0], parts[-1]
        initial = firstname[0]
        return quote(f"'{initial} {lastname}'")
    except:
        # This usually happens for collaboration papers
        print(f"Cannot split '{name}' into initial and last names!")
        return quote(f"{name}")


def fill_independent_citations(publication, links_only=True):
    if not publication["source"].name == "AUTHOR_PUBLICATION_ENTRY":
        raise TypeError("Input source must be from a Google Scholar profile page")

    if not publication["filled"]:  # TODO: Don't fill once the patch comes through
        scholarly.fill(publication)

    citedby_url = publication.get("citedby_url", None)
    if citedby_url is None:
        # If there are no citations, then there is nothing to do
        publication["num_independent_citations"] = 0
        return None

    author_names = publication["bib"]["author"].split(" and ")
    independent_query = "+".join([f"-author:{standardize_names(name)}" for name in author_names])
    independent_url = citedby_url+"&hl=en&scipsc=1&q="+independent_query
    publication["independent_url"] = independent_url

    if links_only:
        return None

    try:
        search_results = scholarly.search_pubs_custom_url(independent_url)
        num_independent_citations = search_results.total_results if search_results.total_results else 0
    except Exception as err:
        num_independent_citations = -99

    publication["num_independent_citations"] = num_independent_citations

In [None]:
set_proxy(proxy_type)
scholar = scholarly.search_author_id(scholar_id, filled=True)
scholar_name = scholar["name"]
print(f"Hello {scholar_name} !")

In [None]:
if links_only:
    print("Fetching the independent citation counts has been turned off for your own good"
          " because you are not using a proxy."
          " You can turn it back on at your own risk by explicitly setting `links_only` to `False`."
          )
else:
    print("You are fetching the counts in addition to the links. The code will run slow intentionally.")

independent_citation_counts = []
for paper in scholar["publications"]:
    if not links_only:
        # Sleep for some random time to mimic human behavior
        time.sleep(random.uniform(2, 5))

    try:
        if paper.get("num_independent_citations", -1) < 0:
            fill_independent_citations(paper, links_only=links_only)
            independent_citation_counts.append(paper.get("num_independent_citations", 0))
    except Exception as err:
        print("Google Scholar is aggressively blocking us! Quitting for now.")
        print(err)
    finally:
        print("\n ------\n")
        print(paper["bib"]["title"])
        print(f"Citations: {paper.get('num_independent_citations', 'NA')}/{paper.get('num_citations')}")
        independent_url = paper.get("independent_url", None)
        if independent_url:
            print("Link: ", "http://scholar.google.com"+independent_url)

print("\n --- End of list ---")

if not links_only:
    print("Total number of independent citations = ", sum(independent_citation_counts))
