# Snowballing, for rounds 2 and above

This takes each newly-coded spreadsheet of papers and queries further 'related articles' from the newly selected papers. Each time it runs it retrieves a new set of candidates and adds them to the list, saving them to a new spreadsheet.

Before starting, the previous spreadsheet, PapersToCode.xlsx (starting with N=2), should be double coded, adding coding values (7 or above means 'accept') in the column for each coder (C and P) and in the AgreedScore column (the result of coder discussion). Save the result as CodedPapers.xlsx.

Change the coder names in the FirstStep and Analysis notebook if desired.

Copyright &copy; 2021.   
Shared under the Apache 2.0 License

In [1]:
# We upped this count after round 3 to 19 (see Analysis notebook)
RequiredAnnualCount=1
FinalYear=2021

In [2]:
# !pip install google-search-results
from serpapi import GoogleScholarSearch
from ScholarUtils import GetPapers, GetPaper, WellCitedPapers, InitScholar, RelatedQuery

In [3]:
import pandas as pd
import numpy as np
import pickle
# Reload ScholarUtils every time before executing code 
%load_ext autoreload
%autoreload 2

In [4]:
InitScholar("APIKey.yaml")

In [5]:
scoringSpreadsheet="CodedPapers.xlsx"
previousPapersDf=(pd.read_excel(scoringSpreadsheet, index_col=0,
                                # Want Key as string, not number, to match WellCitedPapers() output:
                              dtype={'Key':np.str_})
                         )
CurrentRound=previousPapersDf.Round.max()
CurrentRound

2

In [6]:
# Get the accepted papers from the latest round of coding:
papersToFollowDf=previousPapersDf.query('Round=={} and AgreedScore > 6'.format(CurrentRound))
print(len(papersToFollowDf))
papersToFollowDf

2


Unnamed: 0,Key,Round,Citations,Year,Title,Authors,Link,Related,Snippet,C,P,AgreedScore
28,7884234468971956873,2,103,2018,Future developments in cyber risk assessment f...,"P Radanliev, DC De Roure, R Nicolescu, M Huth…",https://www.sciencedirect.com/science/article/...,icr2IHptam0J,This article is focused on the economic impact...,,,7.0
41,15098848452248305004,2,6,2020,Cyber risk measurement with ordinal data,"S Facchinetti, P Giudici, SA Osmetti",https://link.springer.com/article/10.1007/s102...,bH23kaviidEJ,The paper proposes a new methodology to measur...,,,7.0


In [8]:
# Ask Google for the 'related publications' for each newly coded item (takes a long time)
# Concatenate returned lists...
newPapers = [foundPaper for relatedPaper in papersToFollowDf.itertuples()
                 for foundPaper in GetPapers(RelatedQuery(relatedPaper.Related))
            ] 
len(newPapers)


Retrieving 101 papers for {'q': 'related:icr2IHptam0J:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:bH23kaviidEJ:scholar.google.com/'}


202

In [9]:
if CurrentRound == 1:
    allPapers=newPapers
else:
    with open( "papersFound.p", "rb" ) as inFile:
        allPapers=pickle.load( inFile ) + newPapers

In [10]:
# The Google queries take a long time and can cost money. 
# We save the results as 'pickle' dumps, 
# Can also 'comment out' the slow parts by changing the cell type to 'Raw NB Convert'; 
# to redo the queries, convert them back again to Code.

with open( "papersFound.p", "wb" ) as outFile:
    pickle.dump( allPapers, outFile )

In [11]:
# How many unique papers so far?

pd.Series([paper['result_id'] for paper in allPapers]).unique().size

393

In [12]:
newPapersDf=(WellCitedPapers(newPapers, requiredAnnualCount=RequiredAnnualCount, finalYear=FinalYear)
           .reindex(columns=previousPapersDf.columns, fill_value='') # Add in the extra classification columns.
            .assign(Round=CurrentRound+1)
          )
len(newPapersDf)

142

In [13]:
allPapersSoFarDf=(pd.concat([previousPapersDf, newPapersDf]) # Keep already coded papers - don't want to recode.
                .drop_duplicates(subset=['Key'])  # Only want new ones - this keeps the earlier entries.
                .reset_index(drop=True)           # Don't need to keep the old numbering.
               )
allPapersSoFarDf.to_excel('PapersToCode.xlsx')
len(allPapersSoFarDf)
#allPapersSoFarDf

273

In [14]:
!open PapersToCode.xlsx

In [15]:
assert(False)

AssertionError: 

In [None]:
!open ScholarUtils.py