# Snowballing, for rounds 2 and above

This takes each newly-coded spreadsheet of papers and queries further 'related articles' from the newly selected papers. Each time it runs it retrieves a new set of candidates and adds them to the list, saving them to a new spreadsheet.

Before starting, the previous spreadsheet, PapersToCode.xlsx (starting with N=2), should be double coded, adding coding values (7 or above means 'accept') in the AgreedScore column (the result of coder discussion - if you're double coding, use the Coder1 and Coder2 columns for the two coders' values). Save the result as CodedPapers.xlsx.

Change the coder names in the CodedPapers spreadsheet and Analysis notebook if desired.

Copyright &copy; Charles Weir 2021.   
Shared under the Apache 2.0 License

In [1]:
# We upped this count after round 3 to 19 (see Analysis notebook)
RequiredAnnualCount=1
FinalYear=2021

In [2]:
# !pip install google-search-results
from serpapi import GoogleScholarSearch
from ScholarUtils import GetPapers, GetPaper, WellCitedPapers, InitScholar, RelatedQuery

In [3]:
import pandas as pd
import numpy as np
import pickle
# Reload ScholarUtils every time before executing code 
%load_ext autoreload
%autoreload 2

In [4]:
InitScholar("APIKey.yaml")

In [5]:
scoringSpreadsheet="CodedPapers.xlsx"
previousPapersDf=(pd.read_excel(scoringSpreadsheet, index_col=0,
                                # Want Key as string, not number, to match WellCitedPapers() output:
                              dtype={'Key':np.str_})
                         )
CurrentRound=previousPapersDf.Round.max()
CurrentRound

6

In [6]:
# Get the accepted papers from the latest round of coding:
papersToFollowDf=previousPapersDf.query('Round=={} and AgreedScore > 6'.format(CurrentRound))
print(len(papersToFollowDf))
papersToFollowDf

6


Unnamed: 0,Key,Round,Citations,Year,Title,Authors,Link,Related,Snippet,C,P,AgreedScore,Unnamed: 13
695,6724629180905251087,6,21,2019,Ontology-based security recommendation for the...,"F Alsubaei, A Abuhussein, S Shiva",https://ieeexplore.ieee.org/abstract/document/...,D2EVhpyuUl0J,Security and privacy are among the key barrier...,,,7.0,
698,10710897600124378357,6,37,2019,Security and privacy for the internet of medic...,"Y Sun, FPW Lo, B Lo",https://ieeexplore.ieee.org/abstract/document/...,9Vz_WJ7ApJQJ,With the increasing demands on quality healthc...,,,7.0,
729,5021234964518621637,6,54,2019,IoT smart health security threats,"SA Butt, JL Diaz-Martinez, T Jamal, A Ali…",https://ieeexplore.ieee.org/abstract/document/...,xVHnuZgCr0UJ,The Internet of things (IoT) is an active area...,,,7.0,Dummys guide to basic attacks
870,16125298375562168501,6,23,2012,Risk-driven security metrics in agile software...,"RM Savola, C Frühwirth, A Pietikäinen",http://jucs.org/jucs_18_12/risk_driven_securit...,tQiR5G-RyN8J,The need for effective and efficient informati...,,,7.0,Like the risk-driven bit.
970,16205524198307687761,6,2,2020,Operationalization of privacy and security req...,"O Tomashchuk, Y Li, D Van Landuyt, W Joosen",https://link.springer.com/chapter/10.1007/978-...,UW0g12WW5eAJ,Abstract The Fourth Industrial Revolution impo...,,,7.0,
1113,9619492135223642694,6,17,2015,A value blueprint approach to cybersecurity in...,"G Tanev, P Tzolov, R Apiafi",https://timreview.ca/sites/default/files/artic...,RkqrdyFNf4UJ,Cybersecurity for networked medical devices ha...,,,7.0,Emphasises stakeholder value


In [7]:
# Ask Google for the 'related publications' for each newly coded item (takes a long time)
# Concatenate returned lists...
newPapers = [foundPaper for relatedPaper in papersToFollowDf.itertuples()
                 for foundPaper in GetPapers(RelatedQuery(relatedPaper.Related))
            ] 
len(newPapers)


Retrieving 101 papers for {'q': 'related:D2EVhpyuUl0J:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:9Vz_WJ7ApJQJ:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:xVHnuZgCr0UJ:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:tQiR5G-RyN8J:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:UW0g12WW5eAJ:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:RkqrdyFNf4UJ:scholar.google.com/'}


606

In [8]:
# papersFound.p is first created in the first step:
with open( "papersFound.p", "rb" ) as inFile:
    allPapers=pickle.load( inFile ) + newPapers

In [9]:
# The Google queries take a long time and can cost money. 
# We save the results as 'pickle' dumps, 
# Can also 'comment out' the slow parts by changing the cell type of appropriate cells to 'Raw NB Convert'; 
# to redo the queries, convert them back again to Code.

with open( "papersFound.p", "wb" ) as outFile:
    pickle.dump( allPapers, outFile )

In [10]:
# How many unique papers so far?

pd.Series([paper['result_id'] for paper in allPapers]).unique().size

2377

In [11]:
newPapersDf=(WellCitedPapers(newPapers, requiredAnnualCount=RequiredAnnualCount, finalYear=FinalYear)
           .reindex(columns=previousPapersDf.columns, fill_value='') # Add in the extra classification columns.
            .assign(Round=CurrentRound+1)
          )
len(newPapersDf)

325

In [12]:
allPapersSoFarDf=(pd.concat([previousPapersDf, newPapersDf]) # Keep already coded papers - don't want to recode.
                .drop_duplicates(subset=['Key'])  # Only want new ones - this keeps the earlier entries.
                .reset_index(drop=True)           # Don't need to keep the old numbering.
               )
allPapersSoFarDf.to_excel('PapersToCode.xlsx')
len(allPapersSoFarDf)
#allPapersSoFarDf

1459

In [13]:
!open PapersToCode.xlsx

In [14]:
assert(False)

AssertionError: 

In [None]:
!open ScholarUtils.py