# Snowballing, for rounds 2, 3 and 4

This takes each newly-coded spreadsheet of papers and queries further 'related articles' from the newly selected papers. Each time it runs it retrieves a new set of candidates and adds them to the list, saving them to a new spreadsheet.

Before starting, the previous spreadsheet, PapersToCode.xlsx (starting with N=2), should be double coded, adding coding values (7 or above means 'accept') in the column for each coder (Charles and Pierre) and in the AgreedScore column (the result of coder discussion). Save the result as CodedPapers.xlsx.

Change the coder names in the FirstStep and Analysis notebook if desired.

Copyright &copy; 2021 Charles Weir.   
Shared under the Apache 2.0 License

In [1]:
# We upped this count after round 3 to 19 (see Analysis notebook)
RequiredAnnualCount=8

In [2]:
# !pip install google-search-results
from serpapi import GoogleScholarSearch
from ScholarUtils import GetPapers, GetPaper, WellCitedPapers, InitScholar, RelatedQuery

In [3]:
import pandas as pd
import numpy as np
import pickle
# Reload ScholarUtils every time before executing code 
%load_ext autoreload
%autoreload 2

In [4]:
InitScholar("APIKey.yaml")

In [5]:
scoringSpreadsheet="CodedPapers.xlsx"
previousPapersDf=(pd.read_excel(scoringSpreadsheet, index_col=0,
                                # Want Key as string, not number, to match WellCitedPapers() output:
                              dtype={'Key':np.str_})
                         )
CurrentRound=previousPapersDf.Round.max()

In [6]:
# Get the newly coded papers:
papersToFollowDf=previousPapersDf.query('Round=={} and AgreedScore > 6'.format(CurrentRound))
print(len(papersToFollowDf))
papersToFollowDf

37


Unnamed: 0,Key,Round,Citations,Year,Title,Authors,Link,Related,Snippet,AgreedScore,Pierre,P Comment,Charles,C comment,Agreement,Unnamed: 16
51,782252406809913637,2,298,2004,The trustworthy computing security development...,S Lipner,https://ieeexplore.ieee.org/abstract/document/...,JWmZYGse2woJ,This paper discusses the trustworthy computing...,7,7.0,7 (3 2 2),5,,7.0,P: Not really about developer behaviour
53,16155454764124359910,2,150,2017,Comparing the usability of cryptographic apis,"Y Acar, M Backes, S Fahl, S Garfinkel…",https://ieeexplore.ieee.org/abstract/document/...,5kyGQIO0M-AJ,Potentially dangerous cryptography errors are ...,9,9.0,9,9,,,
54,3016824145034364740,2,153,2017,Stack overflow considered harmful? the impact ...,"F Fischer, K Böttinger, H Xiao…",https://ieeexplore.ieee.org/abstract/document/...,REeYepPp3SkJ,Online programming discussion platforms such a...,7,8.0,8 (3 2 3),7,,,
55,11048499173659227511,2,68,2017,A stitch in time: Supporting android developer...,"DC Nguyen, D Wermke, Y Acar, M Backes…",https://dl.acm.org/doi/abs/10.1145/3133956.313...,d7UnA3YnVJkJ,Despite security advice in the official docume...,9,7.0,7 (3 2 2),9,,,
56,13159929425877222719,2,147,2016,Jumping through hoops: Why do Java developers ...,"S Nadi, S Krüger, M Mezini, E Bodden",https://dl.acm.org/doi/abs/10.1145/2884781.288...,P0WfqhZ2obYJ,To protect sensitive data processed by current...,9,5.0,5 (3 1 1),9,,9.0,P: Bug
57,13944593684394717498,2,118,2016,Developers are not the enemy!: The need for us...,"M Green, M Smith",https://ieeexplore.ieee.org/abstract/document/...,OrGABxMmhcEJ,Rather than recognizing software engineers' li...,9,6.0,6,8,,9.0,P: could not access
59,4014642532850524710,2,65,2017,Security developer studies with github users: ...,"Y Acar, C Stransky, D Wermke, ML Mazurek…",https://www.usenix.org/conference/soups2017/te...,JuK69hzgtjcJ,The usable security community is increasingly ...,9,9.0,9,9,,,
60,18023081776129372272,2,66,2016,"You are not your developer, either: A research...","Y Acar, S Fahl, ML Mazurek",https://ieeexplore.ieee.org/abstract/document/...,cPABkGLZHvoJ,"While researchers have developed many tools, t...",7,7.0,7 (3 3 1),7,,,
61,4322461358304187283,2,65,2017,Why do developers get password storage wrong? ...,"A Naiakshina, A Danilova, C Tiefenau…",https://dl.acm.org/doi/abs/10.1145/3133956.313...,k9MSObR3_DsJ,Passwords are still a mainstay of various secu...,8,8.0,8 (3 3 2),8,,,
62,8265577877465095052,2,184,2013,Rethinking SSL development in an appified world,"S Fahl, M Harbach, H Perl, M Koetter…",https://dl.acm.org/doi/abs/10.1145/2508859.251...,jFuiYUk7tXIJ,ABSTRACT The Secure Sockets Layer (SSL) is wid...,9,7.0,7 (3 2 2),9,,,


In [7]:
# Ask Google for the 'related publications' for each newly coded item (takes a long time)
# Concatenate returned lists...
allPapers = [foundPaper for relatedPaper in papersToFollowDf.itertuples()
                 for foundPaper in GetPapers(RelatedQuery(relatedPaper.Related))
            ] 
len(allPapers)


Retrieving 101 papers for {'q': 'related:JWmZYGse2woJ:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:5kyGQIO0M-AJ:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:REeYepPp3SkJ:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:d7UnA3YnVJkJ:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:P0WfqhZ2obYJ:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:OrGABxMmhcEJ:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:JuK69hzgtjcJ:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:cPABkGLZHvoJ:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:k9MSObR3_DsJ:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:jFuiYUk7tXIJ:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:W3V6kSHi8kUJ:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:tliFTgXuBfgJ:scholar.google.com/'}
Retrieving 101 papers for {'q': 'related:w67L5Nb5wHIJ:scholar.google.com/'}
Retrieving 1

3737

In [None]:
# The Google queries take a long time and can cost money. 
# While developing the code we save the results as 'pickle' dumps, 
# and 'comment out' the slow parts by changing the cell type to 'Raw NB Convert'; 
# to redo the queries, convert them back again to Code.

with open( "papersFound.p", "wb" ) as outFile:
    pickle.dump( allPapers, outFile )

In [8]:
# How many unique new papers?
pd.Series([paper['result_id'] for paper in allPapers]).unique().size
#allPapers[0]

1679

In [9]:
newPapersDf=(WellCitedPapers(allPapers, requiredAnnualCount=RequiredAnnualCount)
           .reindex(columns=previousPapersDf.columns, fill_value='') # Add in the extra classification columns.
            .assign(Round=CurrentRound+1)
          )
len(newPapersDf)

1513

In [10]:
allPapersSoFarDf=(pd.concat([previousPapersDf, newPapersDf]) # Keep already coded papers - don't want to recode.
                .drop_duplicates(subset=['Key'])  # Only want new ones - this keeps the earlier entries.
                .reset_index(drop=True)           # Don't need to keep the old numbering.
               )
allPapersSoFarDf.to_excel('PapersToCode.xlsx')
len(allPapersSoFarDf)
#allPapersSoFarDf

503

In [13]:
!open PapersToCode.xlsx

In [12]:
assert(False)

AssertionError: 

In [None]:
!open ScholarUtils.py