## Goodreads to MV library query
@Author: DJ Rajdev  
@LastUpdates: Aug 28, 2019  
@Purpose: Query my to-read list on goodreads against available or on hold books at MV / palo alto library. Helps to figure out differnt login scenarios using `rvest` , try `selenium` if `rvest` doesn't work.  
@Citations:  
* [(R Vignette) SelectorGadget](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html)  
* [(StackOverflow) Scrolling page using selenium](https://stackoverflow.com/questions/31901072/scrolling-page-in-rselenium)

<br/>

### Prereq
Using the chrome extension `selectorGadget` to figure out what to scrape. load the required libraries and record the versions in use.

In [1]:
library(tidyverse)
library(rvest)
library(magrittr)
library(getPass)
library(RSelenium) #from devtools install_github("hrbrmstr/decapitated")

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.1.0     [32m✔[39m [34mpurrr  [39m 0.2.5
[32m✔[39m [34mtibble [39m 1.4.2     [32m✔[39m [34mdplyr  [39m 0.7.8
[32m✔[39m [34mtidyr  [39m 0.8.2     [32m✔[39m [34mstringr[39m 1.3.1
[32m✔[39m [34mreadr  [39m 1.2.1     [32m✔[39m [34mforcats[39m 0.3.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
“package ‘rvest’ was built under R version 3.5.2”Loading required package: xml2
“package ‘xml2’ was built under R version 3.5.2”
Attaching package: ‘rvest’

The following object is masked from ‘package:purrr’:

    pluck

The following object is masked from ‘package:readr’:

    guess_encoding


Attaching package: ‘magrittr’

The following object is masked 

In [2]:
sapply(c('tidyverse', 'rvest', 'magrittr', 'getPass', 'RSelenium'), function(x) {
    toString(packageVersion(x))
})

### Let's try logging in
<br/>

#### Goodreads

In [3]:
grLoginUrl  <- 'https://www.goodreads.com/user/sign_in'
grsession  <- html_session(grLoginUrl)
# html_form(read_html(grLoginUrl))     # figure out how many forms on page and what they contain
getform  <- html_form(grsession)[[1]]
print(getform)
# you need the name of the element, not the tag
fillform  <- getform  %>% set_values('user[email]'=getPass("Goodreads email: "), 'user[password]'=getPass('Goodreads password:'))
submitform  <- submit_form(grsession, fillform)


<form> 'sign_in' (POST https://www.goodreads.com/user/sign_in)
  <input hidden> 'utf8': ✓
  <input hidden> 'authenticity_token': 9Q3cCXAcaKlEfgFP7WORFdVD99NnKP4Z+/7KOQgR5BT4exk6XXlRQVJNp0ZSVdOrask0IyjM8tXgqTMiqJe95g==
  <input email> 'user[email]': 
  <input password> 'user[password]': 
  <input checkbox> 'remember_me': 
  <input submit> 'next': Sign in
  <input hidden> 'n': 368231
Goodreads email: ········
Goodreads password:········


Submitting with 'next'


In [4]:
print(submitform)

<session> https://www.goodreads.com/
  Status: 200
  Type:   text/html; charset=utf-8
  Size:   197934


#### MV library login

In [5]:
mvLoginUrl  <- 'https://library.mountainview.gov/iii/cas/login?'
mvlsession  <- html_session(mvLoginUrl)
#html_form(read_html(mvLoginUrl))     # figure out how many forms on page and what they contain
getform  <- html_form(mvlsession)[[1]]
fillform  <- getform  %>% set_values('code'=getPass("MV library card number: "), 'pin'=getPass("MV library pin: "))
submitform  <- submit_form(mvlsession, fillform)

MV library card number: ········
MV library pin: ········


Submitting with 'NULL'


In [6]:
print(submitform)


<session> https://library.mountainview.gov/iii/cas/login;jsessionid=7616D432A2A481A4FAC78F3BAE4C5B7B
  Status: 200
  Type:   text/html;charset=ISO-8859-1
  Size:   1635


### Get to-read books from GR

my user id and books data is public so all I need is the right userID number, making login obselete. But well, already figured it out so continuing with this approach.

In [7]:
grToRead  <-  'https://www.goodreads.com/review/list/11651752-divyajyoti-rajdev?shelf=to-read'
# parse html, get all nodes with <table> tags, parse table numbers given, fill missing cols with NA, returns list so get first element which is our data frame
booksToRead  <- grToRead  %>% read_html  %>% html_nodes("table")  %>% .[2]  %>% html_table(header=T, fill=T)  %>% .[[1]]
booksToRead  %>% nrow

whoops! goodreads has javascript infinite scrolling so I only get 30 rows of the most recent books I've added (darn!). Let's try `selenium` or `splashr` to mimic the browser JS behavior

In [8]:
# rDr[["server"]]$stop()
rDr <- rsDriver()
remDr <- rDr[["client"]]
remDr$navigate(grToRead)

checking Selenium Server versions:
BEGIN: PREDOWNLOAD
BEGIN: DOWNLOAD
BEGIN: POSTDOWNLOAD
checking chromedriver versions:
BEGIN: PREDOWNLOAD
BEGIN: DOWNLOAD
BEGIN: POSTDOWNLOAD
checking geckodriver versions:
BEGIN: PREDOWNLOAD
BEGIN: DOWNLOAD
BEGIN: POSTDOWNLOAD
checking phantomjs versions:
BEGIN: PREDOWNLOAD
BEGIN: DOWNLOAD
BEGIN: POSTDOWNLOAD


[1] "Connecting to remote server"
$acceptInsecureCerts
[1] FALSE

$browserName
[1] "chrome"

$browserVersion
[1] "77.0.3865.42"

$chrome
$chrome$chromedriverVersion
[1] "77.0.3865.40 (f484704e052e0b556f8030b65b953dce96503217-refs/branch-heads/3865@{#442})"

$chrome$userDataDir
[1] "/var/folders/s3/9b3wq3_n74n_74sl59hv_9vw0000gp/T/.com.google.Chrome.NArDct"


$`goog:chromeOptions`
$`goog:chromeOptions`$debuggerAddress
[1] "localhost:56747"


$networkConnectionEnabled
[1] FALSE

$pageLoadStrategy
[1] "normal"

$platformName
[1] "mac os x"

$proxy
named list()

$setWindowRect
[1] TRUE

$strictFileInteractability
[1] FALSE

$timeouts
$timeouts$implicit
[1] 0

$timeouts$pageLoad
[1] 300000

$timeouts$script
[1] 30000


$unhandledPromptBehavior
[1] "dismiss and notify"

$webdriver.remote.sessionid
[1] "5d572f3c308b72513c3bde6ccdd893f0"

$id
[1] "5d572f3c308b72513c3bde6ccdd893f0"



Let's get how many times I need to scroll. For this on GR I extract how many books in my want-to-read shelf. I know it loads 30 at a time.

In [9]:
# go to url, parse html, get sher shelves, from list find want-to-read shelf, eliminate non numeric chars, div by 30
tmp  <- grToRead  %>% read_html  %>% html_nodes("div.userShelf")  %>% html_text(trim=T)
nscroll = ceiling( as.integer( gsub("([^0-9.])","", tmp[grep('want to read',tmp, ignore.case = T)])) / 30)

Okay so for following to work, the active window needs to be chrome. I might as well scroll down by hand *sigh*. (maybe I'm missing something?)

TODO: figure out how to do this in headless mode.

In [10]:
for(i in 1:nscroll) {
    webElem <- remDr$findElement("css", "body")
    webElem$sendKeysToElement(list(key = "end"))
    Sys.sleep(2)
}

In [11]:
booksToRead  <- remDr$getPageSource()[[1]]  %>% read_html  %>% html_nodes("table")  %>% .[2]  %>% html_table(header=T, fill=T)  %>% .[[1]]

In [12]:
#close session , stop server
rDr$server$stop()

In [13]:
booksToRead  %>% names

Geez, duplicated names. Can't move forward until I fix that

In [14]:
newnames  <-c('unknown1','#','cover','title', 'author', 'isbn', 'isbn13', 'asin', 'pages', 'rating', 'ratings', 'pub', '(ed.)', 'rating2', 'my rating', 
              'review', 'notes','recommender', 'comments', 'votes', 'count', 'started', 'read', 'added', 'purchased', 'owned',
              'location', 'condition', 'format', 'unknown2')
print(newnames)
colnames(booksToRead)  <- newnames

 [1] "unknown1"    "#"           "cover"       "title"       "author"     
 [6] "isbn"        "isbn13"      "asin"        "pages"       "rating"     
[11] "ratings"     "pub"         "(ed.)"       "rating2"     "my rating"  
[16] "review"      "notes"       "recommender" "comments"    "votes"      
[21] "count"       "started"     "read"        "added"       "purchased"  
[26] "owned"       "location"    "condition"   "format"      "unknown2"   


In [15]:
booksToRead  %<>% select(author, title, isbn, isbn13, avgrating=rating, numratings=ratings)  

In [16]:
booksToRead  %>% head()

author,title,isbn,isbn13,avgrating,numratings
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
"author Mitnick, Kevin D.",title Ghost in the Wires: My Adventures as the World's Most Wanted Hacker,isbn 0316037702,isbn13 9780316037709,avg rating 3.97,"num ratings 18,381"
"author Montell, Amanda",title Wordslut: A Feminist Guide to Taking Back the English Language,isbn 006286887X,isbn13 9780062868879,avg rating 4.44,num ratings 363
"author Law, Averill M.",title Simulation Modeling & Analysis,isbn 0070366985,isbn13 9780070366985,avg rating 3.81,num ratings 117
"author Altshuller, Genrich","title The Innovation Algorithm: Triz, Systematic Innovation and Technical Creativity",isbn 0964074044,isbn13 9780964074040,avg rating 4.01,num ratings 136
"author Bartle, Robert G.",title Introduction to Real Analysis,isbn 0471321486,isbn13 9780471321484,avg rating 3.95,num ratings 221
"author Casella, George",title Statistical Inference,isbn 0534243126,isbn13 9780534243128,avg rating 4.09,num ratings 260


#### Let's clean
remove any non numeric chars from isbn, isbn13, numratings  
remove non numeric except "." from avgrating  
remove "author " from author  
split "author" to firstname, last name  
remove "title " from title  
<br />
removing books reviewed by less than 50 people (I dont want to read esocteric books / fall prey to early biased reviewers like friends and family of author)  
arrange by descending average rating  

In [17]:
booksToRead  %<>% mutate(
    title = trimws(gsub(pattern = '^title', '', title)),
    author = trimws(gsub(pattern= '^author', '', author)),
    isbn = trimws(gsub(pattern='^isbn', '', isbn)),
    isbn13 = as.numeric(trimws(gsub(pattern= '^isbn13', '', isbn13))),
    numratings = as.integer(gsub("([^0-9.])","", numratings)),
    avgrating = as.numeric(gsub("([^0-9.^\\..])","", avgrating))
)    %>% separate(author, c("lastname", "firstname"), sep = ", ", remove=FALSE) %>% 
mutate(firstname = trimws(gsub(pattern="\\*", "", x=firstname)),
      author_clean = trimws(paste(firstname, lastname, sep=" "))) %>% 
filter(numratings > 50)  %>% arrange(desc(avgrating))  %>% select(-author)

“The `printer` argument is deprecated as of rlang 0.3.0.
“Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [294].”

In [18]:
booksToRead  %>% head

lastname,firstname,title,isbn,isbn13,avgrating,numratings,author_clean
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<int>,<chr>
McElreath,Richard,Statistical Rethinking: A Bayesian Course with Examples in R and Stan,1482253445,9781482000000.0,4.71,148,Richard McElreath
Hansen,Brant,"Blessed Are the Misfits: Great News for Believers who are Introverts, Spiritual Strugglers, or Just Feel Like They're Missing Something",718096312,9780718000000.0,4.62,881,Brant Hansen
Wickham,Hadley,Advanced R,1466586966,9781467000000.0,4.62,186,Hadley Wickham
James,Gareth,An Introduction to Statistical Learning: With Applications in R,1461471370,9781461000000.0,4.61,1082,Gareth James
Alexander,Michelle,The New Jim Crow: Mass Incarceration in the Age of Colorblindness,1595581030,9781596000000.0,4.51,42742,Michelle Alexander
López-Alt,J. Kenji,The Food Lab: Better Home Cooking Through Science,393081087,9780393000000.0,4.5,6938,J. Kenji López-Alt


### Cross reference available or onhold books at MV library
<br />
For everybook in my to-read list see if there's a match at the library. Sometimes books can have multiple matches (ebook vs offline). For these matches ensure it's the correct book (title match, author match) and return the set of found / not found books.<br>
There's also an interesting behavior to exploit here, if the book is not found there's a "tryAgainMessage" `div` tag that gets added. So I'm going to simply look for that and if I find it, I'll move to the next book.<br>
Multiple versions of the book in different formats or editions might come up since I'm searching by title not 13digit ISBN. I could use ISBN but then I don't get audiobooks <br>
I tried to scrape the availability for each book but the html tag isn't clean so returns blanks. <br> <br>
Lastly, MV library session times out if I don't run the script long enough, so building in a check for that

#### Check if session active

In [19]:
catalogUrl  <- 'https://www.mountainview.gov/depts/library/default.asp'
catalogForm  <- mvlsession  %>% jump_to(catalogUrl)  %>% read_html()  %>% html_form  %>% .[[3]]
tget = 'wuthering heights'
searchResult  <- mvlsession  %>% submit_form(catalogForm  %>% set_values('target'= tget))
if(searchResult$response$status != 200){
    print("there's an error!!")
}else{
    parsed  <- searchResult  %>% read_html  
    notFound  <- grepl(pattern='tryAgainMessage', x=(parsed  %>% html_node("body")  %>% as.character))
    emptySet  <- length(parsed  %>% html_nodes(".title a")) == 0
    if(!notFound & emptySet) print("Refresh session")
}

Submitting with 'NULL'


#### Find Matches

In [20]:
booksFound  <- booksToRead[0,]
booksNotFound  <- booksToRead  %>% mutate(foundTitle = NA, foundAuthor = NA, mediaType=NA)  %>% .[0,]
#print(booksNotFound)

In [21]:
#never runs, just incase I need to check for blank fields
"
tget = NA
searchResult  <- mvlsession  %>% submit_form(catalogForm  %>% set_values('target'= tget))
parsed  <- searchResult  %>% read_html   %>% html_node('body')  %>% as.character 
grepl(pattern='You must enter a value for Search', x=parsed, ignore.case = T)
"

In [25]:
#for(ind in 1:5) { #test only
for(ind in 1:nrow(booksToRead)) {
    currBook  <- booksToRead[ind,]
    
    #some titles have captions separated by ":", ignoring those
    cleantitle  <- unlist(strsplit(x=currBook$title, split=":"))[1]
    
    #error treated as book not found
    suppressMessages(searchResult  <- mvlsession  %>% submit_form(catalogForm  %>% set_values('target'= cleantitle)))
    parsed  <- searchResult  %>% read_html  
    notFound  <- grepl(pattern='tryAgainMessage', x=(parsed  %>% html_node("body")  %>% as.character))
    if(searchResult$response$status != 200 | notFound){
        booksNotFound  %<>% bind_rows(currBook)
    }else{
        parsed_titles  <- parsed  %>% html_nodes(".title a") %>%  html_text %>% as.character  %>% trimws() 
        parsed_authors  <- parsed  %>% html_nodes(".customSecondaryText")  %>%  html_text %>% as.character  %>% trimws() 
        parsed_authors  <- gsub(pattern="\\/ ", "",x= parsed_authors)
        parsed_types  <- parsed  %>% html_nodes(".itemMediaDescription")  %>%  html_text %>% as.character  %>% trimws() 
        #parsed_availability  <- searchResult  %>% read_html  %>% html_nodes(".availabilityMessage span:nth-child(1)")  %>%  html_text %>% as.character  %>% trimws() 
        
        #so my matching criteria can prolly be better but I'll just check for title match and author last name
        authorMatchInd  <- grep(pattern=currBook$lastname[1],x=parsed_authors, ignore.case = T)
        titleMatchInd  <- grep(pattern=cleantitle,x=parsed_titles, ignore.case = T)
        matched  <- which(authorMatchInd %in% titleMatchInd)
        
        #bunch of results but none what we want
        if(length(matched)==0){
            booksNotFound  %<>% bind_rows( currBook)
        }else{
            foundMatches  <- data.frame(foundTitle = parsed_titles[matched], 
                                       foundAuthor = parsed_authors[matched],
                                       mediaType = parsed_types[matched],
                                        stringsAsFactors=F
                                       )
            tmpdf  <- currBook[0,]
            
            #not proud of the following, I'm sure purrr has an easier way for this
            for(ind2 in 1:length(matched)){
                tmpdf %<>% bind_rows(currBook  %>%  
                                     mutate(foundTitle=foundMatches[ind2,'foundTitle'],
                                            foundAuthor=foundMatches[ind2,'foundAuthor'],
                                            mediaType=foundMatches[ind2,'mediaType']
                                           )
                                    )
            }
            booksFound  %<>% bind_rows(tmpdf)
        }
    }
}

#I want to clean results a little
booksNotFound  %<>% select(c(title, isbn13, numratings, avgrating, author_clean))  %>% unique
booksFound  %<>% select(-c(firstname, lastname, isbn))  %>% unique

I'm interested in very specific mediaTypes (Audiobooks, Books). So filtering in my view, but the csv file still has this info.

In [26]:
booksFound  %>% pull(mediaType)  %>% unique
#booksFound  %>% filter(mediaType %in% c("Music CD","Adult Foreign Lang Book","Children\'s Board Book")) #okay I'm curious

In [28]:
booksFound  %>% filter(!(mediaType %in% c('DVD', 'Downloadable Music', 'Music CD', 'Downloadable Video', 'Adult Large Type', 'Ebook: Downloadable')) )  %>% 
head(10)

title,isbn13,avgrating,numratings,author_clean,foundTitle,foundAuthor,mediaType
<chr>,<dbl>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>
"Blessed Are the Misfits: Great News for Believers who are Introverts, Spiritual Strugglers, or Just Feel Like They're Missing Something",9780718000000.0,4.62,881,Brant Hansen,Blessed are the misfits,Brant Hansen,Downloadable Audiobook
The New Jim Crow: Mass Incarceration in the Age of Colorblindness,9781596000000.0,4.51,42742,Michelle Alexander,The new Jim Crow : mass incarceration in the age of colorblindness,Michelle Alexander ; [with a new foreword by Cornel West],Adult Non-Fiction Book
The Food Lab: Better Home Cooking Through Science,9780393000000.0,4.5,6938,J. Kenji López-Alt,The food lab : better home cooking through science,J. Kenji López-Alt ; photographs by the author,Adult Non-Fiction Book
The Moth Presents All These Wonders: True Stories about Facing the Unknown,9781102000000.0,4.48,3751,Catherine Burns,The Moth presents All these wonders : true stories about facing the unknown,edited by Catherine Burns,Adult Non-Fiction Book
The Meaning of Marriage: Facing the Complexities of Commitment with the Wisdom of God,9780526000000.0,4.47,18456,Timothy J. Keller,The meaning of marriage : facing the complexities of commitment with the wisdom of God,Timothy Keller with Kathy Keller,Adult Non-Fiction Book
Unoffendable: How Just One Change Can Make All of Life Better,9780529000000.0,4.47,2237,Brant Hansen,Unoffendable : how just one change can make all of life better,Brant Hansen,Downloadable Audiobook
Bad Blood: Secrets and Lies in a Silicon Valley Startup,9781525000000.0,4.46,95603,John Carreyrou,Bad blood,John Sandford,Adult Fiction Book
Bad Blood: Secrets and Lies in a Silicon Valley Startup,9781525000000.0,4.46,95603,John Carreyrou,Bad blood : secrets and lies in a Silicon Valley startup,John Carreyrou,Adult Non-Fiction Book
Bad Blood: Secrets and Lies in a Silicon Valley Startup,9781525000000.0,4.46,95603,John Carreyrou,Bad blood : secrets and lies in a Silicon Valley startup,John Carreyrou,Audiobook
Bad Blood: Secrets and Lies in a Silicon Valley Startup,9781525000000.0,4.46,95603,John Carreyrou,Bad blood : a Lucy Black thriller,Brian McGilloway,Adult Fiction Book


In [29]:
booksNotFound  %>% head(10)

title,isbn13,numratings,avgrating,author_clean
<chr>,<dbl>,<int>,<dbl>,<chr>
Statistical Rethinking: A Bayesian Course with Examples in R and Stan,9781482000000.0,148,4.71,Richard McElreath
Advanced R,9781467000000.0,186,4.62,Hadley Wickham
An Introduction to Statistical Learning: With Applications in R,9781461000000.0,1082,4.61,Gareth James
"Mrityunjaya, The Death Conqueror: The Story Of Karna",9788172000000.0,7641,4.5,Shivaji Sawant
The Hero's Journey: Joseph Campbell on His Life & Work,9781577000000.0,1365,4.42,Joseph Campbell
"Storyworthy: Engage, Teach, Persuade, and Change Your Life through the Power of Storytelling",9781609000000.0,244,4.4,Matthew Dicks
Harry Potter and the Methods of Rationality,,11838,4.39,Eliezer Yudkowsky
Machine Learning: A Probabilistic Perspective,9780262000000.0,375,4.38,Kevin P. Murphy
Information Graphics,9783837000000.0,276,4.38,Sandra Rendgen
"The Elements of Statistical Learning: Data Mining, Inference, and Prediction",9780388000000.0,1246,4.37,Trevor Hastie


### Writing results
<br />
I want to make sure I didn't miss a book, shouldn't happen but doesn't hurt to check.<br>
Also let's write results to a nice csv file.

In [30]:
#sanity check
numFound <- booksFound  %>% pull(title)  %>% unique  %>% length
numNotFound <- booksNotFound  %>% pull(title)  %>% unique  %>% length
sprintf("Found %d but didn't find %d books.", numFound, numNotFound)
if(numFound + numNotFound != nrow(booksToRead)){
    print("Whoops missed some books, don't know which ones sorry")
}

In [31]:
write.csv(booksFound,file.path(getwd(), 'booksFound.csv'), row.names=F)
write.csv(booksNotFound,file.path(getwd(), 'booksNotFound.csv'), row.names=F)

### How can I use or improve this?
<br />
<b> Improvements </b>
<br />
- the querying and result parsing could be made faster<br>
- not modularized yet, and soem redundant code, could use cleaning<br>
- extend to other libraries (Palo Alto, Cupertino pretty please?) <br>
- automate putting books on hold <br>

<br />
<b> Using this </b>
<br />
Just edit the URLs, use your own login, change the tags in `html_node` to reflect your library's settings and you're good to go.