Memory Issues #27

kferris10 · 2015-05-20T20:43:21Z

I am running into some errors trying to scrape large amounts of PITCHf/x data on my Windows 7 computer. Here are some screenshots to illustrate

When I start a new R session, I'm only using about 20 MB of memory.

I run this code to scrape several months of PITCHf/x data

library(pitchRx)
library(dplyr)
library(DBI)

# setwd("~/pitchfx")
db <- src_sqlite("pitchfx14.sqlite3", create = T)

update_db(db$con, "2014-12-01")

scrape(start = "2014-01-01", 
  end = "2014-04-01", 
  suffix = c("inning/inning_all.xml", 
                 "inning/inning_hit.xml", 
                 "miniscoreboard.xml", 
                 "players.xml"), 
  connect = db$con)

When this is finished running, the R session is now using almost 1 GB of memory.

Running gc() appears to have no effect

The only solution I have found is to restart R completely

P.S. Sorry if those numbers are impossible to see. Let me know if it would help to improve the quality of any of the screenshots.

The text was updated successfully, but these errors were encountered:

colemanconley · 2016-05-24T15:14:31Z

I'm having what appears to be the exact same issue and read through both #27 above and the referenced issue #22. I tried gc() like you suggested in 22, but it doesn't work on my machine just as it doesn't work above. What is the solution? I can restart R, but usually have to restart my machine for everything to run in a reasonable amount of time.

Related: I was trying to scrape data for all games starting on 03/01/2010 through the present by grabbing only one month at a time. R crashed midway through the games on 5/16/2012, so I restart, load my packages and define my connection, then run:

update_db(mysqlconnection, end="2012-05-20")

This starts getting the games from 5/17/2012 through 5/20, which obviously misses the remaining 5/16 games I didn't get to. How can I get the rest of the 5/16 games now without duplicating what I already have for that day?

kferris10 · 2016-06-05T19:20:23Z

@colemanconley I've never had an issue with duplicating games when using update_db. My strategy is to first scrape one year of data. Then I can just run update_db one year at a time. Is that not working for you?

myellen · 2017-01-19T18:08:27Z

I have these memory issues on windows but not on mac. The only way I've found to free up the memory is to restart the R session. What I do is make a new SNOW cluster with one node to run the scrape method each time, which is the same as having a new r session each time.

some code I use

ll <- seq(as.Date(start_date), as.Date(end_date), "1 year")
 ntasks <- length(ll)-1
 for(i in 1:ntasks) {
    print(ll[i])
    print(ll[i+1])]
      cl<-makeCluster(1, type="SOCK", outfile = "")
      clusterEvalQ(cl, library(pitchRx))
      clusterEvalQ(cl, library(DBI))
      clusterEvalQ(cl, library(RSQLite))
      clusterEvalQ(cl, library(dplyr))
      
      clusterExport(cl, list = c("ll", "files", "dbpath"), envir=environment())
      clusterCall(cl, function(i) {
        db <- src_sqlite(dbpath, create = TRUE)
        scrape(start = ll[i], end = ll[i+1], suffix = files, connect = db$con)
        dbDisconnect(db$con)
      },i)
      stopCluster(cl)
    ]
  }

cpsievert mentioned this issue Jun 1, 2015

Low memory error/crash #22

Closed

cpsievert added the enhancement label Jun 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Issues #27

Memory Issues #27

kferris10 commented May 20, 2015

colemanconley commented May 24, 2016 •

edited

kferris10 commented Jun 5, 2016

myellen commented Jan 19, 2017 •

edited

Memory Issues #27

Memory Issues #27

Comments

kferris10 commented May 20, 2015

colemanconley commented May 24, 2016 • edited

kferris10 commented Jun 5, 2016

myellen commented Jan 19, 2017 • edited

colemanconley commented May 24, 2016 •

edited

myellen commented Jan 19, 2017 •

edited