Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Issues #27

Open
kferris10 opened this issue May 20, 2015 · 3 comments
Open

Memory Issues #27

kferris10 opened this issue May 20, 2015 · 3 comments

Comments

@kferris10
Copy link

I am running into some errors trying to scrape large amounts of PITCHf/x data on my Windows 7 computer. Here are some screenshots to illustrate

  • When I start a new R session, I'm only using about 20 MB of memory.

1-pre-scrape

  • I run this code to scrape several months of PITCHf/x data

    library(pitchRx)
    library(dplyr)
    library(DBI)
    
    # setwd("~/pitchfx")
    db <- src_sqlite("pitchfx14.sqlite3", create = T)
    
    update_db(db$con, "2014-12-01")
    
    scrape(start = "2014-01-01", 
      end = "2014-04-01", 
      suffix = c("inning/inning_all.xml", 
                     "inning/inning_hit.xml", 
                     "miniscoreboard.xml", 
                     "players.xml"), 
      connect = db$con)
    
  • When this is finished running, the R session is now using almost 1 GB of memory.

2-post-scrape

  • Running gc() appears to have no effect

3-post-gc

  • The only solution I have found is to restart R completely

P.S. Sorry if those numbers are impossible to see. Let me know if it would help to improve the quality of any of the screenshots.

@colemanconley
Copy link

colemanconley commented May 24, 2016

I'm having what appears to be the exact same issue and read through both #27 above and the referenced issue #22. I tried gc() like you suggested in 22, but it doesn't work on my machine just as it doesn't work above. What is the solution? I can restart R, but usually have to restart my machine for everything to run in a reasonable amount of time.

Related: I was trying to scrape data for all games starting on 03/01/2010 through the present by grabbing only one month at a time. R crashed midway through the games on 5/16/2012, so I restart, load my packages and define my connection, then run:

update_db(mysqlconnection, end="2012-05-20")

This starts getting the games from 5/17/2012 through 5/20, which obviously misses the remaining 5/16 games I didn't get to. How can I get the rest of the 5/16 games now without duplicating what I already have for that day?

@kferris10
Copy link
Author

@colemanconley I've never had an issue with duplicating games when using update_db. My strategy is to first scrape one year of data. Then I can just run update_db one year at a time. Is that not working for you?

@myellen
Copy link

myellen commented Jan 19, 2017

I have these memory issues on windows but not on mac. The only way I've found to free up the memory is to restart the R session. What I do is make a new SNOW cluster with one node to run the scrape method each time, which is the same as having a new r session each time.

some code I use

ll <- seq(as.Date(start_date), as.Date(end_date), "1 year")
 ntasks <- length(ll)-1
 for(i in 1:ntasks) {
    print(ll[i])
    print(ll[i+1])]
      cl<-makeCluster(1, type="SOCK", outfile = "")
      clusterEvalQ(cl, library(pitchRx))
      clusterEvalQ(cl, library(DBI))
      clusterEvalQ(cl, library(RSQLite))
      clusterEvalQ(cl, library(dplyr))
      
      clusterExport(cl, list = c("ll", "files", "dbpath"), envir=environment())
      clusterCall(cl, function(i) {
        db <- src_sqlite(dbpath, create = TRUE)
        scrape(start = ll[i], end = ll[i+1], suffix = files, connect = db$con)
        dbDisconnect(db$con)
      },i)
      stopCluster(cl)
    ]
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants