New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

newftp.epa.gov #279

Open
Serubin opened this Issue Jan 26, 2017 · 58 comments

Comments

Projects
None yet
@Serubin

Serubin commented Jan 26, 2017

  • Agency: EPA
  • Data Format: Unknown
  • FTP URL: newftp.epa.edu

Current ftp contents:
899M ./AIR_QUALITY_DATA
0 ./CAM_HRA
2.3G ./CERCLA108B
406G ./COMPTOX
406G ./Computational_Toxicology_Data (Looks like a duplicate of the above)
2.2G ./EJSCREEN
33G ./EPADataCommons
44G ./GKM_DOCUMENTS
1.0T ./RSEI
7.5G ./RTPGIS
62M ./STANDARD_MINE
1.0K ./TESTAREA
1.9T .

Currently pulled down on my machine:

899M ./AIR_QUALITY_DATA
31M ./GKM_DOCUMENTS
2.2G ./EJSCREEN
14G ./EPADataCommons
2.3G ./CERCLA108B
4.0K ./CAM_HRA
32G ./COMPTOX
52G .

I intend to make my mirror public, but that may have to wait until the weekend.

@mxplusb mxplusb added this to the January milestone Jan 26, 2017

@Plazmaz

This comment has been minimized.

Plazmaz commented Jan 26, 2017

Looks like http://newftp.epa.edu/ is down

@mheistermann

This comment has been minimized.

mheistermann commented Jan 26, 2017

@Plazmaz it's ftp://newftp.epa.gov/

@Serubin

This comment has been minimized.

Serubin commented Jan 26, 2017

Updated current download status.
If anyone want's to start downloading other parts of this feel free - it's rate limited at 500kb/s so this is a pretty slow process.

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Jan 26, 2017

tried wget but it stopped because of login issues

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Jan 26, 2017

--11:42:18-- ftp://newftp.epa.gov/EPADataCommons/
(try:20) => `C:/Users/user/Music/newftp.epa.gov/EPADataCommons/.listing'
Connecting to newftp.epa.gov|134.67.100.58|:21... connected.
Logging in as anonymous ...
The server refuses login.
Giving up.

unlink: No such file or directory

FINISHED --11:42:18--
Downloaded: 0 bytes in 0 files

@Serubin

This comment has been minimized.

Serubin commented Jan 26, 2017

@JeremiahCurtis Give it another try. That happens every so often.

These still need to be downloaded. The RSEI directory looks daunting - might split that up a bit.
1.0T ./RSEI
7.5G ./RTPGIS
62M ./STANDARD_MINE
1.0K ./TESTAREA

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Jan 26, 2017

ftp://newftp.epa.gov/RSEI/Version233_RY2012/Aggregated_Grid_Cell_Data/

working on the above csv files; since wget is having problems, i am doing direct downloads

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Jan 26, 2017

this may take awhile

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Jan 26, 2017

direct download not working either...not sure what's up

@Serubin

This comment has been minimized.

Serubin commented Jan 26, 2017

It appears the server is gone. ftp://ftp.epa.gov is still up

@Serubin

This comment has been minimized.

Serubin commented Jan 26, 2017

Final data count:
15G ./newftp.epa.gov
899M ./AIR_QUALITY_DATA
2.1G ./GKM_DOCUMENTS
2.2G ./EJSCREEN
2.3G ./CERCLA108B
4.0K ./CAM_HRA
36G ./COMPTOX
57G .

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Jan 26, 2017

now the direct download is working again...

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Jan 26, 2017

there are 3 massive csv files at ftp://newftp.epa.gov/RSEI/Version233_RY2012/Disaggregated_Microdata/

each is about 110 GB

@Serubin

This comment has been minimized.

Serubin commented Jan 26, 2017

@JeremiahCurtis Pull down whatever you can - I'm unable to access

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Jan 26, 2017

working on it

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Jan 26, 2017

direct download is kind of ineffective for a 110 GB file, though. If my browser crashes, I have to start all over....any ideas?

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Jan 26, 2017

I'm also running downthemall on thousands of files from a lot of the directories at http://cdiac.ornl.gov/ftp/
This doesn't help direct download speeds, but if someone can confirm that the above ftp has been completely mirrored, I will end the dta session and that should speed up direct download.....thanks

@Serubin

This comment has been minimized.

Serubin commented Jan 26, 2017

Using wget might be good idea.

The download rates are limited to about 500kb/s

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Jan 26, 2017

ftp://newftp.epa.gov/RSEI/Version233_RY2012/Disaggregated_Microdata/
is anyone else able to access?

@lgreenlee

This comment has been minimized.

lgreenlee commented Jan 26, 2017

@Serubin

This comment has been minimized.

Serubin commented Jan 26, 2017

I think I've hit my connection limits - I've got to bow out. I've got some amount of data that I can pass off to anyone - or I am happy to grab data from someone who downloaded to try and host the data somewhere.

@adinbied

This comment has been minimized.

adinbied commented Jan 26, 2017

While it's not DOWN for me, it's requiring a username and password to connect.

@lrehmann

This comment has been minimized.

lrehmann commented Jan 26, 2017

The server is responding with

421 Maximum login limit has been reached

Various clients give different messages when the server cannot be reached with the default anonymous credentials. Chrome asks for a username and password when in fact the anonymous credentials are still valid, the server is just overwhelmed.

@adinbied

This comment has been minimized.

adinbied commented Jan 26, 2017

OK, didn't know that. Thanks!

@ecoquant

This comment has been minimized.

ecoquant commented Jan 26, 2017

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Jan 26, 2017

what is the cdiac ftp mirror address? i followed the link on the main cdiac issue page here, and could not actually find any data.....maybe i'm missing something....thanks

@Serubin

This comment has been minimized.

Serubin commented Jan 26, 2017

Given that this data source is going to be taken down at anytime (and that the source is crazy slow), I think priority one should be downloading it - even if it's spread across multiple people. We can consolidate and duplicate later.

@randomvariable

This comment has been minimized.

randomvariable commented Jan 26, 2017

Started a sync of ftp://newftp.epa.gov/RSEI/Version233_RY2012/Disaggregated_Microdata/ at about 500KB/s

@gofrogs2013

This comment has been minimized.

gofrogs2013 commented Jan 30, 2017

I'm curious if it would be worthwhile to try to make a FOIA request for this information as I'm having the same issue with slow downloads and we could get it on a hard drive or similar, albeit with a fee. The entire dataset could be sent on a 2 TB external HD.

@bkirkbri

This comment has been minimized.

Collaborator

bkirkbri commented Jan 31, 2017

@gofrogs2013 Good idea

@bkirkbri

This comment has been minimized.

Collaborator

bkirkbri commented Jan 31, 2017

Can someone volunteer to coordinate this issue? It's great that so many people are dividing it up to get it done! If one of you could track who has what that would be really helpful. Thanks!

@Serubin

This comment has been minimized.

Serubin commented Jan 31, 2017

I've suffered an untimely hard drive failure, I gotta back out. Sorry.

@gofrogs2013

This comment has been minimized.

gofrogs2013 commented Jan 31, 2017

I went ahead and made a FOIA request for all data in the newftp folder. You can check the progress here: https://foiaonline.regulations.gov/foia/action/public/view/request?objectId=090004d281137e25

@gofrogs2013

This comment has been minimized.

gofrogs2013 commented Feb 1, 2017

@bkirkbri Per the previous comment, I've made the FOIA request and added a link. I won't be able to coordinate it beyond that if we still want to try downloading the rest of it (which is probably the case) as I'll be working on NASA ERS files #289 for a while, but I'll post here if they approve the request.

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Feb 3, 2017

@randomvariable How is the microdata folder moving? I am attempting a grab of the following RSEI subfolders: temp and shapefiles

@donbright

This comment has been minimized.

donbright commented Feb 4, 2017

fyi for anyone trying to look at @empirical-bayesian issue links, they actually refer to https://bitbucket.org/azimuth-backup/azimuth-inventory/issues/89 not the automatically generated github issues (like this #89)

@StephWo

This comment has been minimized.

StephWo commented Feb 4, 2017

I'm trying to get those Microdata files. I started with the last one in alphabetical order (Micro2012_2012...) and will go backwards from that. ETA for the first file is in 9 days...

@gofrogs2013

This comment has been minimized.

gofrogs2013 commented Feb 7, 2017

@BauerPiepenbrink Is your download still going, and if so do you have the same ETA? Hopefully it will be possible to download these large files, but if not I will try getting them from the agency via FOIA as I mention above.

@donbright

This comment has been minimized.

donbright commented Feb 7, 2017

I just checked and ./AIR_QUALITY_DATA only has 58M of data in a single .zip file, which is far less than what @Serubin reported above.

does anyone have a public mirror up for cross-checking data?

@StephWo

This comment has been minimized.

StephWo commented Feb 7, 2017

@gofrogs2013 steady as a rock. ETA 6d 23h with an average of 130 K/s. It's not fast but reliable so far.

A friend of mine and me used to try to calculate what has better bandwith from europe to china. A Gigabit Internet Uplink or a seacontainer full of Hard-Drives.
Getting a physical Backup seems the way to go if possible. Anyway, I keep on nibbling. 27% already done :)

@StephWo

This comment has been minimized.

StephWo commented Feb 7, 2017

I shouldn't have jinxed it. Got Interrupted by the server half an hour ago. Continueing now.
Make shure to use a download-client with the ability to resume after disconnect

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Feb 13, 2017

@Serubin hope your hard drive failure doesn't mean your download is irretrievable :)

@StephWo

This comment has been minimized.

StephWo commented Feb 15, 2017

So, first file from the Disaggregated_Microdata folder is finally downloaded. Its Micro2012_2012.csv

http://176.9.83.61/InProgress_279/Disaggregated_Microdata/
This link will change later on.

Hashdeep Checksum for that single file:

110831639138,1d94bea31fe0bd03d732e01b7e7d6ab8,9087314828d9736e275d395f749b354676f7f4164a003319c3501257053b8366,Micro2012_2012.csv

@gofrogs2013

This comment has been minimized.

gofrogs2013 commented Feb 15, 2017

Disregard the referenced issue above.
@BauerPiepenbrink Congrats on the download! Were you able to actually open the file considering its size?

@StephWo

This comment has been minimized.

StephWo commented Feb 16, 2017

@gofrogs2013 well, I won't try to open it in whole :)
If I run
tail -n 40 Micro2012_2012.csv
I get the last 40 lines of that file which look like this:
14,1275,2277,5231336,318,1204704,6,3.29918E-08,5.93853E-06,2.18259E-07,0.00000E+00,2.18259E-07,1.55910E+02
14,1275,2278,5231336,318,1204704,6,3.01574E-08,5.42833E-06,4.37626E-09,0.00000E+00,4.37626E-09,3.69841E+00
14,1275,2279,5231336,318,1204704,6,2.75008E-08,4.95015E-06,4.11220E-08,0.00000E+00,4.11220E-08,3.59820E+01

So, as the file extension promised, comma seperated values. If someone really wants to dig into that there seems to be a software for that to basically filter the csv files called Microdata_Extractor.

I will try to download that too if I stumble upon it.

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Feb 16, 2017

just wondering what we're still missing on the RSEI folder;

I have finished:

Version233_RY2012/Public_Release_Data/CSV version/
Version233_RY2012/Aggregated_Grid_Cell_Data/
Census_XWalks/
Shapefiles/

@Serubin

This comment has been minimized.

Serubin commented Feb 20, 2017

@JeremiahCurtis Still working on retrieval.

Picked up another 4TB drive so I should be able to get back to data pulling soon.

@gofrogs2013

This comment has been minimized.

gofrogs2013 commented Feb 20, 2017

@BauerPiepenbrink Are you trying to download the 2010/11 csv files as well? Maybe you and @Serubin could each do one.

@StephWo

This comment has been minimized.

StephWo commented Feb 20, 2017

@gofrogs2013 As I said, Im going in reverse order, so I'm at Micro2012_2011.csv right now (56GB downloaded, 4 days to go).
I'm happy if I don't have to sit through the 2010-file :)

@gofrogs2013

This comment has been minimized.

gofrogs2013 commented Feb 22, 2017

For the record, I withdrew the FOIA request as the downloads are working. I won't be able to do 2010 myself but perhaps another user here could grab it.

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Feb 22, 2017

there are also three huge files under ftp://newftp.epa.gov/RSEI/Version234_RY2014/Disaggregated_Microdata/
and three more under ftp://newftp.epa.gov/RSEI/Version235_RY2015/Disaggregated_Microdata/

I'm running IDM on the three files at ftp://newftp.epa.gov/RSEI/Version235_RY2015/Disaggregated_Microdata/ , with a transfer rate about 700-800 KB/s when running all three simultaneously.......ETA: 4-5 days for the set

Update: I had my ISP increase my speed to 10 Mbps/s, and so I'm running these at a combined 1.2-1.3 MB/s right now

ETA for the trio: less than 3 days

@JeremiahCurtis

This comment has been minimized.

JeremiahCurtis commented Mar 2, 2017

Finished the three huge files at ftp://newftp.epa.gov/RSEI/Version235_RY2015/Disaggregated_Microdata/........will attempt ftp://newftp.epa.gov/RSEI/Version234_RY2014/Disaggregated_Microdata/ after I finish a few more NCEI folders

I'm up to 20 Mbps service, looking into gbps but not sure if I can afford it

local mirror of ftp://newftp.epa.gov/RSEI/Version235_RY2015/Disaggregated_Microdata
total size: 323 GB (346,856,829,422 bytes)
SHA256 hash: a07b062da9a8909ccb23b1738f3c8ba47d84aaef9c2413793734687b5565d31b (using http://www.xorbin.com/tools/sha256-hash-calculator)

@StephWo

This comment has been minimized.

StephWo commented Mar 6, 2017

I finished
ftp://newftp.epa.gov/RSEI/Version233_RY2012/Disaggregated_Microdata/

at http://176.9.83.61/279/

@gofrogs2013

This comment has been minimized.

gofrogs2013 commented Mar 8, 2017

Awesome work @BauerPiepenbrink !

@StephWo

This comment has been minimized.

StephWo commented Apr 2, 2018

Be advised:
because of changes in my hardware demands I wont be able to host this or the other datasets any longer after April 2018. Please create a copy if necessary before the end of April.
The Full list of Dataset Issue-Numbers that are mirrored on my server and will not be hosted after April:

162
175
176
184
185
279
291
362

Find all these datasets at http://176.9.83.62 or http://climatemirror1.space

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment