-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RCRA access via selenium no longer functioning #146
Comments
Do you have a plain language summary of what you're trying to accomplish with |
Have you looked at Envirofacts Data Service API? It looks like it has BR_REPORTING. |
Here's an example for 2015 VA BR REPORTING: https://data.epa.gov/efservice/br_reporting/state/=/VA/report_cycle/=/2015/json |
Unfortunately HD_LU_WASTE_CODE is unavailable (see here) |
I see where you were trying to use Google Chrome to load a Javascript page, so you can load the full page content. If you look behind the scenes at the URL calls happening (e.g., inspect the page, look at network traffic, see where CSV is called, find the URL), you'll see this: https://rcrapublic.epa.gov/rcra-public-export/rest/api/v1/public/export/current/CSV If you request this URL, you should get a lovely JSON text back. >>> import requests
>>> import json
>>> my_url = "https://rcrapublic.epa.gov/rcra-public-export/rest/api/v1/public/export/current/CSV"
>>> r = requests.get(my_url)
>>> print(r.text)
{
"id" : 115556,
"runId" : "CSV-2023-09-18T03-00-00-0400",
"createdDate" : "2023-09-18T07:00:00.058+00:00",
"completedDate" : "2023-09-18T14:13:03.766+00:00",
"latest" : true,
"tables" : [ {
"id" : 115557,
"module" : "Biennial Report",
"moduleSortOrder" : 1,
"tableName" : "BR_GM_WASTE_CODE",
"totalRecords" : 10564777,
"totalFiles" : 11,
"recordsPerFile" : 1000000,
"createdDate" : "2023-09-18T07:01:05.780+00:00",
"files" : [ {
"id" : 115559,
"s3Key" : "Production/CSV-2023-09-18T03-00-00-0400/Biennial Report/BR_GM_WASTE_CODE/BR_GM_WASTE_CODE.zip",
"fileName" : "BR_GM_WASTE_CODE.zip",
"fileSize" : 35475962,
"startRecord" : 0,
"numberOfRecords" : 0
} ]
}, ...
{
"id" : 115911,
"module" : "Permitting",
"createdDate" : "2023-09-18T08:53:36.963+00:00",
"s3Key" : "Production/CSV-2023-09-18T03-00-00-0400/Permitting/PM.zip",
"fileName" : "PM.zip",
"fileSize" : 7432257,
"sortOrder" : 9
} ],
"outputType" : "CSV"
}
>>> d = json.loads(r.text) # Load JSON to dict
>>> def find_table(d): # simple search
... for f_tab in d['tables']:
... f_list = f_tab.get('files', [])
... for f_item in f_list:
... f_name = f_item.get("fileName", "")
... if f_name == 'HD_LU_WASTE_CODE.zip':
... return(f_item)
>>> r_dict = find_table(d)
>>> r_dict
{'id': 115821,
's3Key': 'Production/CSV-2023-09-18T03-00-00-0400/Handler/HD_LU_WASTE_CODE/HD_LU_WASTE_CODE.zip',
'fileName': 'HD_LU_WASTE_CODE.zip',
'fileSize': 119988,
'startRecord': 0,
'numberOfRecords': 0}
>>> hd_waste_url = "/".join(["https://s3.amazonaws.com/rcrainfo-ftp", r_dict['s3Key']])
>>> hd_waste_url # parse together s3 URL bits
'https://s3.amazonaws.com/rcrainfo-ftp/Production/CSV-2023-09-18T03-00-00-0400/Handler/HD_LU_WASTE_CODE/HD_LU_WASTE_CODE.zip' |
Yes, we generally are trying to download |
See example code above. |
Ahh yes - that could simplify things quite a bit. thanks! |
This update to essentially replace |
Note that the updates in #147 does not shift to using EnviroFacts API, which we can evaluate in the future, but just streamlines the identification of the zip file to download. |
So I tried accessing other years of RCRAInfo data (2013, 2015, 2017, and 2019). All worked except for one (2017), which produced the following errors. I wasn't able to track down the CSV file it keeps crashing on. Maybe there's a debug statement that points to it. INFO RCRAInfo_2017 not found in ~/stewi/flowbyfacility
INFO requested inventory does not exist in local directory, it will be generated...
INFO file extraction complete
INFO organizing data for BR_REPORTING from 2017...
INFO extracting ~/stewi/RCRAInfo Data Files/BR_REPORTING_2017_0.csv
INFO extracting ~/stewi/RCRAInfo Data Files/BR_REPORTING_2017_1.csv
INFO extracting ~/stewi/RCRAInfo Data Files/BR_REPORTING_2017_2.csv
INFO saving to ~/stewi/RCRAInfo Data Files/RCRAInfo_by_year/br_reporting_2017.csv...
INFO generating inventory files for 2017
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
Cell In[9], line 1
----> 1 stewi.getInventory('RCRAInfo', 2017)
File ~/Envs/ebm/lib/python3.11/site-packages/stewi/__init__.py:82, in getInventory(inventory_acronym, year, stewiformat, filters, filter_for_LCI, US_States_Only, download_if_missing, keep_sec_cntx)
66 """Return or generate an inventory in a standard output format.
67
68 :param inventory_acronym: like 'TRI'
(...)
79 :return: dataframe with standard fields depending on output format
80 """
81 f = ensure_format(stewiformat)
---> 82 inventory = read_inventory(inventory_acronym, year, f,
83 download_if_missing)
85 if (not keep_sec_cntx) and ('Compartment' in inventory):
86 inventory['Compartment'] = (inventory['Compartment']
87 .str.partition('/')[0])
File ~/Envs/ebm/lib/python3.11/site-packages/stewi/globals.py:268, in read_inventory(inventory_acronym, year, f, download_if_missing)
265 else:
266 log.info('requested inventory does not exist in local directory, '
267 'it will be generated...')
--> 268 generate_inventory(inventory_acronym, year)
269 inventory = load_preprocessed_output(meta, paths)
270 if inventory is None:
File ~/Envs/ebm/lib/python3.11/site-packages/stewi/globals.py:313, in generate_inventory(inventory_acronym, year)
309 RCRAInfo.main(Option = 'A', Year = [year],
310 Tables = ['BR_REPORTING', 'HD_LU_WASTE_CODE'])
311 RCRAInfo.main(Option = 'B', Year = [year],
312 Tables = ['BR_REPORTING'])
--> 313 RCRAInfo.main(Option = 'C', Year = [year])
314 elif inventory_acronym == 'TRI':
315 import stewi.TRI as TRI
File ~/Envs/ebm/lib/python3.11/site-packages/stewi/RCRAInfo.py:444, in main(**kwargs)
441 organize_br_reporting_files_by_year(kwargs['Tables'], year)
443 elif kwargs['Option'] == 'C':
--> 444 Generate_RCRAInfo_files_csv(year)
446 elif kwargs['Option'] == 'D':
447 """State totals are compiled from the Trends Analysis website
448 and stored as csv. New years will be added as data becomes
449 available"""
File ~/Envs/ebm/lib/python3.11/site-packages/stewi/RCRAInfo.py:219, in Generate_RCRAInfo_files_csv(report_year)
216 fieldstokeep = pd.read_csv(RCRA_DATA_PATH.joinpath('RCRA_required_fields.txt'),
217 header=None)
218 # on_bad_lines requires pandas >= 1.3
--> 219 df = pd.read_csv(filepath, header=0, usecols=list(fieldstokeep[0]),
220 low_memory=False, on_bad_lines='skip',
221 encoding='ISO-8859-1')
223 log.info(f'completed reading {filepath}')
224 # Checking the Waste Generation Data Health
File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:948, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
935 kwds_defaults = _refine_defaults_read(
936 dialect,
937 delimiter,
(...)
944 dtype_backend=dtype_backend,
945 )
946 kwds.update(kwds_defaults)
--> 948 return _read(filepath_or_buffer, kwds)
File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:617, in _read(filepath_or_buffer, kwds)
614 return parser
616 with parser:
--> 617 return parser.read(nrows)
File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1748, in TextFileReader.read(self, nrows)
1741 nrows = validate_integer("nrows", nrows)
1742 try:
1743 # error: "ParserBase" has no attribute "read"
1744 (
1745 index,
1746 columns,
1747 col_dict,
-> 1748 ) = self._engine.read( # type: ignore[attr-defined]
1749 nrows
1750 )
1751 except Exception:
1752 self.close()
File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:239, in CParserWrapper.read(self, nrows)
236 data = _concatenate_chunks(chunks)
238 else:
--> 239 data = self._reader.read(nrows)
240 except StopIteration:
241 if self._first_chunk:
File parsers.pyx:825, in pandas._libs.parsers.TextReader.read()
File parsers.pyx:913, in pandas._libs.parsers.TextReader._read_rows()
File parsers.pyx:890, in pandas._libs.parsers.TextReader._check_tokenize_status()
File parsers.pyx:2058, in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file. |
@bl-young, are you able to generate the 2017 RCRAInfo with stewi? |
I can yes, but I get the same error on the runner (see here) I don't think I have the latest version of pandas so perhaps its a pandas issue that needs updating. I will make a new issue. |
No description provided.
The text was updated successfully, but these errors were encountered: