RCRA access via selenium no longer functioning #146

bl-young · 2023-09-18T18:55:07Z

No description provided.

dt-woods · 2023-09-18T19:11:37Z

Do you have a plain language summary of what you're trying to accomplish with download_and_extract_zip in RCRAInfo.py? It looks like a recreation of "downthemall" for the defined links on the page (e.g., BR REPORTING and HD LU WASTE CODE). Is that right?

dt-woods · 2023-09-18T19:15:44Z

Have you looked at Envirofacts Data Service API? It looks like it has BR_REPORTING.

dt-woods · 2023-09-18T19:22:17Z

Here's an example for 2015 VA BR REPORTING:

https://data.epa.gov/efservice/br_reporting/state/=/VA/report_cycle/=/2015/json

dt-woods · 2023-09-18T19:27:29Z

Unfortunately HD_LU_WASTE_CODE is unavailable (see here)

dt-woods · 2023-09-18T19:58:06Z

I see where you were trying to use Google Chrome to load a Javascript page, so you can load the full page content. If you look behind the scenes at the URL calls happening (e.g., inspect the page, look at network traffic, see where CSV is called, find the URL), you'll see this:

https://rcrapublic.epa.gov/rcra-public-export/rest/api/v1/public/export/current/CSV

If you request this URL, you should get a lovely JSON text back.

>>> import requests
>>> import json
>>> my_url = "https://rcrapublic.epa.gov/rcra-public-export/rest/api/v1/public/export/current/CSV"
>>> r = requests.get(my_url)
>>> print(r.text)
{
  "id" : 115556,
  "runId" : "CSV-2023-09-18T03-00-00-0400",
  "createdDate" : "2023-09-18T07:00:00.058+00:00",
  "completedDate" : "2023-09-18T14:13:03.766+00:00",
  "latest" : true,
  "tables" : [ {
    "id" : 115557,
    "module" : "Biennial Report",
    "moduleSortOrder" : 1,
    "tableName" : "BR_GM_WASTE_CODE",
    "totalRecords" : 10564777,
    "totalFiles" : 11,
    "recordsPerFile" : 1000000,
    "createdDate" : "2023-09-18T07:01:05.780+00:00",
    "files" : [ {
      "id" : 115559,
      "s3Key" : "Production/CSV-2023-09-18T03-00-00-0400/Biennial Report/BR_GM_WASTE_CODE/BR_GM_WASTE_CODE.zip",
      "fileName" : "BR_GM_WASTE_CODE.zip",
      "fileSize" : 35475962,
      "startRecord" : 0,
      "numberOfRecords" : 0
    } ]
  }, ...
{
    "id" : 115911,
    "module" : "Permitting",
    "createdDate" : "2023-09-18T08:53:36.963+00:00",
    "s3Key" : "Production/CSV-2023-09-18T03-00-00-0400/Permitting/PM.zip",
    "fileName" : "PM.zip",
    "fileSize" : 7432257,
    "sortOrder" : 9
  } ],
  "outputType" : "CSV"
}
>>> d = json.loads(r.text) # Load JSON to dict
>>> def find_table(d):      # simple search
...   for f_tab in d['tables']:
...       f_list = f_tab.get('files', [])
...           for f_item in f_list:
...               f_name = f_item.get("fileName", "")
...               if f_name == 'HD_LU_WASTE_CODE.zip':
...                   return(f_item)
>>> r_dict = find_table(d)
>>> r_dict
{'id': 115821,
 's3Key': 'Production/CSV-2023-09-18T03-00-00-0400/Handler/HD_LU_WASTE_CODE/HD_LU_WASTE_CODE.zip',
 'fileName': 'HD_LU_WASTE_CODE.zip',
 'fileSize': 119988,
 'startRecord': 0,
 'numberOfRecords': 0}
>>> hd_waste_url = "/".join(["https://s3.amazonaws.com/rcrainfo-ftp", r_dict['s3Key']])
>>> hd_waste_url  # parse together s3 URL bits
'https://s3.amazonaws.com/rcrainfo-ftp/Production/CSV-2023-09-18T03-00-00-0400/Handler/HD_LU_WASTE_CODE/HD_LU_WASTE_CODE.zip'

bl-young · 2023-09-18T19:58:48Z

Yes, we generally are trying to download BR_REPORTING_{year}.zip but the link is not stable as they are regenerated regularly.

dt-woods · 2023-09-18T19:59:40Z

See example code above.

bl-young · 2023-09-18T20:00:09Z

Ahh yes - that could simplify things quite a bit. thanks!

bl-young · 2023-09-18T20:05:26Z

This update to essentially replace download_and_extract_zip() seems quite straight forward but I expect I will not get to it for a week or two.

bl-young · 2023-09-29T15:38:29Z

Note that the updates in #147 does not shift to using EnviroFacts API, which we can evaluate in the future, but just streamlines the identification of the zip file to download.

dt-woods · 2023-11-20T23:01:13Z

So I tried accessing other years of RCRAInfo data (2013, 2015, 2017, and 2019). All worked except for one (2017), which produced the following errors. I wasn't able to track down the CSV file it keeps crashing on. Maybe there's a debug statement that points to it.

INFO RCRAInfo_2017 not found in ~/stewi/flowbyfacility
INFO requested inventory does not exist in local directory, it will be generated...
INFO file extraction complete
INFO organizing data for BR_REPORTING from 2017...
INFO extracting ~/stewi/RCRAInfo Data Files/BR_REPORTING_2017_0.csv
INFO extracting ~/stewi/RCRAInfo Data Files/BR_REPORTING_2017_1.csv
INFO extracting ~/stewi/RCRAInfo Data Files/BR_REPORTING_2017_2.csv
INFO saving to ~/stewi/RCRAInfo Data Files/RCRAInfo_by_year/br_reporting_2017.csv...
INFO generating inventory files for 2017
---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
Cell In[9], line 1
----> 1 stewi.getInventory('RCRAInfo', 2017)

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/__init__.py:82, in getInventory(inventory_acronym, year, stewiformat, filters, filter_for_LCI, US_States_Only, download_if_missing, keep_sec_cntx)
     66 """Return or generate an inventory in a standard output format.
     67 
     68 :param inventory_acronym: like 'TRI'
   (...)
     79 :return: dataframe with standard fields depending on output format
     80 """
     81 f = ensure_format(stewiformat)
---> 82 inventory = read_inventory(inventory_acronym, year, f,
     83                            download_if_missing)
     85 if (not keep_sec_cntx) and ('Compartment' in inventory):
     86     inventory['Compartment'] = (inventory['Compartment']
     87                                 .str.partition('/')[0])

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/globals.py:268, in read_inventory(inventory_acronym, year, f, download_if_missing)
    265 else:
    266     log.info('requested inventory does not exist in local directory, '
    267              'it will be generated...')
--> 268     generate_inventory(inventory_acronym, year)
    269 inventory = load_preprocessed_output(meta, paths)
    270 if inventory is None:

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/globals.py:313, in generate_inventory(inventory_acronym, year)
    309     RCRAInfo.main(Option = 'A', Year = [year],
    310                   Tables = ['BR_REPORTING', 'HD_LU_WASTE_CODE'])
    311     RCRAInfo.main(Option = 'B', Year = [year],
    312                   Tables = ['BR_REPORTING'])
--> 313     RCRAInfo.main(Option = 'C', Year = [year])
    314 elif inventory_acronym == 'TRI':
    315     import stewi.TRI as TRI

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/RCRAInfo.py:444, in main(**kwargs)
    441     organize_br_reporting_files_by_year(kwargs['Tables'], year)
    443 elif kwargs['Option'] == 'C':
--> 444     Generate_RCRAInfo_files_csv(year)
    446 elif kwargs['Option'] == 'D':
    447     """State totals are compiled from the Trends Analysis website
    448     and stored as csv. New years will be added as data becomes
    449     available"""

File ~/Envs/ebm/lib/python3.11/site-packages/stewi/RCRAInfo.py:219, in Generate_RCRAInfo_files_csv(report_year)
    216 fieldstokeep = pd.read_csv(RCRA_DATA_PATH.joinpath('RCRA_required_fields.txt'),
    217                            header=None)
    218 # on_bad_lines requires pandas >= 1.3
--> 219 df = pd.read_csv(filepath, header=0, usecols=list(fieldstokeep[0]),
    220                  low_memory=False, on_bad_lines='skip',
    221                  encoding='ISO-8859-1')
    223 log.info(f'completed reading {filepath}')
    224 # Checking the Waste Generation Data Health

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:948, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    944     dtype_backend=dtype_backend,
    945 )
    946 kwds.update(kwds_defaults)
--> 948 return _read(filepath_or_buffer, kwds)

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:617, in _read(filepath_or_buffer, kwds)
    614     return parser
    616 with parser:
--> 617     return parser.read(nrows)

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1748, in TextFileReader.read(self, nrows)
   1741 nrows = validate_integer("nrows", nrows)
   1742 try:
   1743     # error: "ParserBase" has no attribute "read"
   1744     (
   1745         index,
   1746         columns,
   1747         col_dict,
-> 1748     ) = self._engine.read(  # type: ignore[attr-defined]
   1749         nrows
   1750     )
   1751 except Exception:
   1752     self.close()

File ~/Envs/ebm/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:239, in CParserWrapper.read(self, nrows)
    236         data = _concatenate_chunks(chunks)
    238     else:
--> 239         data = self._reader.read(nrows)
    240 except StopIteration:
    241     if self._first_chunk:

File parsers.pyx:825, in pandas._libs.parsers.TextReader.read()

File parsers.pyx:913, in pandas._libs.parsers.TextReader._read_rows()

File parsers.pyx:890, in pandas._libs.parsers.TextReader._check_tokenize_status()

File parsers.pyx:2058, in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

dt-woods · 2023-11-20T23:02:08Z

@bl-young, are you able to generate the 2017 RCRAInfo with stewi?

bl-young · 2023-11-21T03:38:53Z

I can yes, but I get the same error on the runner (see here)

I don't think I have the latest version of pandas so perhaps its a pandas issue that needs updating. I will make a new issue.

bl-young added the bug label Sep 18, 2023

bl-young mentioned this issue Sep 18, 2023

Inconsistent Data Management #145

Open

bl-young added the help wanted label Sep 18, 2023

bl-young added a commit that referenced this issue Sep 29, 2023

simplify data pull for RCRAInfo resolves #146

0bdd732

bl-young mentioned this issue Sep 29, 2023

Improved handling of requests #147

Merged

bl-young mentioned this issue Oct 4, 2023

Release v1.1.2 #148

Merged

bl-young closed this as completed in #148 Oct 4, 2023

bl-young mentioned this issue Oct 6, 2023

Not Found: ftp://newftp.epa.gov USEPA/ElectricityLCI#207

Open

bl-young mentioned this issue Nov 21, 2023

ParserError in RCRAInfo #151

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RCRA access via selenium no longer functioning #146

RCRA access via selenium no longer functioning #146

bl-young commented Sep 18, 2023

dt-woods commented Sep 18, 2023

dt-woods commented Sep 18, 2023

dt-woods commented Sep 18, 2023

dt-woods commented Sep 18, 2023

dt-woods commented Sep 18, 2023

bl-young commented Sep 18, 2023 •

edited

Loading

dt-woods commented Sep 18, 2023

bl-young commented Sep 18, 2023

bl-young commented Sep 18, 2023 •

edited

Loading

bl-young commented Sep 29, 2023

dt-woods commented Nov 20, 2023

dt-woods commented Nov 20, 2023

bl-young commented Nov 21, 2023

RCRA access via selenium no longer functioning #146

RCRA access via selenium no longer functioning #146

Comments

bl-young commented Sep 18, 2023

dt-woods commented Sep 18, 2023

dt-woods commented Sep 18, 2023

dt-woods commented Sep 18, 2023

dt-woods commented Sep 18, 2023

dt-woods commented Sep 18, 2023

bl-young commented Sep 18, 2023 • edited Loading

dt-woods commented Sep 18, 2023

bl-young commented Sep 18, 2023

bl-young commented Sep 18, 2023 • edited Loading

bl-young commented Sep 29, 2023

dt-woods commented Nov 20, 2023

dt-woods commented Nov 20, 2023

bl-young commented Nov 21, 2023

bl-young commented Sep 18, 2023 •

edited

Loading

bl-young commented Sep 18, 2023 •

edited

Loading