Allow Python scraper to keep empty spans with ids #2082

thewheat · 2023-10-16T13:22:41Z

Fixes #1081

Source has 3 IDs

Scraped page has 2

Turns out that

Empty spans with IDs are removed in core/clean_text

devdocs/lib/docs/filters/core/clean_text.rb

Line 10 in 887c879

while html.gsub!(EMPTY_NODES_RGX, ''); end

sphinx/clean_html does some ID manipulation and removal of some empty spans

devdocs/lib/docs/filters/sphinx/clean_html.rb

Lines 39 to 42 in 887c879

    
           css('span[id]:empty').each do |node| 
        
             (node.next_element || node.previous_element)['id'] ||= node['id'] if node.next_element || node.previous_element 
        
             node.remove 
        
           end

Code changes done

Created new sphinx_keep_empty_ids parameter to bypass sphinx/clean_html processing and set python scraper to options[:sphinx_keep_empty_ids] = true
options[:clean_text] = false to bypass core/clean_text

Other notes

In testing this solve the issue mentioned and likely other scenarios (could be up to 54 scenarios in total)

➜  python~3.10 git:(allow-python-empty-spans) ✗ pwd
/Users/thewheat/src/devdocs/docs/python~3.10
➜  python~3.10 git:(allow-python-empty-spans) ✗ grep -irn  "<span id=.*</span><span id="  | grep datetime
./library/datetime.html:2660:<span id="strftime-strptime-behavior"></span><span id="index-0"></span><h2><code class="xref py py-meth docutils literal notranslate"><span class="pre">strftime()</span></code> and <code class="xref py py-meth docutils literal notranslate"><span class="pre">strptime()</span></code> Behavior<a class="headerlink" href="#strftime-and-strptime-behavior" title="Permalink to this headline">¶</a></h2>
➜  python~3.10 git:(allow-python-empty-spans) ✗ grep -irn  "<span id=.*</span><span id="  | wc -l        
      54

Some others affected that are fixed in my local setup

Search option flags https://devdocs.io/python~3.7/library/doctest#doctest-options
Search doctest directives https://devdocs.io/python~3.7/library/doctest#doctest-directives

simon04

Great, thank you!

Allow Python scraper to keep empty spans with ids

b46cb95

thewheat requested a review from a team as a code owner October 16, 2023 13:22

thewheat mentioned this pull request Oct 16, 2023

Python datetime strftime/strptime behaviour link-to broken #1081

Closed

Update Python documentation (3.12.1)

4862e15

simon04 approved these changes Jan 5, 2024

View reviewed changes

simon04 merged commit 783e5dc into freeCodeCamp:main Jan 5, 2024
1 check passed

simon04 mentioned this pull request Jan 6, 2024

incorrect anchor in python3.9 #2099

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Python scraper to keep empty spans with ids #2082

Allow Python scraper to keep empty spans with ids #2082

thewheat commented Oct 16, 2023

simon04 left a comment

	css('span[id]:empty').each do \|node\|
	(node.next_element \|\| node.previous_element)['id'] \|\|= node['id'] if node.next_element \|\| node.previous_element
	node.remove
	end

Allow Python scraper to keep empty spans with ids #2082

Allow Python scraper to keep empty spans with ids #2082

Conversation

thewheat commented Oct 16, 2023

Code changes done

Other notes

simon04 left a comment

Choose a reason for hiding this comment