Did I make a mistake? #137

yjqiang · 2019-03-30T01:50:19Z

import requests
from bs4 import BeautifulSoup


url = 'https://www.ptwxz.com/html/0/296/39948.html'

cookie = ""

user_agent = ('Mozilla/5.0 (iPhone; CPU iPhone OS 11_2_6 like'
              'Mac OS X) AppleWebKit/604.1.34 (KHTML, like Gecko)'
              'CriOS/65.0.3325.152 Mobile/15D100 Safari/604.1')

headers = {'User-Agent': user_agent, 'cookie': cookie}

'''
socks5 = 'socks5://127.0.0.1:10086'  # because this website is blocked by the government of China, I have to use proxy
rsp = requests.get(
    url,
    headers=headers,
    proxies={'http': socks5, 'https': socks5})
'''
rsp = requests.get(url, headers=headers)
rsp.encoding = 'gbk'
text = rsp.text

soups = BeautifulSoup(text, 'html.parser')
print(soups.prettify())
print('_____________________')

tag_center = soups.select_one('table[align="center"]')
list_tag_center_next_siblings = list(tag_center.next_siblings)
for i in list_tag_center_next_siblings[-4:]:
    print(type(i), i)

tag_head = soups.select_one('head')
list_tag_head_children = list(tag_head.children)
for i in list_tag_head_children[-4:]:
    print(type(i), i, '|')

for x, y in zip(reversed(list_tag_center_next_siblings), reversed(list_tag_head_children)):
    assert x is y


tags_after_center = soups.select('table[align="center"] ~ *')
print(tags_after_center)

yjqiang · 2019-03-30T01:57:10Z

table[align="center"] ~ * should mean that the next-siblings of the tag table[align="center"].
But it just shows 2 results. And there are 4 results in list_tag_center_next_siblings = list(tag_center.next_siblings).

facelessuser · 2019-03-30T02:30:44Z

I'm having a hard time understanding. How are you determining it is wrong? Are you sure you are not counting text nodes? I see you creating a list of children, but I don't see you filtering out just Tag nodes. If you can give me a more simple example, maybe I can better understand.

facelessuser · 2019-03-30T02:33:41Z

Yeah, in your output, I'm seeing <class 'bs4.element.NavigableString'>, those are not elements, those are text nodes. You should be filtering just Tag objects: isinstance(node, bs4.Tag).

yjqiang · 2019-03-30T02:35:24Z

Yeah, in your output, I'm seeing <class 'bs4.element.NavigableString'>, those are not elements, those are text nodes. You should be filtering just Tag objects: isinstance(node, bs4.Tag).

But, can I get .next_siblings with pure soupsieve?

facelessuser · 2019-03-30T02:37:25Z

You want text nodes and element nodes?

yjqiang · 2019-03-30T02:37:43Z

You want text nodes and element nodes?

Yeah.

yjqiang · 2019-03-30T02:40:11Z

Yeah, in your output, I'm seeing <class 'bs4.element.NavigableString'>, those are not elements, those are text nodes. You should be filtering just Tag objects: isinstance(node, bs4.Tag).

And sorry for my naive knowledge. Now I got the difference between node and element. https://developer.mozilla.org/en-US/docs/Web/API/Node/nodeType

facelessuser · 2019-03-30T02:45:02Z

CSS selectors return element nodes not text nodes.

But you can get what you want I would select table[align="center"], get the index under its parent, then slice the children.

Here I'm going to target span with Text 3, get its index, then slice the contents of its parent to give me all nodes after my target:

from bs4 import BeautifulSoup

HTML = """
<div>
<p id="first">Text 1</p>
<span>Text 2</span>
<span>Text 3</span>
Text 4
<span>Text 5</span>
</div>
"""

soups = BeautifulSoup(HTML, 'html.parser')
element = soups.select_one('span:nth-child(3)')
index = element.parent.index(element)
print(element.parent.contents[index + 1:])

Output

['\nText 4\n', <span>Text 5</span>, '\n']

Makes sense?

yjqiang · 2019-03-30T02:50:27Z

But here is my problem. I want to view ebook with python which means I need to get and filter all text nodes of one website. And I want to use configure file to set the rules for different websites. So it might not good enough for you idea and for many other websites I just need to use one select.

facelessuser · 2019-03-30T02:55:29Z

Unfortunately, selectors only return elements. You can leverage this with additional logic to get desired text nodes, but selectors won't return text nodes directly.

I don't know enough about your project to make suggestions on approach, but it appears selectors are working as expected.

yjqiang · 2019-03-30T03:01:27Z

Unfortunately, selectors only return elements. You can leverage this with additional logic to get desired text nodes, but selectors won't return text nodes directly.

I don't know enough about your project to make suggestions on approach, but it appears selectors are working as expected.

Got it. Thanks. But would you like to support this in the future? (Although I checked the document of bs4 and CSS. In CSS, selectors are patterns used to select the element(s) you want to style.)

facelessuser · 2019-03-30T03:24:51Z

Unfortunately, selecting text nodes doesn't make sense with the selectors, so I don't have plans to implement it.

facelessuser added the T: support Support request. label Mar 30, 2019

facelessuser closed this as completed Mar 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Did I make a mistake? #137

Did I make a mistake? #137

yjqiang commented Mar 30, 2019 •

edited by facelessuser

yjqiang commented Mar 30, 2019 •

edited

facelessuser commented Mar 30, 2019

facelessuser commented Mar 30, 2019

yjqiang commented Mar 30, 2019 •

edited

facelessuser commented Mar 30, 2019

yjqiang commented Mar 30, 2019

yjqiang commented Mar 30, 2019 •

edited

facelessuser commented Mar 30, 2019 •

edited

yjqiang commented Mar 30, 2019 •

edited

facelessuser commented Mar 30, 2019

yjqiang commented Mar 30, 2019 •

edited

facelessuser commented Mar 30, 2019

Did I make a mistake? #137

Did I make a mistake? #137

Comments

yjqiang commented Mar 30, 2019 • edited by facelessuser

yjqiang commented Mar 30, 2019 • edited

facelessuser commented Mar 30, 2019

facelessuser commented Mar 30, 2019

yjqiang commented Mar 30, 2019 • edited

facelessuser commented Mar 30, 2019

yjqiang commented Mar 30, 2019

yjqiang commented Mar 30, 2019 • edited

facelessuser commented Mar 30, 2019 • edited

yjqiang commented Mar 30, 2019 • edited

facelessuser commented Mar 30, 2019

yjqiang commented Mar 30, 2019 • edited

facelessuser commented Mar 30, 2019

yjqiang commented Mar 30, 2019 •

edited by facelessuser

yjqiang commented Mar 30, 2019 •

edited

yjqiang commented Mar 30, 2019 •

edited

yjqiang commented Mar 30, 2019 •

edited

facelessuser commented Mar 30, 2019 •

edited

yjqiang commented Mar 30, 2019 •

edited

yjqiang commented Mar 30, 2019 •

edited