Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Did I make a mistake? #137

Closed
yjqiang opened this issue Mar 30, 2019 · 12 comments
Closed

Did I make a mistake? #137

yjqiang opened this issue Mar 30, 2019 · 12 comments
Labels
T: support Support request.

Comments

@yjqiang
Copy link
Contributor

yjqiang commented Mar 30, 2019

import requests
from bs4 import BeautifulSoup


url = 'https://www.ptwxz.com/html/0/296/39948.html'

cookie = ""

user_agent = ('Mozilla/5.0 (iPhone; CPU iPhone OS 11_2_6 like'
              'Mac OS X) AppleWebKit/604.1.34 (KHTML, like Gecko)'
              'CriOS/65.0.3325.152 Mobile/15D100 Safari/604.1')

headers = {'User-Agent': user_agent, 'cookie': cookie}

'''
socks5 = 'socks5://127.0.0.1:10086'  # because this website is blocked by the government of China, I have to use proxy
rsp = requests.get(
    url,
    headers=headers,
    proxies={'http': socks5, 'https': socks5})
'''
rsp = requests.get(url, headers=headers)
rsp.encoding = 'gbk'
text = rsp.text

soups = BeautifulSoup(text, 'html.parser')
print(soups.prettify())
print('_____________________')

tag_center = soups.select_one('table[align="center"]')
list_tag_center_next_siblings = list(tag_center.next_siblings)
for i in list_tag_center_next_siblings[-4:]:
    print(type(i), i)

tag_head = soups.select_one('head')
list_tag_head_children = list(tag_head.children)
for i in list_tag_head_children[-4:]:
    print(type(i), i, '|')

for x, y in zip(reversed(list_tag_center_next_siblings), reversed(list_tag_head_children)):
    assert x is y


tags_after_center = soups.select('table[align="center"] ~ *')
print(tags_after_center)
@yjqiang
Copy link
Contributor Author

yjqiang commented Mar 30, 2019

table[align="center"] ~ * should mean that the next-siblings of the tag table[align="center"].
But it just shows 2 results. And there are 4 results in list_tag_center_next_siblings = list(tag_center.next_siblings).

@facelessuser
Copy link
Owner

I'm having a hard time understanding. How are you determining it is wrong? Are you sure you are not counting text nodes? I see you creating a list of children, but I don't see you filtering out just Tag nodes. If you can give me a more simple example, maybe I can better understand.

@facelessuser
Copy link
Owner

Yeah, in your output, I'm seeing <class 'bs4.element.NavigableString'>, those are not elements, those are text nodes. You should be filtering just Tag objects: isinstance(node, bs4.Tag).

@yjqiang
Copy link
Contributor Author

yjqiang commented Mar 30, 2019

Yeah, in your output, I'm seeing <class 'bs4.element.NavigableString'>, those are not elements, those are text nodes. You should be filtering just Tag objects: isinstance(node, bs4.Tag).

But, can I get .next_siblings with pure soupsieve?

@facelessuser
Copy link
Owner

You want text nodes and element nodes?

@yjqiang
Copy link
Contributor Author

yjqiang commented Mar 30, 2019

You want text nodes and element nodes?

Yeah.

@yjqiang
Copy link
Contributor Author

yjqiang commented Mar 30, 2019

Yeah, in your output, I'm seeing <class 'bs4.element.NavigableString'>, those are not elements, those are text nodes. You should be filtering just Tag objects: isinstance(node, bs4.Tag).

And sorry for my naive knowledge. Now I got the difference between node and element. https://developer.mozilla.org/en-US/docs/Web/API/Node/nodeType

@facelessuser
Copy link
Owner

facelessuser commented Mar 30, 2019

CSS selectors return element nodes not text nodes.

But you can get what you want I would select table[align="center"], get the index under its parent, then slice the children.

Here I'm going to target span with Text 3, get its index, then slice the contents of its parent to give me all nodes after my target:

from bs4 import BeautifulSoup

HTML = """
<div>
<p id="first">Text 1</p>
<span>Text 2</span>
<span>Text 3</span>
Text 4
<span>Text 5</span>
</div>
"""

soups = BeautifulSoup(HTML, 'html.parser')
element = soups.select_one('span:nth-child(3)')
index = element.parent.index(element)
print(element.parent.contents[index + 1:])

Output

['\nText 4\n', <span>Text 5</span>, '\n']

Makes sense?

@facelessuser facelessuser added the T: support Support request. label Mar 30, 2019
@yjqiang
Copy link
Contributor Author

yjqiang commented Mar 30, 2019

But here is my problem. I want to view ebook with python which means I need to get and filter all text nodes of one website. And I want to use configure file to set the rules for different websites. So it might not good enough for you idea and for many other websites I just need to use one select.

@facelessuser
Copy link
Owner

Unfortunately, selectors only return elements. You can leverage this with additional logic to get desired text nodes, but selectors won't return text nodes directly.

I don't know enough about your project to make suggestions on approach, but it appears selectors are working as expected.

@yjqiang
Copy link
Contributor Author

yjqiang commented Mar 30, 2019

Unfortunately, selectors only return elements. You can leverage this with additional logic to get desired text nodes, but selectors won't return text nodes directly.

I don't know enough about your project to make suggestions on approach, but it appears selectors are working as expected.

Got it. Thanks. But would you like to support this in the future? (Although I checked the document of bs4 and CSS. In CSS, selectors are patterns used to select the element(s) you want to style.)

@facelessuser
Copy link
Owner

Unfortunately, selecting text nodes doesn't make sense with the selectors, so I don't have plans to implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T: support Support request.
Projects
None yet
Development

No branches or pull requests

2 participants