-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accomodate special case : Title attribute missing from rss #17
Conversation
Hi, thanks for your contribution! The reason why there's no checks in that title, description and link fields are required fields by RSS specification |
Oh yes definitely! It would help parse non-standard rss files like this one. While it seems to be standard in the most part, the first item was making my program crash when it had to parse it because of the lack of title given. |
Nice! Would you want to create a separate PR or force-push this one to update the logic and add this feature? |
I think I will update this PR when I'll have time. 👍 |
@BBArikL Have you gotten around to this? Otherwise I could implement this the coming week. |
Hello! I have been quite busy with life since last time. I should be way more free after this thursday and I'll try update the branch somewhere in the next weekend. Thank you for reminding me of it, I'll add a reminder so I do not forget :) |
…e additional entries.
@dhvcc @Thewildweb Here is the commit that I promised! I added some whitespaces to help the code review and added a functionality to have additional entries to the RSS scraped. Let me know what you think! |
To add a field, call parse() with entries equal to a list of fields the scraper should look for.
Then you can retrieve the information (let's say for the first item) by callying:
And now the variable |
Hi @BBArikL @Thewildweb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but there are couple points which need to be considered I think
rss_parser/_parser.py
Outdated
"publish_date": getattr(item.pubDate, "text", ""), | ||
"category": getattr(item.category, "text", ""), | ||
"description": description_soup.text, | ||
"title": getattr(getattr(item, "title", ""), "text", ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we need to to something about those double getattr
s. Perhaps move them into a separate function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'll do that!
rss_parser/_parser.py
Outdated
default: str, | ||
item_dict: Optional[str] = None, | ||
default_dict: Optional[str] = None, | ||
item: object, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think black uses 4 spaces instead of 8, let's try to not cause any extra diffs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I did not, see that I put more spaces, I will fix this in the next commit.
@BBArikL please fix linting issues |
That last commit should fix the linting issues 😅 . First time working with strict code checkers. |
When parsing a rss file, it might be possible for the item to not have any title property. I added a check for the title to be sure it does not crash any user's application. When there is no title found in the rss, the "title" property is changed to a empty string.