Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accomodate special case : Title attribute missing from rss #17

Merged
merged 5 commits into from
Jan 13, 2022
Merged

Accomodate special case : Title attribute missing from rss #17

merged 5 commits into from
Jan 13, 2022

Conversation

BBArikL
Copy link
Contributor

@BBArikL BBArikL commented Nov 18, 2021

When parsing a rss file, it might be possible for the item to not have any title property. I added a check for the title to be sure it does not crash any user's application. When there is no title found in the rss, the "title" property is changed to a empty string.

@dhvcc
Copy link
Owner

dhvcc commented Nov 18, 2021

Hi, thanks for your contribution! The reason why there's no checks in that title, description and link fields are required fields by RSS specification
On the other hand it may be useful to implement "optional" mode where every field is not mandatory

@BBArikL
Copy link
Contributor Author

BBArikL commented Nov 18, 2021

Oh yes definitely! It would help parse non-standard rss files like this one. While it seems to be standard in the most part, the first item was making my program crash when it had to parse it because of the lack of title given.

@dhvcc
Copy link
Owner

dhvcc commented Nov 19, 2021

Oh yes definitely! It would help parse non-standard rss files like this one. While it seems to be standard in the most part, the first item was making my program crash when it had to parse it because of the lack of title given.

Nice! Would you want to create a separate PR or force-push this one to update the logic and add this feature?

@BBArikL
Copy link
Contributor Author

BBArikL commented Nov 19, 2021

I think I will update this PR when I'll have time. 👍

@Thewildweb
Copy link

@BBArikL Have you gotten around to this? Otherwise I could implement this the coming week.

@BBArikL
Copy link
Contributor Author

BBArikL commented Dec 12, 2021

Hello! I have been quite busy with life since last time. I should be way more free after this thursday and I'll try update the branch somewhere in the next weekend. Thank you for reminding me of it, I'll add a reminder so I do not forget :)

@BBArikL
Copy link
Contributor Author

BBArikL commented Dec 18, 2021

@dhvcc @Thewildweb Here is the commit that I promised! I added some whitespaces to help the code review and added a functionality to have additional entries to the RSS scraped. Let me know what you think!

@BBArikL
Copy link
Contributor Author

BBArikL commented Dec 19, 2021

To add a field, call parse() with entries equal to a list of fields the scraper should look for.
Let's say for a field 'author':

parser = Parser(xml=someRSSSite)

feed = parser.parse(entries=["author"])

Then you can retrieve the information (let's say for the first item) by callying:

item = feed[0]

author = item.other["author"]

And now the variable author contains the author value that was in the rss, or contains a empty string if there was not any value set.

@dhvcc
Copy link
Owner

dhvcc commented Dec 28, 2021

Hi @BBArikL @Thewildweb
Sorry for not updating on the issue, had a lot of work. I'll try to review and sort this out before NY's. Happy holidays!

Copy link
Owner

@dhvcc dhvcc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but there are couple points which need to be considered I think

"publish_date": getattr(item.pubDate, "text", ""),
"category": getattr(item.category, "text", ""),
"description": description_soup.text,
"title": getattr(getattr(item, "title", ""), "text", ""),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we need to to something about those double getattrs. Perhaps move them into a separate function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'll do that!

rss_parser/_parser.py Show resolved Hide resolved
default: str,
item_dict: Optional[str] = None,
default_dict: Optional[str] = None,
item: object,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think black uses 4 spaces instead of 8, let's try to not cause any extra diffs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I did not, see that I put more spaces, I will fix this in the next commit.

@dhvcc
Copy link
Owner

dhvcc commented Jan 11, 2022

@BBArikL please fix linting issues

@BBArikL
Copy link
Contributor Author

BBArikL commented Jan 11, 2022

That last commit should fix the linting issues 😅 . First time working with strict code checkers.

@dhvcc dhvcc merged commit f48f72e into dhvcc:master Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants