Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split sitemap #619

Merged
merged 11 commits into from
Mar 14, 2019
Merged

Split sitemap #619

merged 11 commits into from
Mar 14, 2019

Conversation

sebastien-mariaux
Copy link
Contributor

As discussed on issue #562, this addition opens the possibility to split the sitemap when the number of pages exceed Google limitation for SEO.

In that case a main sitemap is created with links to the sub-sitemaps.

The max number of urls in a given sitemap should be set in the .toml config file (relevant section is added in the documentation).

@sebastien-mariaux sebastien-mariaux changed the title Next Split sitemap Feb 20, 2019

/// Maximum number of urls in the sitemap.xml file.
/// If number of pages exceed that limit, several sitemaps site are created, referenced in a main sitemap.xml
pub sitemap_limit: usize,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that need to be user defined or can we just always break on ~30k?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be hard-coded if you are looking to keep it simple.
The all point is to comply with Google requirements (50000 urls and max 50Mb), so 30K seems safe enough (With unrealistically long urls I get to 35Mb).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unlikely people would want to actually configure that so I would just hardcode it.

@@ -366,6 +366,8 @@ pub fn after_template_change(site: &mut Site, path: &Path) -> Result<()> {
match filename {
"sitemap.xml" => site.render_sitemap(),
"rss.xml" => site.render_rss_feed(site.library.read().unwrap().pages_values(), None),
"multi_sitemap.xml" => site.render_sitemap(),
"main_sitemap.xml" => site.render_sitemap(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe change the names to be more explicit, split_sitemap.xml and split_sitemap_index.xml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,I will do that.

let mut chunk_context = context.clone();
chunk_context.insert("chunk", &chunk);
let sitemap = &render_template("multi_sitemap.xml", &self.tera, chunk_context, &self.config.theme)?;
let file_name:String = format!("sitemap{}.xml", i+1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need those types, Rust will infer them

// Slice the vector in chunks and create sitemap file for each
let mut xml_files = Vec::new();
for (i, chunk) in all_sitemap_entries.chunks(self.config.sitemap_limit).enumerate() {
let mut chunk_context = context.clone();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is VERY expensive right there

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing that, can you delay inserting into the context?
In short, do an early return of the fn if the number of entries is lower than the sitemap_limit where you can insert into the context.

Then you can do a big vec of sitemap entries and chunk that, only creating a context when you need to render one.

I'm also thinking whether we can do a breaking change and have only one entry in context entries that would simplify things. since there's no need to differentiate between pages/sections/taxonomies there afaik. But that's something I would do afterwards, don't worry about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a try, did I understand you right?
thanks for your comments.

let file_name:String = format!("sitemap{}.xml", i+1);
create_file(&self.output_path.join(&file_name), sitemap)?;
let mut xml_link:String = self.config.make_permalink(&file_name);
xml_link.pop(); // Remove trailing slash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

@sebastien-mariaux sebastien-mariaux Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xml_link is a file url, and the trailing slash makes it looks like a folder, therefore Google can't read (and index) it.
eg: https://sitemap1.xml/

There might be a more elegant and safe way to do it though..

create_file(&self.output_path.join("sitemap.xml"), sitemap)?;
// Create main sitemap
let mut main_context = context.clone();
main_context.insert("xml_files", &xml_files);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about xml_files, how about sitemaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a "sitemap" variable just above so it might be confusing:
let sitemap = &render_template("split_sitemap.xml", &self.tera, context, &self.config.theme)?;
Maybe "sub_sitemap".

context.insert("sections", &sections);
context.insert("taxonomies", &taxonomies);
let sitemap = &render_template("sitemap.xml", &self.tera, context, &self.config.theme)?;
create_file(&self.output_path.join("sitemap.xml"), sitemap)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can return that directly and skip the indentation in else

let mut context = Context::new();
context.insert("pages", &pages);
context.insert("sections", &sections);
context.insert("taxonomies", &taxonomies);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to have the same context variables here than when rendering a chunked sitemaps, otherwise we cannot use the same template.


create_file(&self.output_path.join("sitemap.xml"), sitemap)?;
// Count total number of urls to include in sitemap
let total_number = pages.len() + sections.len() + taxonomies.len();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just concatenate those variables basically

@Keats Keats merged commit 2a0d0b9 into getzola:next Mar 14, 2019
@Keats
Copy link
Collaborator

Keats commented Mar 14, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants