-
Notifications
You must be signed in to change notification settings - Fork 950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split sitemap #619
Split sitemap #619
Conversation
components/config/src/config.rs
Outdated
|
||
/// Maximum number of urls in the sitemap.xml file. | ||
/// If number of pages exceed that limit, several sitemaps site are created, referenced in a main sitemap.xml | ||
pub sitemap_limit: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that need to be user defined or can we just always break on ~30k?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be hard-coded if you are looking to keep it simple.
The all point is to comply with Google requirements (50000 urls and max 50Mb), so 30K seems safe enough (With unrealistically long urls I get to 35Mb).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems unlikely people would want to actually configure that so I would just hardcode it.
components/rebuild/src/lib.rs
Outdated
@@ -366,6 +366,8 @@ pub fn after_template_change(site: &mut Site, path: &Path) -> Result<()> { | |||
match filename { | |||
"sitemap.xml" => site.render_sitemap(), | |||
"rss.xml" => site.render_rss_feed(site.library.read().unwrap().pages_values(), None), | |||
"multi_sitemap.xml" => site.render_sitemap(), | |||
"main_sitemap.xml" => site.render_sitemap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe change the names to be more explicit, split_sitemap.xml
and split_sitemap_index.xml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes,I will do that.
components/site/src/lib.rs
Outdated
let mut chunk_context = context.clone(); | ||
chunk_context.insert("chunk", &chunk); | ||
let sitemap = &render_template("multi_sitemap.xml", &self.tera, chunk_context, &self.config.theme)?; | ||
let file_name:String = format!("sitemap{}.xml", i+1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need those types, Rust will infer them
components/site/src/lib.rs
Outdated
// Slice the vector in chunks and create sitemap file for each | ||
let mut xml_files = Vec::new(); | ||
for (i, chunk) in all_sitemap_entries.chunks(self.config.sitemap_limit).enumerate() { | ||
let mut chunk_context = context.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is VERY expensive right there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing that, can you delay inserting into the context?
In short, do an early return of the fn if the number of entries is lower than the sitemap_limit
where you can insert into the context.
Then you can do a big vec of sitemap entries and chunk that, only creating a context when you need to render one.
I'm also thinking whether we can do a breaking change and have only one entry in context entries
that would simplify things. since there's no need to differentiate between pages/sections/taxonomies there afaik. But that's something I would do afterwards, don't worry about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave it a try, did I understand you right?
thanks for your comments.
components/site/src/lib.rs
Outdated
let file_name:String = format!("sitemap{}.xml", i+1); | ||
create_file(&self.output_path.join(&file_name), sitemap)?; | ||
let mut xml_link:String = self.config.make_permalink(&file_name); | ||
xml_link.pop(); // Remove trailing slash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xml_link is a file url, and the trailing slash makes it looks like a folder, therefore Google can't read (and index) it.
eg: https://sitemap1.xml/
There might be a more elegant and safe way to do it though..
components/site/src/lib.rs
Outdated
create_file(&self.output_path.join("sitemap.xml"), sitemap)?; | ||
// Create main sitemap | ||
let mut main_context = context.clone(); | ||
main_context.insert("xml_files", &xml_files); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about xml_files
, how about sitemaps
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a "sitemap" variable just above so it might be confusing:
let sitemap = &render_template("split_sitemap.xml", &self.tera, context, &self.config.theme)?;
Maybe "sub_sitemap".
context.insert("sections", §ions); | ||
context.insert("taxonomies", &taxonomies); | ||
let sitemap = &render_template("sitemap.xml", &self.tera, context, &self.config.theme)?; | ||
create_file(&self.output_path.join("sitemap.xml"), sitemap)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can return that directly and skip the indentation in else
components/site/src/lib.rs
Outdated
let mut context = Context::new(); | ||
context.insert("pages", &pages); | ||
context.insert("sections", §ions); | ||
context.insert("taxonomies", &taxonomies); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to have the same context variables here than when rendering a chunked sitemaps, otherwise we cannot use the same template.
components/site/src/lib.rs
Outdated
|
||
create_file(&self.output_path.join("sitemap.xml"), sitemap)?; | ||
// Count total number of urls to include in sitemap | ||
let total_number = pages.len() + sections.len() + taxonomies.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just concatenate those variables basically
Thanks! |
As discussed on issue #562, this addition opens the possibility to split the sitemap when the number of pages exceed Google limitation for SEO.
In that case a main sitemap is created with links to the sub-sitemaps.
The max number of urls in a given sitemap should be set in the .toml config file (relevant section is added in the documentation).