Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: hashmap based caching #82

Merged
merged 80 commits into from Aug 24, 2023
Merged

feat: hashmap based caching #82

merged 80 commits into from Aug 24, 2023

Conversation

KSXGitHub
Copy link
Contributor

@KSXGitHub KSXGitHub commented Aug 15, 2023

Before this PR, pacquet install was installing packages by "generations": All direct dependencies are installed first, then dependencies of direct dependencies, then dependencies of dependencies of dependencies, and so on. This was not optimal, but it was done to avoid race condition where some packages would be fetched twice should they be invoked in the window of time where its duplicated has begun fetching but not yet saved to local disk.

This PR aims to change this by using an in-memory hashmap cache to keep track of all fetched packages, then re-enable full parallelization.

NOTE: This PR only affects pacquet install, pacquet add still uses the old algorithm.

NOTE: New tests with multiple packages for pacquet install is required, but I am too tired right now.

crates/cli/src/package_cache.rs Outdated Show resolved Hide resolved
crates/cli/src/package_cache.rs Outdated Show resolved Hide resolved
.join(&package_version.name);

if let Some(mut receiver) = package_cache.get_mut(&saved_path) {
// TODO: is this loop necessary?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use a Mutex<Receiver<PackageState>> we don't have to have a loop here. Similar to: https://medium.com/@polyglot_factotum/rust-concurrency-patterns-condvars-and-locks-e278f18db74f

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what my question was referring to. I am wonder whether I should remove the whole code block inside loop (leaving only return Ok(())) or keep it. This would only be clear after further work. Hence the TODO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the package is InProgress we need to wait, since after this function, we call package import functions to link the store files with node_modules.

Ok(cas_files)
})
.await?
verify_checksum(&response, package_integrity)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we remove tokio spawn in here? The tarball extraction is a CPU bounded problem, and solving it in a different thread helps the cause?

Copy link
Contributor Author

@KSXGitHub KSXGitHub Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were going to .await on the handle right afterward. So even if we spawn another thread, the current thread will be waiting for the spawned thread to complete. It will just like not spawning any thread at all.

crates/tarball/src/lib.rs Outdated Show resolved Hide resolved
crates/tarball/src/lib.rs Outdated Show resolved Hide resolved
crates/cli/src/package.rs Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Aug 15, 2023

Codecov Report

Patch coverage: 78.98% and project coverage change: -1.91% ⚠️

Comparison is base (022ca6e) 85.82% compared to head (c386708) 83.92%.
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #82      +/-   ##
==========================================
- Coverage   85.82%   83.92%   -1.91%     
==========================================
  Files          24       24              
  Lines        1270     1306      +36     
==========================================
+ Hits         1090     1096       +6     
- Misses        180      210      +30     
Files Changed Coverage Δ
crates/cli/src/lib.rs 43.42% <0.00%> (-0.58%) ⬇️
crates/npmrc/src/lib.rs 94.82% <ø> (ø)
crates/registry/src/lib.rs 0.00% <0.00%> (ø)
crates/tarball/src/lib.rs 79.54% <48.48%> (-19.40%) ⬇️
crates/cli/src/package_import.rs 72.72% <60.00%> (-13.39%) ⬇️
crates/cli/src/package_manager.rs 77.77% <80.00%> (-7.94%) ⬇️
crates/registry/src/package.rs 89.61% <81.81%> (+0.13%) ⬆️
crates/cli/src/commands/install.rs 96.03% <95.00%> (-1.51%) ⬇️
crates/cafs/src/lib.rs 98.03% <100.00%> (ø)
crates/cli/src/commands/add.rs 94.77% <100.00%> (+0.32%) ⬆️
... and 3 more

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions
Copy link

github-actions bot commented Aug 15, 2023

Benchmark Results

Linux

group                          main                                   pr
-----                          ----                                   --
tarball/download_dependency    1.00      9.7±0.47ms   445.8 KB/sec    1.02      9.9±0.58ms   438.2 KB/sec

@zkochan
Copy link
Member

zkochan commented Aug 21, 2023

Could you add a description what this PR is about?

@KSXGitHub
Copy link
Contributor Author

Could you add a description what this PR is about?

done

@KSXGitHub KSXGitHub marked this pull request as ready for review August 22, 2023 08:11
@KSXGitHub KSXGitHub requested a review from anonrig August 22, 2023 08:11
.join("node_modules")
.join(&package_version.name);

// TODO: skip when it already exists in store?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the virtual store at node_modules/.pnpm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is global store.


let saved_path = config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think import_destination or import_dest would be a better name for the variable.

However, the argument name in the function is save_path, so better name the same, not "saveD_path"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

@@ -3,7 +3,7 @@ use serde::{Deserialize, Serialize};
#[derive(Serialize, Deserialize, Debug, Default, Clone, Eq)]
#[serde(rename_all = "camelCase")]
pub struct PackageDistribution {
pub integrity: String,
pub integrity: String, // TODO: why fetching typescript cause error here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we merge the PR with TODO items? It is probably best to open issues instead of adding TODO comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer code todos as well as issues. It makes it easier to navigate through the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#92

}
}
drop(cache_lock);
sleep(Duration::from_millis(100)).await; // TODO: millis can be any small number, even 0, further testing is required to find the ideal number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK to use sleep for this? Looks like a workaround.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even need sleep in here? We need more comments on the code in general.

&self.http_client,
name,
version,
&self.config.modules_dir,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already passing self.config to this function. We dont need this do we?


let saved_path = config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

.join("node_modules")
.join(&package_version.name);

// TODO: skip when it already exists in store?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is global store.

@@ -3,7 +3,7 @@ use serde::{Deserialize, Serialize};
#[derive(Serialize, Deserialize, Debug, Default, Clone, Eq)]
#[serde(rename_all = "camelCase")]
pub struct PackageDistribution {
pub integrity: String,
pub integrity: String, // TODO: why fetching typescript cause error here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer code todos as well as issues. It makes it easier to navigate through the codebase.

@@ -30,13 +29,18 @@ impl PackageVersion {
http_client: &reqwest::Client,
registry: &str,
) -> Result<Self, RegistryError> {
let url = || format!("{0}{name}/{version}", &registry); // TODO: use reqwest url type
let network_error = |error| NetworkError { error, url: url() };

http_client
.get(format!("{0}{name}/{version}", &registry))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line needs to be updated to use local variable url.

@@ -30,13 +29,18 @@ impl PackageVersion {
http_client: &reqwest::Client,
registry: &str,
) -> Result<Self, RegistryError> {
let url = || format!("{0}{name}/{version}", &registry); // TODO: use reqwest url type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this todo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current type of registry is &str, and it only works correctly when it ends with a /. The Url type is more efficient and less error-prone.

}
}
drop(cache_lock);
sleep(Duration::from_millis(100)).await; // TODO: millis can be any small number, even 0, further testing is required to find the ideal number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even need sleep in here? We need more comments on the code in general.

serde = { workspace = true }
serde_json = { workspace = true }
tokio = { workspace = true }
async-recursion = { workspace = true }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using async-recursion? I couldn't find any reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[async_recursion]

@zkochan zkochan merged commit 9c94551 into main Aug 24, 2023
15 of 17 checks passed
@zkochan zkochan deleted the hashmap-based-caching branch August 24, 2023 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants