Skip to content

Commit

Permalink
Add support for checkpoint/restart
Browse files Browse the repository at this point in the history
  • Loading branch information
Rombur committed Feb 22, 2024
1 parent 7b3b3d5 commit 9316081
Show file tree
Hide file tree
Showing 12 changed files with 562 additions and 116 deletions.
18 changes: 12 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,13 +184,13 @@ The following options are available:
(default value: 100)
* newton\_tolerance: tolerance of the Newton solver (default value: 1e-6)
* jfnk: use Jacobian-Free Newton Krylov method (default value: false)
* experiment: (optional)
* experiment (optional):
* read\_in\_experimental\_data: whether to read in experimental data (default: false)
* if reading in experimental data:
* file: format of the file names. The format is pretty arbitrary, the keywords \#frame
and \#camera are replaced by the frame and the camera number. The format of
the file itself should be csv with a header line. (required)
* format: The format of the experimental data, either `point_cloud`, with (x,y,z,value) per line, or `ray`, with (pt0_x,pt0_y,pt0_z,pt1_x,pt1_y,pt1_z,value) per line, where the ray starts at pt0 and passes through pt1. (required)
* format: The format of the experimental data, either `point_cloud`, with (x,y,z,value) per line, or `ray`, with (pt0\_x,pt0\_y,pt0\_z,pt1\_x,pt1\_y,pt1\_z,value) per line, where the ray starts at pt0 and passes through pt1. (required)
* first\_frame: number associated to the first frame (default value: 0)
* last\_frame: number associated to the last frame (required)
* first\_camera\_id: number associated to the first camera (required)
Expand All @@ -203,27 +203,33 @@ The following options are available:
given by a standard deviation (under the simplifying assumption that the error is normally
distributed and independent for each data point) (default value: 0.0).
* output\_experiment\_on\_mesh: Whether to output the experimental data projected onto the simulation mesh at each experiment time stamp (default: true).
* ensemble: (optional)
* ensemble (optional):
* ensemble\_simulation: whether to perform an ensemble of simulations (default value: false)
* ensemble\_size: the number of ensemble members for the ensemble Kalman filter (EnKF) (default value: 5)
* initial\_temperature\_stddev: the standard deviation for the initial temperature of the material (default value: 0.0)
* new\_material\_temperature\_stddev: the standard deviation for the temperature of material added during the process (default value: 0.0)
* beam\_0\_max\_power\_stddev: the standard deviation for the max power for beam 0 (if it exists) (default value: 0.0)
* beam\_0\_absorption\_efficiency\_stddev: the standard deviation for the absorption efficiency for beam 0 (if it exists) (default value: 0.0)
* data\_assimilation: (optional)
* data\_assimilation (optional):
* assimilate\_data: whether to perform data assimilation (default value: false)
* localization\_cutoff\_function: the function used to decrease the sample covariance as the relevant points become farther away: gaspari\_cohn, step\_function, none (default: none)
* localization\_cutoff\_distance: the distance at which sample covariance entries are set to zero (default: infinity)
* augment\_with\_beam\_0\_absorption: whether to augment the state vector with the beam 0 absorption efficiency (default: false)
* augment\_with\_beam\_0\_max_power: whether to augment the state vector with the beam 0 max power (default: false)
* augment\_with\_beam\_0\_max\_power: whether to augment the state vector with the beam 0 max power (default: false)
* solver:
* max\_number\_of\_temp\_vectors: maximum number of temporary vectors for the GMRES solve (optional)
* max\_iterations: maximum number of iterations for the GMRES solve (optional)
* convergence\_tolerance: convergence tolerance for the GMRES solve (optional)
* profiling (optional):
* timer: output timing information (default value: false)
* caliper: configuration string for Caliper (optional)
* verbose_output: true or false (default value: false)
* checkpoint (optional):
* time\_steps\_between\_checkpoint: number of time steps after which
checkpointing is performed (required)
* filename\_prefix: prefix of the checkpoint files (required)
* restart (optional):
* filename\_prefix: prefix of the restart files (required)
* verbose\_output: true or false (default value: false)



Expand Down
204 changes: 180 additions & 24 deletions application/adamantine.hh
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,14 @@
#include <deal.II/lac/vector_operation.h>
#include <deal.II/numerics/error_estimator.h>

#include <boost/archive/text_iarchive.hpp>
#include <boost/archive/text_oarchive.hpp>
#include <boost/property_tree/ptree.hpp>

#include <fstream>
#include <limits>
#include <memory>
#include <string>
#include <type_traits>
#include <unordered_map>

Expand Down Expand Up @@ -471,7 +476,7 @@ void refine_and_transfer(

// Update the AffineConstraints and resize the solution
thermal_physics->setup_dofs();
thermal_physics->initialize_dof_vector(solution);
thermal_physics->initialize_dof_vector(0., solution);

// Update MaterialProperty DoFHandler and resize the state vectors
material_properties.reinit_dofs();
Expand Down Expand Up @@ -943,20 +948,63 @@ run(MPI_Comm const &communicator, boost::property_tree::ptree const &database,
: mechanical_physics->get_dof_handler());
}

// Get the checkpoint and restart subtrees
boost::optional<boost::property_tree::ptree const &>
checkpoint_optional_database = database.get_child_optional("checkpoint");
boost::optional<boost::property_tree::ptree const &>
restart_optional_database = database.get_child_optional("restart");

unsigned int time_steps_checkpoint = std::numeric_limits<unsigned int>::max();
std::string checkpoint_filename;
if (checkpoint_optional_database)
{
auto checkpoint_database = checkpoint_optional_database.get();
// PropertyTreeInput checkpoint.time_steps_between_checkpoint
time_steps_checkpoint =
checkpoint_database.get<unsigned int>("time_steps_between_checkpoint");
// PropertyTreeInput checkpoint.filename_prefix
checkpoint_filename =
checkpoint_database.get<std::string>("filename_prefix");
}

bool restart = false;
std::string restart_filename;
if (restart_optional_database)
{
restart = true;
auto restart_database = restart_optional_database.get();
// PropertyTreeInput restart.filename_prefix
restart_filename = restart_database.get<std::string>("filename_prefix");
}

dealii::LA::distributed::Vector<double, MemorySpaceType> temperature;
dealii::LA::distributed::Vector<double, dealii::MemorySpace::Host>
displacement;
if (use_thermal_physics)
{
thermal_physics->setup_dofs();
thermal_physics->update_material_deposition_orientation();
thermal_physics->compute_inverse_mass_matrix();
thermal_physics->initialize_dof_vector(initial_temperature, temperature);
thermal_physics->get_state_from_material_properties();
if (restart == false)
{
thermal_physics->setup();
thermal_physics->initialize_dof_vector(initial_temperature, temperature);
}
else
{
#ifdef ADAMANTINE_WITH_CALIPER
CALI_MARK_BEGIN("restart from file");
#endif
thermal_physics->load_checkpoint(restart_filename, temperature);
#ifdef ADAMANTINE_WITH_CALIPER
CALI_MARK_END("restart from file");
#endif
}
}

if (use_mechanical_physics)
{
// We currently do not support restarting the mechanical simulation.
adamantine::ASSERT_THROW(
restart == false,
"Mechanical simulation cannot be restarted from a file");
if (use_thermal_physics)
{
// Thermo-mechanical simulation
Expand All @@ -979,6 +1027,19 @@ run(MPI_Comm const &communicator, boost::property_tree::ptree const &database,
unsigned int n_time_step = 0;
double time = 0.;
double activation_time_end = -1.;
unsigned int const rank =
dealii::Utilities::MPI::this_mpi_process(communicator);
if (restart == true)
{
if (rank == 0)
{
std::cout << "Restarting from file" << std::endl;
}
std::ifstream file{restart_filename + "_time.txt"};
boost::archive::text_iarchive ia{file};
ia >> time;
ia >> n_time_step;
}
// PropertyTreeInput geometry.deposition_time
double const activation_time =
geometry_database.get<double>("deposition_time", 0.);
Expand Down Expand Up @@ -1025,7 +1086,6 @@ run(MPI_Comm const &communicator, boost::property_tree::ptree const &database,
#endif
if ((time + time_step) > duration)
time_step = duration - time;
unsigned int rank = dealii::Utilities::MPI::this_mpi_process(communicator);

// Refine the mesh after time_steps_refinement time steps or when time
// is greater or equal than the next predicted time for refinement. This
Expand Down Expand Up @@ -1074,7 +1134,7 @@ run(MPI_Comm const &communicator, boost::property_tree::ptree const &database,
// activation_end.
timers[adamantine::add_material_search].start();
auto elements_to_activate = adamantine::get_elements_to_activate(
thermal_physics->get_dof_handler(), material_deposition_boxes);
thermal_physics->get_dof_handler(), material_deposition_boxes);
timers[adamantine::add_material_search].stop();

// For now assume that all deposited material has never been melted
Expand Down Expand Up @@ -1154,6 +1214,25 @@ run(MPI_Comm const &communicator, boost::property_tree::ptree const &database,
time_step += time_step;
}

if (n_time_step % time_steps_checkpoint == 0)
{
#ifdef ADAMANTINE_WITH_CALIPER
CALI_MARK_BEGIN("save checkpoint");
#endif
if (rank == 0)
{
std::cout << "Checkpoint reached" << std::endl;
}
thermal_physics->save_checkpoint(checkpoint_filename, temperature);
std::ofstream file{checkpoint_filename + "_time.txt"};
boost::archive::text_oarchive oa{file};
oa << time;
oa << n_time_step;
#ifdef ADAMANTINE_WITH_CALIPER
CALI_MARK_END("save checkpoint");
#endif
}

// Output progress on screen
if (rank == 0)
{
Expand All @@ -1163,7 +1242,7 @@ run(MPI_Comm const &communicator, boost::property_tree::ptree const &database,
if (int_part > progress)
{
std::cout << int_part * 10 << '%' << " completed" << std::endl;
++progress;
progress = static_cast<unsigned int>(int_part);
}
}

Expand Down Expand Up @@ -1353,8 +1432,8 @@ run_ensemble(MPI_Comm const &global_communicator,
ensemble_database.get("ensemble_size", 5);
// Distribute the processors among the ensemble members
MPI_Comm local_communicator;
unsigned int local_ensemble_size = -1;
unsigned int first_local_member = -1;
unsigned int local_ensemble_size = std::numeric_limits<unsigned int>::max();
unsigned int first_local_member = std::numeric_limits<unsigned int>::max();
int my_color = -1;
split_global_communicator(global_communicator, global_ensemble_size,
local_communicator, local_ensemble_size,
Expand Down Expand Up @@ -1462,6 +1541,37 @@ run_ensemble(MPI_Comm const &global_communicator,
local_communicator, my_color,
data_assimilation_database);

// Get the checkpoint and restart subtrees
boost::optional<boost::property_tree::ptree const &>
checkpoint_optional_database = database.get_child_optional("checkpoint");
boost::optional<boost::property_tree::ptree const &>
restart_optional_database = database.get_child_optional("restart");

unsigned int const global_rank =
dealii::Utilities::MPI::this_mpi_process(global_communicator);
unsigned int time_steps_checkpoint = std::numeric_limits<unsigned int>::max();
std::string checkpoint_filename;
if (checkpoint_optional_database)
{
auto checkpoint_database = checkpoint_optional_database.get();
// PropertyTreeInput checkpoint.time_steps_between_checkpoint
time_steps_checkpoint =
checkpoint_database.get<unsigned int>("time_steps_between_checkpoint");
// PropertyTreeInput checkpoint.filename_prefix
checkpoint_filename =
checkpoint_database.get<std::string>("filename_prefix") + '_' +
std::to_string(global_rank);
}

bool restart = false;
std::string restart_filename;
if (restart_optional_database)
{
restart = true;
auto restart_database = restart_optional_database.get();
// PropertyTreeInput restart.filename_prefix
restart_filename = restart_database.get<std::string>("filename_prefix");
}
for (unsigned int member = 0; member < local_ensemble_size; ++member)
{
// Resize the augmented ensemble block vector to have two blocks
Expand Down Expand Up @@ -1525,18 +1635,27 @@ run_ensemble(MPI_Comm const &global_communicator,
heat_sources_ensemble[member] =
thermal_physics_ensemble[member]->get_heat_sources();

thermal_physics_ensemble[member]->setup_dofs();
thermal_physics_ensemble[member]->update_material_deposition_orientation();
thermal_physics_ensemble[member]->compute_inverse_mass_matrix();

thermal_physics_ensemble[member]->initialize_dof_vector(
initial_temperature[member],
solution_augmented_ensemble[member].block(base_state));

if (restart == false)
{
thermal_physics_ensemble[member]->setup();
thermal_physics_ensemble[member]->initialize_dof_vector(
initial_temperature[member],
solution_augmented_ensemble[member].block(base_state));
}
else
{
#ifdef ADAMANTINE_WITH_CALIPER
CALI_MARK_BEGIN("restart from file");
#endif
thermal_physics_ensemble[member]->load_checkpoint(
restart_filename + '_' + std::to_string(member),
solution_augmented_ensemble[member].block(base_state));
#ifdef ADAMANTINE_WITH_CALIPER
CALI_MARK_END("restart from file");
#endif
}
solution_augmented_ensemble[member].collect_sizes();

thermal_physics_ensemble[member]->get_state_from_material_properties();

// For now we only output temperature
post_processor_database.put("thermal_output", true);
post_processor_ensemble.push_back(
Expand All @@ -1558,11 +1677,10 @@ run_ensemble(MPI_Comm const &global_communicator,
thermal_physics_ensemble[0]->get_dof_handler());

// ----- Read the experimental data -----
unsigned int const global_rank =
dealii::Utilities::MPI::this_mpi_process(global_communicator);
std::vector<std::vector<double>> frame_time_stamps;
std::unique_ptr<adamantine::ExperimentalData<dim>> experimental_data;
unsigned int experimental_frame_index = -1;
unsigned int experimental_frame_index =
std::numeric_limits<unsigned int>::max();

if (experiment_optional_database)
{
Expand Down Expand Up @@ -1621,6 +1739,18 @@ run_ensemble(MPI_Comm const &global_communicator,
unsigned int n_time_step = 0;
double time = 0.;
double activation_time_end = -1.;
if (restart == true)
{
if (global_rank == 0)
{
std::cout << "Restarting from file" << std::endl;
}
std::ifstream file{restart_filename + "_time.txt"};
boost::archive::text_iarchive ia{file};
ia >> time;
ia >> n_time_step;
}

// PropertyTreeInput geometry.deposition_time
double const activation_time =
geometry_database.get<double>("deposition_time", 0.);
Expand Down Expand Up @@ -1997,6 +2127,32 @@ run_ensemble(MPI_Comm const &global_communicator,
}
}

// ----- Checkpoint the ensemble members -----
if (n_time_step % time_steps_checkpoint == 0)
{
#ifdef ADAMANTINE_WITH_CALIPER
CALI_MARK_BEGIN("save checkpoint");
#endif
if (global_rank == 0)
{
std::cout << "Checkpoint reached" << std::endl;
}
for (unsigned int member = 0; member < local_ensemble_size; ++member)
{
thermal_physics_ensemble[member]->save_checkpoint(
checkpoint_filename + '_' +
std::to_string(first_local_member + member),
solution_augmented_ensemble[member].block(base_state));
}
std::ofstream file{checkpoint_filename + "_time.txt"};
boost::archive::text_oarchive oa{file};
oa << time;
oa << n_time_step;
#ifdef ADAMANTINE_WITH_CALIPER
CALI_MARK_END("save checkpoint");
#endif
}

// ----- Output progress on screen -----
if (global_rank == 0)
{
Expand Down
10 changes: 9 additions & 1 deletion source/MaterialProperty.hh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
/* Copyright (c) 2016 - 2023, the adamantine authors.
/* Copyright (c) 2016 - 2024, the adamantine authors.
*
* This file is subject to the Modified BSD License and may not be distributed
* without copyright and license information. Please refer to the file LICENSE
Expand Down Expand Up @@ -210,6 +210,12 @@ public:
std::vector<unsigned int>> const &_cell_it_to_mf_pos,
dealii::DoFHandler<dim> const &dof_handler);

/**
* Set the ratio of the material states at the cell level.
*/
void set_cell_state(
std::vector<std::array<double, g_n_material_states>> const &cell_state);

/**
* Return the underlying the DoFHandler.
*/
Expand Down Expand Up @@ -289,6 +295,8 @@ private:
/**
* Ratio of each in MaterarialState in each cell.
*/
// FIXME Change the order of the indices. Currently, the first index is the
// state and the second is the cell.
Kokkos::View<double **, typename MemorySpaceType::kokkos_space> _state;
/**
* Thermal properties of the material that are dependent of the state of the
Expand Down

0 comments on commit 9316081

Please sign in to comment.