Report Repacked Files For Unclosed Runs #4293

blallen · 2016-05-26T14:04:29Z

The current system only sends Repack signals to Storage Manager team if a run is closed. This appears to be intentional, as it is explicitly included in the query used to check if files have been repacked [1].

Storage Manager relies on these signals to know if they can safely delete files so waiting until a run is closed can prevent them from deleting any files from a run due to an issue with only a few of them. Depending on the size of the run, this can clog up the streamer space at P5 and lead to unnecessary interventions to prevent them from running of space.

Additionally, the Storage Manager monitoring shows the number of Repack files, so in the current setup, an open run appears to have nothing repacked, potentially hiding issues or causing alarm when there is no need.

[1] https://github.com/dmwm/T0/blob/master/src/python/T0/WMBS/Oracle/SMNotification/GetFinishedStreamers.py#L26

hufnagel · 2016-06-01T21:46:04Z

The query as written relies on the fileset to be closed since without that requirement this query would pick up files between injection into the Tier0 and being picked up by the Tier0Feeder and forwarded into the correct fileset for processing.

I can rewrite the query though to do this differently...

Immediatelt send notifications as soon as the streamer file is completed or failed.

send SM repack notifications quicker, fixes #4293

Immediatelt send notifications as soon as the streamer file is completed or failed. Also retry notification if it fails.

@common

Configuration updates - CMSSW_7_4_8_patch1 - New Express GT 74X_dataRun2_Express_v1 - DQM sequences updated - HLTMonitor dataset added - Time and Size per event estimates - Add T1 subsciptions - Use T2_CH_CERN_AI resources for production configuration - AcqEra to Run2015C comment out dqm_sequences Comment out dqm_sequences (as we don't run them in prod). fix a type in the production config Changes default plugin to PyCondorPlugin Remove WorkQueueManager from the Tier0 config Add Skims to the Tier0 configuration - Add skims to the Tier0 configuration - remove comments from dqm sequences - enable multicore for PromptReco and Express (multicore 4) - New alca matrix updates - Default write_dqm = False - Enabling dqm for PDs with alca producers - @common default when write_dqm = True - Ignore DQMOffline stream remove PromptCalibProdSiStripGains from ExpressCosmics Change in TOTEM PDs on low PU HLT menu add AlCa scenario Add AlCa scneario to event_scenario table. Also remove the data_tier0 table. remove TFC override We currently include T2_CH_CERN/Tier0/override_catalog.xml to make sure we read streamer files from castor. Remove this, streamer will be read from EOS now. enable copy+delete subscriptions for Tier0 allow to set event and sizer estimates per dataset or stream add PhysicsSkims to PromptReco Add support for PhysicsSkims in PromptReco and also fix a problem with dqm_sequences in PromptReco. add ppRun2at50ns scenario CMSSW_7_4_10_patch1 and new HLTPhysics PDs config update 25.08.2015 release PromptReco run by run If multiple runs are to be released for PromptReco currently this happens in a single transaction. Split this into run by run transactions. updating master Fixing configuration Adding a new TOTEM PD fix streamer notify skipping tails xSetting ppScenario to ppRun2at50ns do not subscribe express data anywhere by default For replays we want to keep all data on the production buffer T0_CH_CERN_Disk and not subscribed anywhere. 1.9.98 Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch> change replay configs to use T2_CH_CERN_AI and to not subscribe Express data complete express and reco config, also in t0wmadatasvc Add physics_skim to the reco_config in t0ast. Add alca_skim and dqm_seq to the express_config in t0wmadatasvc. Add alca_skim, physics_skim and dqm_seq to the reco_config in t0wmadatasvc. Also add subscriptions for RAW-RECO and USER data tiers for PromptReco output (same as for AOD, so tape and disk at T1). These are for the PhysicsSkims. Switched to CMSSW_7_4_10_patch2 Switch to CMSSW_7_4_10_patch2 fix SM notification perl locale warnings use multiple cores for repackmerge to process large files switch cmssw version, procVersion and PromptReco GT. Added PDs for 0T swtich to CMSSW_7_4_11_patch1 version Switch to the new Express GT 74X_dataRun2_Express_v2 Switch to 74X_dataRun2_Prompt_v3 PromptReco Global Tag Updating the DQMUpload URLs Change of GlobalTag Switch to CMSS_7_4_12 and processing version 3 change acquisition era to Run2015D Adding datasets of the 3.8T HLT menu adjust ParkingMonitor configuration update to 7_4_12_patch1. also updated AcqEra to 2015D Update to 7_4_12_patch4 Adding ScoutingMonitor datasets Changing ScoutingMonitor to ParkingScoutingMonitor fix typo in ParkingScoutingMonitor config remove subscriptions from replay configs Adding the LumiPixelsMinBias AlCa producer to Express, also updating Global Tags, Processing Version and CMSSW version 1.9.99 Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch> Change of CMSWS release to 7_4_15 Adding new PDs to the configuration for Low PU add HeavyIonRuns2 scenario clean up replay config change the PCL upload code python futurize fixes all of stage1 plus ws_comma and idioms Switch to the CMSSW_7_4_15_patch1 config python formating changes 2.0.0 Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch> Configuration for pp Reference Run 2015 (HI) Adding PDs for HI Run 2015 add HIExpress config CMSSW_7_5_5_patch3, HI GTs, updated AlcaMatrix, Removing _0T for HI PDs, Updating processing site some last minute config tweaks for ppref data remove virginraw dataset from config Removing AlCa producers from HIExpress config Fixes a typo in the HIExpress GT config reco event/size estimates for HI Increase the allowed RAW size to 12GB (HI doesn't write RECO, so we can accept larger RAW). Make some initial estimates for HI reco size/time, they are very conservative for now. Still talking to the HI team to get better estimates, what we have now should work though. Updates in the CMSSW version, PR GT, AlCa Matrix and Skims for HI New GTs and typo correction fix express alca producers Switching to CMSSW_7_5_6_patch1 Last HI configuration create HI replay config Replay to test 757_p2 and lite dqm. config changes config changes More HI data sets Multicore for HI Switching to CMSSW_7_5_7_patch3 and setting multicore = 6 for HI PDs Multicore to 6 for HIExpress More HIMinimumBias Datasets Switch to CMSSW_7_5_8_patch1 make new PCL upload code work with prod Tier0 support tape supscriptions with no disk change memory requirements for HeavyIonsRun2 scenario enable error detection in PCL upload adjust RSS and VSize runtime limits 2.0.1 Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch> adjust runtime memory kill limits Got the units wrong... configure different PhEDEx groups for various subscriptions Cleanup subscription code and configure PhEDEx groups for Express subscriptions and for Repack and PromptReco Disk subscriptions. forward run/stream processing completion into t0wmadatasvc forward produced datasets to t0wmadatasvc Publish all produced datasets to the Tier0 Data Service (to be used by DDM as locked datasets) AqcEra set to Commissioning 2016. Using CMSSW_7_5_8_patch3. HI configs removed from here Changes for MWGR2 2016 New GTs: 80X v1 CMSSW_800p2, default arch slc6_amd64_gcc493, parameter maxLatency to use repack patch CMSSW_8_0_0_patch3 Configuration changes for 2016 MWGR3 implemeted max latency rules for repack and repackmerge Switch to CMSSW_8_0_2 implemet stageout to merged for repack Use the minimum acceptable merge file size and the number of event merge trigger to trigger direct stageout to merged from repack jobs. In addition also treat lumi holes from the detector (StorageManager declaring no data for that lumi) the same way as already processed lumis, ie. process around the holes. This is a bit inefficient in creating large RAW files, but protects against data later appearing for these lumis. CMSSW_8_0_3 and new GTs remove the injection site override option Since WMCore hasn't supported injection site override in years, remove this option. set different subscription for skim some minor fixes for subscriptions code 2.0.2 Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch> fix bug in storage node insert CMSSW_8_0_3_patch1 get run start and stop times from run summary Also requires keeping track of run stop (from RunSummary) and run close (from StorageManager). 2.0.3 Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch> add low pileup scenario Changes for the Quiet Beam 2016 Removing DQM sequences from Express0T an ZeroBias*_0T Removing @common sequence in HTLPhysics PDs and PromptCalibProdSiStripGains producer in Express adjust memory limits for Express and PromptReco Changes for first collision 2016 Switch to CMSSW_8_0_5_patch1 Adding HLTPhysics0 and HLTPhysics0_0T. CMSSW_8_0_6 New PDs, lastest GTs (express v6, prompt v8), DQM seqs for ZeroBias CMSSW_8_0_7 and acquisition era Run2016B New alca producers and express GT v7 CMSSW_8_0_7_patch1, processingVersion 2 and cleaning default architecture and overrides adjust the memory reservation Reduce the memory requested per core for Express and PromptReco slightly (1000 was too much). 2.0.4 Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch> CMSSW_8_0_8, Reco Processing Version 2 and ZeroBias0 L1MinimumBias[0..9] PDs Switch to CMSSW_8_0_8_patch1 change syntax for runsummary query Seems cx_oracle doesn't like the old syntax anymore for some reason. Increase of the maxInputEvents value in the Repack config consistent behavior for repack event limits Updating diagnoseActiveRuns script send SM repack notifications quicker, fixes dmwm#4293 Immediatelt send notifications as soon as the streamer file is completed or failed. Also retry notification if it fails. Adjust base priority for replays Newest CMSSW release and Express GT. Adding a new AlCaProducer for Express Switching to CMSSW_8_0_10_patch1 Switching to CMSSW_8_0_11 mark run/stream as done quicker Currently run/stream is marked done when a run/stream workflow is archived and cleaned up. Due to WMStats retention policy this takes a week, which is a bit long. Change this to mark a run/stream as done if all subscriptions for this run/stream are marked as finished. Config Changes for Run2016C Proposed Configuration for Run2016D. New subscriptions distribution and CMSSW_8_0_13_patch1 allow some Tier0Config parameters to be era or run dependent To better manage era or processing version changes introduce a way to codify era or run dependency into the Tier0Config. special handling for SiPixelAli PCL The SiPixelAli AlcaHarvest job needs a lot of local disk space. Assign 4 cores to it in the job splitter and then override at runtime to run single core. make RAW subscription to Disk node optional Proposed configuration for the new era: Run2016E BHPskim for some PDs and Run2016E config for replays Updating HTLPhysics PDs config for high rates CMSSW_8_0_15, HcalCalIsoTrk for Commissioning PD and cleaning New subscription schema, AcqEra Run2016F and Processing version 1 Switch to CMSSW_8_0_16 and use of the TkAlCosmicsInCollisions AlCa produces on the NoBPTX PD Allowing the subscription of DoubleMuon to disk Switch to 80X_dataRun2_Express_v12 disable robbust (ie. flaky) merge migrating DB table and column names from sename to pnn modified: src/python/T0/WMBS/Oracle/Subscriptions/GetAllFiles.py modified: src/python/T0/WMBS/Oracle/Subscriptions/GetAvailableExpressFiles.py modified: src/python/T0/WMBS/Oracle/Subscriptions/GetAvailableExpressMergeFiles.py modified: src/python/T0/WMBS/Oracle/Subscriptions/GetAvailableRepackFiles.py modified: src/python/T0/WMBS/Oracle/Subscriptions/GetAvailableRepackMergeFiles.py modified: test/python/T0_t/WMBS_t/JobSplitting_t/Condition_t.py modified: test/python/T0_t/WMBS_t/JobSplitting_t/ExpressMerge_t.py modified: test/python/T0_t/WMBS_t/JobSplitting_t/Express_t.py modified: test/python/T0_t/WMBS_t/JobSplitting_t/RepackMerge_t.py modified: test/python/T0_t/WMBS_t/JobSplitting_t/Repack_t.py 2.0.5 Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch> Switching to CMSSW_8_0_17 and adding the HcalCalIsolatedBunchSelector and HcalCalIsolatedBunchFilter AlCa producers expand era and run dependent config parameters to express and repack fix forceCloseRuns script Switch to era Run2016G fix timezone problem in RunSummary query RunSummary stores run start and stop times as TIMESTAMP, ie. as local time. The actual time stored is UTC though. This discrepancy causes problem for the query extracting seconds since epoch. Correct it. Two new PDs: ZeroBiasFirstBunchAfterTrain and ZeroBiasFirstBunchInTrain CMSSW_8_0_18_patch1 release switch to PR GT v11

hufnagel added a commit to hufnagel/T0 that referenced this issue Jun 1, 2016

send SM repack notifications quicker, fixes dmwm#4293

257d917

Immediatelt send notifications as soon as the streamer file is completed or failed.

hufnagel added a commit to hufnagel/T0 that referenced this issue Jun 1, 2016

send SM repack notifications quicker, fixes dmwm#4293

aac7425

Immediatelt send notifications as soon as the streamer file is completed or failed.

hufnagel closed this as completed in 2b88ee1 Jun 2, 2016

hufnagel added a commit that referenced this issue Jun 2, 2016

Merge pull request #4295 from hufnagel/repack-notifications-quicker

03fd1bb

send SM repack notifications quicker, fixes #4293

johnhcasallasl pushed a commit to johnhcasallasl/T0 that referenced this issue Aug 30, 2016

send SM repack notifications quicker, fixes dmwm#4293

f660886

Immediatelt send notifications as soon as the streamer file is completed or failed. Also retry notification if it fails.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report Repacked Files For Unclosed Runs #4293

Report Repacked Files For Unclosed Runs #4293

blallen commented May 26, 2016

hufnagel commented Jun 1, 2016

Report Repacked Files For Unclosed Runs #4293

Report Repacked Files For Unclosed Runs #4293

Comments

blallen commented May 26, 2016

hufnagel commented Jun 1, 2016