Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report Repacked Files For Unclosed Runs #4293

Closed
blallen opened this issue May 26, 2016 · 1 comment
Closed

Report Repacked Files For Unclosed Runs #4293

blallen opened this issue May 26, 2016 · 1 comment

Comments

@blallen
Copy link
Contributor

blallen commented May 26, 2016

The current system only sends Repack signals to Storage Manager team if a run is closed. This appears to be intentional, as it is explicitly included in the query used to check if files have been repacked [1].

Storage Manager relies on these signals to know if they can safely delete files so waiting until a run is closed can prevent them from deleting any files from a run due to an issue with only a few of them. Depending on the size of the run, this can clog up the streamer space at P5 and lead to unnecessary interventions to prevent them from running of space.

Additionally, the Storage Manager monitoring shows the number of Repack files, so in the current setup, an open run appears to have nothing repacked, potentially hiding issues or causing alarm when there is no need.

[1] https://github.com/dmwm/T0/blob/master/src/python/T0/WMBS/Oracle/SMNotification/GetFinishedStreamers.py#L26

@hufnagel
Copy link
Member

hufnagel commented Jun 1, 2016

The query as written relies on the fileset to be closed since without that requirement this query would pick up files between injection into the Tier0 and being picked up by the Tier0Feeder and forwarded into the correct fileset for processing.

I can rewrite the query though to do this differently...

hufnagel added a commit to hufnagel/T0 that referenced this issue Jun 1, 2016
Immediatelt send notifications as soon as the streamer file is completed or failed.
hufnagel added a commit to hufnagel/T0 that referenced this issue Jun 1, 2016
Immediatelt send notifications as soon as the streamer file is completed or failed.
hufnagel added a commit that referenced this issue Jun 2, 2016
send SM repack notifications quicker, fixes #4293
johnhcasallasl pushed a commit to johnhcasallasl/T0 that referenced this issue Aug 30, 2016
Immediatelt send notifications as soon as the streamer file is completed or failed. Also retry notification if it fails.
johnhcasallasl pushed a commit to johnhcasallasl/T0 that referenced this issue Aug 30, 2016
Configuration updates
- CMSSW_7_4_8_patch1
- New Express GT 74X_dataRun2_Express_v1
- DQM sequences updated
- HLTMonitor dataset added
- Time and Size per event estimates
- Add T1 subsciptions
- Use T2_CH_CERN_AI resources for production configuration
- AcqEra to Run2015C

comment out dqm_sequences

Comment out dqm_sequences (as we don't run them in prod).

fix a type in the production config

Changes default plugin to PyCondorPlugin

Remove WorkQueueManager from the Tier0 config

Add Skims to the Tier0 configuration

- Add skims to the Tier0 configuration
- remove comments from dqm sequences
- enable multicore for PromptReco and Express (multicore 4)
- New alca matrix updates
- Default write_dqm = False
- Enabling dqm for PDs with alca producers
- @common default when write_dqm = True
- Ignore DQMOffline stream

remove PromptCalibProdSiStripGains from ExpressCosmics

Change in TOTEM PDs on low PU HLT menu

add AlCa scenario

Add AlCa scneario to event_scenario table. Also remove the data_tier0 table.

remove TFC override

We currently include T2_CH_CERN/Tier0/override_catalog.xml to make sure we read streamer files from castor. Remove this, streamer will be read from EOS now.

enable copy+delete subscriptions for Tier0

allow to set event and sizer estimates per dataset or stream

add PhysicsSkims to PromptReco

Add support for PhysicsSkims in PromptReco and also fix a problem with dqm_sequences in PromptReco.

add ppRun2at50ns scenario

CMSSW_7_4_10_patch1 and new HLTPhysics PDs

config update 25.08.2015

release PromptReco run by run

If multiple runs are to be released for PromptReco currently this happens in a single transaction. Split this into run by run transactions.

updating master

Fixing configuration

Adding a new TOTEM PD

fix streamer notify skipping tails

xSetting ppScenario to ppRun2at50ns

do not subscribe express data anywhere by default

For replays we want to keep all data on the production buffer T0_CH_CERN_Disk and not subscribed anywhere.

1.9.98

Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch>

change replay configs to use T2_CH_CERN_AI and to not subscribe Express data

complete express and reco config, also in t0wmadatasvc

Add physics_skim to the reco_config in t0ast. Add alca_skim and dqm_seq to the express_config in t0wmadatasvc. Add alca_skim, physics_skim and dqm_seq to the reco_config in t0wmadatasvc.

Also add subscriptions for RAW-RECO and USER data tiers for PromptReco output (same as for AOD, so tape and disk at T1). These are for the PhysicsSkims.

Switched to CMSSW_7_4_10_patch2

Switch to CMSSW_7_4_10_patch2

fix SM notification perl locale warnings

use multiple cores for repackmerge to process large files

switch cmssw version, procVersion and PromptReco GT. Added PDs for 0T

swtich to CMSSW_7_4_11_patch1 version

Switch to the new Express GT 74X_dataRun2_Express_v2

Switch to 74X_dataRun2_Prompt_v3 PromptReco Global Tag

Updating the DQMUpload URLs

Change of GlobalTag

Switch to CMSS_7_4_12 and processing version 3

change acquisition era to Run2015D

Adding datasets of the 3.8T HLT menu

adjust ParkingMonitor configuration

update to 7_4_12_patch1. also updated AcqEra to 2015D

Update to 7_4_12_patch4

Adding ScoutingMonitor datasets

Changing ScoutingMonitor to ParkingScoutingMonitor

fix typo in ParkingScoutingMonitor config

remove subscriptions from replay configs

Adding the LumiPixelsMinBias AlCa producer to Express, also updating Global Tags, Processing Version and CMSSW version

1.9.99

Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch>

Change of CMSWS release to 7_4_15

Adding new PDs to the configuration for Low PU

add HeavyIonRuns2 scenario

clean up replay config

change the PCL upload code

python futurize fixes

all of stage1 plus ws_comma and idioms

Switch to the CMSSW_7_4_15_patch1

config python formating changes

2.0.0

Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch>

Configuration for pp Reference Run 2015 (HI)

Adding PDs for HI Run 2015

add HIExpress config

CMSSW_7_5_5_patch3, HI GTs, updated AlcaMatrix, Removing _0T for HI PDs, Updating processing site

some last minute config tweaks for ppref data

remove virginraw dataset from config

Removing AlCa producers from HIExpress config

Fixes a typo in the HIExpress GT config

reco event/size estimates for HI

Increase the allowed RAW size to 12GB (HI doesn't write RECO, so we can accept larger RAW).
Make some initial estimates for HI reco size/time, they are very conservative for now.
Still talking to the HI team to get better estimates, what we have now should work though.

Updates in the CMSSW version, PR GT, AlCa Matrix and Skims for HI

New GTs and typo correction

fix express alca producers

Switching to CMSSW_7_5_6_patch1

Last HI configuration

create HI replay config

Replay to test 757_p2 and lite dqm.

config changes

config changes

More HI data sets

Multicore for HI

Switching to CMSSW_7_5_7_patch3 and setting multicore = 6 for HI PDs

Multicore to 6 for HIExpress

More HIMinimumBias Datasets

Switch to CMSSW_7_5_8_patch1

make new PCL upload code work with prod Tier0

support tape supscriptions with no disk

change memory requirements for HeavyIonsRun2 scenario

enable error detection in PCL upload

adjust RSS and VSize runtime limits

2.0.1

Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch>

adjust runtime memory kill limits

Got the units wrong...

configure different PhEDEx groups for various subscriptions

Cleanup subscription code and configure PhEDEx groups for Express subscriptions and for Repack and PromptReco Disk subscriptions.

forward run/stream processing completion into t0wmadatasvc

forward produced datasets to t0wmadatasvc

Publish all produced datasets to the Tier0 Data Service (to be used by DDM as locked datasets)

AqcEra set to Commissioning 2016. Using CMSSW_7_5_8_patch3. HI configs removed from here

Changes for MWGR2 2016

New GTs: 80X v1

CMSSW_800p2, default arch slc6_amd64_gcc493, parameter maxLatency to use repack patch

CMSSW_8_0_0_patch3

Configuration changes for 2016 MWGR3

implemeted max latency rules for repack and repackmerge

Switch to CMSSW_8_0_2

implemet stageout to merged for repack

Use the minimum acceptable merge file size and the number of event merge trigger to trigger direct stageout to merged from repack jobs. In addition also treat lumi holes from the detector (StorageManager declaring no data for that lumi) the same way as already processed lumis, ie. process around the holes. This is a bit inefficient in creating large RAW files, but protects against data later appearing for these lumis.

CMSSW_8_0_3 and new GTs

remove the injection site override option

Since WMCore hasn't supported injection site override in years, remove this option.

set different subscription for skim

some minor fixes for subscriptions code

2.0.2

Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch>

fix bug in storage node insert

CMSSW_8_0_3_patch1

get run start and stop times from run summary

Also requires keeping track of run stop (from RunSummary) and run close (from StorageManager).

2.0.3

Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch>

add low pileup scenario

Changes for the Quiet Beam 2016

Removing DQM sequences from Express0T an ZeroBias*_0T

Removing @common sequence in HTLPhysics PDs and PromptCalibProdSiStripGains producer in Express

adjust memory limits for Express and PromptReco

Changes for first collision 2016

Switch to CMSSW_8_0_5_patch1

Adding HLTPhysics0 and HLTPhysics0_0T.

CMSSW_8_0_6

New PDs, lastest GTs (express v6, prompt v8), DQM seqs for ZeroBias

CMSSW_8_0_7 and acquisition era Run2016B

New alca producers and express GT v7

CMSSW_8_0_7_patch1, processingVersion 2 and cleaning default architecture and overrides

adjust the memory reservation

Reduce the memory requested per core for Express and PromptReco slightly (1000 was too much).

2.0.4

Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch>

CMSSW_8_0_8, Reco Processing Version 2 and ZeroBias0

L1MinimumBias[0..9] PDs

Switch to CMSSW_8_0_8_patch1

change syntax for runsummary query

Seems cx_oracle doesn't like the old syntax anymore for some reason.

Increase of the maxInputEvents value in the Repack config

consistent behavior for repack event limits

Updating diagnoseActiveRuns script

send SM repack notifications quicker, fixes dmwm#4293

Immediatelt send notifications as soon as the streamer file is completed or failed. Also retry notification if it fails.

Adjust base priority for replays

Newest CMSSW release and Express GT. Adding a new AlCaProducer for Express

Switching to CMSSW_8_0_10_patch1

Switching to CMSSW_8_0_11

mark run/stream as done quicker

Currently run/stream is marked done when a run/stream workflow is archived and cleaned up. Due to WMStats retention policy this takes a week, which is a bit long. Change this to mark a run/stream as done if all subscriptions for this run/stream are marked as finished.

Config Changes for Run2016C

Proposed Configuration for Run2016D. New subscriptions distribution and CMSSW_8_0_13_patch1

allow some Tier0Config parameters to be era or run dependent

To better manage era or processing version changes introduce a way to codify era or run dependency into the Tier0Config.

special handling for SiPixelAli PCL

The SiPixelAli AlcaHarvest job needs a lot of local disk space. Assign 4 cores to it in the job splitter and then override at runtime to run single core.

make RAW subscription to Disk node optional

Proposed configuration for the new era: Run2016E

BHPskim for some PDs and Run2016E config for replays

Updating HTLPhysics PDs config for high rates

CMSSW_8_0_15, HcalCalIsoTrk for Commissioning PD and cleaning

New subscription schema, AcqEra Run2016F and Processing version 1

Switch to CMSSW_8_0_16 and use of the TkAlCosmicsInCollisions AlCa produces on the NoBPTX PD

Allowing the subscription of DoubleMuon to disk

Switch to 80X_dataRun2_Express_v12

disable robbust (ie. flaky) merge

migrating DB table and column names from sename to pnn

	modified:   src/python/T0/WMBS/Oracle/Subscriptions/GetAllFiles.py
	modified:   src/python/T0/WMBS/Oracle/Subscriptions/GetAvailableExpressFiles.py
	modified:   src/python/T0/WMBS/Oracle/Subscriptions/GetAvailableExpressMergeFiles.py
	modified:   src/python/T0/WMBS/Oracle/Subscriptions/GetAvailableRepackFiles.py
	modified:   src/python/T0/WMBS/Oracle/Subscriptions/GetAvailableRepackMergeFiles.py
	modified:   test/python/T0_t/WMBS_t/JobSplitting_t/Condition_t.py
	modified:   test/python/T0_t/WMBS_t/JobSplitting_t/ExpressMerge_t.py
	modified:   test/python/T0_t/WMBS_t/JobSplitting_t/Express_t.py
	modified:   test/python/T0_t/WMBS_t/JobSplitting_t/RepackMerge_t.py
	modified:   test/python/T0_t/WMBS_t/JobSplitting_t/Repack_t.py

2.0.5

Signed-off-by: Dirk Hufnagel <Dirk.Hufnagel@cern.ch>

Switching to CMSSW_8_0_17 and adding the HcalCalIsolatedBunchSelector and HcalCalIsolatedBunchFilter AlCa producers

expand era and run dependent config parameters to express and repack

fix forceCloseRuns script

Switch to era Run2016G

fix timezone problem in RunSummary query

RunSummary stores run start and stop times as TIMESTAMP, ie. as local time. The actual time stored is UTC though. This discrepancy causes problem for the query extracting seconds since epoch. Correct it.

Two new PDs: ZeroBiasFirstBunchAfterTrain and ZeroBiasFirstBunchInTrain

CMSSW_8_0_18_patch1 release

switch to PR GT v11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants