Skip to content

Commit

Permalink
add posts so they appear immediately
Browse files Browse the repository at this point in the history
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Jan 4, 2023
1 parent 99176f9 commit eac6af6
Show file tree
Hide file tree
Showing 10 changed files with 991 additions and 0 deletions.
105 changes: 105 additions & 0 deletions _posts/dweitzel/2017-11-6-cleaning-up-gracc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
---
author: Derek Weitzel's Blog
author_tag: dweitzel
blog_subtitle: Thoughts from Derek
blog_title: Dereks Web
blog_url: https://derekweitzel.com/
category: dweitzel
date: '2017-11-06 19:09:23'
layout: post
original_url: https://derekweitzel.com/2017/11/06/cleaning-up-gracc/
slug: cleaning-up-gracc
title: Cleaning Up GRACC
---

<p>The <a href="https://opensciencegrid.github.io/gracc/">GRid ACcounting Collector</a> (GRACC) is the OSG’s new version of accounting software, replacing Gratia. It has been running in production since March 2017. Last week, on Friday November 3rd, we held a GRACC Focus Day. Our goal was to clean up data that is presented in GRACC. My changes where:</p>


<ul>
<li>Update the GRACC-Collector to version <a href="https://github.com/opensciencegrid/gracc-collector/tree/v1.1.8">1.1.8</a>. The primary change in this release is setting the messages sent to RabbitMQ to be “persistent”. The persistent messages are then saved to disk in order to survive a RabbitMQ reboot.</li>
<li>Use case-insenstive comparisons to determine the <a href="https://oim.grid.iu.edu/oim/home">Open Science Grid Information Management system</a> (OIM) information. This was an issue with GPGrid (Fermilab), which was registered as <strong>GPGRID</strong>.</li>
<li>Set the <code class="language-plaintext highlighter-rouge">OIM_Site</code> equal to the <code class="language-plaintext highlighter-rouge">Host_description</code> attribute if the OIM logic is unable to determine the registered OIM site. This is especially useful for the LIGO collaboration, which uses sites in Europe that are not registered in OIM. Now, instead of a lot of Unknown sites listed on the LIGO site listing, it shows the somewhat reported site name of where the job ran.</li>
</ul>

<figure class="">
<img alt="GRACC Projects Page" src="https://derekweitzel.com/images/posts/GRACC-Cleanup/GRACC_Projects_Ligo.png" /><figcaption>
GRACC Projects Page for LIGO

</figcaption></figure>

<h2 id="regular-expression-corrections"><a id="regex"></a>Regular Expression Corrections</h2>

<p>One of the common problems we have in GRACC is poor data coming from the various probes installed at hundreds of sites. We don’t control the data coming into GRACC, so occasionally we must make corrections to the data for clarity or correctness. One of these corrections is misreporting the “site” that the jobs ran on.</p>


<p>In many instances, the probe is unable to determine the site and simply lists the hostname of the worker node where the job ran. This can cause the cardinality of sites listed in GRACC to increase dramatically as we get new hostnames inserted into the sites listing. If the hostnames are predictable, a regular expression matching algorithm can match a worker node hostname to a proper site name.</p>


<p>The largest change for GRACC was the regular expression corrections. With this new feature, GRACC administrators can set corrections to match on attributes using regular expression patterns. For example, consider the following correction configuration.</p>


<div class="language-toml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">[[Corrections]]</span>
<span class="py">index</span> <span class="p">=</span> <span class="s">'gracc.corrections'</span>
<span class="py">doc_type</span> <span class="p">=</span> <span class="s">'host_description_regex'</span>
<span class="py">match_fields</span> <span class="p">=</span> <span class="nn">['Host_description']</span>
<span class="py">source_field</span> <span class="p">=</span> <span class="s">'Corrected_OIM_Site'</span>
<span class="py">dest_field</span> <span class="p">=</span> <span class="s">'OIM_Site'</span>
<span class="py">regex</span> <span class="p">=</span> <span class="kc">true</span>
</code></pre></div>
</div>


<p>This configuration means:</p>


<blockquote>
<p>Match the <code class="language-plaintext highlighter-rouge">Host_description</code> field in the incoming job record with the regular expression <code class="language-plaintext highlighter-rouge">Host_description</code> field in the corrections table. If they are a match, take the value in the <code class="language-plaintext highlighter-rouge">Corrected_OIM_Site</code> field in the corrections table and place it into the <code class="language-plaintext highlighter-rouge">OIM_Site</code> field in the job record.</p>

</blockquote>

<p>And the correction document would look like:</p>


<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"_index"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gracc.corrections-0"</span><span class="p">,</span><span class="w">
</span><span class="nl">"_type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"host_description_regex"</span><span class="p">,</span><span class="w">
</span><span class="nl">"_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"asldkfj;alksjdf"</span><span class="p">,</span><span class="w">
</span><span class="nl">"_score"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
</span><span class="nl">"_source"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"Host_description"</span><span class="p">:</span><span class="w"> </span><span class="s2">".*</span><span class="se">\.</span><span class="s2">bridges</span><span class="se">\.</span><span class="s2">psc</span><span class="se">\.</span><span class="s2">edu"</span><span class="p">,</span><span class="w">
</span><span class="nl">"Corrected_OIM_Site"</span><span class="p">:</span><span class="w"> </span><span class="s2">"PSC Bridges"</span><span class="p">,</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div>
</div>

<p>The regular expression is in the <code class="language-plaintext highlighter-rouge">Host_description</code> FIELD.</p>


<p>So, if the incoming job record is similar to :</p>


<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="err">...</span><span class="w">
</span><span class="nl">"Host_description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"l006.pvt.bridges.psc.edu"</span><span class="w">
</span><span class="err">...</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div>
</div>


<p>Then the correction would modify or create values such that the final record would approximate:</p>


<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="err">...</span><span class="w">
</span><span class="nl">"Host_description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"l006.pvt.bridges.psc.edu"</span><span class="p">,</span><span class="w">
</span><span class="nl">"OIM_Site"</span><span class="p">:</span><span class="w"> </span><span class="s2">"PSC Bridges"</span><span class="p">,</span><span class="w">
</span><span class="nl">"RawOIM_Site"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="w">
</span><span class="err">...</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div>
</div>


<p>Note that the <code class="language-plaintext highlighter-rouge">Host_description</code> field stays the same. We must keep it the same because it is used in record duplicate detection. If we modified the field and resummarized previous records, then it would cause multiple records to represent the same job.</p>
97 changes: 97 additions & 0 deletions _posts/dweitzel/2017-6-14-stashcache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
author: Derek Weitzel's Blog
author_tag: dweitzel
blog_subtitle: Thoughts from Derek
blog_title: Dereks Web
blog_url: https://derekweitzel.com/
category: dweitzel
date: '2017-06-14 17:11:55'
layout: post
original_url: https://derekweitzel.com/2017/06/14/stashcache/
slug: stashcache
title: StashCache
---

<p><a href="https://opensciencegrid.github.io/StashCache/">StashCache</a> is a framework to distribute data across the Open Science Grid. It is designed to help opportunistic users to transfer data without the need for dedicated storage or frameworks of their own, like CMS and ATLAS have deployed. StashCache has several regional caches and a small set of origin servers. Caches have fast network connections, and sizable disk storage to quickly distribute data to the execution hosts in the OSG.</p>


<p>StashCache is named for the Stash filesystem located at the University of Chicago’s OSG-Connect service. It is primarily intended to be used to cache data from the Stash filesystem, though, data origins exist for other experiments.</p>


<figure>

<img alt="Regional Caches" src="https://derekweitzel.com/images/posts/StashCache/StashCacheMap.png" />

<figcaption>Regional Caches</figcaption>
</figure>

<h2 id="components">Components</h2>
<p>The worker nodes are where the user jobs will run. The transfer tools are used on the worker nodes to download data from StashCache caches. Worker nodes are geographically distributed across the US, and will select the nearest cache based upon a GeoIP database.</p>


<figure>
<img alt="StashCache Architecture" src="https://derekweitzel.com/images/posts/StashCache/StashCache-Arch-Big.png" />
<figcaption>StashCache Architecture</figcaption>
</figure>

<p>The caches are distributed to computing sites across the U.S. They are are running the <a href="http://xrootd.org/">XRootD</a> software. The worker nodes connect directly to the regional caches, which in turn download from the Origin servers. The caching proxies discover the data origin by querying the Redirectors. The caching algorithm used is Least Recently Used (LRU). In this algorithm, the cache will only delete cached data when storage space is near capacity, and will delete the least recently used data first.</p>


<p>The origin servers are the primary source of data for the StashCache framework. StashCache was named after the Stash data store at the University of Chicago’s OSG-Connect service, but other origins also utilize the framework. The origin is the initial source of data, but once the data is stored on the Caches, the origin is no longer used. Updates to data on the origin are not reflected in the caches automatically. The caches treat the data from the origin as immutable, and therefore do not check for updates. If a user requires new data to be pulled into the cache, the name or location of the data on the origin must be changed.</p>


<p>Redirectors are used to discover the location of data. They are run only at the Indiana Grid Operations Center (GOC). The redirectors help in the discovery of the origin for data. Only the caching proxies communicate with the redirectors.</p>


<h2 id="tools-to-transfer">Tools to transfer</h2>
<p>Two tools exist to download data from StashCache, CVMFS and StashCP. With either of these tools, the first step for users is to copy the data to the Stash filesystem. Once the user has an OSG-Connect account, they may copy their data to the /stash//public directory. Once there, both of the tools can view and download the files.</p>


<p><a href="https://cernvm.cern.ch/portal/filesystem">CVMFS</a> (CERN Virtual Machine File System) is a mountable filesystem that appears to the user as a regular directory. CVMFS provides transparent access for users to data in the Stash filesystem. The namespace, such as the size and name of files, and the data are separate in the Stash CVMFS. CVMFS distributes the namespace information for the Stash filesystem over a series of HTTP Forward Proxies that are separate from the StashCache federation. Data is retrieved through the Stash proxies.</p>


<p>In order to map the Stash filesystem into CVMFS, a process is constantly scanning the Stash filesystem checking for new files. When new files are discovered, they are checksummed and the meta-data is stored in the CVMFS namespace. Since this scanning can take a while for a filesystem the size of Stash, it may take several hours for a file placed in Stash to be available through CVMFS.</p>


<p>Using CVMFS, copying files is as easy as copying files with any other filesystem:</p>


<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cp /cvmfs/stash.osgstorage.org/user/&lt;username&gt;/public/… dest/
</code></pre></div>
</div>


<p>CVMFS access also has other features that are beneficial for Stash access. CVMFS will cache files locally so that multiple accesses to the same file on the same node will be very fast. Also, CVMFS can fallback to other nearby caches if the first fails.</p>


<p><a href="https://support.opensciencegrid.org/support/solutions/articles/12000002775-transferring-data-with-stashcache">StashCP</a> is the second tool that can download data from StashCache. StashCP uses CVMFS above, as well as falling back to the caching proxies and eventually the origin. The order of operations that StashCP performs:</p>


<ol>
<li>Check for the file in CVMFS mount under /cvmfs/stash.osgstorage.org/…</li>
<li>If CVMFS copy fails, connect directly to the nearest proxy and attempt to download the file.</li>
<li>If the proxy fails, then connect directly to the origin server.</li>
</ol>

<p>Since StashCP doesn’t rely on the CVMFS mount only, files are immediately available to transfer with StashCP.</p>


<p>StashCP is distributed with OSG-Connect’s module system. Using StashCP is nearly as simple as using the <code class="language-plaintext highlighter-rouge">cp</code> command:</p>


<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ module load stashcp
$ stashcp /user/&lt;username&gt;/public/… dest/
</code></pre></div>
</div>


<h2 id="conclusions">Conclusions</h2>
<p>The StashCache framework is very useful for downloading data to execution hosts across the OSG. It was designed to help opportunistic users to transfer data without the need for dedicated storage or frameworks of their own, like CMS and ATLAS have deployed.</p>


<p>StashCache has been used to transfer over 3 PB of data this year. Check out some of the papers written about using StashCache:</p>

<ul>
<li>Derek Weitzel, Brian Bockelman, Duncan A. Brown, Peter Couvares, and Frank Wu ̈rthwein, Edgar Fajardo Hernandez. 2017. Data Access for LIGO on the OSG. In Proceedings of PEARC17, New Orleans, LA, USA, July 09-13, 2017, 6 pages. DOI: 10.1145/3093338.3093363 <a href="https://arxiv.org/abs/1705.06202">Online</a></li>
<li>Derek Weitzel, Brian Bockelman, Dave Dykstra, Jakob Blomer, and René Meusel, 2017. Accessing Data Federations with CVMFS. In Journal of Physics - Conference Series. <a href="https://drive.google.com/open?id=0B_RVv_OjWcURUi15cmtUaXotVkU">Online</a></li>
</ul>
58 changes: 58 additions & 0 deletions _posts/dweitzel/2017-9-7-installing-scitokens-on-a-mac.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
author: Derek Weitzel's Blog
author_tag: dweitzel
blog_subtitle: Thoughts from Derek
blog_title: Dereks Web
blog_url: https://derekweitzel.com/
category: dweitzel
date: '2017-09-07 19:20:04'
layout: post
original_url: https://derekweitzel.com/2017/09/07/installing-scitokens-on-a-mac/
slug: installing-scitokens-on-a-mac
title: Installing SciTokens on a Mac
---

<p>In case I ever have to install <a href="https://scitokens.org/">SciTokens</a> again, the steps I took to make it work on my mac. The most difficult part of this is installing openssl headers for the jwt python library. I followed the advice on this <a href="https://solitum.net/openssl-os-x-el-capitan-and-brew/">blog post</a>.</p>


<ol>
<li>Install <a href="https://brew.sh/">Homebrew</a></li>
<li>
<p>Install openssl:</p>


<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> brew install openssl
</code></pre></div>
</div>

</li>
<li>
<p>Download the SciTokens library:</p>


<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> git clone https://github.com/scitokens/scitokens.git
cd scitokens
</code></pre></div>
</div>

</li>
<li>
<p>Create the virtualenv to install the <a href="https://jwt.io/">jwt</a> library</p>


<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> virtualenv jwt
. jwt/bin/activate
</code></pre></div>
</div>

</li>
<li>
<p>Install jwt pointing to the Homebrew installed openssl headers:</p>


<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> env LDFLAGS="-L$(brew --prefix openssl)/lib" CFLAGS="-I$(brew --prefix openssl)/include" pip install cryptography PyJWT
</code></pre></div>
</div>

</li>
</ol>

0 comments on commit eac6af6

Please sign in to comment.