search.xml

<?xml version="1.0" encoding="utf-8"?>
<search>
  <entry>
    <title>CS144-Lab0</title>
    <url>/2023/09/04/CS144-Lab0/</url>
    <content><![CDATA[<p>本文基于<a href="https://cs144.github.io/assignments/check0.pdf">指导文档</a>进行编写。</p>
<p>CS144 的 Lab0 主要分为三部分</p>
<ul>
<li>第一部分是 VM 的安装/使用</li>
<li>第二部分则是 telnet 等网络程序的尝试</li>
<li>第三部分则是写一个基于 OS 自带 socket 库的网络程序和实现一个简单 ByteStream</li>
</ul>
<p>第一部分可以略过。</p>
<p>第二部分则主要是介绍<strong>telnet</strong>和<strong>telcat</strong>，其中<strong>telnet</strong>的作用就是建立 connection, 并用不同协议进行通信。 <strong>netcat</strong>则是用于建立 client/server 一类的 end-to-end 的端。</p>
<p>首先用</p>
<figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">telcat 9091</span><br></pre></td></tr></table></figure>
<p>建立一个对于 9091 端口的监听 socket, 然后打开另一个终端用</p>
<figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">telnet localhost 9091</span><br></pre></td></tr></table></figure>
<p>连接到该端口，此时 telcat 的窗口就会显示连接信息。</p>
<p>重点是第三部分的实验。这个实验将利用 Linux 的 Socket 构建一个基于 TCP 的程序，要求该程序可以连接到 Web Server，并抓取一个界面。</p>
<p>这里有些需要注意的要点：</p>
<ul>
<li>在 HTTP 协议中每行必须以‘’结尾</li>
<li>不能漏了‘Connection: closed’, 不然进程会一直等待</li>
</ul>
<p>代码如下：</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="function"><span class="type">void</span> <span class="title">get_URL</span><span class="params">( <span class="type">const</span> string&amp; host, <span class="type">const</span> string&amp; path )</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">  TCPSocket sock1;</span><br><span class="line">  Address addr = <span class="built_in">Address</span>( host, <span class="string">&quot;http&quot;</span> );</span><br><span class="line">  sock1.<span class="built_in">connect</span>( addr );</span><br><span class="line">  sock1.<span class="built_in">write</span>( <span class="string">&quot;GET &quot;</span> + path + <span class="string">&quot; &quot;</span> + <span class="string">&quot;HTTP/1.1\r\nHost: &quot;</span> + host + <span class="string">&quot;\r\nConnection: close\r\n\r\n&quot;</span> );</span><br><span class="line">  <span class="keyword">while</span> ( <span class="number">1</span> ) &#123;</span><br><span class="line">    string recv;</span><br><span class="line">    sock1.<span class="built_in">read</span>( recv );</span><br><span class="line">    cout &lt;&lt; recv;</span><br><span class="line">    <span class="keyword">if</span> ( sock1.<span class="built_in">eof</span>() )</span><br><span class="line">      <span class="keyword">break</span>;</span><br><span class="line">  &#125;</span><br><span class="line">  sock1.<span class="built_in">close</span>();</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>在 ByteStream 部分，要实现对于数据的读写，笔者主要是基于<code>std::queue</code>实现的缓存， 主要难点在于 peek 函数，参考网上代码后，发现 string_view 必须像下列代码一样初始化：</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="function">string_view <span class="title">Reader::peek</span><span class="params">()</span> <span class="type">const</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">  <span class="keyword">return</span> &#123; &amp;buffer.<span class="built_in">front</span>(), <span class="number">1</span> &#125;;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<h3 id="优化部分">优化部分</h3>
<p>用 string_view 和 move 实现移动语义：</p>
<p>两个队列存数据和引用</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line">std::queue&lt;std::string_view&gt; buffer;</span><br><span class="line">std::queue&lt;std::string&gt; buffer_actual;</span><br></pre></td></tr></table></figure>
<p>Reader 的 pop 则要分类讨论：</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="function"><span class="type">void</span> <span class="title">Reader::pop</span><span class="params">( <span class="type">uint64_t</span> len )</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">  bytesPopped += len;</span><br><span class="line">  <span class="keyword">for</span> ( <span class="type">unsigned</span> i = <span class="number">0</span>; i &lt; len; ) &#123;</span><br><span class="line">    <span class="keyword">if</span> ( buffer.<span class="built_in">front</span>().<span class="built_in">size</span>() &gt; len - i ) &#123;</span><br><span class="line">      buffer.<span class="built_in">front</span>() = buffer.<span class="built_in">front</span>().<span class="built_in">substr</span>( len - i );</span><br><span class="line">      i = len;</span><br><span class="line">    &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">      i += buffer.<span class="built_in">front</span>().<span class="built_in">size</span>();</span><br><span class="line">      buffer.<span class="built_in">pop</span>();</span><br><span class="line">      buffer_actual.<span class="built_in">pop</span>();</span><br><span class="line">    &#125;</span><br><span class="line">  &#125;</span><br><span class="line"></span><br><span class="line">  <span class="keyword">while</span> ( !buffer.<span class="built_in">empty</span>() &amp;&amp; buffer.<span class="built_in">front</span>().<span class="built_in">empty</span>() )</span><br><span class="line">    buffer.<span class="built_in">pop</span>();</span><br><span class="line">  bytesBuffered -= len;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>最后优化的结果： <img src="/images/CS144_Lab0.png" alt="img" /></p>
]]></content>
      <categories>
        <category>Network</category>
      </categories>
      <tags>
        <tag>Network</tag>
        <tag>Algorithm</tag>
        <tag>CS144</tag>
      </tags>
  </entry>
  <entry>
    <title>CS144-Lab2</title>
    <url>/2023/09/12/CS144-Lab2/</url>
    <content><![CDATA[<p>CS144 Lab2的主要任务是完成一个TCP Receiver，在TCP协议中每一个端系统都会有两个角色： <strong>Sender</strong>和<strong>Receiver</strong>，这个Lab的主要研究对象就是后者了。</p>
<p>而Receiver要完成几个任务： - 从Sender接受数据 - Reassemble 这些数据（在Lab1已经完成） - 决定是否把<strong>Acknowledgement</strong>和<strong>Flow-Control</strong>的数据send back</p>
<p>注意， <strong>Acknowledgement</strong> 表示的是Receiver所需要下一个byte的index， <strong>Flow-Control</strong> 表示的则是Receiver想获取多少数据。</p>
<h2 id="转换64位和32位的seqnos">转换64位和32位的seqnos</h2>
<p>众所周知，64位非常大，以至于可以认为其永远不会溢出，但32位最大只有4GB，这意味着32位的地址可能会不够用。 而TCP header中，seqno是用32位来表示，也就是说为了节省空间，每份sequence的地址都是32位寻址的。</p>
<p>这导致了TCP的一些机制： - 一旦32位的sequence number积累到 <span class="math inline">\(2^{32} - 1\)</span>，下一字节的index就变成了0。 - 为了提高TCP的健壮性并避免在同一端点之间的早期连接中混淆旧的数据段，TCP试图确保序列号不易被猜测并且不太可能重复。 因此，流的TCP sequences number不从零开始。流中的第一个序列号是一个随机的32位数字，称为初始序列号(<span class="math inline">\(ISN\)</span>）。 这是表示“零点”或<span class="math inline">\(SYN\)</span>（流的开始）的序列号。之后的序列号行为与正常情况下相同： 数据的第一个字节将具有<span class="math inline">\(ISN + 1\mod 2^{32}\)</span>的序列号，第二个字节将具有<span class="math inline">\(ISN + 2\mod 2^{32}\)</span>的序列号，依此类推。 - (懒得翻译直接粘贴了)The logical beginning and ending each occupy one sequence number: In addition to ensuring the receipt of all bytes of data, TCP makes sure that the beginning and ending of the stream are received reliably. Thus, in TCP the SYN (beginning-ofstream) and FIN (end-of-stream) control flags are assigned sequence numbers. Each of these occupies one sequence number. (The sequence number occupied by the SYN flag is the ISN.) Each byte of data in the stream also occupies one sequence number. Keep in mind that SYN and FIN aren’t part of the stream itself and aren’t “bytes”—they represent the beginning and ending of the byte stream itself.</p>
<p>总之我们要实现一个<code>Wrap32</code>类来进行有关转换，基本代码如下： <figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="function">Wrap32 <span class="title">Wrap32::wrap</span><span class="params">( <span class="type">uint64_t</span> n, Wrap32 zero_point )</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">  <span class="keyword">return</span> zero_point + n;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="type">uint64_t</span> <span class="title">Wrap32::unwrap</span><span class="params">( Wrap32 zero_point, <span class="type">uint64_t</span> checkpoint )</span> <span class="type">const</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">  <span class="type">uint64_t</span> cycle = <span class="number">1ll</span> &lt;&lt; <span class="number">32</span>;</span><br><span class="line">  <span class="type">uint64_t</span> n_cycle = checkpoint / cycle;</span><br><span class="line">  <span class="type">uint64_t</span> diff = raw_value_ - zero_point.raw_value_;</span><br><span class="line">  <span class="type">uint64_t</span> upper = ( n_cycle + <span class="number">1ll</span> ) * cycle + diff;</span><br><span class="line">  <span class="type">uint64_t</span> middle = n_cycle * cycle + diff;</span><br><span class="line">  <span class="type">uint64_t</span> lower = ( n_cycle - <span class="number">1ll</span> ) * cycle + diff;</span><br><span class="line">  <span class="keyword">if</span> ( ( ( n_cycle == <span class="number">0</span> &amp;&amp; cycle &lt;= diff ) || n_cycle != <span class="number">0</span> ) &amp;&amp; checkpoint &lt;= ( lower + middle ) / <span class="number">2</span> )</span><br><span class="line">    <span class="keyword">return</span> lower;</span><br><span class="line">  <span class="keyword">if</span> ( checkpoint &lt;= ( middle + upper ) / <span class="number">2</span> )</span><br><span class="line">    <span class="keyword">return</span> middle;</span><br><span class="line">  <span class="keyword">else</span></span><br><span class="line">    <span class="keyword">return</span> upper;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure></p>
<p>说实话这里我debug了很久，主要是没有考虑 <span class="math inline">\(lower &lt; 0\)</span> 的情况。</p>
<p>然后是receiver的代码： <figure class="highlight c++"><table><tr><td class="code"><pre><span class="line">TCPReceiver::<span class="built_in">TCPReceiver</span>() : <span class="built_in">ISN</span>( <span class="literal">nullopt</span> ), <span class="built_in">FIN</span>( <span class="literal">false</span> ) &#123;&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="type">void</span> <span class="title">TCPReceiver::receive</span><span class="params">( TCPSenderMessage message, Reassembler&amp; reassembler, Writer&amp; inbound_stream )</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">  <span class="keyword">if</span> ( message.SYN )</span><br><span class="line">    ISN = message.seqno;</span><br><span class="line"></span><br><span class="line">  <span class="keyword">if</span> ( !ISN.<span class="built_in">has_value</span>() )</span><br><span class="line">    <span class="keyword">return</span>;</span><br><span class="line"></span><br><span class="line">  <span class="keyword">if</span> ( message.FIN )</span><br><span class="line">    FIN = <span class="literal">true</span>;</span><br><span class="line"></span><br><span class="line">  reassembler.<span class="built_in">insert</span>( message.seqno.<span class="built_in">unwrap</span>( ISN.<span class="built_in">value</span>(), reassembler.<span class="built_in">bytes_pending</span>() ) + message.SYN - <span class="number">1ll</span>,</span><br><span class="line">                      message.payload,</span><br><span class="line">                      message.FIN,</span><br><span class="line">                      inbound_stream );</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function">TCPReceiverMessage <span class="title">TCPReceiver::send</span><span class="params">( <span class="type">const</span> Writer&amp; inbound_stream )</span> <span class="type">const</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">  (<span class="type">void</span>)inbound_stream;</span><br><span class="line">  TCPReceiverMessage ret;</span><br><span class="line">  <span class="keyword">if</span> ( !ISN.<span class="built_in">has_value</span>() )</span><br><span class="line">    ret.ackno = <span class="literal">nullopt</span>;</span><br><span class="line">  <span class="keyword">else</span></span><br><span class="line">    <span class="comment">// +1 for the SYN flag, and finish only when FIN flag reached and stream is closed.</span></span><br><span class="line">    ret.ackno</span><br><span class="line">      = Wrap32::<span class="built_in">wrap</span>( inbound_stream.<span class="built_in">bytes_pushed</span>() + <span class="number">1</span> + ( FIN &amp;&amp; inbound_stream.<span class="built_in">is_closed</span>() ), ISN.<span class="built_in">value</span>() );</span><br><span class="line"></span><br><span class="line">  ret.window_size = <span class="built_in">min</span>( inbound_stream.<span class="built_in">available_capacity</span>(), (<span class="type">uint64_t</span>)UINT16_MAX );</span><br><span class="line"></span><br><span class="line">  <span class="keyword">return</span> ret;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure></p>
<p>逻辑很简单，就是要处理 <span class="math inline">\(SYN\)</span> 和 <span class="math inline">\(FIN\)</span> 的情况。</p>
]]></content>
      <categories>
        <category>Network</category>
      </categories>
      <tags>
        <tag>Network</tag>
        <tag>CS144</tag>
      </tags>
  </entry>
  <entry>
    <title>CS144-Lab3</title>
    <url>/2023/09/17/CS144-Lab3/</url>
    <content><![CDATA[<p>Lab3 接续 Lab2，要完成<strong>Sender</strong>的角色。 TCP 的任务主要为：</p>
<ul>
<li>Keep track of the receiver's window (acknos and window size)</li>
<li>Fill the window when possible, by reading from the <em>ByteStream</em>, creating new TCP segments (including <em>SYN</em> and <em>FIN</em> flags if needed), and sending them.</li>
<li>Keep track of which segments have been sent but not yet acknowledged by the receiver — we call these “<strong>outstanding</strong>” segments</li>
<li>Re-send <strong>outstanding</strong> segments if <strong>enough time passes</strong> since they were sent, and they haven’t been acknowledged yet</li>
</ul>
<blockquote>
<p>Why am I doing this? The basic principle is to send whatever the receiver will allow</p>
<p>us to send (filling the window), and keep retransmitting until the receiver acknowledges</p>
<p>each segment. This is called “automatic repeat request” (ARQ).</p>
</blockquote>
<h2 id="那么tcpsender是怎么时候知道一段segment丢失了呢">那么TCPSender是怎么时候知道一段segment丢失了呢？</h2>
<p>Sender会记录每一个outstanding segment直到收到receiver的ackno。而如果一个segment outstand了太久的话， 我们就需要将其重新发送一遍。</p>
<p>当然，这里有些关于"outstanding for too long"的原则，但Lab3不会让我们解决一些tricky或者过于文字游戏的问题 （留在Lab4）。</p>
<p>这里有几个要点： - Sender的<strong>tick</strong>函数是唯一一个你可以用的，关于时间的函数。其他对于CPU/OS的调用都是被禁止的。 - Sender会被设置一个<strong>retransmission timeout (RTO)</strong>。这个就是我们resend segment的时长了。 - 我们需要自己实现retransmission timer，<strong>基于tick</strong>。 - 每个包含数据的segment被发送时，若timer没有运行，就启动timer。 - 当所有outstanding data被acknowledged了，停止timer。</p>
<p>在这里我们可以讨论一下RTO和Retransmission timer。 首先，当有带数据的segment被发送时，我们要让timer run起来。 当tick时若timer超时，则： - 把segno最低的重发一遍 - 若window大小不为0,则： - retransmission num ++ (timer stop的时候置零) - RTO *= 2, 这是根据流量调整速率的 - reset timer and start it</p>
<p>除此之外<span class="math inline">\(FIN\)</span>的处理也有点dirty，实现的时候要注意一下。</p>
<p>Lab4和Lab5比较简单，就不记录了，一个是IP/Ethernet以及ARP的NetworkInterface实现，一个则是Router的跳转表实现， 不需要太动脑子。</p>
]]></content>
      <categories>
        <category>Network</category>
      </categories>
      <tags>
        <tag>Network</tag>
        <tag>CS144</tag>
      </tags>
  </entry>
  <entry>
    <title>Cache Performance Analysis</title>
    <url>/2023/08/07/Cache-Performance-Analysis/</url>
    <content><![CDATA[<h2 id="some-concepts">Some Concepts</h2>
<p>AMAT: Average memory access time. <span class="math inline">\(AMAT = t_{hit} + rate_{missed} * penalty_{missed}\)</span></p>
<h2 id="cache-miss">Cache Miss</h2>
<p>Sources of Cache Misses:</p>
<ul>
<li>Compulsory: (Like cold start, process migration, 1st reference)</li>
<li>Capacity</li>
<li>Conflict (Collison)</li>
</ul>
<p>The Design Solutions:</p>
<ul>
<li>Compulsory:
<ul>
<li>Increase block size</li>
</ul></li>
<li>Capacity:
<ul>
<li>Increase cache size</li>
</ul></li>
<li>Conflict:
<ul>
<li>Increase associativity (may increase hit-time)</li>
</ul></li>
</ul>
<h2 id="miss-penalty">Miss Penalty</h2>
<p>Factors:</p>
<ul>
<li>How big is your memory architecture</li>
<li>How big is your block size</li>
</ul>
<h2 id="multiple-cache-levels">Multiple Cache Levels</h2>
<p>To minimize AMAT, we need to adjust the type/parameters of cache. But it's hard to reduce hit time, miss rate and miss penalty at once.</p>
<p>Multiple Cache Levels resolves this.</p>
<p>In general, L1 focuses on low hit time, L2,L3 focus on low miss rate. However, there is also big write back cost for such design.</p>
<h2 id="the-cache-design-space">The Cache Design Space</h2>
<ul>
<li>Cache parameters</li>
<li>Policy choices (Rewrite, Replacement)</li>
<li>Optimal choice is a compromise</li>
<li>Simplicity often wins</li>
</ul>
]]></content>
      <categories>
        <category>Architecture</category>
      </categories>
      <tags>
        <tag>Architecture</tag>
        <tag>CS61C</tag>
      </tags>
  </entry>
  <entry>
    <title>CS144-Lab1</title>
    <url>/2023/09/12/CS144-Lab1/</url>
    <content><![CDATA[<p>CS144 Lab1的主要任务是完成<strong>TCP</strong>的<strong>Reassembler</strong>，其主要功能为： <figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="function"><span class="type">void</span> <span class="title">Reassembler::insert</span><span class="params">( <span class="type">uint64_t</span> first_index, string data, <span class="type">bool</span> is_last_substring, Writer&amp; output )</span></span>;</span><br></pre></td></tr></table></figure> 其中<code>first_index</code>是数据的逻辑下标，也就是数据到达的顺序，<code>is_last_substring</code>标识了该数据是否代表了最后一份数据。 而<code>output</code>明显就是输出的字节流。</p>
<p>TCP的数据流一般由以下部分组成：</p>
<p><code>| popped data | unpopped-and-pushed data | arrived-and-unpushed data |</code></p>
<p>其中<strong>popped data</strong>已经被Reader获取，而<strong>unpopped-and-pushed data</strong>已经被Writer写进缓存， 但Reader暂时还未读取，最后的<strong>arrived-and-unpushed data</strong>是从网络接收，尚未组装传入Writer，非连续的数据， 是本Lab的核心工作对象。</p>
<p>计算机网络的特性决定了：不同的数据到来顺序是乱序的，他们之间可能有重叠(overlapping)，而且到来的数据可能已经被push， 而Reassembler要解决这些问题，提供可靠的<strong>流服务(Reliable Flow)</strong>。</p>
<h2 id="设计思路">设计思路</h2>
<p>我的基本想法是用一个类似char数组的缓存存储<strong>arrived-and-unpushed</strong>的数据， 然后用一个<code>map&lt;int,int&gt;</code>存储已经到达的数据的index区间 <span class="math inline">\([l,r]\)</span>，在数据到达时进行区间的合并， 这一问题和经典算法题<strong>插入区间</strong>一致。</p>
<p>其他的逻辑比较简单，主要是： - <code>first_index</code>若是arrived-and-unpushed data的首地址，要直接push - 若是空字符则省略，但若有last_string的标识，则要把writer关闭 - 一部分data在push之后，buf之后的数据要往前推（有优化空间？）</p>
<p>Reassembler的成员如下： <figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="keyword">private</span>:</span><br><span class="line">  std::map&lt;<span class="type">uint64_t</span>, <span class="type">uint64_t</span>&gt; buffer;</span><br><span class="line">  std::string buf;</span><br><span class="line">  <span class="type">uint64_t</span> end_index;</span><br><span class="line">  <span class="type">uint64_t</span> pending;</span><br></pre></td></tr></table></figure></p>
<p>实现代码如下： <figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="function"><span class="type">void</span> <span class="title">Reassembler::insert</span><span class="params">( <span class="type">uint64_t</span> first_index, string data, <span class="type">bool</span> is_last_substring, Writer&amp; output )</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">  <span class="comment">// Your code here.</span></span><br><span class="line">  (<span class="type">void</span>)first_index;</span><br><span class="line">  (<span class="type">void</span>)data;</span><br><span class="line">  (<span class="type">void</span>)is_last_substring;</span><br><span class="line">  (<span class="type">void</span>)output;</span><br><span class="line"></span><br><span class="line">  <span class="keyword">if</span> ( buf.<span class="built_in">empty</span>() )</span><br><span class="line">    buf.<span class="built_in">resize</span>( output.<span class="built_in">capacity</span>() );</span><br><span class="line"></span><br><span class="line">  <span class="type">uint64_t</span> bias_push = output.<span class="built_in">bytes_pushed</span>();</span><br><span class="line">  <span class="type">uint64_t</span> insert_l = <span class="built_in">max</span>( output.<span class="built_in">bytes_pushed</span>(), first_index );</span><br><span class="line">  <span class="type">uint64_t</span> insert_r = <span class="built_in">min</span>( first_index + data.<span class="built_in">size</span>() - <span class="number">1</span>, output.<span class="built_in">available_capacity</span>() + bias_push - <span class="number">1</span> );</span><br><span class="line"></span><br><span class="line">  <span class="keyword">if</span> ( is_last_substring )</span><br><span class="line">    end_index = first_index + data.<span class="built_in">size</span>() - <span class="number">1</span>;</span><br><span class="line"></span><br><span class="line">  <span class="keyword">if</span> ( data.<span class="built_in">empty</span>() &amp;&amp; is_last_substring ) &#123;</span><br><span class="line">    output.<span class="built_in">close</span>();</span><br><span class="line">    <span class="keyword">return</span>;</span><br><span class="line">  &#125;</span><br><span class="line"></span><br><span class="line">  <span class="keyword">if</span> ( insert_l - first_index &gt;= data.<span class="built_in">size</span>() )</span><br><span class="line">    <span class="keyword">return</span>;</span><br><span class="line">  <span class="keyword">if</span> ( insert_l &gt; insert_r )</span><br><span class="line">    <span class="keyword">return</span>;</span><br><span class="line"></span><br><span class="line">  <span class="keyword">for</span> ( <span class="type">uint64_t</span> i = insert_l; i &lt;= insert_r; i++ ) &#123;</span><br><span class="line">    buf[i - bias_push] = data[i - first_index];</span><br><span class="line">  &#125;</span><br><span class="line"></span><br><span class="line">  <span class="type">bool</span> changed = <span class="literal">true</span>;</span><br><span class="line">  <span class="keyword">while</span> ( changed &amp;&amp; !buffer.<span class="built_in">empty</span>() ) &#123;</span><br><span class="line">    changed = <span class="literal">false</span>;</span><br><span class="line">    <span class="keyword">auto</span> upper = buffer.<span class="built_in">lower_bound</span>( insert_l );</span><br><span class="line"></span><br><span class="line">    <span class="comment">// upper.first &gt;= l, compare [l,r] with [uf, us]</span></span><br><span class="line">    <span class="keyword">if</span> ( upper != buffer.<span class="built_in">end</span>() &amp;&amp; insert_r + <span class="number">1</span> &gt;= upper-&gt;first ) &#123;</span><br><span class="line"></span><br><span class="line">      insert_r = <span class="built_in">max</span>( upper-&gt;second, insert_r );</span><br><span class="line">      pending -= upper-&gt;second - upper-&gt;first + <span class="number">1</span>;</span><br><span class="line"></span><br><span class="line">      buffer.<span class="built_in">erase</span>( upper );</span><br><span class="line">      changed = <span class="literal">true</span>;</span><br><span class="line">      <span class="keyword">continue</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> ( upper == buffer.<span class="built_in">begin</span>() || buffer.<span class="built_in">empty</span>() )</span><br><span class="line">      <span class="keyword">break</span>;</span><br><span class="line">    <span class="keyword">auto</span> lower = --upper;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// lower.first &lt; l, compare [lf, ls] with [l, r]</span></span><br><span class="line">    <span class="keyword">if</span> ( lower != buffer.<span class="built_in">end</span>() &amp;&amp; lower-&gt;second + <span class="number">1</span> &gt;= insert_l ) &#123;</span><br><span class="line"></span><br><span class="line">      insert_l =  lower-&gt;first;</span><br><span class="line">      insert_r = <span class="built_in">max</span>( lower-&gt;second, insert_r );</span><br><span class="line">      pending -= lower-&gt;second - lower-&gt;first + <span class="number">1</span>;</span><br><span class="line"></span><br><span class="line">      buffer.<span class="built_in">erase</span>( lower );</span><br><span class="line">      changed = <span class="literal">true</span>;</span><br><span class="line">      <span class="keyword">continue</span>;</span><br><span class="line">    &#125;</span><br><span class="line">  &#125;</span><br><span class="line"></span><br><span class="line">  <span class="keyword">if</span> ( insert_l == output.<span class="built_in">bytes_pushed</span>() ) &#123;</span><br><span class="line">    <span class="type">uint64_t</span> old_bias = output.<span class="built_in">bytes_pushed</span>();</span><br><span class="line">    output.<span class="built_in">push</span>( buf.<span class="built_in">substr</span>( insert_l - output.<span class="built_in">bytes_pushed</span>(), insert_r - insert_l + <span class="number">1</span> ) );</span><br><span class="line">    bias_push = output.<span class="built_in">bytes_pushed</span>();</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> ( insert_r == end_index ) &#123;</span><br><span class="line">      output.<span class="built_in">close</span>();</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">for</span> ( <span class="keyword">auto</span> it = buffer.<span class="built_in">begin</span>(); it != buffer.<span class="built_in">end</span>(); it++ )</span><br><span class="line">      <span class="keyword">for</span> ( <span class="type">uint64_t</span> i = it-&gt;first; i &lt;= it-&gt;second; ++i )</span><br><span class="line">        buf[i - bias_push] = buf[i - old_bias];</span><br><span class="line">  &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">    buffer[insert_l] = insert_r;</span><br><span class="line">    pending += insert_r - insert_l + <span class="number">1</span>;</span><br><span class="line">  &#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="type">uint64_t</span> <span class="title">Reassembler::bytes_pending</span><span class="params">()</span> <span class="type">const</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">  <span class="keyword">return</span> pending;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure></p>
<p>大部分的代码和插入区间问题一致,最后的性能指标为 1.78Gbit/s，有一定的优化空间。</p>
<p>我想用string先存储起来，最后进行合并的话性能会提高不少。</p>
]]></content>
      <categories>
        <category>Network</category>
      </categories>
      <tags>
        <tag>Network</tag>
        <tag>Algorithm</tag>
        <tag>CS144</tag>
      </tags>
  </entry>
  <entry>
    <title>Different Caches</title>
    <url>/2023/08/07/Different-Caches/</url>
    <content><![CDATA[<h2 id="fully-associative-caches">Fully Associative Caches</h2>
<p>Basic implementation of cache. Omit it here. Note: The offset is determined by block size.</p>
<h2 id="direct-mapped-caches">Direct Mapped Caches</h2>
<p>For normal fully associative cache, we break down the address into: <code>[Tag | 31 ~ X bits] [Offset | X-1 ~ 0 bits]</code> which requires multiple tag checks.</p>
<p>So we design a direct-mapped cache inspired by hash table. Currently, we break down the address into: <code>[Tag | 31 ~ X bits] [Index | X-1 ~ Y bits] [Offset | Y-1 ~ 0 bits]</code> And the <strong>Index</strong> serves as the hashcode for the address, <strong>Tag</strong> as a identifier.</p>
<p>Here, for <em>Write-back Policy</em> cache, there are：</p>
<ul>
<li>Block of data</li>
<li>Index field</li>
<li>Tag field of address as identifier</li>
<li>Valid bit</li>
<li>Dirty bit</li>
<li>No replacement management bit</li>
</ul>
<p>every slot.</p>
<p>For example, for address 10010010, we can break it down into:</p>
<ul>
<li>Tag: 1001</li>
<li>Index: 001</li>
<li>Offset: 0</li>
</ul>
<p>Then look for something like <code>cache[Index] + Offset</code> to find avaliable data.</p>
<p>There are also some worst-case for such design. Since the multiple address is mapped into the same slot, We can consider the memory accesses: 00000010, 00010010, 00000010, 00010010, ... And all of the accesses will be missed.</p>
<p>But for fully associative cache, it only miss twice.</p>
<p>What <strong>direct-mapped</strong> outweighs <strong>fully-associative</strong> is its fast mapping.</p>
<h2 id="set-associative-caches">Set Associative Caches</h2>
<p><strong>N-way set-associative</strong>: divide $ into sets, each of which consists of N slots.</p>
<ul>
<li>Memory block maps to a set determined by <strong>Index</strong> field and is placed in any of the N slots of that set.</li>
<li>Call <span class="math inline">\(N\)</span> the associativity.</li>
<li>Replcaement policy applies to every set.</li>
</ul>
<p>Actually, from my perspective, Set Associative Cache is just a in-between of the two former.</p>
<p>Fully associative requires 0 index bits. Direct-mapped requires max index bits. Set-associative requires somewhere in-between.</p>
<p>Here is a screenshot from CS61C: <img src="/images/Set-Associative.png" alt="img" /></p>
<p>As you can see, it's just the combination of Direct-Mapped and Fully-Associative.</p>
]]></content>
      <categories>
        <category>Architecture</category>
      </categories>
      <tags>
        <tag>Architecture</tag>
        <tag>CS61C</tag>
      </tags>
  </entry>
  <entry>
    <title>Direct Memory Access Mechanism</title>
    <url>/2023/08/07/Direct-Memory-Access/</url>
    <content><![CDATA[<p>DMA serves asa real solution for I/O problems.</p>
<ul>
<li>Device controller transfers data directly to/from memory without involving the processor.</li>
<li>Only interrupts once per page (large) once transfer is complete.</li>
</ul>
<p>The incoming procedure:</p>
<ul>
<li>Receive interrupt from device</li>
<li>CPU takes interrupt, begins transfer (instructs DMA to place data at certain address)</li>
<li>Device/DMA engine handle the transfer (CPU is free to execute other things)</li>
<li>Upon completion, Device/DMA engine interrupt the CPU again</li>
</ul>
<p>The outgoing procedure:</p>
<ul>
<li>CPU decides to initiate transfer, confirms that external device is ready.</li>
<li>CPU takes interrupt, begins transfer (instructs DMA to place data at certain address)</li>
<li>Device/DMA engine handle the transfer (CPU is free to execute other things)</li>
<li>Device/DMA engine interrupt the CPU again to signal completion</li>
</ul>
<h2 id="cache-coherency">Cache-coherency</h2>
<p>DMA writes to memory, leading to incoherency with cache. Here we can see DMA as another processor core, whose coherency has been solved by most modern multiprocessors.</p>
<h2 id="dma-and-cpu-sharing-memory">DMA and CPU Sharing Memory</h2>
<h3 id="cycle-stealing-mode">Cycle Stealing mode</h3>
<ul>
<li>DMA Engine transfers a byte, releases control, then repeats</li>
</ul>
<h3 id="transparent-mode-maybe-best">Transparent Mode (Maybe best)</h3>
<ul>
<li>DMA transfer only occurs when CPU is not using the system bus</li>
</ul>
]]></content>
      <categories>
        <category>Architecture</category>
      </categories>
      <tags>
        <tag>Architecture</tag>
        <tag>CS61C</tag>
      </tags>
  </entry>
  <entry>
    <title>Looking into ELF Symbol Table</title>
    <url>/2023/10/26/ELF-Symbol-Table/</url>
    <content><![CDATA[<h2 id="introduction">Introduction</h2>
<p>ELF, Executable and Linking Format (ELF) files, is a universal binary format in Linux. As its name suggests, any executable or linking files in Linux are in format of ELF, which consists of an ELF header, followed by a program header table or a section header table, or both. The two tables describe the rest of the particularities of the file.</p>
<p>The header file &lt;elf.h&gt; defines the format of ELF files and related C structures.</p>
<span id="more"></span>
<h2 id="top-view">Top-View</h2>
<figure class="highlight c"><table><tr><td class="code"><pre><span class="line">| -------------- |</span><br><span class="line">|   ELF Header   |</span><br><span class="line">| -------------- |</span><br><span class="line">| Program Header |</span><br><span class="line">|     Table      |</span><br><span class="line">| -------------- |</span><br><span class="line">| Section Header |</span><br><span class="line">|     Table      |</span><br><span class="line">| -------------- |</span><br><span class="line">|   ..........   |</span><br><span class="line">|   ..........   |</span><br><span class="line">|   ..........   |</span><br><span class="line">| -------------- |</span><br><span class="line">| Symbol  Table  |</span><br><span class="line">|     Section    |</span><br><span class="line">| -------------- |</span><br><span class="line">| String  Table  |</span><br><span class="line">|     Section    |</span><br><span class="line">| -------------- |</span><br><span class="line"></span><br></pre></td></tr></table></figure>
<p>We take Elf32 as an example, it's ELF header is like below:</p>
<figure class="highlight c"><table><tr><td class="code"><pre><span class="line"><span class="keyword">typedef</span> <span class="class"><span class="keyword">struct</span></span></span><br><span class="line"><span class="class">&#123;</span></span><br><span class="line">  <span class="type">unsigned</span> <span class="type">char</span>     e_ident[EI_NIDENT];      <span class="comment">/* Magic number and other info */</span></span><br><span class="line">  Elf64_Half        e_type;                  <span class="comment">/* Object file type */</span></span><br><span class="line">  Elf64_Half        e_machine;               <span class="comment">/* Architecture */</span></span><br><span class="line">  Elf64_Word        e_version;               <span class="comment">/* Object file version */</span></span><br><span class="line">  Elf64_Addr        e_entry;                 <span class="comment">/* Entry point virtual address */</span></span><br><span class="line">  Elf64_Off         e_phoff;                 <span class="comment">/* Program header table file offset */</span></span><br><span class="line">  Elf64_Off         e_shoff;                 <span class="comment">/* Section header table file offset */</span></span><br><span class="line">  Elf64_Word        e_flags;                 <span class="comment">/* Processor-specific flags */</span></span><br><span class="line">  Elf64_Half        e_ehsize;                <span class="comment">/* ELF header size in bytes */</span></span><br><span class="line">  Elf64_Half        e_phentsize;             <span class="comment">/* Program header table entry size */</span></span><br><span class="line">  Elf64_Half        e_phnum;                 <span class="comment">/* Program header table entry count */</span></span><br><span class="line">  Elf64_Half        e_shentsize;             <span class="comment">/* Section header table entry size */</span></span><br><span class="line">  Elf64_Half        e_shnum;                 <span class="comment">/* Section header table entry count */</span></span><br><span class="line">  Elf64_Half        e_shstrndx;              <span class="comment">/* Section header string table index */</span></span><br><span class="line">&#125; Elf64_Ehdr;</span><br></pre></td></tr></table></figure>
<p><strong>e_shoff</strong> defines the offset of <strong>section header tables</strong> from <strong>file begin</strong>. And section tables consist of consecutive sections.<br />
<strong>p_shoff</strong> defines the offset of <strong>program header tables</strong> from <strong>file begin</strong>.</p>
<h3 id="section-header">Section Header</h3>
<p>A file's section header table lets one locate all the file's sections. From <strong>e_shoff</strong> we can reach the table of section headers. And <strong>e_shnum</strong> holds the number of entries the section header table contains.</p>
<p>A section header table index is a subscript into this array. Some section header table indices are reserved: the initial entry and the indices between <strong>SHN_LORESERVE</strong> and <strong>SHN_HIRESERVE</strong>. The initial entry is used in ELF extensions for <strong>e_phnum</strong>, <strong>e_shnum</strong>, and <strong>e_shstrndx</strong>; in other cases, each field in the initial entry is set to zero. An object file does not have sections for these special indices:</p>
<p>For details about these special indices, see also <code>man 5 elf</code>.</p>
<p>The section header has the following structure:</p>
<figure class="highlight c"><table><tr><td class="code"><pre><span class="line"><span class="keyword">typedef</span> <span class="class"><span class="keyword">struct</span></span></span><br><span class="line"><span class="class">&#123;</span></span><br><span class="line">  Elf32_Word    sh_name;        <span class="comment">/* Section name (string tbl index) */</span></span><br><span class="line">  Elf32_Word    sh_type;        <span class="comment">/* Section type */</span></span><br><span class="line">  Elf32_Word    sh_flags;       <span class="comment">/* Section flags */</span></span><br><span class="line">  Elf32_Addr    sh_addr;        <span class="comment">/* Section virtual addr at execution */</span></span><br><span class="line">  Elf32_Off     sh_offset;      <span class="comment">/* Section file offset */</span></span><br><span class="line">  Elf32_Word    sh_size;        <span class="comment">/* Section size in bytes */</span></span><br><span class="line">  Elf32_Word    sh_link;        <span class="comment">/* Link to another section */</span></span><br><span class="line">  Elf32_Word    sh_info;        <span class="comment">/* Additional section information */</span></span><br><span class="line">  Elf32_Word    sh_addralign;   <span class="comment">/* Section alignment */</span></span><br><span class="line">  Elf32_Word    sh_entsize;     <span class="comment">/* Entry size if section holds table */</span></span><br><span class="line">&#125; Elf32_Shdr;</span><br></pre></td></tr></table></figure>
<p><strong>sh_name</strong>: indicates the <em>index</em> of <em>section name</em> in <em>Section Header String Table</em>.<br />
<strong>sh_type</strong>: mainly includes(The part I'm interested in):</p>
<ul>
<li><strong>SHT_NULL</strong>: Marks the section header as inactive.</li>
<li><strong>SHT_SYMTAB</strong>: Symbol Table, for link editing and dynamic linking.</li>
<li><strong>SHT_DYNSYM</strong>: Dynamic Symbol Table, holds a minimal set of dynamic symbols linking symbols.</li>
<li><strong>SHT_STRTAB</strong>: String Table. An object file may have multiple string sections.</li>
</ul>
<p><strong>sh_offset</strong>: functions as above, determining the offset of section from from begin.<br />
<strong>sh_link</strong>: This member holds a section header table index link, whose interpretation depends on the section type. For symbol table, it's the section index of String Table Section (holding <strong>name</strong> of symbol).</p>
<h3 id="elf-symbol-table">ELF Symbol Table</h3>
<p>ELF Symbol Table consists of <strong>consecutive</strong> entries.<br />
The structure of the ELF symbol table entry is like:</p>
<figure class="highlight c"><table><tr><td class="code"><pre><span class="line"><span class="keyword">typedef</span> <span class="class"><span class="keyword">struct</span></span></span><br><span class="line"><span class="class">&#123;</span></span><br><span class="line">  Elf32_Word       st_name;        <span class="comment">/* Symbol name (string tbl index) */</span></span><br><span class="line">  Elf32_Addr       st_value;       <span class="comment">/* Symbol value */</span></span><br><span class="line">  Elf32_Word       st_size;        <span class="comment">/* Symbol size */</span></span><br><span class="line">  <span class="type">unsigned</span> <span class="type">char</span>    st_info;        <span class="comment">/* Symbol type and binding */</span></span><br><span class="line">  <span class="type">unsigned</span> <span class="type">char</span>    st_other;       <span class="comment">/* Symbol visibility */</span></span><br><span class="line">  Elf32_Section    st_shndx;       <span class="comment">/* Section index */</span></span><br><span class="line">&#125; Elf32_Sym;</span><br></pre></td></tr></table></figure>
<p>As the comment shows, <strong>st_name</strong> is an <em>entry</em> index in <em>String Table</em>. And the section index of <em>String Table</em> is holded in <strong>sh_link</strong>. Based on both, we can get the function/variable name easily.</p>
<p><strong>st_info</strong>: Consist of 2 field: Bind and Type, we focus on latter(which can be derived by <code>ELF32_ST_TYPE(info)</code>) now.</p>
<p>Symbol type mainly includes:</p>
<ul>
<li><strong>STT_OBJECT</strong>: A data object. (Such as <em>C</em> variable)</li>
<li><strong>STT_FUNC</strong>: A function or other executable code.</li>
<li><strong>STT_SECTION</strong>: A section, for relocation.</li>
<li><strong>STT_FILE</strong>: The name of the source file.</li>
<li><strong>STB_LOCAL</strong>: Local symbols are not visible outside the object file containing their definition. Local symbols of the same name may exist in multiple files without interfering with each other.</li>
<li><strong>STB_GLOBAL</strong>: Global symbols are visible to all object files being com‐ bined. One file's definition of a global symbol will satisfy another file's undefined reference to the same symbol.</li>
<li><strong>STB_WEAK</strong>: Weak symbols resemble global symbols, but their definitions have lower precedence.</li>
</ul>
<h3 id="how-to-find-string-in-string-table">How to Find String in String Table?</h3>
<p>String Table can be seen as an array of multiple null-terminated strings. The index of the entry in String Table is just the index of string in array.</p>
<p>For example, a String Table may look like this:</p>
<figure class="highlight c"><table><tr><td class="code"><pre><span class="line"><span class="string">&quot;\0hello\0world\0xxxxxxxxx&quot;</span></span><br></pre></td></tr></table></figure>
<p>The 0th entry is always empty. The 1st entry is "hello". The 2nd is "world". So every time you want to find an string entry by index, you must traverse every string before the one you looks for. Or you can cache the whole string table to speed up the whole ELF analysis.</p>
<h3 id="some-tools">Some Tools</h3>
<p><em>readelf</em> can read the elf file easily by various options.</p>
<p>Some cheatsheet:</p>
<figure class="highlight bash"><table><tr><td class="code"><pre><span class="line">readelf -h <span class="comment"># show elf header</span></span><br><span class="line">readelf -l <span class="comment"># show program headers, or segments</span></span><br><span class="line">readelf -S <span class="comment"># show section headers</span></span><br><span class="line">readelf -g <span class="comment"># show section groups</span></span><br><span class="line">readelf -s <span class="comment"># show symbols</span></span><br></pre></td></tr></table></figure>
<h2 id="reference">Reference</h2>
<ul>
<li><strong><em>Linux ELF Manual</em></strong></li>
</ul>
]]></content>
      <categories>
        <category>OS</category>
      </categories>
      <tags>
        <tag>ELF</tag>
        <tag>Linux</tag>
      </tags>
  </entry>
  <entry>
    <title>How to debug LLVM ?</title>
    <url>/2023/10/17/How-to-debug-LLVM/</url>
    <content><![CDATA[<h2 id="abstract">Abstract</h2>
<p>Debugging programs remains a significant topic in software engineering field. Especially in system software like <em>Compiler</em>, it's difficult to pinpoint the root and solve relative problems.<br />
For my recent work on <em>LLVM</em>, I'd like to share some experience about it.</p>
<span id="more"></span>
<h2 id="classification-of-bugs">Classification of Bugs</h2>
<p>Bugs in Compiler field can be mainly classified into <em>Crash</em>, <em>Mis-compilation</em>, and <em>Missed Optimizations</em>.</p>
<p>For example, there may be a C file triggering one of these bugs.</p>
<p>If it crashed clang, which is easiest to pinpoint, clang would dump the stack trace. With the stack trace, we are able to determine it's a frontend/middleend/backend problem.</p>
<p>Or, if someone reported an assembly file after mis-compilation, we have to reduce it first (use <em>llvm-reduce</em> if it's a LLVM-IR), and try to validate in the whole compilation. For example, validate the AST, the LLVM-IR after each pass, and the assembly after every step in backend. In this way, we can pinpoint which module caused the mis-compilation.</p>
<p>For missed optimizations, it's similar to the case as mis-compilation. However, it's harder to define whether it's a <em>helpful or real</em> missed optimizations. There are some kinds of missed optimizations that always make no sense to real improvement of optimization. And fuzzers always generate such missed cases:</p>
<ul>
<li><p>Too large IR. For this kind of IR, passes like CSE, GVN and DSE only fold it partially for cost/compile-time problem</p></li>
<li><p>No real motivation. The optimization in LLVM is designed mostly for real-world applications. For this reason, some non-sense missed cases are not considered at all, unless they become a pattern.</p></li>
<li><p>Hard to debug. Complex testcases always needs reduction and can be located precisely in which module.</p></li>
<li><p>Won't fix. Optimization is a recursively unsolvable problem, and there is always some topics that compiler can't fix at all, such as fully eliminating all common expressions or simplifying all expressions. Most optimization in LLVM is <strong>mostly heuristic</strong> or based on <strong>experience</strong>, which determines that LLVM can't handle all cases.</p></li>
</ul>
<h2 id="some-tools">Some Tools</h2>
<ul>
<li><p><em>llc/opt --print-before=[crash pass] [ir]</em>: You could dump IR before the pass causing crash through it. For example, use <code>2&gt; dump.txt</code> to output to a file.</p></li>
<li><p><em>opt -O2 -print-before-all / opt -O2 -print-before-all</em>: Dump all IR before/after all passes that modify IR.</p></li>
<li><p><em>llvm-reduce [ir] --test=test.sh</em>: <em>llvm-reduce</em> is an IR-Reduction tool based on <em>Delta Algorithm</em>, which reduces ir if <em>test.sh</em> return 0(0 represents interestness). It make IR eaiser to analyze.</p></li>
</ul>
<hr />
<h2 id="my-workflow">My Workflow</h2>
<p>When I come across an IR file crashing clang/opt, I first take a look at stacktrace. The stacktrace always indicates which function/class exposes the error.</p>
<!-- picture1 -->
<p>Here we assume the error is exposed by a optimization pass <em>op1</em>. Then we enter:</p>
<figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">opt --print-before=op1 -O2 -S [ir-file] &gt; [ir-before-op1.ll]</span><br></pre></td></tr></table></figure>
<p>Or if it's a C file, we enter:</p>
<figure class="highlight bash"><table><tr><td class="code"><pre><span class="line">clang -mllvm -print-before=instcombine xxx.c -O2 -g0 2&gt; xxx.ll</span><br></pre></td></tr></table></figure>
<p><code>O2</code> can be anything causing problem. The output serves as a reproducer. And then we use <em>op1</em> to reproduce it:</p>
<figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">opt --passes=op1 -S [ir-file]</span><br></pre></td></tr></table></figure>
<p>If we reproduce it successfully, we are down to reducing it:</p>
<p>Write a <code>test.sh</code></p>
<figure class="highlight bash"><table><tr><td class="code"><pre><span class="line"><span class="meta">#!/bin/bash</span></span><br><span class="line">opt --passes=op1 -S <span class="variable">$1</span> | grep <span class="string">&quot;something related to error&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># For missed optimization, write a testcase and check it with FileCheck</span></span><br><span class="line"><span class="comment"># FileCheck $1 | grep &quot;something related&quot;</span></span><br></pre></td></tr></table></figure>
<p>And launch <em>llvm-reduce</em>:</p>
<figure class="highlight bash"><table><tr><td class="code"><pre><span class="line">llvm-reduce [ir-file] --<span class="built_in">test</span>=test.sh</span><br></pre></td></tr></table></figure>
<p>Finally we get a <em>reduced.ll</em>. Based on this file, we analyze the problem easier.</p>
<h2 id="how-to-get-command-line-arguments-and-ir-dump-file-when-bootstrapping">How to Get Command-Line Arguments and IR-dump File when bootstrapping</h2>
<p>Refer to <a href="https://discourse.llvm.org/t/how-to-reproduce-the-bug-and-get-the-exact-ir-before-crash-during-bootstrapping/74032">discourse</a></p>
<h2 id="reference">Reference</h2>
<ul>
<li><a href="https://clangbuiltlinux.github.io/llvm-dev-conf-2020/nick/debugging_llvm.html">Debugging llvm</a></li>
<li><a href="https://www.npopov.com/2023/10/22/How-to-reduce-LLVM-crashes.html">Nikic's Blog</a></li>
</ul>
]]></content>
      <categories>
        <category>LLVM</category>
      </categories>
      <tags>
        <tag>Compiler</tag>
        <tag>LLVM</tag>
        <tag>OpenSource</tag>
      </tags>
  </entry>
  <entry>
    <title>Introduction to GPU</title>
    <url>/2023/10/31/Introduction-To-GPU/</url>
    <content><![CDATA[<h2 id="introduction">Introduction</h2>
<p>GPU, Graphics Processing Unit, is initially designed to accelerate image rendering such as video games. For its high performance at parallel computation, it has become a great processor for accelerating DL/ML training.</p>
<span id="more"></span>
<p>Unlike CPU, GPU consists of numerous computational units, long pipeline and a video memory, which determines its advantages in parallel computation and disadvantages in complex control logic handling.</p>
<p><img src="/images/CPU-GPU.webp" /></p>
<h2 id="computational-unitscores">Computational units(cores)</h2>
<p>In total, computational units of CPU are fast but few, while that of GPU are slow but numerous. The fastness of CPU is based on its high frequency and smart calculation. Here, smartness is reflected by its out-of-order executions, multiple branch prediction and etc.</p>
<p>But GPU can only handle some easy linear work like <code>fmuladd</code> instructions. In fact, except small scalar float, modern GPU can perform operations on more complicated type like tensor(<em>tensor core</em>).</p>
<p>SIMD not only exist in CPU, but also in GPU. Same operation, but different data, such feature make GPU fast in parallel work like matrix multiplication.</p>
<h2 id="memory">Memory</h2>
<p>Memory of GPU is much tinier than CPU's. And cache in GPU has some difference with what L1, L2 in CPU do.</p>
<p>For <code>reduce</code> in parallel computation, it requires multiple cores share memory. But it's hard and expensive for thousands of core share one memory segment. So we divide different types of cores into multiple groups, called <em>Streaming Multiprocessors</em>;</p>
<p>There are INT32, FP32 and other types of SM in GPU. So how they cooperate?</p>
<p>In TU102. every 4 SMs share a shared segment of L1 cache, and all cores share L2 cache. Like CPU, after missing data in L1, core will try to hit L2, and then GMEM. To note, how L1 is shared is controlled by software or programmer, not hardware. But L2 and GMEM is controlled by hardware. Besides, cores can also share data in registers.</p>
<p>The basic idea is that every thread holds a register to keep temporary result and every register can only be visited by one consistent thread(or by same wrap/group).</p>
<h2 id="references">References</h2>
<p><a href="https://zhuanlan.zhihu.com/p/598173226">Clarence's Zhihu</a><br />
<a href="https://medium.com/codex/understanding-the-architecture-of-a-gpu-d5d2d2e8978b">Understanding the architecture of a GPU</a></p>
]]></content>
      <categories>
        <category>Architecture</category>
      </categories>
      <tags>
        <tag>Architecture</tag>
        <tag>GPU</tag>
        <tag>HPC</tag>
      </tags>
  </entry>
  <entry>
    <title>第一次给LLVM的Contribution</title>
    <url>/2023/06/30/LLVM-First-Contribution/</url>
    <content><![CDATA[<blockquote>
<p>2023.9.21修改：LLVM的patch以及完全迁移到Github PR上，本篇文章有关Phabricator的操作已经<strong>out-of-dated</strong>。</p>
</blockquote>
<h2 id="为什么要参与llvm的开源">为什么要参与LLVM的开源？</h2>
<p>由于一直以来对编译器后端特别感兴趣，又曾用<strong>LLVM</strong>作为后端为自己的语言进行AOT的编译， 我对LLVM的内部十分好奇，于是想通过为<strong>LLVM</strong>贡献代码的方式了解<strong>LLVM</strong>，并了解编译器优化的流程。</p>
<p>于是我参考了一位LLVM Member的文章: <a href="https://developers.redhat.com/articles/2022/12/20/how-contribute-llvm#implementing_the_transform">How to contribute to llvm?</a></p>
<p>以下则是我从编译到提交patch的全流程。</p>
<h3 id="编译">编译</h3>
<p>要为LLVM贡献代码，那首先能在本地编译LLVM库。</p>
<p>那么我们首先要clone LLVM的git仓库，或者自己fork了<strong>llvm-project</strong>后再clone到本地。二者区别不大，我按照github的开源习惯选了后者。</p>
<p>clone完之后我们开始编译，这边要注意的是：由于计算机编译速度的限制，我们一边建议进行<strong>Release</strong>编译。否则一次编译链接要长达几小时的时间。 以下是cmake的模板：</p>
<figure class="highlight shell"><table><tr><td class="code"><pre><span class="line">cmake -GNinja -Bbuild -Hllvm \</span><br><span class="line">    -DLLVM_ENABLE_PROJECTS=&quot;clang&quot; \</span><br><span class="line">    -DLLVM_TARGETS_TO_BUILD=&quot;all&quot; \</span><br><span class="line">    -DCMAKE_BUILD_TYPE=Release \</span><br><span class="line">    -DLLVM_ENABLE_ASSERTIONS=true \</span><br><span class="line">    -DLLVM_CCACHE_BUILD=true \</span><br><span class="line">    -DLLVM_USE_LINKER=lld</span><br></pre></td></tr></table></figure>
<p>其中Debug可通过<code>-debug</code> flag来进行，你可以在对应的代码位置用<code>errs() &lt;&lt; something</code>进行输出。</p>
<p>而ninja的编译速度相对较快，所以以下有构建和测试的shell：</p>
<figure class="highlight shell"><table><tr><td class="code"><pre><span class="line"><span class="meta prompt_"># </span><span class="language-bash">Build LLVM</span></span><br><span class="line">ninja -Cbuild</span><br><span class="line"><span class="meta prompt_"></span></span><br><span class="line"><span class="meta prompt_"># </span><span class="language-bash">Run all LLVM tests</span></span><br><span class="line">ninja -Cbuild check-llvm</span><br><span class="line"><span class="meta prompt_"></span></span><br><span class="line"><span class="meta prompt_"># </span><span class="language-bash">Run tests <span class="keyword">in</span> a specific directory.</span></span><br><span class="line"><span class="meta prompt_"># </span><span class="language-bash">-v will <span class="built_in">print</span> additional information <span class="keyword">for</span> failures.</span></span><br><span class="line">build/bin/llvm-lit -v llvm/test/Transforms/InstCombine</span><br></pre></td></tr></table></figure>
<h3 id="选issue">选Issue</h3>
<p>由于我是LLVM领域的新手，不太可能一上来就砍大龙，所以我挑了个简单的任务。 <span id="more"></span> 而llvm-project包括许多子项目，包括LLVM本身、Clang编译器、LLD链接器、libc++标准库以及许多其他项目。即使在LLVM本身中也有不同的领域。主要分为与中端优化器与LLVM中间表示(IR)有关的项目，和与后端将IR转换为机器代码有关的项目。</p>
<p>而我对中端的了解比较多，而且中端优化的代码有许多corner cases，可以通过简单的几行代码解决这些cases， 所以本博客主要针对中端IR优化的<strong>InstCombine</strong>进行讨论，挑选的也是<a href="https://github.com/llvm/llvm-project/issues?q=is%3Aopen+is%3Aissue+label%3Allvm%3Ainstcombine">InstCombine Issue</a>。 当然，LLVM还有许多其他容易解决的Issue，如：<a href="https://github.com/llvm/llvm-project/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22">good first issues</a>，Clang，Flang还有clang-tidy和clang-format等项目的Issue。</p>
<p>在这里我将展示我的一次LLVM贡献经历：<a href="https://reviews.llvm.org/D154126/new/">D154126</a></p>
<p>相关<a href="https://github.com/llvm/llvm-project/issues/62586">Issue</a>。</p>
<h3 id="问题分析">问题分析</h3>
<p>这篇Issue里提到的问题为: <code>(a &gt; b) | (a &lt; b)</code> 的优化会在 <code>b == 0</code> 时失效。</p>
<p>而一般的 <code>(a &gt; b) | (a &lt; b)</code> 会折叠为 <code>ZExt(a != 0)</code>，对应的LLVM-IR如下：</p>
<figure class="highlight plaintext"><table><tr><td class="code"><pre><span class="line">define i32 @src(i32 %A, i32 %B) &#123;</span><br><span class="line">%1:</span><br><span class="line">  %2 = icmp sgt i32 %A, %B</span><br><span class="line">  %3 = zext i1 %2 to i32</span><br><span class="line">  %4 = icmp slt i32 %A, %B</span><br><span class="line">  %5 = zext i1 %4 to i32</span><br><span class="line">  %6 = or i32 %3, %5</span><br><span class="line">  ret i32 %6</span><br><span class="line">&#125;</span><br><span class="line">=&gt;</span><br><span class="line">define i32 @tgt(i32 %A, i32 %B) &#123;</span><br><span class="line">%1:</span><br><span class="line">  %2 = icmp ne i32 %A, %B</span><br><span class="line">  %3 = zext i1 %2 to i32</span><br><span class="line">  ret i32 %3</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>但对于 <code>b == 0</code> 的case，其对应的InstCombine优化为： <figure class="highlight plaintext"><table><tr><td class="code"><pre><span class="line">define i32 @src(i32 %A) &#123;</span><br><span class="line">%1:</span><br><span class="line">  %2 = icmp sgt i32 %A, 0</span><br><span class="line">  %3 = zext i1 %2 to i32</span><br><span class="line">  %4 = lshr i32 %A, 31</span><br><span class="line">  %5 = or i32 %4, %3</span><br><span class="line">  ret i32 %5</span><br><span class="line">&#125;</span><br><span class="line">=&gt;</span><br><span class="line">define i32 @tgt(i32 %A) &#123;</span><br><span class="line">%1:</span><br><span class="line">  %2 = icmp sgt i32 %A, 0</span><br><span class="line">  %3 = zext i1 %2 to i32 </span><br><span class="line">  %4 = lshr i32 %A, 31</span><br><span class="line">  %5 = or i32 %3, %4</span><br><span class="line">  ret i32 %5</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure> 也就是说在这种情况下 <code>A &lt; 0</code> 被优化成了 <code>A &lt;&lt; 31</code>，而之前对应的 <code>A &lt; B | A &gt; B</code> 的<strong>Pattern Matching</strong>被破坏掉了。</p>
<p>在分析如何解决这个优化问题前，我们先了解LLVM的中端优化代码提交patch的特殊规则。</p>
<p>LLVM的patch由两部分组成，第一部分是<strong>impl</strong>前的<strong>misoptimization tests</strong>，第二部分则是<strong>impl</strong>以及应用<strong>impl</strong>后的<strong>tests</strong>。 这样分解patch的好处有以下2点：</p>
<ol type="1">
<li>便于通过对tests的前后对比查看你实现的优化效果。</li>
<li>可以把tests作为单独的patch提交，这样能简单提高LLVM的测试量。</li>
</ol>
<p>除此之外，在你提交patch前，你还要证明你优化的正确性。</p>
<h4 id="证明transform的正确性">证明Transform的正确性</h4>
<p>一般来讲，我们会使用 <a href="https://github.com/AliveToolkit/alive2">alive2</a> 验证不同<strong>LLVM-IR</strong>的正确性，<a href="https://alive2.llvm.org/ce/">online</a>版。 本篇的Issue的alive2结果如下：</p>
<figure class="highlight plaintext"><table><tr><td class="code"><pre><span class="line">define i32 @src(i32 %0) &#123;</span><br><span class="line">%1:</span><br><span class="line">  %2 = icmp sgt i32 %0, 0</span><br><span class="line">  %3 = zext i1 %2 to i32</span><br><span class="line">  %4 = lshr i32 %0, 31</span><br><span class="line">  %5 = or i32 %4, %3</span><br><span class="line">  ret i32 %5</span><br><span class="line">&#125;</span><br><span class="line">=&gt;</span><br><span class="line">define i32 @tgt(i32 %0) &#123;</span><br><span class="line">%1:</span><br><span class="line">  %2 = icmp ne i32 %0, 0</span><br><span class="line">  %3 = zext i1 %2 to i32</span><br><span class="line">  ret i32 %3</span><br><span class="line">&#125;</span><br><span class="line">Transformation seems to be correct!</span><br></pre></td></tr></table></figure>
<p>虽然<strong>alive2</strong>是确保LLVM转换正确性的非常重要的工具，但值得注意的是它可能会产生<strong>false negative</strong>结果（即有时它会声称一个不正确的转换是正确的）。这通常发生在循环优化的背景下，并且通常不会影响<strong>InstCombine</strong>优化。</p>
<h4 id="测试">测试</h4>
<p>在我们写<strong>impl</strong>之前，我们需要先完成所有testcases的构建。</p>
<p>首先是基本成功转换的测试样例: <figure class="highlight plaintext"><table><tr><td class="code"><pre><span class="line">define i32 @icmp_slt_0_or_icmp_sgt_0_i32(i32 %x) &#123;</span><br><span class="line">; CHECK-LABEL: @icmp_slt_0_or_icmp_sgt_0_i32(</span><br><span class="line">; CHECK-NEXT:    [[B:%.*]] = icmp sgt i32 [[X:%.*]], 0</span><br><span class="line">; CHECK-NEXT:    [[X_LOBIT:%.*]] = lshr i32 [[X]], 31</span><br><span class="line">; CHECK-NEXT:    [[D:%.*]] = zext i1 [[B]] to i32</span><br><span class="line">; CHECK-NEXT:    [[E:%.*]] = or i32 [[X_LOBIT]], [[D]]</span><br><span class="line">; CHECK-NEXT:    ret i32 [[E]]</span><br><span class="line">;</span><br><span class="line">  %A = icmp slt i32 %x, 0</span><br><span class="line">  %B = icmp sgt i32 %x, 0</span><br><span class="line">  %C = zext i1 %A to i32</span><br><span class="line">  %D = zext i1 %B to i32</span><br><span class="line">  %E = or i32 %C, %D</span><br><span class="line">  ret i32 %E</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure></p>
<p>注意，其中的<strong>CHECK-LABEL</strong>后的是testcase的函数名，<strong>CHECK-NEXT</strong>后则是经过转换后期望的IR，在测试时若不满足期望，则会返回失败的测试报告。 这里的测试是未进行优化时的结果，故<strong>CHECK</strong>的结果也自然是未优化的。 当然这里<strong>CHECK</strong>的内容不用自己直接输入，可以用llvm的脚本自动生成，脚本如下：</p>
<figure class="highlight shell"><table><tr><td class="code"><pre><span class="line">llvm/utils/update_test_checks.py --opt-bin build/bin/opt \</span><br><span class="line">    llvm/test/Transforms/InstCombine/and-or-icmps.ll</span><br></pre></td></tr></table></figure>
<p>这段脚本会用<strong>InstCombine</strong>对<code>and-or-icmps</code>的每个testcase进行一次优化，并把优化结果作为<strong>CHECK</strong>的IR插入到<code>and-or-icmps</code>中。</p>
<p>而上面的测试用例只考虑了i32的基本类型，这里我们再添加i64的测试类型:</p>
<figure class="highlight plaintext"><table><tr><td class="code"><pre><span class="line">define i64 @icmp_slt_0_or_icmp_sgt_0_i64(i64 %x) &#123;</span><br><span class="line">  %A = icmp slt i64 %x, 0</span><br><span class="line">  %B = icmp sgt i64 %x, 0</span><br><span class="line">  %C = zext i1 %A to i64</span><br><span class="line">  %D = zext i1 %B to i64</span><br><span class="line">  %E = or i64 %C, %D</span><br><span class="line">  ret i64 %E</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>除此之外，我们还需要一些反例(如改变左移的位数，把大于变为小于等)，防止我们的转换误优化，一例如下： <figure class="highlight plaintext"><table><tr><td class="code"><pre><span class="line">define i64 @icmp_slt_0_or_icmp_sgt_0_i64_fail2(i64 %x) &#123;</span><br><span class="line">; CHECK-LABEL: @icmp_slt_0_or_icmp_sgt_0_i64_fail2(</span><br><span class="line">; CHECK-NEXT:    [[B:%.*]] = icmp sgt i64 [[X:%.*]], 0</span><br><span class="line">; CHECK-NEXT:    [[C:%.*]] = lshr i64 [[X]], 62</span><br><span class="line">; CHECK-NEXT:    [[D:%.*]] = zext i1 [[B]] to i64</span><br><span class="line">; CHECK-NEXT:    [[E:%.*]] = or i64 [[C]], [[D]]</span><br><span class="line">; CHECK-NEXT:    ret i64 [[E]]</span><br><span class="line">;</span><br><span class="line">  %B = icmp sgt i64 %x, 0</span><br><span class="line">  %C = lshr i64 %x, 62</span><br><span class="line">  %D = zext i1 %B to i64</span><br><span class="line">  %E = or i64 %C, %D</span><br><span class="line">  ret i64 %E</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure></p>
<p>最后，我们可能还要考虑向量化的测试如下：</p>
<figure class="highlight plaintext"><table><tr><td class="code"><pre><span class="line">define &lt;2 x i64&gt; @icmp_slt_0_or_icmp_sgt_0_i64x2(&lt;2 x i64&gt; %x) &#123;</span><br><span class="line">  %A = icmp slt &lt;2 x i64&gt; %x, &lt;i64 0,i64 0&gt;</span><br><span class="line">  %B = icmp sgt &lt;2 x i64&gt; %x, &lt;i64 0,i64 0&gt;</span><br><span class="line">  %C = zext &lt;2 x i1&gt; %A to &lt;2 x i64&gt;</span><br><span class="line">  %D = zext &lt;2 x i1&gt; %B to &lt;2 x i64&gt;</span><br><span class="line">  %E = or &lt;2 x i64&gt; %C, %D</span><br><span class="line">  ret &lt;2 x i64&gt; %E</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>完成这些testcases后我们进行一次commit。</p>
<h4 id="实现">实现</h4>
<p>最终到了我们的实现部分，在实现之前，我们要进行有关的分析/debug的工作。</p>
<p>我们在这里通过<code>build/bin/opt -passes=instcombine -S -debug src.ll</code>进行Debug，在不同函数中插入打印的函数，从而根据输出判断优化的代码位置。</p>
<p>经过一系列排查，我们可以发现当<code>b == 0</code>时无法优化的原因是在 <strong>InstCombineAndOr.cpp</strong>中的<strong>transformZExtICmp</strong>函数会把<code>ZExt(a &lt; 0)</code>转化为<code>a &lt;&lt; 31</code>。</p>
<p>而优化 <code>a &lt; b | a &gt; b</code> 的函数<strong>foldAndOrOfICmpsUsingRanges</strong>无法识别<code>a &lt;&lt; 31</code>这样的语句，自然就无法优化了。 由于笔者并不是特别清楚InstCombine优化的顺序，故笔者选择在<strong>foldCastedBitwiseLogic</strong>中增加对<code>Zext(a &gt; 0) | a &lt;&lt; 31</code>的匹配，并进行对应的优化。 代码如下：</p>
<figure class="highlight cpp"><table><tr><td class="code"><pre><span class="line"><span class="comment">// ( A &lt;&lt; (X - 1) ) | ((A &gt; 0) zext to iX)</span></span><br><span class="line"><span class="comment">// &lt;=&gt; A &lt; 0 | A &gt; 0</span></span><br><span class="line"><span class="comment">// &lt;=&gt; (A != 0) zext to iX</span></span><br><span class="line">Value *A;</span><br><span class="line">ICmpInst::Predicate Pred;</span><br><span class="line"></span><br><span class="line"><span class="keyword">auto</span> MatchOrZExtICmp = [&amp;](Value *Op0, Value *Op1) -&gt; <span class="type">bool</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="built_in">match</span>(Op0, <span class="built_in">m_LShr</span>(<span class="built_in">m_Value</span>(A), <span class="built_in">m_SpecificInt</span>(Op0-&gt;<span class="built_in">getType</span>()-&gt;<span class="built_in">getScalarSizeInBits</span>() - <span class="number">1</span>))) &amp;&amp;</span><br><span class="line">       <span class="built_in">match</span>(Op1, <span class="built_in">m_ZExt</span>(<span class="built_in">m_ICmp</span>(Pred, <span class="built_in">m_Specific</span>(A), <span class="built_in">m_Zero</span>())));</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (LogicOpc == Instruction::Or &amp;&amp;</span><br><span class="line">  (<span class="built_in">MatchOrZExtICmp</span>(Op0, Op1) || <span class="built_in">MatchOrZExtICmp</span>(Op1, Op0)) &amp;&amp;</span><br><span class="line">  Pred == ICmpInst::ICMP_SGT) &#123;</span><br><span class="line">  Value *Cmp =</span><br><span class="line">      Builder.<span class="built_in">CreateICmpNE</span>(A, Constant::<span class="built_in">getNullValue</span>(A-&gt;<span class="built_in">getType</span>()));</span><br><span class="line">  <span class="keyword">return</span> <span class="keyword">new</span> <span class="built_in">ZExtInst</span>(Cmp, A-&gt;<span class="built_in">getType</span>());</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>在这里我们定义了一个lambda：<code>MatchOrZExtICmp</code>，用于匹配左移与Zext运算，而<code>Op0</code>,<code>Op1</code>则是在<code>or</code>运算符的两个操作数。</p>
<p><code>match</code>、<code>m_ZExt</code>等有关的函数、类则是LLVM的<strong>PatternMatching</strong>库。 <strong>PatternMatching</strong>库提供一系列函数和模板类，用于匹配特定LLVM-IR的Pattern，类似<code>m_SpecificInt</code>则是匹配一个特定整数或者有相同整数元素的向量 (<strong>Splat Vector</strong>)。</p>
<p>其中要注意的是<code>getScalarSizeInBits</code>函数在整数类型中返回整数的大小，而在vector中返回元素的大小。</p>
<p>最后经过了实现，我们需要再次更新我们的testcases以确认优化的效果，故要再次运行：</p>
<figure class="highlight shell"><table><tr><td class="code"><pre><span class="line">llvm/utils/update_test_checks.py --opt-bin build/bin/opt \</span><br><span class="line">    llvm/test/Transforms/InstCombine/and-or-icmps.ll</span><br></pre></td></tr></table></figure>
<p>这时我们可以发现我们的正例的<strong>CHECK</strong>发生了变化：</p>
<figure class="highlight plaintext"><table><tr><td class="code"><pre><span class="line">define i32 @icmp_slt_0_or_icmp_sgt_0_i32(i32 %x) &#123;</span><br><span class="line">; CHECK-LABEL: @icmp_slt_0_or_icmp_sgt_0_i32(</span><br><span class="line">; CHECK-NEXT:    [[TMP1:%.*]] = icmp ne i32 [[X:%.*]], 0</span><br><span class="line">; CHECK-NEXT:    [[E:%.*]] = zext i1 [[TMP1]] to i32</span><br><span class="line">; CHECK-NEXT:    ret i32 [[E]]</span><br><span class="line">;</span><br><span class="line">  %A = icmp slt i32 %x, 0</span><br><span class="line">  %B = icmp sgt i32 %x, 0</span><br><span class="line">  %C = zext i1 %A to i32</span><br><span class="line">  %D = zext i1 %B to i32</span><br><span class="line">  %E = or i32 %C, %D</span><br><span class="line">  ret i32 %E</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>且其他的testcases的变化也符合我们的期望，这里我们再commit一次。</p>
<p>这时我们就可以进入patch的提交阶段了。</p>
<h3 id="提交patch">提交Patch</h3>
<p>现在我们已经有了两个<strong>commit</strong>，可以通过以下指令生成test和impl的patch文件。</p>
<figure class="highlight shell"><table><tr><td class="code"><pre><span class="line">git show -U99999 HEAD^ &gt; patch_test</span><br><span class="line">git show -U99999 &gt; patch_transform</span><br></pre></td></tr></table></figure>
<p>而LLVM暂时不接受Github的PR，只允许在<a href="https://reviews.llvm.org/">Phabricator</a>上提交patch。 故在这里我注册了Phabricator的帐号，并通过<a href="https://reviews.llvm.org/differential/diff/create/">Create Diff</a>分别上传我的两个patch。</p>
<p>Patch的标题内容等格式可以博客开头的参考文章，机翻并改造如下： &gt; 选择一个有意义的patch标题和摘要。对于我们的运行示例，第一个patch可能是这样的： &gt; &gt; title：[InstCombine] Add tests for (A &gt; 0) | (A &lt; 0) -&gt; zext (A != 0) fold (NFC) &gt; &gt; summary：Tests for an upcoming (A &gt; 0) | (A &lt; 0) -&gt; zext (A != 0) fold.。 &gt; &gt; reviewer：（见下文） &gt; &gt; 第二个patch可能是这样的： &gt; &gt; title：[InstCombine] Transform (A &gt; 0) | (A &lt; 0) -&gt; zext (A != 0) fold &gt; &gt; summary：[InstCombine] Transform (A &gt; 0) | (A &lt; 0) -&gt; zext (A != 0) fold &gt; &gt; This extends foldCastedBitwiseLogic to handle the similar cases. &gt; &gt; ......你的分析...... &gt; &gt; It's proved by alive-tv:<strong>link</strong> &gt; &gt; Depends on DNNNNNN（在此处放置第一个patch的ID）。 &gt; reviewer：（见下文） &gt; &gt; 这里有几个值得强调的地方： &gt; &gt; 标题开头应该有一个 [Category] 标签。通常，您可以只使用您要修改的文件的名称。例如，对 InstCombine 的更改通常带有[InstCombine]标记。 &gt; 非功能性更改（如测试添加）的patch通常在标题中的某个地方带有 NFC 标记。 &gt; 如果您有任何 alive2 证明，请在patch摘要中包含它们。 &gt; 您可以使用“Depends on DNNNNNN”来创建堆叠的patch。也可以事后添加“子修订版”来实现此目的。</p>
<hr />
<p>现在我们还差<strong>Reviewers</strong>，在LLVM中，patch提交者负责选择适当的审阅者。虽然有人可能会根据patch标题（这就是分类标记如此重要的原因）来找到合适的审阅者，但您最好一开始就指定适当的审阅者。</p>
<p>虽然LLVM有一个CODE_OWNERS.txt文件，用于指定不同领域的代码所有者，但不幸的是， 这个文件往往过时且不完整。找到审阅者的更好方法是查看您要修改的文件的Git历史记录，并添加一些最近commit或最近review diff revision的人员。</p>
<p>对于InstCombine，主要的reviewer是spatel，但您也可以根据历史记录找到其他几个候选人（例如nikic，goldstein.w.n）。</p>
<p>提交了patch后，就该等待review了。对于这样简单的更改，通常会有人很快处理。如果您在一周内没有得到回复，请发送“ping”评论，并每周发送一次。对于InstCombine来说等待数周才进行审阅是相当不寻常的，但如果您提交的更改是很长时间没有人真正工作的领域，则可能会发生。只需要不断“ping”。</p>
<p>最后，一旦patch获得批准，审阅者通常会认为您已经拥有提交访问权限，并允许您自己提交更改。如果不是这种情况，则应该跟进一条评论， 例如“I don't have commit access, can you please land this for me? Please use 'Your Name <a href="mailto:your@email" class="email">your@email</a>' for the commit”。 最后一点很重要，因为Phabricator会丢失patch的作者信息，提交者必须将其添加回来。</p>
<p>如果您计划对LLVM进行任何形式的常规贡献，建议请求提交访问权限。这方面的门槛非常低，因此可以尽早请求。如果不必创建堆叠的审查，则测试的预提交工作流程要方便得多。</p>
<p>最后，有关CI的一些话：Phabricator上的patch会通过“pre-merge”测试运行。特别是如果您没有在本地运行完整的测试套件，则这些结果可能会有所帮助。不幸的是，这些测试运行有些不稳定，因此如果您看到与您的patch没有明显关系的失败，则通常可以忽略它们。</p>
<p>一旦patch被提交，它将在更广泛的“buildbots”范围内运行，这些机器人在许多不同的架构和许多不同的配置上运行测试。 这些也相当不稳定，因此同样适用：如果您收到buildbots故障电子邮件，看起来与您的patch无关，则不必担心。如果最终发现是您的责任，buildbots所有者会让您知道。</p>
<h3 id="总结">总结</h3>
<blockquote>
<p>翻译参考文章的总结</p>
</blockquote>
<p>LLVM的贡献过程具有某些不同于其他开源项目的不寻常方面。其中一部分是使用Phabricator而不是GitHub进行审查，但大多数差异都集中在强调正确性方面，从正确性证明开始，到测试的预提交工作流程，以及最终往往是测试和代码更改之间非常大的比率。</p>
<p>我希望本文对于想要进入LLVM开发的人有所帮助，但我想重申，第一次做不需要完全做得“正确”，如果遇到问题，人们会很乐意提供帮助。Discourse的初学者类别以及Discord聊天是提问的好地方。</p>
<blockquote>
<p>自己的总结</p>
</blockquote>
<p>第一次为大型开源项目Contribute是一次特别的经历，在不断与reviewer的沟通中，我也对LLVM的体系有了更深刻的了解，希望读者在看了本篇博客后也可以更活跃地参与开源活动。</p>
]]></content>
      <categories>
        <category>LLVM</category>
      </categories>
      <tags>
        <tag>Compiler</tag>
        <tag>LLVM</tag>
        <tag>OpenSource</tag>
      </tags>
  </entry>
  <entry>
    <title>有关LLVM的文档</title>
    <url>/2023/10/30/LLVM-Docs/</url>
    <content><![CDATA[<p><a href="https://llvm.org/docs/LangRef.html">LangRef</a><br />
<a href="https://llvm.org/docs/LoopTerminology.html">循环术语/Loop Terminology</a><br />
<a href="https://llvm.org/docs/MemorySSA.html">MemorySSA</a><br />
<a href="https://llvm.org/docs/Reference.html">Reference Guide</a><br />
<a href="https://llvm.org/docs/Passes.html">Current Passes</a><br />
<span id="more"></span></p>
]]></content>
      <categories>
        <category>LLVM</category>
      </categories>
      <tags>
        <tag>Compiler</tag>
        <tag>LLVM</tag>
        <tag>OpenSource</tag>
      </tags>
  </entry>
  <entry>
    <title>LLVM源码解析- EarlyCSE</title>
    <url>/2023/10/08/LLVM-Source-Analysis-EarlyCSE/</url>
    <content><![CDATA[<h2 id="abstract">Abstract</h2>
<p><strong>Common sub-expression elimination (CSE)</strong> is an important optimization for compilers, which is similar to partial redundancies elimination optimization.<br />
CSE is designed to eliminate those expressions with identical and semantically equivalent components, with consideration for some properties like commutativity, associativity of operators.<br />
For LLVM, there is <strong>EarlyCSE</strong> pass as one of implementation for CSE. The "Early" in <strong>EarlyCSE</strong> means that simple, fast and can be applied in every stages it needs.</p>
<span id="more"></span>
<h2 id="a-top-down-view">A Top-Down View</h2>
<p>EarlyCSE iterates down all BasicBlocks in DFS order within dom-tree (only once), which guarantees that expressions in current expressions will be <strong>dominated</strong> after the expressions iterated before.</p>
<p>Besides, EarlyCSE tags every Node(or BasicBlock) with a generation number for memory instructions, since memory insts in LLVM doesn't fit into SSA, which we must hack in other ways. And every time we meet a branch (current BB has more than one predecessors), we have to increment generation by one.</p>
<blockquote>
<p>If this block has a single predecessor, then the predecessor is the parent of the domtree node and all of the live out memory values are still current in this block. If this block has multiple predecessors, then they could have invalidated the live-out memory values of our parent value. For now, just be conservative and invalidate memory if this block has multiple predecessors.</p>
</blockquote>
<p>Then, in <code>processBlock</code> function, we handle the most key case where "SimpleValue" can handle. We maintain a hash table called "AvailableValues". And when we encounter an instruction, we lookup this table for the hash value of the instruction. If no such hash in table, insert it. Otherwise, we compare whether those with the same hash is equivalent in instruction level. If equivalent, we replace the latter with the former higher in dom-tree.</p>
<p>In this way, we handle the most SSA. Memory operations are discussed later.</p>
<h2 id="how-is-the-available-values-maintained">How is the available values maintained?</h2>
<p>When DFS the dom-tree, EarlyCSE actually maintains a scoped map and a stack (emulating the function stack). When entering a new <em>BB</em>, push a Node to the stack and insert relevant hash in <em>BB</em>. When exiting <em>BB</em>, pop the Node and erase relevant hash in <em>BB</em>.</p>
<h2 id="how-is-lookup-implemented">How is lookup implemented?</h2>
<p>Let's take a look at <code>getHashValueImpl</code> of SimpleValue:</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="function"><span class="type">static</span> <span class="type">unsigned</span> <span class="title">getHashValueImpl</span><span class="params">(SimpleValue Val)</span> </span>&#123;</span><br><span class="line">  Instruction *Inst = Val.Inst;</span><br><span class="line">  <span class="comment">// Hash in all of the operands as pointers.</span></span><br><span class="line">  <span class="keyword">if</span> (BinaryOperator *BinOp = <span class="built_in">dyn_cast</span>&lt;BinaryOperator&gt;(Inst)) &#123;</span><br><span class="line">    Value *LHS = BinOp-&gt;<span class="built_in">getOperand</span>(<span class="number">0</span>);</span><br><span class="line">    Value *RHS = BinOp-&gt;<span class="built_in">getOperand</span>(<span class="number">1</span>);</span><br><span class="line">    <span class="keyword">if</span> (BinOp-&gt;<span class="built_in">isCommutative</span>() &amp;&amp; BinOp-&gt;<span class="built_in">getOperand</span>(<span class="number">0</span>) &gt; BinOp-&gt;<span class="built_in">getOperand</span>(<span class="number">1</span>))</span><br><span class="line">      std::<span class="built_in">swap</span>(LHS, RHS);</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> <span class="built_in">hash_combine</span>(BinOp-&gt;<span class="built_in">getOpcode</span>(), LHS, RHS);</span><br><span class="line">  &#125;</span><br><span class="line"></span><br><span class="line">  <span class="keyword">if</span> (CmpInst *CI = <span class="built_in">dyn_cast</span>&lt;CmpInst&gt;(Inst)) &#123;</span><br><span class="line">    <span class="comment">// Compares can be commuted by swapping the comparands and</span></span><br><span class="line">    <span class="comment">// updating the predicate.  Choose the form that has the</span></span><br><span class="line">    <span class="comment">// comparands in sorted order, or in the case of a tie, the</span></span><br><span class="line">    <span class="comment">// one with the lower predicate.</span></span><br><span class="line">    Value *LHS = CI-&gt;<span class="built_in">getOperand</span>(<span class="number">0</span>);</span><br><span class="line">    Value *RHS = CI-&gt;<span class="built_in">getOperand</span>(<span class="number">1</span>);</span><br><span class="line">    CmpInst::Predicate Pred = CI-&gt;<span class="built_in">getPredicate</span>();</span><br><span class="line">    CmpInst::Predicate SwappedPred = CI-&gt;<span class="built_in">getSwappedPredicate</span>();</span><br><span class="line">    <span class="keyword">if</span> (std::<span class="built_in">tie</span>(LHS, Pred) &gt; std::<span class="built_in">tie</span>(RHS, SwappedPred)) &#123;</span><br><span class="line">      std::<span class="built_in">swap</span>(LHS, RHS);</span><br><span class="line">      Pred = SwappedPred;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">return</span> <span class="built_in">hash_combine</span>(Inst-&gt;<span class="built_in">getOpcode</span>(), Pred, LHS, RHS);</span><br><span class="line">  &#125;</span><br><span class="line">  ....</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>As we can see, <span class="math inline">\(hash(binop) = hash(opcode, lhs, rhs)\)</span>, where <span class="math inline">\(lhs\)</span> is the pointer of lhs, <span class="math inline">\(rhs\)</span> is that of rhs. It means that what we can eliminate once is those instruction with the same <strong>references/pointers</strong> of the same value.</p>
<p>For the DFS order in dom-tree, for the same two <span class="math inline">\(op(a,b)\)</span> in BB1 and BB2, only when BB1 dominates BB2 or BB2 dominates BB1, can we eliminate them. However, <em>GVN</em> could solve it for its <em>RPO</em> iteration order (More <strong>expensive</strong> one).</p>
<p>Besides, IR flags like <code>nsw, nuw</code> having no effect on the what IR actually does are ignored.</p>
<p>With such easy implementation, EarlyCSE is <strong>cheap</strong> with <span class="math inline">\(O(n)\)</span> time, but <strong>less effective</strong> than <em>GVN</em>.</p>
<h2 id="ignorecombine-ir-flag">Ignore/Combine IR flag</h2>
<p>When hashing instructions, we always ignore the flags like <code>nsw, nuw</code>. But for <strong>memory instructions</strong>, we will combine the flags like matching id, atomicity.</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line">AvailableLoads.<span class="built_in">insert</span>(MemInst.<span class="built_in">getPointerOperand</span>(),</span><br><span class="line">                              <span class="built_in">LoadValue</span>(&amp;Inst, CurrentGeneration,</span><br><span class="line">                                        MemInst.<span class="built_in">getMatchingId</span>(),</span><br><span class="line">                                        MemInst.<span class="built_in">isAtomic</span>(),</span><br><span class="line">                                        MemInst.<span class="built_in">isLoad</span>()));</span><br></pre></td></tr></table></figure>
<h2 id="memory-cse">Memory CSE</h2>
<p>EarlyCSE eliminates memory operations mostly based on <em>Memory SSA</em> analysis. And it records the <strong>generation</strong> of BasicBlock. Currently, such generation is equivalent to the iteration order number (or DFS number) of BasicBlocks.</p>
<p>If generations of two memory operations differs, we can't state they are identical, since the live-out memory parental value could be invalidated by multiple predecessors.</p>
<p>In <code>processNode</code> function, EarlyCSE handles some trivial dead store elimination.</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="comment">/// LastStore - Keep track of the last non-volatile store that we saw... for</span></span><br><span class="line"><span class="comment">/// as long as there in no instruction that reads memory.  If we see a store</span></span><br><span class="line"><span class="comment">/// to the same location, we delete the dead store.  This zaps trivial dead</span></span><br><span class="line"><span class="comment">/// stores which can occur in bitfield code among other things.</span></span><br><span class="line">Instruction *LastStore = <span class="literal">nullptr</span>;</span><br></pre></td></tr></table></figure>
<p>For non-trivial memory operations, EarlyCSE applies specific methods. Let's take a look at its implementation after lookup:</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="function">ParseMemoryInst <span class="title">MemInst</span><span class="params">(&amp;Inst, TTI)</span></span>;</span><br><span class="line"><span class="comment">// If this is a non-volatile load, process it.</span></span><br><span class="line"><span class="keyword">if</span> (MemInst.<span class="built_in">isValid</span>() &amp;&amp; MemInst.<span class="built_in">isLoad</span>()) &#123;</span><br><span class="line">  <span class="keyword">if</span> (MemInst.<span class="built_in">isVolatile</span>() || !MemInst.<span class="built_in">isUnordered</span>()) &#123;</span><br><span class="line">    LastStore = <span class="literal">nullptr</span>;</span><br><span class="line">    ++CurrentGeneration;</span><br><span class="line">  &#125;</span><br></pre></td></tr></table></figure>
<p>Here we drop the last store, since volatile/ordered memory operation make the store unCSEable.</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"></span><br><span class="line"><span class="keyword">if</span> (MemInst.<span class="built_in">isInvariantLoad</span>()) &#123;</span><br><span class="line">  <span class="comment">// If we pass an invariant load, we know that memory location is</span></span><br><span class="line">  <span class="comment">// indefinitely constant from the moment of first dereferenceability.</span></span><br><span class="line">  <span class="comment">// We conservatively treat the invariant_load as that moment.  If we</span></span><br><span class="line">  <span class="comment">// pass a invariant load after already establishing a scope, don&#x27;t</span></span><br><span class="line">  <span class="comment">// restart it since we want to preserve the earliest point seen.</span></span><br><span class="line">  <span class="keyword">auto</span> MemLoc = MemoryLocation::<span class="built_in">get</span>(&amp;Inst);</span><br><span class="line">  <span class="keyword">if</span> (!AvailableInvariants.<span class="built_in">count</span>(MemLoc))</span><br><span class="line">    AvailableInvariants.<span class="built_in">insert</span>(MemLoc, CurrentGeneration);</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>For invariant loop, its <em>memory location</em>, or pointer will keep <em>invariant</em> in later stages. So we keep the earliest load, to maximize its effect.</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="comment">// If we have an available version of this load, and if it is the right</span></span><br><span class="line"><span class="comment">// generation or the load is known to be from an invariant location,</span></span><br><span class="line"><span class="comment">// replace this instruction.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// If either the dominating load or the current load are invariant, then</span></span><br><span class="line"><span class="comment">// we can assume the current load loads the same value as the dominating</span></span><br><span class="line"><span class="comment">// load.</span></span><br><span class="line">LoadValue InVal = AvailableLoads.<span class="built_in">lookup</span>(MemInst.<span class="built_in">getPointerOperand</span>());</span><br><span class="line"><span class="keyword">if</span> (Value *Op = <span class="built_in">getMatchingValue</span>(InVal, MemInst, CurrentGeneration)) &#123;</span><br><span class="line">  <span class="comment">// Something related to debug information</span></span><br><span class="line">  <span class="keyword">if</span> (InVal.IsLoad)</span><br><span class="line">    <span class="keyword">if</span> (<span class="keyword">auto</span> *I = <span class="built_in">dyn_cast</span>&lt;Instruction&gt;(Op))</span><br><span class="line">      <span class="built_in">combineMetadataForCSE</span>(I, &amp;Inst, <span class="literal">false</span>);</span><br><span class="line">  <span class="keyword">if</span> (!Inst.<span class="built_in">use_empty</span>())</span><br><span class="line">    Inst.<span class="built_in">replaceAllUsesWith</span>(Op);</span><br><span class="line">  <span class="comment">// Something related to updating analysis and debug information</span></span><br><span class="line">  Inst.<span class="built_in">eraseFromParent</span>();</span><br><span class="line">  Changed = <span class="literal">true</span>;</span><br><span class="line">  ++NumCSELoad;</span><br><span class="line">  <span class="keyword">continue</span>;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>Similar to SimpleValue case, besides getting matching value throught <em>MemorySSA</em>.</p>
<h2 id="difference-between-gvn-and-earlycse">Difference between GVN and EarlyCSE</h2>
<p>To be continued</p>
]]></content>
      <categories>
        <category>LLVM</category>
      </categories>
      <tags>
        <tag>Compiler</tag>
        <tag>LLVM</tag>
        <tag>OpenSource</tag>
      </tags>
  </entry>
  <entry>
    <title>LLVM源码解析-Interval Analysis</title>
    <url>/2023/09/21/LLVM-Source-Analysis-Interval/</url>
    <content><![CDATA[<h2 id="abstract">Abstract</h2>
<p>第一次专门写 blog 解析 LLVM 源码，最近在看鲸书学习编译优化，正好借这个系列结合 Theory 与 Practice。</p>
<p>Interval Analysis 是一种 Control Flow Analysis，常用作于其他优化如 LoopUnroll 的基础。</p>
<p>先看 Interval 类的代码，在编译理论里，Interval 一般指 Node 的集合， 集合里每个 <span class="math inline">\(Node \ne Head\)</span> 都满足 <span class="math inline">\(Pred(Node) \subset Interval\)</span> ：</p>
<span id="more"></span>
<h2 id="interval-类">Interval 类</h2>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">Interval</span> &#123;</span><br><span class="line">  <span class="comment">/// HeaderNode - The header BasicBlock, which dominates all BasicBlocks in this</span></span><br><span class="line">  <span class="comment">/// interval.  Also, any loops in this interval must go through the HeaderNode.</span></span><br><span class="line">  <span class="comment">///</span></span><br><span class="line">  BasicBlock *HeaderNode;</span><br></pre></td></tr></table></figure>
<p>这里的 HeaderNode dominates Interval 里所有的 BasicBlock(Node)，代表了一个 Interval。</p>
<!-- more -->
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"></span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"><span class="function"><span class="keyword">inline</span> <span class="title">Interval</span><span class="params">(BasicBlock *Header)</span> : HeaderNode(Header) &#123;</span></span><br><span class="line">  Nodes.<span class="built_in">push_back</span>(Header);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">inline</span> BasicBlock *<span class="title">getHeaderNode</span><span class="params">()</span> <span class="type">const</span> </span>&#123; <span class="keyword">return</span> HeaderNode; &#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/// Nodes - The basic blocks in this interval.</span></span><br><span class="line">std::vector&lt;BasicBlock*&gt; Nodes;</span><br></pre></td></tr></table></figure>
<p>构造函数和一些基本定义, Nodes 存了 Interval 里所有的 BasicBlock。</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="comment">/// Successors - List of BasicBlocks that are reachable directly from nodes in</span></span><br><span class="line"><span class="comment">/// this interval, but are not in the interval themselves.</span></span><br><span class="line"><span class="comment">/// These nodes necessarily must be header nodes for other intervals.</span></span><br><span class="line">std::vector&lt;BasicBlock*&gt; Successors;</span><br></pre></td></tr></table></figure>
<p>Successors 是所有<strong>从</strong>Interval 里的 Node 可以<strong>直接</strong>到达的 Nodes</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="comment">/// Predecessors - List of BasicBlocks that have this Interval&#x27;s header block</span></span><br><span class="line"><span class="comment">/// as one of their successors.</span></span><br><span class="line">std::vector&lt;BasicBlock*&gt; Predecessors;</span><br></pre></td></tr></table></figure>
<p>Predecessors 则是满足 <span class="math inline">\(Head \in Succ(Node)\)</span> 的所有 Node。</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="comment">/// contains - Find out if a basic block is in this interval</span></span><br><span class="line"><span class="function"><span class="keyword">inline</span> <span class="type">bool</span> <span class="title">contains</span><span class="params">(BasicBlock *BB)</span> <span class="type">const</span> </span>&#123;</span><br><span class="line">  <span class="keyword">for</span> (BasicBlock *Node : Nodes)</span><br><span class="line">    <span class="keyword">if</span> (Node == BB)</span><br><span class="line">      <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">  <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">  <span class="comment">// I don&#x27;t want the dependency on &lt;algorithm&gt;</span></span><br><span class="line">  <span class="comment">//return find(Nodes.begin(), Nodes.end(), BB) != Nodes.end();</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/// isSuccessor - find out if a basic block is a successor of this Interval</span></span><br><span class="line"><span class="function"><span class="keyword">inline</span> <span class="type">bool</span> <span class="title">isSuccessor</span><span class="params">(BasicBlock *BB)</span> <span class="type">const</span> </span>&#123;</span><br><span class="line">  <span class="keyword">for</span> (BasicBlock *Successor : Successors)</span><br><span class="line">    <span class="keyword">if</span> (Successor == BB)</span><br><span class="line">      <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">  <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">  <span class="comment">// I don&#x27;t want the dependency on &lt;algorithm&gt;</span></span><br><span class="line">  <span class="comment">//return find(Successors.begin(), Successors.end(), BB) != Successors.end();</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">/// Equality operator.  It is only valid to compare two intervals from the</span></span><br><span class="line"><span class="comment">/// same partition, because of this, all we have to check is the header node</span></span><br><span class="line"><span class="comment">/// for equality.</span></span><br><span class="line"><span class="keyword">inline</span> <span class="type">bool</span> <span class="keyword">operator</span>==(<span class="type">const</span> Interval &amp;I) <span class="type">const</span> &#123;</span><br><span class="line">  <span class="keyword">return</span> HeaderNode == I.HeaderNode;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>这些比较简单就不多说了。</p>
<h2 id="interval-partition-类">Interval Partition 类</h2>
<p>下面是关键的 IntervalPartition 和 IntervalIterator，也是算法核心:</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">IntervalPartition</span> : <span class="keyword">public</span> FunctionPass &#123;</span><br><span class="line">  <span class="keyword">using</span> IntervalMapTy = std::map&lt;BasicBlock *, Interval *&gt;;</span><br><span class="line">  IntervalMapTy IntervalMap;</span><br><span class="line"></span><br><span class="line">  <span class="keyword">using</span> IntervalListTy = std::vector&lt;Interval *&gt;;</span><br><span class="line">  Interval *RootInterval = <span class="literal">nullptr</span>;</span><br><span class="line">  std::vector&lt;Interval *&gt; Intervals;</span><br></pre></td></tr></table></figure>
<p>这里的存储类型也和理论一致，由一个根节点和所有节点的集合以及 BasicBlock 与 Interval 的对应(Map)构成。</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="comment">// addIntervalToPartition - Add an interval to the internal list of intervals,</span></span><br><span class="line"><span class="comment">// and then add mappings from all of the basic blocks in the interval to the</span></span><br><span class="line"><span class="comment">// interval itself (in the IntervalMap).</span></span><br><span class="line"><span class="function"><span class="type">void</span> <span class="title">IntervalPartition::addIntervalToPartition</span><span class="params">(Interval *I)</span> </span>&#123;</span><br><span class="line">  Intervals.<span class="built_in">push_back</span>(I);</span><br><span class="line"></span><br><span class="line">  <span class="comment">// Add mappings for all of the basic blocks in I to the IntervalPartition</span></span><br><span class="line">  <span class="keyword">for</span> (Interval::node_iterator It = I-&gt;Nodes.<span class="built_in">begin</span>(), End = I-&gt;Nodes.<span class="built_in">end</span>();</span><br><span class="line">       It != End; ++It)</span><br><span class="line">    IntervalMap.<span class="built_in">insert</span>(std::<span class="built_in">make_pair</span>(*It, I));</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>这个函数就是加 Intervals,并把 BasicBlock 和其 Interval 的 Map 建立起来。</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="comment">// updatePredecessors - Interval generation only sets the successor fields of</span></span><br><span class="line"><span class="comment">// the interval data structures.  After interval generation is complete,</span></span><br><span class="line"><span class="comment">// run through all of the intervals and propagate successor info as</span></span><br><span class="line"><span class="comment">// predecessor info.</span></span><br><span class="line"><span class="function"><span class="type">void</span> <span class="title">IntervalPartition::updatePredecessors</span><span class="params">(Interval *Int)</span> </span>&#123;</span><br><span class="line">  BasicBlock *Header = Int-&gt;<span class="built_in">getHeaderNode</span>();</span><br><span class="line">  <span class="keyword">for</span> (BasicBlock *Successor : Int-&gt;Successors)</span><br><span class="line">    <span class="built_in">getBlockInterval</span>(Successor)-&gt;Predecessors.<span class="built_in">push_back</span>(Header);</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>由于生成 Interval 时只更新了 Interval 的 Successors 数据，这里需要更新其对应的 Predecessors。</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="comment">// IntervalPartition ctor - Build the first level interval partition for the</span></span><br><span class="line"><span class="comment">// specified function...</span></span><br><span class="line"><span class="function"><span class="type">bool</span> <span class="title">IntervalPartition::runOnFunction</span><span class="params">(Function &amp;F)</span> </span>&#123;</span><br><span class="line">  <span class="comment">// Pass false to intervals_begin because we take ownership of it&#x27;s memory</span></span><br><span class="line">  function_interval_iterator I = <span class="built_in">intervals_begin</span>(&amp;F, <span class="literal">false</span>);</span><br><span class="line">  <span class="built_in">assert</span>(I != <span class="built_in">intervals_end</span>(&amp;F) &amp;&amp; <span class="string">&quot;No intervals in function!?!?!&quot;</span>);</span><br><span class="line"></span><br><span class="line">  <span class="built_in">addIntervalToPartition</span>(RootInterval = *I);</span><br><span class="line"></span><br><span class="line">  ++I;  <span class="comment">// After the first one...</span></span><br><span class="line"></span><br><span class="line">  <span class="comment">// Add the rest of the intervals to the partition.</span></span><br><span class="line">  <span class="keyword">for</span> (function_interval_iterator E = <span class="built_in">intervals_end</span>(&amp;F); I != E; ++I)</span><br><span class="line">    <span class="built_in">addIntervalToPartition</span>(*I);</span><br><span class="line"></span><br><span class="line">  <span class="comment">// Now that we know all of the successor information, propagate this to the</span></span><br><span class="line">  <span class="comment">// predecessors for each block.</span></span><br><span class="line">  <span class="keyword">for</span> (Interval *I : Intervals)</span><br><span class="line">    <span class="built_in">updatePredecessors</span>(I);</span><br><span class="line">  <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>这里就是一个 Interval 一个 Interval 地分解，根据算法原理我们可以知道， 当一个 Interval 更新完，可以根据其 Successors 更新其余的 Interval，最后更新 Preds 并划分整个函数。</p>
<h2 id="interval-iterator-类">Interval Iterator 类</h2>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="keyword">template</span>&lt;<span class="keyword">class</span> <span class="title class_">NodeTy</span>, <span class="keyword">class</span> <span class="title class_">OrigContainer_t</span>, <span class="keyword">class</span> <span class="title class_">GT</span> = GraphTraits&lt;NodeTy *&gt;,</span><br><span class="line">         <span class="keyword">class</span> IGT = GraphTraits&lt;Inverse&lt;NodeTy *&gt;&gt;&gt;</span><br><span class="line"><span class="keyword">class</span> IntervalIterator &#123;</span><br><span class="line">  std::vector&lt;std::pair&lt;Interval *, <span class="keyword">typename</span> Interval::succ_iterator&gt;&gt; IntStack;</span><br><span class="line">  std::set&lt;BasicBlock *&gt; Visited;</span><br><span class="line">  OrigContainer_t *OrigContainer;</span><br><span class="line">  <span class="type">bool</span> IOwnMem;     <span class="comment">// If True, delete intervals when done with them</span></span><br><span class="line">                    <span class="comment">// See file header for conditions of use</span></span><br></pre></td></tr></table></figure>
<p>这是 Iterator 的数据结构，暂时不需要分析模板，这里直接把 NodeTy 换成 BasicBlock, OrigContainer 看成 Function。</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="comment">// ProcessInterval - This method is used during the construction of the</span></span><br><span class="line"><span class="comment">// interval graph.  It walks through the source graph, recursively creating</span></span><br><span class="line"><span class="comment">// an interval per invocation until the entire graph is covered.  This uses</span></span><br><span class="line"><span class="comment">// the ProcessNode method to add all of the nodes to the interval.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// This method is templated because it may operate on two different source</span></span><br><span class="line"><span class="comment">// graphs: a basic block graph, or a preexisting interval graph.</span></span><br><span class="line"><span class="function"><span class="type">bool</span> <span class="title">ProcessInterval</span><span class="params">(NodeTy *Node)</span> </span>&#123;</span><br><span class="line">  BasicBlock *Header = <span class="built_in">getNodeHeader</span>(Node);</span><br><span class="line">  <span class="keyword">if</span> (!Visited.<span class="built_in">insert</span>(Header).second)</span><br><span class="line">    <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line"></span><br><span class="line">  Interval *Int = <span class="keyword">new</span> <span class="built_in">Interval</span>(Header);</span><br><span class="line"></span><br><span class="line">  <span class="comment">// Check all of our successors to see if they are in the interval...</span></span><br><span class="line">  <span class="keyword">for</span> (<span class="keyword">typename</span> GT::ChildIteratorType I = GT::<span class="built_in">child_begin</span>(Node),</span><br><span class="line">         E = GT::<span class="built_in">child_end</span>(Node); I != E; ++I)</span><br><span class="line">    <span class="built_in">ProcessNode</span>(Int, <span class="built_in">getSourceGraphNode</span>(OrigContainer, *I));</span><br><span class="line"></span><br><span class="line">  IntStack.<span class="built_in">push_back</span>(std::<span class="built_in">make_pair</span>(Int, <span class="built_in">succ_begin</span>(Int)));</span><br><span class="line">  <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// ProcessNode - This method is called by ProcessInterval to add nodes to the</span></span><br><span class="line"><span class="comment">// interval being constructed, and it is also called recursively as it walks</span></span><br><span class="line"><span class="comment">// the source graph.  A node is added to the current interval only if all of</span></span><br><span class="line"><span class="comment">// its predecessors are already in the graph.  This also takes care of keeping</span></span><br><span class="line"><span class="comment">// the successor set of an interval up to date.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// This method is templated because it may operate on two different source</span></span><br><span class="line"><span class="comment">// graphs: a basic block graph, or a preexisting interval graph.</span></span><br><span class="line"><span class="function"><span class="type">void</span> <span class="title">ProcessNode</span><span class="params">(Interval *Int, NodeTy *Node)</span> </span>&#123;</span><br><span class="line">  <span class="built_in">assert</span>(Int &amp;&amp; <span class="string">&quot;Null interval == bad!&quot;</span>);</span><br><span class="line">  <span class="built_in">assert</span>(Node &amp;&amp; <span class="string">&quot;Null Node == bad!&quot;</span>);</span><br><span class="line"></span><br><span class="line">  BasicBlock *NodeHeader = <span class="built_in">getNodeHeader</span>(Node);</span><br><span class="line"></span><br><span class="line">  <span class="keyword">if</span> (Visited.<span class="built_in">count</span>(NodeHeader)) &#123;     <span class="comment">// Node already been visited?</span></span><br><span class="line">    <span class="keyword">if</span> (Int-&gt;<span class="built_in">contains</span>(NodeHeader)) &#123;   <span class="comment">// Already in this interval...</span></span><br><span class="line">      <span class="keyword">return</span>;</span><br><span class="line">    &#125; <span class="keyword">else</span> &#123;                           <span class="comment">// In other interval, add as successor</span></span><br><span class="line">      <span class="keyword">if</span> (!Int-&gt;<span class="built_in">isSuccessor</span>(NodeHeader)) <span class="comment">// Add only if not already in set</span></span><br><span class="line">        Int-&gt;Successors.<span class="built_in">push_back</span>(NodeHeader);</span><br><span class="line">    &#125;</span><br><span class="line">  &#125; <span class="keyword">else</span> &#123;                             <span class="comment">// Otherwise, not in interval yet</span></span><br><span class="line">    <span class="keyword">for</span> (<span class="keyword">typename</span> IGT::ChildIteratorType I = IGT::<span class="built_in">child_begin</span>(Node),</span><br><span class="line">           E = IGT::<span class="built_in">child_end</span>(Node); I != E; ++I) &#123;</span><br><span class="line">      <span class="keyword">if</span> (!Int-&gt;<span class="built_in">contains</span>(*I)) &#123;        <span class="comment">// If pred not in interval, we can&#x27;t be</span></span><br><span class="line">        <span class="keyword">if</span> (!Int-&gt;<span class="built_in">isSuccessor</span>(NodeHeader)) <span class="comment">// Add only if not already in set</span></span><br><span class="line">          Int-&gt;Successors.<span class="built_in">push_back</span>(NodeHeader);</span><br><span class="line">        <span class="keyword">return</span>;                        <span class="comment">// See you later</span></span><br><span class="line">      &#125;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// If we get here, then all of the predecessors of BB are in the interval</span></span><br><span class="line">    <span class="comment">// already.  In this case, we must add BB to the interval!</span></span><br><span class="line">    <span class="built_in">addNodeToInterval</span>(Int, Node);</span><br><span class="line">    Visited.<span class="built_in">insert</span>(NodeHeader);     <span class="comment">// The node has now been visited!</span></span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (Int-&gt;<span class="built_in">isSuccessor</span>(NodeHeader)) &#123;</span><br><span class="line">      <span class="comment">// If we were in the successor list from before... remove from succ list</span></span><br><span class="line">      llvm::<span class="built_in">erase_value</span>(Int-&gt;Successors, NodeHeader);</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Now that we have discovered that Node is in the interval, perhaps some</span></span><br><span class="line">    <span class="comment">// of its successors are as well?</span></span><br><span class="line">    <span class="keyword">for</span> (<span class="keyword">typename</span> GT::ChildIteratorType It = GT::<span class="built_in">child_begin</span>(Node),</span><br><span class="line">           End = GT::<span class="built_in">child_end</span>(Node); It != End; ++It)</span><br><span class="line">      <span class="built_in">ProcessNode</span>(Int, <span class="built_in">getSourceGraphNode</span>(OrigContainer, *It));</span><br><span class="line">  &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>这是最关键的算法部分，第一个<code>ProcessInterval</code>函数以 Node 作为 Header, 开始寻找以此为 Header 的 Interval(通过调用第二个 ProcessInterval)。然后把找到的 Interval 和对应的 Successors 迭代器入栈, 然后在<code>operator ++()</code>里面每次搜索 IntStack 里所有 Successors 作为 Header 的 Interval，其实这是一种<strong>Interval 层面的 BFS</strong>。 同时注意，如果已经 visited 改 Node, 就返回 false, 说明这次没有找到 Interval。</p>
<p>我们接下来看 Interval 里面的图算法,也就是第二个<code>ProcessInterval</code>函数里的逻辑。</p>
<p>如果 Visited[Node]:</p>
<ul>
<li>若 Interval 里已经有这个 Node 了，就结束这次寻找</li>
<li>若没有，说明在别的 Interval 里，也就是说，是本 Interval 的 Successor 之一</li>
</ul>
<p>若没有 Visited</p>
<ul>
<li>若逆向搜索发现 Pred(Node)不在 Interval 里，则说明我们没有搜完 Node 的 Preds， 也就是还不能 dominate Node， 先退出让 Preds 先被搜完 (BasicBlock 层面的 BFS, 其实还是会 see you later)
<ul>
<li>若 Node 不是 Successor，先加进去，后面再删除。<span class="math inline">\((1)\)</span></li>
</ul></li>
</ul>
<p>然后把 Node 加进 Interval 里，若 Node 之前是 Successor 现在取出，对应的是情况<span class="math inline">\((1)\)</span></p>
<p>最后继续搜索子节点，直到所有对应的 Node 都被加进来，注意这里 Interval 的前一个 Interval 的 Successors 是未更新的， 这也就是为什么 IntervalPartition 类要调用<code>updatePredecessors(I)</code>。</p>
]]></content>
      <categories>
        <category>LLVM</category>
      </categories>
      <tags>
        <tag>Compiler</tag>
        <tag>LLVM</tag>
        <tag>OpenSource</tag>
      </tags>
  </entry>
  <entry>
    <title>LeetCode 42 接雨水 题解</title>
    <url>/2023/08/23/LeetCode-42/</url>
    <content><![CDATA[<p>题目描述： <img src="/images/leetcode42.png" alt="img" /></p>
<p>基本想法：</p>
<p>对于每个方格索引 <span class="math inline">\(x\)</span>，其容量<span class="math inline">\(c(x)\)</span>取决于其左边最高的格子和右边最高的格子，也就是说令：</p>
<p><span class="math display">\[t(x) = \min(\max_{y&lt;x}{\{h(y)\}} , \max_{y&gt;x}{\{h(y)\}})\]</span></p>
<p>则</p>
<p><span class="math display">\[
c(x) =
\begin{cases}
t(x) - h(x), &amp; \text{if } y &gt; x \\
0, &amp; \text{otherwise}
\end{cases}
\]</span></p>
<p>故我们可以有代码：</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">Solution</span> &#123;</span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line">  <span class="function"><span class="type">int</span> <span class="title">trap</span><span class="params">(vector&lt;<span class="type">int</span>&gt; &amp;height)</span> </span>&#123;</span><br><span class="line">    <span class="type">int</span> n = height.<span class="built_in">size</span>();</span><br><span class="line">    <span class="type">int</span> *maxLessThan = <span class="keyword">new</span> <span class="type">int</span>[n];</span><br><span class="line">    <span class="type">int</span> *maxGreaterThan = <span class="keyword">new</span> <span class="type">int</span>[n];</span><br><span class="line">    maxLessThan[<span class="number">0</span>] = <span class="number">0</span>;</span><br><span class="line">    maxGreaterThan[n - <span class="number">1</span>] = <span class="number">0</span>;</span><br><span class="line"></span><br><span class="line">    <span class="type">int</span> curMax = <span class="number">0</span>;</span><br><span class="line">    <span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">1</span>; i &lt; n; ++i) &#123;</span><br><span class="line">      <span class="keyword">if</span> (height[i - <span class="number">1</span>] &gt; curMax)</span><br><span class="line">        curMax = height[i - <span class="number">1</span>];</span><br><span class="line">      maxLessThan[i] = curMax;</span><br><span class="line">    &#125;</span><br><span class="line">    curMax = <span class="number">0</span>;</span><br><span class="line">    <span class="keyword">for</span> (<span class="type">int</span> i = n - <span class="number">2</span>; i &gt;= <span class="number">0</span>; --i) &#123;</span><br><span class="line">      <span class="keyword">if</span> (height[i + <span class="number">1</span>] &gt; curMax)</span><br><span class="line">        curMax = height[i + <span class="number">1</span>];</span><br><span class="line">      maxGreaterThan[i] = curMax;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="type">int</span> ret = <span class="number">0</span>;</span><br><span class="line">    <span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">0</span>; i &lt; n; ++i) &#123;</span><br><span class="line">      <span class="type">int</span> t = std::<span class="built_in">min</span>(maxLessThan[i], maxGreaterThan[i]);</span><br><span class="line">      <span class="type">int</span> capa = t &gt; height[i] ? t - height[i] : <span class="number">0</span>;</span><br><span class="line">      ret += capa;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> ret;</span><br><span class="line">  &#125;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure>
<p>当然还有同样思想的双指针法，此处不表。</p>
]]></content>
      <categories>
        <category>LeetCode</category>
      </categories>
      <tags>
        <tag>Algorithm</tag>
        <tag>LeetCode</tag>
      </tags>
  </entry>
  <entry>
    <title>LeetCode 44 通配符匹配 题解</title>
    <url>/2023/08/23/LeetCode-44/</url>
    <content><![CDATA[<p>题目描述： <img src="/images/leetcode44.png" alt="img" /></p>
<h2 id="动态规划">动态规划</h2>
<p>简单想法：</p>
<p>使用动态规划，令 <span class="math inline">\(dp[i][j]\)</span> 为 <strong>是否 <span class="math inline">\(s[0..i]\)</span> 与 <span class="math inline">\(p[0..j]\)</span> 匹配</strong> ,也就是 s 前 i 个字符与 p 前 j 个字符匹配。</p>
<p>则有初始状态:</p>
<p><span class="math display">\[dp[0][0] = true\]</span></p>
<p>由于长度大于 0 的字符串不可能被长度为 0 的模式匹配，故令：</p>
<p><span class="math display">\[dp[i][0] = false, 0 &lt; i \le sn\]</span></p>
<p>同时长度为 0 的字符串只可能被形如"<strong>*</strong>"这样<strong>全为通配符</strong>的模式匹配，故令：</p>
<p><span class="math display">\[dp[0][j] = dp[0][j-1] \quad \wedge \quad p[j-1] = &#39;*&#39;\]</span></p>
<p>状态转移方程则为：</p>
<p><span class="math display">\[
 \begin{equation*} %加*表示不对公式编号
    \begin{split}
        dp[i][j] =
        &amp;   dp[i][j - 1] \wedge p[j - 1] = &#39;*&#39; \quad \vee \\
        &amp;   dp[i - 1][j] \wedge p[j - 1] = &#39;*&#39; \quad \vee \\
        &amp;   dp[i - 1][j - 1] \wedge (s[i - 1] = p[j - 1] \vee p[j - 1] = &#39;?&#39; \vee p[j - 1] = &#39;*&#39;)
    \end{split}
\end{equation*} 
\]</span></p>
<p>故我们可以有代码：</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="type">bool</span> dp[<span class="number">2001</span>][<span class="number">2001</span>];</span><br><span class="line"></span><br><span class="line"><span class="keyword">class</span> <span class="title class_">Solution</span> &#123;</span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line">  <span class="function"><span class="type">bool</span> <span class="title">isMatch</span><span class="params">(string s, string p)</span> </span>&#123;</span><br><span class="line">    <span class="keyword">if</span> (s.<span class="built_in">length</span>() == <span class="number">0</span> &amp;&amp;</span><br><span class="line">        std::<span class="built_in">all_of</span>(p.<span class="built_in">begin</span>(), p.<span class="built_in">end</span>(), [](<span class="type">char</span> c) &#123; <span class="keyword">return</span> c == <span class="string">&#x27;*&#x27;</span>; &#125;))</span><br><span class="line">      <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">    <span class="keyword">if</span> (p.<span class="built_in">length</span>() == <span class="number">0</span>)</span><br><span class="line">      <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line"></span><br><span class="line">    <span class="type">int</span> sn = s.<span class="built_in">length</span>();</span><br><span class="line">    <span class="type">int</span> pn = p.<span class="built_in">length</span>();</span><br><span class="line">    dp[<span class="number">0</span>][<span class="number">0</span>] = <span class="number">1</span>;</span><br><span class="line">    <span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">1</span>; i &lt;= pn; ++i) &#123;</span><br><span class="line">      dp[<span class="number">0</span>][i] = std::<span class="built_in">all_of</span>(p.<span class="built_in">begin</span>(), p.<span class="built_in">begin</span>() + i,</span><br><span class="line">                             [](<span class="type">char</span> c) &#123; <span class="keyword">return</span> c == <span class="string">&#x27;*&#x27;</span>; &#125;);</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">1</span>; i &lt;= sn; ++i) &#123;</span><br><span class="line">      dp[i][<span class="number">0</span>] = <span class="literal">false</span>;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">1</span>; i &lt;= sn; ++i) &#123;</span><br><span class="line">      <span class="keyword">for</span> (<span class="type">int</span> j = <span class="number">1</span>; j &lt;= pn; ++j) &#123;</span><br><span class="line"></span><br><span class="line">        <span class="type">bool</span> a = dp[i][j] =</span><br><span class="line">            (dp[i][j - <span class="number">1</span>] &amp;&amp; p[j - <span class="number">1</span>] == <span class="string">&#x27;*&#x27;</span>) ||</span><br><span class="line">            (dp[i - <span class="number">1</span>][j] &amp;&amp; p[j - <span class="number">1</span>] == <span class="string">&#x27;*&#x27;</span>) ||</span><br><span class="line">            (dp[i - <span class="number">1</span>][j - <span class="number">1</span>] &amp;&amp;</span><br><span class="line">             (s[i - <span class="number">1</span>] == p[j - <span class="number">1</span>] || p[j - <span class="number">1</span>] == <span class="string">&#x27;?&#x27;</span> || p[j - <span class="number">1</span>] == <span class="string">&#x27;*&#x27;</span>));</span><br><span class="line">      &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">return</span> dp[sn][pn];</span><br><span class="line">  &#125;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure>
<p>时间复杂度为 <span class="math inline">\(O(mn)\)</span>, 空间复杂度为 <span class="math inline">\(O(mn)\)</span></p>
<h2 id="贪心leetcode-题解">贪心(LeetCode 题解)</h2>
<p>前一方法的瓶颈在于对星号 <span class="math inline">\(*\)</span> 的处理方式：使用动态规划枚举所有的情况。由于星号是「万能」的匹配字符，连续的多个星号和单个星号实际上是等价的，那么不连续的多个星号呢？</p>
<p>我们以 <span class="math inline">\(p=∗ abcd ∗\)</span> 为例，ppp 可以匹配所有包含子串 abcd 的字符串，也就是说，我们只需要暴力地枚举字符串 s 中的每个位置作为起始位置，并判断对应的子串是否为 abcd 即可。这种暴力方法的时间复杂度为 O(mn)，与动态规划一致，但不需要额外的空间。</p>
<p>如果 p=∗abcd∗efgh∗i∗ 呢？显然，ppp 可以匹配所有依次出现子串 abcd、efgh、i 的字符串。此时，对于任意一个字符串 sss，我们首先暴力找到最早出现的 abcd，随后从下一个位置开始暴力找到最早出现的 efgh，最后找出 i，就可以判断 sss 是否可以与 ppp 匹配。这样「贪心地」找到最早出现的子串是比较直观的，因为如果 sss 中多次出现了某个子串，那么我们选择最早出现的位置，可以使得后续子串能被找到的机会更大。</p>
<p>因此，如果模式 ppp 的形式为 <span class="math display">\[* u_1 * u_2 * u_3 * \cdots * u_x ∗\]</span> ，即字符串（可以为空）和星号交替出现，并且首尾字符均为星号，那么我们就可以设计出下面这个基于贪心的暴力匹配算法。算法的本质是：如果在字符串 sss 中首先找到 <span class="math inline">\(u_1\)</span> ，再找到 <span class="math inline">\(u_2, u_3, \cdots, u_x\)</span>，那么 s 就可以与模式 p 匹配，伪代码如下：</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="comment">// 我们用 sIndex 和 pIndex 表示当前遍历到 s 和 p 的位置</span></span><br><span class="line"><span class="comment">// 此时我们正在 s 中寻找某个 u_i</span></span><br><span class="line"><span class="comment">// 其在 s 和 p 中的起始位置为 sRecord 和 pRecord</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// sIndex 和 sRecord 的初始值为 0</span></span><br><span class="line"><span class="comment">// 即我们从字符串 s 的首位开始匹配</span></span><br><span class="line">sIndex = sRecord = <span class="number">0</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// pIndex 和 pRecord 的初始值为 1</span></span><br><span class="line"><span class="comment">// 这是因为模式 p 的首位是星号，那么 u_1 的起始位置为 1</span></span><br><span class="line">pIndex = pRecord = <span class="number">1</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">while</span> sIndex &lt; s.length <span class="keyword">and</span> pIndex &lt; p.length <span class="keyword">do</span></span><br><span class="line">    <span class="keyword">if</span> p[pIndex] == <span class="string">&#x27;*&#x27;</span> then</span><br><span class="line">        <span class="comment">// 如果遇到星号，说明找到了 u_i，开始寻找 u_i+1</span></span><br><span class="line">        pIndex += <span class="number">1</span></span><br><span class="line">        <span class="comment">// 记录下起始位置</span></span><br><span class="line">        sRecord = sIndex</span><br><span class="line">        pRecord = pIndex</span><br><span class="line">    <span class="keyword">else</span> <span class="keyword">if</span> <span class="built_in">match</span>(s[sIndex], p[pIndex]) then</span><br><span class="line">        <span class="comment">// 如果两个字符可以匹配，就继续寻找 u_i 的下一个字符</span></span><br><span class="line">        sIndex += <span class="number">1</span></span><br><span class="line">        pIndex += <span class="number">1</span></span><br><span class="line">    <span class="keyword">else</span> <span class="keyword">if</span> sRecord + <span class="number">1</span> &lt; s.length then</span><br><span class="line">        <span class="comment">// 如果两个字符不匹配，那么需要重新寻找 u_i</span></span><br><span class="line">        <span class="comment">// 枚举下一个 s 中的起始位置</span></span><br><span class="line">        sRecord += <span class="number">1</span></span><br><span class="line">        sIndex = sRecord</span><br><span class="line">        pIndex = pRecord</span><br><span class="line">    <span class="keyword">else</span></span><br><span class="line">        <span class="comment">// 如果不匹配并且下一个起始位置不存在，那么匹配失败</span></span><br><span class="line">        <span class="keyword">return</span> False</span><br><span class="line">    end <span class="keyword">if</span></span><br><span class="line">end <span class="keyword">while</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// 由于 p 的最后一个字符是星号，那么 s 未匹配完，那么没有关系</span></span><br><span class="line"><span class="comment">// 但如果 p 没有匹配完，那么 p 剩余的字符必须都是星号</span></span><br><span class="line"><span class="keyword">return</span> <span class="built_in">all</span>(p[pIndex] ~ p[p.length - <span class="number">1</span>] == <span class="string">&#x27;*&#x27;</span>)</span><br></pre></td></tr></table></figure>
<p>当然还有一些特殊情况，如星号不总是出现在前后，此处省略。 时间复杂度： 渐进：<span class="math inline">\(O(mn)\)</span>，平均复杂度：<span class="math inline">\(O(m\log{n})\)</span> 具体的分析可以参考论文<a href="https://arxiv.org/abs/1407.0950">On the Average-case Complexity of Pattern Matching with Wildcards</a>，注意论文中的分析是对于每一个<span class="math inline">\(u_i\)</span> 而言的，即模式中只包含小写字母和问号，本题相当于多个连续模式的情况。由于超出了面试难度。这里不再赘述。</p>
<p>空间复杂度：O(1)</p>
<h2 id="此外leetcode-官方题解">此外(LeetCode 官方题解)</h2>
<p>在贪心方法中，对于每一个被星号分隔的、只包含小写字符和问号的子模式 <span class="math inline">\(u_i\)</span> ，我们在原串中使用的是暴力匹配的方法。然而这里是可以继续进行优化的，即使用 AC 自动机 代替暴力方法进行匹配。 由于 AC 自动机本身已经是竞赛难度的知识点，而本题还需要在 AC 自动机中额外存储一些内容才能完成匹配，因此这种做法远远超过了面试难度。 这里只给出参考讲义 <a href="http://www.cs.cmu.edu/~ab/CMU/Week%2010-%20Strings%20Search/print04.pdf">Set Matching and Aho-Corasick Algorithm</a>：</p>
<ul>
<li><p>讲义的前 6 页介绍了字典树 Trie；</p></li>
<li><p>讲义的 7−19 页介绍了 AC 自动机，它是以字典树为基础的；</p></li>
<li><p>讲义的 20−23 页介绍了基于 AC 自动机的一种 wildcard matching 算法，其中的 wildcard <span class="math inline">\(\phi\)</span> 就是本题中的问号。</p></li>
</ul>
<p>感兴趣的读者可以尝试进行学习。</p>
]]></content>
      <categories>
        <category>LeetCode</category>
      </categories>
      <tags>
        <tag>Algorithm</tag>
        <tag>LeetCode</tag>
      </tags>
  </entry>
  <entry>
    <title>LeetCode 89 格雷编码</title>
    <url>/2023/09/15/LeetCode-89/</url>
    <content><![CDATA[<p>题目如下：</p>
<p>n 位格雷码序列 是一个由 <span class="math inline">\(2^n\)</span> 个整数组成的序列，其中：</p>
<p>每个整数都在范围 <span class="math inline">\([0, 2^n - 1]\)</span> 内, 要求：</p>
<ul>
<li>第一个整数是 0</li>
<li>一个整数在序列中出现 不超过一次</li>
<li>每对 相邻 整数的二进制表示 恰好一位不同 ，且</li>
<li>第一个 和 最后一个 整数的二进制表示 恰好一位不同</li>
</ul>
<p>给你一个整数 n ，返回任一有效的 n 位格雷码序列 。</p>
<blockquote>
<p>说实话我一开始想的简单了，直接暴力搜索，最后发现不行，只能 refer 一下官方题解了</p>
</blockquote>
<h2 id="方法一">方法一</h2>
<p>我们可以用归纳法，从 <span class="math inline">\(n-1\)</span>推到<span class="math inline">\(n\)</span>，设序列 <span class="math inline">\(G_n\)</span> 为<span class="math inline">\(n\)</span> 位的格雷码序列, 我们可以从 <span class="math inline">\(G_{n-1}\)</span> 推到 <span class="math inline">\(G_n\)</span>。</p>
<p>首先把 <span class="math inline">\(G_{n-1}\)</span> 中所有元素的<span class="math inline">\(n-1\)</span>位设为 1，得到<span class="math inline">\(G_{n-1}^T\)</span>, 然后拼接 <span class="math inline">\(G_{n-1}\)</span>和<span class="math inline">\(G_{n-1}^T\)</span>就得到了我们想要的结果。</p>
<p>为什么呢？其实很简单，<span class="math inline">\(G_{n-1}^T\)</span> 中每个数字都与<span class="math inline">\(G_{n-1}\)</span> <strong>有且仅有</strong>一位不同, 且 <span class="math inline">\(G_{n-1}\)</span>是<span class="math inline">\([0,2^{n-1}]\)</span>的一个排列，<span class="math inline">\(G_{n-1}^T\)</span>则是<span class="math inline">\([2^{n-1}, 2^{n}-1]\)</span>上的排列。 二者组合后自然就得到了<span class="math inline">\([0,2^n-1]\)</span>上的排列，且依次穿插后二进制位恰有一位不同。</p>
<h2 id="方法二">方法二</h2>
<p>这个方法是纯粹的找规律，如下： <img src="/images/leetcode89.png" alt="a" /></p>
]]></content>
      <categories>
        <category>LeetCode</category>
      </categories>
      <tags>
        <tag>Algorithm</tag>
        <tag>LeetCode</tag>
      </tags>
  </entry>
  <entry>
    <title>Linear Algebra 4.3 -- Least Squares Approximations</title>
    <url>/2023/08/28/Linear-Algebra-4-3-Least-Squares/</url>
    <content><![CDATA[<p>线性回归基本方法--<strong>最小二乘法(Least Squares Approximations)</strong>，这里记录具体思想。</p>
<p><span class="math inline">\(Ax=b\)</span>在实际情况中大多是无解的，一种情况是：方程式往往会比未知数更多(<span class="math inline">\(m&gt;n\)</span>)，而 n 列只能产生 m 维线性空间的一小部分。 换句话讲，<span class="math inline">\(\boldsymbol{b}\)</span> 总是在 <span class="math inline">\(C(A)\)</span> 之外。这时我们便可以通过上一章投影的有关知识解决这一问题。</p>
<p>首先给出结果，和投影一样，我们的基本方程仍是如下方程： <span class="math display">\[A^TA\boldsymbol{\hat{x}}=A^T\boldsymbol{b}\]</span></p>
<p>而我们的基本目标就是减小 error ( <span class="math inline">\(\boldsymbol{Ax-b}\)</span> )，我们可以从三个不同的方向解决的这个问题：</p>
<h4 id="几何方向">几何方向</h4>
<p>对于一条直线<span class="math inline">\(\boldsymbol{b}\)</span>，要让其和一个平面/子空间 <span class="math inline">\(A\boldsymbol{x}\)</span> 相距最小， 必然要求出其投影<span class="math inline">\(\boldsymbol{p}\)</span>，<span class="math inline">\(\boldsymbol{e = b - p}\)</span> 此时就是最小的， <span class="math inline">\(\boldsymbol{p}\)</span> 此时也是比较合适的的接近解的直线。</p>
<h4 id="代数方向">代数方向</h4>
<p>每一个向量 <span class="math inline">\(\boldsymbol{b}\)</span> 都可以被分成两个部分，一个是在 <span class="math inline">\(C(A)\)</span> 中的 <span class="math inline">\(\boldsymbol{p}\)</span>， 另一部分则是正交于 <span class="math inline">\(C(A)\)</span> 的 <span class="math inline">\(\boldsymbol{e}\)</span>。</p>
<p><span class="math inline">\(A\boldsymbol{x = b = p + e}\)</span> 是不可解的</p>
<p><span class="math inline">\(A\boldsymbol{\hat{x} = p}\)</span> 则是可解的</p>
<p>而后者的解则留下了最小的误差$ $。最小的原因：</p>
<p>这里有 <strong>Squared length for any <span class="math inline">\(x\)</span></strong>: <span class="math inline">\(||Ax - b||^2 = ||Ax-p||^2 + ||e||^2\)</span></p>
<p>而我们把 <span class="math inline">\(||Ax-p||^2\)</span> 减到了 <span class="math inline">\(0\)</span> ，已经把 <span class="math inline">\(||Ax - b||^2\)</span> 减到不能再减了。</p>
<h4 id="微积分方向">微积分方向</h4>
<p>举例而言，对于直线<span class="math inline">\(C + Dt\)</span>，有三个样本点：<span class="math inline">\((0,6), (1,0), (2,0)\)</span>，则有：</p>
<p><span class="math display">\[
A=\left [ \begin{matrix}
1&amp; 0 \\
1&amp; 1 \\
1&amp; 2 \\
\end{matrix} \right ] ,
\boldsymbol{x} = \left [ \begin{matrix}
C \\
D \\
\end{matrix} \right ] ,
\boldsymbol{b} = \left [ \begin{matrix}
6 \\
0 \\
0 \\
\end{matrix} \right ]
\]</span></p>
<p>我们要最小化 <span class="math inline">\(E = ||Ax-b||^2\)</span> 则要有： <span class="math display">\[\frac{\partial E}{\partial C} = 0, \quad \frac{\partial E}{\partial D} = 0\]</span></p>
<p>事实上最后化简的结果与 <span class="math inline">\(A^TA\hat{x}=A^Tb\)</span> 是一样的。</p>
]]></content>
      <categories>
        <category>Linear Algebra</category>
      </categories>
      <tags>
        <tag>Math</tag>
        <tag>Linear Algebra</tag>
      </tags>
  </entry>
  <entry>
    <title>Linear Algebra 4.2 -- Projection</title>
    <url>/2023/08/28/Linear-Algebra-Projection/</url>
    <content><![CDATA[<p>The projection of <span class="math inline">\(\boldsymbol{b}\)</span> onto a subspace <span class="math inline">\(C(A)\)</span> is computed by:</p>
<p><span class="math display">\[
 \boldsymbol{p} =  P\boldsymbol{b}
\]</span></p>
<p>where <span class="math inline">\(P\)</span> is called <strong>Projection Matrix</strong>. The reason for multiplying a matrix is based on how the projection is computed.</p>
<p>Here is the reasoning steps:</p>
<p>Let's image that there is <span class="math inline">\(\boldsymbol{b}\)</span> projecting onto a plane <span class="math inline">\(C(A)\)</span>, producing projection <span class="math inline">\(\boldsymbol{p}\)</span>. Then <span class="math inline">\(\boldsymbol{p}\)</span> is in <span class="math inline">\(C(A)\)</span>, which could be expressed as <span class="math inline">\(A\boldsymbol{\hat{x}}\)</span>. Our <strong>goal</strong> is to get <span class="math inline">\(\boldsymbol{\hat{x}}\)</span>.</p>
<p>Let <span class="math inline">\(\boldsymbol{e = b - A\hat{x}}\)</span> be the error vector , only when <span class="math inline">\(\boldsymbol{e}\)</span> is <strong>perpendicular</strong> to the subspace, can we say <span class="math inline">\(\boldsymbol{p = b - e}\)</span> is projection.</p>
<p>Since <span class="math inline">\(\boldsymbol{e}\)</span> is perpendicular to <span class="math inline">\(C(A)\)</span>, we can get: <span class="math display">\[A^T(\boldsymbol{b}-A\boldsymbol{\hat{x}}) = \boldsymbol{0}\]</span> or <span class="math display">\[A^TA\boldsymbol{\hat{x}} = A^T\boldsymbol{b}\]</span></p>
<p>The symmetric matrix <span class="math inline">\(A^TA\)</span> is invertible if and only if <span class="math inline">\(\boldsymbol{a&#39;s}\)</span> in <span class="math inline">\(A\)</span> are <strong>independent</strong>. Then, <span class="math display">\[\boldsymbol{p} = A\boldsymbol{\hat{x}}=A(A^TA)^{-1}A^T\boldsymbol{b}\]</span></p>
<p>Here <span class="math inline">\(A(A^TA)^{-1}A^T\)</span> is a matrix, we name it <strong>Projection Matrix</strong>. You might try to split <span class="math inline">\((A^TA)^{-1}\)</span> into <span class="math inline">\(A^{-1}(A^{T})^{-1}\)</span>, however when <span class="math inline">\(A\)</span> is rectangular, it has no inverse.</p>
<p>Or when <span class="math inline">\(A\)</span> is invertible, <span class="math inline">\(N(A), N(A^T)\)</span> contains only <strong>zero</strong> vector, where <span class="math inline">\(A^T\boldsymbol{e} = 0 \rightarrow \boldsymbol{e=0, b=p}\)</span> itself, <span class="math inline">\(P = \boldsymbol{I}\)</span> satisfies it well.</p>
<h4 id="why-the-symmetric-matrix-ata-is-invertible-if-and-only-if-boldsymbolas-in-a-are-independent">Why the symmetric matrix <span class="math inline">\(A^TA\)</span> is invertible if and only if <span class="math inline">\(\boldsymbol{a&#39;s}\)</span> in <span class="math inline">\(A\)</span> are <strong>independent</strong>?</h4>
<p><span class="math display">\[A^TAx = 0 \Longleftrightarrow Ax = 0\]</span></p>
<p>Thus <span class="math inline">\(A^TA\)</span> has the same nullspace with <span class="math inline">\(A\)</span>. <span class="math inline">\(A\)</span> is invertible, <strong>if and only if</strong> <span class="math inline">\(A^TA\)</span> is invertible.</p>
]]></content>
      <categories>
        <category>Linear Algebra</category>
      </categories>
      <tags>
        <tag>Math</tag>
        <tag>Linear Algebra</tag>
      </tags>
  </entry>
  <entry>
    <title>Mathematica微积分常用命令</title>
    <url>/2023/02/06/Mathematica%E5%BE%AE%E7%A7%AF%E5%88%86%E5%B8%B8%E7%94%A8%E5%91%BD%E4%BB%A4/</url>
    <content><![CDATA[<p>作为大一新生,每天都要为了数学作业焦头烂额，为了解决这个问题，聪慧的我想到了利用数学工具 Mathematica 来解决这个问题</p>
<p>于是我先用南大邮箱获得了 mma，并在 Ubuntu 上安装了 mma 及其依赖</p>
<p>下面记录有关求极限，求微分，以及求积分的几个模板</p>
<span id="more"></span>
<h2 id="极限limit">极限(Limit)</h2>
<p>我们要求得下列式子的极限：</p>
<p><span class="math inline">\(Assume {\quad} f&#39;(a)=\sqrt{2} {\quad} f&#39;&#39;(a)=2\)</span></p>
<p>$_{x a}  -  $</p>
<p>我们在 mma 可以输入以下代码</p>
<figure class="highlight mathematica"><table><tr><td class="code"><pre><span class="line"><span class="built_in">Limit</span><span class="punctuation">[</span><span class="number">1</span><span class="operator">/</span><span class="punctuation">(</span><span class="variable">f</span><span class="punctuation">[</span><span class="variable">x</span><span class="punctuation">]</span> <span class="operator">-</span> <span class="variable">f</span><span class="punctuation">[</span><span class="variable">a</span><span class="punctuation">]</span><span class="punctuation">)</span> <span class="operator">-</span> <span class="number">1</span><span class="operator">/</span><span class="punctuation">(</span><span class="punctuation">(</span><span class="variable">x</span> <span class="operator">-</span> <span class="variable">a</span><span class="punctuation">)</span> <span class="variable">f</span><span class="operator">&#x27;</span><span class="punctuation">[</span><span class="variable">x</span><span class="punctuation">]</span><span class="punctuation">)</span><span class="operator">,</span> <span class="variable">x</span> <span class="operator">-&gt;</span> <span class="variable">a</span><span class="operator">,</span></span><br><span class="line"> <span class="built_in">Assumptions</span> <span class="operator">-&gt;</span> <span class="punctuation">&#123;</span><span class="built_in">D</span><span class="punctuation">[</span><span class="variable">f</span><span class="punctuation">[</span><span class="variable">a</span><span class="punctuation">]</span><span class="operator">,</span> <span class="variable">a</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="built_in">Sqrt</span><span class="punctuation">[</span><span class="number">2</span><span class="punctuation">]</span><span class="operator">,</span> <span class="built_in">D</span><span class="punctuation">[</span><span class="built_in">D</span><span class="punctuation">[</span><span class="variable">f</span><span class="punctuation">[</span><span class="variable">a</span><span class="punctuation">]</span><span class="operator">,</span> <span class="variable">a</span><span class="punctuation">]</span><span class="operator">,</span> <span class="variable">a</span><span class="punctuation">]</span> <span class="operator">=</span> <span class="number">2</span><span class="punctuation">&#125;</span><span class="punctuation">]</span></span><br></pre></td></tr></table></figure>
<hr />
<h2 id="微分导数derivative">微分/导数(Derivative)</h2>
<p>我们要求得下列函数的导数：</p>
<p><span class="math inline">\(f(x)=\sin{x}^{\sin{x}}+\ln{\int_0^x{\sqrt{\tan{x}}dx}}\)</span></p>
<p>我们在 mma 可以输入以下代码</p>
<figure class="highlight mathematica"><table><tr><td class="code"><pre><span class="line"><span class="variable">f</span><span class="punctuation">[</span><span class="type">_x</span><span class="punctuation">]</span><span class="operator">=...</span></span><br><span class="line"><span class="built_in">D</span><span class="punctuation">[</span><span class="variable">f</span><span class="punctuation">(</span><span class="variable">x</span><span class="punctuation">)</span><span class="operator">,</span><span class="variable">x</span><span class="punctuation">]</span></span><br></pre></td></tr></table></figure>
<hr />
<h2 id="积分定积分integration">积分/定积分(Integration)</h2>
<p>我们要求得以下积分</p>
<p><span class="math inline">\(\int{\frac{1}{\cos^2{x}}dx}\)</span></p>
<p><span class="math inline">\(\int_0^{\pi/2}{\frac{1}{\cos^2{x}}dx}\)</span></p>
<p>我们可以分别在 mma 输入以下代码</p>
<figure class="highlight mathematica"><table><tr><td class="code"><pre><span class="line"><span class="built_in">Integrate</span><span class="punctuation">[</span><span class="number">1</span><span class="operator">/</span><span class="punctuation">(</span><span class="built_in">Cos</span><span class="punctuation">[</span><span class="variable">x</span><span class="punctuation">]</span><span class="operator">^</span><span class="number">2</span><span class="punctuation">)</span><span class="operator">,</span><span class="variable">x</span><span class="punctuation">]</span></span><br><span class="line"><span class="built_in">Integrate</span><span class="punctuation">[</span><span class="number">1</span><span class="operator">/</span><span class="punctuation">(</span><span class="built_in">Cos</span><span class="punctuation">[</span><span class="variable">x</span><span class="punctuation">]</span><span class="operator">^</span><span class="number">2</span><span class="punctuation">)</span><span class="operator">,</span><span class="punctuation">&#123;</span><span class="variable">x</span><span class="operator">,</span><span class="number">0</span><span class="operator">,</span><span class="built_in">Pi</span><span class="operator">/</span><span class="number">2</span><span class="punctuation">&#125;</span><span class="punctuation">]</span></span><br></pre></td></tr></table></figure>
<hr />
]]></content>
      <categories>
        <category>Math</category>
      </categories>
      <tags>
        <tag>Math</tag>
        <tag>Mathematica</tag>
      </tags>
  </entry>
  <entry>
    <title>Neovim常用配置(1)</title>
    <url>/2023/02/06/Neovim%E5%B8%B8%E7%94%A8%E9%85%8D%E7%BD%AE-1/</url>
    <content><![CDATA[<p>网上有关Neovim API的中文资料实在稀缺，在此特意整理一部分</p>
<p>若英文水平过关，可以直接输入指令 <code>:h lua guide</code> 获得Neovim的Lua API相关英文文档</p>
<span id="more"></span>
<h2 id="neovims-lua-api">Neovim's Lua API</h2>
<ul>
<li><p><code>vim.keymap.set(mode , from_keys, to_expr, opts)</code></p>
<p><em>作用:创建一个键位映射</em></p>
<p><strong>mode</strong>：类型：<strong>string</strong>，映射作用的模式，"n"代表normal，"i"代表insert,"v"代表visual</p>
<p><strong>from_keys</strong>：类型：<strong>string</strong>，则指被映射的按键</p>
<p><strong>to_expr</strong>：类型：<strong>any</strong>，指映射得到的键位，vim表达式，或者Lua函数</p>
<p><strong>opts</strong>：类型：<strong>table</strong>，键位映射有关的设置</p></li>
</ul>
<hr />
<ul>
<li><p><code>vim.api.nvim_create_user_command(commandName, expr)</code></p>
<p><em>作用:创建一个用户命令</em></p>
<p><strong>commandName</strong>：类型：<strong>string</strong>， 命令名(必须首字母大写)</p>
<p><strong>expr</strong>：类型：<strong>any</strong>，命令执行的键位，表达式或者Lua函数</p></li>
</ul>
<hr />
<ul>
<li><p><code>vim.api.nvim_create_autocmd(event, opts)</code></p>
<p><em>作用:创建一个自动命令</em></p>
<p><strong>event</strong>：类型：<strong>string</strong>， 自动命令组(autogroup)</p>
<p><strong>opts</strong>：类型：<strong>table</strong>，相关设置：</p>
<ul>
<li><p><strong>pattern</strong>: 文件名的pattern</p></li>
<li><p><strong>callback</strong>: 自动命令的回调函数，可以是键位，vim表达式，或者Lua函数</p></li>
</ul></li>
</ul>
<hr />
]]></content>
      <categories>
        <category>Tools</category>
      </categories>
      <tags>
        <tag>Vim</tag>
      </tags>
  </entry>
  <entry>
    <title>Neovim常用配置(2)</title>
    <url>/2023/02/08/Neovim%E5%B8%B8%E7%94%A8%E9%85%8D%E7%BD%AE-2/</url>
    <content><![CDATA[<h3 id="使用-lua-配置-neovim并设置自己的-workflow">使用 Lua 配置 Neovim，并设置自己的 workflow</h3>
<h4 id="结合命令行工具">结合命令行工具</h4>
<p>我在编码时常常有使用 git 的需求，但又不想总是在命令行中敲命令</p>
<p>于是我利用与 ToggleTerm 把命令行工具 lazygit 嵌入至 Neovim 中</p>
<figure class="highlight lua"><table><tr><td class="code"><pre><span class="line"><span class="keyword">local</span> Terminal = <span class="built_in">require</span>(<span class="string">&#x27;toggleterm.terminal&#x27;</span>).Terminal</span><br><span class="line"></span><br><span class="line"><span class="keyword">local</span> lazygit = Terminal:new(&#123; cmd = <span class="string">&quot;lazygit&quot;</span>, direction = <span class="string">&#x27;float&#x27;</span>, hidden = <span class="literal">true</span> &#125;)</span><br><span class="line"><span class="keyword">local</span> top = Terminal:new(&#123; cmd = <span class="string">&quot;top&quot;</span>, direction = <span class="string">&#x27;float&#x27;</span>, hidden = <span class="literal">true</span> &#125;)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment">-- lazygit</span></span><br><span class="line">vim.api.nvim_create_user_command(<span class="string">&quot;LazyGit&quot;</span>,</span><br><span class="line">    <span class="function"><span class="keyword">function</span><span class="params">()</span></span></span><br><span class="line">        lazygit:toggle()</span><br><span class="line">    <span class="keyword">end</span>,</span><br><span class="line">    &#123; nargs = <span class="number">0</span> &#125;)</span><br><span class="line"></span><br><span class="line"><span class="comment">-- top</span></span><br><span class="line">vim.api.nvim_create_user_command(<span class="string">&quot;Top&quot;</span>,</span><br><span class="line">    <span class="function"><span class="keyword">function</span><span class="params">()</span></span></span><br><span class="line">        top:toggle()</span><br><span class="line">    <span class="keyword">end</span>,</span><br><span class="line">    &#123; nargs = <span class="number">0</span> &#125;)</span><br></pre></td></tr></table></figure>
<span id="more"></span>
<p>同样类似的，还可以通过命令行工具 trans 进行翻译，并通过 neovim 的 api 将翻译结果显示出来.</p>
<figure class="highlight lua"><table><tr><td class="code"><pre><span class="line"><span class="keyword">local</span> <span class="function"><span class="keyword">function</span> <span class="title">translate_terminal</span><span class="params">()</span></span></span><br><span class="line">    <span class="keyword">local</span> mode = vim.api.nvim_get_mode()[<span class="string">&#x27;mode&#x27;</span>]</span><br><span class="line">    <span class="keyword">local</span> to_translate</span><br><span class="line">    <span class="keyword">if</span> mode == <span class="string">&#x27;n&#x27;</span> <span class="keyword">then</span></span><br><span class="line">        to_translate = vim.fn.expand(<span class="string">&#x27;&lt;cword&gt;&#x27;</span>)</span><br><span class="line">    <span class="keyword">elseif</span> mode == <span class="string">&#x27;v&#x27;</span> <span class="keyword">then</span></span><br><span class="line">        to_translate = <span class="built_in">require</span>(<span class="string">&#x27;basic&#x27;</span>).get_visual_selection()</span><br><span class="line">    <span class="keyword">end</span></span><br><span class="line">    <span class="keyword">local</span> command = <span class="built_in">string</span>.<span class="built_in">format</span>(<span class="string">&#x27;trans &quot;%s&quot;&#x27;</span>, to_translate)</span><br><span class="line"></span><br><span class="line">    async.run(<span class="function"><span class="keyword">function</span><span class="params">()</span></span></span><br><span class="line">        <span class="keyword">local</span> translated_content = vim.fn.systemlist(command)</span><br><span class="line">        utils.show_term_content(translated_content)</span><br><span class="line">    <span class="keyword">end</span>)</span><br><span class="line"><span class="keyword">end</span></span><br><span class="line"></span><br></pre></td></tr></table></figure>
<h4 id="设置-layout">设置 Layout</h4>
<figure class="highlight lua"><table><tr><td class="code"><pre><span class="line">vim.api.nvim_create_user_command(</span><br><span class="line">    <span class="string">&quot;BufferDelete&quot;</span>,</span><br><span class="line">    <span class="function"><span class="keyword">function</span><span class="params">()</span></span></span><br><span class="line">        <span class="comment">---@diagnostic disable-next-line: missing-parameter</span></span><br><span class="line">        <span class="keyword">local</span> file_exists = vim.fn.filereadable(vim.fn.expand(<span class="string">&quot;%p&quot;</span>))</span><br><span class="line">        <span class="keyword">local</span> modified = vim.api.nvim_buf_get_option(<span class="number">0</span>, <span class="string">&quot;modified&quot;</span>)</span><br><span class="line"></span><br><span class="line">        <span class="keyword">if</span> file_exists == <span class="number">0</span> <span class="keyword">and</span> modified <span class="keyword">then</span></span><br><span class="line">            <span class="keyword">local</span> user_choice = vim.fn.<span class="built_in">input</span>(</span><br><span class="line">                    <span class="string">&quot;The file is not saved, whether to force delete? Press enter or input [y/n]:&quot;</span>)</span><br><span class="line">            <span class="keyword">if</span> user_choice == <span class="string">&quot;y&quot;</span> <span class="keyword">or</span> <span class="built_in">string</span>.<span class="built_in">len</span>(user_choice) == <span class="number">0</span> <span class="keyword">then</span></span><br><span class="line">                vim.cmd(<span class="string">&quot;bd!&quot;</span>)</span><br><span class="line">            <span class="keyword">end</span></span><br><span class="line">            <span class="keyword">return</span></span><br><span class="line">        <span class="keyword">end</span></span><br><span class="line"></span><br><span class="line">        <span class="keyword">local</span> force = <span class="keyword">not</span> vim.bo.buflisted <span class="keyword">or</span> vim.bo.buftype == <span class="string">&quot;nofile&quot;</span></span><br><span class="line"></span><br><span class="line">        vim.cmd(force <span class="keyword">and</span> <span class="string">&quot;bd!&quot;</span> <span class="keyword">or</span> <span class="built_in">string</span>.<span class="built_in">format</span>(<span class="string">&quot;bp | bd! %s&quot;</span>, vim.api.nvim_get_current_buf()))</span><br><span class="line">    <span class="keyword">end</span>,</span><br><span class="line">    &#123; desc = <span class="string">&quot;Delete the current Buffer while maintaining the window layout&quot;</span> &#125;)</span><br></pre></td></tr></table></figure>
<h4 id="在-neovim-中编辑-hexo-blog">在 Neovim 中编辑 Hexo blog</h4>
<figure class="highlight lua"><table><tr><td class="code"><pre><span class="line"><span class="keyword">local</span> blog_path = <span class="string">&quot;~/Documents/Hexo-Blog&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">local</span> <span class="function"><span class="keyword">function</span> <span class="title">blogNew</span><span class="params">(input)</span></span></span><br><span class="line">    vim.api.nvim_set_current_dir(blog_path)</span><br><span class="line">    <span class="built_in">require</span>(<span class="string">&#x27;nvim-tree.api&#x27;</span>).tree.change_root(blog_path)</span><br><span class="line">    <span class="keyword">local</span> <span class="built_in">output</span> = vim.fn.system(<span class="string">&quot;hexo n &quot;</span> .. <span class="string">&#x27;\&quot;&#x27;</span> .. <span class="built_in">input</span>.args .. <span class="string">&#x27;\&quot;&#x27;</span>)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (vim.v.shell_error == <span class="number">0</span>) <span class="keyword">then</span></span><br><span class="line">        <span class="keyword">local</span> <span class="built_in">path</span> = <span class="built_in">string</span>.<span class="built_in">sub</span>(<span class="built_in">output</span>, <span class="built_in">string</span>.<span class="built_in">find</span>(<span class="built_in">output</span>, <span class="string">&#x27;~&#x27;</span>, <span class="number">1</span>, <span class="literal">true</span>), <span class="number">-1</span>)</span><br><span class="line">        vim.cmd(<span class="string">&quot;:e &quot;</span> .. <span class="built_in">path</span>)</span><br><span class="line">    <span class="keyword">else</span></span><br><span class="line">        vim.notify(<span class="string">&quot;Failed creating new blog post&quot;</span> .. <span class="built_in">input</span>.args, <span class="string">&quot;error&quot;</span>)</span><br><span class="line">    <span class="keyword">end</span></span><br><span class="line"><span class="keyword">end</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">local</span> <span class="function"><span class="keyword">function</span> <span class="title">blogNewDraft</span><span class="params">(input)</span></span></span><br><span class="line">    vim.api.nvim_set_current_dir(blog_path)</span><br><span class="line">    <span class="built_in">require</span>(<span class="string">&#x27;nvim-tree.api&#x27;</span>).tree.change_root(blog_path)</span><br><span class="line">    <span class="keyword">local</span> <span class="built_in">output</span> = vim.fn.system(<span class="string">&quot;hexo new draft &quot;</span> .. <span class="string">&#x27;\&quot;&#x27;</span> .. <span class="built_in">input</span>.args .. <span class="string">&#x27;\&quot;&#x27;</span>)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (vim.v.shell_error == <span class="number">0</span>) <span class="keyword">then</span></span><br><span class="line">        <span class="keyword">local</span> <span class="built_in">path</span> = <span class="built_in">string</span>.<span class="built_in">sub</span>(<span class="built_in">output</span>, <span class="built_in">string</span>.<span class="built_in">find</span>(<span class="built_in">output</span>, <span class="string">&#x27;~&#x27;</span>, <span class="number">1</span>, <span class="literal">true</span>), <span class="number">-1</span>)</span><br><span class="line">        vim.cmd(<span class="string">&quot;:e &quot;</span> .. <span class="built_in">path</span>)</span><br><span class="line">    <span class="keyword">else</span></span><br><span class="line">        vim.notify(<span class="string">&quot;Failed creating new blog post&quot;</span> .. <span class="built_in">input</span>.args, <span class="string">&quot;error&quot;</span>)</span><br><span class="line">    <span class="keyword">end</span></span><br><span class="line"><span class="keyword">end</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">local</span> <span class="function"><span class="keyword">function</span> <span class="title">blogGenerateAndDeploy</span><span class="params">()</span></span></span><br><span class="line">    vim.api.nvim_set_current_dir(blog_path)</span><br><span class="line">    <span class="keyword">if</span> (<span class="built_in">os</span>.<span class="built_in">execute</span>(<span class="string">&quot;hexo g &amp;&amp; hexo s&quot;</span>)) <span class="keyword">then</span></span><br><span class="line">        vim.notify(<span class="string">&quot;Deploy the blog successfully&quot;</span>, <span class="string">&quot;info&quot;</span>)</span><br><span class="line">    <span class="keyword">else</span></span><br><span class="line">        vim.notify(<span class="string">&quot;Deployment of blog failed&quot;</span>, <span class="string">&quot;error&quot;</span>)</span><br><span class="line">    <span class="keyword">end</span></span><br><span class="line"><span class="keyword">end</span></span><br><span class="line"></span><br></pre></td></tr></table></figure>
<h4 id="取消下一行注释">取消下一行注释</h4>
<figure class="highlight lua"><table><tr><td class="code"><pre><span class="line"><span class="comment">-- avoid comment when enter the new line</span></span><br><span class="line">vim.api.nvim_create_autocmd(&#123; <span class="string">&quot;BufEnter&quot;</span> &#125;, &#123;</span><br><span class="line">    pattern = <span class="string">&quot;*&quot;</span>,</span><br><span class="line">    callback = <span class="function"><span class="keyword">function</span><span class="params">()</span></span></span><br><span class="line">        vim.opt.formatoptions = vim.opt.formatoptions - &#123; <span class="string">&quot;c&quot;</span>, <span class="string">&quot;r&quot;</span>, <span class="string">&quot;o&quot;</span> &#125;</span><br><span class="line">    <span class="keyword">end</span>,</span><br><span class="line">&#125;)</span><br></pre></td></tr></table></figure>
]]></content>
      <categories>
        <category>Tools</category>
      </categories>
      <tags>
        <tag>Vim</tag>
      </tags>
  </entry>
  <entry>
    <title>Neovim常用配置(3) (clangd &amp; CMake)</title>
    <url>/2023/03/01/Neovim%E5%B8%B8%E7%94%A8%E9%85%8D%E7%BD%AE-3-Clangd---CMake/</url>
    <content><![CDATA[<p>在使用 Neovim 进行 C/C++的开发时，我们常常使用 <strong>clangd</strong> 作为 <strong>lsp</strong> 提供语法高亮/重构等语言服务</p>
<p>其中 clangd 根据自动推断宏的功能也是十分有效，搭配<strong>CMake</strong>可以达到更加好的效果(如支持 CMake 内置宏，支持自动 include CMake 配置的头文件)</p>
<p>下面提供简要的集成 clangd 与 cmake 的方法</p>
<p>一般来说<strong>clangd</strong>可以自动识别<strong>CMake</strong>生成的<strong>compile_commands.json</strong>来进行头文件的识别与宏的分析</p>
<p>但 compile_commands.json 不会自动生产，故我们可以通过以下命令实现 compile_commands 的自动生产</p>
<figure class="highlight bash"><table><tr><td class="code"><pre><span class="line">cmake . -G -DCMAKE_EXPORT_COMPILE_COMMANDS=ON</span><br></pre></td></tr></table></figure>
<p>其中 <em><code>-DCMAKE_EXPORT_COMPILE_COMMANDS=ON</code></em> 是用于导出编译命令的 flag</p>
<p>故我常常会在项目目录下建立一个 build.sh 来构建项目:</p>
<figure class="highlight bash"><table><tr><td class="code"><pre><span class="line">cmake . -G -DCMAKE_EXPORT_COMPILE_COMMANDS=ON</span><br><span class="line">make</span><br></pre></td></tr></table></figure>
<p>构建时只需要输入 build.sh</p>
]]></content>
      <categories>
        <category>Tools</category>
      </categories>
      <tags>
        <tag>Vim</tag>
      </tags>
  </entry>
  <entry>
    <title>OSPF (Open Shortest Path First) &amp; BGP (Border Gateway Protocol)</title>
    <url>/2023/08/19/OSPF_BGP/</url>
    <content><![CDATA[<h2 id="making-routing-scalable">Making routing scalable</h2>
<p>Here are some concepts to note:</p>
<p>scale: billions of destinations:</p>
<ul>
<li>can't store all destinations in routing tables.</li>
<li>routing table exchange would swamp links.</li>
</ul>
<p>administrative autonomy:</p>
<ul>
<li>Internet: a network of networks</li>
<li>each network admin may want to control routing in its own network</li>
</ul>
<h2 id="approach-to-scalable-routing">Approach to scalable routing</h2>
<p>We always aggregate routers into regions known as "autonomous systems" (a.k.a "domains").</p>
<p>And <strong>intra-AS (intra-domain)</strong> is such routing among routers within same AS(network).</p>
<ul>
<li>all routers in AS must run same intra-domain protocol.</li>
<li>routers in different AS can run different intra-domain protocols.</li>
<li>gateway router: at edge of its own AS, has link(s) to router(s) in other AS'es</li>
</ul>
<p><strong>inter-AS</strong> routing among AS'es is the gateways perform inter-domain routing</p>
<p>Both of them determine entries for destination of routers, while former is <em>within</em> AS and latter is for <em>external</em> destinations. Most common intra-AS routing protocols:</p>
<ul>
<li>RIP (Routing Information Protocol), which is no longer widely used.</li>
<li>OSPF (Open Shortest Path First), which includes classic <strong>link-state</strong> routing.</li>
<li>EIGRP: (Enhanced Interior Gateway Routing Protocol), which is <strong>DV</strong> based/</li>
</ul>
<h2 id="ospf">OSPF</h2>
<p>OSPF is an intra-domain routing protocol.</p>
<ul>
<li><p>open: publicly available</p></li>
<li><p>classic link-state:</p>
<ul>
<li>each router floods OSPF link-state advertisements (directly over IP) to all other routers in entire AS.</li>
<li>multiple link costs metrics possible: bandwidth, delay.</li>
<li>global (has full topology)</li>
</ul>
<p>There is two-level hierarchy: local <em>area</em> and <em>backbone</em>.</p>
<ul>
<li><strong>Local routers</strong> only know/compute detailed topology within its local area, and forwad information to <strong>area border routers</strong>.</li>
<li>And <strong>area border routers</strong> are responsible for <em>summarizing</em> distances to destinations in own area, and advertising in backbone.</li>
</ul></li>
</ul>
<h2 id="bgp">BGP</h2>
<p>BGP is an inter-domain routing protocol ("glue that holds the Internet together").</p>
<p>BGP provides each AS a means to:</p>
<ul>
<li>obtain destination network reachability information from neighboring ASes (<strong>eBGP</strong>).</li>
<li>determine roues to other networks based on reachability information and policy.</li>
<li>propagate reachability information to all AS-internal routers (<strong>iBGP</strong>).</li>
<li>advertise destination reachability information.</li>
</ul>
<h3 id="bgp-basics">BGP Basics</h3>
<p>BGP Session: two BGP routers exchange BGP messages over semi-permanent TCP connection:</p>
<ul>
<li>advertising paths to different destination network prefixes</li>
<li>BGP is a "path vector" protocol</li>
</ul>
<p>BGP protocol messages [RFC 4371]:</p>
<ul>
<li>Open: opens <strong>TCP</strong> connection to peer and authenticates sending BGP peer</li>
<li>Update: advertises new path (or withdraws old)</li>
<li>Keepalive: keeps connection alive in absence of UPDATES; also ACKS OPEN request</li>
<li>Notification: reports erros in previous msg; also used to close connection</li>
</ul>
<h3 id="bgp-path-advertisement">BGP: path advertisement</h3>
<p>BGP advertised path: prefix + attributes</p>
<ul>
<li>path prefix: destination being advertised</li>
<li>two important attributes:
<ul>
<li>AS-PATH: list of ASes through which prefix advertisement has passed</li>
<li>NEXT-HOP: indicates specific internal-AS router to next-hop AS</li>
</ul></li>
</ul>
<h3 id="bgp-policy">BGP policy</h3>
<p>ISP only wants to route traffic to/from its customer networks (does not want to carry transit traffic between other ISPs – a typical “real world” policy)</p>
<h3 id="bgp-populating-forwading-tables">BGP: populating forwading tables</h3>
<p>Just popluate from boundary to internal and choose local gateway that has least intra-domain cost. Omit details here.</p>
<h2 id="benefits">Benefits</h2>
<p>Intra/Inter-AS routing scale the network, creating hierarchical routing, reducing forwarding table size. And seperate them can make:</p>
<ul>
<li>intra-AS focus on performance.</li>
<li>inter-AS has policy dominates over performaance.</li>
</ul>
]]></content>
      <categories>
        <category>Network</category>
      </categories>
      <tags>
        <tag>Network</tag>
        <tag>Algorithm</tag>
      </tags>
  </entry>
  <entry>
    <title>OS Boot</title>
    <url>/2023/08/07/OS-Boot/</url>
    <content><![CDATA[<p>The 1st period: the CPU executes instructions from some start address (stored in Flash ROM)</p>
<ol type="1">
<li>BIOS: FInd a storage device and load first sector.</li>
<li>Bootloader: Load the OS kernel from disk into a location in memory and jump into it.</li>
<li>OS Boot: Initialize services, drivers, etc.</li>
</ol>
]]></content>
      <categories>
        <category>Architecture</category>
      </categories>
      <tags>
        <tag>Architecture</tag>
        <tag>CS61C</tag>
      </tags>
  </entry>
  <entry>
    <title>(Paper Reading) SCC-Based Value Numbering</title>
    <url>/2023/10/18/Paper-Reading-SCC-VN/</url>
    <content><![CDATA[<h2 id="introduction">Introduction</h2>
<p><em>Value Numbering</em> is a universally useful optimization implemented by most compilers like <em>Clang, GCC</em>, etc. Traditionally <em>GVN</em> can be divided into <em>Hash-Based</em> and <em>Value-Partitioning</em>. The former handles algebraic simplifications but locally. The latter is global but handling only simple redundancies. <em>K. C</em> and <em>T. S</em> came up with a new method to solve it for <em>SSA</em>.</p>
<span id="more"></span>
<h2 id="scc-based-value-numebring-easy-version">SCC-Based Value Numebring (Easy Version)</h2>
<p>This paper proposed an <span class="math inline">\(O(N \cdot D(SSA))\)</span> algorithm at first, which is based on the RPO traversal and SSA form IR.</p>
<p><strong>Note</strong>: RPO traversal guarantees that predecessors through <strong>non-back</strong> edges of <strong>BasicBlocks</strong> are processed before the block itself</p>
<p>Here is pseudocode for this easy version:</p>
<figure class="highlight python"><table><tr><td class="code"><pre><span class="line">VN.fill(Null)</span><br><span class="line">repeat</span><br><span class="line">  done = true</span><br><span class="line">  <span class="keyword">for</span> block b <span class="keyword">in</span> RPO(Function)</span><br><span class="line">    <span class="keyword">for</span> inst <span class="keyword">in</span> b</span><br><span class="line">      temp = lookup(x.op, VN[x[<span class="number">1</span>]], VN[x[<span class="number">2</span>]], x)</span><br><span class="line">      <span class="keyword">if</span> VN[x] != temp</span><br><span class="line">        done = false</span><br><span class="line">        VN[x] = temp</span><br><span class="line">until done</span><br></pre></td></tr></table></figure>
<p>We define equivalence/congruence relation, <span class="math inline">\(\cong_i\)</span>. We say that <span class="math inline">\(x \cong_i y\)</span> if and only if <span class="math inline">\(VN[x] = VN[y]\)</span> after <span class="math inline">\(i^{th}\)</span> RPO iteration.</p>
<p>We get:</p>
<p><img src="/images/cong_scc.png" /></p>
<p>Now we prove the correctness.</p>
<p>Firstly, it's obvious that <span class="math inline">\(x \cong_i y \rightarrow x \cong_{i-1} y\)</span>. It's monotonicity.</p>
<p>And each step produces a re nement of the partition, and refinement cannot continue indefinitely. Since the RPO algorithm begins with all SSA names congruent, we must converge to the same fixed point as value partitioning.</p>
<p>Then, we try to get its time complexity. <strong>Back edges</strong> play a key role here.</p>
<p><img src="/images/lemma1-scc.png" /></p>
<p>Induction for <span class="math inline">\(x \ncong_{i} y\)</span>:</p>
<p>Basis (<span class="math inline">\(i = 1\)</span>): Empty sequence</p>
<ul>
<li><p>Basics (<span class="math inline">\(j=1\)</span>) and Induction: (<span class="math inline">\(i &gt; 1\)</span>):</p>
<ul>
<li>case1: <span class="math inline">\(x.op \ne y.op\)</span>. Obviously impossible due to <span class="math inline">\(x \cong_{i-1} y\)</span>.</li>
<li>case2: <span class="math inline">\(x[e_1] \ncong_{i} y[e_1]\)</span> for some <strong>non-back</strong> edge. <strong>Some</strong> here means that there exists non-back edge <span class="math inline">\(e\)</span> cause it. But that contradicts with <span class="math inline">\(x \cong_{i - 1} y\)</span>. Just like graph below, it's impossible: <img src="/images/flow-scc.png" /></li>
<li>case 3: <span class="math inline">\(x[e_1] \ncong_{i-1} y[e_1]\)</span> for some <strong>back</strong> edge. That's possible. Sequence now consists of <span class="math inline">\(e\)</span>.</li>
</ul></li>
<li><p>Induction (<span class="math inline">\(j &gt; 1\)</span>)</p>
<ul>
<li><p>case1: <span class="math inline">\(x.op \ne y.op\)</span>. Obviously impossible as before.</p></li>
<li><p>case2: <span class="math inline">\(x[e_j] \ncong_{i} y[e_j]\)</span> for some <strong>non-back</strong> edge. The sequence consists of <span class="math inline">\(e\)</span> followed by the sequence for the pair <span class="math inline">\(x[e_{j-1}],y[e_{j-1}]\)</span>, which we know exists by the induction hypothesis for <strong><span class="math inline">\(j\)</span></strong>: <span class="math inline">\(s\)</span></p></li>
<li><p>case 3: <span class="math inline">\(x[e_j] \ncong_{i-1} y[e_j]\)</span> for some <strong>back</strong> edge. The sequence consists of <span class="math inline">\(e_j\)</span> followed by the sequence for the pair <span class="math inline">\(x[e],y[e]\)</span>, which we know exists by the induction hypothesis for <strong><span class="math inline">\(i\)</span></strong>.</p></li>
</ul></li>
</ul>
<p>Final time analysis (<span class="math inline">\(D(SSA)\)</span> is the <em>loop connectedness</em>): <img src="/images/theorem2-scc.jpg" /></p>
<h2 id="the-scc-algorithm-more-efficient">The SCC Algorithm (More efficient)</h2>
<blockquote>
<p>To make the algorithm more efficient in practice, we operate on the <strong>SSA graph</strong> instead of the <strong>control-flow graph</strong>. We refer to the improved algorithm as the SCC algorithm because it concentrates on the strongly connected components of the SSA graph.</p>
</blockquote>
<p>The paper works with <em>Tarjan's depth-first algorithm for finding SCCs</em>. Just handle the single node firstly and then handle nodes in SCCs in RPO order.</p>
<h2 id="result">Result</h2>
<p>Not worse than <em>Hash-Based</em> and <em>Value-Partitioning</em> methods. Details in paper.</p>
<h2 id="connection-with-modern-compilers">Connection with modern compilers</h2>
<p><em>NewGVN in LLVM</em> applies the algorithm in this paper, in development still.</p>
<h2 id="reference">Reference</h2>
<ul>
<li><em>SCC-Based Value Numbering, by Keith Cooper and Taylor Simpson</em></li>
</ul>
]]></content>
      <categories>
        <category>Paper-Reading</category>
      </categories>
      <tags>
        <tag>Algorithm</tag>
        <tag>Compiler</tag>
        <tag>Paper</tag>
      </tags>
  </entry>
  <entry>
    <title>Routing-Algorithms</title>
    <url>/2023/08/19/Routing-Algorithms/</url>
    <content><![CDATA[<p>Routing algorithm goal: determine <strong>good</strong> paths from sending hosts to receiving host. So what is a good path? Here "good" means least cost, fastest and least congested. Cost is defined by network operator: related to bandwidth, related to congestion, etc.</p>
<p>For the characterisitics of network, We apply <strong>graph abstraction</strong> to solve this problem. Let routers be vertices, connections be edges and the "cost" of connection be weights of edges.</p>
<p>There is routing algorithm classification:</p>
<ul>
<li>global: all routers have complete topology, link cost info</li>
<li>decentralized: iterative process of computation, exchange of info with neighbors</li>
<li>dynamic: routes change more quickly</li>
<li>static: routes change slowly over time</li>
</ul>
<h2 id="link-state">Link State</h2>
<p><strong>Link State Algorithm</strong> is iterative and centralized/global, which knows network topology, link costs needed. And it computes least cost paths from one node to all other nodes.</p>
<p>Notation: <span class="math inline">\(C_{a,b}\)</span> is the cost from <span class="math inline">\(a\)</span> to <span class="math inline">\(b\)</span>. <span class="math inline">\(D(a)\)</span> is the cost of least-cost-path from source to destination <span class="math inline">\(a\)</span>. <span class="math inline">\(p(a)\)</span> is the predecessor node along path from source to <span class="math inline">\(a\)</span>. And <span class="math inline">\(N&#39;\)</span> is set of nodes whose least-cost-path definitively known.</p>
<p>And it's mostly the same as <strong>Dijkstra's Algorithm</strong> or <strong>Prim Algorithm</strong> in Graph Theory. Omit details here.</p>
<p>Optimized algorithm complexity: <span class="math inline">\(O(nlogn)\)</span></p>
<p>Message Complexity: Each router must broadcast its link state information to other <span class="math inline">\(n\)</span> routers, so complexity is <span class="math inline">\(O(n^2)\)</span></p>
<h2 id="distance-vector">Distance Vector</h2>
<p><strong>Distance Vector</strong> is an application of <strong>Bellman Ford Algorithm</strong>, which is decentralized, iterative and asynchronous. In this algorithm, each node propagates its cost(distance vector) to its neighbors, so that they can update their own distance vectors.</p>
<p>The process for each node is: - wait for change in local link cost or msg from neighbor - recompute its own DV estimates with DV received from neighbor - if its own DV changed, send it to its neightbors</p>
<p>Such thing is like state information diffusion. <strong>As a compiler learner, I think it's the same as what MFP(Maximum FixedPoint) implements</strong></p>
<p>However, there's difference where one of the costs increase. For example, with path x-4-y-1-z, when <span class="math inline">\(C_{x,y}\)</span> updates to 60, <span class="math inline">\(y\)</span> will update <span class="math inline">\(D_y(x)\)</span> to 6 because <span class="math display">\[D_y(x) = min(D_y(x), C_{y,z} + D_z(x))\]</span>, while <span class="math inline">\(D_z(x)\)</span> is out-of-date. Such count-to-infinity problem is tricky to solve.</p>
<p>Message complexity: exchange between neighbors; convergence time varies.</p>
<h2 id="comparsion-of-ls-and-dv-algorithms">Comparsion of LS and DV algorithms</h2>
<p>robustness: - LS: - router can advertise incorrect link cost. - each router computes only its own table.</p>
<ul>
<li>DV:
<ul>
<li>DV router can advertise incorrect path cost:black-holing.</li>
<li>each router;s DV is used by others: error propagate through network.</li>
</ul></li>
</ul>
]]></content>
      <categories>
        <category>Compiler Theory</category>
      </categories>
      <tags>
        <tag>Algorithm</tag>
        <tag>Compiler</tag>
      </tags>
  </entry>
  <entry>
    <title>Sequential-Logic-Circuit</title>
    <url>/2023/07/26/Sequential-Logic-Circuit/</url>
    <content><![CDATA[<h2 id="sequential-clock-circuit">Sequential Clock Circuit</h2>
<p>There are multiple combinational logic circuits in a circuit. And sequential logic circuit connects them into a single one, which is synchorized by a clock.</p>
<p>The <strong>critical path</strong> is the longest delay betwwen any two <em>registers</em> in a circuit. The clock period must be longer be longer than this critical path, or the signal will not propagate properly to that next register.</p>
<p>So the max frequency of the circuit is limited by how much time needed to get correct Next State to Register. (<span class="math inline">\(t_{setup}\)</span> constraint)</p>
<p>The structure of circuit should like this: <figure class="highlight plaintext"><table><tr><td class="code"><pre><span class="line">input -&gt; Combinational -&gt; output </span><br><span class="line">    --&gt; Logic Circuit --</span><br><span class="line">    |                  | (Next State)</span><br><span class="line">    ---   Register &lt;----</span><br></pre></td></tr></table></figure></p>
<p>So when you overclock, you actually force your machine to break the limit of Max Clock Frequency. That's unsafe and untable.</p>
<p>You can: - add extra register to shorten critical path. Meanwhile, it may require more period. However, that's fine for pipeline. More latency, but more throughtput too.</p>
<p>Pipielining <strong>tends</strong> to improve performance.</p>
<h2 id="finite-state-machine">Finite State Machine</h2>
<p>State transitions are controlled by the clock, On each clock cycle the machine checks, generate new state and new output.</p>
<p>Register holds a representation of the FSM's state - Must assign a unique bit pattern for each state - Output is present/current state (PS/CS) - Input is next state (NS)</p>
<p>Combinational Logic implements transition function here.</p>
]]></content>
      <categories>
        <category>Architecture</category>
      </categories>
      <tags>
        <tag>Architecture</tag>
        <tag>CS61C</tag>
      </tags>
  </entry>
  <entry>
    <title>TCP Congestion Control</title>
    <url>/2023/08/17/TCP-Congestion-Control/</url>
    <content><![CDATA[<h1 id="classic-tcp">Classic TCP</h1>
<h2 id="aimd">AIMD</h2>
<p><strong>A</strong>I<strong>M</strong>D - a distributed, asynchronous algorithm - has been shown to:</p>
<ul>
<li>optimize congested flow rates network wide.</li>
<li>have desirable stability properties.</li>
</ul>
<p>Approach: senders can <strong>increase</strong> sending rate until packet loss occurs, then <strong>decrease</strong> sending rate on loss event.</p>
<p>Additive Increase: Increase sending rate by 1 maximum segment size every RTT until loss detected.</p>
<p>Multiplicative Decrease: Cut sending rate in half at each loss event by triple duplicate ACK (TCP Reno). Or cut to 1 maximum segment size when loss is detected by timeout (TCP Tahoe)</p>
<h2 id="tcp-congestion-control-details">TCP Congestion Control Details</h2>
<p>sender sequence number space:</p>
<figure>
<img src="/images/SenderSequenceSpace.png" alt="" /><figcaption>/images/SenderSequenceSpace.png</figcaption>
</figure>
<p>TCP rate ~= <span class="math inline">\(\frac{cwnd}{RTT}\)</span> bytes/sec</p>
<ul>
<li>TCP sender limits transmission : LastByteSent - LastByteAcked &lt;= cwnd</li>
<li><em>cwnd</em> is dynamically adjusted in response to observed network congestion</li>
</ul>
<h3 id="tcp-slow-start">TCP slow start</h3>
<p>initially cwnd = 1 MSS. double cwnd every RTT. done by incrementing cwnd for every ACK received.</p>
<p>When cwnd gets to 1/2 of its value before, we should switch to linear</p>
<h2 id="cubic">CUBIC</h2>
<figure>
<img src="/images/TCPCUBIC.png" alt="" /><figcaption>CUBIC</figcaption>
</figure>
<p>TCP CUBIC is default in Linux, most popular TCP for popular Web servers.</p>
<h1 id="enhanced-tcps">Enhanced TCPs</h1>
<h2 id="delay-based-tcp-congestion-control">Delay-based TCP congestion control</h2>
<p>"Just full enough, but not fuller": keep bottleneck link busy transmitting, but avoid high delays/buffering</p>
<h2 id="explicit-congestion-notification-ecn">Explicit congestion notification (ECN)</h2>
<p>TCP deployments often implement network-assisted congestion control.</p>
<ul>
<li>two bits in IP header (ToS field) marked by network router to indicate congestion
<ul>
<li>policy to determine marking chosen by network operator</li>
</ul></li>
<li>congestion indication carried to destination</li>
<li>destination sets ECE bit on ACK segment to notify sender of congestion</li>
<li>involves both IP (IP header ECN bit marking) and TCP (TCP header C,E bit marking)</li>
</ul>
<h2 id="tcp-fairness">TCP fairness</h2>
<p>Goal: Multiple TCP sessions share the equal resource of network.</p>
<p>However, there is no Internet police policing use of congestion control.</p>
]]></content>
      <categories>
        <category>Network</category>
      </categories>
      <tags>
        <tag>Network</tag>
        <tag>TCP</tag>
      </tags>
  </entry>
  <entry>
    <title>Virtual-Memory</title>
    <url>/2023/08/07/Virtual-Memory/</url>
    <content><![CDATA[<h2 id="address-translation">Address Translation</h2>
<p>Assuming the virtual memory has 1024B, a page has 256B, then the index of page should be: <span class="math display">\[\log_2{\frac{1024}{256}} = 2 bit\]</span> So for an 32-bit address, the first 2 bits is the index, while the remaining 30 bits serve as offset.</p>
<h2 id="page-table">Page Table</h2>
<p>Consist of: [Valid] [Access Rights] [VPN] [PPN]</p>
<p>Valid bit determines whether this virtual page is mapped to a physical page. The mapping VPN to PPN is by looking up the table. The offset from virtual to physical is invariant.</p>
<p>Page Tables are always saved in main memory. And we always create hierarchical page table since page tables is too big. <img src="/images/HierachicalPageTable.png" alt="pagetable" /></p>
<h2 id="problems">Problems</h2>
<p>2+ Physical memory accesses per data access is too slow. Since locality in pages of data, there must be locality in the translations of those pages, we could build a separate cache for the page table.</p>
<p>For historical reasons, cache is called a Translation Lookaside Buffer (TLB)</p>
<p>VPN -&gt; TLB -&gt; PPN -&gt; Data (Access Page Table in main memory if messed)</p>
<h2 id="performance-analysis">Performance Analysis</h2>
<h3 id="vm-performance">VM Performance</h3>
<p>Similar to cache. But here, though the rate of page miss is much smaller, page miss will lead to much slower performance. Page fault(Loading page from disk) requires about 20,000,000 cycles, which is destructive. The corresponding miss rate must be quite small to match it.</p>
]]></content>
      <categories>
        <category>Architecture</category>
      </categories>
      <tags>
        <tag>Architecture</tag>
        <tag>CS61C</tag>
      </tags>
  </entry>
  <entry>
    <title>Whale-Dependencies-Analysis-0</title>
    <url>/2023/09/21/Whale-Dependencies-Analysis-0/</url>
    <content><![CDATA[<h2 id="why-do-we-need-dependencies-analysis">Why do we need dependencies analysis?</h2>
<p>Dependencies analysis is the <strong>foundation</strong> for <em>instruction scheduling</em> and <em>data-cached optimization</em>. It detects and analyzes the conflict relation on resources, control, data, etc, such that other transform can reorder the instruction/BasicBlock to chase better performance.</p>
<p>Here, we mainly focus on the <strong>instruction dependencies</strong>.</p>
<h2 id="classifications-of-instruction-dependencies">Classifications of instruction dependencies</h2>
]]></content>
      <categories>
        <category>LLVM</category>
      </categories>
      <tags>
        <tag>Compiler</tag>
        <tag>LLVM</tag>
      </tags>
  </entry>
  <entry>
    <title>XSharp开发思路-数组设计</title>
    <url>/2023/04/17/XSharp-3-Array-Design-0/</url>
    <content><![CDATA[<p>参考了 Java 中的对象模型 我决定把 XSharp 中的 <strong>数组(Array)</strong> 的模型设计为以下形式： [ <strong>8</strong> bytes ] object header as <strong>length of array</strong> [ <strong>4 or 8</strong> bytes ] pointer <strong><em>p</em></strong> to a sequential memory (<strong>for elements</strong>)</p>
<p>故对以下 XSharp 代码</p>
<figure class="highlight plaintext"><table><tr><td class="code"><pre><span class="line">i64[] a = new i64[100]</span><br></pre></td></tr></table></figure>
<p>在 64 位系统上，我们将会在栈上分配 8 + 4 字节的内存，由于 align 的要求我们再加上 4 字节的 padding， 一共 16 字节，并为 100 个 i64 元素在堆上分配 100 * 8 字节的内存</p>
<p>而每次执行<code>a[i]</code>这样的操作时，我们会对取 a 的地址 并加上 8，得到指向对应连续内存的指针 <strong><em>p</em></strong> ， 再对 <strong>p + i _ sizeof(i64)</strong> 对应的地址指向读/写操作</p>
<p>对应到 LLVM 的 CodeGen，我们则需要定义形如<code>StructType&lt;i64,PointerTo&lt;xxx&gt;&gt;</code>这样的类型， 并用<strong>getelementptr inbound</strong>和<strong>getelementptr</strong>指令获得某个元素的地址</p>
<p>这样设计的好处则是将长度 length 放到栈上，不需要在堆分配和取元素时进行额外的计算， 也不需要 align 来保证 cache friendly，同时也方便优化。</p>
<p>而相较于 C 语言风格的数组，我们的数组主体始终放在堆上，故内存的管理不够精细， 但经过了封装，其易用性更胜一筹，基于这些限制，其优化也更容易实现。</p>
]]></content>
      <categories>
        <category>XSharp</category>
      </categories>
      <tags>
        <tag>Compiler</tag>
        <tag>LLVM</tag>
        <tag>XSharp</tag>
      </tags>
  </entry>
  <entry>
    <title>XSharp-4-Class-Design</title>
    <url>/2023/05/18/XSharp-4-Class-Design/</url>
    <content><![CDATA[<p>Waiting to complete</p>
]]></content>
      <categories>
        <category>XSharp</category>
      </categories>
      <tags>
        <tag>Compiler</tag>
        <tag>LLVM</tag>
        <tag>XSharp</tag>
      </tags>
  </entry>
  <entry>
    <title>XSharp开发思路-Mutable Variable的LLVM IR生成</title>
    <url>/2023/03/03/XSharp%E5%BC%80%E5%8F%91%E6%80%9D%E8%B7%AF-Mutable-Variable%E7%9A%84LLVM-IR%E7%94%9F%E6%88%90/</url>
    <content><![CDATA[<h3 id="为什么需要-mutable-variable">为什么需要 Mutable Variable?</h3>
<p>由于 LLVM 内部优化等原因，LLVM IR 中的寄存器必须遵循<strong>SSA</strong>原则，即每个寄存器在 SSA 中仅被赋值一次。</p>
<p>但由于 XSharp 需要支持同个变量的多次引用，我们不能直接使用寄存器作为变量的存储单元。</p>
<p>幸运的是，LLVM 并不强制要求栈上的变量保持<strong>SSA</strong>，所以我们可以考虑将所有变量存放在栈上，</p>
<p>然后再通过 LLVM 提供的 Mem2Reg 工具或者 Pass 进行栈上内存的数据流分析，尽可能地将栈上的变量转换至寄存器上。</p>
<p>原文档在此:<a href="https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangImpl07.html">LLVM Mutable Variable</a></p>
<p>而针对 XSharp,我们可以写出如下代码</p>
<span id="more"></span>
<p>首先，LLVM 通过<strong>AllocaInst</strong>分配栈上的变量</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line">VariableDeclarationNode* varNode = <span class="built_in">get</span>();</span><br><span class="line">TypeNode* typenode = varNode-&gt;<span class="built_in">type</span>();</span><br><span class="line"></span><br><span class="line"><span class="keyword">auto</span> xsharpType = varNode-&gt;<span class="built_in">type</span>();</span><br><span class="line"><span class="keyword">auto</span> llvmValue =</span><br><span class="line">builder.<span class="built_in">CreateAlloca</span>(</span><br><span class="line">    <span class="built_in">llvmTypeFor</span>(xsharpType, context), <span class="literal">nullptr</span>,</span><br><span class="line">    varNode-&gt;<span class="built_in">name</span>().<span class="built_in">toStdString</span>());</span><br></pre></td></tr></table></figure>
<p>同时，也可以把函数的参数存在栈上，以下复制自 <a href="https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangImpl07.html">LLVM Tutorial</a></p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="function">Function *<span class="title">FunctionAST::codegen</span><span class="params">()</span> </span>&#123;</span><br><span class="line">  ...</span><br><span class="line">  Builder-&gt;<span class="built_in">SetInsertPoint</span>(BB);</span><br><span class="line"></span><br><span class="line">  <span class="comment">// Record the function arguments in the NamedValues map.</span></span><br><span class="line">  NamedValues.<span class="built_in">clear</span>();</span><br><span class="line">  <span class="keyword">for</span> (<span class="keyword">auto</span> &amp;Arg : TheFunction-&gt;<span class="built_in">args</span>()) &#123;</span><br><span class="line">    <span class="comment">// Create an alloca for this variable.</span></span><br><span class="line">    AllocaInst *Alloca = <span class="built_in">CreateEntryBlockAlloca</span>(TheFunction, Arg.<span class="built_in">getName</span>());</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Store the initial value into the alloca.</span></span><br><span class="line">    Builder-&gt;<span class="built_in">CreateStore</span>(&amp;Arg, Alloca);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Add arguments to variable symbol table.</span></span><br><span class="line">    NamedValues[std::<span class="built_in">string</span>(Arg.<span class="built_in">getName</span>())] = Alloca;</span><br><span class="line">  &#125;</span><br><span class="line"></span><br><span class="line">  <span class="keyword">if</span> (Value *RetVal = Body-&gt;<span class="built_in">codegen</span>()) &#123;</span><br><span class="line">    ...</span><br></pre></td></tr></table></figure>
<p>并用<strong>PromoteMemoryToRegisterPass</strong>实现 Mem2Reg 的优化</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="comment">// Promote allocas to registers.</span></span><br><span class="line">functionPassManager-&gt;<span class="built_in">add</span>(<span class="built_in">createPromoteMemoryToRegisterPass</span>());</span><br></pre></td></tr></table></figure>
<p>LLVM 也对性能等问题做了解释 <img src="/images/Mem2RegLLVM.png" alt="content" /></p>
]]></content>
      <categories>
        <category>XSharp</category>
      </categories>
      <tags>
        <tag>Compiler</tag>
        <tag>LLVM</tag>
        <tag>XSharp</tag>
      </tags>
  </entry>
  <entry>
    <title>XSharp开发思路-表达式解析-Pratt Parsing</title>
    <url>/2023/03/15/XSharp%E5%BC%80%E5%8F%91%E6%80%9D%E8%B7%AF-%E8%A1%A8%E8%BE%BE%E5%BC%8F%E8%A7%A3%E6%9E%90-Pratt-Parsing/</url>
    <content><![CDATA[<p>手工实现 Parser 常用<strong>递归下降法(Recusive Descent)</strong>，XSharp 的 Parser 也采用了<strong>递归下降</strong>的主体结构。</p>
<p>一般来说递归下降法适用于自上而下的结构，更容易解析开头有标识符的语言，如：</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="keyword">if</span> () &#123;&#125;</span><br><span class="line"><span class="keyword">while</span> () &#123;&#125;</span><br><span class="line"><span class="keyword">class</span> &#123;&#125;</span><br></pre></td></tr></table></figure>
<p>但也由于同样的原因，递归下降法处理表达式非常吃力。Parser 在读到表达式开头的时候，无法知道自己身处哪种表达式之中，这是因为操作符往往在表达式的中间位置（甚至结尾），比如加法运算的+、函数调用的()。为了能自顶向下地解析表达式，你需要将每一种操作符 <strong>优先级(priority)</strong> 都单独作为一个层级，为其编写解析函数，并手动处理 <strong>结合性(associativity)</strong> ，因此解析函数会比较多、比较复杂。</p>
<p>所以在重构 XSharp 的 Parser 时，我选择了 <strong>Pratt Parsing</strong> 作为表达式的算法</p>
<p>笔者参考了 <a href="https://zhuanlan.zhihu.com/p/471075848">Pratt Parsing 知乎</a> 和 <a href="https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html">Pratt Parsing Rust</a> 进行了有关代码的重构</p>
<p>核心代码如下：</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line">ASTNode* lhs = <span class="built_in">operand</span>();</span><br><span class="line"></span><br><span class="line"><span class="keyword">while</span> (<span class="literal">true</span>) &#123;</span><br><span class="line">    <span class="keyword">if</span> (<span class="built_in">isStopwords</span>(current, stopwords)) <span class="keyword">return</span> lhs;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (current-&gt;type != Operator)</span><br><span class="line">        <span class="keyword">throw</span> <span class="built_in">XSharpError</span>(<span class="string">&quot;No operator matched after operand&quot;</span>);</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (<span class="built_in">priority</span>(current-&gt;value) &lt;= ctxPriority) <span class="keyword">break</span>;</span><br><span class="line"></span><br><span class="line">    XString op = current-&gt;value;</span><br><span class="line"></span><br><span class="line">    forward();</span><br><span class="line">    <span class="keyword">auto</span> right_binding_power =</span><br><span class="line">        <span class="built_in">assoc</span>(op) == LeftToRight ? <span class="built_in">priority</span>(op) : <span class="built_in">priority</span>(op) - <span class="number">1</span>;</span><br><span class="line">    <span class="keyword">auto</span> rhs = <span class="built_in">expression</span>(stopwords, right_binding_power);</span><br><span class="line"></span><br><span class="line">    <span class="keyword">auto</span> new_lhs = <span class="keyword">new</span> BinaryOperatorNode;</span><br><span class="line">    new_lhs-&gt;<span class="built_in">setOperatorStr</span>(op);</span><br><span class="line">    new_lhs-&gt;<span class="built_in">setLeft</span>(lhs);</span><br><span class="line">    new_lhs-&gt;<span class="built_in">setRight</span>(rhs);</span><br><span class="line">    lhs = new_lhs;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> lhs;</span><br><span class="line"></span><br></pre></td></tr></table></figure>
<span id="more"></span>
<p>原理如下：</p>
<p>我们在解析表达式时，我们总是倾向于让<strong>priority</strong>较高的运算符与<strong>operand</strong>结合</p>
<p>故我们在已知左边表达式 lhs 时，从人类通常思维出发</p>
<p>我们倾向于在 op 的 priority 较高时拆散 lhs，让 op 不断与 lhs 最右边的 operand 结合直到优先级不足</p>
<p>而在 op 的 priority 较低时，让 op 与 lhs 整体结合</p>
<p>但这不符合机器从左到右解析的顺序，所以我们可以换一种思路</p>
<p>所以，我们从左向右扫描，设初始优先级为 0，从 priority 较低的层级出发，一步步找到优先级更高的运算符并结合</p>
<p>以表达式 <code>a / b = 2 + 5 * 6</code> 为例</p>
<p>初始层级优先级为 0，给当前层级命名 <strong>initial</strong></p>
<ul>
<li><p>进入 initial 层，我们先读入 token <strong>a</strong></p>
<p>发现 <strong>/</strong> 的优先级大于 0，于是结合 0 与/ 并进入属于 <strong>‘/’</strong> 的层级，该层级优先级为 3，该层级求优先级大于 3 的 rhs</p>
<ul>
<li><p>然后读入 token <strong>b</strong>, b 属于 <strong>‘/’</strong> 层级，又读入 operator <strong>=</strong> 发现其优先级&lt;=当前层级最小优先级</p>
<p>于是结束 <strong>‘/’</strong> 层级， 确定其 rhs 为 b，得到一个整体 <code>(a / b)</code></p></li>
</ul></li>
<li><p>回到 initial 层，且此时 lhs 为<code>a / b</code>，继续读入 operator <strong>=</strong> ，其优先级为 1&gt;=0，故进入 <strong>‘=’</strong> 层</p>
<ul>
<li><p>现在读入 token <strong>2</strong>，再读入 operator <strong>+</strong>，发现其优先级 2&gt;=1 故可作为 rhs，进入 <strong>'+'</strong> 层</p>
<ul>
<li><p>继续求 <strong>'+'</strong> 的 rhs，发现 token <strong>5</strong>，和 operator <strong>*</strong> ，<strong>*</strong> 的优先级 3&gt;=2，故进入<strong>'*'</strong>层</p>
<ul>
<li>读入 <code>6</code> 表达式结束，将 6 作为 <strong>'*'</strong> 的 rhs，开始回溯</li>
</ul>
<p>将 <code>5 * 6</code> 作为 <strong>‘+’</strong> 的 rhs，退出 <strong>‘+’</strong> 层</p></li>
</ul>
<p>得到 <code>2 + (5*6)</code> ，将其作为 <strong>'='</strong> 的 rhs，退出 <strong>‘='</strong> 层</p></li>
</ul></li>
</ul>
<p>最后回到<strong>initial</strong>层，结合已有 lhs：<code>a / b</code>, op: <code>=</code>, rhs <code>2 + ( 5 * 6 )</code>，返回<code>(a / b) = ( 2 + ( 5 * 6 ) )</code></p>
<p>至此基本算法结束，对于右结合的<strong>associativity</strong>可以通过降低其’右优先级‘来实现（如代码所示），其他高级特性可参考上面引用的文章</p>
<p>通过这个算法，我们成功把原本 200 行的复杂函数压缩到 20 行，且获得了更高的性能。</p>
]]></content>
      <categories>
        <category>XSharp</category>
      </categories>
      <tags>
        <tag>Compiler</tag>
        <tag>XSharp</tag>
      </tags>
  </entry>
  <entry>
    <title>Hello World</title>
    <url>/2023/02/06/hello-world/</url>
    <content><![CDATA[<p>Welcome to <a href="https://hexo.io/">Hexo</a>! This is your very first post. Check <a href="https://hexo.io/docs/">documentation</a> for more info. If you get any problems when using Hexo, you can find the answer in <a href="https://hexo.io/docs/troubleshooting.html">troubleshooting</a> or you can ask me on <a href="https://github.com/hexojs/hexo/issues">GitHub</a>.</p>
<h2 id="quick-start">Quick Start</h2>
<h3 id="create-a-new-post">Create a new post</h3>
<figure class="highlight bash"><table><tr><td class="code"><pre><span class="line">$ hexo new <span class="string">&quot;My New Post&quot;</span></span><br></pre></td></tr></table></figure>
<span id="more"></span>
<p>More info: <a href="https://hexo.io/docs/writing.html">Writing</a></p>
<p>&lt;--more&gt;</p>
<h3 id="run-server">Run server</h3>
<figure class="highlight bash"><table><tr><td class="code"><pre><span class="line">$ hexo server</span><br></pre></td></tr></table></figure>
<p>More info: <a href="https://hexo.io/docs/server.html">Server</a></p>
<h3 id="generate-static-files">Generate static files</h3>
<figure class="highlight bash"><table><tr><td class="code"><pre><span class="line">$ hexo generate</span><br></pre></td></tr></table></figure>
<p>More info: <a href="https://hexo.io/docs/generating.html">Generating</a></p>
<h3 id="deploy-to-remote-sites">Deploy to remote sites</h3>
<figure class="highlight bash"><table><tr><td class="code"><pre><span class="line">$ hexo deploy</span><br></pre></td></tr></table></figure>
<p>More info: <a href="https://hexo.io/docs/one-command-deployment.html">Deployment</a></p>
]]></content>
      <categories>
        <category>Blog&#39;s configuration</category>
      </categories>
      <tags>
        <tag>Helloworld</tag>
        <tag>Blog&#39;s configuration</tag>
      </tags>
  </entry>
  <entry>
    <title>XSharp开发思路-Type</title>
    <url>/2023/02/18/XSharp%E5%BC%80%E5%8F%91%E6%80%9D%E8%B7%AF-Type/</url>
    <content><![CDATA[<h3 id="一个好的编程语言需要有一个好的类型系统">一个好的编程语言需要有一个好的类型系统</h3>
<p>笔者计划为 XSharp 开发一个静态且可拓展的类型系统，其中支持基本类型(如<em>i32</em>,<em>i64</em>)，数组，函数，Closure，类等类型及其复合</p>
<p>而复合的需求就意味着类型必须是多层次，且多种类型的形式，而树这种数据结构正好符合要求</p>
<p>于是<strong>TypeNode</strong>出现了</p>
<p>我们社设计具体类型的<strong>类型相关</strong>设置，从而构建不同的类型结构，如 ArrayType 有 elementType 的子类型，FunctionType 有 paramTypes 的子节点列表</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"></span><br><span class="line"><span class="keyword">class</span> <span class="title class_">TypeNode</span>;</span><br><span class="line"></span><br><span class="line"><span class="comment">//arrayDimension指的是数组类型的维度</span></span><br><span class="line"><span class="comment">//而elementType则是元素类型的TypeNode指针</span></span><br><span class="line"><span class="keyword">struct</span> <span class="title class_">ArrayType</span> &#123;</span><br><span class="line">    uint arrayDimension;</span><br><span class="line">    TypeNode* elementType;</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="comment">//paramTypes指的是参数的类型</span></span><br><span class="line"><span class="comment">//returnValueType则是返回值的类型</span></span><br><span class="line"><span class="keyword">struct</span> <span class="title class_">FunctionType</span> &#123;</span><br><span class="line">    std::vector&lt;TypeNode*&gt; paramTypes;</span><br><span class="line">    TypeNode* returnValueType;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure>
<span id="more"></span>
<p>而<strong>TypeNode</strong>则用枚举<strong>Category</strong>表示类型的范畴，即<strong>类型相关</strong>的类型设置(<strong>typeSpecifiedInfo</strong>)的范畴</p>
<p>从而确定 TypeNode 的类型结构，使用<strong>std::variant</strong>使存储多种类型相关设置成为可能</p>
<p>搭配上<strong>Category</strong>，就可根据<strong>category</strong>解析 variant 类型的<strong>typeSpecifiedInfo</strong>，获得具体的类型信息</p>
<figure class="highlight c++"><table><tr><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">TypeNode</span></span><br><span class="line">&#123;</span><br><span class="line">   <span class="keyword">public</span>:</span><br><span class="line">    <span class="built_in">TypeNode</span>();</span><br><span class="line">    <span class="built_in">TypeNode</span>(<span class="type">const</span> TypeNode&amp; other);</span><br><span class="line">    ~<span class="built_in">TypeNode</span>();</span><br><span class="line">    <span class="function"><span class="type">bool</span> <span class="title">equals</span><span class="params">(<span class="type">const</span> TypeNode&amp; other)</span> <span class="type">const</span></span>;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Basic type</span></span><br><span class="line">    <span class="function">BasicType <span class="title">basicType</span><span class="params">()</span> <span class="type">const</span></span>;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Function type, TODO complete below</span></span><br><span class="line">    <span class="function">TypeNode* <span class="title">returnValueType</span><span class="params">()</span> <span class="type">const</span></span>;</span><br><span class="line">    <span class="function">std::vector&lt;TypeNode*&gt; <span class="title">paramsType</span><span class="params">()</span> <span class="type">const</span></span>;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Array type, TODO complete below</span></span><br><span class="line">    <span class="function">uint <span class="title">arrayDimension</span><span class="params">()</span> <span class="type">const</span></span>;</span><br><span class="line">    <span class="function">TypeNode* <span class="title">elementType</span><span class="params">()</span> <span class="type">const</span></span>;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Class type,  TODO complete below</span></span><br><span class="line"></span><br><span class="line">    <span class="comment">// generate a unique name for a type</span></span><br><span class="line">    <span class="function">XString <span class="title">typeName</span><span class="params">()</span> <span class="type">const</span></span>;</span><br><span class="line"></span><br><span class="line">    uint typeID;</span><br><span class="line">    XString baseName;</span><br><span class="line">    <span class="type">bool</span> isConst;</span><br><span class="line">    <span class="keyword">enum</span> <span class="title class_">Categories</span> &#123; Basic, Array, Function, Closure, Class &#125; category;</span><br><span class="line"></span><br><span class="line">    std::variant&lt;BasicType, ClassType, FunctionType, ArrayType, ClosureType&gt;</span><br><span class="line">        typeSpecifiedInfo;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure>
<p>同时注意到<strong>typeID</strong>，我们将会在编译时为特定类型分配<strong>唯一</strong>(<strong>unique</strong>)的 typeID，并通过 typeID 实现运行时反射</p>
<p>在 TypeSystem 中我们将实现这一功能</p>
]]></content>
      <categories>
        <category>XSharp</category>
      </categories>
      <tags>
        <tag>Compiler</tag>
        <tag>XSharp</tag>
      </tags>
  </entry>
  <entry>
    <title>第一篇博客文章</title>
    <url>/2023/02/06/%E7%AC%AC%E4%B8%80%E7%AF%87%E5%8D%9A%E5%AE%A2%E6%96%87%E7%AB%A0/</url>
    <content><![CDATA[<p>奋战了 2 个小时后，终于成功用 Hexo 搭建了一个小博客。 本博客仅用于个人生活学习的记录，并无商业用途，若有友链或者交流需要，请通过我的邮箱<strong>xxs_chy@outlook.com</strong>联系我</p>
]]></content>
      <categories>
        <category>Blog&#39;s configuration</category>
      </categories>
      <tags>
        <tag>Helloworld</tag>
        <tag>Blog&#39;s configuration</tag>
      </tags>
  </entry>
  <entry>
    <title>编译原理-数据流分析-冗余消除</title>
    <url>/2023/07/08/%E7%BC%96%E8%AF%91%E5%8E%9F%E7%90%86-%E6%95%B0%E6%8D%AE%E6%B5%81%E5%88%86%E6%9E%90-%E5%86%97%E4%BD%99%E6%B6%88%E9%99%A4/</url>
    <content><![CDATA[<p>本章博客将介绍一种消除程序代码冗余的编译器代码优化技术 --- <strong>懒惰代码移动算法</strong></p>
<h2 id="什么是冗余消除">什么是冗余消除</h2>
<p><strong>冗余消除</strong>就是要尽量减少表达式求值的次数，避免形如<span class="math inline">\(x+y\)</span>的表达式在之后的代码中多次计算，影响性能。</p>
<p>冗余的来源主要有以下三种：</p>
<ol type="1">
<li>公共子表达式 (<em>Common Expression</em>)</li>
<li>循环不变表达式 (<em>Loop Invariant</em>)</li>
<li>部分冗余表达式 (<em>Partial Redundancy Expression</em>)</li>
</ol>
<span id="more"></span>
<h3 id="全局公共子表达式">全局公共子表达式</h3>
<p>若对于含有表达式如 <span class="math inline">\(a+b\)</span> 的基本块 <span class="math inline">\(B\)</span>，任意到 <span class="math inline">\(B\)</span> 的路径都已经对 <span class="math inline">\(a + b\)</span> 求过值，则我们称这个表达式在 <span class="math inline">\(B\)</span> 中冗余，是公共的子表达式。 这样的表达式就不需要在 <span class="math inline">\(B\)</span> 中重新计算。</p>
<p><strong>注意</strong>，此时在 <span class="math inline">\(a+b\)</span> 被计算后，表达式中的分量 <span class="math inline">\(a,b\)</span> 不能在 <span class="math inline">\(B\)</span> 之前被重新定值，否则这样的表达式不是一个可用表达式。</p>
<h4 id="深层公共表达式">深层公共表达式</h4>
<p>对于类似 <span class="math inline">\((a + b) ^ c + d\)</span> 这样的更深层的子表达式，我们可以重复利用<strong>公共表达式消除技术</strong>直至没有新的公共表达式来找到这样的深层子表达式， 当然我们也可以参考常量传播框架来实现类似的搜索，当然也可以参考<strong>LLVM</strong>的模式匹配来达到同样的效果。</p>
<h3 id="循环不变表达式">循环不变表达式</h3>
<p>假设 <span class="math inline">\(a\)</span>，<span class="math inline">\(b\)</span> 没有在循环 <span class="math inline">\(L\)</span> 中重新定值，那么 <span class="math inline">\(a+b\)</span> 就是对于 <span class="math inline">\(L\)</span> 循环不变的，这样的循环不变式可以提出循环，减少不必要的计算。 以下是一个循环不变式的例子：</p>
<figure class="highlight c"><table><tr><td class="code"><pre><span class="line"><span class="keyword">while</span> (c)&#123;</span><br><span class="line">    print(a + b)</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>可以转化为：</p>
<figure class="highlight c"><table><tr><td class="code"><pre><span class="line">t = a + b</span><br><span class="line"><span class="keyword">while</span> (c)&#123;</span><br><span class="line">    print(t)</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>为了保证 while 循环中循环不变表达式可以被优化，编译器通常把：</p>
<figure class="highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">while</span> c &#123;</span><br><span class="line">    S;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>表示为：</p>
<figure class="highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">if</span> c &#123;</span><br><span class="line">    repeat</span><br><span class="line">        S;</span><br><span class="line">    until <span class="keyword">not</span> c</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>这样循环不变式可以放在 <em>repeat-util</em> 之前。</p>
<h3 id="部分冗余表达式">部分冗余表达式</h3>
<p>对于以下基本块结构： <img src="/images/PartialReductancyEx.png" alt="content" /></p>
<p>若 <span class="math inline">\(B_2\)</span> 中计算了 <span class="math inline">\(a+b\)</span> ，但 <span class="math inline">\(B_3\)</span> 中没有计算 <span class="math inline">\(a+b\)</span>，<span class="math inline">\(B_4\)</span> 中计算了 <span class="math inline">\(a+b\)</span>，那么</p>
<p>可以说在 <span class="math inline">\(B_1 \rightarrow B_2 \rightarrow B_4\)</span> 上 <span class="math inline">\(a+b\)</span> 冗余， 在<span class="math inline">\(B_1 \rightarrow B_3 \rightarrow B_4\)</span>上 <span class="math inline">\(a+b\)</span> 不冗余， 那么该表达式对 <span class="math inline">\(B_4\)</span> 就是部分冗余的</p>
<p>对于这样部分冗余的表达式，我们需要在 <span class="math inline">\(B_3\)</span> 与 <span class="math inline">\(B_4\)</span> 之间插入新的基本块来计算 <span class="math inline">\(a+b\)</span> 。</p>
<h3 id="懒惰代码移动算法">懒惰代码移动算法</h3>
<h4 id="性质">性质</h4>
<p>为了解决部分冗余的问题，我们设计了懒惰代码移动算法，它有以下性质：</p>
<ol type="1">
<li>所有不复制代码就可以消除的表达式冗余计算都被消除了</li>
<li>优化后的程序是正确的，不会执行原来程序不执行的任何计算</li>
<li>表达式的计算时刻尽量<strong>靠后</strong>，尽量靠后计算一个表达式可以降低其生命周期，也减少了其使用寄存器的时间， 这也是其被称为 <em>懒惰代码移动算法</em> 的原因。</li>
</ol>
<h4 id="主要步骤">主要步骤</h4>
<ol type="1">
<li><p>逆向数据流分析找到各个程序点上的 <em>预期执行(anticipated)</em> 的表达式。 &gt; <em>预期执行(anticipated)</em> 指的是：从程序点 <span class="math inline">\(p\)</span> 出发的所有路径都会计算 <span class="math inline">\(a+b\)</span> 的值，且 <span class="math inline">\(b,c\)</span> 的值就是他们在 <span class="math inline">\(p\)</span> 上的值 &gt; &gt; 预期执行决定了一个表达式可以放的有多靠前，而一个表达式越靠前，能消除的冗余就越多</p></li>
<li><p>将对表达式的计算放在满足下面条件的程序点上：总存在路径使得该点是此路径第一个<em>预期执行</em>该表达式的点。 同时我们称程序点<em>可用(available)</em>当所有到达该程序点的原有路径中该表达式都被预期执行，这个过程可以通过前向数据流分析完成。</p></li>
<li><p><em>后延</em>表达式，一个表达式可被<em>后延</em>到某个程序点的条件为：到该点的<strong>所有</strong>路径上，该表达式已经在程序点前<em>预期执行</em>， 但没有使用该表达式。该过程可以通过前向数据流分析完成。</p></li>
<li><p>最后使用简单的逆向数据流分析删除那些给程序中只使用一次的临时变量赋值语句。</p></li>
</ol>
<h4 id="理论代码">理论代码</h4>
<h5 id="预期执行anticipated">预期执行(anticipated)</h5>
<p>方向：逆向</p>
<p>传递函数：<span class="math inline">\(f_B(x)=use_B \cup (x-kill_B)\)</span></p>
<p>交汇运算：<span class="math inline">\(\cap\)</span></p>
<h5 id="可用性available">可用性(available)</h5>
<p>方向：正向</p>
<p>传递函数：<span class="math inline">\(f_B(x)=(anticipated[B].in \cup x) - kill_B\)</span></p>
<p>交汇运算：<span class="math inline">\(\cap\)</span></p>
<h5 id="可后延postponable">可后延(postponable)</h5>
<p>方向：正向</p>
<p>注意，这里定义<span class="math inline">\(earliest[B]=anticipated[B].in - available[B].in\)</span></p>
<p>传递函数：<span class="math inline">\(f_B(x)=(earliest[B] \cup x) - kill_B\)</span></p>
<p>交汇运算：<span class="math inline">\(\cap\)</span></p>
<h5 id="被使用used">被使用(used)</h5>
<p>方向：逆向</p>
<p>注意，这里定义 <span class="math display">\[ latest[B]=(earliest[B] \cup postponable[B].in)
\cap (use_B \cup \neg(\bigcap_{S,succ(B)}{earliest[S]\cup postponable[S].in})) \]</span></p>
<p>传递函数：<span class="math inline">\(f_B(x)=(use_B \cup x) - latest[B]\)</span></p>
<p>交汇运算：<span class="math inline">\(\cup\)</span></p>
]]></content>
      <categories>
        <category>Compiler Theory</category>
      </categories>
  </entry>
  <entry>
    <title>线性代数的小trick-0</title>
    <url>/2023/11/05/Linear-Algebra-Cheetsheet0/</url>
    <content><![CDATA[<h2 id="矩阵多项式">矩阵多项式</h2>
<p>对于矩阵<span class="math inline">\(A\)</span>，我们假设有关于矩阵<span class="math inline">\(A\)</span>的<span class="math inline">\(n\)</span>次多项式 <span class="math inline">\(f(A) = c_nA^n+c_{n-1}A^{n-1}+\cdots+c_1A+c_0E\)</span><br />
那么假设我们知道<span class="math inline">\(A\)</span>的特征值<span class="math inline">\(\lambda_1,\lambda_2, \cdots, \lambda_n\)</span></p>
<p>那么<span class="math inline">\(f(A)\)</span>的特征值会是什么呢？</p>
<span id="more"></span>
<p>如果<span class="math inline">\(\lambda_1\)</span>是<span class="math inline">\(f(A)\)</span>的一个特征值，<span class="math inline">\(\alpha\)</span>为对应的特征向量,那么： <span class="math display">\[f(A)\alpha = c_n\lambda_i^n\alpha+c_{n-1}\lambda_i^{n-1}\alpha+\cdots+c_1\lambda_i\alpha+c_0\alpha = f(\lambda_i)\]</span> 故<span class="math inline">\(f(\lambda_i)\)</span>是<span class="math inline">\(f(A)\)</span>的一个特征值，<span class="math inline">\(\alpha\)</span>是<span class="math inline">\(f(\lambda_i)\)</span>的特征向量</p>
<h2 id="余子式-和-代数余子式的区别">余子式 和 代数余子式的区别</h2>
<p>余子式记为<span class="math inline">\(M_{ij}\)</span>, 而代数余子式是<span class="math inline">\(A_{ij} = (-1)^{i+j}M_{ij}\)</span>，只有<strong>符号</strong>上的区别。 注意，伴随矩阵<span class="math inline">\(A^*\)</span>是用代数余子式定义的。</p>
<h2 id="与秩有关的不等式">与秩有关的不等式</h2>
<p><span class="math display">\[r(AB) \le \min{(r(A), r(B))}\]</span> <span class="math display">\[r(A+B) \le r(A) + r(B)\]</span> <span class="math display">\[r(A,B) \le r(A) + r(B)\]</span> <span class="math display">\[r(AA^T) = r(A^TA) = r(A)\]</span></p>
<p>若<span class="math inline">\(A,B\)</span>为<span class="math inline">\(n\)</span> x <span class="math inline">\(n\)</span>的矩阵,<span class="math inline">\(AB=O\)</span>，则 <span class="math display">\[r(A)+r(B) \le n\]</span></p>
<p><span class="math inline">\(A^*\)</span>为<span class="math inline">\(A\)</span>的伴随矩阵，则： <span class="math display">\[r(A^*) = n, r(A) = n\]</span> <span class="math display">\[r(A^*) = 1, r(A) = n-1\]</span> <span class="math display">\[r(A^*) = 0, r(A) &lt; n - 1\]</span></p>
<p>对于分块矩阵，我们有：</p>
<p><span class="math display">\[
r\left(
\begin{array}{l}
A &amp; O \\
O &amp; B
\end{array}
\right) = r(A) + r(B)
\]</span></p>
<p><span class="math display">\[
r\left(
\begin{array}{l}
A &amp; C \\
O &amp; B
\end{array}
\right) \ge r(A) + r(B)
\]</span></p>
<p><span class="math inline">\(Sylvester\)</span>不等式, <span class="math inline">\(A\)</span>为<span class="math inline">\(s\)</span> x <span class="math inline">\(n\)</span>, <span class="math inline">\(B\)</span>为<span class="math inline">\(n\)</span> x <span class="math inline">\(m\)</span>： <span class="math display">\[r(AB) \ge r(A)+r(B)-n\]</span></p>
<p>若<span class="math inline">\(A^2=A\)</span>，if and only if: <span class="math display">\[r(A)+r(I-A) = n\]</span></p>
<h2 id="引用">引用</h2>
<p><a href="https://zhuanlan.zhihu.com/p/261152093">知乎文章</a></p>
<p><a href="https://zhuanlan.zhihu.com/p/341263037">知乎文章 1</a></p>
]]></content>
      <categories>
        <category>Linear Algebra</category>
      </categories>
      <tags>
        <tag>Math</tag>
        <tag>Linear Algebra</tag>
      </tags>
  </entry>
  <entry>
    <title>SAT-布尔可满足性理论</title>
    <url>/2023/11/23/SAT-Introduction/</url>
    <content><![CDATA[<h2 id="什么是约束求解">什么是约束求解？</h2>
<p>现实世界中有许多问题可以被抽象为基于约束(Constraint)的问题，<strong>约束</strong>指的是就是条件。<br />
而约束求解就是在给定约束的情况下，如果可满足(Satisfiable)我们就返回一组解， 如果不可满足，我们就给出一个较小的矛盾集(Unsatisfiable core)。</p>
<span id="more"></span>
<p>约束求解是不可判定(Undecidable)的问题，但许多子问题是可判定的。 比如求解一个不等式组 <span class="math inline">\(a &lt; 0 \&amp;\&amp; b &gt; 0\)</span> ，我们有一组可满足的解 <span class="math inline">\(a = -1, b = 1\)</span>， 但对于<span class="math inline">\(a &lt; 0 \&amp;\&amp; a &gt; 0\)</span>，我们不可能满足该命题，故我们需要给出矛盾集 <span class="math inline">\(\{1, 2\}\)</span>。 其中<span class="math inline">\(1\)</span>指的就是第一个命题<span class="math inline">\(a &lt; 0\)</span>成立，<span class="math inline">\(\{1, 2\}\)</span> 则是两个命题同时成立， 显然，这个命题集合是矛盾的。</p>
<p>SAT-布尔可满足性问题就是一种约束求解问题，该问题是 NP-complete 的，除此之外 线性方程组、不等式组之类的求解问题同样属于约束求解问题。</p>
<h2 id="notations">Notations</h2>
<ul>
<li>文字(literal): 布尔变量 <span class="math inline">\(x\)</span></li>
<li>子句(clause): 文字的析取(disjunction): <span class="math inline">\(x \vee \neg y\)</span></li>
</ul>
<p>那么 SAT 就是给定一组子句，寻找对各文字(或者布尔变量)的赋值使得所有子句为真。</p>
<p>其实就是使子句的合取为真: <span class="math inline">\((\neg x \vee y) \wedge x\)</span></p>
<h2 id="穷举法">穷举法</h2>
<p>最简单的暴力解法，一个个试就行了， <span class="math inline">\(O(2^n)\)</span>的复杂度，过于慢了。</p>
<h2 id="dpll">DPLL</h2>
<h3 id="冲突conflict检测">冲突(Conflict)检测</h3>
<p>对于一个赋值<span class="math inline">\(\{x, y\}\)</span>，我们可以推出 <span class="math inline">\(x \vee y\)</span> 与 <span class="math inline">\(x \vee \neg y\)</span>是冲突的， 即两个子句在该赋值下无法同时为真，那么我们不需要完整赋值就可以排除该情况。 这时我们有伪代码：</p>
<figure class="highlight c"><table><tr><td class="code"><pre><span class="line">Sat(assign) &#123;</span><br><span class="line">  <span class="keyword">if</span> (conflict(assign)) <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">  <span class="keyword">if</span> (complete(assign)) <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">  choose a unassigned x；</span><br><span class="line">  <span class="keyword">return</span> sat(assign ∪ &#123;x=<span class="number">0</span>&#125;) || sat(assign ∪ &#123;x=<span class="number">1</span>&#125;)</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<h3 id="标准推导方法">标准推导方法</h3>
<ul>
<li>Unit Propagation: 其他文字都为假，剩下的一个文字必定为真</li>
<li>Unate Propagation: 当一个子句存在为真的文字时，可以从子句集合中删除</li>
<li>Pure literal elimination: 当一个变量只有为真或者为假的形式的时候，可以把包含该变量的子句删除</li>
</ul>
<p>最后我们根据以上两个方法得到算法：</p>
<figure class="highlight c"><table><tr><td class="code"><pre><span class="line">dpll(assign) &#123;</span><br><span class="line">  assign’ = assign_prop(assign);</span><br><span class="line">  <span class="keyword">if</span> (conflict(assign’)) <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">  <span class="keyword">if</span> (complete(assign’)) <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">  choose a unassigned x；</span><br><span class="line">  <span class="keyword">return</span> dpll(assign’ ∪ &#123;x=<span class="number">0</span>&#125;) || dpll(assign’ ∪ &#123;x=<span class="number">1</span>&#125;);</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>其中 <em>assign_prop</em> 是推导方法，</p>
<p>这里还有一些预处理/归结/Resolution 方法如：</p>
<ul>
<li>Probing: 如果 x=0 或者 x=1 都能推导出 y=0，则推导出 y=0</li>
<li>Equivalence: classes<br />
预先检查出等价的子句集合，然后删除其中一个<br />
{1, 2, -3}<br />
{2, 1, -3}</li>
</ul>
<p>早期存在完全基于归结，不穷举赋值的算法（DP 算法），但速 度通常显著落后于 DPLL</p>
<p>变量选择有很多的 heuristics，有基于子句集，也有基于历史的，这里就不多写了。</p>
<h2 id="cdcl----conflict-driven-clause-learning">CDCL -- Conflict-Driven Clause Learning</h2>
<p>核心思想很简单：遇到 conflict 的时候，把和 conflict <strong>相关</strong>的布尔赋值取反加入子句集， 若<span class="math inline">\(x \wedge y \wedge z\)</span>导致了冲突，那么我们加入子句<span class="math inline">\(\neg x \vee \neg y \vee \neg z \vee\)</span>， 通过这种方式，我们不再需要记录之前遍历过的赋值，每次任意选择剩下的变量和赋值即可，因为 从新添加约束出发的推导实际保证了之前探过的冲突赋值不会出现。</p>
<p>这里有个新问题，寻找什么样的切割，让决策节点和矛盾不联通？参考北大 PPT: <img src="/images/sat-cut.png" alt="img" /></p>
<p>这里也有个可互动的例子来解释这个算法：<a href="https://cse442-17f.github.io/Conflict-Driven-Clause-Learning/">Conflict Driven Clause Learning</a></p>
<p>同样，这里也有个英文的解释：</p>
<ul>
<li>Non-Chronological Backtracking<br />
When CDCL learns a clause, it backtracks to the clause’s asserting level.<br />
You can just think of this meaning that it backtracks to the latest guess that affects a literal in the learned clause. Since this clause has x1 and x5, and x1 was the latest one to be guessed in that clause, we backtrack to when we set x1 to be True. When we backtrack to this level, the learned clause will immediately be available for BCP letting us put into action what we just learned!</li>
</ul>
<p>最后，我们有伪代码：</p>
<figure class="highlight c"><table><tr><td class="code"><pre><span class="line">dpll(assign) &#123;</span><br><span class="line">assign = &#123;&#125;;</span><br><span class="line">  <span class="keyword">while</span>(<span class="literal">true</span>) &#123;</span><br><span class="line">  assign’ = assign_prop(assign);</span><br><span class="line">  <span class="keyword">if</span> (conflict(assign’)) &#123;</span><br><span class="line">    <span class="keyword">if</span>(assign is empty) <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">    add new constraint;</span><br><span class="line">    backtrack;</span><br><span class="line">  &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">    <span class="keyword">if</span> (complete(assign’)) <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">    choose a unassigned x；</span><br><span class="line">    assign = assign<span class="number">&#x27;</span></span><br><span class="line">    add x into assign</span><br><span class="line">  &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<blockquote>
<p>这个方法使后来的 SAT/SMT 速度变得特别快</p>
</blockquote>
<h2 id="reference">Reference</h2>
<p><a href="https://xiongyingfei.github.io/SA/2022/12_SAT.pdf">北大软件分析</a> <a href="https://cse442-17f.github.io/Conflict-Driven-Clause-Learning/">Conflict Driven Clause Learning</a></p>
]]></content>
      <categories>
        <category>PL</category>
      </categories>
      <tags>
        <tag>PL</tag>
        <tag>SAT</tag>
        <tag>Constraint Solver</tag>
      </tags>
  </entry>
  <entry>
    <title>SMT-可满足性模理论</title>
    <url>/2023/11/23/SMT-Introduction/</url>
    <content><![CDATA[<h2 id="smt-solver">SMT Solver</h2>
<p>继上篇有关 SAT 的介绍，我们接着展开看看 SMT - Satisfiability Modulo Theories。 SAT 是基于布尔逻辑回答某个命题的可满足性，如<span class="math inline">\(x \vee y\)</span>。但现实中有各种其他的理论， 如实数理论等：<span class="math inline">\(a &lt; c \vee a &gt; b\)</span>，那我们怎么判断这种公式的可满足性呢？</p>
<span id="more"></span>
<p>这里我们就提出了可满足性模理论 Satisfiability Modulo Theories：</p>
<ul>
<li>给定一组理论，根据给定逻辑，求在该组理论解释下公式的可满足性</li>
<li>现有理论通常针对一阶理论，即公理都是一阶的</li>
</ul>
<p>比如 EUF(Equality with Uninterpreted Functions): <span class="math inline">\(a = b \rightarrow f(a) = f(b)\)</span><br />
<span class="math inline">\(a = b \leftrightarrow f(a) = f(b)\)</span><br />
以及等价关系的性质。</p>
<p>再比如线性方程组，算数求解之类的理论。</p>
<h2 id="eager-method">Eager Method</h2>
<p>我们可以将 SMT 问题编码为 SAT 问题，如:</p>
<ul>
<li><p><span class="math inline">\(f(a) = f(c) \wedge f(b) \ne c \wedge a \ne b\)</span>, 我们把<span class="math inline">\(f(a)\)</span>替换为<span class="math inline">\(A\)</span>, <span class="math inline">\(f(b)\)</span>替换为<span class="math inline">\(B\)</span>，原式就变为了 <span class="math inline">\(A = C \wedge B \ne c \wedge a \ne b\)</span>， 然后根据 EUF 理论的传递性，我们得到一堆的 SAT 可解的命题：</p>
<p><span class="math display">\[
P_{A=c} \wedge
P_{B=c} \rightarrow P_{A=B}
\]</span></p>
<p><span class="math display">\[
P_{A=B} \wedge
P_{B=c} \rightarrow P\_{A=c}
\]</span></p>
<p><span class="math display">\[ \cdots \]</span></p></li>
</ul>
<p>从这个例子就可以知道，Eager 方法有许多问题，再摘自北大软件分析 PPT: <img src="/images/smt-eager.png" alt="img" /></p>
<h2 id="lazy-method">Lazy Method</h2>
<p>把 Theory Solver 当作输出 SAT/UNSAT 的黑盒，那么我们可以先把命题看成 SAT 命题，然后有流程： 对于一个类似<span class="math inline">\(f(a) = b \wedge (g(b) \ne c \vee g(c) = d) \wedge k(d) = a\)</span>的公式，</p>
<ul>
<li>SAT Solver 将其看成 <span class="math inline">\(A \wedge (\neg B \vee C) \wedge D\)</span></li>
<li>SAT Solver 返回 SAT 并赋值 <span class="math inline">\(\{A = 1, B = 1, C = 1\}\)</span></li>
<li>然后把 <span class="math inline">\(A,\neg B,C\)</span> 对应的公式组给到 Theory Solver</li>
<li>Theory Solver 返回 SAT/UNSAT
<ul>
<li>SAT，继续让 SAT Solver 赋值，直到赋值 complete 为止</li>
<li>UNSAT，表示 SAT 达到了 Conflict，加新约束到子句集中，若加入后不可满足了，那么宣告整个命题 UNSAT</li>
</ul></li>
</ul>
<p>Lazy 方法的优点与问题： <img src="/images/smt-lazy.png" alt="img1" /></p>
<p>怎么解决这个问题呢? 我们只需要给 Theory Solver 加接口就行了，这个就是<em>DPLL(T)</em>算法了（懒得再写了，如北大 PPT 所示）： <img src="/images/smt-dpllt.png" alt="img2" /> 注意，这里的冲突项的前驱和 SAT 那篇文章里求合适分割是一样的。</p>
<h2 id="混合多个理论----nelson-oppen-方法">混合多个理论 -- Nelson-Oppen 方法</h2>
<p>本质和 SAT 与其他理论的结合一样，通过变形让同一个文字(Literal)变为不同的文字，然后 让不同的理论处理不同的文字，最后再结合不同 Theory Solver 的结果，也就是让不同理论之间通过 接口属性交换信息。但通常我们不能遍历所有的接口属性（通常是无限多的）。</p>
<blockquote>
<p>注: 接口属性指的是两种理论都包含的命题集合</p>
</blockquote>
<p><span class="math inline">\(DPLL(T)\)</span>算法可以处理混合的多个理论，但前提是他们没有共享变量，共享变量的情况该怎么处理呢？</p>
<p>Nelson-Oppen 方法解决了这一问题，但他有一定的限定范围：</p>
<ul>
<li>两个理论除等号外没有公共函数或者谓词</li>
<li>理论在某种无限域上成立</li>
<li>理论是凸的：
<ul>
<li>这里引入一个凸理论(Convex Theory)，一个理论是凸的当其满足: 其对于每个变量自由的公式<span class="math inline">\(F\)</span>，若<span class="math display">\[F \rightarrow \bigvee_{i=1}^n u_i = v_i\]</span> 则 <span class="math display">\[F \rightarrow u_i = v_i \text{ for some } i \in \{1,\cdots, n\}\]</span> 可以举个反例，对于<span class="math inline">\(F: 1 \le z \wedge z \le 2 \wedge u = 1 \wedge v = 2\)</span> 那么<span class="math inline">\(F \rightarrow z = u \vee z = v\)</span><br />
但是无法推出<span class="math inline">\(F \rightarrow z = u\)</span> 或 <span class="math inline">\(F \rightarrow z = v\)</span><br />
对于凸理论，我们只需要考虑<strong>变量之间的等价关系</strong>，这些关系则是有限的。</li>
</ul></li>
</ul>
<p>基于以上限制，对于混合多个理论的命题，我们只需要沿着 AST 自底向上将其他理论的子树用变量替代即可, 对于以下例子，假设我们有 EUC 和线性方程组两个理论求解器： <span class="math display">\[f(f(x) + f(x)) = 2a\]</span> <span class="math display">\[f(1) = 1\]</span> <span class="math display">\[f(2) = a\]</span> <span class="math display">\[2x = x + 1\]</span></p>
<p>令<span class="math inline">\(e_0 = f(x), e_1 = e_0 + e_0, e_2 = 1, e_3 = 2, e_4 = 2x, e_5 = x + 1, e_6 = 2a\)</span> ，对 EUC 我们有:</p>
<p><span class="math display">\[e_0 = f(x)\]</span> <span class="math display">\[f(e_1) = e_6\]</span> <span class="math display">\[f(e_2) = e_2\]</span> <span class="math display">\[f(e_3) = a\]</span> <span class="math display">\[e_4 = e_5\]</span></p>
<p>在线性方程组理论中我们有：</p>
<p><span class="math display">\[e_1 = e_0 + e_0\]</span> <span class="math display">\[e_2 = 1\]</span> <span class="math display">\[e_3 = 2\]</span> <span class="math display">\[e_4 = 2x\]</span> <span class="math display">\[e_5 = x + 1\]</span> <span class="math display">\[e_6 = 2a\]</span> <span class="math display">\[2x = x + 1\]</span></p>
<p>故我们有共享变量<span class="math inline">\(V = \{x, e_0, e_1, e_2, e_3, e_4, e_5, e_6\}\)</span></p>
<p>线性方程组理论解得： <span class="math display">\[x = 1, e_4 = 2, e_5 = 2\]</span></p>
<p>接口属性经过 EUC 处理，再次求解得：</p>
<p><span class="math display">\[x = e_2 = 1, e_0 = f(x) = f(e_2) = 1, e_4 = e_5 = e_3 = 2\]</span></p>
<p>线性方程组求解得:</p>
<p><span class="math display">\[e_1 = e_0 + e_0 = 2\]</span></p>
<p>EUC 继而求到:</p>
<p><span class="math display">\[e_3 = e_1 = 2 \rightarrow f(e_1) = f(e_3) \rightarrow e_6 = a\]</span></p>
<p>最后线性方程组求解<span class="math inline">\(e_6 = 2a = a\)</span>得到<span class="math inline">\(a = 0\)</span> 还是要注意：沿着 AST 自底向上将<strong>其他理论的子树</strong>用变量替代，而非所有子树都要用变量代替。</p>
<p>把例子扔了，我们只剩下一个简单的流程： 遍历 <span class="math inline">\(V\)</span> 中的变量对 <span class="math inline">\(x,y\)</span>，然后求解 <span class="math inline">\(F \wedge x \ne y\)</span>，如果 UNSAT 说明 <span class="math inline">\(x=y\)</span> 成立</p>
<blockquote>
<p>具体理论通常有高效的实现方式</p>
</blockquote>
<h3 id="对于非凸包">对于非凸包</h3>
<p>任何时候遇到一个等价关系的析取式，依次尝试每个等价关系, 如果任意一个得出 SAT，即整体 SAT, 如果全部 UNSAT，即整体 UNSAT。</p>
<h2 id="application">Application</h2>
<p><a href="https://github.com/Z3Prover/z3">Z3</a> 是一个基于 SMT 理论的求解器</p>
<p><a href="https://github.com/AliveToolkit/alive2">Alive2</a> 基于 <em>Z3</em> 完成了 LLVM-IR 的程序验证/refinement 关系验证的工作</p>
<h2 id="reference">Reference</h2>
<p><a href="https://xiongyingfei.github.io/SA_new/2023/slides/slides14_SMT.pdf">北大软件分析</a></p>
]]></content>
      <categories>
        <category>PL</category>
      </categories>
      <tags>
        <tag>PL</tag>
        <tag>Constraint Solver</tag>
        <tag>SMT</tag>
      </tags>
  </entry>
  <entry>
    <title>CS149 Asst3 -- CUDA Renderer</title>
    <url>/2024/04/03/CS149-CUDA-Renderer/</url>
    <content><![CDATA[<h2 id="introduction">1. Introduction</h2>
<p>It's always hard to write code for parallel programs, and harder to write correct and fast code on GPU. Writing a simple <a href="https://github.com/stanford-cs149/asst3">CUDA Renderer</a> would be an opportunity to practice.</p>
<span id="more"></span>
<p>In this article, the CUDA version is 12.1, the GPU is RTX 3090.</p>
<h2 id="task">2. Task</h2>
<p>Given the positions, RGBs and other information of a bunch of circles(may be transparent), you need to implement a <strong>fastest</strong> and <strong>correct</strong> render function for these circles you can.</p>
<p>If we don't care about the <strong>order</strong> of circles, it would be just an easy programming assignment. However, the circles can be <strong>transparent</strong>, that is, we need to blend every pixel based on the original color in that pixel. The render uses the following math: <span class="math display">\[C_{new} = \alpha C_{i} + (1 - \alpha) C_{old}\]</span> Such composition is not commutative, so it's important to draw the pixel following the correct order:</p>
<figure>
<img src="https://raw.githubusercontent.com/stanford-cs149/asst3/master/handout/order.jpg" alt="" /><figcaption>order</figcaption>
</figure>
<h2 id="extract-parallelism-offical-prefix-sum">3. Extract Parallelism (Offical) (Prefix-Sum)</h2>
<p>So how can we make it parallel to render? An basic insight is that we could draw every pixel independently. Since nothing is required to share between pixels, we could render a pixel in the order of circles. The kernel function is just like:</p>
<figure class="highlight cpp"><table><tr><td class="code"><pre><span class="line"><span class="function"><span class="type">void</span> <span class="title">renderPixel</span><span class="params">()</span> </span>&#123;</span><br><span class="line">  <span class="type">int</span> index = threadIdx.....;</span><br><span class="line">  <span class="type">int</span> pixelX, pixelY;</span><br><span class="line">  <span class="keyword">if</span> (index &gt; total_pixels)</span><br><span class="line">    <span class="keyword">return</span>;</span><br><span class="line"></span><br><span class="line">  <span class="keyword">for</span> (circle : circles)</span><br><span class="line">    <span class="keyword">if</span> (pixel in circle) <span class="comment">// expensive</span></span><br><span class="line">      <span class="built_in">blend</span>(circle, pixelX, pixelY);</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>Another insight is that, we could scan a bunch of circles once to see whether this pixel is in the circle. But how can we synchronize and reduce the result from one scan?</p>
<p><em>Parallel prefix-sum algorithm</em> makes sense here. Every thread in scan compute whether the pixel intersects one circle, then put the result in a boolean array <code>incircle</code>. By computing the prefix-sum of <code>incircle</code> parallelly, we get an result array <code>indexes</code> and the count of intersected circles. <code>indexes[i]</code> is the number of <strong>intersected</strong> circle son top of which ith circle lies. Knowing the indexes of circles that intersect, we could parallelize the renderer further(pseudocode):</p>
<figure class="highlight cpp"><table><tr><td class="code"><pre><span class="line"><span class="function"><span class="type">void</span> <span class="title">renderPixel</span><span class="params">()</span> </span>&#123;</span><br><span class="line">  <span class="type">int</span> index = threadIdx.....;</span><br><span class="line">  <span class="type">int</span> pixelX, pixelY;</span><br><span class="line">  <span class="type">int</span> scanIndex = ...;</span><br><span class="line">  <span class="type">int</span> countIntersected;</span><br><span class="line">  <span class="keyword">if</span> (index &gt; total_pixels)</span><br><span class="line">    <span class="keyword">return</span>;</span><br><span class="line"></span><br><span class="line">  __shared__ <span class="type">int</span> incircle[SCANNUM];</span><br><span class="line">  __shared__ <span class="type">int</span> indexes[SCANNUM];</span><br><span class="line">  __shared__ <span class="type">int</span> circles[SCANNUM];</span><br><span class="line"></span><br><span class="line">  <span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">0</span>; i &lt; numCircles; i++) &#123;</span><br><span class="line">    <span class="keyword">for</span> (one of circle in every SCANNUM circles) </span><br><span class="line">      incircle[scanIndex] = (pixel in circle)</span><br><span class="line"></span><br><span class="line">      __syncthreads();</span><br><span class="line">      <span class="built_in">prefixSumParallel</span>(&amp;countIntersected, indexes);</span><br><span class="line">      __syncthreads();</span><br><span class="line"></span><br><span class="line">      <span class="comment">// Put into corresponding place</span></span><br><span class="line">      circles[indexes[scanIndex]] = scanIndex + i;</span><br><span class="line"></span><br><span class="line">      __syncthreads();</span><br><span class="line"></span><br><span class="line">      <span class="keyword">if</span> (scanIndex == <span class="number">0</span>)</span><br><span class="line">          <span class="keyword">for</span> (<span class="type">int</span> j = <span class="number">0</span>; j &lt; countIntersected; j++)</span><br><span class="line">            <span class="built_in">drawCircle</span>(circles[j]);</span><br><span class="line">    &#125;</span><br><span class="line">  &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>This solution looks generally good, but it <strong>cannot</strong> reach full points on my machine. I guess the cause is that, it's expensive to spawn a large amout of threads and synchronize them, especially there is just single circle in the scene.</p>
<p>To prevent synchronization and unnecessary threads, I propose another method to solve it.</p>
<hr />
<h2 id="extract-parallelism-split-space-into-boxes">4. Extract Parallelism (Split Space into Boxes)</h2>
<p>Confronted with this task, the first thing occurred to me is <strong>quad-tree</strong>, which splits space into multiple small boxes dynamically, so that we only need to check whether entities in the small box intersect. In this way, we avoid computing intersection for every entity in the whole space. Such technique is often applied in game engine.</p>
<p>In the solution above, to render a pixel, we need to check all circles in the scene and see whether they intersect the current pixel. That's <strong>UNNECESSARY</strong> indeed. If we split the scene into uninterleaved boxes and compute the intersected circles in each box, and only try to draw the circles in the same box with that of the pixel parallelly, many unnecessary <code>drawCircle</code> calls could be avoided. With such intuition, I handicraft a fast CUDA renderer by splitting the 2D scene space. In this section, I will show how I optimize my renderer step by step.</p>
<h3 id="split-space-on-cpu">4.1 Split Space on CPU</h3>
<p>Based on MISS principle in software engineering, We'd better write correct code first. And then consider improving it.</p>
<p>A trivial solution on CPU is to split the space into equally-sized boxes, computing intersection between circles and boxes, and copy the results to GPU.</p>
<p>Basic procedure is like: <figure class="highlight c"><table><tr><td class="code"><pre><span class="line">allocate <span class="string">&quot;circlesInBox&quot;</span> <span class="built_in">array</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> every circle</span><br><span class="line">    check which boxes overlap with the circle</span><br><span class="line">    <span class="keyword">for</span> overlapped boxes</span><br><span class="line">        put this circle index into circlesInBox[box]</span><br><span class="line"></span><br><span class="line">transfer data from CPU to GPU</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> every box</span><br><span class="line">    kernel_render_on_box&lt;&lt;&lt;...&gt;&gt;&gt;(box)</span><br></pre></td></tr></table></figure></p>
<p>This method won the score of <strong>20</strong> on my machine.</p>
<p>When I analyzed the performance bottleneck, my first thought was that <code>cudaMalloc</code> and the <code>cudaMemcpy</code> between host and device are slow and costly. Even worse, <code>cudaMemcpy</code> leads to all threads on device to synchronize, it's really, really <strong>slow</strong>. Meanwhile it's also expensive to spawn the kernel calls separately.</p>
<p>Let's prove the inference above. Run <code>nsys</code> on the renderer, and see the profiling information: <img src="/images/CUDA-Renderer/cpu-split-profile.png" /></p>
<p>Obivously, the latency of <code>cudaMalloc</code> and <code>cudaMemcpy</code> is a big overhead. Let's resolve this first!</p>
<h3 id="copy-the-memory-asynchronously">4.2 Copy the memory asynchronously</h3>
<p>The first thing I did is to replace <code>cudaMemcpy</code> with <code>cudaMemcpyAsync</code>. Though the <a href="https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-async">document</a> warns:</p>
<blockquote>
<ol start="3" type="1">
<li>If pageable memory must first be staged to pinned memory, the driver may synchronize with the stream and stage the copy into pinned memory.</li>
</ol>
</blockquote>
<p>But reality is that <code>cudaMemcpyAsync</code> always runs asynchronously on my machine. This helps with reducing/hiding latency:</p>
<p><img src="/images/CUDA-Renderer/cpu-performance.png" /></p>
<p>This method overperforms <code>cpu-ref</code> in terms of speed for some tests. But it's not fast enough. Obviously, for <code>rand100k</code> and <code>biglittle</code> tests, we are slow. Let's profile again on <code>rand100k</code> test:</p>
<p><img src="/images/CUDA-Renderer/cpu-split-profile1.png" /></p>
<p><code>cudaMalloc</code> still plays a role as an overhead here. Further optimization needs parallelized splitting on <strong>GPU</strong>, preventing data allocations/moves from host to device.</p>
<h3 id="split-space-on-gpu">4.2 Split Space on GPU</h3>
<p>The split method on CPU focus on circles firstly, instead of boxes. On GPU, it's easier to focus on independent boxes parallelly. So the GPU splitting kernel is like: <figure class="highlight c"><table><tr><td class="code"><pre><span class="line"><span class="type">void</span> <span class="title function_">findCirclesInBox</span><span class="params">()</span> &#123;</span><br><span class="line">    Box mybox = assignBox(thread index);</span><br><span class="line">    <span class="keyword">for</span> every circle</span><br><span class="line">        <span class="keyword">if</span> circle intersect mybox</span><br><span class="line">            <span class="title function_">addCircle</span><span class="params">(mybox, circle)</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure></p>
<p>Greatly, this improve won the score of <strong>61</strong>. Now, the CUDA renderer with splitted space could compete the offical solution!</p>
<h3 id="reduce-memory-access-to-global-memory">4.3 Reduce Memory Access to Global Memory</h3>
<p>Traditional compilers reduce unnecessary memory access to register, which enhance performance of application especially when extracting memory access in loops (AFAIK, <em>LLVM</em> always perform this type of optimization in <em>LICM</em>). But for <em>CUDA</em>, it's impossible to do such optimization without knowing whether the memory position is accessed by other threads.</p>
<p>This is a snippet from kernel render function for every pixel: <figure class="highlight cpp"><table><tr><td class="code"><pre><span class="line">float4* img = ...; <span class="comment">// A pointer pointing to a part of global variable</span></span><br><span class="line"><span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">0</span>; i &lt; numInBox; i++) &#123;</span><br><span class="line">    <span class="type">int</span> circleIndex3 = <span class="number">3</span> * circleIndexes[i];</span><br><span class="line">    float3 p = position[circleIndex3];</span><br><span class="line">    <span class="built_in">shade</span>(circleIndexes[i], ... , img); <span class="comment">// &quot;shade&quot; reads and writes &quot;img&quot;</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure></p>
<p>Global memory is a slow memory region in GPU. My intuition is to promote it to register:</p>
<figure class="highlight cpp"><table><tr><td class="code"><pre><span class="line">float4* img = ...; <span class="comment">// A pointer pointing to a part of global variable</span></span><br><span class="line">float4 imgTmp = *img; <span class="comment">// In register</span></span><br><span class="line"><span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">0</span>; i &lt; numInBox; i++) &#123;</span><br><span class="line">    <span class="type">int</span> circleIndex3 = <span class="number">3</span> * circleIndexes[i];</span><br><span class="line">    float3 p = position[circleIndex3];</span><br><span class="line">    <span class="built_in">shade</span>(circleIndexes[i], ... , &amp;imgTmp); <span class="comment">// &quot;shade&quot; now reads and writes a register</span></span><br><span class="line">&#125;</span><br><span class="line">*img = imgTmp;</span><br></pre></td></tr></table></figure>
<p>Surprisingly, the minor change make the renderer defeat <code>render_ref</code> completely and win score of <strong>72</strong> sometimes. Hard to imagine the impact it brought.</p>
<p>Take a look at the original <em>NVPTX</em> assembly by <code>cuobjdump -ptx</code>: <figure class="highlight wasm"><table><tr><td class="code"><pre><span class="line"><span class="keyword">loop</span>:</span><br><span class="line">...</span><br><span class="line">ld.global.v4.<span class="type">f32</span> &#123;%f65, %f66, %f67, %f68&#125;, [%rd4];</span><br><span class="line">...</span><br><span class="line">st.global.v4.<span class="type">f32</span> [%rd4], &#123;%f78, %f77, %f76, %f79&#125;;</span><br><span class="line">...</span><br><span class="line">@%p4 bra <span class="variable">$loop</span></span><br><span class="line"></span><br><span class="line">loopexit:</span><br><span class="line">...</span><br></pre></td></tr></table></figure></p>
<p>And the assembly after change: <figure class="highlight wasm"><table><tr><td class="code"><pre><span class="line">ld.global.v4.<span class="type">f32</span> &#123;%f65, %f66, %f67, %f68&#125;, [%rd4];</span><br><span class="line"></span><br><span class="line"><span class="keyword">loop</span>:</span><br><span class="line">...</span><br><span class="line">@%p4 bra <span class="variable">$loop</span></span><br><span class="line"></span><br><span class="line">loopexit:</span><br><span class="line">st.global.v4.<span class="type">f32</span> [%rd4], &#123;%f78, %f77, %f76, %f79&#125;;</span><br><span class="line">...</span><br></pre></td></tr></table></figure> It confirmed that the change makes sense.</p>
<p>Similarly, compiler cannot determine the equivalence of two loads of identical memory address. I imitated <em>CSE</em> (common sub-expression elimination) technique of compiler optimization, replace <code>circleIndexes[i]</code> with a common variable: <figure class="highlight cpp"><table><tr><td class="code"><pre><span class="line">float4* img = ...; <span class="comment">// A pointer pointing to a part of global variable</span></span><br><span class="line">float4 imgTmp = *img; <span class="comment">// In register</span></span><br><span class="line"><span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">0</span>; i &lt; numInBox; i++) &#123;</span><br><span class="line">    <span class="type">int</span> circleIndex3 = <span class="number">3</span> * circleIndexes[i];</span><br><span class="line">    float3 p = position[circleIndex3];</span><br><span class="line">    <span class="built_in">shade</span>(circleIndexes[i], ... , &amp;imgTmp); <span class="comment">// &quot;shade&quot; now reads and writes a register</span></span><br><span class="line">&#125;</span><br><span class="line">*img = imgTmp;</span><br></pre></td></tr></table></figure> to <figure class="highlight cpp"><table><tr><td class="code"><pre><span class="line">float4* img = ...; <span class="comment">// A pointer pointing to a part of global variable</span></span><br><span class="line">float4 imgTmp = *img; <span class="comment">// In register</span></span><br><span class="line"><span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">0</span>; i &lt; numInBox; i++) &#123;</span><br><span class="line">    <span class="type">int</span> index = circleindexes[i];</span><br><span class="line">    <span class="type">int</span> circleIndex3 = <span class="number">3</span> * index;</span><br><span class="line">    float3 p = position[circleIndex3];</span><br><span class="line">    <span class="built_in">shade</span>(index, ... , &amp;imgTmp); <span class="comment">// &quot;shade&quot; now reads and writes a register</span></span><br><span class="line">&#125;</span><br><span class="line">*img = imgTmp;</span><br></pre></td></tr></table></figure> Now the renderer become better and defeat the reference implementation!</p>
<p>But wait... Whether can we utilize parallelism of circles inside a box with shared memory and <strong>prefix-sum</strong>?</p>
<p>The result is <strong>BAD</strong>. By default, I split the scene into 128x128 boxes. And in most cases, the number of circles inside a box is less than 1000. Synchronizing a large bunch threads in a scan become a <strong>big</strong> overhead. In simple <code>rgb</code> scene, it runs for even over 10ms.</p>
<p>Also, I tried to do the <em>prefix-sum</em> algorithm in warp level. I guess warp-level synchronization primitives are cheaper. But the result is <strong>BAD</strong> too.</p>
<p>I think a better solution is to build a <em>quad-tree</em> parallelly and compute in tree node. Maybe I will implement it in my free time.</p>
<h2 id="final-result">5. Final Result</h2>
<p><img src="/images/CUDA-Renderer/final-result.png" /></p>
<h2 id="some-inspiration">6. Some inspiration</h2>
<ul>
<li><p>Threads in a warp always synchronize(converge). The GPU scheduler schedule by warps(instead of thread). Since GPU is always <em>SIMT/SIMD</em>, all threads in a warp execute identical instruction at a time. If threads in warp diverge, warp will go through every path. Thus warp must wait for one path even though other threads are inactive on this path. Anyway, if two threads in a warp wait for a spinning lock, that would be a dead lock.</p></li>
<li><p>Parallelism is <strong>not</strong> always better. <em>Amdahl's formula</em> proves it: <span class="math display">\[S = \frac{1}{1 - a + \frac{a}{n}}\]</span> And the overhead of parallelism would make it slower.</p></li>
<li><p>Memory Access should be considered seriously.<br />
Global memory, constant memory, register(on chip), local memory(on chip) differ in terms of speed and storage. To prevent memory bound, we should pick suitable memory region to handle data.</p></li>
</ul>
]]></content>
      <categories>
        <category>Parallel Computing</category>
      </categories>
      <tags>
        <tag>Parallel Computing</tag>
      </tags>
  </entry>
</search>