atom.xml

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Wille</title>
  
  <subtitle>最後に辿り着いた場所</subtitle>
  <link href="/atom.xml" rel="self"/>
  
  <link href="http://dnc1994.com/"/>
  <updated>2019-07-10T07:43:33.972Z</updated>
  <id>http://dnc1994.com/</id>
  
  <author>
    <name>Linghao Zhang</name>
    
  </author>
  
  <generator uri="http://hexo.io/">Hexo</generator>
  
  <entry>
    <title>Annoucing New Blog | 博客迁移公告</title>
    <link href="http://dnc1994.com/2019/07/new-blog-annoucement/"/>
    <id>http://dnc1994.com/2019/07/new-blog-annoucement/</id>
    <published>2019-07-10T07:32:29.000Z</published>
    <updated>2019-07-10T07:43:33.972Z</updated>
    
    <content type="html"><![CDATA[<p>This blog is deprecated and will no longer be updated. Please visit <a href="https://linghao.io/" target="_blank" rel="noopener">linghao.io</a> for my new blog.</p><p>While existing posts will stay, the domain is set to expire in Jan 2020. After that you can visit this site via <a href="https://dnc1994.github.io" target="_blank" rel="noopener">dnc1994.github.io</a>.</p><p>Thank you :)</p><hr><p>本博客已经停止更新，请前往 <a href="https://linghao.io/" target="_blank" rel="noopener">linghao.io</a> 访问我的新博客。</p><p>本博客的现有内容会全部保留，但域名将于 2020 年 1 月过期。之后请通过 <a href="https://dnc1994.github.io" target="_blank" rel="noopener">dnc1994.github.io</a> 访问。</p><p>感谢各位读者的支持。</p>]]></content>
    
    <summary type="html">
    
      
      
        &lt;p&gt;This blog is deprecated and will no longer be updated. Please visit &lt;a href=&quot;https://linghao.io/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;linghao.
      
    
    </summary>
    
      <category term="Personal" scheme="http://dnc1994.com/categories/Personal/"/>
    
    
  </entry>
  
  <entry>
    <title>过去这五年，我学到了什么</title>
    <link href="http://dnc1994.com/2018/12/last-5-years-lessons/"/>
    <id>http://dnc1994.com/2018/12/last-5-years-lessons/</id>
    <published>2018-12-25T10:26:23.000Z</published>
    <updated>2019-06-03T00:37:13.872Z</updated>
    
    <content type="html"><![CDATA[<p><strong>本博客已经迁移到新域名 <a href="https://linghao.io" target="_blank" rel="noopener">linghao.io</a>。请前往新博客阅读本文：<a href="https://linghao.io/posts/five-year-learning-2013-2018/" target="_blank" rel="noopener">https://linghao.io/posts/five-year-learning-2013-2018/</a>。</strong></p><p>本文是前一篇<a href="https://dnc1994.com/2018/10/last-5-years/">五年总结</a>的后续。希望就如何实现自己的目标这个话题给各位带来一些启发。</p><a id="more"></a><h2 id="基本方法论"><a href="#基本方法论" class="headerlink" title="基本方法论"></a>基本方法论</h2><h3 id="了解自我"><a href="#了解自我" class="headerlink" title="了解自我"></a>了解自我</h3><p>了解自我是伴随每个人终身最重要的课题。人生的早期阶段，尤其是高等教育期间，是了解自我的黄金时期。</p><p>进入一个好大学，就读有竞争力的专业，获得优秀的学习成绩，拿到心仪的工作机会。这些固然是非常重要的目标，但不应该仅仅只是目标。达到这些目标的过程同时也应该被看作是了解自我的一种手段。</p><p>具体来说，了解自我可以体现在方方面面。比如弄清楚自己一天中什么时候学习、工作效率最高，连续集中注意力的时间上限是多少；又比如探索自己分别擅长和不擅长哪种类型的任务，或是哪种类型的回报对自己的激励作用最强等。</p><p>以我的个人经历为例，过去五年间我最重要的两点发现是：1、我完全无法做好自己不发自内心认同的事情。2、我享受做面向用户的产品。这样的自我认识对职业道路的选择就有着重要的指导意义。</p><h3 id="一个人就是一家创业公司"><a href="#一个人就是一家创业公司" class="headerlink" title="一个人就是一家创业公司"></a>一个人就是一家创业公司</h3><p>在人生的早期阶段，我们面临的困境跟创业者们所面临的困境很像：世界每分每秒都在变化，对事物的认知永远无法达到完全正确，甚至很多时候错得离谱。创业公司拥抱这种现实的方式，是通过快速的迭代和试错，来最大化他们理解市场和用户的速度。</p><p>我们可以从中得到许多借鉴。总的来说，每个人都应该把自己当做一家创业公司，接受失败和试错的常态，在考虑多个机会时偏向于能够更多更快地学到新东西的那个选择。</p><p>师傅领进门，修行在自身。一个潜在的问题是如何跨越从零到一的槛。基于个人经历，我的建议有：</p><ul><li><strong>广泛阅读</strong>：本文第二部分关于执行计划的方法论和技巧基本都是我从一些 MOOC 和生产力书籍中学到的。</li><li><strong>混入到比你优秀的人当中</strong>：大一时我开始与一群朋友租用独立服务器，借此以 Linux 基本操作和 Web 后端开发为起点入门了 CS 这个庞大的领域。</li><li><strong>寻求 Mentorship</strong>：大二时我在学校实验室导师的指导下学习了如何阅读论文。</li></ul><h3 id="建立高效的反馈回路"><a href="#建立高效的反馈回路" class="headerlink" title="建立高效的反馈回路"></a>建立高效的反馈回路</h3><p>建立高效的反馈回路对快速的迭代和试错来说至关重要。寻求外界的帮助（比如学校的导师和工作的前辈）或是自行设立衡量指标，用来帮助你在学习新事物或是实验新技巧时了解当前的效果如何以及如何改进。</p><p>我在学习如何阅读论文时，每周会跟导师面对面交流对上一周布置的 2~3 篇论文的理解，这使我的错误理解和不良习惯得以被纠正；在了解到<a href="https://zh.wikipedia.org/wiki/%E7%95%AA%E8%8C%84%E5%B7%A5%E4%BD%9C%E6%B3%95" target="_blank" rel="noopener">番茄工作法（Pomodoro）</a>之后，我通过完成事项的数量和质量（客观指标）和对工作时精神状态的自我评价（主观指标）来确定每个时间周期应该定为多长。</p><p>建立这样的反馈回路有时是困难的，甚至是不可能的。在这种情况下，重要的是我们能够有意识地去思考反馈、指标和修正这些元素。</p><h2 id="执行！执行！执行！"><a href="#执行！执行！执行！" class="headerlink" title="执行！执行！执行！"></a>执行！执行！执行！</h2><h3 id="设立目标"><a href="#设立目标" class="headerlink" title="设立目标"></a>设立目标</h3><blockquote><p>Shoot for the moon. Even if you miss, you’ll land among the stars.</p></blockquote><p>有句玩笑话说：梦想还是要有的，万一实现了呢？这其实不无道理。</p><p>我们应该尽早地为自己设立多个不同层次的目标。这样做至少有以下好处：</p><ul><li><strong>平摊达成目标所需的成本</strong>：以准备 GRE 考试为例，我提前 6 个月制定了备考计划，从而使得每天所花时间平均不到一小时。这样一来我不需要脱产备考，也更容易坚持完成进度。</li><li><strong>利用大脑的排演机制</strong>：研究认为，大脑在睡眠时会排演学到的新内容或是未来计划中困难的部分。通过将目标提前罗列在脑海中，我们可以在事实上降低达成目标的难度。</li><li><strong>对抗拖延</strong>：这一点要求在设立目标时不能停留于「我要做 X」，而要详细到「我要做 X，如果到了 Y 时进度跟不上，我就去做 Z」这种程度。</li></ul><h3 id="Getting-Things-Done"><a href="#Getting-Things-Done" class="headerlink" title="Getting Things Done"></a>Getting Things Done</h3><p>高效执行计划的核心在于尽量多地完成预设的目标。而如何在有限的时间内做完尽量多的事，基本就可以归结于发展一套适用于自己的 GTD（Getting Things Done）系统。</p><p>市面上有许多关于 GTD 的资料，但具体采用什么方法论或是使用哪些工具不是最重要的。重要的是要有意识地去实验、确定和执行适合自己的系统。举例来说，我的 GTD 系统主要包含如下部分：</p><ul><li><strong>规划事项和估计耗时</strong>：规划杂务和有截止期限的事项很容易，要留意的主要是处于「第二象限」<a href="https://en.wikipedia.org/wiki/Time_management#The_Eisenhower_Method" target="_blank" rel="noopener">（重要但不紧急）</a>的事项。我个人习惯先花一定时间将目标具体细分并设定时间线，然后每周预留固定时段用来完成这些子目标。估计耗时则主要依赖于经验，而且并不需要十分准确，更重要的是借助估计的过程在脑海中初步形成「如果某一阶段耗时超出可接受范围该怎么办」的概念。</li><li><strong>记录和追踪事项进度</strong>：如<a href="https://dnc1994.com/2018/12/liqi-community-plan/">这篇文章</a>中所说，我将所有的待办事项放在同一个列表中。一般而言，列表本身可以用任何文字编辑工具（包括纸笔这样的物理介质）来管理，也有人偏好使用专门的软件。而我选择的是表达能力比纯文本更强一点的 Markdown 以及 <a href="https://inns.studio/mak/" target="_blank" rel="noopener">Mak</a> 这个编辑器。</li><li><strong>Prioritization</strong>：事情永远是做不完的。我们需要周期性地评估和调整所有待办事项的优先度，从而决定在某个空闲时段应该做什么事项。我在写下待办事项的同时，还会记录它们所对应的中长期目标，并将这些目标放在列表的最顶部用来指导 prioritization：永远优先做当下对达成目标贡献最大的事项（有截止期限的事项除外）。</li><li><strong>执行事项</strong>：我一般会使用上文提到过的番茄工作法，以 50 分钟 + 10 分钟的周期进行工作。</li><li><strong>总结与反思</strong>：我会阶段性地借助日常记录的各种数据（脱离数据无法讨论改进，但也要平衡好记录的颗粒度）来调整自己下一阶段的目标。今后我可能会需要形成更正式的流程。</li></ul><h2 id="态度决定高度"><a href="#态度决定高度" class="headerlink" title="态度决定高度"></a>态度决定高度</h2><h3 id="珍惜时间"><a href="#珍惜时间" class="headerlink" title="珍惜时间"></a>珍惜时间</h3><p>时间是最稀缺的资源。不要为了节约其他资源（尤其是金钱）而牺牲时间。</p><p>但有时候「浪费时间」是不可避免的。我们主观上觉得浪费了的时间可能是解决问题所必不可缺的一部分。或者说，某些「浪费」掉的时间是比其他时间更有价值的。这件事情没有统一的标尺，一般来说花大量时间去巩固基础绝不会是浪费。</p><h3 id="对抗焦虑"><a href="#对抗焦虑" class="headerlink" title="对抗焦虑"></a>对抗焦虑</h3><blockquote><p>We shape our tools, and thereafter our tools shape us.</p></blockquote><ul><li>拒绝 <a href="https://en.wikipedia.org/wiki/Fear_of_missing_out" target="_blank" rel="noopener">FOMO</a>：在这个信息过载的时代，我们往往已经拥有足够多的信息来达成目标，在此前提下不断被动地收取额外的信息不仅浪费时间而且徒增焦虑。更有收益的是整理自己已经掌握的信息，比如把它们写成像本文一样的总结。</li><li>管理期望：我们的焦虑往往来自于错位的期望。而期望管理是一个通过反复刻意练习可以提高的技巧。</li><li>远离有毒的工具、平台和社交关系：关闭不必要的推送，卸载知乎和今日头条这样的 App，好好利用 Screen Time 或类似的功能，做更多能让你忘记捧起手机的事情。</li><li>自我 Hacking：小到情绪管理，大到人际交往，在掌握了运作原理和潜在规则之后，我们是能够在一定程度上对抗自身的动物性和身心局限的。举例来说，如果破费去吃最喜欢的食物能够一扫负面情绪并让自己开始有效率地产出的话，那就去吧！</li></ul><h3 id="寻求帮助"><a href="#寻求帮助" class="headerlink" title="寻求帮助"></a>寻求帮助</h3><ul><li>勇于寻求帮助</li><li>用正确的方式寻求帮助。</li></ul><h3 id="帮助和影响身边的人"><a href="#帮助和影响身边的人" class="headerlink" title="帮助和影响身边的人"></a>帮助和影响身边的人</h3><ul><li>一个人的成功往往在很大程度上依赖和反映于他周围的人的成功。</li><li>在帮助他人的过程中我们往往能够发现自己需要改进的地方，比如对一个知识点理解得不够透彻等。</li><li>帮助他人这件事情本身能够带来巨大的回报感和满足感。</li></ul><h3 id="连点成线"><a href="#连点成线" class="headerlink" title="连点成线"></a>连点成线</h3><p>Connecting the dots 是一个非常美丽的梦想。没有人不想能够像 Steve Jobs 那样在获得了伟大的成就以后回顾自己的经历并将其串成线来津津乐道。</p><p>而我浅薄的理解则是，它更多地是一种 <a href="https://zh.wikipedia.org/wiki/%E8%87%AA%E8%AF%81%E9%A2%84%E8%A8%80" target="_blank" rel="noopener">自证预言（Self-fulfilling prophecy）</a>。正因为刻意的自我暗示所带来的正反馈回路，我们才有可能开拓一条之后看来自洽和谐不误入歧途的人生道路。</p><p>除了在心理上自我暗示以外，我们还可以做的是在阶段性回顾时多去思考自己的不同经历之间有什么联系。</p><h2 id="改变，从今天开始"><a href="#改变，从今天开始" class="headerlink" title="改变，从今天开始"></a>改变，从今天开始</h2><p>上文一些内容可能过于抽象，所以在结尾我会给出一些具体可行的建议。如果您认同本文的部分内容却担心自己无法迈出改变的第一步，从下面挑一两件事先尝试做起来吧！</p><ul><li>写下当前阶段困惑你的问题，找三个合适的对象（学长、导师、前辈等）进行交流。</li><li>思考自己目前在学习的一件新事物，你是否已经建立了有效的反馈回路？</li><li>尝试番茄工作法并确定它是否适合自己。</li><li>为自己设定未来一年内和未来三年内想要达到的目标各三个（包括不能按时达成目标时的 Plan B）。</li><li>将自己未来一周内要完成的事项列表，选择一种管理工具（可以从纸笔开始！）尝试规划和追踪进度。</li><li>找三本自己领域内公认的经典著作，读完并做详细的笔记。</li><li>选择手机上一个消耗你的时间较多且不重要的 App，卸载它！</li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;&lt;strong&gt;本博客已经迁移到新域名 &lt;a href=&quot;https://linghao.io&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;linghao.io&lt;/a&gt;。请前往新博客阅读本文：&lt;a href=&quot;https://linghao.io/posts/five-year-learning-2013-2018/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://linghao.io/posts/five-year-learning-2013-2018/&lt;/a&gt;。&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;本文是前一篇&lt;a href=&quot;https://dnc1994.com/2018/10/last-5-years/&quot;&gt;五年总结&lt;/a&gt;的后续。希望就如何实现自己的目标这个话题给各位带来一些启发。&lt;/p&gt;
    
    </summary>
    
      <category term="Knowledge" scheme="http://dnc1994.com/categories/Knowledge/"/>
    
    
  </entry>
  
  <entry>
    <title>创造者和他们的工具——利器社群计划</title>
    <link href="http://dnc1994.com/2018/12/liqi-community-plan/"/>
    <id>http://dnc1994.com/2018/12/liqi-community-plan/</id>
    <published>2018-12-10T06:12:45.000Z</published>
    <updated>2019-06-03T00:38:22.126Z</updated>
    
    <content type="html"><![CDATA[<p><strong>本博客已经迁移到新域名 <a href="https://linghao.io" target="_blank" rel="noopener">linghao.io</a>。请前往新博客阅读本文：<a href="https://linghao.io/posts/liqi-interview/" target="_blank" rel="noopener">https://linghao.io/posts/liqi-interview/</a>。</strong></p><blockquote><p>工具和灵感，都是利器。当来自不同的领域，不同类型的创造者的工具和灵感互相碰撞的时候，才会迸发出更多的可能性。为了鼓励更多的工具和灵感的出现，我们发起了「利器社群计划」。「利器社群计划」鼓励独立的组织或个人，在自己的平台上分享「利器」。你可以像 eico 的陶子一样，采访自己身边的同事；也可以像扶墙老师一样，在自己的博客上分享清单。</p></blockquote><a id="more"></a><h2 id="介绍一下你自己和所做的工作。"><a href="#介绍一下你自己和所做的工作。" class="headerlink" title="介绍一下你自己和所做的工作。"></a>介绍一下你自己和所做的工作。</h2><p>我是 Google 的一名软件工程师。</p><h2 id="你的职业生涯的转折点是什么？"><a href="#你的职业生涯的转折点是什么？" class="headerlink" title="你的职业生涯的转折点是什么？"></a>你的职业生涯的转折点是什么？</h2><p>作为一名刚刚进入第一份全职工作的工程师，很难说自己经历过什么称得上是转折点的事件。硬要说的话，2016 年暑假在 <a href="https://www.strikingly.com/" target="_blank" rel="noopener">Strikingly</a> 的实习，让我确信了自己喜欢的是开发面向消费者的产品，从而彻底打消了进入学术界的念头。这也算是对职业生涯有重要影响的事件吧。</p><h2 id="你都在使用哪些硬件？"><a href="#你都在使用哪些硬件？" class="headerlink" title="你都在使用哪些硬件？"></a>你都在使用哪些硬件？</h2><h3 id="Lenovo-X1-Carbon"><a href="#Lenovo-X1-Carbon" class="headerlink" title="Lenovo X1 Carbon"></a>Lenovo X1 Carbon</h3><p>OK，这其实是一个不太符合事实的短回答。更长的回答是：</p><p>由于喜欢 X1C 优秀的工业设计和便携性，15 年入坑之后我一直将其作为主力笔记本。而作为一名中度游戏玩家，在 X1C 之外我另有一台联想的游戏本（Y520）。入职以后我在公司主要使用 Linux Workstation，同时还配了一台 13” rMBP。由于在工作之外没有高强度的开发需求，所以目前在家里接有显示器的游戏本加上 MBP 基本可以覆盖日常需求，于是 X1C 正在吃灰。</p><p>但如果有朝一日成为了无业游民，我想作为主要生产力的笔记本还会是 X1C。（应该会考虑入 X1 Extreme 作为 Y520 的替代。）</p><h3 id="Filco-Majestouch-NINJA-TKL-Linear-Action（黑轴）"><a href="#Filco-Majestouch-NINJA-TKL-Linear-Action（黑轴）" class="headerlink" title="Filco Majestouch NINJA TKL Linear Action（黑轴）"></a>Filco Majestouch NINJA TKL Linear Action（黑轴）</h3><p>一款很优秀的机械键盘。选择黑轴是为了兼顾写字、代码和游戏。</p><p>在这之前用的是朋友送的 Cherry G80-3000（青轴），目前在老家吃灰。主要是因为全键盘尺寸携带不便，而且青轴不适合长时间游戏。</p><h3 id="Logitech-G502"><a href="#Logitech-G502" class="headerlink" title="Logitech G502"></a>Logitech G502</h3><p>黑五打折买的游戏鼠标。外形和手感都不错，侧键多功能性强。</p><h3 id="iPhone-XS"><a href="#iPhone-XS" class="headerlink" title="iPhone XS"></a>iPhone XS</h3><p>我从 2011 年开始就是 iPhone 用户，先后使用过 iPhone 4、iPhone 4S、iPhone 6S 和 iPhone 7。最近由于 7 的电池损耗严重而换到 XS。</p><p>我很少在手机上进行生产活动，在社交、娱乐等功能以外主要是将其作为摄影和图片编辑工具。有时我也会用 iBooks 看电子书。手机同时也是保障我个人账户安全的 2nd Factor，</p><h4 id="AirPods"><a href="#AirPods" class="headerlink" title="AirPods"></a>AirPods</h4><p>这里想单独提一下 AirPods。因为它实在是太好用了。在 AirPower 遥遥无期的现在，买到就是赚到。</p><h3 id="iPad-4-Mini"><a href="#iPad-4-Mini" class="headerlink" title="iPad 4 Mini"></a>iPad 4 Mini</h3><p>我的 iPad 在阅读电子书和 PDF 之外，基本只用于娱乐（观看视频和直播）。</p><h2 id="软件呢？"><a href="#软件呢？" class="headerlink" title="软件呢？"></a>软件呢？</h2><h3 id="Mak"><a href="#Mak" class="headerlink" title="Mak"></a>Mak</h3><p>朋友 <a href="https://shud.in/" target="_blank" rel="noopener">Shu Ding</a> 开发的 <a href="https://inns.studio/mak/" target="_blank" rel="noopener">Mak</a> 是我目前最重要的生产力应用。它是我所知最优秀的 Markdown 编辑器（当然，它本身不仅是一个编辑器）。</p><p>我用 Mak 来维护自己生产力流程中最为关键的 Todo List，并用它来写绝大部分 Markdown 可以满足的文章。这篇文章也是我用 Mak 写的。</p><p>Mak 的设计理念很好地体现了 <a href="https://en.wikipedia.org/wiki/Minimalism_%28computing%29" target="_blank" rel="noopener">Less is More</a>。纵观市面上层出不穷的生产力、笔记以及编辑器应用，许多都犯了过度设计的问题。我相信在工具链中越是处于基本和重要地位的部分，就越需要避免过度设计带来的干扰。在这一点上 Mak 跟我的想法高度一致。</p><h3 id="Pinboard"><a href="#Pinboard" class="headerlink" title="Pinboard"></a>Pinboard</h3><p><a href="https://pinboard.in" target="_blank" rel="noopener">Pinboard</a> 是我主要使用的书签管理工具。它有着极简的用户界面和完善的标签管理系统。与 Mak 类似，不带任何干扰的它也是我工具链的重要组成部分。</p><h3 id="G-Suite"><a href="#G-Suite" class="headerlink" title="G Suite"></a>G Suite</h3><p>（作为 Google 员工）提到生产力工具就不能不提到 G Suite。在工作和日常中我都有大量使用 Gmail，Calendar，Drive 和 Docs。</p><p>值得一提的是我跟一些朋友共同订阅了一个 G Suite 的 <a href="https://gsuite.google.com/pricing.html" target="_blank" rel="noopener">Business Plan</a>（$10 / 人 / 月），因此获得了无限容量的 Drive 存储空间。我用它来备份所有的个人数据。</p><h3 id="Spark"><a href="#Spark" class="headerlink" title="Spark"></a>Spark</h3><p><a href="https://sparkmailapp.com/" target="_blank" rel="noopener">Spark</a> 是我在手机上使用的邮件客户端。Gmail 官方客户端不能满足我的原因是：</p><ul><li>为了将工作与私人生活分隔，在 Gmail 客户端上我只登陆了工作邮箱。</li><li>我有两个非 Gmail 的 Legacy 邮箱地址也需要管理。</li><li>Spark 本身非常优秀！</li></ul><h3 id="Google-Photos"><a href="#Google-Photos" class="headerlink" title="Google Photos"></a>Google Photos</h3><p><a href="https://photos.google.com/" target="_blank" rel="noopener">Google Photos</a> 是我的照片备份、同步和整理工具。作为 iOS 用户我也体验过 iCloud Photos，结论是只谈备份和同步功能它与 Google Photos 体验类似（严格来说会更好，毕竟有原生支持）。但在照片整理上 Google 的确做得比 Apple 好很多，按人物、地点或是事物分类的算法非常准确，也支持带条件的搜索。</p><h3 id="1Password"><a href="#1Password" class="headerlink" title="1Password"></a>1Password</h3><p><a href="https://1password.com/" target="_blank" rel="noopener">1Password</a> 是我的密码管理方案。从最近某次更新开始 iOS 的密码自动填充支持以 1Password 作为来源，再配合 Face ID 使用起来异常便利。</p><h3 id="Telegram"><a href="#Telegram" class="headerlink" title="Telegram"></a>Telegram</h3><p><a href="https://telegram.org/" target="_blank" rel="noopener">Telegram</a> 是我唯一喜欢的 IM 软件。它的回复、链接预览和表情包管理等功能都比其他替代品优秀太多。欢迎关注我的<a href="https://t.me/instante_thoughts" target="_blank" rel="noopener">个人频道</a>。</p><h3 id="Sublime-Text"><a href="#Sublime-Text" class="headerlink" title="Sublime Text"></a>Sublime Text</h3><p>作为一名主要跟 Data 和 Model 打交道的工程师，我的大部分时间都花在自底向上的探索或是编写查询语句上了，经常涉及的工具有 Jupyter Notebook 和各种数据后端提供的查询界面等。</p><p>因为不曾有过绝对意义上的重度开发经历，就编辑代码而言 Sublime Text 对我来说足够了。目前在公司里主要使用的是内部的 IDE，不过也在尝试提高 Vim 的熟练度。</p><h2 id="你最理想的工作环境是什么？"><a href="#你最理想的工作环境是什么？" class="headerlink" title="你最理想的工作环境是什么？"></a>你最理想的工作环境是什么？</h2><p>可以调节高度的桌子，舒适的椅子，27 寸或以上的双显示器，没有 Cubicle，空气清新光照充足 。桌上要有一些自己喜欢的小玩意，比如<a href="https://www.evastore.jp/products/detail/9061" target="_blank" rel="noopener">明日香黏土</a>，又比如 Jill Plush。纯净水和 Diet Coke 要管够。</p><p><img src="jill-plush.jpg" alt="Jill Plush"></p><h2 id="你平时获得工作灵感的方式有哪些？"><a href="#你平时获得工作灵感的方式有哪些？" class="headerlink" title="你平时获得工作灵感的方式有哪些？"></a>你平时获得工作灵感的方式有哪些？</h2><p>UCSD 的 <a href="https://www.coursera.org/learn/learning-how-to-learn" target="_blank" rel="noopener">Learning How to Learn</a> 中提到过，人的大脑有 Focused Mode 和 Diffuse Mode 这两种不同的思考模式。比如当你集中注意力解决数学问题时，大脑就会处于 Focused Mode。而 Diffuse Mode 指的是一种放松的思考模式。在思考卡壳的时候，让大脑进入 Diffuse Mode 有助于找到缺失的灵感。所以我在感觉自己无法解决眼下的问题时会选择换一种活动，比如出去散步或者听会音乐。</p><h2 id="推荐一件生活中的利器给大家。"><a href="#推荐一件生活中的利器给大家。" class="headerlink" title="推荐一件生活中的利器给大家。"></a>推荐一件生活中的利器给大家。</h2><p>最近买了水牙线，非常好用。</p><hr><p>本文参与了「利器社群计划」，发现更多创造者和他们的工具：<a href="http://liqi.io/community/" target="_blank" rel="noopener">http://liqi.io/community/</a></p>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;&lt;strong&gt;本博客已经迁移到新域名 &lt;a href=&quot;https://linghao.io&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;linghao.io&lt;/a&gt;。请前往新博客阅读本文：&lt;a href=&quot;https://linghao.io/posts/liqi-interview/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://linghao.io/posts/liqi-interview/&lt;/a&gt;。&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;工具和灵感，都是利器。当来自不同的领域，不同类型的创造者的工具和灵感互相碰撞的时候，才会迸发出更多的可能性。为了鼓励更多的工具和灵感的出现，我们发起了「利器社群计划」。「利器社群计划」鼓励独立的组织或个人，在自己的平台上分享「利器」。你可以像 eico 的陶子一样，采访自己身边的同事；也可以像扶墙老师一样，在自己的博客上分享清单。&lt;/p&gt;
&lt;/blockquote&gt;
    
    </summary>
    
      <category term="Productivity" scheme="http://dnc1994.com/categories/Productivity/"/>
    
    
  </entry>
  
  <entry>
    <title>[Notes] The Effective Engineer</title>
    <link href="http://dnc1994.com/2018/10/notes-the-effective-engineer/"/>
    <id>http://dnc1994.com/2018/10/notes-the-effective-engineer/</id>
    <published>2018-10-17T02:35:56.000Z</published>
    <updated>2019-06-03T00:41:45.135Z</updated>
    
    <content type="html"><![CDATA[<p><strong>This blog has been migrated to <a href="https://linghao.io" target="_blank" rel="noopener">linghao.io</a>. Read this post on my new blog: <a href="https://linghao.io/notes/the-effective-engineer/" target="_blank" rel="noopener">https://linghao.io/notes/the-effective-engineer/</a>.</strong></p><p>Starting from “time is our most limited resource”, <a href="https://www.effectiveengineer.com/book" target="_blank" rel="noopener"><em>The Effective Engineer</em></a> by Edmond Lau first establishes the methodology of using “leverage” to guide our actions. The book then, from multiple angles, discusses how to become a more effective engineer by focusing on high-leverage activities that produce a disproportionately high impact for a relatively small time investment. Ranging from adopting mindsets, actual execution, to building long-term values, topics involved are complemented with ample examples within the industry. Much content are easily generalizable to areas beyond software engineering as well. A must-read for software engineers.</p><p>This post is a refined version of the notes I took while reading this book.</p><a id="more"></a><h2 id="Part-I-Adopt-the-Right-Mindsets"><a href="#Part-I-Adopt-the-Right-Mindsets" class="headerlink" title="Part I: Adopt the Right Mindsets"></a>Part I: Adopt the Right Mindsets</h2><h3 id="1-Focus-on-High-Leverage-Activities"><a href="#1-Focus-on-High-Leverage-Activities" class="headerlink" title="1. Focus on High-Leverage Activities"></a>1. Focus on High-Leverage Activities</h3><h4 id="Use-leverage-to-measure-your-engineering-effectiveness"><a href="#Use-leverage-to-measure-your-engineering-effectiveness" class="headerlink" title="Use leverage to measure your engineering effectiveness"></a>Use leverage to measure your engineering effectiveness</h4><p>Leverage = Impact Produced / Time Invested</p><p>Leverage is critical because time is your most limited resource.</p><p>Pareto principle: 80% impact from 20% work – high-leverage activities that produce a disproportionately high impact for a relatively small time investment.</p><h4 id="Systematically-increase-the-leverage-of-your-time"><a href="#Systematically-increase-the-leverage-of-your-time" class="headerlink" title="Systematically increase the leverage of your time"></a>Systematically increase the leverage of your time</h4><p>Three ways to increase your leverage:</p><ul><li>Reducing time</li><li>Increasing output</li><li>Shifting to higher-leverage activities</li></ul><p>Examples:</p><ul><li>Attending meetings<ul><li>Defaulting to a half-hour meeting instead of a one-hour meeting</li><li>Prepare an agenda and a set of goals</li><li>Replace the in-person meeting with email discussion if possible</li></ul></li><li>Developing a customer-facing feature<ul><li>Automate parts of the development / testing process</li><li>Prioritize tasks</li><li>Use knowledge about customers to understand whether there’s another feature you could be working on</li></ul></li><li>Fixing bottlenecks in a web application<ul><li>Learn to effectively use a profiling tool</li><li>Measure both performance and visit frequency so you can address the bottlenecks that affect the most traffic first</li><li>Design performant software from the outset so that speed is prioritized as a feature instead of a bug to be fixed</li></ul></li></ul><h4 id="Focus-your-effort-on-leverage-points"><a href="#Focus-your-effort-on-leverage-points" class="headerlink" title="Focus your effort on leverage points"></a>Focus your effort on leverage points</h4><p>Don’t confuse high-leverage activities with easy wins. Many high-leverage activities require consistent applications of effort over long time periods to achieve high impact.</p><p>Find leverage points, establish high-leverage habits.</p><h3 id="2-Optimize-for-Learning"><a href="#2-Optimize-for-Learning" class="headerlink" title="2. Optimize for Learning"></a>2. Optimize for Learning</h3><h4 id="Adopt-a-growth-mindset"><a href="#Adopt-a-growth-mindset" class="headerlink" title="Adopt a growth mindset"></a>Adopt a growth mindset</h4><ul><li>Fixed mindset: humans are born with a predetermined amount of intelligence</li><li>Growth mindset: humans can cultivate and grow their intelligence and skills through effort</li></ul><p>Own your story. Instead of apologizing for where your resume doesn’t line up, take control of the parts that are within your sphere of influence.</p><h4 id="Invest-in-your-learning-rate"><a href="#Invest-in-your-learning-rate" class="headerlink" title="Invest in your learning rate"></a>Invest in your learning rate</h4><p>Lessons from compound interest:</p><ol><li>Compounding leads to an exponential growth curve</li><li>The earlier compounding starts, the sooner you hit the region of rapid growth and the faster you can reap its benefits</li><li>Even small deltas in the interest rate can make massive differences in the long run</li></ol><p>When companies pay you for cushy and unchallenging 9-to-5 jobs, what they are actually doing is paying you to accept a much lower intellectual growth rate.</p><p>Treat yourself like a startup. Startups initially prioritize learning over profitability to increase their chances of success.</p><p>You would rather invest your financial assets in accounts that pay high interest rates, not low ones. Why would you treat your time – your most limited asset – any differently?</p><h4 id="Seek-work-environments-conducive-to-learning"><a href="#Seek-work-environments-conducive-to-learning" class="headerlink" title="Seek work environments conducive to learning"></a>Seek work environments conducive to learning</h4><p>One of the most powerful leverage points for increasing our learning rate is our choice of work environment – because we spend so much time at work.</p><p>Factors to consider:</p><ul><li>Fast growth. When the number of problems to solve exceeds available resources, there are ample opportunities to make a big impact. A lack of growth, on the other hand, leads to stagnation and politics.</li><li>Training.</li><li>Openness.</li><li>Pace.</li><li>People.</li><li>Autonomy. The freedom to choose what to work on and how to do it drives our ability to learn.</li></ul><h4 id="Capitalize-on-opportunities-on-the-job-to-develop-new-skills"><a href="#Capitalize-on-opportunities-on-the-job-to-develop-new-skills" class="headerlink" title="Capitalize on opportunities on the job to develop new skills"></a>Capitalize on opportunities on the job to develop new skills</h4><p>Borrow the idea of 20% time from Google, but take it in one- or two-hour chunks each day, because you can then make a daily habit out of improving your skills.</p><p>Gain experience in adjacent disciplines:</p><ul><li>For product engineers: product management, user research, backend engineering</li><li>For infrastructure engineers: machine learning, database internals, web development</li><li>For growth engineers: data science, marketing, behavioral psychology.</li></ul><p>Tips:</p><ul><li><strong>Study code for core abstractions written by the best engineers at your company.</strong></li><li>Write more code.</li><li><strong>Go through any technical, educational material available internally.</strong></li><li>Master the programming language that you use.</li><li>Send your code reviews to the harshest critics.</li><li>Enroll in classes on areas where you want to improve.</li><li>Participate in design discussions of projects you’re interested in.</li><li><strong>Work on a diversity of projects.</strong> Interleaving different projects can tech you what problems are common across projects and what might just be artifacts of your current ones.</li><li>Make sure you’re on a team with at least a few senior engineers whom you can learn from.</li><li><strong>Jump fearlessly into code you don’t know.</strong> Highly correlated to engineering success.</li></ul><h4 id="Always-be-learning-even-when-outside-of-the-workplace"><a href="#Always-be-learning-even-when-outside-of-the-workplace" class="headerlink" title="Always be learning, even when outside of the workplace"></a>Always be learning, even when outside of the workplace</h4><p>Some skills we learn could be cross-functional and help our engineering work. For example, increasing your comfort level in conversing with strangers can help with meeting and interviewing. Other skills might not translate directly into engineering benefits, but the practice of adopting a growth mindset toward them makes us better learners and more willing to stretch beyond our comfort zone.</p><p>Continual learning is inextricably linked with increased happiness.</p><p>Tips:</p><ul><li>Learn new programming languages and frameworks.</li><li>Invest in skills that are in high demand.</li><li><strong>Read books.</strong></li><li>Join a discussion group.</li><li>Attend talks, conferences, and meetups.</li><li>Build and maintain a strong network of relationships.</li><li>Follow bloggers who teach.</li><li><strong>Write to teach</strong> – Feynman’s technique.</li><li><strong>Tinker on side projects.</strong> Creativity stems from combining existing and often disparate ideas in new ways.</li><li>Pursue what you love.</li></ul><h3 id="3-Prioritize-Regularly"><a href="#3-Prioritize-Regularly" class="headerlink" title="3. Prioritize Regularly"></a>3. Prioritize Regularly</h3><h4 id="Track-and-review-to-dos-in-a-single-easily-accessible-list"><a href="#Track-and-review-to-dos-in-a-single-easily-accessible-list" class="headerlink" title="Track and review to-dos in a single, easily accessible list"></a>Track and review to-dos in a single, easily accessible list</h4><p>The human brain is optimized for processing and not for storage. The average brain can actively hold on 7 +/- 2 items. Expending effort on remembering things reduces our attention, impairs our decision-making abilities, and even hurts our physical performance.</p><p>To-do lists should be 1) a canonical representation of our work and 2) easily accessible.</p><p>Instead of accurately computing the leverage of each task (which is incredibly difficult), compile a small number of goals to complete. Pick initial tasks towards these goals, and then make a pairwise comparison between what you’re currently doing and what else is on your to-do list. Continuously shift your top priorities towards the ones with the highest leverage.</p><h4 id="Focus-on-what-directly-produces-value"><a href="#Focus-on-what-directly-produces-value" class="headerlink" title="Focus on what directly produces value"></a>Focus on what directly produces value</h4><p>Activity is not necessarily production. Activities like writing status reports, organizing things, creating organizational systems, recording things multiple times, going to meetings, replying to low-priority communications only have a weak and indirect connection to creating value.</p><p>Once you’re producing results, few people will complain about declined meetings, slow email response times, or even non-urgent bugs not being fixed.</p><p>Defer and ignore tasks that don’t directly produce value.</p><h4 id="Focus-on-the-important-and-non-urgent"><a href="#Focus-on-the-important-and-non-urgent" class="headerlink" title="Focus on the important and non-urgent"></a>Focus on the important and non-urgent</h4><p><img src="quadrants.png" alt="Quadrants"></p><p>Urgency should not be confused with importance. Put first things first.</p><p>Label to-dos from 1 to 4 based on which quadrant the activity fall under.</p><p>Oftentimes, the root cause of a Quadrant 1 problem is an underinvestment in a Quadrant 2 activity.</p><p>The act of prioritization is itself a Quadrant 2 activity, whose important often gets overlooked because it’s rarely urgent. Prioritize the act of prioritization.</p><h4 id="Protect-your-schedule"><a href="#Protect-your-schedule" class="headerlink" title="Protect your schedule"></a>Protect your schedule</h4><p>Engineers need longer and more contiguous blocks of time to be productive than many other professionals.</p><p>Managers traditionally organize their time into one-hour blocks. Makers generally prefer to use time in units of half a day at least.</p><p>Tips:</p><ul><li>Schedule necessary meetings back-to-back at the beginning or end of your work day.</li><li>Defer helping others when in the middle of a focused activity.</li><li>Block off hours on your calendar or schedule days like “no meeting Wednesdays”.</li></ul><h4 id="Limit-the-amount-of-work-in-progress"><a href="#Limit-the-amount-of-work-in-progress" class="headerlink" title="Limit the amount of work in progress"></a>Limit the amount of work in progress</h4><p>Increasing work linearly increases the likelihood of failure exponentially.</p><p>Constant context switching hinders deep engagement in any one activity and reduces our overall chance of success.</p><h4 id="Fight-procrastination-with-if-then-plans"><a href="#Fight-procrastination-with-if-then-plans" class="headerlink" title="Fight procrastination with if-then plans"></a>Fight procrastination with if-then plans</h4><p>Many people do not have sufficient motivation to summon the activation energy required to start a difficult task.</p><p>Planning creates a link between the situation or cue and the behavior that you should follow, which follows automatically without any conscious intent when the cue triggers.</p><p>Subconscious followup is important because procrastination primarily stems from a reluctance to expand the initial activation energy on a task. This reluctance leads us to rationalize why it might be better to do something easier or more enjoyable, even if it has lower leverage. When we’re in the moment, the short-term value that we get from procrastinating can often dominate our decision-making process. But when we make if-then plans and decide what to do ahead of time, we’re more likely to consider the long-term benefits associated with a task.</p><p>If-then planning also can help fill the small gaps in our schedule.</p><h4 id="Make-prioritization-a-habit"><a href="#Make-prioritization-a-habit" class="headerlink" title="Make prioritization a habit"></a>Make prioritization a habit</h4><p>Take general principles and iteratively adapt your own prioritization system.</p><p>The actual mechanics of how your review your priorities matter less than adopting the habit of doing it.</p><h2 id="Part-II-Execute-Execute-Execute"><a href="#Part-II-Execute-Execute-Execute" class="headerlink" title="Part II: Execute, Execute, Execute"></a>Part II: Execute, Execute, Execute</h2><h3 id="4-Invest-in-Iteration-Speed"><a href="#4-Invest-in-Iteration-Speed" class="headerlink" title="4. Invest in Iteration Speed"></a>4. Invest in Iteration Speed</h3><h4 id="Move-fast-to-learn-fast"><a href="#Move-fast-to-learn-fast" class="headerlink" title="Move fast to learn fast"></a>Move fast to learn fast</h4><p>Investing in iteration speed is a high-leverage decision. The faster you can iterate, the more you can learn about what works and what doesn’t work.</p><p>You can build more things and try out more ideas. Not every change will produce positive value and growth. But with each iteration, you get a better sense of which changes will point you in the right direction, making your future efforts much more effective.</p><h4 id="Invest-in-time-saving-tools"><a href="#Invest-in-time-saving-tools" class="headerlink" title="Invest in time-saving tools"></a>Invest in time-saving tools</h4><p>Almost all successful people write a lot of tools.</p><ul><li>Faster tools get used more often, therefore saving even more time.</li><li>Faster tools can enable new development workflows that previously weren’t possible.</li></ul><p>Examples:</p><ul><li>Continuous integration / deployment</li><li>Incremental compilation</li><li>Interactive programming, REPL</li><li>Hot code reloads</li></ul><p>Sometimes, the time-saving tool that you built might be objectively superior to the existing one, but the switching costs discourage other engineers from actually changing their workflow and learning your tools. It’s worth investing the additional effort to lower the switching cost and to find a smoother way to integrate the tool into existing workflows.</p><p>One side benefit of proving to people that your tool saves times is that it also earns you leeway with your manager and your peers to explore more ideas in the future.</p><h4 id="Shorten-your-debugging-and-validation-loops"><a href="#Shorten-your-debugging-and-validation-loops" class="headerlink" title="Shorten your debugging and validation loops"></a>Shorten your debugging and validation loops</h4><p>As engineers, we can shortcut around normal system behaviors and user interactions when we’re testing your products, extending the concept of a minimal reproducible test case.</p><p>When you’re fully engaged with a bug you’re testing or a new feature you’re building, the last thing you want to do is to add more work. When you’re already using a workflow that works, albeit with a few extra steps, it’s easy to get complacent and not expend the mental cycles on devising a shorter one. Don’t fall into this trap!</p><p>Effective engineers have an obsessive ability to create tight feedback loops for what they’re testing.</p><h4 id="Master-your-programming-environment"><a href="#Master-your-programming-environment" class="headerlink" title="Master your programming environment"></a>Master your programming environment</h4><p>Given how much time we spend in our programming environments, the more efficient we can become, the more effective we will be as engineers.</p><p>Mastery is a process, not a event. As you get more comfortable, the time savings will start to build.</p><p>Tips:</p><ul><li>Get proficient with your favorite text editor or IDE.</li><li>Learn at least one productive, high-level programming language. Each minute spent writing boilerplate code for a less productive language is a minute not spent tackling the meatier aspects of a problem.</li><li>Get familiar with UNIX (or Windows) shell commands.</li><li>Prefer the keyboard over the mouse.</li><li>Automate your manual workflows.</li><li>Test out ideas on an interactive interpreter.</li><li>Make it fast and easy to run just the unit tests associated with your current changes.</li></ul><h4 id="Don’t-ignore-non-engineering-bottlenecks"><a href="#Don’t-ignore-non-engineering-bottlenecks" class="headerlink" title="Don’t ignore non-engineering bottlenecks"></a>Don’t ignore non-engineering bottlenecks</h4><p>One common type of bottleneck is dependency on other people. Oftentimes the cause is misalignment of priorities rather than negative intentions. The sooner you acknowledge that you need to personally address this bottleneck, the more likely you’ll be able to either adapt your goals or establish consensus on the functionality’s priority.</p><p>Project fail from under-communicating, not over-communicating.</p><ul><li>Ask for updates and commitments from team members at meetings or daily stand-ups.</li><li>Periodically check in with that product manager to make sure what you need hasn’t gotten dropped.</li><li>Follow up with written communication on key action items and dates that were decided in-person.</li></ul><p>Even if resource constraints preclude the dependency that you want from being delivered any sooner, clarifying priorities and expectations enables you to plan ahead and work through alternatives.</p><p>Another common type of bottleneck is obtaining approval from a key decision maker. This kind of bottlenecks generally fall outside of an engineer’s control. Prioritize building prototypes, collecting early data, conducting user studies and so on to get preliminary project approval. Don’t defer approvals until the end.</p><p>A third type of bottleneck is the review processes that accompany any project launch. Expend slightly more effort in coordination and communication.</p><p>Premature optimization is the root of all evil. Find out the biggest bottlenecks and optimize them.</p><h3 id="5-Measure-What-You-Want-to-Improve"><a href="#5-Measure-What-You-Want-to-Improve" class="headerlink" title="5. Measure What You Want to Improve"></a>5. Measure What You Want to Improve</h3><h4 id="Use-metrics-to-drive-progress"><a href="#Use-metrics-to-drive-progress" class="headerlink" title="Use metrics to drive progress"></a>Use metrics to drive progress</h4><p>If you can’t measure it, you can’t improve it.</p><p>Good metrics accomplish a number of goals:</p><ul><li>They help you focus on the right things.</li><li>When visualized over time, they help guard against future regressions. Engineers know the value of writing a regression test while fixing bugs: it confirms that a patch actually fixes a bug and detects if the bug re-surfaces in the future. Good metrics play a similar role, but on a system-wide scale.</li><li>They can drive forward progress. Performance ratcheting: Any new change that would push latency or other key indicators past the ratchet can’t get deployed until it’s optimized, or until some other feature is improved by a counterbalancing amount.</li><li>They let you measure your effectiveness over time and compare the leverage of what you’re doing against other activities you could be doing instead.</li></ul><h4 id="Pick-the-right-metric-to-incentivize-the-behavior-you-want"><a href="#Pick-the-right-metric-to-incentivize-the-behavior-you-want" class="headerlink" title="Pick the right metric to incentivize the behavior you want"></a>Pick the right metric to incentivize the behavior you want</h4><ul><li>Hours worked per week vs. productivity per week. The marginal productivity of each additional work hour drops precipitously. Attempting to increase output by increasing hours worked per week is not sustainable.</li><li>Click-through rates vs. long click-through rates. Google measure “long clicks”.</li><li>Average response times vs. 95th or 99th percentile response times. The average is the right metric to use if your goal is to reduce server costs by cutting down aggregate computation time. While the slowest responses tend to reflect the experiences of your power users.</li><li>Bugs fixed vs. bugs outstanding. Tracking the number of outstanding bugs can de-incentivize developers being less rigorous about testing when building new features.</li><li>Registered users vs. weekly growth rate of registered users.</li><li>Weekly active users vs. weekly active rate by age of cohort. The number of weekly active users doesn’t provide a complete picture. That number might increase temporarily even if product changes are actually reducing engagement over time. User could be signing up as a result of prior momentum.</li></ul><p>What you don’t measure is important as well. </p><p>Choose metrics that 1) maximize impact, 2) are actionable, and 3) are responsive yet robust.</p><ul><li>Maximize impact. Align employees along a single, core metric – economic denominator. Having a single, unifying metric enables you to compare the output of disparate projects and helps your team decide how to handle externalities.</li><li>Actionable. Movements can be causally explained by the team’s efforts. In contrast, vanity metrics like page views per month, total registered users, or total paying customers don’t necessarily reflect the actual quality of the team’s work.</li><li>Responsive. Updates quickly enough to give feedback about whether a given change was positive or negative. It is a leading indicator of how your team is currently doing.</li><li>Robust. External factors outside of the team’s control don’t lead to significant noise. Responsiveness needs to be balanced with robustness.</li></ul><h4 id="Instrument-everything-in-your-system"><a href="#Instrument-everything-in-your-system" class="headerlink" title="Instrument everything in your system"></a>Instrument everything in your system</h4><p>When it comes to diagnosing problems, instrumentation is critical.</p><p>Adopting a mindset of instrumentation means ensuring we have a set of dashboards that surface key health metrics and that enable us to drill down to the relevant data. However, many of the questions we want to answer tend to be exploratory, since we often don’t know everything that we want to measure ahead of time. Therefore, we need to build flexible tools and abstractions that make it easy to track additional metrics.</p><h4 id="Internalizing-useful-numbers"><a href="#Internalizing-useful-numbers" class="headerlink" title="Internalizing useful numbers"></a>Internalizing useful numbers</h4><p>The knowledge of useful numbers provides a valuable shortcut for knowing where to invest effort to maximize gains.</p><p>Internalizing useful numbers can also help you spot anomalies in data measurements.</p><p>Knowledge of useful numbers can clarify both the areas and scope for improvement.</p><p>To obtain performance-related numbers, you can:</p><ul><li>Write small benchmarks.</li><li>Talk with teams (possibly at other companies) that have worked in similar focus areas.</li><li>Digging through your own historical data.</li><li>Measuring parts of the data yourself.</li></ul><h4 id="Be-skeptical-about-data-integrity"><a href="#Be-skeptical-about-data-integrity" class="headerlink" title="Be skeptical about data integrity"></a>Be skeptical about data integrity</h4><p>The right metric can slice through office politics, philosophical biases, and product arguments, quickly resolving discussions. Unfortunately, the wrong metric can do the same thing – with disastrous results.</p><p>All data can be abused. People interpret data the way they want to interpret it.</p><p>Untrustworthy data that gets incorporated into decision-making processes provides negative leverage. It may lead teams to make the wrong decision or waste cognitive cycles second-guessing themselves.</p><p>Our best defense against data abuse is skepticism.</p><ul><li>Compare the numbers with your intuition to see if they align.</li><li>Try to arrive at the same data from a different direction and see if the metrics still make sense.</li><li>If a metric implies some other property, try to measure the other property to make sure the conclusions are consistent.</li></ul><p>Metrics-related code tends to be less robust. Errors can get introduced anywhere in the data collection or processing pipeline:</p><ul><li>Forget to measure a particular code path if there are multiple entry points.</li><li>Data can get dropped when sent over the network, leading to inaccurate ground truth data.</li><li>When data from multiple sources get merged, not paying attention to how different teams interpreted the definitions, units, or standards for what ought to have been logged can introduce inconsistencies.</li><li>Data visualization is hard to unit test.</li></ul><p>Tips:</p><ul><li><strong>Log data liberally, in case it turns out to be useful later on.</strong></li><li>Build tools to iterate on data accuracy sooner.</li><li><strong>Write end-to-end integration tests to validate your entire analytics pipeline.</strong></li><li>Examine collected data sooner.</li><li><strong>Cross-validate data accuracy by computing the same metric in multiple ways.</strong></li><li>When a number does look off, dig into it early.</li></ul><h3 id="6-Validate-Your-Ideas-Early-and-Often"><a href="#6-Validate-Your-Ideas-Early-and-Often" class="headerlink" title="6. Validate Your Ideas Early and Often"></a>6. Validate Your Ideas Early and Often</h3><h4 id="Find-low-effort-ways-to-validate-your-work"><a href="#Find-low-effort-ways-to-validate-your-work" class="headerlink" title="Find low-effort ways to validate your work"></a>Find low-effort ways to validate your work</h4><p>Invest a small amount of work to gather data to validate your project assumptions and goals.</p><ul><li>Demystifying the riskiest areas first lets you proactively update your plan and avoid nasty surprises that might invalidate your efforts later.</li><li>One way to validate your idea would be to spend 10% of your effort building a small, informative prototype.<ul><li>Measuring performance on a representative workload</li><li>Comparing the code footprint of the module you rewrote against the original module.</li><li>Assessing the ease of adding new features.</li></ul></li><li>Minimum viable product (MVP), the version of a new product which allows a team to collect the maximum amount of validated learning about customers with the least effort. Sometimes, building an MVP requires being creative. Dropbox’s MVP was a 4-minute video.</li><li>The strategy of faking the full implementation of an idea to validate whether it will work is extremely powerful. Asana used a fake signup via Google button.</li></ul><h4 id="Continuously-validate-product-changes-with-A-B-testing"><a href="#Continuously-validate-product-changes-with-A-B-testing" class="headerlink" title="Continuously validate product changes with A/B testing"></a>Continuously validate product changes with A/B testing</h4><p>Even if you were absolutely convinced that a certain change would improve metrics, an A/B test tells you how much better that variation actually is.</p><p>A/B tests also encourage an iterative approach to product development.</p><h4 id="Beware-the-one-person-team"><a href="#Beware-the-one-person-team" class="headerlink" title="Beware the one-person team"></a>Beware the one-person team</h4><p>Additional risks introduced by working on a one-person project:</p><ul><li>It adds friction to getting feedback. And it can be tempting to defer getting feedback until you think it’s nearly perfect.</li><li>The lows of a project are more demoralizing.</li><li>The highs are less motivating.</li></ul><p>Tips for setting up feedback channels to increase chances of success:</p><ul><li>Be open and receptive to feedback. Don’t adopt a defensive mindset. View feedback and criticism as opportunities for improvement.</li><li>Commit code early and often.</li><li>Request code reviews from thorough critics.</li><li><strong>Ask to bounce ideas off your teammates. Explaining an idea to others is one of the best ways of learning it yourself. And your explanation might reveal holes in your own understanding.</strong></li><li>Design the interface or API first.</li><li>Send out a design document before devoting your energy to your code</li><li><strong>Structure ongoing projects so that there is some shared context with your teammates.</strong></li><li><strong>Solicit buy-ins for controversial features before investing too much time.</strong></li></ul><h4 id="Build-feedback-loops-for-your-decisions"><a href="#Build-feedback-loops-for-your-decisions" class="headerlink" title="Build feedback loops for your decisions"></a>Build feedback loops for your decisions</h4><p>Creating a feedback loop is necessary for all aspects of a job. Many of our work decisions are testable hypotheses. You may not be able to test an idea as rigorously as you could with an A/B test and ample amounts of traffic, but you can still transform what otherwise would be guesswork into informed decision-making.</p><h3 id="7-Improve-Your-Project-Estimation-Skills"><a href="#7-Improve-Your-Project-Estimation-Skills" class="headerlink" title="7. Improve Your Project Estimation Skills"></a>7. Improve Your Project Estimation Skills</h3><h4 id="Use-accurate-estimates-to-drive-project-planning"><a href="#Use-accurate-estimates-to-drive-project-planning" class="headerlink" title="Use accurate estimates to drive project planning"></a>Use accurate estimates to drive project planning</h4><p>Managers and business leaders specify targets. Engineers create estimates. A good estimate does not merely reflect our best guess about how long or how much work a project will take. Instead, it’s an estimate that provides a clear enough view of the project reality to allow the project leadership to make good decisions about how to control the project to hit its targets.</p><p>Project schedules often slip because we allow the target to alter the estimate. A more productive approach is to use the estimates to inform project planning. If it’s not possible to deliver all features by the target date, we could hold the date constant and deliver what is possible, or hold the feature set constant and push back the date.</p><p>Tips for producing accurate estimates:</p><ul><li><strong>Decompose the project into granular tasks.</strong> A long estimate is a hiding place for nasty surprises. Treat it as a warning that you haven’t thought through the task thoroughly enough to understand what’s involved.</li><li>Estimate based on how long tasks will take, not on how long you or someone else wants them to take. Managers challenge estimates. If you’ve made your estimates granular, you can defend them more easily.</li><li>Think of estimates as probability distributions, not best-case scenarios.</li><li><strong>Let the person doing the actual task make the estimate.</strong></li><li><strong>Beware of anchoring bias.</strong> Avoid committing to an intial number before actually outlining the tasks involved.</li><li><strong>Use multiple approaches to estimate the same task.</strong><ul><li>Decompose the project into granular tasks, estimate each individual task, and create a bottom-up estimate</li><li>Gather historical data on how long it took to build something similar</li><li>Count the number of subsystems you have to build and estimate the average time required for each one</li></ul></li><li><strong>Beware the mythical man-month.</strong></li><li>Validate estimates against historical data. If you know that historically, you’ve tended to underestimate by 20%, then you’ll know that it’s worthwhile to scale up your overall estimate by 25%.</li><li><strong>Use timeboxing to constrain tasks that can grow in scope.</strong></li><li>Allow others to challenge estimates.</li></ul><h4 id="Allow-buffer-room-for-the-unknown-in-the-schedule"><a href="#Allow-buffer-room-for-the-unknown-in-the-schedule" class="headerlink" title="Allow buffer room for the unknown in the schedule"></a>Allow buffer room for the unknown in the schedule</h4><p>Acknowledge that the longer a project is, the more likely that an unexpected problem will arise.</p><ul><li>Leave buffer room for unknowns</li><li>Separate estimated work time from calendar time.<ul><li>An 8-hour workday doesn’t actually provide 8 hours of working time on a project.</li><li>The effect of interruptions is further compounded when schedules slip.</li></ul></li><li>Be clear that a certain schedule is contingent on some person spending a certain amount of time each week on the project.</li><li>Factor in competing time investments.</li></ul><h4 id="Define-specific-project-goals-and-measurable-milestones"><a href="#Define-specific-project-goals-and-measurable-milestones" class="headerlink" title="Define specific project goals and measurable milestones"></a>Define specific project goals and measurable milestones</h4><p>What frequently causes a project to slip is a fuzzy understanding of what constitutes success.</p><p>Setting a project goal produces two concrete benefits:</p><ul><li>A well-defined goal provides an important filter for separating the must-haves from the nice-to-haves in the task list.</li><li>It builds clarity and alignment across key stakeholders. It’s very important to understand what the goal is, what your constraints are, and to call out the assumptions that you’re making.</li></ul><p>Building alignment also helps team members be more accountable for local tradeoffs that might hurt global goals.</p><p>Define specific goals to reduce risk and efficiently allocate time, and outline milestones to track progress.</p><h4 id="Reduce-risk-early"><a href="#Reduce-risk-early" class="headerlink" title="Reduce risk early"></a>Reduce risk early</h4><p>As engineers, we like to build things. This tendency can bias us toward making visible progress on the easier parts of a project that we understand well. We then convince ourselves that we’re right on track, because the cost of riskier areas hasn’t yet materialized.</p><p>Effectively executing on a project means minimizing the risk that a deadline might slip and surfacing unexpected issues as early as possible.</p><p>Tackling the riskiest areas first helps us identify any estimation errors associated with them. The goal from the beginning should be to maximize learning and minimize risk, so that we can adjust our project plan if necessary.</p><p>Examples:</p><ul><li>When switching to a new technology, build a small-scale end-to-end prototype.</li><li>When adopting a new backend infrastructure, gain an early systematic understanding of its performance and failure characteristics.</li><li>When considering a new design to improve application performance, benchmark core pieces of code.</li></ul><p>One effective strategy to reduce integration risk is to build end-to-end scaffolding and do system testing earlier. Front-loading the integration work provides a number of benefits:</p><ul><li>It forces you to think more about the necessary glue between different pieces and how they interact, which can help refine the integration estimates and reduce project risk.</li><li>If something breaks the end-to-end system during development, you can identify and fix it along the way, while dealing with much less code complexity, rather than scrambling to tackle it at the end.</li><li>It amortizes the cost of integration throughout the development process, which helps build a stronger awareness of how much integration work is actually left.</li></ul><p>Our initial project estimates will exhibit high variance because we’re operating under uncertainty and imperfect information. As we gain more information and revise our estimates, the variance narrows. By shifting the work that can take highly variable amounts of time to earlier in the process, we reduce risk and give ourselves more time and information to make effective project plans.</p><h4 id="Approach-rewrite-projects-with-extreme-caution"><a href="#Approach-rewrite-projects-with-extreme-caution" class="headerlink" title="Approach rewrite projects with extreme caution"></a>Approach rewrite projects with extreme caution</h4><p>Trying to rewrite stuff from scratch – that’s the cardinal sin.</p><p>Rewrite projects are particularly troublesome because:</p><ul><li>They share the same project planning and estimation difficulties.</li><li>We tend to underestimate them more drastically due to a false sense of familiarity.</li><li>It’s easy and tempting to bundle additional improvements into a rewrite.</li><li>When a rewrite is ongoing, any new features or improvements must either be added to the rewritten version or they must be duplicated.</li></ul><p>The second system is the most dangerous system a man ever designs.</p><p>Engineers should use a series of incremental, behavior-preserving transformations to refactor code. Rewriting a system incrementally is a high-leverage activity. It provides additional flexibility at each step to shift to other work that might be higher-leverage.</p><p>Sometimes, doing an incremental rewrite might not be possible. The next best approach is to break the rewrite down into separate, targeted phases.</p><h4 id="Know-the-limits-of-overtime"><a href="#Know-the-limits-of-overtime" class="headerlink" title="Know the limits of overtime"></a>Know the limits of overtime</h4><p>Don’t sprint in the middle of a marathon.</p><p>Reasons why working more hours doesn’t necessarily mean hitting the launch date:</p><ul><li>Hourly productivity decreases with additional hours worked.</li><li>You’re probably more behind schedule than you think.</li><li>Additional hours can burn out team members.</li><li>Working extra hours can hurt team dynamics.</li><li><strong>Communication overhead increases as the deadline looms.</strong></li><li><strong>The sprint toward the deadline incentivizes technical debts.</strong></li></ul><p>Tips for increasing the probability that overtime will actually accomplish your goals:</p><ul><li>Making sure everyone understands the primary causes for why the timeline has slipped thus far.</li><li>Developing a realistic and revised version of the project plan and timeline.</li><li>Being ready to abandon the sprint if you slip even further from the revised timeline.</li></ul><h2 id="Part-III-Build-Long-Term-Value"><a href="#Part-III-Build-Long-Term-Value" class="headerlink" title="Part III: Build Long-Term Value"></a>Part III: Build Long-Term Value</h2><h3 id="8-Balance-Quality-with-Pragmatism"><a href="#8-Balance-Quality-with-Pragmatism" class="headerlink" title="8. Balance Quality with Pragmatism"></a>8. Balance Quality with Pragmatism</h3><h4 id="Establish-a-culture-of-reviewing-code"><a href="#Establish-a-culture-of-reviewing-code" class="headerlink" title="Establish a culture of reviewing code"></a>Establish a culture of reviewing code</h4><p>The benefits of code reviews:</p><ul><li>Catching bugs or design shortcomings early.</li><li>Increasing accountability for code changes.</li><li>Positive modeling of how to write good code.</li><li>Sharing working knowledge of the codebase.</li><li>Increasing long-term agility.</li></ul><p>Fundamentally, there’s a tradeoff between the additional quality that code reviews can provide and the short-term productivity win from spending that time to add value in other ways.</p><p>Code reviews can be structured in different ways to reduce their overhead while still maintaining their benefits. Experiment to find the right balance of code reviews that work for you and your team.</p><h4 id="Manage-complexity-through-abstraction"><a href="#Manage-complexity-through-abstraction" class="headerlink" title="Manage complexity through abstraction"></a>Manage complexity through abstraction</h4><p>How the right abstraction increases engineering productivity:</p><ul><li><strong>It reduces the complexity of the original problem into easier-to-understand primitives.</strong></li><li>It reduces future application maintenance and makes it easier to apply future improvements.</li><li>It solves the hard problems once and enables the solution to be used multiple times.</li></ul><p>When we’re looking for the right tool for the job and we find it easier to build something from scratch rather than incorporate an existing abstraction intended for our use case, that’s a signal that the abstraction might be ill-designed.</p><p>Bad abstractions aren’t just wasted effort; they’re also liabilities that slow down future development.</p><p>Good abstractions should be:</p><ul><li>Easy to learn</li><li><strong>Easy to use even without documentation</strong></li><li><strong>Hard to misuse</strong></li><li>Sufficiently powerful to satisfy requirements</li><li>Easy to extend</li><li>Appropriate to the audience</li></ul><h4 id="Scale-code-quality-with-automated-testing"><a href="#Scale-code-quality-with-automated-testing" class="headerlink" title="Scale code quality with automated testing"></a>Scale code quality with automated testing</h4><p>Tests allow engineers to make changes, especially large refactorings ,with significantly higher confidence. When code does break, automated tests help to efficiently identify who’s accountable.</p><p>Tests offer executable documentation of what cases the original author considered and how to invoke the code.</p><p>The extent to which you should automate testing again boils down to a matter of tradeoffs. The inflection point came when a simple unit test visibly started to save time.</p><h4 id="Manage-technical-debt"><a href="#Manage-technical-debt" class="headerlink" title="Manage technical debt"></a>Manage technical debt</h4><p>Since our initial understanding of problems always will be incomplete, incurring a little debt is unavoidable. The key to being a more effective engineer is to incur technical debt when it’s necessary to get things done for a deadline, but to pay off that debt periodically.</p><h3 id="9-Minimize-Operational-Burden"><a href="#9-Minimize-Operational-Burden" class="headerlink" title="9. Minimize Operational Burden"></a>9. Minimize Operational Burden</h3><h4 id="Do-the-simple-thing-first"><a href="#Do-the-simple-thing-first" class="headerlink" title="Do the simple thing first"></a>Do the simple thing first</h4><p>Simple solutions impose a lower operational burden because they’re easier to understand, maintain and modify.</p><p>Having too complex of an architecture imposes a maintenance cost in a few ways:</p><ul><li><strong>Engineering expertise gets splintered across multiple systems.</strong> Every system has its own unique sets of properties and failure modes that must be discovered, understood, and mastered.</li><li>Increased complexity introduces more potential single points of failure.</li><li>New engineers face a steeper learning curve when learning and understanding the new systems.</li><li><strong>Effort towards improving abstractions, libraries, and tools get diluted across the different systems.</strong></li></ul><p>People often say, “Use the right tool for the job” – but that can also increase the number of moving parts. Does the complexity of having more parts outweigh the benefits of simplicity through standardization?</p><h4 id="Fail-fast-to-pinpoint-the-source-of-errors"><a href="#Fail-fast-to-pinpoint-the-source-of-errors" class="headerlink" title="Fail fast to pinpoint the source of errors"></a>Fail fast to pinpoint the source of errors</h4><p>By failing fast, we can more quickly and effectively surface and address issues.</p><p>Examples:</p><ul><li>Crashing at startup time when encountering configuration errors.</li><li><strong>Validating software inputs, particularly if they won’t be consumed until much later.</strong></li><li>Bubbling up an error from an external service that you don’t know how to handle, rather than swallowing it.</li><li>Throwing an exception as soon as possible when certain modifications to a data structure, like a collection, would render dependent data structures, like an iterator, unusable.</li><li>Throwing an exception if key data structures have been corrupted rather than propagating that corruption further within the system.</li><li><strong>Asserting that key invariants hold before of after complex logic flows and attaching sufficiently descriptive failure messages.</strong></li><li>Alerting engineers about any invalid or inconsistent program state as early as possible.</li></ul><p>You can take a hybrid approach: use fail-fast techniques to surface issues immediately and as close to the actual source of error as possible; and complement them with a global exception handler that reports the error to engineers while failing gracefully to the end user.</p><h4 id="Relentlessly-automate-mechanical-tasks"><a href="#Relentlessly-automate-mechanical-tasks" class="headerlink" title="Relentlessly automate mechanical tasks"></a>Relentlessly automate mechanical tasks</h4><p>Engineers automate less frequently than they should, for a few reasons:</p><ul><li>They don’t have the time right now.</li><li><strong>They suffer from the tragedy of the commons</strong>, in which individuals act rationally according to their own self-interest but contrary to the group’s best long-term interests. When manual work is spread across multiple engineers and teams, it reduces the incentive of any individual engineer to spend the time to automate.</li><li>They lack familiarity with automation tools.</li><li><strong>They underestimate the future frequency of the task.</strong></li><li><strong>They don’t internalize the time savings over a long time horizon.</strong></li></ul><p>Automation can produce diminishing returns as you move from automating mechanics to automating decision-making.</p><h4 id="Aim-for-idempotence-and-reentrancy"><a href="#Aim-for-idempotence-and-reentrancy" class="headerlink" title="Aim for idempotence and reentrancy"></a>Aim for idempotence and reentrancy</h4><p>Idempotence offers another benefit that many effective engineers take advantage of: the ability to run infrequent processes at a more frequent rate than strictly necessary, to expose problems sooner.</p><p>Running batch processes more frequently also allows you to handle assorted glitches transparently. A system check that runs every 5 to 10 minutes might raise spurious alarms because a temporary network glitch causes it to fail, but running the check every 60 seconds and only raising an alarm on consecutive failures dramatically decreases the chances of false positives. Many temporary failures might resolve themselves with in a minute, reducing the need for manual intervention.</p><h4 id="Hone-your-ability-to-respond-and-recover-quickly"><a href="#Hone-your-ability-to-respond-and-recover-quickly" class="headerlink" title="Hone your ability to respond and recover quickly"></a>Hone your ability to respond and recover quickly</h4><p>The best defense against major unexpected failures is to fail often.</p><p>It’s important to focus on uptime and quality, but as we go down the list of probable failure modes or known bugs, we will find that our time investments produce diminishing returns. No matter how careful we are, unexpected failures will always occur. At some point, it becomes higher leverage to focus our time and energy on our ability to recover quickly than on preventing failures in the first place.</p><p>We can script for success and shift our decision-making away from high-stakes and high-pressure situations and into more controlled environments.</p><h3 id="10-Invest-in-Your-Team’s-Growth"><a href="#10-Invest-in-Your-Team’s-Growth" class="headerlink" title="10. Invest in Your Team’s Growth"></a>10. Invest in Your Team’s Growth</h3><h4 id="Help-the-people-around-you-succeed"><a href="#Help-the-people-around-you-succeed" class="headerlink" title="Help the people around you succeed"></a>Help the people around you succeed</h4><p>The higher you climb up the engineering ladder, the more your effectiveness will be measured not by your individual contributions but by your impact on the people around you. Thinking early in your career about how to help your coworkers succeed instills the right habits that in turn will lead to your own success.</p><p>Your career success depends largely on your company and team’s success. You get more credit than you deserve for being part of a successful company, and less credit than you deserve for being part of an unsuccessful company.</p><h4 id="Make-hiring-a-priority"><a href="#Make-hiring-a-priority" class="headerlink" title="Make hiring a priority"></a>Make hiring a priority</h4><p>A good interview process achieves two goals:</p><ul><li>It screens for the type of people likely to do well on the team</li><li>It gets candidates excited about the team</li></ul><p>Tips for improving your interview process:</p><ul><li>Take time with your team to identify which qualities in a potential teammate you care about the most.</li><li>Periodically meet to discuss how effective the current recruiting and interview process are at finding new hires who succeed on the team.</li><li><strong>Design interview problems with multiple layers of difficulty that you can tailor to the candidate’s ability.</strong></li><li>Control the interview pace to maintain a high signal-to-noise ratio.</li><li><strong>Scan for red flags by rapidly firing short-answer questions.</strong></li><li>Periodically shadow or pair with another team member during interviews.</li></ul><h4 id="Invest-in-onboarding-and-mentoring"><a href="#Invest-in-onboarding-and-mentoring" class="headerlink" title="Invest in onboarding and mentoring"></a>Invest in onboarding and mentoring</h4><p>A good initial experience influences an engineer’s perception of the engineering culture, shapes her ability to deliver future impact, and directs her learning and activities according to team priorities.</p><p>Quora’s onboarding program:</p><ul><li>Codelabs: why a core abstraction was designed and how it’s used.</li><li>Onboarding talks: codebase, site architecture, development tools, engineering expectations and values, key focus areas.</li><li>Mentorship.</li><li>Starter tasks.</li></ul><h4 id="Share-ownership-of-code"><a href="#Share-ownership-of-code" class="headerlink" title="Share ownership of code"></a>Share ownership of code</h4><p>There’s a common misconception that being the sole engineer responsible for a project increases your value. When you’re the bottleneck for a project, you lose your flexibility to work on other things.</p><p>Tips for increase shared ownership:</p><ul><li>Avoid one-person teams.</li><li>Review each other’s code and software designs.</li><li>Rotate different types of tasks and responsibilities across the team.</li><li>Keep code readable and code quality high.</li><li>Present tech talks on software decisions and architecture.</li><li>Document your software, either through high-level design documents or in code-level comments.</li><li><strong>Document the complex workflows or non-obvious workarounds necessary for you to get things done.</strong></li><li>Invest time in teaching and mentor other team members.</li></ul><h4 id="Build-collective-wisdom-through-post-mortems"><a href="#Build-collective-wisdom-through-post-mortems" class="headerlink" title="Build collective wisdom through post-mortems"></a>Build collective wisdom through post-mortems</h4><p>Meet and conduct a detailed post-mortem after a site outage, a high-priority bug, or some other infrastructure issue. Try doing the same healthy retrospection to projects and launches.</p><p>Ultimately, compiling team lessons is predicated upon honest conversation – and holding an honest conversation about a project can be uncomfortable. It requires aligning behind a common goal of improving the product or team, and not focusing on where to assign blame. It requires being open and receptive to feedback, with the goal of building collective wisdom around what went wrong and what could’ve been done better.</p><h4 id="Build-a-great-engineering-culture"><a href="#Build-a-great-engineering-culture" class="headerlink" title="Build a great engineering culture"></a>Build a great engineering culture</h4><p>Great engineering cultures:</p><ul><li>Optimize for iteration speed.</li><li>Push relentlessly towards automation.</li><li>Build the right software abstractions.</li><li>Focus on high code quality by using code reviews.</li><li>Maintain a respectful work environment.</li><li>Build shared ownership of code.</li><li>Invest in automated testing.</li><li>Allot experimentation time.</li><li>Foster a culture of learning and continuous improvements.</li><li>Hire the best.</li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;&lt;strong&gt;This blog has been migrated to &lt;a href=&quot;https://linghao.io&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;linghao.io&lt;/a&gt;. Read this post on my new blog: &lt;a href=&quot;https://linghao.io/notes/the-effective-engineer/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://linghao.io/notes/the-effective-engineer/&lt;/a&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Starting from “time is our most limited resource”, &lt;a href=&quot;https://www.effectiveengineer.com/book&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;em&gt;The Effective Engineer&lt;/em&gt;&lt;/a&gt; by Edmond Lau first establishes the methodology of using “leverage” to guide our actions. The book then, from multiple angles, discusses how to become a more effective engineer by focusing on high-leverage activities that produce a disproportionately high impact for a relatively small time investment. Ranging from adopting mindsets, actual execution, to building long-term values, topics involved are complemented with ample examples within the industry. Much content are easily generalizable to areas beyond software engineering as well. A must-read for software engineers.&lt;/p&gt;
&lt;p&gt;This post is a refined version of the notes I took while reading this book.&lt;/p&gt;
    
    </summary>
    
      <category term="Notes" scheme="http://dnc1994.com/categories/Notes/"/>
    
    
  </entry>
  
  <entry>
    <title>过去这五年</title>
    <link href="http://dnc1994.com/2018/10/last-5-years/"/>
    <id>http://dnc1994.com/2018/10/last-5-years/</id>
    <published>2018-10-11T00:51:42.000Z</published>
    <updated>2019-06-03T00:38:32.817Z</updated>
    
    <content type="html"><![CDATA[<p><strong>本博客已经迁移到新域名 <a href="https://linghao.io" target="_blank" rel="noopener">linghao.io</a>。请前往新博客阅读本文：<a href="https://linghao.io/posts/five-year-summary-2013-2018/" target="_blank" rel="noopener">https://linghao.io/posts/five-year-summary-2013-2018/</a>。</strong></p><p>这是一篇对过去五年的忠实记录。有成功也有失败，有欢笑也有泪水。作为一篇很大程度上为自己而写的回忆录，倘若对你有些许启发，便是无憾。</p><p>本文谢绝转载。</p><a id="more"></a><h2 id="2013-2014"><a href="#2013-2014" class="headerlink" title="2013 - 2014"></a>2013 - 2014</h2><p>在高中准备了两年信息学竞赛却没能拿到全国赛（NOI）资格之后，我通过计算机特长生夏令营保送来到了复旦。那时保送生有一个选择是提前半年过去读预科，上三门数学课并进行 ACM 训练。而我并不打算在大学继续参加算法竞赛，同时也觉得利用高三的时间提高英语能力更为重要，就没有选择去读预科。那一年我的自主学习成果还是不错的：英语水平突飞猛进；掌握了微积分和线性代数的基础知识；开始学习 Python 并尝试写出了一些自动化脚本；还在无意中浅尝了机器学习这一领域。</p><p>与那段时光相比，我在复旦的第一年可以说是浑浑噩噩。人生第一次离家独立生活；从三线城市来到一线城市被新鲜事物包围而感到迷失；对 CS 的学术/职业道路缺乏概念，看不清未来在哪；复旦特色的大类加通识教育导致课表上堆满了不感兴趣的课；…… 种种因素叠加起来，我的大一虽然过得还算开心，但在个人发展上几乎一事无成。不仅结结实实吃了三个 C 档成绩使得 GPA 彻底不能看，还经历了许多的半途而废：加入了一打感兴趣的社团却都没有坚持参加，一时兴起买的树莓派在新鲜感褪去之后也惨遭废弃等。单从结果来看，大一做的唯一正确的选择居然是选了一门网球课，从而跟之后成为室友兼良师益友的余神混熟，并且连带着认识了戍爷和 Farter 这两位未来室友。</p><p>第二个学期的时候，在余神和戍爷的推荐下，我加入了复旦学生网（STU）这个神奇的组织。我在那里得到了技术启蒙，也结交了好几位至今仍然志同道合的朋友。那时的我简单地满足于写出能工作的脚本，在 STU 与众不同的报名环节还提交了参考朋友代码开发的 Pixiv 图片批量下载工具作为自信作。然而刚加入 STU 没多久，余神就随手表演了如何闭着眼睛写出一个选课信息爬虫。当我还在依葫芦画瓢使用 <code>urllib2</code> 的时候，余神早已建立了使用 <code>requests</code> 和 <code>pyquery</code> 等工具编写爬虫的最佳实践，并且对 HTTP 等原理层面的知识也已经有了切实的掌握。我完全折服于能够将业务逻辑级别的代码写得如此熟练和优美的这种能力，也认识到了自己在技术上是何等的无知。在同一时期，戍爷和 QS 学长的前端入门教学也成为了我进入 Web 开发领域的起始点。</p><p>尽管还是看不清远方在哪，脚下的道路已经开始变得具体而坚实。我本能般地混入了一个人人都比我强的圈子，跟随余神脚步租了第一个服务器，购入了第一个域名 <a href="dnc1994.com">dnc1994.com</a>，搭建了第一个基于 WordPress 的博客。恰好也是在这个时候，为复旦 CS 不尽人意的教学质量所失望的我，接触到了 MOOC 这一事物的存在。</p><p>诚然，这一时期的我很容易沉浸于低效率的学习方式中。为了驱散懂得太少的焦虑，我曾经周期性地制订不切实际的计划，比如浪费假期的大把时间去阅读一些以我的基础完全跟不上的教科书等。类似的现象在过去的五年中时有发生。值得欣慰的是，随着自身的成长，我逐渐能够正视自身的劣势，保持清醒的头脑来修正错误的学习方式。</p><p>大一结束时，在夹杂着乱码小插曲的宿舍分配之后，我和余神、戍爷、Farter 四个人组成了带有传奇色彩的 709 寝室。709 的每一个人都很独特，而我应该是其中最为普通，也是技术实力最弱的一个。但也正因如此，我在接下来的三年中学到了终身受用的东西。</p><h2 id="2014-2015"><a href="#2014-2015" class="headerlink" title="2014 - 2015"></a>2014 - 2015</h2><p>大二第一个学期，我参加了第一次托福考试，拿到了自己都有些吃惊的 110 分。那是我横跨三年的留学计划的开端。想去留学有诸多动机，但我已经记不清究竟是先下了留学的决心才报的托福，还是先随便报着被成绩鼓励到了才下的决心。不论如何，在不晚于第二个学期到来之前，我坚定了去北美留学的目标。</p><p>差不多在同一时间，我在高三到大一两年的探索中所接受的信息和观念，终于内化并形成了具体的想法。我第一次对于自己想要成为一个怎样的人有了清晰的认识，并将那时的想法写在了<a href="https://www.zhihu.com/question/26095881/answer/34732051" target="_blank" rel="noopener">这个知乎回答</a>里。四年过后，我的想法依然没有改变。这大概就是我的确找到了能够定义自己的叙事的最有力证明了吧。</p><p>15 年 3 月，通过朋友介绍，我认识了一位同样也是从复旦到 CMU 的学长，向他请教了许多留学相关的问题。在对整个流程有了基本的系统性认识之后，我意识到除了 GPA 和英语成绩这些硬性条件，自己非常欠缺的就是经历上的不足。我开始发挥自己擅长制定和执行计划的优势，早早报名了 15 年 10 月的托福和 GRE 考试，并部署了周期长但均摊负担轻的备考策略。但也正是从这个时间点开始，我产生了一种「一步落后，步步落后」的焦虑感。</p><p>由于大一没有留学的意向，并且自负地蔑视刷 GPA 的行为，我对自己不喜欢的科目非常不重视，导致当时的 GPA 只有 3.4 出头。另一方面，学校官方提供的海外交流项目也基本都在大二第一个学期的时候完成了申请和分配，而对交流机会的稀缺性缺乏敏感的我在那时没有提出参加交流的意向。等到第二个学期，我才发现春学期几乎没有可去的项目。这种被人逐渐甩开的焦虑感，催生了之后一系列不那么明智的选择。</p><p>这个时期，我开始投入大量精力到 MOOC 上。从数据挖掘到机器学习再到 Web 开发，我都是在 Coursera 上入门的。正是由于在林轩田老师的机器学习基石 &amp; 技法这两门课上系统性地接触了机器学习，我才得以在 CS 这个巨大的学科下找到了自己真正热爱的方向。</p><p>开始学习 MOOC 一半是机缘巧合，另一半则是源于高三时代使用 MIT OCW 学习线性代数而体会到的世界一流名校教学资源的优越性；而把 MOOC 作为爱好坚持下去则主要有两个原因：一是复旦校内优质教学资源的缺位，二是我本人对英语的热爱。我曾经说过，适合大学生、与实际运用联系紧密、还能作为长期输入输出来源的英语学习方式，非 MOOC 莫属了。我最终完成的 MOOC 总数不下 50 门，也因为对 Coursera 的热爱得到了跟其中国区负责人近距离交流的机会。出国以后我也发现刷 MOOC 的经历使得我不需要任何适应就能融入以英语为主导的学术交流环境。但是反思一下，Coursera 也好别的平台也好，在初期（12、13 年）过后就很少再有进阶级别的课程了。用 MOOC 入门一个新的领域或许还无可厚非，但当时的我却沉浸于刷证书的快感不能自拔。这也是一定程度上的逃避现实，因为周围没有比自己刷得更凶的人，所以能够带来一种虚假的优越感。我刷掉的后 30 门课完全可以用来做更有价值的事情。</p><p>也是在第二学期，为了积累实验室经历，我加入了王新老师的 SONIC Lab，在周扬帆老师组里跟进 Mobile Computing 相关的工作。这是一个缺乏考虑却歪打正着地给我带来了许多收获的选择。由于大一时跟余神一同参加的最后无疾而终的腾飞计划就是找的王新老师作为指导，所以在仅仅以「加入一个实验室」为目的搜寻时很容易就通过这层关系联系上了正在招小朋友的周老师。</p><p>在周老师实验室的那一个学期，我每周需要阅读布置的一些论文并去跟老师和学长做讨论。那时我没有什么正儿八经的项目经验，也没有做过科研，对 Mobile Computing 这个领域更是毫无了解。我经常不能正确理解一项工作的难点在哪或是意义几何。而周老师的确是一位非常好的导师，他总是能耐心地讲解我的想法在哪里出了偏差，还传授了我许多阅读论文的技巧。可以说是周老师教会了我如何阅读论文，而这一技能令我在其他领域也受益无穷。</p><p>学期快结束时，周老师开始向我布置实际的开发工作，需要在 Android 上实现对一些性能指标的监控。在尝试动手之后我发现，尽管阅读论文本身能给我带来理性愉悦，但这个领域终究不是我所感兴趣的。过去的经历让我很清楚自己不可能做好不喜欢的事情，并且那时我也已经找到了喜欢的方向（机器学习）。在一阵纠结之后，我离开了实验室。</p><p>这一年的暑假，我先后去了英国和美国。前者是书院组织去 Hertfordshire 的项目，乏善可陈；后者则是经由一个留学社团联系的去 Stanford 的项目。事后看来，那四周的时间和金钱成本花得不算太值。被信息不对称限制了想象力的我，并不知道可以自己联系教授去做暑研。（当然那个时候不比留学门槛水涨船高的现在，知道并会去实践这个套路的也是少数人。）英国的项目对我自然是毫无帮助，而美国的项目也经历了一波三折。最初项目承诺的是由 Stanford CS 的一位教授主持，后来由于主办方的失误导致了时间冲突，最终被迫改成由 Stanford d.school 的 Michael Barry 教授主持的 Design Thinking 项目。说得直白一点，这个项目基本上就是教授赚外快的副业，其内容也是名副其实的「游学」，由洛杉矶/旧金山观光、入门级别的 design/business 短课程和参观硅谷公司三部分组成。</p><p>即便如此，这段经历也并非一无是处。在硅谷的所见所闻，一定程度上支撑了我出国留学的目标：第一，我亲身体验了自己设想中的未来工作环境（参观了 Google、Facebook、LinkedIn、Twitter 等公司）；第二，我认识了 Polarr 的创始人 Borui Wang，近距离体会了创业者的特有气质。我至今仍记得的一个细节，是在闲聊时我向他描述了学校里一些不尽人意的现状之后，他淡淡地说了一句，那么为什么不退学呢。看着他的眼神我很明白，这么说并不是在嘲讽，而是的确认为这是一个可行的选项。当时的我感叹道，有朝一日我也要成为在这种前提下有底气选择退学这一选项的人。</p><p>这个时期还有一件值得提的事情，是我对 MOOC 的发展有了接近盖棺定论的看法。MOOC 的初衷是作为改革传统教育的实验，探索通往个性化教育的道路。但它诞生之后，显然更多的人看中的是原本非公开的大量优质教育资源。最初那一批学校开课给公众留下的深刻印象使得 MOOC 不可避免地带上了公益性，而在后续发展中商业性也开始被挖掘。MOOC 平台需要足够数量的「水课」来赚取利润并维持一定的公益形象，而内容提供方的学校也不可能把所有的资源都抖出来。这里的本质矛盾在于个性化教育必然要求倾注在单个学生身上的资源变多，但商业化 MOOC 平台却都是规模制胜论。MOOC 不得不跟上原本就存在偏差的定位，现在大家对它的理解也都不一样。有人觉得是把好的课程免费带给每个人，也有人觉得只是换一种形式接受传统教育，还有人只是把它作为一种收割智商税的手段。最早那批想要探索新教学方式的人似乎已经绝迹了。想明白这些问题以后，我不再将这个阶段的 MOOC 看作是能够改变教育未来的事物，它也从此开始逐渐淡出我的生活。三四年后来看，我的想法跟现实并没有太大的出入。</p><h2 id="2015-2016"><a href="#2015-2016" class="headerlink" title="2015 - 2016"></a>2015 - 2016</h2><p>大三开始的时候，我对自己接下来要做什么就有比较清晰的把握了：搞定英语考试，积累更多机器学习相关的经历，找一份暑假实习。</p><p>大概是九月底十月初那会，在同学的推荐下，我加入了肖仰华老师的 GDM Lab / 知识工场。有着之前周老师对我的训练，我在讨论班时做论文汇报还是很得心应手的，也使肖老师对我产生了比较高的期待。之后我实际承担了两个 NLP 项目。讽刺的是，正是在 GDM Lab 的经历使我开始意识到自己不适合走纯学术路线。</p><p>在那两个项目中，我用脚本语言来处理数据和拼凑不同模块的能力得到了充分的锻炼，但也仅此而已了。我发现自己既不适合解决开放性过强的问题（也有部分原因在于反馈的缺失），也不能从以发表论文为最终目标的工作中收获满足感。我依然享受跟进学术进展的理性愉悦，在阅读和理解论文上做得还不错，也有自信做技术上的实现，但我还是很难成为一个好的研究者。并且实验室普遍低质量的胶水代码也跟我的技术审美不太契合。或许在这之前我还不能斩钉截铁地说自己是否打算申请 PhD，这段经历则让我彻底打消了这个念头。</p><p>第一学期时我还做了一个决定，那就是不论去哪，第二学期我都要出去交流。这么想主要是当时在学习和生活上都有许多烦心的事情，因为学分修得比较足就想换个环境调整一下身心。于是我去了国立交通大学（NCTU）。因为只有三门课，我在一番套瓷之后加入了 NCTU Machine Learning Lab。我的初衷是想利用这个机会参与一些工作来积攒经历，但很快便了解到台湾这边非一线实验室的运作方式跟我们所熟悉的节奏相去甚远。在那边硕士和博士生们的唯一关注点似乎就是自己的毕业论文（其内容也比较接近文献综述），而并不会为了往一线会议/期刊投稿去做项目，与工业界的合作更是完全没有。而愿意接纳我的简仁宗教授虽然非常热情，但他对我所说的「交流」的理解似乎更接近于「见学」。他的期望就是我能在有空的时候来实验室待着，参与讨论班等集体活动，同时找些感兴趣的课题做一些独立研究，也不给我任何指导。</p><p>这时的我已经跨过那个焦虑自己一步落后步步落后的阶段了，心态就变得比较度假。在台湾的四个月，除了体会风土人情以外，我将主要精力花在了 Kaggle 上面。对于只学过机器学习理论的我来说，完整实践建模的整个过程无疑是很有价值的经历。我投入了大量的时间去阅读攻略经验并加以实践，最后误打误撞地在第一次比赛中进入了前 5%。这段经历对我所坚信的「做自己喜欢的事情才能做得好」的准则也是一次很好的检验。</p><p>在废寝忘食训练模型的过程中，我试图在 Windows 上安装 XGBoost。当时网上能找到的安装攻略几乎没有一篇是在我的环境下可行的。一通折腾之后，我终于通过拼凑两篇博文中的细节得到了一个适用范围较广的安装流程。那时我内心最直观的感受是，这种抓耳挠腮花费一整晚后终于在互联网的某个角落找到解决问题的关键的感觉，实在是过于愉悦了。我想将这份愉悦传递下去，而不是让自己曾经踩过的坑继续困扰更多的人。于是我将安装流程写成了博文，至今已经有几十个人向我表示感谢。这种成就感坚定了我继续产出类似干货内容的决心。</p><p>于是在那次 Kaggle 比赛结束以后，我整理了一部分参考文献，结合大量自己摸索出来的细节（尤其是现有攻略中较少谈及的部分，比如 Stacking 的实现细节等）写成了这篇 <a href="https://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/">Kaggle 入门指南</a>，并在随后补上了对应的英文版。这篇文章的影响力完全超出了我的想象。先是英文版被 Kaggle 官方和 KDNuggets 等在数据科学圈子里比较有名的社区转发，接着就是国内外各路网站纷纷来请求授权转载（当然也有非授权转载甚至商业盗用）。后来甚至还有人民邮电出版社和机械工业出版社的编辑联系我希望能出版一本介绍 Kaggle 的书。也因为这篇文章，我的博客在已经很久没有更新过英文干货内容的前提下每个月还能有上千的海外访问量。</p><p>很快到了找暑假实习的时候。最初我因为迷信 MSRA 对出国的帮助，通过同学联系了一位做短文本理解的研究员，想要跟着他做一些深度学习项目，但最终因为无法承诺全职实习六个月而失去了这次机会。在开始找普通工业界实习时，同样出于对外企的迷信，我没有找 BAT 之类的公司，而是投了 Intel、Microsoft 等，但最后大部分公司都没有给我回复。后来我通过了一直特别感兴趣的 Splunk 的面试，却因为 HR 的工作失误没有能够去成实习。这时的我难免有些着急，正好当时也在找工作的戍爷向我推荐了做建站工具的创业公司 Strikingly。</p><p>Strikingly 的招聘流程令人惊艳。我印象尤其深刻的，是之后成为同事的数据工程师 Young 在面试时直接把实际工作中见到的钓鱼站点发给我，让我思考可以用什么策略去过滤它们。当时还未能从学术派思维中转变过来的我，不假思索地开始设想一整个完整的建模流程，而 Young 却告诉我其实靠简单的启发式规则就可以得到很不错的结果。回头来看，这是我第一次在面试中学到新的知识。纵观之后的面试经历，所有能让人学到东西的团队都不会太差。走完招聘流程，我成功被这家公司圈粉了，于是拿到 offer 后立马就接了下来。</p><p>当时的 Strikingly 主营业务是针对移动端优化的建站工具。用户在建站之后可以在每个网站各自的 Dashboard 中查看一些流量统计数据。我的上手项目就是将这个三年没更新的模块进行升级，为用户展示更多的数据。简单来说，我需要先在前端埋点将访问事件发送到我们使用的第三方服务，再在后端补充对应的查询 API，最后修改 Dashboard 的前端代码来展示新增的统计数据。当时主要有这样一些挑战：</p><ul><li>Dashboard 模块前端用的是 CoffeeScript（当时公司已经停止在新项目中使用），后端 API 是集成在主服务（Ruby on Rails）里的。两端的技术我都是第一次接触。</li><li>我抱有一种来到 Strikingly 就是做数据建模的幻想，从心理上抗拒所有不直接对这一目标有贡献的工作。起初由于前后端代码都可以依葫芦画瓢所以我还能勉强接受，但做到一半发现后端 API 由于改动较大需要重新设计，并且跟我结对的后端工程师不恰当地试图将比较炫技的写法传授给对 Ruby 毫无经验的我，这使得我对需要达到的代码质量产生了混乱和错位的预期。</li><li>在一个高度模块化的庞大系统中 debug 时往往需要对比各个服务的 log，之前没有类似经验的我感到十分吃力。</li><li>公司内部对项目不是特别重视。由于原本对前端实现没有要求，经验有限的我就写得比较粗糙（尤其是样式）。而在项目快收尾时的一次会议上，CTO 经过评估又决定前端需要重构，并将任务交给了当时在一同实习的戍爷。从代码质量的层面来说这么做更为合理，也减轻了我的负担，但对我的积极性的确是一个打击。</li><li>我不适应创业公司的开发节奏，对 Ownership 的理解不够到位。我一度在自己的工作被别人 block 之后不去积极推动进度却转而开始忙活已经构想好的 Modeling 项目。这一点被我们的产品主管 Teng 狠狠地批评了。</li></ul><p>好在随着时间的流逝，我开始抛弃掉那些不切实际的幻想，并逐渐建立起一种对自己的项目负责的成熟态度。到开发基本完成进入 QA 和部署环节时，我已经能够驾轻就熟，跟 QA Lead 争论什么是 bug 什么是 feature，自己动手把新版本部署到生产环境，还在上线的当天做 hot fix。在发给全体用户的邮件里看到自己开发的 feature 上线时，那种创造了能够传递给千百万用户的价值的成就感让我明白，我想要追求的就是这种做产品的感觉。</p><h2 id="2016-2017"><a href="#2016-2017" class="headerlink" title="2016 - 2017"></a>2016 - 2017</h2><p>大四的第一个学期过得非常忙碌。在重中之重的留学申请之外，我还得同时处理好学业和实习。</p><p>鉴于身边有太多被中介坑害的例子，我的性格又不允许他人经手对自己未来能够产生重要影响的事情，我很早就决定要 DIY 申请。（这个时间点的我已经非常习惯和享受这种拓荒的感觉了。）由于可以预见到开学之后自己的时间将会非常有限，我早在八月初就开始了选校和文书的准备。借助几位学长学姐作为反馈的来源，我的个人陈述一共改了八稿。最初的两三稿写得异常痛苦，一边整理自己三年的经历，一边是无尽的后悔和丧气。在跨越什么都写不出来的阶段以后，接着就进入了截然相反的什么都想写上去的阶段。选经历，改写法，挑语病，再到最后调整可读性。不可否认的是，在这个过程中或多或少会陷入一种情绪从而过度投入时间成本，但回想起来那种直面自我的痛苦的确是有益的经历。</p><p>由于一切都规划得很早，我的申请季过得非常平稳。在九月底开始填网申之前，我手上已经有了所需要的全部组件，剩下的仅仅是将它们拼凑成型。值得一提的是，我将之前用来开 TOEFL/GRE 备考小讲座的微信群发展成了一个留学信息交流群。当时也有一些人数多得多的交流群，但信噪比都实在太低，而且我个人十分看重的那种将自己踩过的坑拿出来分享的例子更是少之又少。而我的群由于大部分成员原本就互相熟悉，又有我带头做毫无保留的分享，从而得以创造独特得多的价值。在申请季后期，我开始帮朋友审阅和修改 PS/CV，最后还写了一篇 <a href="https://dnc1994.com/2017/01/gradschool-application-diy-demystified/">DIY 申请总结</a>。</p><p>Strikingly 的工作则在这个时期陷入僵局。一方面固然是由于我被留学申请占用了大量时间，没有投入足够的热情和精力。另一方面也是因为，在最初的甜蜜期过去以后，理想和现实之间的差距开始开始令我感到无从下手。</p><p>Strikingly 将我招入公司时，是期望我能够建立一些增长相关的预测模型。我们的终极目标是预测一个用户的 Life Time Value（LTV），也即他/她在整个周期内能为公司创造多少价值。作为第一步，我尝试建立一个预测用户是否会取消服务（churn）的模型。第一次挑战现实世界中的数据科学问题，我自然是兴致勃勃。得益于 Strikingly 相对完整的技术基建和 Young 的帮助，我很快开始使用在参加 Kaggle 比赛时就已经非常熟悉的一套流程来进行建模。然而，种种意想不到的挑战接踵而至：</p><ul><li>由于没有现成的 OLAP 服务，我需要编写繁琐的查询语句来读入数据。由于之前对 SQL/ORM 的掌握仅仅是能够写基本业务逻辑的水平，所以尽管在 Young 的帮助下将一些常用的查询类型封装成了库，但直到实习结束我在这方面的熟练度仍然没有什么提高。</li><li>由于一些数据是用第三方服务来统计的，所以这部分的查询模式跟存在自家数据库上的那些数据有一定差别，创造了额外的工作量。而且其中一个第三方服务由于技术水平不足，不能提供令人满意的查询性能，一时半会又看不到迁移的希望，这就使得问题更加严重。</li><li>Strikingly 的用户基数本就不大，而 Churn Prediction 主要针对付费用户，这就使得可用的数据量少之又少。尽管我使用了各种 Sampling 和 Validation 技巧，最终训练得到的模型性能仍然不太令人满意，而且我对其结果是否具有足够的统计显著性也没有太大的信心。</li><li>整个数据管道和建模流程从软件工程的角度来看问题多多。比如没有对 ETL 的正确性做验证（曾经导致我在错误的数据上浪费数天时间），没有统一的将模型部署成服务的约定和流程，也没有任何方式去监控模型上线以后的实际性能（我离职后模型很快就下线了）。</li></ul><p>归根结底，在那个时间点公司对于要搭建一个怎样的数据团队的理解是很肤浅的，正好又碰上我这个初出茅庐的数据挖掘工程师，最终只能产出这样的成果也比较遗憾。但在这里得到的经验教训却是我本科阶段最为宝贵的财富之一。同时，Strikingly 在如何做出好的产品、如何跟他人合作写出好的代码、如何构建好的工程师文化等等方面给了我无可替代的启迪，也直接影响了我对今后的工作体验的预期。更完整的实习感想可见<a href="https://www.zhihu.com/question/30292916/answer/141437259" target="_blank" rel="noopener">这个知乎回答</a>。</p><p>16 年底，逐渐丧失在 Strikingly 继续工作的热情的我，借着学期结束提出了离职。差不多同一时间，一家做医保反欺诈的创业公司（暂且称为 L 司）联系了我，并催生了一段非常不愉快的经历。</p><p>L 司的行径可以用坑蒙拐骗来概括（一定程度上也是当下创业浪潮的一个缩影）。先是 CTO 在最初跟我交谈时极度夸大公司的团队资源和已经积累的建模经验。他们号称坐拥的来自北美多家著名公司的资深机器学习专家，无一例外都没有脱离原公司，只是挂个名号并象征性地起一点顾问的作用，也看不出来能提供多有价值的领域知识。并且，我在加入公司后才发现，他们并没有积累简单的匹配规则以外的建模经验。这本身虽然不构成问题，但与最初的沟通是完全矛盾的。</p><p>更为夸张的是，公司在跟某甲方合作时，用其一贯的套路虚假宣传我是「海归的高材生」。并且这一点并没有事先知会我，所以我在被甲方问到时也是一脸懵逼。说到实际工作内容，甲方内部 IT 基建水平之低令人叹为观止，我还有幸目睹了两个技术负责人疯狂撕逼甩锅，仿佛是上了一堂社会实践课。总而言之，天真的我被听上去很 fancy 的工作内容所欺骗，在不到两周大开眼界的体验后愤然离职。</p><p>17 年新年过后，申请结果开始逐一揭晓。客观地说，我申请到的项目基本符合我的实力中容易量化和验证的那部分。尽管我很确定自己比很多拿到 MCDS 录取的人都要强，但这种预设了错位期望的选拔过程所导致的结果已经很难令我失望了。出于一直以来对 CMU 的向往，我打算在 MITS 和 SESV 中挑选一个。由于对自己的实力比较自信，也觉得选择 MITS 之后未来一年的生活方式会更接近自己的设想，我选择了去主校区度过最后一年的学生时光。</p><p>接完 offer 没多久，我开始了在 NVIDIA 的实习。说来惭愧，这段经历从初衷到执行到结果都非常的功利。我原本没预料到在 L 司的实习会结束得如此突然。离职后过了一段时间，我感觉是时候找点事情做了。与此同时，由于之后要在美国求职，就想着往简历上放一个 big name 公司会比较有帮助。于是在同学的介绍下，我去了 NVIDIA 上海的 Computing Architecture 组。</p><p>NVIDIA 的面试也是非常值得一提的。第一轮中，面试官先是让我写了一个裸的矩阵乘法，然后开始循循善诱让我思考如何通过对体系结构知识的理解来从代码细节的层面去优化算法。因为是接触得不多的领域，我在面试中紧张无比。但结束之后回想一下，由于面试官引导得足够好，我不仅答上来了大部分内容，还学到了许多新知识。而第二轮面试几乎是相同的套路，只是算法变成了求卷积。</p><p>在 NVIDIA 的工作比预期的要更为无聊和缺少反馈，实习体验在很大程度上取决于 mentor。我当时的任务是要魔改 Caffe 写一个原型出来验证一个想法的可行性，但还没来得及做完 mentor 那边从另一个角度出发做的 benchmarking 就给这个想法判了死刑。这段经历中，我收获的仅仅只是 C++ 的熟练度以及对一个过气深度学习框架的了解。</p><p>在复旦的最后一个学期，我还做了两个比较值得一提的项目。第一个是给复旦艺术团写的票务管理系统。这是我第一次独立写出一个投入了实际生产的 Web 应用（一年多以后依然在使用）。这套用 Express 快糙猛地写出一个 MVP 的路子使我具备了基本的全栈能力。另一个项目则是在 GDM Lab 跟着肖老师做的毕业论文：在 Bing 搜索日志数据上做的基于 Seq2Seq 模型的搜索关键词生成。这是我作为一个声称做机器学习的人第一次从头到尾做完一个深度学习的项目，也是一次对 DL + NLP 的前沿进行跟进的经历。</p><h2 id="2017-2018"><a href="#2017-2018" class="headerlink" title="2017 - 2018"></a>2017 - 2018</h2><p>还没从复旦毕业的时候，我就开始远程上 CMU 的招牌课程 15213（CSAPP）了。尽管对课程内容已经不能再熟悉，但实际上起正版的 513 来，还是会为 CMU 的 IT 基建所折服。</p><p>CMU 是一所很独特的学校，周围的人大多有着一种我喜欢的质朴。在这里的时光被学期的开始和结束而清晰地分隔开来，因为每个学期上了什么课在很大程度上就定义了你的日常。MITS 这个奇葩项目需要在一些毫无用处的地方花费一些时间，在此略过不提。</p><p>第一个学期我主要上了三门 10 开头（Machine Learning Department）的课。其中 10-601 是入门级别的机器学习课程，乏善可陈。另外两门课都值得稍微一提。</p><p>Russ Salakhutdinov 教授开的 10-707 是入门级别的深度学习课程。这门课的作业需要从零实现一个深度学习框架并用其做一些探索性的实验，这类作业只要认真去写必然收获丰富。在 Project 中，我跟两个国人队友设计了一种全新的 Dropout 算法，并用非常不严谨的实验证明了它比原始的 Dropout 泛化能力更好，收敛速度也更快（强行无视了它的计算方式要复杂得多）。令人意外的是这篇报告居然获得了满分，这也让我对 TA 的水平感到怀疑。事实上，这门课是 CMU 很多课程都有的一个令人不悦的现象的代表：教授和 TA 团队有许多做得不好的地方，但他们似乎不愿意接受任何批评。学生里面也总是会跳出来一些和事佬，发表一些类似于「你们知道教授和 TA 们有多努力吗」之类的观点。诸如此类的现象让我对 MLD 的教学质量逐渐失望。</p><p>William Cohen 教授开的 10-805 叫做 ML with Large Datasets，是我在 CMU 上过最有收获的课。这门课的中心思想，也是第一节课的主旨，就是扩大数据集给模型性能带来的提升往往超越不同模型之间的性能差异，继而介绍了种种让模型更 scalable 的套路和技巧。Cohen 有着丰富的工业界经验，虽然他的 Slides 制作水平和授课水平都堪忧，但作业的设计和几个讲义的质量在 10 的课中算是上乘。从理性愉悦的层面来讲，我是非常喜欢像 lazy updates 这类工程上的实现技巧的。在 Project 中，我跟两个印度队友做了一个 VQA 相关的数据集论文，扩充了前一年 EMNLP 一篇文章构造的数据集，并非常扣题地展现了更大的数据集加上 scalable 的模型可以达到更好的性能。最终文章的质量其实相当不错，Cohen 给了 A+，还建议我们去投 NAACL，但由于两个队友没有意向所以作罢。</p><p>这个学期我还尝试找了一下工作，很尴尬并且意想不到的是，我只拿到了 Google 一家的面试并且还通过了。虽然就此失去了 compete offer 的机会，我还是选择了真香。面试之前我做了 50 题左右的 LeetCode 来刷熟练度，感受是这种愚蠢的应试模式实在令人厌恶。</p><p>第二个学期我主要上了 Distributed Systems（DS）和 Advanced Cloud Computing（ACC）。这个学期的工作量大了很多，平均每周有 3 个 due。DS 需要 debug 底层细节，并跟 Autolab 斗智斗勇；而 ACC 则是每次写作业都要开一个集群并小心翼翼地控制运行时间，的确比较催人脱发。通过这两门课，我头一回对分布式系统产生了兴趣，也或深或浅地掌握了其中比较关键的一些概念和技术。ACC 最后还做了一个优化集群资源调度器的 Project，在写报告的时候我甚至产生了一种做科研的快感。</p><p>最后的暑假学期，我唯一的课业要求就是完成 Capstone Project。这可能是整个 MITS 项目体验最差的部分，其本质原因在于 ISR 的教职员工所相信的教育哲学和他们强加在项目上的一些观念实在过于奇葩。在受到种种限制的前提下去做一个自己并不认为有价值的项目已经足够令人沮丧了，还要带着四个水平比自己差一大截的队友（其中还有人品存疑的），我只能说当年没能申到更好的项目的恶果终究是要彰显的。</p><p>我不否认这三个月作为 Tech Lead + PM 的经历对我是有价值的。我在并不理想的环境下切实体会到了软件工程中老生常谈的几个难题，对技术管理有了初步理解，也感受到了同理心的重要性，掌握了一些沟通技巧。项目本身在技术层面也是我的一次成功的实验：我上手了新的前端框架 Vue 并独立开发了功能逻辑颇为复杂的前端，设计了一个模块化的系统架构并把 AWS 上常用的服务都踩了一遍坑，还尝试了没有用过的部署和运维工具。同一年前的抢票系统相比，这次的全栈体验更为完整，也更为现代。我当然也不会否认带着团队接连拿到 A 的满足感是确凿存在的。只是这段经历的意义更大程度上在于让我理解了为什么 Linus 需要通过语言侮辱来表达对代码质量的不满，为什么要远离有毒的人和社交关系。有些事情无药可救，也就不要勉强。</p><p>回过头来看在 CMU 的这一年，我最大的收获可能是发现了在机器学习以外还有分布式系统、软件工程等能给我带来理性愉悦的方向。我逐渐开始能够不带偏见地去欣赏特定技术方案背后的美，而这大概是一种在技术上成熟的体现。但更多地，我需要好好反思自己对时间的利用。这一年里，我的时间先后被学业、求职和发牢骚所主宰，所剩不多的空闲也被我拿来沉迷 DOTA2。从表面上看，我从 CMU 毕业，也拿到了 Google 的 offer，似乎正在向人生巅峰迈进。然而我内心却很清楚，这一年浪费的时间太多太多。我体会比较深的一点是，CMU 日常有丰富的学术讲座可供自由参加，其中有许多跨学科的课题对激发创意非常有帮助。而我却没有好好把握从而错过了很多开拓视野的机会。</p><p>好在这一年的最后，我终于摆脱了这种不断减速的进步状态。伴随着与几个志同道合的朋友的交流，我开始回顾初心，梳理已有的成果和信息，连点成线，开始展望和规划未来。</p><p>就用这篇略显流水的回忆录，纪念我人生中最后五年也是最重要的求学时光。希望未来的五年可以走更少的弯路，做更好的人。</p>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;&lt;strong&gt;本博客已经迁移到新域名 &lt;a href=&quot;https://linghao.io&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;linghao.io&lt;/a&gt;。请前往新博客阅读本文：&lt;a href=&quot;https://linghao.io/posts/five-year-summary-2013-2018/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://linghao.io/posts/five-year-summary-2013-2018/&lt;/a&gt;。&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;这是一篇对过去五年的忠实记录。有成功也有失败，有欢笑也有泪水。作为一篇很大程度上为自己而写的回忆录，倘若对你有些许启发，便是无憾。&lt;/p&gt;
&lt;p&gt;本文谢绝转载。&lt;/p&gt;
    
    </summary>
    
      <category term="Personal" scheme="http://dnc1994.com/categories/Personal/"/>
    
    
  </entry>
  
  <entry>
    <title>[Notes] Steven Pinker - Linguistics, Style and Writing in the 21st Century</title>
    <link href="http://dnc1994.com/2018/07/notes-steven-pinker-linguistics-style-writing/"/>
    <id>http://dnc1994.com/2018/07/notes-steven-pinker-linguistics-style-writing/</id>
    <published>2018-07-21T06:29:41.000Z</published>
    <updated>2019-06-03T00:41:51.467Z</updated>
    
    <content type="html"><![CDATA[<p><strong>This blog has been migrated to <a href="https://linghao.io" target="_blank" rel="noopener">linghao.io</a>. Read this post on my new blog: <a href="https://linghao.io/notes/steven-pinker-linguistics-style-writing/" target="_blank" rel="noopener">https://linghao.io/notes/steven-pinker-linguistics-style-writing/</a>.</strong></p><p>Check out this captivating and humorous﻿ lecture on <a href="https://www.youtube.com/watch?v=OV5J6BfToSw" target="_blank" rel="noopener">YouTube</a>.</p><a id="more"></a><h2 id="Part-I"><a href="#Part-I" class="headerlink" title="Part I"></a>Part I</h2><h3 id="Why-do-we-put-up-with-Legalese-Academese"><a href="#Why-do-we-put-up-with-Legalese-Academese" class="headerlink" title="Why do we put up with Legalese/Academese?"></a>Why do we put up with Legalese/Academese?</h3><h4 id="Theory-1-Bad-writing-is-a-deliberate-choice"><a href="#Theory-1-Bad-writing-is-a-deliberate-choice" class="headerlink" title="Theory 1: Bad writing is a deliberate choice"></a>Theory 1: Bad writing is a deliberate choice</h4><ul><li>Bureaucrats insist on gibberish to evade responsibility</li><li>Revenge of the nerds</li><li>Pseudo-intellectuals bambooze their readers to hide the fact that they have nothing to say</li></ul><p>Not true in general: good people can write bad prose. E.g. scientist who have nothing to hide and no need to impress.</p><h4 id="Theory-2-Digital-Media-are-Ruining-the-Language"><a href="#Theory-2-Digital-Media-are-Ruining-the-Language" class="headerlink" title="Theory 2: Digital Media are Ruining the Language"></a>Theory 2: Digital Media are Ruining the Language</h4><p>Implication: it was better before the digital age. Which is not true.</p><h4 id="A-better-Theory"><a href="#A-better-Theory" class="headerlink" title="A better Theory"></a>A better Theory</h4><ul><li>Speech is instinctive, writing is and always has been hard – Charles Darwin</li><li>Readers are unknown, invisible, inscrutable<ul><li>Exist only in imagination</li><li>Can’t react, break in, ask for clarification</li></ul></li><li>Writing is an act of pretense</li><li>Writing is an act of craftsmanship</li></ul><h3 id="What-we-can-do-to-improve-the-craft-of-writing"><a href="#What-we-can-do-to-improve-the-craft-of-writing" class="headerlink" title="What we can do to improve the craft of writing?"></a>What we can do to improve the craft of writing?</h3><h4 id="The-Elements-of-Style"><a href="#The-Elements-of-Style" class="headerlink" title="The Elements of Style"></a>The Elements of Style</h4><ul><li>Good sense<ul><li>Use definite, specific, concrete language</li><li>Write with nouns and verbs</li><li>Put the emphatic words at the end</li><li>Omit needless words</li></ul></li><li>Bad sense<ul><li>Obsolete advice<ul><li>To finalize</li><li>To contact</li></ul></li><li>Baffling advice<ul><li>6 people - 5 people = 1 people</li><li>Clever person/horse</li></ul></li></ul></li></ul><h4 id="The-Problem-with-Traditional-Style-Advice"><a href="#The-Problem-with-Traditional-Style-Advice" class="headerlink" title="The Problem with Traditional Style Advice"></a>The Problem with Traditional Style Advice</h4><ul><li>Arbitrary list of dos and don’ts based on the tastes of the authors</li><li>No principled understanding of how language works</li><li>Users have no way of understanding and assimilating the advice</li><li>Much of the advice is just wrong</li></ul><h4 id="Why-we-can-do-better"><a href="#Why-we-can-do-better" class="headerlink" title="Why we can do better"></a>Why we can do better</h4><ul><li>Base advice on the science &amp; scholarship<ul><li>Modern grammatical theory</li><li>Evidence-based dictionaries</li><li>Cognitive science</li><li>Historical &amp; critical study of usage</li></ul></li></ul><h4 id="A-Model-of-Prose"><a href="#A-Model-of-Prose" class="headerlink" title="A Model of Prose"></a>A Model of Prose</h4><ul><li>Writing is an unnatural act</li><li>Good style requires a coherent mental model of the communication scenario<ul><li>How the writer images the reader</li><li>What the writer is trying to accomplish</li></ul></li><li>Classic style - Francis-Noël Thomas &amp; Mark Turner</li></ul><h3 id="Classic-Style"><a href="#Classic-Style" class="headerlink" title="Classic Style"></a>Classic Style</h3><ul><li>Prose as a window on to the world<ul><li>The writer has seen something in the world</li><li>He positions the reader so she can see it with her own eyes</li></ul></li><li>The reader and writer are equals</li><li>The goal is to help the reader see objective reality</li><li>The style is conversation</li></ul><h4 id="Non-Classic-Styles"><a href="#Non-Classic-Styles" class="headerlink" title="Non-Classic Styles"></a>Non-Classic Styles</h4><ul><li>Contemplative Style; Oracular Style; Practical Style</li><li>Academic typically write in Postmodern/Self-conscious style<ul><li>The writer’s chief, if unstated, concern is to escape being convicted of philosophical naiveté about his own enterprise</li></ul></li></ul><h4 id="Classic-Style-Example-from-Brian-Greene"><a href="#Classic-Style-Example-from-Brian-Greene" class="headerlink" title="Classic Style Example from Brian Greene"></a>Classic Style Example from <a href="https://www.newsweek.com/brian-greene-welcome-multiverse-64887" target="_blank" rel="noopener">Brian Greene</a></h4><ul><li>Universe expanding as a mental movie that can be run backwards</li><li>Abstruse mathematical notion of euqations breaking down explained by “similar to the error message returned by a calculator when you try to divide 1 by 0”</li></ul><h4 id="Classic-Prose-cont"><a href="#Classic-Prose-cont" class="headerlink" title="Classic Prose, cont."></a>Classic Prose, cont.</h4><ol><li>The focus is on <em>the thing being shown</em>, not on <em>the activity of studying it</em><ul><li>Not classic: <em>In recent years, an increasing number of researchers have turned their attention to the problem of child language acquisition. In this article, recent theories of this process will be reviewed.</em></li><li>Classic: <em>All children acquire the ability to speak and understand a language without explicit lessions. How do they accomplish this feat?</em></li><li>Corollary 1: minimize apologizing. E.g. <em>The problem of language acquisition is extremely complex. It is difficult to give precise definitions of the concept of ‘language’ and the concept of ‘acquisition’ and the concept of ‘children’. There is much uncertainty about the interpretation of experimental data and a great deal of controversy surrounding the theories. More research needs to be done.</em><ul><li>Classic prose gives the reader credit for knowing that many concepts are hard to define, many controversies hard to resolve</li><li>The reader is there to see what the writer will do about it</li></ul></li><li>Corollary 2: minimize hedging.<ul><li><strong>Somewhat, fairly, rather, nearly, relatively…</strong></li><li>Shudder quotes: <em>She is a <strong>“quick study”</strong> and has been able to educate herself in <strong>virtually</strong> any area that interests her.</em></li><li>Classic prose: better to be clear &amp; possibly wrong than muddy and “not even wrong”</li><li>Also count on the cooperative nature of ordinary conversation. E.g. <strong>Americans have been getting fatter.</strong></li></ul></li><li><strong>Professional Narcissism</strong></li></ul></li></ol><ol start="2"><li><p>Keep up the illusion that the reader is seeing a world rather than listening to verbiage</p><ul><li>Avoid clichés like the plague. E.g. <em>We needed to think outside the box in our search for the holy grail, but found that it was neither a magic bullet nor a slam dunk, so we rolled with the punches and let the chips fall where they may while seeing the glass as half-full – it’s a no-brainer!</em></li><li>Mixed metaphors<ul><li><em>Jeff is a renaissance man, drilling down to the core issues and pushing the envelope.</em></li><li><em>No one has yet invented a condom that will knock people’s socks off.</em></li></ul></li><li>A.W.F.U.L. (Amercians Who Figuratively Use <em>Literally</em>)<ul><li>√ <em>She literally blushed.</em></li><li>× <em>She literally exploded.</em></li></ul></li></ul></li><li><p>Classic prose is about the <em>word</em>, not about the <em>conceptual tools</em> with which we understand the world</p><ul><li>Avoids metaconcepts: <em>approach, assumption, concept, condition, context, framework, issue, level, model, paradigm, perspective, process, role, strategy, tendency, variable</em><ul><li><em>I have serious doubts that trying to amend the Constituion would work on an actual <strong>level</strong>. On the aspirational <strong>level</strong>, however, a consitutional amendment <strong>strategy</strong> may be more valuable.</em> == I doubt that trying to amend the Constitution would actually succeed, but it may be valuable to aspire to it.</li><li><em>It is important to approach this <strong>subject</strong> from a variety of <strong>strategies</strong>, including mental health <strong>assistance</strong> but also from a law enforcement <strong>perspective</strong>.</em> == We should consult a psychiatrist about this man, but we may also have to inform the police.</li></ul></li></ul></li></ol><ol start="4"><li>Classic prose <em>narrates ongoing events</em><ul><li>We see agents performing actions that affect objects</li><li>Non-classic prose <em>thingifies</em> events and then <em>refers</em> to them<ul><li>Nominalization (a dangerous tool of English grammar)<ul><li>Appear -&gt; make an appearance</li><li>Organize -&gt; bring about the organization of</li></ul></li><li>“Zombie nouns” (Helen Sword)</li><li><em>Participants read <strong>assertions</strong> whose <strong>veracity</strong> was either affirmed or denied by the subsequent <strong>presentation</strong> of an <strong>assessment</strong> word.</em> == The people saw sentences, each followed by the word TRUE or FALSE.</li><li><em>Subjects were tested under <strong>conditions</strong> of good to excellent acoustic <strong>isolation</strong>.</em> == We tested the students in a quiet room.</li><li><em>Right now there is not any <strong>anticipation</strong> there will be a <strong>cancellation</strong>.</em> == Right now we don’t anticipate that we will have to cancell it.</li><li><em>The President is <strong>desirous</strong> of trying to see how we can make our best <strong>efforts</strong> in order to find a <strong>way</strong> to facilitate.</em> == The President wants to help.</li><li><em>I’m a digital and social-media strategist. I deliver <strong>programs</strong>, <strong>products</strong> and <strong>strategies</strong> to our corporate clients across the <strong>specturm</strong> of <strong>communications functions</strong>.</em> == I teach big companies how to use Facebook.</li><li><em>Mild <strong>exposure</strong> to CO can result in accumulated <strong>damage</strong> over time. Extreme <strong>exposure</strong> to CO may rapidly be fatal without producing significant warning symptoms.</em> == Using a generator indoors CAN KILL YOU IN MINUTES.</li></ul></li></ul></li></ol><h2 id="Part-II-How-Understanding-the-Design-of-Language-Can-Lead-to-Better-Writing-Advice"><a href="#Part-II-How-Understanding-the-Design-of-Language-Can-Lead-to-Better-Writing-Advice" class="headerlink" title="Part II: How Understanding the Design of Language Can Lead to Better Writing Advice"></a>Part II: How Understanding the Design of Language Can Lead to Better Writing Advice</h2><h3 id="Another-Contributor-to-Zombie-Prose-The-Passive-Voice"><a href="#Another-Contributor-to-Zombie-Prose-The-Passive-Voice" class="headerlink" title="Another Contributor to Zombie Prose: The Passive Voice"></a>Another Contributor to Zombie Prose: The Passive Voice</h3><ul><li>Overused<ul><li><em>On the basis of the analysis which was <strong>made</strong> of the data which were <strong>collected</strong>, it is <strong>suggested</strong> that the null hypothesis can be <strong>rejected</strong>.</em></li><li><em>If the outstanding balance is <strong>prepaid</strong> in full, the <strong>unearned</strong> finance charge will be <strong>refunded</strong>.</em></li><li><em>Mistakes were <strong>made</strong>.</em></li></ul></li><li>The design of language<ul><li>Language is an app for converting a <em>web of thoughts</em> into a <em>string of words</em></li><li>The order of words in a sentence has to do two things at once<ul><li>Serve as code for meaning (who did what to whom)</li><li>Present some bits of information to the reader before others (affects how the information is absorbed)<ul><li>Early material in the sentence == Topic</li><li>Later material == Focal point</li><li>Prose that violates these principles feels choppy, disjointed, incoherent </li></ul></li><li>The passive is a workaround for this inherent design limitation of language<ul><li>Allows writers to convey same ideas in different order</li></ul></li></ul></li></ul></li><li>“Avoid the passive” is <em>bad advice</em></li><li>The passive is the better construction when the done-to is currently the target of the reader’s mental gaze<ul><li>Better: <em>A messenger arrives from Corinth. It emerges that he was formerly a shepherd on Mt. Kithaeron, and during that time <strong>he was given</strong> a baby. The baby, he says, <strong>was given</strong> to him by another shepherd from the Laius household, who <strong>had been told</strong> to get rid of the child.</em></li><li>Worse: <em>A messagener arrives from Corinth. It emerges that he was formerly a shepherd on Mt. Kithaeron, and during that time someone <strong>gave</strong> him a baby. Another shepherd from the Laius household, he says, whom somehone <strong>had told</strong> to get rid of a child, <strong>gave</strong> the baby to him.</em></li></ul></li><li>English syntax provides writers with constructions that vary order in the string while preserving meaning. Writers must choose the construction that introduces ideas to the reader in the order in which she can absorb them</li></ul><h3 id="Why-is-the-Passive-So-Common-in-Bad-Writing"><a href="#Why-is-the-Passive-So-Common-in-Bad-Writing" class="headerlink" title="Why is the Passive So Common in Bad Writing?"></a>Why is the Passive So Common in Bad Writing?</h3><ul><li>Good writers narrate a story, advanced by protagonists who make things happen</li><li>Bad writers work backwards from their own knowledge, writing down ideas in the order in which they occur to them</li><li>The begin with the <em>outcome</em> of an event, and then throw in the <em>cause</em> as an afterthought.</li><li>The passive makes that easy</li></ul><h2 id="Part-III-Why-Is-It-So-Hard-for-Writers-to-Use-Language-to-Convey-Ideas-Effectively"><a href="#Part-III-Why-Is-It-So-Hard-for-Writers-to-Use-Language-to-Convey-Ideas-Effectively" class="headerlink" title="Part III: Why Is It So Hard for Writers to Use Language to Convey Ideas Effectively?"></a>Part III: Why Is It So <em>Hard</em> for Writers to Use Language to Convey Ideas Effectively?</h2><h3 id="The-Curse-of-Knowledge"><a href="#The-Curse-of-Knowledge" class="headerlink" title="The Curse of Knowledge"></a>The Curse of Knowledge</h3><ul><li>When you know something, it’s hard to image what it is like for some else <em>not</em> to know it</li><li>AKA mindblindness, egocentrism, hindsight bias</li><li>The M&amp;Ms study: the child cannot recover the innocent state in which he once did not know it</li><li>Studies have shown a similar effect in adults</li><li>The chief contributor to opaque writing<ul><li>Doesn’t occur to the writer that readers<ul><li>Haven’t learned their jargon</li><li>Don’t know the intermediate steps that seem too obvious to mention</li><li>Can’t visualize a sentence currently in the writer’s mind’s eye</li></ul></li><li>So the writer doesn’t bother to<ul><li>Explain the jargon</li><li>Spell out the logic</li><li>Supply the concrete details</li></ul></li><li><em>Even</em> when writing for professional peers</li></ul></li></ul><h3 id="How-to-Exorcise-the-curse-of-knowledge"><a href="#How-to-Exorcise-the-curse-of-knowledge" class="headerlink" title="How to Exorcise the curse of knowledge"></a>How to Exorcise the curse of knowledge</h3><ul><li>Keep in mind “the reader over your shoulder”</li><li>The problem: we’re not very good at guessing other people’s knowledge even when we try</li><li>Better solutions<ul><li>Show a draft to a representative reader</li><li>Show a draft to <em>yourself</em> after some time has passed</li><li>Rewrite with a single goal: making the prose understandable to the reader</li></ul></li></ul><h2 id="Part-IV-How-Should-We-Think-About-Correct-Usage"><a href="#Part-IV-How-Should-We-Think-About-Correct-Usage" class="headerlink" title="Part IV: How Should We Think About Correct Usage"></a>Part IV: How Should We Think About Correct Usage</h2><ul><li>Some Usages are Clearly Wrong</li><li>Other are Not So Clear<ul><li>Between you and I error</li><li>Singular “they” error</li><li>Split infinitive: <em>To boldly go where no man has gone before</em></li><li>Preposition at the end of sentence</li><li>Dangling participle: <em>Checking into the hotel, it was nice to see a few of my old classmates in the lobby.</em></li></ul></li></ul><h3 id="The-Language-War"><a href="#The-Language-War" class="headerlink" title="The Language War"></a>The Language War</h3><ul><li>Prescriptivists: Prescribe how people <em>ought to</em> speak &amp; write</li><li>Descriptivists: Describe how people <em>do</em> speak &amp; write</li><li>Conclusion: We need a more sophisticated way of thinking about usage</li></ul><h3 id="What-Are-Rules-of-Usage"><a href="#What-Are-Rules-of-Usage" class="headerlink" title="What Are Rules of Usage"></a>What <em>Are</em> Rules of Usage</h3><ul><li>Not logical truths</li><li><em>Not</em> officaly regulated by dictionaries</li><li><strong>Tacit, evolving conventions</strong><ul><li>Tacit: Emerges as a rough consensus within a community of careful writers, without explicit deliberation. agreement, or legislation</li><li>Evolving: The consensus may change over time</li></ul></li></ul><h3 id="Should-Writer-Follow-the-Rules"><a href="#Should-Writer-Follow-the-Rules" class="headerlink" title="Should Writer Follow the Rules?"></a>Should Writer Follow the Rules?</h3><ul><li>It depends</li><li>Some rules just extend the logic of everyday grammar to more complicated cases</li><li>Some make important semantic distinctions<ul><li><em>full</em> vs. <em>fulsome</em></li><li><em>simple</em> vs. <em>simplistic</em></li><li><em>meritorious</em> vs. <em>meretricious</em></li></ul></li><li>Not every lesson is a legitimate rule of usage</li><li>Many supposed rules of usage<ul><li>Violate the grammatical logic of English</li><li>Are routinely flouted by the best writers</li><li>Have <em>always</em> been flouted by the best writers<ul><li>Singular <em>they</em>: Jane Austen</li><li>Sentence-final preposition: Shakespeare</li></ul></li></ul></li></ul><h3 id="How-Should-A-Careful-Writer-Distinguish-Legitimate-Rules-of-Usage-from-Bogus-Ones"><a href="#How-Should-A-Careful-Writer-Distinguish-Legitimate-Rules-of-Usage-from-Bogus-Ones" class="headerlink" title="How Should A Careful Writer Distinguish Legitimate Rules of Usage from Bogus Ones"></a>How Should A Careful Writer Distinguish Legitimate Rules of Usage from Bogus Ones</h3><blockquote><p>It’s all right to split an infinitive in the interest of clarity. Since clarity is the usual reason for splitting, this advice means merely that you can split them whenever you need to. – Merriam-Webster Unabridged</p></blockquote><blockquote><p>There is no grammatical basis for rejecting split infinitives – Encarta World English Dictionary</p></blockquote><blockquote><p>…</p></blockquote><ul><li>Modern dictionaries &amp; style manuals do <em>not</em> ratify pet peeves, grammatical folklore, or bogus rules</li><li>Usage advice is based on <em>evidence</em><ul><li>Practices of contemporary good writers</li><li>Practices of best writers in the past</li><li>Polling data from a panel of writers (for contested cases)</li><li>Effects on clarity</li><li>Consistency with the grammatical logic of English</li></ul></li><li>Correct usage should be kept in perspective<ul><li>The <em>least important</em> part of good writing</li><li>Far less important than<ul><li>Classic style</li><li>Coherent ordering of ideas</li><li>Overcoming the curse of knowledge</li><li>Factual diligence</li><li>Sound argumentation</li></ul></li><li>Even the most irksome errors are not signs of the decline of the language</li></ul></li></ul><h2 id="Summary"><a href="#Summary" class="headerlink" title="Summary"></a>Summary</h2><p>Modern linguistics &amp; cognitive science provide better ways of enhancing our writing:</p><ul><li>A model of prose communication<ul><li>Classic style: Language as a window on the world</li></ul></li><li>An understanding of the way language works<ul><li>The Web of Thoughts -&gt; A String of Words </li></ul></li><li>A diagnosis of why good prose is so hard to write<ul><li>The Curse of Knowledge</li></ul></li><li>A way to make sense of rules of correct usage<ul><li>Tacit, evolving conventions</li></ul></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;&lt;strong&gt;This blog has been migrated to &lt;a href=&quot;https://linghao.io&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;linghao.io&lt;/a&gt;. Read this post on my new blog: &lt;a href=&quot;https://linghao.io/notes/steven-pinker-linguistics-style-writing/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://linghao.io/notes/steven-pinker-linguistics-style-writing/&lt;/a&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Check out this captivating and humorous﻿ lecture on &lt;a href=&quot;https://www.youtube.com/watch?v=OV5J6BfToSw&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;YouTube&lt;/a&gt;.&lt;/p&gt;
    
    </summary>
    
      <category term="Notes" scheme="http://dnc1994.com/categories/Notes/"/>
    
    
  </entry>
  
  <entry>
    <title>[Notes] Programming Beyond Practices</title>
    <link href="http://dnc1994.com/2018/07/notes-programming-beyond-practices/"/>
    <id>http://dnc1994.com/2018/07/notes-programming-beyond-practices/</id>
    <published>2018-07-05T09:18:16.000Z</published>
    <updated>2019-06-03T00:41:56.865Z</updated>
    
    <content type="html"><![CDATA[<p><strong>This blog has been migrated to <a href="https://linghao.io" target="_blank" rel="noopener">linghao.io</a>. Read this post on my new blog: <a href="https://linghao.io/notes/programming-beyond-practices/" target="_blank" rel="noopener">https://linghao.io/notes/programming-beyond-practices/</a>.</strong></p><p><a href="http://shop.oreilly.com/product/0636920047391.do" target="_blank" rel="noopener"><em>Programming Beyond Practices</em></a> by Gregory Brown is one of the most insightful books I’ve read about software engineering. The book is consisted of 8 chapter-length stories that present a series of interesting topics that go beyond the sheer scope of programming.</p><p>This post is a refined version of the notes I took while reading this book.</p><a id="more"></a><h2 id="Chapter-1-Using-Prototypes-to-Explore-Project-Ideas"><a href="#Chapter-1-Using-Prototypes-to-Explore-Project-Ideas" class="headerlink" title="Chapter 1: Using Prototypes to Explore Project Ideas"></a>Chapter 1: Using Prototypes to Explore Project Ideas</h2><p>This chapter discusses how exploratory programming techniques can be used to build and ship a meaningful proof of concept for a product idea within hours after development begins.</p><ul><li><p>Getting ideas out of a client’s head as quickly as you can</p><ul><li><p>Conversations and wireframes (rough sketches) can be useful, but exploratory programming soon follows.</p></li><li><p>By getting working software into the mix early in the process, product design becomes an interactive collaboration and thus create faster feedback loops.</p></li></ul></li><li><p>Start by understanding the needs behinds the project</p><ul><li><p>Ask questions that reveal the goals of the people involved to both validate your assumptions and get more context on how other people see the problem.</p></li><li><p>Use wireframes to clearly communicate the basic structure of an application without getting bogged down in style details. The goal is to set expectations about functionality.</p></li></ul></li><li><p>Set up a live test system as soon as you start coding</p><ul><li><p>The initial setup doesn’t need to be production-ready. It just needs to be suitable for collecting useful feedback.</p></li><li><p>To speed things up, massive underinvestment is the name of the game here and it requires skill to pull it off.</p></li><li><p>It’s important to keep yourself from overthinking, because not even the most simple things survive first contact with the customer.</p></li></ul></li><li><p>Discuss all defects, but be pragmatic about repairs</p><ul><li>It’s impossible to entirely guard yourself from making mistakes, but how you respond to them is critical.</li></ul></li></ul><ul><li><p>Prototyping is about exploring a problem space, not building a finished product.</p><ul><li><p>Focus on the risky or unknown parts of your work.</p></li><li><p>Check your assumptions early and often.</p></li><li><p>Limit the scope of your work as much as possible.</p></li><li><p>Remember that prototypes are not proudction systems. Make notes about any simplifying assumptions you made for maintainability.</p></li><li><p>Make WIP features a bit rough on purpose to prevent others from thinking they’re ready to ship.</p></li></ul></li><li><p>Design features that make collecting feedback easy</p><ul><li><p>Figuring out the right balance of when to play fast and loose and when to tighten things up.</p></li><li><p>E.g. In a simple video recommendation system based on tags (artist, genre, etc), it pays to display a sidebar that lists tags and their scores. Make the tags clickable so that the user can interact with it to figure out how watching videos with a certain tag influences its score and change the recommendation behavior.</p></li><li><p>E.g. Build an “import from CSV” feature so that the user can explore more with their custom data.</p></li></ul></li><li><p>Encountering issues during prototyping is not a sign of a flawed process. It’s exactly what you should expect as a side effect of accelerated feedback loops. While they help you figure out how to build things faster, they also help you fail faster.</p></li></ul><h2 id="Chapter-2-Spotting-Hidden-Dependencies-in-Incremental-Changes"><a href="#Chapter-2-Spotting-Hidden-Dependencies-in-Incremental-Changes" class="headerlink" title="Chapter 2: Spotting Hidden Dependencies in Incremental Changes"></a>Chapter 2: Spotting Hidden Dependencies in Incremental Changes</h2><p>This chapter discusses issues that can crop up whenever a production codebase is gradually extended to fit a new purpose.</p><ul><li><p>There’s no such thing as a standalone feature</p><ul><li><p>Don’t assume that a change is backward-compatible or safe just because it doesn’t explicitly modify existing features.</p></li><li><p>Pay attention to shared resources that live outside the codebase: storage mechanisms, processing capacity, databases, external services, libraries, user interfaces, etc. These tools form a “hidden dependency web” that can propagate side effects and failures between seemingly unrelated application features.</p></li><li><p>E.g. Suppose two seemingly independent Wiki sites share the same storage infrastructure. The first Wiki is only used by a few trusted admins, while the second Wiki is newly developed and made open to the public. An attack on the public Wiki could easily bring down the admin Wiki. This is an example of infrastructure-level dependency. </p></li><li><p>Use constraints and validations to help prevent local failures from causing global side effects. Also make sure to have good monitoring in place so that unexpected system failures are quickly noticed and dealt with.</p></li><li><p>E.g. Limit the number of pages allowed to be created on the public wiki.</p></li><li><p>E.g. Add monitoring to track Wiki page creation / deletion / editing and set up alerts for those events.</p></li></ul></li><li><p>If two features share a screen, they depend on each other</p><ul><li>E.g. Adding a sidebar with flexible width to a Wiki page could make the actual content stuffed into a tiny column when the title of sidebar becomes too long.</li></ul></li></ul><ul><li><p>Avoid non-essential real-time data synchronization</p><ul><li>E.g. Suppose that you need to display 5 most visited Wiki pages. Querying those in real-time is wasteful. And external service intergration are often full of headaches. Instead, you can write a script that periodically calls the analytics API and updates the 5 pages in a caching layer. By doing so, you sidestep the need to add further configuration information or libraries to the main web app.</li></ul></li></ul><ul><li><p>Look for problems when code is reused in a new context</p><ul><li><p>Watch out for context switches when reusing existing tools and resources. Any changes in scale, performance expectations, or privacy levels can lead to dangerous problems.</p></li><li><p>Focusing on superficial similarities of different use cases rather than their fundamental differences can cloud your judgement.</p></li><li><p>E.g. The admin Wiki uses a Markdown processor to render the page content, where editors are all trusted. But it’s certainly not safe for use by random individuals on the Internet.</p></li></ul></li></ul><h2 id="Chapter-3-Identifying-the-Pain-Points-of-Service-Integrations"><a href="#Chapter-3-Identifying-the-Pain-Points-of-Service-Integrations" class="headerlink" title="Chapter 3: Identifying the Pain Points of Service Integrations"></a>Chapter 3: Identifying the Pain Points of Service Integrations</h2><p>This chapter discusses some of the various ways that third-party systems can cause failures, as well as how flawed thinking about service integrations can lead to bad decision making.</p><ul><li><p>Plan for trouble when your needs are off the beaten path</p><ul><li><p>In theory, we should be approaching every third-party system with distrust until it is proven to be reliable. In practice, time and money constraints often cause us to drive faster than our headlights can see.</p></li><li><p>Be cautious when depending on an external service for something other than what it is well known for. If you can’t find many examples of others successfully using a service to solve similar problems to the ones you have, it is a sign that it may be at best unproven and at worst unsuitable for your needs.</p></li><li><p>Conduct smoke tests in such scenarios can help.</p></li></ul></li><li><p>Remember that external services might change or die</p><ul><li>The key difference between libraries and services: a library can only cause breaking changes if your codebase or supporting infrastructure is modified, but an external service can break or change behavior at any point in time as it involves a remote system that isn’t under your control.</li></ul></li></ul><ul><li><p>Look for outdated mocks in tests when services change</p><ul><li><p>The presence of a decent test coverage can create an illusion of safety.</p></li><li><p>It’s possible that changes to a service dependency will render a mock object in tests outdated. When that happens the test result could be misleading.</p></li><li><p>Make sure that at least some of your tests run against live APIs of the services you depend upon.</p></li></ul></li><li><p>Expect maintenance headaches from poorly coded robots</p><ul><li>You don’t need to just worry about your own integrations, it’s also essential to pay attention to the uninvited guests who integrate with you.</li></ul></li></ul><ul><li><p>Remember that there are no purely internal concerns</p><ul><li><p>A process can be well-intentioned but misguided or ill-executed.</p></li><li><p>Use every code review as an opportunity for a mini-audit of service dependencies — to evaluate testing strategy, to think through how failures will be handled, or to guard against misuse of resources.</p></li></ul></li></ul><h2 id="Chapter-4-Developing-a-Rigorous-Apporach-Toward-Problem-Solving"><a href="#Chapter-4-Developing-a-Rigorous-Apporach-Toward-Problem-Solving" class="headerlink" title="Chapter 4: Developing a Rigorous Apporach Toward Problem Solving"></a>Chapter 4: Developing a Rigorous Apporach Toward Problem Solving</h2><p>This chapter discusses several straightforward tactics for breaking down and solving challenging problems in a methodical fashion. You should go read the book if you’re interested in the specific puzzle used for illustration.</p><ul><li><p>Puzzles are awkward to use for developing practical coding skills but perfect for exploring general problem solving techniques.</p></li><li><p>Begin by gathering the facts and stating them plainly</p><ul><li>The raw materials of a problem description are often a scattered array of prose, examples, and reference materials. Make sense of it all by writing your own notes, then strip away noise until you are left with just the essential details.</li></ul></li></ul><ul><li><p>Work part of the problem by hand before writing code</p><ul><li><p>Behind each new problem that you encounter, there is a collection of simple sub-problems that you already know how to solve. Keep breaking things down into chunks until you start to recognize what the pieces are made of.</p></li><li><p>Challenging problems are made up of many moving parts. To see how they fit together without getting bogged down in implementation details, work through partial solutions on paper before you begin writing code.</p></li></ul></li><li><p>Validate your input data before attempting to process it</p><ul><li>A valid set of rules operating on an invalid dataset can produce confusing results that are difficult to debug. Avoid the “garbage in, garbage out” effect by validating any source data before processing it.</li></ul></li></ul><ul><li><p>Make use of deductive reasoning to check your work</p></li><li><p>Solve simple problems to understand more difficult ones</p></li></ul><h2 id="Chapter-5-Design-Software-from-the-Bottom-Up"><a href="#Chapter-5-Design-Software-from-the-Bottom-Up" class="headerlink" title="Chapter 5: Design Software from the Bottom Up"></a>Chapter 5: Design Software from the Bottom Up</h2><p>This chapter discusses a step-by-step approach to bottom-up software design, and examine the tradeoffs of this way of working. The example here is <a href="https://en.wikipedia.org/wiki/Just-in-time_manufacturing" target="_blank" rel="noopener">just-in-time production workflow</a>. You should go read the book if you’re interested in pithy details of the whole approach.</p><ul><li><p>Identify the nouns and verbs of your problem space</p><ul><li>List a handful of important nouns and verbs in the problem space. Then look for the shortest meaningful sentence that you can construct from the words on that list. Use that as the guiding theme for the first feature you implement.</li></ul></li></ul><ul><li><p>Begin by implementing a minimal slice of functionality</p></li><li><p>Avoid unnecessary temporal coupling between objects</p><ul><li>As you continue to add new functionality into your project, pay attention to the connections between objects. Favor designs that are flexible when it comes to both quantities and timing so that individual objects don’t impose artificial constraints on their collaborators.</li></ul></li></ul><ul><li><p>Gradually extract reusable parts and protocols</p><ul><li>When extracting reusable objects and functions, look for fundamental building blocks that are unlikely to change much over time, rather than looking for superficial ways to reduce duplication of boilerplate code.</li></ul></li></ul><ul><li><p>Experiment freely to discover hidden abstractions</p><ul><li>Deferred decision making is an important part of bottom-up design. You will come to discover missing abstractions in the system as you proceed.</li></ul></li></ul><ul><li><p>Know where the bottom-up approach breaks down</p><ul><li><p>Take advantage of the emergent features that can arise when you reuse your basic building blocks to solve new problems. But watch out for excess complexity in the glue code between objects: messy integration points are a telltale sign that a bottom-up design style is being stretched beyond its comfort zone.</p></li><li><p>Top-down design and bottom-up design work like a spiral. Bottom-up design is good for exploring new areas and keeps things simple as you get some software up and running. When you hit dead ends or rough patches, then top-down mode is useful for considering the bigger picture and how to unify the connections between things.</p></li></ul></li></ul><h2 id="Chapter-6-Data-Modeling-in-an-Imperfect-World"><a href="#Chapter-6-Data-Modeling-in-an-Imperfect-World" class="headerlink" title="Chapter 6: Data Modeling in an Imperfect World"></a>Chapter 6: Data Modeling in an Imperfect World</h2><p>This chapter discusses how small adjustments to the basic building blocks of a data model can fundamentally change how people interact with a system for the better.</p><ul><li><p>Decouple conceptual modeling from physical modeling</p><ul><li><p>In a system with messy data sources, it’s often better to preserve data in its raw form rather than attempting to transform it immediately into structures that closely map to domain-specific concepts. This way you get to keep some degree of flexibility by not imposing too much structure at the physical data modeling level. You can always process raw data into whatever form you’d like, but extracting that same information from complex models can be needlessly complicated.</p></li><li><p>E.g. In a time-tracking application, don’t model a work session as a pair of two punches. Instead, only record the raw punch data and use them to infer work sessions on-the-fly as needed.</p></li></ul></li><li><p>Design an explicit model for tracking data changes</p><ul><li><p>Reduce incidental complexity as much as possible by minimizing mutable state.</p></li><li><p>As you develop a data model, think through the ways data will be presented, queried, and modified over time.</p></li><li><p>Make it easy to preview, annotate, approve, audit, and revert transactional data changes in a human-friendly way. Implementing this type of workflow involves writing custom code rather than relying on pre-built libraries.</p></li><li><p>Applying the event sourcing pattern in your data models can help simplify things.</p></li></ul></li><li><p>Understand how Conway’s Law influences data management practices</p><ul><li><p>Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations – Melvin Conway.</p></li><li><p>Design data management workflows that respect and support the organizational culture of the people using your software. Otherwise the system could be crushed under the weight of a thousand workarounds.</p></li></ul></li><li><p>Remember that workflow design and data modeling go hand in hand</p></li></ul><h2 id="Chapter-7-Gradual-Process-Improvement-as-an-Antidote-for-Overcommitment"><a href="#Chapter-7-Gradual-Process-Improvement-as-an-Antidote-for-Overcommitment" class="headerlink" title="Chapter 7: Gradual Process Improvement as an Antidote for Overcommitment"></a>Chapter 7: Gradual Process Improvement as an Antidote for Overcommitment</h2><p>This chapter discusses some common anti-patterns that lead to struggles in software project management, and how incremental process improvements at all levels can alleviate some of those pains.</p><ul><li><p>Respond to unexpected failures with swiftness and safety</p><ul><li>When dealing with system-wide outages, disable or degrade features as needed to get your software back to a usable state as quickly as possible. Proper repairs to the broken parts can come later, once the immediate pressure has been relieved.</li></ul></li></ul><ul><li><p>Identify and analyze operational bottlenecks</p></li><li><p>Pay attention to the economic tradeoffs of your work</p><ul><li><p>Look for areas where you are overcommitted and constrain them with reasonable budgets, so that you can free up time to spend on other work. Don’t rely solely on intuition for these decisions; use “back of the napkin” calculations to consider the economics of things as well.</p></li><li><p>E.g. Integrations with a whole list of advertising services takes up much time of the development. Yet the data shows that top 8 integrations alreay covers over 80% of growth.</p></li><li><p>Set a fixed limit on capacity (say 20%) that can be allocated to work on integrations each month. Simple time budgeting is a powerful tool for limiting the impact of high-risk areas of a project. It also encourages more careful prioritization and cost-benefit analysis.</p></li><li><p>Audit less popular integrations and decide what level of support to offer.</p></li></ul></li><li><p>Reduce waste by limiting work in progress</p><ul><li><p>Removing one bottleneck in a process will naturally cause another one to become visible.</p></li><li><p>Don’t get obsessed with the idea of catching up with the roadmap which is created before you ran into growing pains, as it could throw you back in a tough spot again.</p></li><li><p>Remember that unshipped code is not an asset; it’s perishable inventory with a cost of carry.</p></li><li><p>Help everyone in your projects understand this by focusing on what valuable work gets shipped in a given time period, rather than trying to make sure each person on the team stays busy.</p></li></ul></li><li><p>Make the whole greater than the sum of its parts</p><ul><li>When collaborating with someone who works in a different role than your own, try to communicate in ways they can relate to.</li></ul></li></ul><h2 id="Chapter-8-The-Future-of-Software-Development"><a href="#Chapter-8-The-Future-of-Software-Development" class="headerlink" title="Chapter 8: The Future of Software Development"></a>Chapter 8: The Future of Software Development</h2><p>This chapter is basically an imagination of what programming would look like in a future where the technology has advanced to a extent that we could focus purely on problem solving and communication rather than writing code.</p><p>There isn’t any takeaways in this chapter. You should go read the book if you’re interested in author’s depiction of the future. Instead, I’d like to quote some of author’s ending remarks to end this post.</p><blockquote><p>Although the field of software development has a long way to go, I do expect things will get better in the years to come. It’s true that some folks among us are here solely for the tools, the code, the intellectual challenge of it all. But for the rest of us, that’s a matter of necessity and the environment we work in, not a defining characteristic of who we are.</p><p>My fundamental belief is that programmers are no less concerned for human interests than anyone else in the world; it’s just hard to make that your main focus in life when you spend a good portion of your day chasing down a missing semicolon, reading source code for an undocumented library, or staring at a binary dump of some text that you suspect has been corrupted by a botched Unicode conversion.  </p><p>And my great hope is that if we fight against the influence of our rough, low-level, tedious tools and gradually replace them with things that make us feel closer to the outcome of our work, then our tech-centric industry focus will shift sharply and permanently to a humancentric outlook.</p></blockquote><h2 id="References"><a href="#References" class="headerlink" title="References"></a>References</h2><ul><li><a href="https://www.sans.org/top25-software-errors" target="_blank" rel="noopener">CWE/SANS TOP 25 Most Dangerous Software Errors</a></li><li><a href="https://practicingruby.com/articles/information-anatomy" target="_blank" rel="noopener">The anatomy of information in software systems</a></li><li><a href="http://connascence.io/" target="_blank" rel="noopener">Connascence</a></li><li><a href="https://docs.microsoft.com/en-us/azure/architecture/patterns/event-sourcing" target="_blank" rel="noopener">Event Sourcing</a></li><li><a href="https://en.wikipedia.org/wiki/Theory_of_constraints" target="_blank" rel="noopener">Theory of constraints</a></li><li><a href="https://en.wikipedia.org/wiki/5_Whys" target="_blank" rel="noopener">Five Whys</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;&lt;strong&gt;This blog has been migrated to &lt;a href=&quot;https://linghao.io&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;linghao.io&lt;/a&gt;. Read this post on my new blog: &lt;a href=&quot;https://linghao.io/notes/programming-beyond-practices/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://linghao.io/notes/programming-beyond-practices/&lt;/a&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://shop.oreilly.com/product/0636920047391.do&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;em&gt;Programming Beyond Practices&lt;/em&gt;&lt;/a&gt; by Gregory Brown is one of the most insightful books I’ve read about software engineering. The book is consisted of 8 chapter-length stories that present a series of interesting topics that go beyond the sheer scope of programming.&lt;/p&gt;
&lt;p&gt;This post is a refined version of the notes I took while reading this book.&lt;/p&gt;
    
    </summary>
    
      <category term="Notes" scheme="http://dnc1994.com/categories/Notes/"/>
    
    
  </entry>
  
  <entry>
    <title>如何提高英语姿势水平</title>
    <link href="http://dnc1994.com/2017/01/how-to-improve-your-english/"/>
    <id>http://dnc1994.com/2017/01/how-to-improve-your-english/</id>
    <published>2017-01-23T09:40:17.000Z</published>
    <updated>2019-06-03T00:38:55.675Z</updated>
    
    <content type="html"><![CDATA[<p><strong>本博客已经迁移到新域名 <a href="https://linghao.io" target="_blank" rel="noopener">linghao.io</a>。请前往新博客阅读本文：<a href="https://linghao.io/posts/improve-english/" target="_blank" rel="noopener">https://linghao.io/posts/improve-english/</a>。</strong></p><h2 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h2><p>既然标题口气这么大，那就先放点考试成绩让大家对我的英语水平有个直观的了解（当然，成绩只是考察语言能力的一个角度）：</p><ul><li>TOEFL：R 30 + L 30 + S 23 + W 29 = 112</li><li>GRE：Verbal 161 + Quant 170 + AW 4.0</li><li>FET（复旦英语测试）：100（满分，意味着在那一年参加考试的人群中接近或者处于第一名的位置）</li></ul><p>我不敢说自己是个英语特别好的人，但可以说在目前的需求下对英语还是运用自如的。在此之前，我在很多场合介绍过自己准备托福和 GRE 的经验，却很少谈及 generally 提高英语水平的心得。这是有原因的：一方面是因为我自己最近几年英语水平的提高已经完全是依靠长时间沉浸在有足够输入输出的环境中的积累，而这个状态跟绝大多数向我寻求帮助的人所处的学习阶段相去甚远；另一方面是因为，即使从我自身的经验中提取出能帮助到别人的部分，也是一件比较难的事情。不过我还是尝试了一下，在 15 年 5 月的时候做了一次班内分享，不过当时对一些问题思考的还不够全面。最近又有学妹找我咨询这方面的问题，我思考了一下觉得现在应该能把自己的心得概括得比较到位了，于是就有了这篇文章。</p><p>不同于我的其他博文，在本文中我将从一个比较高的层次谈谈自己对英语学习的理解，而不会把重点放在给出全面和具体的建议上面。这也意味着本文可能需要依赖于读者的反馈来进行后续的修改，以充实内容。欢迎任何方面的评论和提问：）</p><a id="more"></a><h2 id="环境：Input-amp-Output"><a href="#环境：Input-amp-Output" class="headerlink" title="环境：Input &amp; Output"></a>环境：Input &amp; Output</h2><p>在大学的前两年经常听到的一句自嘲的话是，“高三是我英语水平的巅峰”。然而，如果有谁不只停留在自嘲，而是发自内心这么想的话，那就是错得离谱了。</p><p>不再被强迫上英语课，水平就会逐渐下降，这纯粹是无稽之谈。如果一个人每天都有一定量的英语输入和输出，水平不仅不会下降，说不定还能在不花费额外时间的情况下得到提高。</p><p>这里的逻辑在于，如果一个人想要提高自己的英语水平，那么一般来说他肯定正处在一个需要他这么做的环境中，也就是我所说的<strong>能够保证一定程度的输入和输出量的环境</strong>；如果他并没有处在这种环境中，而只是单纯地想提高英语水平，那么他就必须努力去创造这样的环境，否则任何提高都是不可能的。</p><p>举例来说，我的专业领域的学术环境基本上是由英语主导的，同时我也有留学美国的打算，之前又长时间沉浸在学习 MOOC 中，还有好几段必须使用英语的经历，再加上我由于从小喜欢英语而在文化消费上一直在接受大量的英语材料，所以我可以说是半主动半被动地在一个每天都会输入和输入大量英语信息的环境中生活和学习。在这种前提下，英语水平想不提高反倒是比较难的。</p><p>从发挥主观能动性的角度来讲，参加海外交流项目或者是去国际化的公司实习，都是非常理想的创造英语输入输出环境的方式。当然这样的机会并不是总能争取到的，所以这里再介绍几个我比较推崇而又足够 accessible 的方式。</p><h3 id="MOOC"><a href="#MOOC" class="headerlink" title="MOOC"></a>MOOC</h3><p>在这个话题下，最值得推荐的就是 MOOC 了。试想，<strong>适合大学生，又与实际运用联系紧密，还能能作为长期输入输出来源的</strong>，除了上课还有什么呢？</p><p>学习 MOOC 对英语水平的提高是多方面的：看授课视频能够提高真实学术场景下的听力；阅读课本和补充材料以及完成作业和考试能够提高英语学术能力；跟教授、TA 和其他学习者在论坛中交流则能提高书面表达能力。更不要说很多想要提高英语水平的人的目的都是出国学习，提前学习 MOOC 非常有助于到时<strong>适应以英语为主导的学习环境</strong>。</p><h3 id="阅读"><a href="#阅读" class="headerlink" title="阅读"></a>阅读</h3><p>很多人会问，为什么读了那么多 XXX 和 YYY，理解能力还是没有提高，阅读速度还是那么慢。其实答案很简单，就像其他任何学科一样，<strong>英语学习也是要靠不断把自己推出舒适区，才能有提高的</strong>。</p><p>具体说来，就是在平时接触阅读材料的时候，<strong>有意识地去思考和锻炼</strong>。举个例子，当你遇到一个有意思的表达时，可以想想是不是能用在自己最近写过的东西上面，或者思考一下如果你是作者这里会怎么写，再作一番比较；又比如说，在阅读材料时可以给自己记个时，不断逼自己在保持对内容的把握的前提下用更短的时间完成阅读。总而言之，只有让自己稍微不那么好受一点，接触到的材料才能多多少少积累进脑子里。</p><p>下面推荐一些我平时的阅读材料来源。</p><ul><li><a href="https://en.wikipedia.org/wiki/Main_Page" target="_blank" rel="noopener">Wikipedia</a>：不用介绍了。如果你兴趣足够广泛，光维基就足够你看了。</li><li><a href="https://www.theguardian.com" target="_blank" rel="noopener">The Guardian</a>：同不用介绍。</li><li><a href="https://www.nytimes.com" target="_blank" rel="noopener">The New York Times</a>：继续不用介绍。</li><li><a href="https://aeon.co/" target="_blank" rel="noopener">Aeon</a>：我最喜欢的在线杂志，多是深度长文，话题涵盖丰富，遣词造句值得学习积累。</li></ul><p>PS：一般来说新闻类的文章阅读价值不大，评论类的比较值得用心读。</p><h3 id="刷剧"><a href="#刷剧" class="headerlink" title="刷剧"></a>刷剧</h3><p>虽然刷起来最后看到忘我是标准结局，但也并不是完全学不到英语的。或者说，既然横竖都要刷剧，那么不妨稍微花点心思积累一些语言点，不也挺好的吗？</p><p>首先要选对剧。比如说新闻编辑室和反恐 24 小时，哪个能学到英语，哪个能看得爽就不用多说了吧（当然，如果能对上口味，新闻编辑室是可以鱼和熊掌兼得的）。</p><p>其次就是心态要放轻松，不要隔三差五暂停做笔记什么的。要明确刷剧学英语更多地是一种潜移默化的影响，只是一个辅助方式，而不是知识点密度高到值得提取出笔记的一种媒介。</p><p>还有就是字幕的问题。在达到不借助字幕能看懂生肉的水平之前，我比较推崇只开英文字幕。这样练多了还有个额外的好处是很多时候不用等中字出来就能刷剧，比如像纸牌屋这种 Netflix 的剧都内嵌了英字，我每季都是出来的当天刷完的。当然，如果有质量高的双语字幕，开着中翻也是可以学到东西的，像人人字幕组的很多作品都翻的非常信雅达，值得学习。而且开中字还可以尝试挑错，也是很有趣的。</p><h2 id="情怀：Work-Hard"><a href="#情怀：Work-Hard" class="headerlink" title="情怀：Work Hard"></a>情怀：Work Hard</h2><p>说完了如何在有输入输出的环境中通过不断积累来提高英语水平，那么如果想专注提高自己某一方面的能力该怎么办呢？</p><p>其实答案不外乎“努力”二字。虽然我一直推崇积累的方式，但是不可否认很多时候我们会有短期提高的需求；与此同时积累的学习方式也需要我们先达到一个足够的水平，也就是说会有一个不可跳过的 bootstrap 的过程。绝大部分人在准备相关考试的时候，其实就已经或多或少地涉及到了非积累性的水平提高方式。只不过由于应试的压力，取得的成果往往是容易反弹的。这里就介绍一些我自己用过比较有效的方法吧：</p><ul><li><strong>听抄：</strong>顾名思义，放听力材料然后试图把原文一字不落地写下来。<strong>听抄是提高听力最有效的方式</strong>，它能够锻炼不同层次的听力技巧：既要在短时间内把握大致意象，又强迫人去听出每个细节，从而对快速提高能够应对的语速以及更准确地理解细节有很大的帮助。</li><li><strong>口译：</strong>这是上英语口译课时被丁小龙老师推荐的方法。除了在正式的场合做口译以外，只要有心随时随地都可以练习。比如在听不那么重要的讲课时，可以在心中默默把英文翻译念出来。</li><li><strong>背诵：</strong>把喜欢的文章或是影视作品的片段背下来，对口语和写作都有帮助。像 <em>The Minister</em> 的很多片段我就能倒背如流，这也在一定程度上影响了我的写作风格。</li><li><strong>翻译：只有在两种语言之间不断切换，才能深刻地理解思维和文化上的差异。</strong>可以找些自己感兴趣的材料当搬运工，又或者接份兼职翻译的工作，一遍赚钱一遍锻炼能力。这大概是能够比较全面地提高英语水平又有足够激励使人能坚持下去的比较好的选择之一吧。</li></ul><p>这些方法虽然可以在短时间内拔高英语水平，但要消耗的时间和精力实在是过于巨大。要长期坚持下来，可能需要大量情怀支撑吧……</p><h2 id="误区：How-Not-To-Fail"><a href="#误区：How-Not-To-Fail" class="headerlink" title="误区：How Not To Fail"></a>误区：How Not To Fail</h2><p>在很多领域，成功的方式有无数种，没有人可以打包票教会你如何才能成功；但往往失败的方式却是可以被总结和避免的。创业是如此（事实上，我第一次接触 How Not To Fail 这个表述就是来自于 Alistair Croll 关于 startup 和 growth hacking 的 <a href="https://www.youtube.com/watch?v=0cEfe9mSatM" target="_blank" rel="noopener">talk</a>），英语学习也是如此。所以在这一小节里，我会讨论几个常见的误区。</p><h3 id="单词"><a href="#单词" class="headerlink" title="单词"></a>单词</h3><p>我个人是不推崇用软件背单词的。确切地说，部分背单词软件的形式很容易让人做无用功。</p><p>比如说，给一个单词和四个选项，让你选出正确的义项，这种形式就会让你高估自己对单词的掌握程度；又比如说，只有单词没有例句的，就很容易让人学到一堆孤立的单词却没法用起来，在句子里听到也反应不过来；更为离谱的是，如果单词只有中文释义，很多抽象近义词之间的差别是没法讲清楚的。不要小看这一点，很多时候固定搭配之所以成为固定搭配，并不是完全没有道理的，找一本靠谱的英英词典仔细看就会理解了。</p><p>如果让我推荐，我会去找一本带靠谱<strong>英文释义</strong>、整理了词根近义词反义词、有<strong>高质量例句</strong>并且带<strong>例句录音</strong>的<strong>乱序</strong>单词书来背。我当年花了 4 个月的时间完全（这个“完全”的程度可不是开玩笑的）掌握托福词汇就是这么做的，至今受用无穷。</p><p>当然，我上一次不为应试目的背单词已经是 13 年的事情了。现在大家在用的背单词软件应该也都科学了许多。这里只是提醒一下，不要陷入上述这类误区。</p><h3 id="写作"><a href="#写作" class="headerlink" title="写作"></a>写作</h3><p>中国学生最容易犯的问题，就是在下笔写文章之前，先在脑子里想好中文，然后逐字逐句翻成英文。这样做的问题在于，<strong>翻译是一件比写作难得多的事情</strong>。在遇到稍复杂一点的意思时，用中译英的思维方式几乎不可能写出地道的表达，因为这么做势必会在一定程度上丢掉对逻辑结构、颗粒度和固定搭配的考虑。</p><p>在有了好的想法以后，要写出好文章，首先一定要<strong>切换到英语思维直接下笔</strong>，其次就是一定要从读者的角度出发考虑<strong>易读性</strong>。永远不要认为内容可以掩盖语言能力的不足。因为根据我的观察来看，<strong>以大部分中国学生的英语书面表达能力之弱，根本轮不上拼文章内容的程度</strong>。试想如果读者看不懂你想表达的意思，内容再精彩又有什么用？</p><p>在一般情况下，<strong>不要写结构复杂的长句子，不要用 GRE 里面的高级词汇，不要逐字句查字典翻译来表达你不会表达的意思（英语不是文言文）</strong>。至于应该怎么写，在学好语法的基础上，只要输入够多，平时又有意识地去积累，写出来的文章虽然不一定辞藻华丽，但肯定能够清楚地表意。在此基础上再慢慢改进，直到最后达到一种<strong>德式严谨和法式浪漫之间的完美平衡</strong>。</p><p>PS：根据我帮别人改文章的经验来看，大部分时候下笔不达意的主要原因是动词和介词用的不恰当（表达不够地道）；读起来枯燥的主要原因是<strong>句式单一以及不注意换词（积累不足或是意识缺乏）</strong>；文字冗余的主要原因则是<strong>逻辑混乱</strong>，下笔之前就没想清楚，让他用中文写也会出一样的问题。所以说<strong>母语能力决定二语表达能力上限</strong>这句话是绝对没错的。</p><h2 id="结语"><a href="#结语" class="headerlink" title="结语"></a>结语</h2><p>除非是将语言作为学术研究对象，否则它对我们来说始终只是一门工具、一项技能。<strong>如果是为了应试，准备起来总不会难到哪里去；如果是身处实际运用的环境，就要在平时刻意去锻炼和积累才能提高；如果想要短期拔高，那么必然要付出巨量的时间和精力作为代价。</strong></p><p>如果抛开功利不谈，是什么使我坚持学习外语到今天呢？我想是因为<strong>多学一门语言就能多打开一扇新世界的大门</strong>。我心目中最理想的状态，是在面对其他语言时能像面对母语一样处变不惊，只把它当做生活中一个平凡的组成部分。祝愿大家都能发现更宽广的世界。</p><!-- 最后，如果觉得这篇文章对你有帮助，可以资助我喝一瓶啤酒 ：）![Alipay QRcode](http://oj4csnnsi.bkt.clouddn.com/blog/gradapply_demystified/alipay_qrcode.jpg) --><h2 id="附录"><a href="#附录" class="headerlink" title="附录"></a>附录</h2><p>以前写的几篇关于英语学习的文章：</p><ul><li><a href="https://dnc1994.com/2016/03/how-to-prepare-for-toefl-gre/">如何备考 TOEFL/GRE</a></li><li><a href="https://www.zhihu.com/question/27774623/answer/38327507" target="_blank" rel="noopener">知乎回答：托福上 110 分需要英语达到什么水平？</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;&lt;strong&gt;本博客已经迁移到新域名 &lt;a href=&quot;https://linghao.io&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;linghao.io&lt;/a&gt;。请前往新博客阅读本文：&lt;a href=&quot;https://linghao.io/posts/improve-english/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://linghao.io/posts/improve-english/&lt;/a&gt;。&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;前言&quot;&gt;&lt;a href=&quot;#前言&quot; class=&quot;headerlink&quot; title=&quot;前言&quot;&gt;&lt;/a&gt;前言&lt;/h2&gt;&lt;p&gt;既然标题口气这么大，那就先放点考试成绩让大家对我的英语水平有个直观的了解（当然，成绩只是考察语言能力的一个角度）：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;TOEFL：R 30 + L 30 + S 23 + W 29 = 112&lt;/li&gt;
&lt;li&gt;GRE：Verbal 161 + Quant 170 + AW 4.0&lt;/li&gt;
&lt;li&gt;FET（复旦英语测试）：100（满分，意味着在那一年参加考试的人群中接近或者处于第一名的位置）&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;我不敢说自己是个英语特别好的人，但可以说在目前的需求下对英语还是运用自如的。在此之前，我在很多场合介绍过自己准备托福和 GRE 的经验，却很少谈及 generally 提高英语水平的心得。这是有原因的：一方面是因为我自己最近几年英语水平的提高已经完全是依靠长时间沉浸在有足够输入输出的环境中的积累，而这个状态跟绝大多数向我寻求帮助的人所处的学习阶段相去甚远；另一方面是因为，即使从我自身的经验中提取出能帮助到别人的部分，也是一件比较难的事情。不过我还是尝试了一下，在 15 年 5 月的时候做了一次班内分享，不过当时对一些问题思考的还不够全面。最近又有学妹找我咨询这方面的问题，我思考了一下觉得现在应该能把自己的心得概括得比较到位了，于是就有了这篇文章。&lt;/p&gt;
&lt;p&gt;不同于我的其他博文，在本文中我将从一个比较高的层次谈谈自己对英语学习的理解，而不会把重点放在给出全面和具体的建议上面。这也意味着本文可能需要依赖于读者的反馈来进行后续的修改，以充实内容。欢迎任何方面的评论和提问：）&lt;/p&gt;
    
    </summary>
    
      <category term="Knowledge" scheme="http://dnc1994.com/categories/Knowledge/"/>
    
    
  </entry>
  
  <entry>
    <title>DIY 留学申请全攻略</title>
    <link href="http://dnc1994.com/2017/01/gradschool-application-diy-demystified/"/>
    <id>http://dnc1994.com/2017/01/gradschool-application-diy-demystified/</id>
    <published>2017-01-05T16:11:47.000Z</published>
    <updated>2019-06-03T00:39:11.127Z</updated>
    
    <content type="html"><![CDATA[<p><strong>本博客已经迁移到新域名 <a href="https://linghao.io" target="_blank" rel="noopener">linghao.io</a>。请前往新博客阅读本文：<a href="https://linghao.io/posts/gradapply-diy/" target="_blank" rel="noopener">https://linghao.io/posts/gradapply-diy/</a>。</strong></p><!-- # Do It Yourself: Graduate School Application Demystified --><h2 id="Preface"><a href="#Preface" class="headerlink" title="Preface"></a>Preface</h2><p><strong>本文采用<a href="https://creativecommons.org/licenses/by-nc-nd/3.0/cn/" target="_blank" rel="noopener">署名 - 非商业性使用 - 禁止演绎 3.0 中国大陆许可协议</a>进行许可。著作权由章凌豪所有。</strong></p><p><em>（笔者章凌豪是一名来自复旦大学的 17 Fall CS 方向的 DIY 申请者。在刚刚过去的申请季中，笔者总共申请了美国 14 个大学的 19 个 Master 项目。）</em></p><p>关于申请，总是有很多 myth。最经典的一个就是在申请季开始的时候总会有一大堆人打印十几二十份成绩单跑到教务处去盖章。而事实上，现在绝大多数的项目都采用网申系统，在申请时往往只要求上传成绩单的扫描件。以笔者的 19 个项目为例，其中只有 2 个项目要求寄送纸质成绩单。诸如此类的 myth 代代相传，真正有用的信息却总是凌乱琐碎，再加上唯恐天下不乱的各路中介散布各种不负责任的观点，这种信息不对称使我们 DIY 申请者走了许多弯路。</p><p>申请本身是一件充满了不确定性的事情。在申请季到来时，我们自身的经历和成就往往不会再有太大的变化。我们要做的就是确保不要引入什么减分项，努力表现出自己的加分项，同时控制和减小各个环节的不确定性。本文的主要目的，就是将笔者在 DIY 申请的过程中感受和了解到的一些重要的信息，经过整理和组织后提取出一些通用的内容，以极高的干货密度分享出来。接下来笔者将从申请前需要做的准备，申请时各种材料应当如何处理，以及如何规划和管理申请的进度几个方面来展开。</p><p>提示 1：本文中关于材料具体应该如何准备的部分（比如书写 SOP/PS 的思路）可能只适用于理工科申请，请人文、艺术、商科等方向的读者自行鉴别，兼听则明；而其余部分的内容则基本上对于所有方向的申请都适用。</p><p>提示 2：本文将不会涉及以下内容：</p><ul><li>如何选择学校和项目。</li><li>PhD 申请相关的问题：如何套瓷、如何写好 research proposal、如何准备一对一面试等。</li><li>网申阶段结束以后的问题：如何选择 offer、如何 argue 奖学金等。</li></ul><p>提示 3：本文的干货密度很高，建议仔细阅读，大概需要 15 分钟。</p><a id="more"></a><h2 id="Preparations"><a href="#Preparations" class="headerlink" title="Preparations"></a>Preparations</h2><p>理想情况下，在申请之前，你需要：</p><ul><li>一个能够（相对）稳定翻墙的网络环境。换句话说，一个（甚至几个）代理或者 VPN。如何搭建或是购买它们则不属于本文的讨论范围，只能说至少在目前看来愿意花钱总是能办到的。</li><li>一个（相对）可靠的邮箱。如果能确保上一条，建议选择 Gmail。此外建议使用包含姓名/生日等标示性信息的用户名，比如<a href="mailto:`xiaoming.wang.95@gmail.com" target="_blank" rel="noopener">`xiaoming.wang.95@gmail.com</a>`。</li><li>一张用于在线支付的信用卡，推荐 Visa。</li><li>几个近两年的、跟你同一申请方向的学长学姐。</li><li>几个跟你同一申请方向的同级生。</li></ul><h2 id="Program-Requirements-amp-Progress-Tracking"><a href="#Program-Requirements-amp-Progress-Tracking" class="headerlink" title="Program Requirements &amp; Progress Tracking"></a>Program Requirements &amp; Progress Tracking</h2><p>对于 DIY 申请来说，最重要的就是要有条理。所以在最开始，笔者想要强调很多人都会忽视的两点：看清项目需求，管好申请进度。</p><p>绝大多数项目都会要求提供如下材料：</p><ul><li>个人基本信息</li><li>教育经历，包括上传或者寄送成绩单（Transcript）</li><li>英语考试成绩</li><li>动机函（SOP = Statement of Purpose）或是个人陈述（PS = Personal Statement），以及简历（CV/Resume）</li><li>由推荐人填写推荐并提交推荐信</li></ul><p>这些也是本文将要重点讨论的内容。但与此同时，不同项目对这些材料的要求可能会很不一样，并且大部分项目都会要求一些额外的材料。所以，在申请之前必须先逐个考察和记录好每个项目要求提供的材料清单。</p><p>对于进度管理，笔者建议使用电子表格做一个简单的 tracker 来帮助自己掌握每个项目的进展。在申请结束后，笔者的 tracker 是这样的：</p><p><img src="gradapp-diy-application-tracker.jpg" alt="Application Tracker"></p><p>这里给出笔者所使用的<a href="https://docs.google.com/spreadsheets/d/1BLF2G2XnSoySMSMknsUpbb4RPWOhCb91Dat642YbMeQ/edit?usp=sharing" target="_blank" rel="noopener">模板</a>供参考。如果不方便翻墙的，<a href="gradapp-diy-application-tracker-template.xlsx">这里</a><br> 还有一个 Excel 版。</p><h2 id="Transcripts"><a href="#Transcripts" class="headerlink" title="Transcripts"></a>Transcripts</h2><p>在填写教育经历时，大部分项目都只要求上传成绩单扫描件，在获得 offer/AD 之后才要求寄送原件。但也有个别项目在申请时就会要求寄送原件，或者是要求进行成绩认证。</p><p>首先要仔细阅读项目对成绩单的要求，比如是否要有签字、是否要附上评分标准、是否要提供中文原件等。在办好满足要求的成绩单后，就可以拿去扫描上传。在上传时，可能会遇到一些比较奇葩的网申系统限制文件的大小。这时需要用一些工具来压缩 PDF 的大小，那么在重新上传之前务必要确认压缩后的文件是否依然足够清晰到能够看清上面的文字。</p><p>对于寄送成就单原件，一般都是使用学校的官方信封，然后到教务处盖上骑缝章。寄送的时候一般选择 DHL，三天基本就能到了，学生件的价格也很低（150 RMB/件）。</p><p>做成绩认证则比较麻烦。笔者遇到的有学信网认证（与 ApplyWeb 合作）和 WES 认证。流程分别如下：</p><p>学信网：</p><ol><li>登录学信网上传成绩单和个人证件的扫描件，缴费并提交申请。</li><li>等待 20 天。</li><li>认证报告出来后，在美国学校的网申系统中缴费并提交认证请求。</li><li>这时在学信网上能够看到已经缴费的项目，选择发送电子报告即可（即达）。</li></ol><p>WES：</p><ol><li>在 WES 上创建申请，选好收件方，缴费并获得一个 reference number。</li><li>对于来自中国大陆学校的成绩单，必须先使用学信网或者学位网完成认证并将报告寄到 WES。这两个认证网站都有目的为 WES 的选项，按照指示操作即可。</li><li>等待学信网/学位网完成认证并发件到 WES。</li><li>WES 需要大概 7 天来将收到的报告跟你的档案匹配起来，再花 7 天左右来处理报告。</li><li>WES 将最终的报告寄给收件方。</li></ol><p>可以看到，如果要做成绩认证，需要花的时间还是挺长的，尤其是 WES。如果遇到感恩节、圣诞节之类的假期，就要等待更久。所以如果想申请的项目中有要求成绩认证的，一定要尽早开始。这也是为什么申请任何一个项目要做的第一件事永远是看清需要提交的材料清单的原因。</p><p>这里可能遇到问题的是交流的成绩单。如果你选择将交流经历写入教育经历，那么就要搞清楚是否需要提交交流时的成绩单。一般来说，如果学分被转换回来并且记载到了本科学校的成绩单上，往往就不需要再提交交流成绩单，但并不总是这样的。而且在实际操作中，如果在交流的时候没有做好准备，事后申请往往比较麻烦，所以这里要提前做好打算。</p><h2 id="Test-Scores"><a href="#Test-Scores" class="headerlink" title="Test Scores"></a>Test Scores</h2><p>对北美申请而言，大多数情况下我们所需要的英语成绩就是托福和 GRE。在申请季到来时，分两种情况：</p><ol><li>你已经有满意的成绩：那么只需送分即可。不过这里要注意下有效期的问题。一般学校对成绩有效期的要求是能够在你提交申请时有效即可，但也不乏要求有效期能够覆盖到申请 deadline 甚至是实际入学日期的。更有甚者，某些学校会强行只承认托福的有效期为 18 个月而不是 2 年。所以这里要留个心眼。</li><li>你还有要参加的考试：注意出成绩和送成绩的时间间隔，尽量不要卡 deadline；还有一个问题是，在申请季如果寄送多份成绩给学校，要留心学校的规定。某些学校会明确指出最多接受 2 封成绩，或者说取最后一次为准（而不是我们所希望的取最高为准）等等。</li></ol><p>至于怎样的成绩足够好，往往所申项目的网站上会给出最低要求，或者是每年的申请者和被录取者的平均成绩。一般而言，对于理工科来说，托福 100+，GRE 320+（Verbal 152+，Quant 168+，AW 不重要）就可以满足绝大多数项目的要求了。不过这里要注意的是，项目介绍中提及的数字，有 hard limit（如果不满足要求就直接淘汰），也有 soft limit（如果低于要求只要其他方面足够强也可以被录取）。像是申请做助教要求托福口语单项 24 这种往往是 hard limit，其他的情况就很难区分了。可能需要去调查一下往年被录取者的背景。不过如果分数差的不多，又是的确想去的项目，建议还是申请一下撞撞运气，毕竟多申一个项目的费用相对于以后留学的开销来讲都是小意思。</p><p>有些项目的网申系统中可以查看成绩是否已经寄到。由于成绩从 ETS 寄到学校以后，还有一个将成绩跟你的申请档案匹配起来的过程（通过填写的考试注册号、姓名、生日等），所以系统中的这个成绩状态可能会更新得比较慢。笔者的经验是最慢的项目大概 15 天左右也都显示寄到了。这里还是建议一旦确定要申请某个项目就马上送分。</p><p>有些项目是无法在系统中查询成绩状态的，这时如果实在不放心，可以尝试发邮件询问，但不一定会得到回复。不过由于大部分系统都会让你上传电子版成绩单，所以即使出了意外没寄到问题应该也不算太大，真的缺文件之后学校往往会发邮件给你的。</p><p>关于电子版成绩单的问题，托福可以用纸质成绩单扫描件，或是 NEEA 上的网页截图。GRE 则是在 ETS 官网有 PDF 下载。</p><p>这里顺带提一下，关于托福和 GRE 的备考，笔者曾经写过<a href="https://dnc1994.com/2016/03/toefl-gre-preparation/">一篇文章</a>，可以参考。</p><h2 id="SOP-PS"><a href="#SOP-PS" class="headerlink" title="SOP/PS"></a>SOP/PS</h2><p>对于申请最重要的一篇 essay 就是 SOP/PS 了。这里其实也有一个 myth，很多人都会强调它们之间的区别，说 SOP 侧重于 Purpose 所以要重点描述自己申请这个项目的目标是什么希望收获什么，PS 侧重于 Personal 所以要重点描述过去的经历是如何促使自己想要申请这个项目的。这类观点并不是全无道理，但事实是大部分项目对于 SOP/PS 中应该包括什么内容都是有要求的。如果你把上面的定义跟每个项目的具体要求去对比，就会发现把这篇 essay 叫成 SOP 还是 PS 很多时候都是随意的。所以还是根据要求来就好。</p><p>以 CMU MIIS 项目的要求为例：</p><blockquote><p>A good essay conveys three types of information about you.<br>First, we look for strong evidence that you can do well in the MIIS degree program. For example, a description of your academic experience is good evidence. A description of a software project, your involvement in the project, and the impact of the project is good evidence. A description of an internship or professional experience is good evidence. These descriptions are stronger if they provide details about what you did, what you liked, and what you learned from the experience.<br>Second, your essay is stronger if it explains why you want to be in the MIIS program. We understand that you may be applying to more than one degree program. Tell us why are you applying to this one, and what you hope to get out of your experience here.<br>Third, a brief discussion of your career goals - what you enjoy, what you hope to do after you complete the MIIS degree - helps us to understand how the MIIS degree may contribute to your long-term professional goals.</p></blockquote><p>一般来说，SOP/PS 可能会包含这些内容：</p><ul><li>开头：主要描写自己的 motivation / dream / goal。这里如果没法写得特别引人入胜，建议开门见山。以及切忌无脑引用名人名言。</li><li>两到三段经历：挑选跟申请方向最相关的事例，串成几段逻辑连贯的经历。有个简单的验证方法就是每段经历的中心句拿出来可以连成一个故事；在写每段经历的时候，一般都是问题描述+解决方案（what you did）+感想收获（what you liked and learned）的结构，要注意把握细节性和故事性之间的平衡，同时避免写泛泛而谈的内容（e.g. I fixed many bugs and felt very glad about it.），因为这谁都能写的出来；段与段之间注意过渡。</li><li>Why School &amp; Why Program：为什么选择这个学校是很难写的，大部分人如果试图回答这个问题往往只能编出特别庸俗的故事（我有一个学长/男朋友在贵校……）。而为什么选择这个项目则相对而言比较好写：可以花点时间去了解下项目的课程安排，然后提一两门课的名字，说觉得这些课能补充你不足的地方；又或者提一下项目的亮点，比如跟工业界接触多、团队协作多，等等。总之要让招生委觉得你仔细地研究过了项目。</li><li>Why graduate school / Why further education：这跟上一点不大一样，比较好的方法是写一段自己对专业领域的比较独到的见解，又或者是畅想一下这个领域的未来，从而自然地引出你觉得在这样的情况下自己离实现目标还差一些技能和经验，所以想要继续学习。</li><li>Career goal：如果前面的铺垫足够，只要从职业的角度重申一下自己的目标就行了。同样可以结合自己的见解或是对未来的畅想使这部分言之有物。</li></ul><p>下面提供笔者本人申请 CMU LTI 的 SOP 的开头段和末尾两段：</p><blockquote><p>“Give a computer a fish, you feed it for a day; teach it how to fish, you feed it for a lifetime.” I still remember this quote from Professor Hsuan-Tien Lin’s slides. After spending my freshman year without much motivation, it was his online course Machine Learning Foundations and Techniques that sparked my interests in Machine Learning and Data Mining. Today, I’m determined to pursue a Master’s degree from the Language Technologies Institute (LTI) at Carnegie Mellon University (CMU) because of the same passion that has been driving me to boldly advance in this exciting field for the last two years.</p></blockquote><blockquote><p>…</p></blockquote><blockquote><p>This technical leadership experience tremendously inspired me. It made me reflect on where the industry is heading and what I still lack to accomplish my dream. As computing becomes ubiquitous in our life, the future of software industry will be dominated by machine intelligence. However, shipping intelligent products is always faced with extra complexity stemming from its very nature. For instance, a Machine Learning system can never be as decoupled as ordinary software, because the reason to develop it in the first place is that the desired behavior cannot be explicitly programmed without dependency on external data. As in the case of EVA, data dependencies often lead to unexpected performance drop and high maintenance cost, which tend to compound and become what we know as technical debt. To create genuine productivity, we have to continuously pay off technical debt by making sound decisions and refactoring legacy codes. Therefore, apart from deepening my understanding of Machine Learning and Data Mining, I need to polish system design and software engineering skills as well.</p></blockquote><blockquote><p>My career goal is to become a leading intelligent software developer. And I believe that a professional degree would help me better prepare myself for it. After thorough research, I’m convinced that LTI is the best choice to realize my dream. As a trailblazing leader, LTI fascinates me with the way it combines research and engineering, especially the projects using large-scale information extraction and content analysis to combat crimes from human trafficking to cyberterrorism. I’m applying for Master in Intelligent Information Systems (MIIS) and Master of Computational Data Science (MCDS) because both program suits my experiences and interests perfectly well. Courses like Machine Learning for Text Mining and Large-Scale Multimedia Analysis will provide highly specialized insights and I cannot wait to challenge myself with them. Given my solid background in Computer Science, practical experience with research projects and sufficient exposure to industrial applications, I’m confident that I will succeed in this ambitious endeavor and teach machines how to “fish” forever.</p></blockquote><p>可以看到开头很恰当地引用了林轩田老师的一句话，其余部分基本就是开门见山。而倒数第二段写了自己对 Machine Learning 作为一个软工系统的所具有的独特的复杂性的见解，最后一段通过描述对 LTI 的直观感受和项目里的两门课程来表达自己对项目的兴趣。</p><p>中间主体的经历部分由于包含一些隐私信息，不便公开。不过可以给出每段的中心句：</p><blockquote><p>Motivated by my desire to gain a more systematic grasp, I joined Shanghai Key Laboratory of Data Science. </p></blockquote><blockquote><p>While gradually shifting my focus to real-world applications, I realized that many scenarios require customized tweaks due to various constraints. </p></blockquote><blockquote><p>Encouraged by my achievements and impelled by the curiosity to taste the difference between academia and industry, I joined Strikingly, a startup company specializing in website building tools.</p></blockquote><p>总结一下笔者的思路，大概就是：</p><ul><li>通过 MOOC 对 Machine Learning 和 Data Mining 产生了兴趣。</li><li>在实验室系统地学习了相关的知识，并解决了一些实际问题。</li><li>对实际问题越来越感兴趣，发现它们往往涉及到理解现有算法和模型的原理并进行相应的修改，所以在交流时加入的实验室那边进行了尝试。</li><li>被之前的成就所鼓励，同时也对工业界的实际情况感兴趣，加入了一家创业公司实习。</li><li>在实习期间逐渐体会到 Machine Learning 系统所特有的复杂性，感受到自己还有许多短板。</li><li>希望能够加入这个非常符合我的兴趣和经历的项目。</li></ul><p>对于 DIY 申请者来说，我们需要依赖学长学姐或是文书咨询顾问的反馈来修改我们的 SOP/PS。这里比较推荐的方式是，找至少两位值得信任又能够及时回复的审稿人，用快速迭代的方式，每写一稿就发给他们，再根据反馈来修改。一开始不要急着对语言进行润色，而是把内容先定下来。最后如果有条件也可以找 Native Speaker 帮你润色语言，但要把握好修改的度，不要适得其反。</p><p>笔者本人的 SOP/PS 基本保持着一周一稿的速度，一共改了八稿，在这个过程中也得到了不少感悟。除了上面已经提到过的几点以外，还有：</p><ul><li>写第一稿是最痛苦和漫长的，笔者花了两星期。建议留一段比较空闲的时间来好好回忆和梳理大学三年的经历，再仔细写出来。一开始不要害怕字数会超出太多，之后可以慢慢删减。</li><li>切忌写成 CV 的展开版。尤其是第一稿一定要小心，否则由于锚定效应，很容易限制之后修改的思路。</li><li>一开始总想一口气表现出自己各方面的优点，但后来往往会发现把所有内容都写的很出色是不可能的，势必要根据重要性来做一些详略上的取舍。</li><li>过渡是很重要的，尤其是行文如果不符合读者心理预期，会让人对文章的印象分大减。建议反复跟审稿人确认文章是否存在这样的问题。</li><li>改最后几稿时，要开始考虑可读性的问题。招生委读你的文章的时间是很有限的，不仅要避免用太高级的词汇，语法和表达如果用的不好，也会出问题。有时候审稿人跟你是一个专业但是方向并不相同，就可能不会看的特别仔细，也不会提出这类问题。但是就笔者的经验，几乎没有人的写作水平能够地道到不犯这种错误。最好能让对你的专业有了解的人来读读看会不会遇到坑。笔者的 SOP/PS 一开始就有几处容易读出歧义的地方，而笔者本人却完全不自知。并且，笔者在帮别人修改 SOP/PS 的时候，也经常遇到由于作者语言能力不过关的原因而不能理解所想要表达的意思。</li><li>多用主动语态，多用显得自己充满信心和笃定的词。</li></ul><p>还有一个问题是，对于不同的学校和项目，通常要提供不同版本的文书。但如果真的为了 N 个项目去写 N 份 SOP/PS，恐怕没人会有那么多的时间和精力。一个比较折中的解决方案是，先写出针对 dream school 的一份底稿，对其他的项目就在此基础上只修改开头和结尾的两段。一般而言，如果你申请的所有项目都是同一个方向的，那么主体部分是不用大改的。甚至在很多情况下，你只需要修改学校和项目的名字。当然，这里千万注意不要把名字搞错了。</p><h2 id="CV"><a href="#CV" class="headerlink" title="CV"></a>CV</h2><p>CV 就比较简单，把常见的内容都罗列上去就行了，网上能找到很多模板，我这里也提供<a href="https://www.rpi.edu/dept/arc/training/latex/resumes/" target="_blank" rel="noopener">一个 LaTeX 模板系列</a>作为参考。强调几点：</p><ul><li>每个想要详述的 Project 采用类似于 SOP/PS 经历段的三点展开。</li><li>描述自己做了什么的时候注意使用恰当的动词。</li><li>CV 要不 1 页要不 2 页，再长就不合适了，1 页半也很尴尬。除非项目有明确要求，否则一般都可以通过行距和页边距来调整紧凑程度。</li></ul><p>举例来说，笔者本人的 CV 包含如下内容：</p><ul><li>Education<ul><li>B.S. of Computer Science, Fudan University</li><li>Exchange Program, National Chiao Tung University</li><li>Selected MOOCs</li></ul></li><li>Professional Experience<ul><li>Data Mining Engineer Intern, Strikingly</li><li>Technical Director, Student Information Management Center</li></ul></li><li>Academic Experience<ul><li>Research Assistant, Shanghai Key Laboratory of Data Science</li><li>Visiting Student, Machine Learning Laboratory of NCTU</li></ul></li><li>Honors and Awards</li><li>Selected Side Projects</li><li>Extracurricular Activities</li><li>Programming Skills</li></ul><p>这里注意，笔者选择将 Professional Experience 放在前面，并在最后加入了 Programming Skills，是由于笔者主要申请的都是就业导向的项目。对于学术导向的项目，需要做相应的调整。</p><p>笔者本人的一个 Project 是这样写的：</p><h4 id="Content-Filtering"><a href="#Content-Filtering" class="headerlink" title="Content Filtering"></a>Content Filtering</h4><ul><li>Phishing &amp; spamming detection service used by the main site.</li><li>Redesigned the prediction pipeline to support customized i18n handlers.</li><li>More malicious behaviors can now be identified with greater accuracy.</li></ul><p>很好地遵循了是什么+做了什么+有什么用的结构，简明扼要又符合读者的预期。</p><h2 id="Recommendations"><a href="#Recommendations" class="headerlink" title="Recommendations"></a>Recommendations</h2><p>在这一部分，你需要找到一定数量（一般是 3 位）的推荐人，将他们的信息填写进网申系统。系统将会向他们的邮箱发送邮件，其中会包含一个链接或一对用户名/密码用于访问推荐填写系统。推荐的形式一般是提交一封推荐信，加上回答若干个问题（跟被推荐人的关系，对被推荐人的各方面进行基于百分比的评价等）。</p><p>一般而言推荐人可以是：</p><ul><li>上过的课程的老师</li><li>所待实验室的教授</li><li>交流期间的教授</li><li>实习期间的上司</li></ul><p>对于北美申请来说，通常来自美国/加拿大的教授的推荐信比较有效。而如果是领域内知名的大牛，或者是要申请的学校那边有分量的教授（比如院长）这类就算是牛推了。当然，大部分人是拿不到牛推的。这时候在推荐人的构成上如果能够全面一点会比较好（至少对于 Master 申请是如此）。比如笔者的三位推荐人就分别是在复旦实验室的教授、交流期间实验室的教授和实习公司的 CTO。</p><p>在跟推荐人沟通时，要注意以下几点：</p><ul><li>提前问好推荐人愿意提交几份推荐。有些比较严肃的推荐人可能只愿意提交 5 份推荐，或者只愿意提交给他所认可的学校。如果首选的 3 个推荐人中存在这种情况，那你就要考虑将自己最想去的项目交给这个推荐人，然后为其他项目找新的推荐人了。所以仅仅确认推荐人愿意为你作推荐的意愿是不够的，这些细节都必须提前商量好。</li><li>在发送推荐请求时，尽量一口气把所有项目的邮件一同发送，这样不仅方便了推荐人，也减小了因为推荐人迟迟不填申请而导致你提心吊胆的概率，对大家都有好处。</li><li>很多时候推荐人比较忙，这时候要不失礼貌但又不懈地提醒对方。这个问题在推荐人是海外的教授时比较常见，因为我们通常没有邮件以外的方式去联系他们。所以最好能够保存对方的电话，或是跟对方实验室里的学生保持联系。</li></ul><p>照理来说，推荐信应该是由推荐人自己起草和提交的。但在实际操作中，尤其是对于中国的教授而言，大家都知道很多时候推荐信底稿是由申请者自己写的。在这种情况下，一封好的推荐信应该满足如下几个要素：</p><ul><li>推荐人跟你的关系以及如何认识你的。</li><li>推荐人跟你共事的经历。</li><li>对你各方面能力的评价（通过具体事例的细节来论证）。</li><li>对你为人和性格的评价。</li><li>一些比较私人的、在你的 SOP/PS 和 CV 中没有的内容（使推荐信更可信）。</li></ul><p>跟 SOP/PS 相反，写推荐信切忌开上帝视角，很多细节性的东西你的推荐人是不应该会知道的。虽然说在没有牛推的情况下，推荐信处理得再好可能最多也只是起到一个不减分的作用。但从尽人事的角度出发，努力让推荐信看起来真实并且很好地支撑你的整个申请，绝对不会是一件有害的事情。</p><p>这里给出一个<a href="gradapp-diy-recommendation-sample.pdf">推荐信样本</a>作为参考。</p><p>还有一点要注意的是，在填写推荐人联系邮箱时，一定要用职业邮箱。比如对教授来说就是 edu 邮箱。</p><h2 id="Extra-Materials"><a href="#Extra-Materials" class="headerlink" title="Extra Materials"></a>Extra Materials</h2><p>在上述材料以外，项目有可能会要求以下这些材料：</p><ul><li>Personal History Statement：公立学校（比如 UC 各校）常见，通常是要求描写个人成长过程中遇到的困难（侧重于因为自己是 minority group 或者家庭经济情况所导致的困难），你是如何克服这些困难的，以及它们是如何 motivate 你来申请学校的。对于这篇 essay，笔者的几位学长都表示，如果不是 LGBT/残疾人/黑人之类公认的 minority group，写了可能作用也不是很大，建议如果是 optional 的就不写，如果是 required 的话也不要花太多时间在上面。听从他们的建议，笔者最后就按照实际情况写了自己是如何从一个落后的农村来到复旦读书的。</li><li>Diversity Essay：这个就更侧重在 minority group 上面了，要求描写自己能够如何促进学校的 diversity。如何处理同上。</li><li>Video Essay：要求提交一个长 1 ~ 2 分钟的视频，其中由申请者本人出镜，对内容的要求一般是简单的自我介绍，说一些 SOP/PS 上没有涉及的内容。写个稿子直接录制，注意下着装、背景、光线和音质即可，不需要花太多时间，也没有必要给视频加什么特效。</li><li>Video Interview：网申阶段的面试往往是播放几个录制好的问题，然后实时采集你的回答，跟托福口语考试的形式比较像。这里要注意的是找一个好的网络环境。面试可能会要求描述自己的某段经历（e.g. Describe the most challenging project you’ve ever worked on.），或者问一些比较 general 的问题（e.g. When working on a group project, how do you acknowledge the achievements of your teammates?），可以找同学互相出题准备一下。</li></ul><h2 id="Recommended-Schedule-amp-Approach"><a href="#Recommended-Schedule-amp-Approach" class="headerlink" title="Recommended Schedule &amp; Approach"></a>Recommended Schedule &amp; Approach</h2><p>申请时建议按照如下的步骤进行。</p><ol><li>确认要申请的项目列表。</li><li>查阅要申请的每个项目的 FAQ 等相关页面，确定需要准备的材料清单。</li><li>进入每个项目的网申系统创建申请。</li><li>送英语成绩，申请成绩认证，寄送纸质成绩单。</li><li>准备其他文书，并抽空填写个人信息。</li><li>在确定推荐人和推荐信内容以后发送推荐信请求。</li><li>在文书准备得差不多或是 deadline 将至时，上传文书。</li><li>最终检查并付款提交。</li></ol><p>比较关键的就是第 4 点一定要尽早做，因为这部分的材料都是需要等待第三方进行处理的，不像其他材料可以拖到最后一天再提交（当然，最好不要这么做）。</p><p>实际操作时，笔者推荐每次完成所有项目的网申系统中的一个部分。比如今天的任务是填写基本信息，那么就一口气把所有项目的基本信息都填写完毕，不多不少。这样做的好处是效率高而且不容易遗漏。当然，如果你喜欢卡 deadline，那么显然就要从最想去的项目开始填一个提交一个。（说真的，不要卡 deadline。）</p><h2 id="Tips"><a href="#Tips" class="headerlink" title="Tips"></a>Tips</h2><ul><li>再次强调，申请任何一个项目之前一定要先看清和记录好要求。</li><li>对 DIY 申请来说，由于没有中介帮忙分析定位和管理进度，就更容易出错。所以做到有条有理和万事提前是很重要的。（当然，DIY 申请的好处也在这里：不会被中介坑。）</li><li>有任何不确定的问题第一时间发邮件问。因为不一定会有回复，所以有一同申请的人可以互相交流和分享资讯是非常必要的。</li><li>要认清自己的优势和短板。交流经历（尤其是英语授课的项目）和实习经历（尤其是 big name 和 startup）是很强的加分项，尤其对于 Master 申请来说；当助教的经历也是很强的加分项，甚至比这门课拿 A/A+ 更强，尤其对于 PhD 申请来说。建立在清晰的自我认知的前提下，很多时候我们都需要对实际经历进行适当的艺术加工来使自己的 case 更有力，这是完全 OK 的。</li><li>学长学姐的帮助是你最宝贵的财富。绝大多数情况下他们都是很乐于提供自己的文书和帮你审阅文书的。</li></ul><h2 id="Epilog"><a href="#Epilog" class="headerlink" title="Epilog"></a>Epilog</h2><p>作为一个刚刚度过有着五门课和一周三天兼职实习的申请季的人来说，笔者深知这几个月对每个申请者来说都是一场艰苦的持久战。希望本文能为各位带来一些启发，节省一些时间。希望大家都能申请到自己的 dream school。</p>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;&lt;strong&gt;本博客已经迁移到新域名 &lt;a href=&quot;https://linghao.io&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;linghao.io&lt;/a&gt;。请前往新博客阅读本文：&lt;a href=&quot;https://linghao.io/posts/gradapply-diy/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://linghao.io/posts/gradapply-diy/&lt;/a&gt;。&lt;/strong&gt;&lt;/p&gt;
&lt;!-- # Do It Yourself: Graduate School Application Demystified --&gt;
&lt;h2 id=&quot;Preface&quot;&gt;&lt;a href=&quot;#Preface&quot; class=&quot;headerlink&quot; title=&quot;Preface&quot;&gt;&lt;/a&gt;Preface&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;本文采用&lt;a href=&quot;https://creativecommons.org/licenses/by-nc-nd/3.0/cn/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;署名 - 非商业性使用 - 禁止演绎 3.0 中国大陆许可协议&lt;/a&gt;进行许可。著作权由章凌豪所有。&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;（笔者章凌豪是一名来自复旦大学的 17 Fall CS 方向的 DIY 申请者。在刚刚过去的申请季中，笔者总共申请了美国 14 个大学的 19 个 Master 项目。）&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;关于申请，总是有很多 myth。最经典的一个就是在申请季开始的时候总会有一大堆人打印十几二十份成绩单跑到教务处去盖章。而事实上，现在绝大多数的项目都采用网申系统，在申请时往往只要求上传成绩单的扫描件。以笔者的 19 个项目为例，其中只有 2 个项目要求寄送纸质成绩单。诸如此类的 myth 代代相传，真正有用的信息却总是凌乱琐碎，再加上唯恐天下不乱的各路中介散布各种不负责任的观点，这种信息不对称使我们 DIY 申请者走了许多弯路。&lt;/p&gt;
&lt;p&gt;申请本身是一件充满了不确定性的事情。在申请季到来时，我们自身的经历和成就往往不会再有太大的变化。我们要做的就是确保不要引入什么减分项，努力表现出自己的加分项，同时控制和减小各个环节的不确定性。本文的主要目的，就是将笔者在 DIY 申请的过程中感受和了解到的一些重要的信息，经过整理和组织后提取出一些通用的内容，以极高的干货密度分享出来。接下来笔者将从申请前需要做的准备，申请时各种材料应当如何处理，以及如何规划和管理申请的进度几个方面来展开。&lt;/p&gt;
&lt;p&gt;提示 1：本文中关于材料具体应该如何准备的部分（比如书写 SOP/PS 的思路）可能只适用于理工科申请，请人文、艺术、商科等方向的读者自行鉴别，兼听则明；而其余部分的内容则基本上对于所有方向的申请都适用。&lt;/p&gt;
&lt;p&gt;提示 2：本文将不会涉及以下内容：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;如何选择学校和项目。&lt;/li&gt;
&lt;li&gt;PhD 申请相关的问题：如何套瓷、如何写好 research proposal、如何准备一对一面试等。&lt;/li&gt;
&lt;li&gt;网申阶段结束以后的问题：如何选择 offer、如何 argue 奖学金等。&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;提示 3：本文的干货密度很高，建议仔细阅读，大概需要 15 分钟。&lt;/p&gt;
    
    </summary>
    
      <category term="Knowledge" scheme="http://dnc1994.com/categories/Knowledge/"/>
    
    
  </entry>
  
  <entry>
    <title>How to Rank 10% in Your First Kaggle Competition</title>
    <link href="http://dnc1994.com/2016/05/rank-10-percent-in-first-kaggle-competition-en/"/>
    <id>http://dnc1994.com/2016/05/rank-10-percent-in-first-kaggle-competition-en/</id>
    <published>2016-05-11T06:06:51.000Z</published>
    <updated>2019-01-19T04:05:07.860Z</updated>
    
    <content type="html"><![CDATA[<h2 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h2><p><a href="https://www.kaggle.com/" target="_blank" rel="noopener">Kaggle</a> is the best place to learn from other data scientists. Many companies provide data and prize money to set up data science competitions on Kaggle. Recently I had my first shot on Kaggle and <strong>ranked 98th (~ 5%) among 2125 teams</strong>. Being my Kaggle debut, I feel quite satisfied with the result. Since many Kaggle beginners set 10% as their first goal, I want to share my two cents on how to achieve that.</p><p><em>This post is also available in <a href="https://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/">Chinese</a>.</em></p><p><strong>Updated on Oct 28th, 2016: </strong> I made many wording changes and added several updates to this post. Note that Kaggle has went through some major changes since I published this post, especially with its ranking system. Therefore some descriptions here might not apply anymore.</p><a id="more"></a><p><img src="kaggle-guide-profile.png" alt="Kaggle Profile"></p><p>Most Kagglers use Python or R. I prefer Python, but R users should have no difficulty in understanding the ideas behind tools and languages.</p><p>First let’s go through some facts about Kaggle competitions in case you are not familiar with them.</p><ul><li><p>Different competitions have different tasks: classifications, regressions, recommendations… Training set and testing set will be open for download after the competition launches.</p></li><li><p>A competition typically lasts for 2 ~ 3 months. Each team can submit for a limited number of times per day. Usually it’s 5 times a day.</p></li><li><p>There will be a 1st submission deadline one week before the end of the competition, after which you cannot merge teams or enter the competition. Therefore <strong>be sure to have at least one valid submission before that.</strong></p></li><li><p>You will get you score immediately after the submission. Different competitions use different scoring metrics, which are explained by the question mark on the leaderboard.</p></li><li><p>The score you get is calculated on a subset of testing set, which is commonly referred to as a <strong>Public LB</strong> score. Whereas the final result will use the remaining data in the testing set, which is referred to as a <strong>Private LB</strong> score.</p></li><li><p>The score you get by local cross validation is commonly referred to as a <strong>CV</strong> score. Generally speaking, CV scores are more reliable than LB scores.</p></li><li><p>Beginners can learn a lot from <strong>Forum</strong> and <strong>Scripts</strong>. Do not hesitate to ask about anything. Kagglers are in general very kind and helpful.</p></li></ul><p>I assume that readers are familiar with basic concepts and models of machine learning. Enjoy reading!</p><h2 id="General-Approach"><a href="#General-Approach" class="headerlink" title="General Approach"></a>General Approach</h2><p>In this section, I will walk you through the process of a Kaggle competition.</p><h3 id="Data-Exploration"><a href="#Data-Exploration" class="headerlink" title="Data Exploration"></a>Data Exploration</h3><p>What we do at this stage is called <strong>EDA (Exploratory Data Analysis)</strong>, which means analytically exploring data in order to provide some insights for subsequent processing and modeling.</p><p>Usually we would load the data using <strong><a href="http://pandas.pydata.org/" target="_blank" rel="noopener">Pandas</a></strong> and make some visualizations to understand the data.</p><h4 id="Visualization"><a href="#Visualization" class="headerlink" title="Visualization"></a>Visualization</h4><p>For plotting, <strong><a href="http://matplotlib.org/" target="_blank" rel="noopener">Matplotlib</a></strong> and <strong><a href="https://stanford.edu/~mwaskom/software/seaborn/" target="_blank" rel="noopener">Seaborn</a></strong> should suffice.</p><p>Some common practices:</p><ul><li>Inspect the distribution of target variable. Depending on what scoring metric is used, <strong>an <a href="http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5128907" target="_blank" rel="noopener">imbalanced</a> distribution of target variable might harm the model’s performance</strong>.</li><li>For <strong>numerical variables</strong>, use <strong>box plot</strong> and <strong>scatter plot</strong> to inspect their distributions and check for outliers.</li><li>For classification tasks, plot the data with points colored according to their labels. This can help with feature engineering.</li><li>Make pairwise distribution plots and examine their correlations.</li></ul><p>Be sure to read <a href="https://www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations" target="_blank" rel="noopener">this inspiring tutorial of exploratory visualization</a> before you go on.</p><h4 id="Statistical-Tests"><a href="#Statistical-Tests" class="headerlink" title="Statistical Tests"></a>Statistical Tests</h4><p>We can perform some statistical tests to confirm our hypotheses. Sometimes we can get enough intuition from visualization, but quantitative results are always good to have. Note that we will always encounter non-i.i.d. data in real world. So we have to be careful about which test to use and how we interpret the findings.</p><p>In many competitions public LB scores are not very consistent with local CV scores due to noise or non-i.i.d. distribution. You can use test results to <strong>roughly set a threshold for determining whether an increase of score is due to genuine improvment or randomness</strong>.</p><h3 id="Data-Preprocessing"><a href="#Data-Preprocessing" class="headerlink" title="Data Preprocessing"></a>Data Preprocessing</h3><p>In most cases, we need to preprocess the dataset before constructing features. Some common steps are:</p><ul><li>Sometimes several files are provided and we need to join them.</li><li>Deal with <strong><a href="https://en.wikipedia.org/wiki/Missing_data" target="_blank" rel="noopener">missing data</a></strong>.</li><li>Deal with <strong><a href="https://en.wikipedia.org/wiki/Outlier" target="_blank" rel="noopener">outliers</a></strong>.</li><li>Encode <strong><a href="https://en.wikipedia.org/wiki/Categorical_variable" target="_blank" rel="noopener">categorical variables</a></strong> if necessary.</li><li>Deal with noise. For example you may have some floats derived from raw figures. The loss of precision during floating-point arithemics can bring much noise into the data: two seemingly different values might be the same before conversion. Sometimes noise harms model and we would want to avoid that.</li></ul><p>How we choose to perform preprocessing largely depends on what we learn about the data in the previous stage. In practice, I recommend using <strong><a href="http://ipython.org/notebook.html" target="_blank" rel="noopener">Jupyter Notebook</a></strong> for data manipulation and mastering usage of frequently used Pandas operations. The advantage is that you get to see the results immediately and are able to modify or rerun code blocks. This also makes it very convenient to share your approach with others. After all <a href="https://en.wikipedia.org/wiki/Reproducibility" target="_blank" rel="noopener">reproducible results</a> are very important in data science.</p><p>Let’s see some examples.</p><h4 id="Outlier"><a href="#Outlier" class="headerlink" title="Outlier"></a>Outlier</h4><p><img src="kaggle-guide-outlier-example.png" alt="Outlier Example"></p><p>The plot shows some scaled coordinates data. We can see that there are some outliers in the top-right corner. Exclude them and the distribution looks good.</p><h4 id="Dummy-Variables"><a href="#Dummy-Variables" class="headerlink" title="Dummy Variables"></a>Dummy Variables</h4><p>For categorical variables, a common practice is <strong><a href="https://en.wikipedia.org/wiki/One-hot" target="_blank" rel="noopener">One-hot Encoding</a></strong>. For a categorical variable with <code>n</code> possible values, we create a group of <code>n</code> dummy variables. Suppose a record in the data takes one value for this variable, then the corresponding dummy variable is set to <code>1</code> while other dummies in the same group are all set to <code>0</code>.</p><p><img src="kaggle-guide-dummies-example.png" alt="Dummies Example"></p><p>In this example, we transform <code>DayOfWeek</code> into 7 dummy variables.</p><p>Note that when the categorical variable can take many values (hundreds or more), this might not work well. It’s difficult to find a general solution to that, but I’ll discuss one scenario in the next section.</p><h3 id="Feature-Engineering"><a href="#Feature-Engineering" class="headerlink" title="Feature Engineering"></a>Feature Engineering</h3><p>Some describe the essence of Kaggle competitions as <strong>feature engineering supplemented by model tuning and ensemble learning</strong>. Yes, that makes a lot of sense. <strong>Feature engineering gets your very far.</strong> Yet it is how well you know about the domain of given data that decides how far you can go. For example, in a competition where data is mainly consisted of texts, Natural Language Processing teachniques are a must. The approach of constructing useful features is something we all have to continuously learn in order to do better.</p><p>Basically, <strong>when you feel that a variable is intuitively useful for the task, you can include it as a feature</strong>. But how do you know it actually works? The simplest way is to plot it against the target variable like this:</p><p><img src="kaggle-visualize-feature-correlation.png" alt="Checking Feature Validity"></p><h4 id="Feature-Selection"><a href="#Feature-Selection" class="headerlink" title="Feature Selection"></a>Feature Selection</h4><p>Generally speaking, <strong>we should try to craft as many features as we can and have faith in the model’s ability to pick up the most significant features</strong>. Yet there’s still something to gain from feature selection beforehand:</p><ul><li>Less features mean faster training</li><li>Some features are linearly related to others. This might put a strain on the model.</li><li>By picking up the most important features, we can use interactions between them as new features. Sometimes this gives surprising improvement.</li></ul><p>The simplest way to inspect feature importance is by fitting a random forest model. There are more robust feature selection algorithms (e.g. <a href="http://jmlr.org/papers/volume10/tuv09a/tuv09a.pdf" target="_blank" rel="noopener">this</a>) which are theoretically superior but not practicable due to the absence of efficient implementation. You can combat noisy data (to an extent) simply by increasing number of trees used in a random forest.</p><p>This is important for competitions in which data is <strong><a href="https://en.wikipedia.org/wiki/Data_anonymization" target="_blank" rel="noopener">anonymized</a></strong> because you won’t waste time trying to figure out the meaning of a variable that’s of no significance.</p><h4 id="Feature-Encoding"><a href="#Feature-Encoding" class="headerlink" title="Feature Encoding"></a>Feature Encoding</h4><p>Sometimes raw features have to be converted to some other formats for them to work properly.</p><p>For example, suppose we have a categorical variable which can take more than 10K different values. Then naively creating dummy variables is not a feasible option. An acceptable solution is to create dummy variables for only a subset of the values (e.g. values that constitute 95% of the feature importance) and assign everything else to an ‘others’ class.</p><p><strong>Updated on Oct 28th, 2016: </strong> For the scenario described above, another possible solution is to use <strong>Factorized Machines</strong>. Please refer to <a href="https://www.kaggle.com/c/expedia-hotel-recommendations/forums/t/21607/1st-place-solution-summary" target="_blank" rel="noopener">this post</a> by Kaggle user “idle_speculation” for details.</p><h3 id="Model-Selection"><a href="#Model-Selection" class="headerlink" title="Model Selection"></a>Model Selection</h3><p>When the features are set, we can start training models. Kaggle competitions usually favor <strong>tree-based models</strong>:</p><ul><li><strong>Gradient Boosted Trees</strong></li><li>Random Forest</li><li>Extra Randomized Trees</li></ul><p>The following models are slightly worse in terms of general performance, but are suitable as base models in ensemble learning (will be discussed later):</p><ul><li>SVM</li><li>Linear Regression</li><li>Logistic Regression</li><li>Neural Networks</li></ul><p>Note that this does not apply to computer vision competitions which are pretty much dominated by neural network models.</p><p>All these models are implemented in <strong><a href="http://scikit-learn.org/" target="_blank" rel="noopener">Sklearn</a></strong>.</p><p>Here I want to emphasize the greatness of <strong><a href="https://github.com/dmlc/xgboost" target="_blank" rel="noopener">Xgboost</a></strong>. The outstanding performance of gradient boosted trees and Xgboost’s efficient implementation makes it very popular in Kaggle competitions. Nowadays almost every winner uses Xgboost in one way or another.</p><p><strong>Updated on Oct 28th, 2016: </strong> Recently Microsoft open sourced <strong><a href="https://github.com/Microsoft/LightGBM" target="_blank" rel="noopener">LightGBM</a></strong>, a potentially better library than Xgboost for gradient boosting.</p><p>By the way, for Windows users installing Xgboost could be a painstaking process. You can refer to <a href="https://dnc1994.com/2016/03/installing-xgboost-on-windows/">this post</a> by me if you run into problems.</p><h4 id="Model-Training"><a href="#Model-Training" class="headerlink" title="Model Training"></a>Model Training</h4><p>We can improve a model’s performance by tuning its parameters. A model usually have many parameters, but only a few of them are significant to its performance. For example, the most important parameters for a random forset is the number of trees in the forest and the maximum number of features used in developing each tree. <strong>We need to understand how models work and what impact does each parameter have to the model’s performance, be it accuracy, robustness or speed.</strong></p><p>Normally we would find the best set of parameters by a process called <strong><a href="http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html)" target="_blank" rel="noopener">grid search</a></strong>. Actually what it does is simply iterating through all the possible combinations and find the best one.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">param_grid = &#123;<span class="string">'n_estimators'</span>: [<span class="number">300</span>, <span class="number">500</span>], <span class="string">'max_features'</span>: [<span class="number">10</span>, <span class="number">12</span>, <span class="number">14</span>]&#125;</span><br><span class="line">model = grid_search.GridSearchCV(</span><br><span class="line">    estimator=rfr, param_grid=param_grid, n_jobs=<span class="number">1</span>, cv=<span class="number">10</span>, verbose=<span class="number">20</span>, scoring=RMSE</span><br><span class="line">)</span><br><span class="line">model.fit(X_train, y_train)</span><br></pre></td></tr></table></figure><p>By the way, random forest usually reach optimum when <code>max_features</code> is set to the square root of the total number of features.</p><p>Here I’d like to stress some points about tuning XGB. These parameters are generally considered to have real impacts on its performance:</p><ul><li><code>eta</code>: Step size used in updating weights. Lower <code>eta</code> means slower training but better convergence.</li><li><code>num_round</code>: Total number of iterations.</li><li><code>subsample</code>: The ratio of training data used in each iteration. This is to combat overfitting.</li><li><code>colsample_bytree</code>: The ratio of features used in each iteration. This is like <code>max_features</code> in <code>RandomForestClassifier</code>.</li><li><code>max_depth</code>: The maximum depth of each tree. Unlike random forest, <strong>gradient boosting would eventually overfit if we do not limit its depth</strong>.</li><li><code>early_stopping_rounds</code>: If we don’t see an increase of validation score for a given number of iterations, the algorithm will stop early. This is to combat overfitting, too.</li></ul><p>Usual tuning steps:</p><ol><li>Reserve a portion of training set as the validation set.</li><li>Set <code>eta</code> to a relatively high value (e.g. 0.05 ~ 0.1), <code>num_round</code> to 300 ~ 500.</li><li>Use grid search to find the best combination of other parameters.</li><li>Gradually lower <code>eta</code> until we reach the optimum.</li><li><strong>Use the validation set as <code>watch_list</code> to re-train the model with the best parameters. Observe how score changes on validation set in each iteration. Find the optimal value for <code>early_stopping_rounds</code>.</strong></li></ol><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line">X_dtrain, X_deval, y_dtrain, y_deval = \</span><br><span class="line">    cross_validation.train_test_split(X_train, y_train, random_state=<span class="number">1026</span>, test_size=<span class="number">0.3</span>)</span><br><span class="line">dtrain = xgb.DMatrix(X_dtrain, y_dtrain)</span><br><span class="line">deval = xgb.DMatrix(X_deval, y_deval)</span><br><span class="line">watchlist = [(deval, <span class="string">'eval'</span>)]</span><br><span class="line">params = &#123;</span><br><span class="line">    <span class="string">'booster'</span>: <span class="string">'gbtree'</span>,</span><br><span class="line">    <span class="string">'objective'</span>: <span class="string">'reg:linear'</span>,</span><br><span class="line">    <span class="string">'subsample'</span>: <span class="number">0.8</span>,</span><br><span class="line">    <span class="string">'colsample_bytree'</span>: <span class="number">0.85</span>,</span><br><span class="line">    <span class="string">'eta'</span>: <span class="number">0.05</span>,</span><br><span class="line">    <span class="string">'max_depth'</span>: <span class="number">7</span>,</span><br><span class="line">    <span class="string">'seed'</span>: <span class="number">2016</span>,</span><br><span class="line">    <span class="string">'silent'</span>: <span class="number">0</span>,</span><br><span class="line">    <span class="string">'eval_metric'</span>: <span class="string">'rmse'</span></span><br><span class="line">&#125;</span><br><span class="line">clf = xgb.train(params, dtrain, <span class="number">500</span>, watchlist, early_stopping_rounds=<span class="number">50</span>)</span><br><span class="line">pred = clf.predict(xgb.DMatrix(df_test))</span><br></pre></td></tr></table></figure><p>Finally, note that models with randomness all have a parameter like <code>seed</code> or <code>random_state</code> to control the random seed. <strong>You must record this</strong> with all other parameters when you get a good model. Otherwise you wouldn’t be able to reproduce it.</p><h4 id="Cross-Validation"><a href="#Cross-Validation" class="headerlink" title="Cross Validation"></a>Cross Validation</h4><p><strong><a href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)" target="_blank" rel="noopener">Cross validation</a></strong> is an essential step in model training. It tells us whether our model is at high risk of overfitting. In many competitions, public LB scores are not very reliable. Often when we improve the model and get a better local CV score, the LB score becomes worse. <strong>It is widely believed that we should trust our CV scores under such situation.</strong> Ideally we would want <strong>CV scores obtained by different approaches to improve in sync with each other and with the LB score</strong>, but this is not always possible.</p><p>Usually <strong>5-fold CV</strong> is good enough. If we use more folds, the CV score would become more reliable, but the training takes longer to finish as well. However, we shouldn’t use too many folds if our training data is limited. Otherwise we would have too few samples in each fold to guarantee statistical significance.</p><p>How to do CV properly is not a trivial problem. It requires constant experiment and case-by-case discussion. Many Kagglers share their CV approaches (like <a href="https://www.kaggle.com/c/telstra-recruiting-network/forums/t/19277/what-is-your-cross-validation-method" target="_blank" rel="noopener">this one</a>) after competitions when they feel that reliable CV is not easy.</p><h3 id="Ensemble-Generation"><a href="#Ensemble-Generation" class="headerlink" title="Ensemble Generation"></a>Ensemble Generation</h3><p><a href="https://en.wikipedia.org/wiki/Ensemble_learning" target="_blank" rel="noopener">Ensemble Learning</a> refers to the technique of combining different models. It <strong>reduces both bias and variance of the final model</strong> (you can find a proof <a href="http://link.springer.com/chapter/10.1007%2F3-540-33019-4_19" target="_blank" rel="noopener">here</a>), thus <strong>increasing the score and reducing the risk of overfitting</strong>. Recently it became virtually impossible to win prize without using ensemble in Kaggle competitions.</p><p>Common approaches of ensemble learning are:</p><ul><li><p><strong>Bagging</strong>: Use different random subsets of training data to train each base model. Then all the base models vote to generate the final predictions. This is how random forest works.</p></li><li><p><strong>Boosting</strong>: Train base models iteratively, modify the weights of training samples according to the last iteration. This is how gradient boosted trees work. (Actually it’s not the whole story. Apart from boosting, GBTs try to learn the residuals of earlier iterations.) It performs better than bagging but is more prone to overfitting.</p></li><li><p><strong>Blending</strong>: Use non-overlapping data to train different base models and take a weighted average of them to obtain the final predictions. This is easy to implement but uses less data.</p></li><li><p><strong>Stacking</strong>: To be discussed next.</p></li></ul><p>In theory, for the ensemble to perform well, two factors matter:</p><ul><li><strong>Base models should be as unrelated as possibly</strong>. This is why we tend to include non-tree-based models in the ensemble even though they don’t perform as well. The math says that the greater the diversity, and less bias in the final ensemble.</li><li><strong>Performance of base models shouldn’t differ to much.</strong></li></ul><p>Actually we have a <strong>trade-off</strong> here. In practice we may end up with highly related models of comparable performances. Yet we ensemble them anyway because it usually increase the overall performance.</p><h4 id="Stacking"><a href="#Stacking" class="headerlink" title="Stacking"></a>Stacking</h4><p>Compared with blending, stacking makes better use of training data. Here’s a diagram of how it works:</p><p><img src="kaggle-guide-stacking-diagram.jpg" alt="Stacking"></p><p><em>(Taken from <a href="https://www.kaggle.com/mmueller" target="_blank" rel="noopener">Faron</a>. Many thanks!)</em></p><p>It’s much like cross validation. Take 5-fold stacking as an example. First we split the training data into 5 folds. Next we will do 5 iterations. In each iteration, train every base model on 4 folds and predict on the hold-out fold. <strong>You have to keep the predictions on the testing data as well.</strong> This way, in each iteration every base model will make predictions on 1 fold of the training data and all of the testing data. After 5 iterations we will obtain a matrix of shape <code>#(samples in training data) X #(base models)</code>. This matrix is then fed to the stacker (it’s just another model) in the second level. After the stacker is fitted, use the predictions on testing data by base models (<strong>each base model is trained 5 times, therefore we have to take an average to obtain a matrix of the same shape</strong>) as the input for the stacker and obtain our final predictions.</p><p>Maybe it’s better to just show the codes:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">Ensemble</span><span class="params">(object)</span>:</span></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">__init__</span><span class="params">(self, n_folds, stacker, base_models)</span>:</span></span><br><span class="line">        self.n_folds = n_folds</span><br><span class="line">        self.stacker = stacker</span><br><span class="line">        self.base_models = base_models</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">fit_predict</span><span class="params">(self, X, y, T)</span>:</span></span><br><span class="line">        X = np.array(X)</span><br><span class="line">        y = np.array(y)</span><br><span class="line">        T = np.array(T)</span><br><span class="line"></span><br><span class="line">        folds = list(KFold(len(y), n_folds=self.n_folds, shuffle=<span class="keyword">True</span>, random_state=<span class="number">2016</span>))</span><br><span class="line"></span><br><span class="line">        S_train = np.zeros((X.shape[<span class="number">0</span>], len(self.base_models)))</span><br><span class="line">        S_test = np.zeros((T.shape[<span class="number">0</span>], len(self.base_models)))</span><br><span class="line"></span><br><span class="line">        <span class="keyword">for</span> i, clf <span class="keyword">in</span> enumerate(self.base_models):</span><br><span class="line">            S_test_i = np.zeros((T.shape[<span class="number">0</span>], len(folds)))</span><br><span class="line"></span><br><span class="line">            <span class="keyword">for</span> j, (train_idx, test_idx) <span class="keyword">in</span> enumerate(folds):</span><br><span class="line">                X_train = X[train_idx]</span><br><span class="line">                y_train = y[train_idx]</span><br><span class="line">                X_holdout = X[test_idx]</span><br><span class="line">                <span class="comment"># y_holdout = y[test_idx]</span></span><br><span class="line">                clf.fit(X_train, y_train)</span><br><span class="line">                y_pred = clf.predict(X_holdout)[:]</span><br><span class="line">                S_train[test_idx, i] = y_pred</span><br><span class="line">                S_test_i[:, j] = clf.predict(T)[:]</span><br><span class="line"></span><br><span class="line">            S_test[:, i] = S_test_i.mean(<span class="number">1</span>)</span><br><span class="line"></span><br><span class="line">        self.stacker.fit(S_train, y)</span><br><span class="line">        y_pred = self.stacker.predict(S_test)[:]</span><br><span class="line">        <span class="keyword">return</span> y_pred</span><br></pre></td></tr></table></figure><p>Prize winners usually have larger and much more complicated ensembles. For beginner, implementing a correct 5-fold stacking is good enough.</p><h3 id="Pipeline"><a href="#Pipeline" class="headerlink" title="*Pipeline"></a>*Pipeline</h3><p>We can see that the workflow for a Kaggle competition is quite complex, especially for model selection and ensemble. Ideally, we need a highly automated pipeline capable of:</p><ul><li><strong>Modularized feature transformations</strong>. We only need to write a few lines of codes (or better, rules / DSLs) and the new feature is added to the training set.</li><li><strong>Automated grid search</strong>. We only need to set up models and parameter grid, the search will be run and the best parameters will be recorded.</li><li><strong>Automated ensemble selection</strong>. Use K best models for training the ensemble as soon as we put another base model into the pool.</li></ul><p>For beginners, the first one is not very important because the number of features is quite manageable; the third one is not important either because typically we only do several ensembles at the end of the competition. But the second one is good to have because <strong>manually recording the performance and parameters of each model is time-consuming and error-prone</strong>.</p><p><a href="https://www.kaggle.com/chenglongchen" target="_blank" rel="noopener">Chenglong Chen</a>, the winner of <a href="https://www.kaggle.com/c/crowdflower-search-relevance" target="_blank" rel="noopener">Crowdflower Search Results Relevance</a>, once released his pipeline on <a href="https://github.com/ChenglongChen/Kaggle_CrowdFlower" target="_blank" rel="noopener">GitHub</a>. It’s very complete and efficient. Yet it’s very hard to understand and extract all his logic to build a general framework. This is something you might want to do when you have plenty of time.</p><h2 id="Home-Depot-Search-Relevance"><a href="#Home-Depot-Search-Relevance" class="headerlink" title="Home Depot Search Relevance"></a>Home Depot Search Relevance</h2><p>In this section I will share my solution in <a href="https://www.kaggle.com/c/home-depot-product-search-relevance" target="_blank" rel="noopener">Home Depot Search Relevance Competition</a> and what I learned from top teams after the competition.</p><p>The task in this competition is to predict how relevant a result is for a search term on Home Depot website. The relevance is an average score from three human evaluators and ranges between 1 ~ 3. Therefore it’s a regression task. The datasets contains search terms, product titles / descriptions and some attributes like brand, size and color. The metric is <a href="https://en.wikipedia.org/wiki/Root-mean-square_deviation" target="_blank" rel="noopener">RMSE</a>.</p><p>This is much like <a href="https://www.kaggle.com/c/crowdflower-search-relevance" target="_blank" rel="noopener">Crowdflower Search Results Relevance</a>. The difference is that <a href="https://en.wikipedia.org/wiki/Cohen%27s_kappa#Weighted_kappa" target="_blank" rel="noopener">Quadratic Weighted Kappa</a> is used in Crowdflower competition and therefore complicated the final cutoff of regression scores. Also there were no attributes provided in Crowdflower.</p><h3 id="EDA"><a href="#EDA" class="headerlink" title="EDA"></a>EDA</h3><p>There were several quite good EDAs by the time I joined the competition, especially <a href="https://www.kaggle.com/briantc/home-depot-product-search-relevance/homedepot-first-dataexploreation-k" target="_blank" rel="noopener">this one</a>. I learned that:</p><ul><li>Many search terms / products appeared several times.</li><li>Text similarities are great features.</li><li>Many products don’t have attributes features. Would this be a problem?</li><li>Product ID seems to have strong predictive power. However the overlap of product ID between the training set and the testing set is not very high. Would this contribute to overfitting?</li></ul><h3 id="Preprocessing"><a href="#Preprocessing" class="headerlink" title="Preprocessing"></a>Preprocessing</h3><p>You can find how I did preprocessing and feature engineering <a href="https://github.com/dnc1994/Kaggle-Playground/blob/master/home-depot/Preprocess.ipynb" target="_blank" rel="noopener">on GitHub</a>. I’ll only give a brief summary here:</p><ol><li>Use <a href="https://www.kaggle.com/steubk/home-depot-product-search-relevance/fixing-typos" target="_blank" rel="noopener">typo dictionary</a> posted in the forum to correct typos in search terms.</li><li>Count attributes. Find those frequent and easily exploited ones.</li><li>Join the training set with the testing set. This is important because otherwise you’ll have to do feature transformation twice.</li><li>Do <strong><a href="https://en.wikipedia.org/wiki/Stemming" target="_blank" rel="noopener">stemming</a></strong> and <strong><a href="https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation" target="_blank" rel="noopener">tokenizing</a></strong> for all the text fields. Some <strong>normalization</strong> (with digits and units) and <strong>synonym substitutions</strong> are performed manually.</li></ol><h3 id="Feature"><a href="#Feature" class="headerlink" title="Feature"></a>Feature</h3><ul><li>*Attribute Features<ul><li>Whether the product contains a certain attribute (brand, size, color, weight, indoor/outdoor, energy star certified …)</li><li>Whether a certain attribute matches with the search term</li></ul></li></ul><ul><li><p>Meta Features</p><ul><li>Length of each text field</li><li>Whether the product contains attribute fields</li><li>Brand (encoded as integers)</li><li>Product ID</li></ul></li><li><p>Matching</p><ul><li>Whether search term appears in product title / description / attributes</li><li>Count and ratio of search term’s appearance in product title / description / attributes</li><li>*Whether the i-th word of search term appears in product title / description / attributes</li></ul></li><li><p>Text similarities between search term and product title/description/attributes</p><ul><li><a href="https://en.wikipedia.org/wiki/Bag-of-words_model" target="_blank" rel="noopener">BOW</a> <a href="https://en.wikipedia.org/wiki/Cosine_similarity" target="_blank" rel="noopener">Cosine Similairty</a></li><li><a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf" target="_blank" rel="noopener">TF-IDF</a> Cosine Similarity</li><li><a href="https://en.wikipedia.org/wiki/Jaccard_index" target="_blank" rel="noopener">Jaccard Similarity</a></li><li>*<a href="https://en.wikipedia.org/wiki/Edit_distance" target="_blank" rel="noopener">Edit Distance</a></li><li><a href="https://en.wikipedia.org/wiki/Word2vec" target="_blank" rel="noopener">Word2Vec</a> Distance (I didn’t include this because of its poor performance and slow calculation. Yet it seems that I was using it wrong.)</li></ul></li><li><p><strong><a href="https://en.wikipedia.org/wiki/Latent_semantic_indexing" target="_blank" rel="noopener">Latent Semantic Indexing</a>: By performing <a href="https://en.wikipedia.org/wiki/Singular_value_decomposition" target="_blank" rel="noopener">SVD decomposition</a> to the matrix obtained from BOW/TF-IDF Vectorization, we get a latent representation of different search term / product groups. This enables our model to distinguish between groups and assign different weights to features, therefore solving the issue of dependent data and products lacking some features (to an extent).</strong></p></li></ul><p>Note that features listed above with <code>*</code> are the last batch of features I added. The problem is that the model trained on data that included these features performed worse than the previous ones. At first I thought that the increase in number of features would require re-tuning of model parameters. However, after wasting much CPU time on grid search, I still could not beat the old model. I think it might be the issue of <strong>feature correlation</strong> mentioned above. I actually knew a solution that might work, which is to <strong>combine models trained on different version of features by stacking</strong>. Unfortunately I didn’t have enough time to try it. <strong>As a matter of fact, most of top teams regard the ensemble of models trained with different preprocessing and feature engineering pipelines as a key to success</strong>.</p><h3 id="Model"><a href="#Model" class="headerlink" title="Model"></a>Model</h3><p>At first I was using <code>RandomForestRegressor</code> to build my model. Then I tried <strong>Xgboost</strong> and it turned out to be more than twice as fast as Sklearn. From that on what I do everyday is basically running grid search on my work station while working on features on my laptop.</p><p>Dataset in this competition is not trivial to validate. It’s not i.i.d. and many records are dependent. Many times I used better features / parameters only to end with worse LB scores. As repeatedly stated by many accomplished Kagglers, you have to trust your own CV score under such situation. Therefore I decided to use 10-fold instead of 5-fold in cross validation and ignore the LB score in the following attempts.</p><h3 id="Ensemble"><a href="#Ensemble" class="headerlink" title="Ensemble"></a>Ensemble</h3><p>My final model is an ensemble consisting of 4 base models:</p><ul><li><code>RandomForestRegressor</code></li><li><code>ExtraTreesRegressor</code></li><li><code>GradientBoostingRegressor</code></li><li><code>XGBRegressor</code></li></ul><p>The stacker is also a <code>XGBRegressor</code>.</p><p>The problem is that all my base models are highly correlated (with a lowest correlation of 0.9). I thought of including linear regression, SVM regression and <code>XGBRegressor</code> with linear booster into the ensemble, but these models had RMSE scores that are 0.02 higher (this accounts for a gap of hundreds of places on the leaderboard) than the 4 models I finally used. Therefore I decided not to use more models although they would have brought much more diversity.</p><p>The good news is that, despite base models being highly correlated, stacking still bumps up my score a lot. <strong>What’s more, my CV score and LB score are in complete sync after I started stacking.</strong></p><p>During the last two days of the competition, I did one more thing: <strong>use 20 or so different random seeds to generate the ensemble and take a weighted average of them as the final submission</strong>. This is actually a kind of <strong>bagging</strong>. It makes sense in theory because in stacking I used 80% of the data to train base models in each iteration, whereas 100% of the data is used to train the stacker. Therefore it’s less clean. Making multiple runs with different seeds makes sure that <strong>different 80% of the data are used each time</strong>, thus reducing the risk of information leak. Yet by doing this I only achieved an increase of <code>0.0004</code>, which might be just due to randomness.</p><p>After the competition, I found out that my best single model scores <code>0.46378</code> on the private leaderboard, whereas my best stacking ensemble scores <code>0.45849</code>. That was the difference between the 174th place and the 98th place. In other words, feature engineering and model tuning got me into 10%, whereas stacking got me into 5%.</p><h3 id="Lessons-Learned"><a href="#Lessons-Learned" class="headerlink" title="Lessons Learned"></a>Lessons Learned</h3><p>There’s much to learn from the solutions shared by top teams:</p><ul><li><p>There’s a pattern in the product title. For example, whether a product is accompanied by a certain accessory will be indicated by <code>With/Without XXX</code> at the end of the title.</p></li><li><p>Use external data. For example use <a href="https://wordnet.princeton.edu/" target="_blank" rel="noopener">WordNet</a> or <a href="https://www.kaggle.com/reddit/reddit-comments-may-2015" target="_blank" rel="noopener">Reddit Comments Dataset</a> to train synonyms and <a href="https://en.wikipedia.org/wiki/Hyponymy_and_hypernymy" target="_blank" rel="noopener">hypernyms</a>.</p></li><li><p>Some features based on <strong>letters</strong> instead of <strong>words</strong>. At first I was rather confused by this. But it makes perfect sense if you consider it. For example, the team that won the 3rd place took the number of letters matched into consideration when computing text similarity. They argued that <strong>longer words are more specific and thus more likely to be assigned high relevance scores by human</strong>. They also used char-by-char sequence comparison (<code>difflib.SequenceMatcher</code>) to measure <strong>visual similarity</strong>, which they claimed to be important for human.</p></li><li><p>POS-tag words and find <strong><a href="https://en.wikipedia.org/wiki/Head_(linguistics)" target="_blank" rel="noopener">head</a></strong> in phrases and use them when computing various distance metrics.</p></li><li><p>Extract top-ranking trigrams from the TF-IDF of product title / description field and compute the ratio of word from search terms that appear in these trigrams. Vice versa. This is like computing latent indexes from another point of view.</p></li><li><p>Some novel distance metrics like <a href="http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf" target="_blank" rel="noopener">Word Movers Distance</a></p></li><li><p>Apart from SVD, some used <a href="https://en.wikipedia.org/wiki/Non-negative_matrix_factorization" target="_blank" rel="noopener">NMF</a>.</p></li><li><p>Generate <strong>pairwise polynomial interactions</strong> between top-ranking features.</p></li><li><p><strong>For CV, construct splits in which product IDs do not overlap between training set and testing set, and splits in which IDs do. Then we can use these with corresponding ratio to approximate the impact of public/private LB split in our local CV.</strong></p></li></ul><h2 id="Summary"><a href="#Summary" class="headerlink" title="Summary"></a>Summary</h2><h3 id="Takeaways"><a href="#Takeaways" class="headerlink" title="Takeaways"></a>Takeaways</h3><ol><li>It was a good call to <strong>start doing ensembles early in the competition</strong>. As it turned out, I was still playing with features during the very last days.</li><li>It’s of high priority that I build a pipeline capable of automatic model training and recording best parameters.</li><li><strong>Features matter the most!</strong> I didn’t spend enough time on features in this competition.</li><li>If possible, spend some time to manually inspect raw data for patterns.</li></ol><h3 id="Issues-Raised"><a href="#Issues-Raised" class="headerlink" title="Issues Raised"></a>Issues Raised</h3><p>Several issues I encountered in this competitions are of high research values.</p><ol><li>How to do reliable CV with dependent data.</li><li>How to quantify <strong>the trade-off between diversity and accuracy</strong> in ensemble learning.</li><li>How to deal with feature interaction which harms the model’s performance. And <strong>how to determine whether new features are effective in such situations</strong>.</li></ol><h3 id="Beginner-Tips"><a href="#Beginner-Tips" class="headerlink" title="Beginner Tips"></a>Beginner Tips</h3><ol><li>Choose a competition you’re interested in. <strong>It would be better if you’ve already have some insights about the problem domain.</strong></li><li>Following my approach or somebody else’s, start exploring, understanding and modeling data.</li><li>Learn from forum and scripts. See how others interpret data and construct features.</li><li><strong>Find winner interviews / blog posts of previous competitions. They’re extremely helpful, especially if from competitions that share some similarities with that one you’re working on.</strong></li><li>Start doing ensemble after you have reached a pretty good score (e.g. 10% ~ 20%) or you feel that there isn’t much room for new features (which, sadly, always turns out to be false).</li><li>If you think you may have a chance to win the prize, try teaming up!</li><li><strong>Don’t give up until the end of the competition. At least try something new every day.</strong></li><li>Learn from the sharings of top teams after the competition. Reflect on your approaches. <strong>If possible, spend some time verifying what you learn.</strong></li><li>Get some rest!</li></ol><h2 id="Reference"><a href="#Reference" class="headerlink" title="Reference"></a>Reference</h2><ol><li><a href="https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=1&amp;cad=rja&amp;uact=8&amp;ved=0ahUKEwiPxZHewLbMAhVKv5QKHb3PCGwQFggcMAA&amp;url=http%3A%2F%2Fwww.ke.tu-darmstadt.de%2Flehre%2Farbeiten%2Fstudien%2F2015%2FDong_Ying.pdf&amp;usg=AFQjCNE9o2BcEkqdnu_-lQ3EFD3eRAFWiw&amp;sig2=oiU8TCEH57EYF9v9l6Scrw&amp;bvm=bv.121070826,d.dGo" target="_blank" rel="noopener">Beating Kaggle the Easy Way - Dong Ying</a></li><li><a href="https://github.com/ChenglongChen/Kaggle_CrowdFlower/blob/master/BlogPost/BlogPost.md" target="_blank" rel="noopener">Search Results Relevance Winner’s Interview: 1st place, Chenglong Chen</a></li><li><a href="http://rstudio-pubs-static.s3.amazonaws.com/158725_5d2f977f4004490e9b095c0ef9357c6b.html" target="_blank" rel="noopener">(Chinese) Solution for Prudential Life Insurance Assessment - Nutastray</a></li></ol>]]></content>
    
    <summary type="html">
    
      &lt;h2 id=&quot;Introduction&quot;&gt;&lt;a href=&quot;#Introduction&quot; class=&quot;headerlink&quot; title=&quot;Introduction&quot;&gt;&lt;/a&gt;Introduction&lt;/h2&gt;&lt;p&gt;&lt;a href=&quot;https://www.kaggle.com/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Kaggle&lt;/a&gt; is the best place to learn from other data scientists. Many companies provide data and prize money to set up data science competitions on Kaggle. Recently I had my first shot on Kaggle and &lt;strong&gt;ranked 98th (~ 5%) among 2125 teams&lt;/strong&gt;. Being my Kaggle debut, I feel quite satisfied with the result. Since many Kaggle beginners set 10% as their first goal, I want to share my two cents on how to achieve that.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;This post is also available in &lt;a href=&quot;https://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/&quot;&gt;Chinese&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Updated on Oct 28th, 2016: &lt;/strong&gt; I made many wording changes and added several updates to this post. Note that Kaggle has went through some major changes since I published this post, especially with its ranking system. Therefore some descriptions here might not apply anymore.&lt;/p&gt;
    
    </summary>
    
      <category term="Data Science" scheme="http://dnc1994.com/categories/Data-Science/"/>
    
    
  </entry>
  
  <entry>
    <title>如何在 Kaggle 首战中进入前 10%</title>
    <link href="http://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/"/>
    <id>http://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/</id>
    <published>2016-04-30T06:06:51.000Z</published>
    <updated>2019-01-19T04:05:10.934Z</updated>
    
    <content type="html"><![CDATA[<h2 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h2><p><strong>本文采用<a href="https://creativecommons.org/licenses/by-nc-nd/3.0/cn/" target="_blank" rel="noopener">署名 - 非商业性使用 - 禁止演绎 3.0 中国大陆许可协议</a>进行许可。著作权由章凌豪所有。</strong></p><p><a href="https://www.kaggle.com/" target="_blank" rel="noopener">Kaggle</a> 是目前最大的 Data Scientist 聚集地。很多公司会拿出自家的数据并提供奖金，在 Kaggle 上组织数据竞赛。我最近完成了第一次比赛，<strong>在 2125 个参赛队伍中排名第 98 位（~ 5%）</strong>。因为是第一次参赛，所以对这个成绩我已经很满意了。在 Kaggle 上一次比赛的结果除了排名以外，还会显示的就是 Prize Winner，10% 或是 25% 这三档。所以刚刚接触 Kaggle 的人很多都会以 25% 或是 10% 为目标。在本文中，我试图根据自己第一次比赛的经验和从其他 Kaggler 那里学到的知识，为刚刚听说 Kaggle 想要参赛的新手提供一些切实可行的冲刺 10% 的指导。</p><p><em>本文的英文版见<a href="https://dnc1994.com/2016/05/rank-10-percent-in-first-kaggle-competition-en/">这里</a>。</em></p><a id="more"></a><p><img src="kaggle-guide-profile.png" alt="Kaggle Profile"></p><p>Kaggler 绝大多数都是用 Python 和 R 这两门语言的。因为我主要使用 Python，所以本文提到的例子都会根据 Python 来。不过 R 的用户应该也能不费力地了解到工具背后的思想。</p><p>首先简单介绍一些关于 Kaggle 比赛的知识：</p><ul><li>不同比赛有不同的任务，分类、回归、推荐、排序等。比赛开始后训练集和测试集就会开放下载。</li><li>比赛通常持续 2 ~ 3 个月，每个队伍每天可以提交的次数有限，通常为 5 次。</li><li>比赛结束前一周是一个 Deadline，在这之后不能再组队，也不能再新加入比赛。所以<strong>想要参加比赛请务必在这一 Deadline 之前有过至少一次有效的提交</strong>。</li><li>一般情况下在提交后会立刻得到得分的反馈。不同比赛会采取不同的评分基准，可以在分数栏最上方看到使用的评分方法。</li><li>反馈的分数是基于测试集的一部分计算的，剩下的另一部分会被用于计算最终的结果。所以最后排名会变动。</li><li><strong>LB</strong> 指的就是在 Leaderboard 得到的分数，由上，有 <strong>Public LB</strong> 和 <strong>Private LB</strong> 之分。</li><li>自己做的 Cross Validation 得到的分数一般称为 <strong>CV</strong> 或是 <strong>Local CV</strong>。一般来说 <strong>CV</strong> 的结果比 <strong>LB</strong> 要可靠。</li><li>新手可以从比赛的 <strong>Forum</strong> 和 <strong>Scripts</strong> 中找到许多有用的经验和洞见。不要吝啬提问，Kaggler 都很热情。</li></ul><p>那么就开始吧！</p><p>P.S. 本文假设读者对 Machine Learning 的基本概念和常见模型已经有一定了解。 Enjoy Reading!</p><h2 id="General-Approach"><a href="#General-Approach" class="headerlink" title="General Approach"></a>General Approach</h2><p>在这一节中我会讲述一次 Kaggle 比赛的大致流程。</p><h3 id="Data-Exploration"><a href="#Data-Exploration" class="headerlink" title="Data Exploration"></a>Data Exploration</h3><p>在这一步要做的基本就是 <strong>EDA (Exploratory Data Analysis)</strong>，也就是对数据进行探索性的分析，从而为之后的处理和建模提供必要的结论。</p><p>通常我们会用 <strong><a href="http://pandas.pydata.org/" target="_blank" rel="noopener">pandas</a></strong> 来载入数据，并做一些简单的可视化来理解数据。</p><h4 id="Visualization"><a href="#Visualization" class="headerlink" title="Visualization"></a>Visualization</h4><p>通常来说 <strong><a href="http://matplotlib.org/" target="_blank" rel="noopener">matplotlib</a></strong> 和 <strong><a href="https://stanford.edu/~mwaskom/software/seaborn/" target="_blank" rel="noopener">seaborn</a></strong> 提供的绘图功能就可以满足需求了。</p><p>比较常用的图表有：</p><ul><li>查看目标变量的分布。当分布<a href="http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5128907" target="_blank" rel="noopener">不平衡</a>时，根据评分标准和具体模型的使用不同，可能会严重影响性能。</li><li>对 <strong>Numerical Variable</strong>，可以用 <strong>Box Plot</strong> 来直观地查看它的分布。</li><li>对于坐标类数据，可以用 <strong>Scatter Plot</strong> 来查看它们的分布趋势和是否有离群点的存在。</li><li>对于分类问题，将数据根据 Label 的不同着不同的颜色绘制出来，这对 Feature 的构造很有帮助。</li><li>绘制变量之间两两的分布和相关度图表。</li></ul><p><strong><a href="https://www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations" target="_blank" rel="noopener">这里</a>有一个在著名的 Iris 数据集上做了一系列可视化的例子，非常有启发性。</strong></p><h4 id="Statistical-Tests"><a href="#Statistical-Tests" class="headerlink" title="Statistical Tests"></a>Statistical Tests</h4><p>我们可以对数据进行一些统计上的测试来验证一些假设的显著性。虽然大部分情况下靠可视化就能得到比较明确的结论，但有一些定量结果总是更理想的。不过，在实际数据中经常会遇到非 i.i.d. 的分布。所以要注意测试类型的的选择和对显著性的解释。</p><p>在某些比赛中，由于数据分布比较奇葩或是噪声过强，<strong>Public LB</strong> 的分数可能会跟 <strong>Local CV</strong> 的结果相去甚远。可以根据一些统计测试的结果来粗略地建立一个阈值，用来衡量一次分数的提高究竟是实质的提高还是由于数据的随机性导致的。</p><h3 id="Data-Preprocessing"><a href="#Data-Preprocessing" class="headerlink" title="Data Preprocessing"></a>Data Preprocessing</h3><p>大部分情况下，在构造 Feature 之前，我们需要对比赛提供的数据集进行一些处理。通常的步骤有：</p><ul><li>有时数据会分散在几个不同的文件中，需要 Join 起来。</li><li>处理 <strong><a href="https://en.wikipedia.org/wiki/Missing_data" target="_blank" rel="noopener">Missing Data</a></strong>。</li><li>处理 <strong><a href="https://en.wikipedia.org/wiki/Outlier" target="_blank" rel="noopener">Outlier</a></strong>。</li><li>必要时转换某些 <strong><a href="https://en.wikipedia.org/wiki/Categorical_variable" target="_blank" rel="noopener">Categorical Variable</a></strong> 的表示方式。</li><li>有些 Float 变量可能是从未知的 Int 变量转换得到的，这个过程中发生精度损失会在数据中产生不必要的 <strong>Noise</strong>，即两个数值原本是相同的却在小数点后某一位开始有不同。这对 Model 可能会产生很负面的影响，需要设法去除或者减弱 Noise。</li></ul><p>这一部分的处理策略多半依赖于在前一步中探索数据集所得到的结论以及创建的可视化图表。在实践中，我建议使用 <strong><a href="http://ipython.org/notebook.html" target="_blank" rel="noopener">iPython Notebook</a></strong> 进行对数据的操作，并熟练掌握常用的 pandas 函数。这样做的好处是可以随时得到结果的反馈和进行修改，也方便跟其他人进行交流（在 Data Science 中 <a href="https://en.wikipedia.org/wiki/Reproducibility" target="_blank" rel="noopener">Reproducible Results</a> 是很重要的)。</p><p>下面给两个例子。</p><h4 id="Outlier"><a href="#Outlier" class="headerlink" title="Outlier"></a>Outlier</h4><p><img src="kaggle-guide-outlier-example.png" alt="Outlier Example"></p><p>这是经过 Scaling 的坐标数据。可以发现右上角存在一些离群点，去除以后分布比较正常。</p><h4 id="Dummy-Variables"><a href="#Dummy-Variables" class="headerlink" title="Dummy Variables"></a>Dummy Variables</h4><p>对于 Categorical Variable，常用的做法就是 <a href="https://en.wikipedia.org/wiki/One-hot" target="_blank" rel="noopener">One-hot encoding</a>。即对这一变量创建一组新的伪变量，对应其所有可能的取值。这些变量中只有这条数据对应的取值为 1，其他都为 0。</p><p>如下，将原本有 7 种可能取值的 <code>Weekdays</code> 变量转换成 7 个 Dummy Variables。</p><p><img src="kaggle-guide-dummies-example.png" alt="Dummies Example"></p><p>要注意，当变量可能取值的范围很大（比如一共有成百上千类）时，这种简单的方法就不太适用了。这时没有有一个普适的方法，但我会在下一小节描述其中一种。</p><h3 id="Feature-Engineering"><a href="#Feature-Engineering" class="headerlink" title="Feature Engineering"></a>Feature Engineering</h3><p>有人总结 Kaggle 比赛是 <strong>“Feature 为主，调参和 Ensemble 为辅”</strong>，我觉得很有道理。Feature Engineering 能做到什么程度，取决于对数据领域的了解程度。比如在数据包含大量文本的比赛中，常用的 NLP 特征就是必须的。怎么构造有用的 Feature，是一个不断学习和提高的过程。</p><p>一般来说，<strong>当一个变量从直觉上来说对所要完成的目标有帮助，就可以将其作为 Feature</strong>。至于它是否有效，最简单的方式就是通过图表来直观感受。比如：</p><p><img src="kaggle-visualize-feature-correlation.png" alt="Checking Feature Validity"></p><h4 id="Feature-Selection"><a href="#Feature-Selection" class="headerlink" title="Feature Selection"></a>Feature Selection</h4><p>总的来说，我们应该<strong>生成尽量多的 Feature，相信 Model 能够挑出最有用的 Feature</strong>。但有时先做一遍 Feature Selection 也能带来一些好处：</p><ul><li>Feature 越少，训练越快。</li><li>有些 Feature 之间可能存在线性关系，影响 Model 的性能。</li><li><strong>通过挑选出最重要的 Feature，可以将它们之间进行各种运算和操作的结果作为新的 Feature，可能带来意外的提高。</strong></li></ul><p>Feature Selection 最实用的方法也就是看 Random Forest 训练完以后得到的 <strong>Feature Importance</strong> 了。其他有一些更复杂的算法在理论上更加 Robust，但是缺乏实用高效的实现，比如<a href="http://jmlr.org/papers/volume10/tuv09a/tuv09a.pdf" target="_blank" rel="noopener">这个</a>。从原理上来讲，增加 Random Forest 中树的数量可以在一定程度上加强其对于 Noisy Data 的 Robustness。</p><p>看 Feature Importance 对于某些数据经过<strong><a href="https://en.wikipedia.org/wiki/Data_anonymization" target="_blank" rel="noopener">脱敏</a></strong>处理的比赛尤其重要。这可以免得你浪费大把时间在琢磨一个不重要的变量的意义上。</p><h4 id="Feature-Encoding"><a href="#Feature-Encoding" class="headerlink" title="Feature Encoding"></a>Feature Encoding</h4><p>这里用一个例子来说明在一些情况下 Raw Feature 可能需要经过一些转换才能起到比较好的效果。</p><p>假设有一个 Categorical Variable 一共有几万个取值可能，那么创建 Dummy Variables 的方法就不可行了。这时一个比较好的方法是根据 Feature Importance 或是这些取值本身在数据中的出现频率，为最重要（比如说前 95% 的 Importance）那些取值（有很大可能只有几个或是十几个）创建 Dummy Variables，而所有其他取值都归到一个“其他”类里面。</p><h3 id="Model-Selection"><a href="#Model-Selection" class="headerlink" title="Model Selection"></a>Model Selection</h3><p>准备好 Feature 以后，就可以开始选用一些常见的模型进行训练了。Kaggle 上最常用的模型基本都是基于树的模型：</p><ul><li><strong>Gradient Boosting</strong></li><li>Random Forest</li><li>Extra Randomized Trees</li></ul><p>以下模型往往在性能上稍逊一筹，但是很适合作为 Ensemble 的 Base Model。这一点之后再详细解释。（当然，在跟图像有关的比赛中神经网络的重要性还是不能小觑的。）</p><ul><li>SVM</li><li>Linear Regression</li><li>Logistic Regression</li><li>Neural Networks</li></ul><p>以上这些模型基本都可以通过 <strong><a href="http://scikit-learn.org/" target="_blank" rel="noopener">sklearn</a></strong> 来使用。</p><p>当然，这里不能不提一下 <strong><a href="https://github.com/dmlc/xgboost" target="_blank" rel="noopener">Xgboost</a></strong>。<strong>Gradient Boosting</strong> 本身优秀的性能加上 <strong>Xgboost</strong> 高效的实现，使得它在 Kaggle 上广为使用。几乎每场比赛的获奖者都会用 <strong>Xgboost</strong> 作为最终 Model 的重要组成部分。在实战中，我们往往会以 Xgboost 为主来建立我们的模型并且验证 Feature 的有效性。顺带一提，<strong>在 Windows 上安装 </strong>Xgboost<strong> 很容易遇到问题，目前已知最简单、成功率最高的方案可以参考我在<a href="https://dnc1994.com/2016/03/installing-xgboost-on-windows/">这篇帖子</a>中的描述</strong>。</p><h4 id="Model-Training"><a href="#Model-Training" class="headerlink" title="Model Training"></a>Model Training</h4><p>在训练时，我们主要希望通过调整参数来得到一个性能不错的模型。一个模型往往有很多参数，但其中比较重要的一般不会太多。比如对 <strong>sklearn</strong> 的 <code>RandomForestClassifier</code> 来说，比较重要的就是随机森林中树的数量 <code>n_estimators</code> 以及在训练每棵树时最多选择的特征数量 <code>max_features</code>。所以<strong>我们需要对自己使用的模型有足够的了解，知道每个参数对性能的影响是怎样的</strong>。</p><p>通常我们会通过一个叫做 <a href="http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html)" target="_blank" rel="noopener">Grid Search</a> 的过程来确定一组最佳的参数。其实这个过程说白了就是根据给定的参数候选对所有的组合进行暴力搜索。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">param_grid = &#123;<span class="string">'n_estimators'</span>: [<span class="number">300</span>, <span class="number">500</span>], <span class="string">'max_features'</span>: [<span class="number">10</span>, <span class="number">12</span>, <span class="number">14</span>]&#125;</span><br><span class="line">model = grid_search.GridSearchCV(estimator=rfr, param_grid=param_grid, n_jobs=<span class="number">1</span>, cv=<span class="number">10</span>, verbose=<span class="number">20</span>, scoring=RMSE)</span><br><span class="line">model.fit(X_train, y_train)</span><br></pre></td></tr></table></figure><p>顺带一提，Random Forest 一般在 <code>max_features</code> 设为 Feature 数量的平方根附近得到最佳结果。</p><p>这里要重点讲一下 Xgboost 的调参。通常认为对它性能影响较大的参数有：</p><ul><li><code>eta</code>：每次迭代完成后更新权重时的步长。越小训练越慢。</li><li><code>num_round</code>：总共迭代的次数。</li><li><code>subsample</code>：训练每棵树时用来训练的数据占全部的比例。用于防止 Overfitting。</li><li><code>colsample_bytree</code>：训练每棵树时用来训练的特征的比例，类似 <code>RandomForestClassifier</code> 的 <code>max_features</code>。</li><li><code>max_depth</code>：每棵树的最大深度限制。与 Random Forest 不同，<strong>Gradient Boosting 如果不对深度加以限制，最终是会 Overfit 的</strong>。</li><li><code>early_stopping_rounds</code>：用于控制在 Out Of Sample 的验证集上连续多少个迭代的分数都没有提高后就提前终止训练。用于防止 Overfitting。</li></ul><p>一般的调参步骤是：</p><ol><li>将训练数据的一部分划出来作为验证集。</li><li>先将 <code>eta</code> 设得比较高（比如 0.1），<code>num_round</code> 设为 300 ~ 500。</li><li>用 Grid Search 对其他参数进行搜索</li><li>逐步将 <code>eta</code> 降低，找到最佳值。</li><li>以验证集为 watchlist，用找到的最佳参数组合重新在训练集上训练。注意观察算法的输出，看每次迭代后在验证集上分数的变化情况，从而得到最佳的 <code>early_stopping_rounds</code>。</li></ol><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line">X_dtrain, X_deval, y_dtrain, y_deval = cross_validation.train_test_split(X_train, y_train, random_state=<span class="number">1026</span>, test_size=<span class="number">0.3</span>)</span><br><span class="line">dtrain = xgb.DMatrix(X_dtrain, y_dtrain)</span><br><span class="line">deval = xgb.DMatrix(X_deval, y_deval)</span><br><span class="line">watchlist = [(deval, <span class="string">'eval'</span>)]</span><br><span class="line">params = &#123;</span><br><span class="line">    <span class="string">'booster'</span>: <span class="string">'gbtree'</span>,</span><br><span class="line">    <span class="string">'objective'</span>: <span class="string">'reg:linear'</span>,</span><br><span class="line">    <span class="string">'subsample'</span>: <span class="number">0.8</span>,</span><br><span class="line">    <span class="string">'colsample_bytree'</span>: <span class="number">0.85</span>,</span><br><span class="line">    <span class="string">'eta'</span>: <span class="number">0.05</span>,</span><br><span class="line">    <span class="string">'max_depth'</span>: <span class="number">7</span>,</span><br><span class="line">    <span class="string">'seed'</span>: <span class="number">2016</span>,</span><br><span class="line">    <span class="string">'silent'</span>: <span class="number">0</span>,</span><br><span class="line">    <span class="string">'eval_metric'</span>: <span class="string">'rmse'</span></span><br><span class="line">&#125;</span><br><span class="line">clf = xgb.train(params, dtrain, <span class="number">500</span>, watchlist, early_stopping_rounds=<span class="number">50</span>)</span><br><span class="line">pred = clf.predict(xgb.DMatrix(df_test))</span><br></pre></td></tr></table></figure><p>最后要提一点，所有具有随机性的 Model 一般都会有一个 <code>seed</code> 或是 <code>random_state</code> 参数用于控制随机种子。得到一个好的 Model 后，在记录参数时务必也记录下这个值，从而能够在之后重现 Model。</p><h4 id="Cross-Validation"><a href="#Cross-Validation" class="headerlink" title="Cross Validation"></a>Cross Validation</h4><p><a href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)" target="_blank" rel="noopener">Cross Validation</a> 是非常重要的一个环节。它让你知道你的 Model 有没有 Overfit，是不是真的能够 Generalize 到测试集上。在很多比赛中 <strong>Public LB</strong> 都会因为这样那样的原因而不可靠。当你改进了 Feature 或是 Model 得到了一个更高的 <strong>CV</strong> 结果，提交之后得到的 <strong>LB</strong> 结果却变差了，<strong>一般认为这时应该相信 CV 的结果</strong>。当然，最理想的情况是多种不同的 <strong>CV</strong> 方法得到的结果和 <strong>LB</strong> 同时提高，但这样的比赛并不是太多。</p><p>在数据的分布比较随机均衡的情况下，<strong>5-Fold CV</strong> 一般就足够了。如果不放心，可以提到 <strong>10-Fold</strong>。<strong>但是 Fold 越多训练也就会越慢，需要根据实际情况进行取舍。</strong></p><p>很多时候简单的 <strong>CV</strong> 得到的分数会不大靠谱，Kaggle 上也有很多关于如何做 <strong>CV</strong> 的讨论。比如<a href="https://www.kaggle.com/c/telstra-recruiting-network/forums/t/19277/what-is-your-cross-validation-method" target="_blank" rel="noopener">这个</a>。但总的来说，靠谱的 <strong>CV</strong> 方法是 Case By Case 的，需要在实际比赛中进行尝试和学习，这里就不再（也不能）叙述了。</p><h3 id="Ensemble-Generation"><a href="#Ensemble-Generation" class="headerlink" title="Ensemble Generation"></a>Ensemble Generation</h3><p><a href="https://en.wikipedia.org/wiki/Ensemble_learning" target="_blank" rel="noopener">Ensemble Learning</a> 是指将多个不同的 Base Model 组合成一个 Ensemble Model 的方法。它可以<strong>同时降低最终模型的 Bias 和 Variance</strong>（证明可以参考<a href="http://link.springer.com/chapter/10.1007%2F3-540-33019-4_19" target="_blank" rel="noopener">这篇论文</a>，我最近在研究类似的理论，可能之后会写新文章详述)，<strong>从而在提高分数的同时又降低 Overfitting 的风险</strong>。在现在的 Kaggle 比赛中要不用 Ensemble 就拿到奖金几乎是不可能的。</p><p>常见的 Ensemble 方法有这么几种：</p><ul><li>Bagging：使用训练数据的不同随机子集来训练每个 Base Model，最后进行每个 Base Model 权重相同的 Vote。也即 Random Forest 的原理。</li><li>Boosting：迭代地训练 Base Model，每次根据上一个迭代中预测错误的情况修改训练样本的权重。也即 Gradient Boosting 的原理。比 Bagging 效果好，但更容易 Overfit。</li><li>Blending：用不相交的数据训练不同的 Base Model，将它们的输出取（加权）平均。实现简单，但对训练数据利用少了。</li><li>Stacking：接下来会详细介绍。</li></ul><p>从理论上讲，Ensemble 要成功，有两个要素：</p><ul><li><strong>Base Model 之间的相关性要尽可能的小。</strong>这就是为什么非 Tree-based Model 往往表现不是最好但还是要将它们包括在 Ensemble 里面的原因。Ensemble 的 Diversity 越大，最终 Model 的 Bias 就越低。</li><li><strong>Base Model 之间的性能表现不能差距太大。</strong>这其实是一个 <strong>Trade-off</strong>，在实际中很有可能表现相近的 Model 只有寥寥几个而且它们之间相关性还不低。但是实践告诉我们即使在这种情况下 Ensemble 还是能大幅提高成绩。</li></ul><h4 id="Stacking"><a href="#Stacking" class="headerlink" title="Stacking"></a>Stacking</h4><p>相比 Blending，Stacking 能更好地利用训练数据。以 5-Fold Stacking 为例，它的基本原理如图所示：</p><p><img src="kaggle-guide-stacking-diagram.jpg" alt="Stacking"></p><p>整个过程很像 Cross Validation。首先将训练数据分为 5 份，接下来一共 5 个迭代，每次迭代时，将 4 份数据作为 Training Set 对每个 Base Model 进行训练，然后在剩下一份 Hold-out Set 上进行预测。<strong>同时也要将其在测试数据上的预测保存下来。</strong>这样，每个 Base Model 在每次迭代时会对训练数据的其中 1 份做出预测，对测试数据的全部做出预测。5 个迭代都完成以后我们就获得了一个 <code>#训练数据行数 x #Base Model 数量</code> 的矩阵，这个矩阵接下来就作为第二层的 Model 的训练数据。当第二层的 Model 训练完以后，将之前保存的 Base Model 对测试数据的预测（<strong>因为每个 Base Model 被训练了 5 次，对测试数据的全体做了 5 次预测，所以对这 5 次求一个平均值，从而得到一个形状与第二层训练数据相同的矩阵</strong>）拿出来让它进行预测，就得到最后的输出。</p><p>这里给出我的实现代码：</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">Ensemble</span><span class="params">(object)</span>:</span></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">__init__</span><span class="params">(self, n_folds, stacker, base_models)</span>:</span></span><br><span class="line">        self.n_folds = n_folds</span><br><span class="line">        self.stacker = stacker</span><br><span class="line">        self.base_models = base_models</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">fit_predict</span><span class="params">(self, X, y, T)</span>:</span></span><br><span class="line">        X = np.array(X)</span><br><span class="line">        y = np.array(y)</span><br><span class="line">        T = np.array(T)</span><br><span class="line"></span><br><span class="line">        folds = list(KFold(len(y), n_folds=self.n_folds, shuffle=<span class="keyword">True</span>, random_state=<span class="number">2016</span>))</span><br><span class="line"></span><br><span class="line">        S_train = np.zeros((X.shape[<span class="number">0</span>], len(self.base_models)))</span><br><span class="line">        S_test = np.zeros((T.shape[<span class="number">0</span>], len(self.base_models)))</span><br><span class="line"></span><br><span class="line">        <span class="keyword">for</span> i, clf <span class="keyword">in</span> enumerate(self.base_models):</span><br><span class="line">            S_test_i = np.zeros((T.shape[<span class="number">0</span>], len(folds)))</span><br><span class="line"></span><br><span class="line">            <span class="keyword">for</span> j, (train_idx, test_idx) <span class="keyword">in</span> enumerate(folds):</span><br><span class="line">                X_train = X[train_idx]</span><br><span class="line">                y_train = y[train_idx]</span><br><span class="line">                X_holdout = X[test_idx]</span><br><span class="line">                <span class="comment"># y_holdout = y[test_idx]</span></span><br><span class="line">                clf.fit(X_train, y_train)</span><br><span class="line">                y_pred = clf.predict(X_holdout)[:]</span><br><span class="line">                S_train[test_idx, i] = y_pred</span><br><span class="line">                S_test_i[:, j] = clf.predict(T)[:]</span><br><span class="line"></span><br><span class="line">            S_test[:, i] = S_test_i.mean(<span class="number">1</span>)</span><br><span class="line"></span><br><span class="line">        self.stacker.fit(S_train, y)</span><br><span class="line">        y_pred = self.stacker.predict(S_test)[:]</span><br><span class="line">        <span class="keyword">return</span> y_pred</span><br></pre></td></tr></table></figure><p>获奖选手往往会使用比这复杂得多的 Ensemble，会出现三层、四层甚至五层，不同的层数之间有各种交互，还有将经过不同的 Preprocessing 和不同的 Feature Engineering 的数据用 Ensemble 组合起来的做法。但对于新手来说，稳稳当当地实现一个正确的 5-Fold Stacking 已经足够了。</p><h3 id="Pipeline"><a href="#Pipeline" class="headerlink" title="*Pipeline"></a>*Pipeline</h3><p>可以看出 Kaggle 比赛的 Workflow 还是比较复杂的。尤其是 Model Selection 和 Ensemble。理想情况下，我们需要搭建一个高自动化的 Pipeline，它可以做到：</p><ul><li><strong>模块化 Feature Transform</strong>，只需写很少的代码就能将新的 Feature 更新到训练集中。</li><li><strong>自动化 Grid Search</strong>，只要预先设定好使用的 Model 和参数的候选，就能自动搜索并记录最佳的 Model。</li><li><strong>自动化 Ensemble Generation</strong>，每个一段时间将现有最好的 K 个 Model 拿来做 Ensemble。</li></ul><p>对新手来说，第一点可能意义还不是太大，因为 Feature 的数量总是人脑管理的过来的；第三点问题也不大，因为往往就是在最后做几次 Ensemble。但是第二点还是很有意义的，手工记录每个 Model 的表现不仅浪费时间而且容易产生混乱。</p><p><a href="https://www.kaggle.com/c/crowdflower-search-relevance" target="_blank" rel="noopener">Crowdflower Search Results Relevance</a> 的第一名获得者 <a href="https://www.kaggle.com/chenglongchen" target="_blank" rel="noopener">Chenglong Chen</a> 将他在比赛中使用的 Pipeline <a href="https://github.com/ChenglongChen/Kaggle_CrowdFlower" target="_blank" rel="noopener">公开了</a>，非常具有参考和借鉴意义。只不过看懂他的代码并将其中的逻辑抽离出来搭建这样一个框架，还是比较困难的一件事。可能在参加过几次比赛以后专门抽时间出来做会比较好。</p><h2 id="Home-Depot-Search-Relevance"><a href="#Home-Depot-Search-Relevance" class="headerlink" title="Home Depot Search Relevance"></a>Home Depot Search Relevance</h2><p>在这一节中我会具体分享我在 <a href="https://www.kaggle.com/c/home-depot-product-search-relevance" target="_blank" rel="noopener">Home Depot Search Relevance</a> 比赛中是怎么做的，以及比赛结束后从排名靠前的队伍那边学到的做法。</p><p>首先简单介绍这个比赛。Task 是<strong>判断用户搜索的关键词和网站返回的结果之间的相关度有多高</strong>。相关度是由 3 个人类打分取平均得到的，每个人可能打 1 ~ 3 分，所以这是一个回归问题。数据中包含用户的搜索词，返回的产品的标题和介绍，以及产品相关的一些属性比如品牌、尺寸、颜色等。使用的评分基准是 <a href="https://en.wikipedia.org/wiki/Root-mean-square_deviation" target="_blank" rel="noopener">RMSE</a>。</p><p>这个比赛非常像 <a href="https://www.kaggle.com/c/crowdflower-search-relevance" target="_blank" rel="noopener">Crowdflower Search Results Relevance</a> 那场比赛。不过那边用的评分基准是 <a href="https://en.wikipedia.org/wiki/Cohen%27s_kappa#Weighted_kappa" target="_blank" rel="noopener">Quadratic Weighted Kappa</a>，把 1 误判成 4 的惩罚会比把 1 判成 2 的惩罚大得多，所以在最后 Decode Prediction 的时候会更麻烦一点。除此以外那次比赛没有提供产品的属性。</p><h3 id="EDA"><a href="#EDA" class="headerlink" title="EDA"></a>EDA</h3><p>由于加入比赛比较晚，当时已经有相当不错的 EDA 了。尤其是<a href="https://www.kaggle.com/briantc/home-depot-product-search-relevance/homedepot-first-dataexploreation-k" target="_blank" rel="noopener">这个</a>。从中我得到的启发有：</p><ul><li>同一个搜索词/产品都出现了多次，<strong>数据分布显然不 i.i.d.</strong>。</li><li>文本之间的相似度很有用。</li><li>产品中有相当大一部分缺失属性，要考虑这会不会使得从属性中得到的 Feature 反而难以利用。</li><li>产品的 ID 对预测相关度很有帮助，但是考虑到训练集和测试集之间的重叠度并不太高，利用它会不会导致 Overfitting？</li></ul><h3 id="Preprocessing"><a href="#Preprocessing" class="headerlink" title="Preprocessing"></a>Preprocessing</h3><p>这次比赛中我的 Preprocessing 和 Feature Engineering 的具体做法都可以在<a href="https://github.com/dnc1994/Kaggle-Playground/blob/master/home-depot/Preprocess.ipynb" target="_blank" rel="noopener">这里</a>看到。我只简单总结一下和指出重要的点。</p><ol><li>利用 Forum 上的 <a href="https://www.kaggle.com/steubk/home-depot-product-search-relevance/fixing-typos" target="_blank" rel="noopener">Typo Dictionary</a> 修正搜索词中的错误。</li><li>统计属性的出现次数，将其中出现次数多又容易利用的记录下来。</li><li>将训练集和测试集合并，并与产品描述和属性 Join 起来。这是考虑到后面有一系列操作，如果不合并的话就要重复写两次了。</li><li>对所有文本能做 <strong><a href="https://en.wikipedia.org/wiki/Stemming" target="_blank" rel="noopener">Stemming</a></strong> 和 <strong><a href="https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation" target="_blank" rel="noopener">Tokenizing</a></strong>，同时手工做了一部分<strong>格式统一化（比如涉及到数字和单位的）</strong>和<strong>同义词替换</strong>。</li></ol><h3 id="Feature"><a href="#Feature" class="headerlink" title="Feature"></a>Feature</h3><ul><li><p>*Attribute Features</p><ul><li>是否包含某个特定的属性（品牌、尺寸、颜色、重量、内用/外用、是否有能源之星认证等）</li><li>这个特定的属性是否匹配</li></ul></li><li><p>Meta Features</p><ul><li>各个文本域的长度</li><li>是否包含属性域</li><li>品牌（将所有的品牌做数值离散化）</li><li>产品 ID</li></ul></li><li><p>简单匹配</p><ul><li>搜索词是否在产品标题、产品介绍或是产品属性中出现</li><li>搜索词在产品标题、产品介绍或是产品属性中出现的数量和比例</li><li>*搜索词中的第 i 个词是否在产品标题、产品介绍或是产品属性中出现</li></ul></li><li><p>搜索词和产品标题、产品介绍以及产品属性之间的文本相似度</p><ul><li><a href="https://en.wikipedia.org/wiki/Bag-of-words_model" target="_blank" rel="noopener">BOW</a> <a href="https://en.wikipedia.org/wiki/Cosine_similarity" target="_blank" rel="noopener">Cosine Similairty</a></li><li><a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf" target="_blank" rel="noopener">TF-IDF</a> Cosine Similarity</li><li><a href="https://en.wikipedia.org/wiki/Jaccard_index" target="_blank" rel="noopener">Jaccard Similarity</a></li><li>*<a href="https://en.wikipedia.org/wiki/Edit_distance" target="_blank" rel="noopener">Edit Distance</a></li><li><a href="https://en.wikipedia.org/wiki/Word2vec" target="_blank" rel="noopener">Word2Vec</a> Distance（由于效果不好，最后没有使用，但似乎是因为用的不对）</li></ul></li><li><p><strong><a href="https://en.wikipedia.org/wiki/Latent_semantic_indexing" target="_blank" rel="noopener">Latent Semantic Indexing</a>：通过将 BOW/TF-IDF Vectorization 得到的矩阵进行 <a href="https://en.wikipedia.org/wiki/Singular_value_decomposition" target="_blank" rel="noopener">SVD 分解</a>，我们可以得到不同搜索词/产品组合的 Latent 标识。这个 Feature 使得 Model 能够在一定程度上对不同的组合做出区别，从而解决某些产品缺失某些 Feature 的问题。</strong></p></li></ul><p>值得一提的是，上面打了 <code>*</code> 的 Feature 都是我在最后一批加上去的。问题是，使用这批 Feature 训练得到的 Model 反而比之前的要差，而且还差不少。我一开始是以为因为 Feature 的数量变多了所以一些参数需要重新调优，但在浪费了很多时间做 Grid Search 以后却发现还是没法超过之前的分数。这可能就是之前提到的 Feature 之间的相互作用导致的问题。当时我设想过一个看到过好几次的解决方案，就是<strong>将使用不同版本 Feature 的 Model 通过 Ensemble 组合起来</strong>。但最终因为时间关系没有实现。事实上排名靠前的队伍分享的解法里面基本都提到了将不同的 Preprocessing 和 Feature Engineering 做 Ensemble 是获胜的关键。</p><h3 id="Model"><a href="#Model" class="headerlink" title="Model"></a>Model</h3><p>我一开始用的是 <code>RandomForestRegressor</code>，后来在 Windows 上折腾 <strong>Xgboost</strong> 成功了就开始用 <code>XGBRegressor</code>。<strong>XGB</strong> 的优势非常明显，同样的数据它只需要不到一半的时间就能跑完，节约了很多时间。</p><p>比赛中后期我基本上就是一边台式机上跑 <strong>Grid Search</strong>，一边在笔记本上继续研究 Feature。</p><p>这次比赛数据分布很不独立，所以期间多次遇到改进的 Feature 或是 <strong>Grid Search</strong> 新得到的参数训练出来的模型反而 <strong>LB</strong> 分数下降了。由于被很多前辈教导过要相信自己的 <strong>CV</strong>，我的决定是将 5-Fold 提到 10-Fold，然后以 <strong>CV</strong> 为标准继续前进。</p><h3 id="Ensemble"><a href="#Ensemble" class="headerlink" title="Ensemble"></a>Ensemble</h3><p>最终我的 Ensemble 的 Base Model 有以下四个：</p><ul><li><code>RandomForestRegressor</code></li><li><code>ExtraTreesRegressor</code></li><li><code>GradientBoostingRegressor</code></li><li><code>XGBRegressor</code></li></ul><p>第二层的 Model 还是用的 <strong>XGB</strong>。</p><p>因为 <strong>Base Model</strong> 之间的相关都都太高了（最低的一对也有 0.9），我原本还想引入使用 <code>gblinear</code> 的 <code>XGBRegressor</code> 以及 <code>SVR</code>，但前者的 RMSE 比其他几个 Model 高了 0.02（这在 <strong>LB</strong> 上有几百名的差距），而后者的训练实在太慢了。最后还是只用了这四个。</p><p><strong>值得一提的是，在开始做 </strong>Stacking<strong> 以后，我的 CV 和 LB 成绩的提高就是完全同步的了。</strong></p><p>在比赛最后两天，因为身心疲惫加上想不到还能有什么显著的改进，我做了一件事情：用 20 个不同的随机种子来生成 Ensemble，最后取 <strong>Weighted Average</strong>。这个其实算是一种变相的 Bagging。<strong>其意义在于按我实现 </strong>Stacking<strong> 的方式，我在训练 Base Model 时只用了 80% 的训练数据，而训练第二层的 Model 时用了 100% 的数据，这在一定程度上增大了 Overfitting 的风险。而每次更改随机种子可以确保每次用的是不同的 80%，这样在多次训练取平均以后就相当于逼近了使用 100% 数据的效果。</strong>这给我带来了大约 <code>0.0004</code> 的提高，也很难受说是真的有效还是随机性了。</p><p>比赛结束后我发现我最好的单个 Model 在 <strong>Private LB</strong> 上的得分是 <code>0.46378</code>，而最终 Stacking 的得分是 <code>0.45849</code>。这是 174 名和 98 名的差距。也就是说，我单靠 Feature 和调参进到了 前 10%，而 <strong>Stacking</strong> 使我进入了前 5%。</p><h3 id="Lessons-Learned"><a href="#Lessons-Learned" class="headerlink" title="Lessons Learned"></a>Lessons Learned</h3><p>比赛结束后一些队伍分享了他们的解法，从中我学到了一些我没有做或是做的不够好的地方：</p><ul><li>产品标题的组织方式是有 Pattern 的，比如一个产品是否带有某附件一定会用 <code>With/Without XXX</code> 的格式放在标题最后。</li><li>使用<strong>外部数据</strong>，比如 <a href="https://wordnet.princeton.edu/" target="_blank" rel="noopener">WordNet</a>，<a href="https://www.kaggle.com/reddit/reddit-comments-may-2015" target="_blank" rel="noopener">Reddit 评论数据集</a>等来训练同义词和上位词（在一定程度上替代 Word2Vec）词典。</li><li>基于<strong>字母</strong>而不是单词的 NLP Feature。这一点我让我十分费解，但请教以后发现非常有道理。举例说，排名第三的队伍在计算匹配度时，将搜索词和内容中相匹配的单词的长度也考虑进去了。这是因为他们发现<strong>越长的单词约具体，所以越容易被用户认为相关度高</strong>。此外他们还使用了逐字符的序列比较（<code>difflib.SequenceMatcher</code>），因为<strong>这个相似度能够衡量视觉上的相似度</strong>。像这样的 Feature 的确不是每个人都能想到的。</li><li>标注单词的词性，找出<strong>中心词</strong>，计算基于中心词的各种匹配度和距离。这一点我想到了，但没有时间尝试。</li><li>将产品标题/介绍中 TF-IDF 最高的一些 Trigram 拿出来，计算搜索词中出现在这些 Trigram 中的比例；反过来以搜索词为基底也做一遍。这相当于是从另一个角度抽取了一些 Latent 标识。</li><li>一些新颖的距离尺度，比如 <a href="http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf" target="_blank" rel="noopener">Word Movers Distance</a></li><li>除了 SVD 以外还可以用上 <a href="https://en.wikipedia.org/wiki/Non-negative_matrix_factorization" target="_blank" rel="noopener">NMF</a>。</li><li><strong>最重要的 Feature 之间的 Pairwise Polynomial Interaction</strong>。</li><li><strong>针对数据不 i.i.d. 的问题，在 CV 时手动构造测试集与验证集之间产品 ID 不重叠和重叠的两种不同分割，并以与实际训练集/测试集的分割相同的比例来做 CV 以逼近 LB 的得分分布</strong>。</li></ul><p>至于 Ensemble 的方法，我暂时还没有办法学到什么，因为自己只有最简单的 Stacking 经验。</p><h2 id="Summary"><a href="#Summary" class="headerlink" title="Summary"></a>Summary</h2><h3 id="Takeaways"><a href="#Takeaways" class="headerlink" title="Takeaways"></a>Takeaways</h3><ol><li>比较早的时候就开始做 Ensemble 是对的，这次比赛到倒数第三天我还在纠结 Feature。</li><li>很有必要搭建一个 Pipeline，至少要能够自动训练并记录最佳参数。</li><li>Feature 为王。我花在 Feature 上的时间还是太少。</li><li>可能的话，多花点时间去手动查看原始数据中的 Pattern。</li></ol><h3 id="Issues-Raised"><a href="#Issues-Raised" class="headerlink" title="Issues Raised"></a>Issues Raised</h3><p>我认为在这次比赛中遇到的一些问题是很有研究价值的：</p><ol><li>在数据分布并不 i.i.d. 甚至有 Dependency 时如何做靠谱的 <strong>CV</strong>。</li><li>如何量化 Ensemble 中 <strong>Diversity vs. Accuracy</strong> 的 Trade-off。</li><li>如何处理 Feature 之间互相影响导致性能反而下降。</li></ol><h3 id="Beginner-Tips"><a href="#Beginner-Tips" class="headerlink" title="Beginner Tips"></a>Beginner Tips</h3><p>给新手的一些建议：</p><ol><li>选择一个感兴趣的比赛。<strong>如果你对相关领域原本就有一些洞见那就更理想了。</strong></li><li>根据我描述的方法开始探索、理解数据并进行建模。</li><li>通过 Forum 和 Scripts 学习其他人对数据的理解和构建 Feature 的方式。</li><li><strong>如果之前有过类似的比赛，可以去找当时获奖者的 Interview 和 Blog Post 作为参考，往往很有用。</strong></li><li>在得到一个比较不错的 <strong>LB</strong> 分数（比如已经接近前 10%）以后可以开始尝试做 Ensemble。</li><li>如果觉得自己有希望拿到奖金，开始找人组队吧！</li><li><strong>到比赛结束为止要绷紧一口气不能断，尽量每天做一些新尝试。</strong></li><li>比赛结束后学习排名靠前的队伍的方法，思考自己这次比赛中的不足和发现的问题，<strong>可能的话再花点时间将学到的新东西用实验进行确认，为下一次比赛做准备</strong>。</li><li>好好休息！</li></ol><h2 id="Reference"><a href="#Reference" class="headerlink" title="Reference"></a>Reference</h2><ol><li><a href="https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=1&amp;cad=rja&amp;uact=8&amp;ved=0ahUKEwiPxZHewLbMAhVKv5QKHb3PCGwQFggcMAA&amp;url=http%3A%2F%2Fwww.ke.tu-darmstadt.de%2Flehre%2Farbeiten%2Fstudien%2F2015%2FDong_Ying.pdf&amp;usg=AFQjCNE9o2BcEkqdnu_-lQ3EFD3eRAFWiw&amp;sig2=oiU8TCEH57EYF9v9l6Scrw&amp;bvm=bv.121070826,d.dGo" target="_blank" rel="noopener">Beating Kaggle the Easy Way - Dong Ying</a></li><li><a href="http://rstudio-pubs-static.s3.amazonaws.com/158725_5d2f977f4004490e9b095c0ef9357c6b.html" target="_blank" rel="noopener">Solution for Prudential Life Insurance Assessment - Nutastray</a></li><li><a href="https://github.com/ChenglongChen/Kaggle_CrowdFlower/blob/master/BlogPost/BlogPost.md" target="_blank" rel="noopener">Search Results Relevance Winner’s Interview: 1st place, Chenglong Chen</a></li></ol>]]></content>
    
    <summary type="html">
    
      &lt;h2 id=&quot;Introduction&quot;&gt;&lt;a href=&quot;#Introduction&quot; class=&quot;headerlink&quot; title=&quot;Introduction&quot;&gt;&lt;/a&gt;Introduction&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;本文采用&lt;a href=&quot;https://creativecommons.org/licenses/by-nc-nd/3.0/cn/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;署名 - 非商业性使用 - 禁止演绎 3.0 中国大陆许可协议&lt;/a&gt;进行许可。著作权由章凌豪所有。&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.kaggle.com/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Kaggle&lt;/a&gt; 是目前最大的 Data Scientist 聚集地。很多公司会拿出自家的数据并提供奖金，在 Kaggle 上组织数据竞赛。我最近完成了第一次比赛，&lt;strong&gt;在 2125 个参赛队伍中排名第 98 位（~ 5%）&lt;/strong&gt;。因为是第一次参赛，所以对这个成绩我已经很满意了。在 Kaggle 上一次比赛的结果除了排名以外，还会显示的就是 Prize Winner，10% 或是 25% 这三档。所以刚刚接触 Kaggle 的人很多都会以 25% 或是 10% 为目标。在本文中，我试图根据自己第一次比赛的经验和从其他 Kaggler 那里学到的知识，为刚刚听说 Kaggle 想要参赛的新手提供一些切实可行的冲刺 10% 的指导。&lt;/p&gt;
&lt;p&gt;&lt;em&gt;本文的英文版见&lt;a href=&quot;https://dnc1994.com/2016/05/rank-10-percent-in-first-kaggle-competition-en/&quot;&gt;这里&lt;/a&gt;。&lt;/em&gt;&lt;/p&gt;
    
    </summary>
    
      <category term="Data Science" scheme="http://dnc1994.com/categories/Data-Science/"/>
    
    
  </entry>
  
  <entry>
    <title>如何备考 TOEFL/GRE</title>
    <link href="http://dnc1994.com/2016/03/how-to-prepare-for-toefl-gre/"/>
    <id>http://dnc1994.com/2016/03/how-to-prepare-for-toefl-gre/</id>
    <published>2016-03-22T00:51:16.000Z</published>
    <updated>2019-01-19T04:04:39.518Z</updated>
    
    <content type="html"><![CDATA[<h2 id="前言的前言"><a href="#前言的前言" class="headerlink" title="前言的前言"></a>前言的前言</h2><p>去年 11 月通过微信群做了 T/G 备考经验的分享，之后一直没有时间把当时的内容整理成文字。后来<strong><a href="https://endle.github.io/" target="_blank" rel="noopener">李臻博</a></strong>同学主动帮我进行了整理（编辑：链接已失效）。不过我还是想用自己的文字做一遍整理和修改。最近终于找到了时间，所以现在把这篇经验分享给大家，希望大家各项考试顺利，都能申请到自己的 Dream School &gt; &lt;。</p><p><strong>本文采用<a href="http://creativecommons.org/licenses/by-nc-nd/3.0/cn/" target="_blank" rel="noopener">署名 - 非商业性使用 - 禁止演绎 3.0 中国大陆许可协议</a>进行许可。著作权由<a href="https://cn.linkedin.com/in/linghaozh" target="_blank" rel="noopener">章凌豪</a>与<a href="https://endle.github.io/" target="_blank" rel="noopener">李臻博</a>共同所有。</strong></p><a id="more"></a><h2 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h2><p>先晒一下成绩以表诚意：<br><strong>TOEFL：14 年 11 月 110；15 年 10 月 112</strong><br><strong>GRE：15 年 10 月 Verbal 161 + Quant 170 + AW 4.0</strong></p><p>本文的目的是分享一些<strong>对于有一定英语基础的人适用的 DIY 备考经验</strong>，而不会讨论诸如“我听力很烂应该怎么提高”之类的问题的。因为这些问题基本上都可以用<strong>“努力多练”</strong>四个字来回答。当然，本文提到的一些备考资源和工具是完全可以拿来提高自己的弱项的。</p><h2 id="TOEFL"><a href="#TOEFL" class="headerlink" title="TOEFL"></a>TOEFL</h2><p>T 是一个<strong>衡量日常和学术英语能力</strong>的考试，而不是一个需要死记硬背考前突击的考试。所以，英语基础比较好的同学并不用花很大的力气去准备。</p><p>我第一次考 T 的时候就是做了 15 套 TPO 里面的听力和阅读，除此以外只在考前熟悉了一下口语和写作的题型。</p><p>曾经有好几位基础还不错的同学来找我询问 T 的备考策略。他们都提到了要不要花钱去英语培训机构上课。我始终认为那些培训结构除了<strong>满足伸手党的惰性和督策一部分极度不自律的人</strong>以外没有任何帮助。我也接触过一些培训机构，拿到过一些所谓的内部资料，并不觉得是什么自己动手所比不上的资源。<strong>事实上那几位同学听从我的建议 DIY 备考以后几乎都拿到了 100 分以上的成绩。</strong></p><h3 id="听力"><a href="#听力" class="headerlink" title="听力"></a>听力</h3><p><strong>听力在 T 中有着举足轻重的地位。</strong>除了阅读以外的部分都要用到听力。</p><p>听力除了因为词汇量和练习量的问题听不懂以外，另一个问题就是<strong>注意力不集中，使得原本能听懂的内容也没能听懂</strong>。我两次考 T 听力都是满分，即便如此我在考试中照样也遇到了因为一个分心就没听到一个信息点的情况。我的经验是，<strong>在考场上不要花太多经历去记笔记，而是专注于听懂材料的内容。</strong>一个 Section 里面三篇材料，第一篇是不需要记任何东西的，后面两篇也只需要记一些很明显是分点叙述的关键句。比如材料里面教授说某某艺术流派有三个重要的特点，那么把这三点的第一句记下来就差不多了。虽然题目里一定会考到一些细节，但<strong>仔细去听去理解往往比手忙脚乱用笔去记要更容易回忆起细节来</strong>。</p><p>第二次考 T 的时候，因为无聊加上前后座位距离近，我观察了一下前排三个人做听力的方式。我发现一个限时 10 分钟、只有 17 道题的 Section，他们都要耗到最后一两分钟才能做完。因为我是那种平时练习吊儿郎当容易分心的人，所以我非常理解他们的心理。这就是因为注意力不集中没有听到某个点，遇到题目就只能排除掉前两个选项，在剩下两个选项中不断纠结，只能靠猜了。</p><p>所以说，真正到考前时，多想想怎么提高自己的专注力，可能会更有帮助。</p><p>做微信群分享时，有同学推荐了 <strong><a href="http://www.aboboo.com/" target="_blank" rel="noopener">aboboo</a></strong> 这个用于提高听力的 APP，可以一试。</p><h3 id="阅读"><a href="#阅读" class="headerlink" title="阅读"></a>阅读</h3><p>阅读没有太多可说的，当时有同学提问如下：</p><ol><li><strong>时间不够。</strong>这就是单词量太小或者阅读量不够。练吧。</li><li><strong>六选三总是做错。</strong>这个题型有一定技巧，一般用排除法会比较好入手。错误的选项往往是<strong>有细节跟文章不符</strong>（常见的比如把两个观点混杂在一起，把观点的一部分讲反，或是夸大观点的结论等），或是虽然跟文章相符但<strong>不属于文章主线</strong>。这时候如果阅读速度比较快会很有优势，因为可以在做题之前先把文章通读一遍，最后做到六选三甚至还有时间重新去翻文章。</li></ol><h3 id="口语"><a href="#口语" class="headerlink" title="口语"></a>口语</h3><p><strong>准备 T 的口语跟提高日常口语能力几乎是完全等价的。</strong>口语好的人完全不用担心 T 的口语，因为一定不会让你没有东西讲，你要做的只是熟悉题型和不要紧张。</p><p>具体到备考，可以去找一些机经上的口语问题，但<strong>不要背它给的答案</strong>（一是背模板容易出事，二是很多机经的答案质量很低），而是<strong>自己掐着表对着麦克风讲</strong>。如果考前能这样每天练个半小时，考试的时候应该就不会惊慌失措了。</p><p>要注意的是，口语 6 个题里面有一半都是要用到听力的。所以你看，听力在 T 里面有多重要。</p><h3 id="写作"><a href="#写作" class="headerlink" title="写作"></a>写作</h3><p>写作的第一部分基本就是考听力，会给一篇阅读材料，里面会分三段论述一个观点。而听力材料则会逐点反驳或是支持阅读材料（一般反驳的比较多）。最后你要结合阅读和听力材料写一篇文章，不过重点是放在听力上的。基本上把听力材料里面每一段的关键句记下来就可以了。写的时候就先提一下阅读材料里面的观点，然后讲一下听力材料里面是怎么反驳或是支持它的。</p><p>第二部分就是<strong>标准的三段八股文</strong>，没有太多可说的。题目一般都比较生活化，不会想不到东西写。有一个说法是，<strong>在语言大体流畅的前提下，写的越多分数越高</strong>。这个说法可能有一定道理，因为我第一次考 T 时，写了 400 词后就把时间都花在润色上了，最后拿了 27 分；而第二次我决定听从这个建议，写了将近 500 词，就得了 29 分。当然样本太小不能说明任何结论，但是这个说法还是值得提一下。</p><p>我曾经帮不计其数的同学改过他们写的英文，感觉很多人之所以写不顺畅就是因为他们下笔之前先在脑子里生成中文，然后强行翻译成英文再写下来。<strong>这是一个非常错误的做法。</strong>因为翻译的要求比写作高多了。正确的做法是在平时<strong>刻意锻炼自己用英语思维进行表达的能力</strong>，尽量跳过中文直接把思绪用英文写下来。同时也要<strong>有意识地积累常见的习惯表达和搭配</strong>。这么做可能刚开始会比较辛苦，但一旦长期坚持下来一定是有好处的。</p><h3 id="备考策略"><a href="#备考策略" class="headerlink" title="备考策略"></a>备考策略</h3><ul><li>提前规划，<strong>留出足够的时间提高弱项</strong>。</li><li><strong>做 10 套以上 TPO</strong> 熟悉题型，同时针对弱项进行练习。</li><li>抽时间熟悉作文题型</li><li>考前一个月左右开始<strong>熟悉和练习口语题型，每天坚持半小时</strong>。</li><li>考前一周放松心态，可以做 1~2 套 TPO 保持手感。</li></ul><h2 id="GRE"><a href="#GRE" class="headerlink" title="GRE"></a>GRE</h2><p>G 和 T 是完全不同的两种考试。对中国学生来说（至少对理工科来讲），Quant 基本不成问题，大部分学校对 AW 分数的要求也不会太高，所以<strong>考 G 基本就是在考 Verbal</strong>。所以如果要考 G 的，我建议<strong>一定要早早规划好好准备，争取一次通关</strong>，因为准备 Verbal 实在是太恶心了。</p><h3 id="单词"><a href="#单词" class="headerlink" title="单词"></a>单词</h3><p>犹豫了很久，还是打算在这里谈一点对背单词这件事情的看法。</p><p>我认为<strong>现在市面上几乎所有背单词 APP 的实际学习效果都几乎为 0</strong>。这个结论来自于我的个人经验和与朋友的交流，以及来自研究的间接支持（绝大部分背单词 APP 的学习方式都很容易给学习者带来<a href="http://dnc1994.com/2016/02/learning-how-to-learn/">这篇文章</a>里所说的<strong>Illusion of Competence</strong>）。我不打算在这里展开论述，只是如果你在用 APP 背单词，我希望你能抽空思考一下这样背单词究竟有没有带来什么实际的成果。</p><p>尤其是像 G 这种难度和容量的词汇要求，用 APP 背单词是非常不可取的。最好的方式还是<strong>对着词汇书背</strong>。要注意不要光看中文释义，以及要<strong>多跟同义词和反义词进行联想</strong>，这一点在 Verbal 中相当重要。</p><h3 id="Verbal"><a href="#Verbal" class="headerlink" title="Verbal"></a>Verbal</h3><p>首先是官方的 <strong>OG</strong>，现在一共有一本总册和 Verbal、Quant 两本分册。因为 G 的官方题目是稀缺资源，所以这三本基本上都是要买来刷完的。</p><p>然后就是神器 <strong><a href="http://magoosh.com/" target="_blank" rel="noopener">Magoosh</a></strong> 了。Magoosh 上有 1000 多道题，Verbal 和 Quant 各占一半。<strong>它的 Verbal 应该是目前为止最接近实考难度和质量的题库了。</strong>订阅 6 个月 Magoosh 的费用是 99 美元，目前我还没有遇到过买了后悔的。虽然 Magoosh 并没有给我广告费，但出于良心还是要安利一下的，尤其是在这个高质量 Verbal 题库如此稀缺的现状下。</p><p>Verbal 三种题型，刚开始做的时候会觉得，阅读生词多逻辑又复杂，完型和选择总是一半单词不认识，特别打击信心。这时候要调整好心态，<strong>从背单词开始，做一题是一题</strong>。到最后会发现，单词背得熟的话，阅读只要头脑清醒就很容易做对，完型和选择也能排除掉绝大多数的错误选项。当然，最后考试成绩跟运气和发挥是有很大关系的。因为考试就那么几个题，一个单词忘记了可能就少一分。但是反过来想，<strong>多背熟一个单词可能就多一分</strong>，所以也算是天道酬勤吧。</p><p>11 届一位去了 CMU 的学姐说过，<strong>只要把要你命（新东方的 GRE 词汇书）背个十遍， Verbal 就能上 160 了</strong>。虽然我只背了两遍，但这句话还是很有教育意义的。</p><p>如果买了 Magoosh 的话，要<strong>善用它的错题标记和笔记功能</strong>。到最后阶段可以把笔记导出来，作为高频错词整理复习，非常有帮助。</p><h3 id="Quant"><a href="#Quant" class="headerlink" title="Quant"></a>Quant</h3><p>OG 里的题应付 Quant 基本足够了。数学基础比较薄弱的同学可能要着重针对下<strong>概率、统计和数列</strong>这些知识点。</p><p>如果你买了 Magoosh，上面五六百道 Quant 题也够你做了。</p><p>一般来说胆大心细就能拿到 170 或者 168 分。<strong>但是也不能掉以轻心，陷阱题还是挺多的。</strong></p><h3 id="Analytical-Writing"><a href="#Analytical-Writing" class="headerlink" title="Analytical Writing"></a>Analytical Writing</h3><p>G 的写作分为 Issue 和 Argument。Issue 是给你一个论题让你发表自己的看法，而 Argument 则是给一篇论证让你反驳。</p><p>一般来说 Argument 比较好写，因为<strong>给你的材料里面往往有很多的逻辑漏洞</strong>。比如材料会说某市进行了问卷调查，根据结果市政府应该如何如何。那么你上来就可以质疑该问卷的调查对象、问卷设计以及有效问卷数等，基本就是耍嘴皮子。</p><p>Issue 稍微难写一点，这也是 G 的作文跟 T 的作文很不一样的一个地方。一是 Issue 的<strong>论题往往比较复杂，容易想不到写什么</strong>；二是 Issue 要求<strong>不能一边倒</strong>，不能像 T 的作文一样无脑三段八股文碾压过去。比如要反驳一个论点，如果是 T 的作文可能就是直接分三点反驳，而 G 的要求是你必须详细叙述在这个论点在多大程度上是错误的，指出它所依赖的假设和前提条件，并分析当这些条件发生变化时论点的强度会发生怎样的变化。</p><p>Issue 我参考的是李建林老师的<strong><a href="http://www.amazon.cn/gp/product/B0062NIIJ4/ref=as_li_ss_tl?ie=UTF8&amp;camp=536&amp;creative=3132&amp;creativeASIN=B0062NIIJ4&amp;linkCode=as2&amp;tag=blo-23" target="_blank" rel="noopener">新GRE写作5.5分</a></strong>。这本书主要把常见的 Issue 题目分成了几个大类，<strong>对每类题型都总结了一套能让你言之有物的分析思路</strong>，并配以实际例子讲解。比如针对因果性的建议类题目，可以采用的角度就有：一、原因是否成立；二、从这个而原因是否能推出与目的一致的结果；三、是否有其他做法也可以达到相同的目的。花点时间看完讲解并试着动手写一写，到考试时应该就不会什么都写不出来，或是写不到规定字数了。</p><p>另外就是 <strong>OG 上的范文</strong>也是很好的资源，都是真实考场上出现的高分作文，要好好看。</p><p>要注意的一点是，G 的作文<strong>注重的是逻辑</strong>，而不是文采。当然两者都有最好，但千万不要试图写一堆 G 的词汇来彰显自己的文采，因为阅卷的考官不一定认识这些词汇。</p><h3 id="备考策略-1"><a href="#备考策略-1" class="headerlink" title="备考策略"></a>备考策略</h3><ul><li>每天背单词，做 Magoosh。提前计划好，<strong>在考前半个月时至少过完两遍单词且做完 Magoosh</strong>。</li><li>抽几个周末出来把 OG 刷完。</li><li>考前一个月左右<strong>抽时间研究 AW 的题型和思路，试着写两三篇找感觉</strong>。</li><li>考前半个月开始可以做做模考（我做过 PP2、Kalpan 和 Barron，但是感觉都不怎么样，有更好的资源欢迎留言告知）。</li><li>最后阶段视短板而定，可以反复复习<strong>高频错词</strong>，巩固下 Quant 难点，多看看 OG 上的范文，又或者是重做 Magoosh 的错题。</li></ul><h2 id="后记"><a href="#后记" class="headerlink" title="后记"></a>后记</h2><p>备考 T 和 G 是我大学生活里非常独特的两段时光。其实平均下来每天花的时间也就大概刚刚到一个小时。所以说只要想做总不会没时间的。</p><p>我的一家之言难免有失偏颇。我能保证这篇文章所讲的是对有一定基础的人来说比较有效的备考策略，但不能保证它是适合所有人的最佳策略。希望大家明白兼听则明偏信则暗的道理。</p>]]></content>
    
    <summary type="html">
    
      &lt;h2 id=&quot;前言的前言&quot;&gt;&lt;a href=&quot;#前言的前言&quot; class=&quot;headerlink&quot; title=&quot;前言的前言&quot;&gt;&lt;/a&gt;前言的前言&lt;/h2&gt;&lt;p&gt;去年 11 月通过微信群做了 T/G 备考经验的分享，之后一直没有时间把当时的内容整理成文字。后来&lt;strong&gt;&lt;a href=&quot;https://endle.github.io/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;李臻博&lt;/a&gt;&lt;/strong&gt;同学主动帮我进行了整理（编辑：链接已失效）。不过我还是想用自己的文字做一遍整理和修改。最近终于找到了时间，所以现在把这篇经验分享给大家，希望大家各项考试顺利，都能申请到自己的 Dream School &amp;gt; &amp;lt;。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;本文采用&lt;a href=&quot;http://creativecommons.org/licenses/by-nc-nd/3.0/cn/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;署名 - 非商业性使用 - 禁止演绎 3.0 中国大陆许可协议&lt;/a&gt;进行许可。著作权由&lt;a href=&quot;https://cn.linkedin.com/in/linghaozh&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;章凌豪&lt;/a&gt;与&lt;a href=&quot;https://endle.github.io/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;李臻博&lt;/a&gt;共同所有。&lt;/strong&gt;&lt;/p&gt;
    
    </summary>
    
      <category term="Knowledge" scheme="http://dnc1994.com/categories/Knowledge/"/>
    
    
  </entry>
  
  <entry>
    <title>Installing XGBoost on Windows</title>
    <link href="http://dnc1994.com/2016/03/installing-xgboost-on-windows/"/>
    <id>http://dnc1994.com/2016/03/installing-xgboost-on-windows/</id>
    <published>2016-03-11T00:14:15.000Z</published>
    <updated>2019-06-06T07:00:55.042Z</updated>
    
    <content type="html"><![CDATA[<p><a href="https://github.com/dmlc/xgboost" target="_blank" rel="noopener">XGBoost</a> is widely used in Kaggle competitions. For those who prefer to use Windows, installing xgboost could be a painstaking process. Therefore I wrote this note to save your time.</p><a id="more"></a><h2 id="Building-XGBoost"><a href="#Building-XGBoost" class="headerlink" title="Building XGBoost"></a>Building XGBoost</h2><p>To be fair, there is nothing wrong about the <a href="http://xgboost.readthedocs.org/en/latest/build.html" target="_blank" rel="noopener">official guide</a> for installing XGBoost on Windows. But still, I’d love to stress several points here.</p><pre><code>git clone --recursive https://github.com/dmlc/xgboost  cd xgboost  wget https://cdn.linghao.now.sh/assets/install-xgboost/Makefile.wincp Makefile.win Makefilecp make/mingw64.mk config.mkmingw32-make  </code></pre><p><strong>Note that:</strong></p><ol><li><code>Makefile.win</code> is a modified version (thanks to <a href="mailto:xiyou.zhou@gmail" target="_blank" rel="noopener">Zhou Xiyou</a>) of the original <code>Makefile</code> to suit the building process on Windows. You can wget it or download it <a href="https://cdn.linghao.now.sh/assets/install-xgboost/Makefile.win" target="_blank" rel="noopener">here</a>.</li><li>Be sure to use a UNIX shell because Windows CMD has issue with <code>mkdir -p</code> command. Git Bash is recommended.</li><li>Be sure to use <code>--recursive</code> option with <code>git clone</code>.</li><li>Be sure to use a proper MinGW. <a href="http://tdm-gcc.tdragon.net/download" target="_blank" rel="noopener">TDM-GCC</a> is recommended. Note that by default it wouldn’t install OpenMP for you. You need to specifiy it otherwise the building would fail.</li></ol><p><img src="install-xgboost-tdmgcc-openmp.png" alt="TDM-GCC OpenMP"></p><h2 id="Installing-Python-Bindings"><a href="#Installing-Python-Bindings" class="headerlink" title="Installing Python Bindings"></a>Installing Python Bindings</h2><p>This should be straightforward enough.</p><pre><code>cd python-package  python setup.py install  </code></pre><h2 id="Done-Enjoy"><a href="#Done-Enjoy" class="headerlink" title="Done! Enjoy!"></a>Done! Enjoy!</h2>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;&lt;a href=&quot;https://github.com/dmlc/xgboost&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;XGBoost&lt;/a&gt; is widely used in Kaggle competitions. For those who prefer to use Windows, installing xgboost could be a painstaking process. Therefore I wrote this note to save your time.&lt;/p&gt;
    
    </summary>
    
      <category term="Data Science" scheme="http://dnc1994.com/categories/Data-Science/"/>
    
    
  </entry>
  
  <entry>
    <title>【笔记】Learning How to Learn</title>
    <link href="http://dnc1994.com/2016/02/notes-learning-how-to-learn/"/>
    <id>http://dnc1994.com/2016/02/notes-learning-how-to-learn/</id>
    <published>2016-02-28T04:34:19.000Z</published>
    <updated>2019-06-03T00:39:21.519Z</updated>
    
    <content type="html"><![CDATA[<p><strong>本博客已经迁移到新域名 <a href="https://linghao.io" target="_blank" rel="noopener">linghao.io</a>。请前往新博客阅读本文：<a href="https://linghao.io/posts/notes-learning-how-to-learn/" target="_blank" rel="noopener">https://linghao.io/posts/notes-learning-how-to-learn/</a>。</strong></p><h2 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h2><p>这是 <em>UCSD</em> 开设在 <em>Coursera</em> 上的课程 <a href="https://www.coursera.org/learn/learning-how-to-learn" target="_blank" rel="noopener"><em>Learning How to Learn</em></a> 的课程笔记。这门课程主要基于<strong>神经科学</strong>和<strong>认知心理学</strong>的一些研究成果讲述高效学习的理论和技巧，涉及了<strong>大脑的记忆机制、拖延的成因和应对方式</strong>，以及许多关于<strong>学习抽象复杂知识的小技巧</strong>。</p><p>由于时间有限，我只看了视频和通过了所有的 Quiz，Optional Assignment 和参考文献里的内容需要花费数倍的时间去仔细研究。尽管如此我依然感觉获益匪浅，故决定将笔记公开造福大家。</p><p>文中几乎所有的观点都是来自于授课材料，我尽量少做二度演绎。这些观点全部有详实的研究作为支撑，相信大家读了以后也能感受到，其中不少内容我们在日常学习中已经深有体会了。</p><p><strong>本文采用<a href="http://creativecommons.org/licenses/by-nc-nd/3.0/cn/" target="_blank" rel="noopener">署名 - 非商业性使用 - 禁止演绎 3.0 中国大陆许可协议</a>进行许可。</strong></p><a id="more"></a><h2 id="Thinking-Modes"><a href="#Thinking-Modes" class="headerlink" title="Thinking Modes"></a>Thinking Modes</h2><p><strong>Focused Mode</strong> 和 <strong>Diffuse Mode</strong> 是两种不同的思考状态。对于 Focused Mode 你一定不陌生，当集中精力解决一道数学题时，大脑就是处于 Focused Mode。而 Diffuse Mode 指的是一种放松的思考模式。你可以借助下面这幅 Pinball 的示意图来更好地理解这两个概念。</p><p><img src="ucsd-learning-focused-diffuse-modes.png" alt="Focused and Diffuse Modes"></p><p>在左边对应的 Focused Mode 中，思绪很快集中于几个临近的神经元组成的神经回路。而在右边对应的 Diffuse Mode 中，可以看到思绪在随性地跳跃。</p><p>要随时牢记，Focused Mode 适合用于<strong>解决已经熟练掌握的内容</strong>，比如计算个位数乘法；而 Diffuse Mode 则对于<strong>新事物的学习</strong>至关重要，比如学习一门新的语言。这两种模式也分别对应了两种不同的思维模式：<strong>顺序思考（Sequential Thinking）</strong>和<strong>整体思考（Holistic Thinking）</strong>。</p><p><img src="ucsd-learning-sequential-holistic-thinking.png" alt="Sequential and Holistic Thinking"></p><p>要注意在解决问题时，整体思考所获得的灵感必须经由顺序思考来确认其正确性。所以说在理想的学习模式中，你要能够<strong>在这两种模式中自如地切换</strong>，从而更好地掌握新知识。</p><h2 id="Memories"><a href="#Memories" class="headerlink" title="Memories"></a>Memories</h2><h3 id="Working-Memory-and-Long-term-Memory"><a href="#Working-Memory-and-Long-term-Memory" class="headerlink" title="Working Memory and Long-term Memory"></a>Working Memory and Long-term Memory</h3><p><strong>Working Memory</strong> 就好比计算机的内存，是指大脑用于处理当下的任务的那部分记忆。Working Memory 的大小因人而异，内存比较小的人在学习抽象和复杂的概念时可能就会遇到困难。也就是我们常说的学到后面忘了前面。这种情况的应对办法，一是 Chunking（之后会介绍），二是<strong>通过比喻和类推等手段来改变知识的表示</strong>。这是因为不同的人思维方式不同，所以他们对知识的不同表示的接受度也就不同，一个表示可能对 A 来说正好能 fit 进他的 Working Memory 中，而对 B 来说却是 overload 了。</p><p>Working Memory，或者说是<strong>短期记忆（Short-term Memory）</strong>，最终要成为<strong>长期记忆（Long-term Memory）</strong>才能在日后为你所用。有一些技巧可以帮助你形成更牢固的长期记忆。</p><h3 id="Practice-and-Repetition"><a href="#Practice-and-Repetition" class="headerlink" title="Practice and Repetition"></a>Practice and Repetition</h3><p>我们要正确地看待练习和重复在学习中的作用。只有通过反复的练习才能搭建足够牢固的神经回路。尤其是<strong>当学习的概念比较抽象时，练习不足就会导致形成的神经回路十分薄弱</strong>。但盲目的反复练习可能效率不高，甚至还对学习有负面影响。这一点在下面会多次提到。</p><p>合理的练习应该是<strong>不断增大两次反复之间的间隙</strong>，也即<strong>间歇性重复（Spaced Repetition）</strong>。研究表明同样次数的练习，分散在好几天中做的结果要比集中在一个晚上做更好。其实各种所谓的记忆曲线也就是这个道理。</p><h3 id="Memory-Tricks"><a href="#Memory-Tricks" class="headerlink" title="Memory Tricks"></a>Memory Tricks</h3><ul><li>研究表明，<strong>用笔写下来过的东西的确更容易记住</strong>。</li><li>研究表明，图像可以直接唤醒右脑的 <strong>Visual Spatial Centers</strong>。也就是说图像可以帮助你<strong>更好地封装概念和知识</strong>，从而形成更多和更加牢固的神经回路，日后就更容易回想起来。事实上其他的感觉也可以起到相似的作用，但对于人类来说视觉的地位是最重要的。</li><li>睡眠对于学习是非常重要。首先，<strong>入睡时脑细胞会缩小使得代谢毒物得以被清除</strong>；更重要的是，睡眠是学习和记忆机制的重要组成部分。在睡眠时大脑会<strong>自动清理不重要的记忆</strong>、<strong>巩固你正在学习的内容</strong>，并<strong>在潜意识中排演其中困难的部分</strong>。要利用大脑的这些机制，<strong>在睡前花上 5 分钟回顾今天学过的内容</strong>是一个不错的选择。</li></ul><p><img src="ucsd-learning-importance-sleeping.png" alt="Importance of Sleeping"></p><h2 id="Pomodoro"><a href="#Pomodoro" class="headerlink" title="Pomodoro"></a>Pomodoro</h2><p>Pomodoro 是一个定时提醒器，它将每个 30 分钟的区间划分成 25 分钟的专注学习时间和 5 分钟的休息时间。使用 Pomodoro 的核心在于，<strong>尽最大可能在每个 25 分钟的区间内都保持专注和高效，而不去过多考虑什么时候能完成既定的目标</strong>。而在 5 分钟的休息时间里，<strong>你的大脑可以进入 Diffuse Mode 来帮助你理解和概念化所学的内容</strong>。</p><p><img src="ucsd-learning-pomodoro.png" alt="Pomodoro"></p><p>想尝试这类工具的话可以戳<a href="http://tomato-timer.com/" target="_blank" rel="noopener">这里</a>。</p><h2 id="Chunking"><a href="#Chunking" class="headerlink" title="Chunking"></a>Chunking</h2><p>在刚开始学习一个新概念时，大量的信息涌入，<strong>认知负载（Cognitive Load）</strong>很重，使你无法很好的把握。大脑需要一个过程来理解和消化这些内容并将它们组合到一起，这个过程就是 <strong>Chunking</strong>。这就好比将拼图的碎片拼接起来的过程。如果你只是被动地接收知识而没有把它们 Chunk 起来，这些知识就好比下图中间那块没有锯齿边的拼图块。它无法与其他知识相关联，也就无法为你所用。</p><p><img src="ucsd-learning-chunking-puzzle-illustrated.png" alt="Chunking Illustrated by Puzzle"></p><p>神经科学认为 Chunk 的本质是<strong>由于意义联系或反复使用而形成的神经回路</strong>，而<strong>通路中的神经元往往被同时激活</strong>，使得你在回忆一个概念或是执行一项操作时能够顺利高效。良好的 Chunking 能够使你<strong>更容易地回想起所学的内容</strong>，<strong>更有助于将已经学习的部分嵌入到更大的框架之中</strong>。（从这个意义上讲，Chunking 带来的好处就好比是 Modularity 在软件工程中带来的好处。）</p><p>从 Working Memory 的角度来讲，一个良好的 Chunk 只占用一份空间，<strong>只需要很少一部分注意力就可以激活整个 Chunk 的神经回路</strong>。而没有经过 Chunking 的等量信息则会占用多得多的空间。</p><p>在母语习得的过程中，当母亲教孩子说“mama”，在孩子的大脑中连接“mama”这个词的声音和母亲的相貌的神经回路就会不断牢固。</p><p><img src="ucsd-learning-chunking-example-mama.png" alt="Chunking Example of Mama"></p><p>练习和重复对于 Chunking 来说非常重要。但除此以外还有一些其他的技巧可以有助于这个过程。</p><ol><li><strong>Divide and Conquer</strong>：如果一个 Chunk 对你来说太大了，就把它们拆分然后各个击破。比如在学习演奏乐器的时候，我们往往会一小段一小段地进行练习提高熟练度。</li><li><strong>Workout Roadmap</strong>：在学习理科时书本上往往会有带解答的例题。它们的意义就在于让你在接受新概念时试着理清解答的思路来作为 Chunking 的第一步。但要注意，不要过分纠结于单个步骤，而是去关注步骤之间的联系。</li></ol><h3 id="Formation-of-Chunks"><a href="#Formation-of-Chunks" class="headerlink" title="Formation of Chunks"></a>Formation of Chunks</h3><ol><li>把注意力集中在你所面对的信息上。尽量不要让无关的事情占用你的 Working Memory。</li><li>试图理解所学内容的主旨。不追求深度的话这一点往往不是特别困难。只有在理解的前提下，大脑才能将新形成的神经回路与其他的神经回路进行联系，否则形成的 Chunk 就是无用的。但要注意，<strong>即使你理解了，也不一定意味着 Chunking 就成功了</strong>。举例来说，我们经常会遇到在上课时听老师讲解听懂了之后却又遗忘的情况。这是因为根据内容的困难程度和理解主动性的不同，形成的神经回路牢固度也会不同。自己琢磨明白的知识往往会来的更牢固一点，而对于从他人那里接受到的知识还是要及时复习，否则等完全遗忘之后重新形成 Chunk 就又非常费劲。</li><li>稍稍扩大思考的范围，来了解 Chunk 对应的<strong>Context</strong>。如图所示，练习和反复可以帮助自下往上的学习，而了解所学领域的 Big Picture 则是一种自顶向下的学习。这两者的重合点就是 Context。有了 Context 你才能够知道<strong>在什么时候该使用哪个 Chunk</strong>。</li></ol><p><img src="ucsd-learning-chunking-context.png" alt="Chunking and Context"></p><h3 id="Recall"><a href="#Recall" class="headerlink" title="Recall"></a>Recall</h3><p>在复习时我们通常会<strong>重新翻看所学的内容（Rereading）</strong>，或是试图<strong>整理概念之间的关系（Concept Maps）</strong>。然而研究表明，这些方法的效率都远远不如<strong>回想（Recall）</strong>。神经科学认为<strong>检索知识的过程本身就有助于加固对其的掌握</strong>。</p><p>不过要注意，Concept Maps 其实不是一种特别糟的方法，但问题是在熟练掌握多个概念之前就试图整理它们之间的关系是很低效的。而 Rereading 如果以间歇性重复的方式进行，也不失为一种好的学习方式。</p><p>把这个技巧更进一步的是<a href="http://www.scotthyoung.com/learnonsteroids/grab/TranscriptFeynman.pdf" target="_blank" rel="noopener"><strong>费曼技巧（The Feynman Technique）</strong></a>。简单来说在费曼技巧中不仅要回忆自己所学的内容，还要<strong>设法用一句话把每个概念解释清楚</strong>。这对我们提出了更高的要求，也有利于达到更深的理解层次。</p><p>另一个有助于学习（尤其是考试）的方式是<strong>尝试在与平时学习不同的环境下做回想练习</strong>。因为大脑总是会注意到环境中的<strong>潜意识线索（Subliminal Cues）</strong>。所以在不同的环境下考试时大脑的运转会受到一些限制。如果平时经常在其他环境下回想，大脑就能“免疫”这些因素的影响了。</p><h3 id="Illusion-of-Competence"><a href="#Illusion-of-Competence" class="headerlink" title="Illusion of Competence"></a>Illusion of Competence</h3><p>正如前面所提到的，上课听老师讲解或是翻看习题的答案从而觉得自己理解了，是一种很常见的<strong>能力错觉（Illusion of Competence）</strong>。另一个常见的错误是<strong>用划线和高亮的方式在课本上标注重点</strong>。研究表明，如果你要采用这种做法，那就需要特别谨慎。因为<strong>这很容易让你产生自己已经掌握了重点部分的错觉</strong>。不过<strong>在纸边写上自己对重点内容的理解和补充</strong>却是一种好方法。</p><p>另一种形式的能力错觉是当你对着书本或是 Google 学习时你会觉得这些内容已经在你的脑子里了。为了应对能力错觉，你需要不断给自己小测试。其实回想就一种自我测试。</p><p>在自我测试的时候难免会产生错误，而这些错误是非常有价值的。你能通过错误知道自己薄弱的地方，并在下一次有意地不再重复同样的错误。</p><h3 id="Transfer"><a href="#Transfer" class="headerlink" title="Transfer"></a>Transfer</h3><p>Chunking 不仅有助于掌握特定的知识，还有助于将来学习其他领域的知识。这是因为 Chunk 之间可以互相联系。这种现象叫做<strong>迁移（Transfer）</strong>。</p><p>一个人如果掌握许多 Chunk，就好比他的大脑里面<strong>储存了许多有用的神经回路</strong>。这样在遇到新问题时他就有很大概率能够直接调出正确的解决方案。在这里 Diffuse Mode 起了很大的作用。</p><p><img src="ucsd-learning-chunking-skip-to-solution.png" alt="Skip to Solution Chunk"></p><h3 id="Overlearning"><a href="#Overlearning" class="headerlink" title="Overlearning"></a>Overlearning</h3><p> <strong>过度学习（Overlearning）</strong>指定是在已经掌握所学的内容后继续重复练习的行为。在某些情况下适当的过度学习是有好处的。想象练习网球的发球或是在公众面前演讲，过度学习使你不会在比赛时发球失误或是在台上说不出话来。</p><p> 但在其他情况下，研究已经表明，<strong>过度学习是一种时间的浪费</strong>。更严重的是，因为重复练习自己已经掌握的内容是相对容易的，所以这有可能<strong>导致能力错觉的产生</strong>。正确的做法是在困难和重要的内容上适当地过度学习。</p><h3 id="Einstellung"><a href="#Einstellung" class="headerlink" title="Einstellung"></a>Einstellung</h3><p><strong>定势（Einstellung）</strong>指的是事先存在的思维模式阻碍了更新更好的想法的出现。</p><p><img src="ucsd-learning-einstellung-illustrated.png" alt="Einstellung Illustrated"></p><p>很多时候我们对于事物的第一感觉是错的。Diffuse Mode 是克服这种问题的最好帮手。</p><h3 id="Interleaving"><a href="#Interleaving" class="headerlink" title="Interleaving"></a>Interleaving</h3><p>如前文所说，学习一门新学科要求我们能够知道在什么时候用什么 Chunk。锻炼这种能力的最好方式就是不断地在需要不同策略和技巧来解决的问题之间来回反复。这就叫<strong>交错（Interleaving）</strong>。在实践中，有意地挑选不同类型的习题来做就是一种很好的方式。</p><p><strong>交错对于创造力和灵活性的培养是至关重要的。</strong>当然，涉猎多门与精通一门往往是对立的，所以这里就取决于个人的取舍。曾经有学者调查发现，科学界的重大突破往往是由年轻人或是来自其他领域的人做出的。这是因为他们不容易被定势所禁锢。</p><p><strong>交错与前文提到的间歇性重复这两个技巧可以完美地搭配在一起。</strong></p><h2 id="Procrastination"><a href="#Procrastination" class="headerlink" title="Procrastination"></a>Procrastination</h2><p>每个人都有不同程度的拖延症。拖延会严重影响学习，因为临时抱佛脚是无法形成稳固的神经模式的。为了克服拖延症，你需要了解一些认知心理学的知识。</p><h3 id="Zombie-Mode"><a href="#Zombie-Mode" class="headerlink" title="Zombie Mode"></a>Zombie Mode</h3><p>所谓<strong>僵尸模式（Zombie Mode）</strong>，是指大脑在受到特定的外界刺激（Cue）时会做出<strong>习惯性的反应</strong>。在这种解释下，当你感受到不快时，大脑就会做出反应将注意力转移到另一件更愉快的事情上，从而使你在短期内更好过些。这种吸毒上瘾一般的模式如果成为习惯，那就是拖延症的源头。</p><h3 id="4-Elements-of-Habit"><a href="#4-Elements-of-Habit" class="headerlink" title="4 Elements of Habit"></a>4 Elements of Habit</h3><p>神经科学认为 Chunking 与习惯是相关的。良好的 Chunking 使你在完成一项任务时只需要关注个别关键的因素，将剩余的交给僵尸模式，从而节省精力。显然习惯并不完全是有害的，但为了克服拖延这个坏习惯，你需要了解习惯的四个要素。</p><ol><li><strong>Cue</strong>：触发僵尸模式的外界刺激，比如 to-do list 上的第一个项目，或是来自朋友的微信消息。这类刺激大致可以分为四类：时间、地点、感受和反应。</li><li><strong>Routine</strong>：受到刺激后习惯性地做出的反应。</li><li><strong>Reward</strong>：习惯给予的奖励使它得以存在下去，最简单的例子就是你在拖延时感受到的短时间的愉悦。</li><li><strong>Belief</strong>：习惯的力量之所以强大，是因为你在内心深处往往认为它们是无法被改变的。</li></ol><h3 id="Process-and-Product"><a href="#Process-and-Product" class="headerlink" title="Process and Product"></a>Process and Product</h3><p>每个人都会对学习感到厌恶和不快，即使是自己擅长甚至喜欢的学科。要明确，会有这些情绪是完全正常的，重要的是如何应对它们。<strong>研究表明，当人开始做令自己感到厌恶的事情一段时间以后，这种厌恶感就会消失。</strong>所以你只需要一些小技巧来熬过最开始的这段时间。</p><p>一个有用的窍门是专注于<strong>过程（Process）</strong>而不是<strong>产物（Product）</strong>。过程是指学习的期间内时间的流逝，而产物是指通过学习得到的结果，比如一本完成的练习册。</p><p>产物往往是拖延的导火索，而过程反而恰恰其实是大脑最喜欢的。当你专注于过程，大脑就可以进入僵尸模式无脑前进了。通过专注于享受学习的过程，你可以避免陷入拖延的恶性循环，还可以<strong>反过来利用僵尸模式</strong>来帮助你轻松地完成学习目标。</p><p>当然在这个过程中总会有事物令你分心，你需要训练<strong>在被分心之后任之而去</strong>的能力。当然事先将自己安置在一个干扰尽可能少的环境也是很好的选择。</p><h3 id="Harnessing-Zombies"><a href="#Harnessing-Zombies" class="headerlink" title="Harnessing Zombies"></a>Harnessing Zombies</h3><p>拖延很容易，而对抗拖延需要消耗大量的<strong>意志力（Will Power）</strong>。所以尽量使自己处于不需要对抗拖延的境地，或是反过来用尽量少的意志力来利用僵尸模式帮助自己。</p><p>根据上面对习惯的分析，要打破拖延的习惯链，只需<strong>消耗意志力去改变四个要素中的一个</strong>。</p><ol><li><strong>Cue</strong>：找到令自己进入拖延的刺激并避免它们。最简单的例子，在学习时远离网络和电视。</li><li><strong>Routine</strong>：有意识地改变日常的一部分，比如定新的计划，或是养成在学习前把手机关闭的习惯。</li><li><strong>Reward</strong>：尝试引导自己的正面进取情绪来替代拖延带来的快感，比如自豪感、满足感等。又或者允许自己在不拖延完成任务后尽情放松。一个小技巧是<strong>将奖励设定为跟 deadline 有关</strong>，比如“五点前做完作业就约上同学去吃大餐”。</li><li><strong>Belief</strong>：相信自己的新策略可以成功打败拖延。可以通过跟志同道合的小伙伴互相监督来促进自信。</li></ol><h3 id="Tasklists"><a href="#Tasklists" class="headerlink" title="Tasklists"></a>Tasklists</h3><ul><li><p>坚持写周计划和日计划，并且最好在睡前做，因为<strong>研究显示入睡时潜意识会进入类似于 Diffuse Mode 的状态来“消化”和“排练”要完成的项目</strong>，从而使你在白天能更好地去完成他们。写计划的另一个好处在于，<strong>如果你不这么做，这些项目就会停留在你的 Working Memory 中</strong>，占据宝贵的空间。通过将它们转移到纸上，你能够更好地专注于做事情本身。</p></li><li><p><em>Eat your frogs first in the moring.</em> 也即把最困难最厌恶的事情放在早上第一件事做。也是老生常谈了。</p></li><li><p><strong>在日计划中定好结束学习的时间</strong>。我们往往只关注在什么时间做什么事情而忽视了从什么时间开始停止做事情，而这其实是很重要的。这么做不仅有助于你的日程规律，还给你更多时间去发展身心健康从而在学业上更加成功。</p></li></ul><h2 id="Metaphor-and-Analogy"><a href="#Metaphor-and-Analogy" class="headerlink" title="Metaphor and Analogy"></a>Metaphor and Analogy</h2><p>为所学的内容创造<strong>比喻（Metaphor）</strong>和<strong>类推（Analogy）</strong>能够帮助你更好地理解内容本身。如前文所述，知识的表示方式对于学习、记忆和推理来说都是至关重要的。比如当 18 世纪的化学家开始想象和可视化分子级别的运动时，他们取得了巨大的突破。比喻和类推的另一个作用在于<strong>它能帮助你突破定势</strong>。</p><p>比喻和类推之所以有这样的作用，是因为它们<strong>将新事物与旧的神经回路联系了起来</strong>。这样的链接就好比快捷方式 一般，使得大脑能思考得更快、更发散。</p><h2 id="The-Value-of-Teamwork"><a href="#The-Value-of-Teamwork" class="headerlink" title="The Value of Teamwork"></a>The Value of Teamwork</h2><p>我们经常会遇到，在做计算题时很前面就犯了错误，但却反复检查不出来，使得最终结果也发生错误的情况。在 这是因为<strong>在 Focused Mode 下，大脑会倾向于坚持已经建立的推理步骤</strong>。与其他人一起合作的价值就在于这样的经历可以填补你思维上的空缺，建立起更强的自我纠正能力。从这个意义上讲，与你一同合作的人们就好像是对你而言的<strong>外部 Diffuse Mode</strong>。</p><p>另外，对身边的人解释所学的内容也有助于自身的学习。</p><h2 id="Test-Checklist"><a href="#Test-Checklist" class="headerlink" title="Test Checklist"></a>Test Checklist</h2><p>考试本身是一种非常有成效的学习方式。这里介绍 Dr. Richard Felder 提出的考前 Checklist。</p><ul><li>Did you make a serious effort to understand the text?</li><li>Diy you work with  classmates on homework problems?</li><li>Did you attempt to outline every homework problem solution?</li><li>Did you participate actively in homework group discussions?</li><li>Did you consult with instructors?</li><li>Did you understand all of your homework problem solutions?</li><li>Did you ask in class for explanations of homework problem solutions that weren’t clear to you?</li><li>Did you attempt to outline lots of problem solutions quickly?</li><li>Did you go over the study guide and problems with classmates and quiz one another?</li><li>Did you get a reasonable night’s sleep before the test?</li></ul><p>在理想的状态下，在考试之前能用 Yes 回答以上尽量多的问题。</p><h2 id="Test-Tips"><a href="#Test-Tips" class="headerlink" title="Test Tips"></a>Test Tips</h2><h3 id="Hard-Start-Jump-to-Easy"><a href="#Hard-Start-Jump-to-Easy" class="headerlink" title="Hard Start - Jump to Easy"></a>Hard Start - Jump to Easy</h3><p>一种考试策略是先解决所有简单的问题然后攻克困难的问题。但这并不对所有人都适用。回想一下关于 Focused Mode 和 Diffuse Mode，如果我们先大概看一眼题目，从困难的问题出发，将它们“加载”到大脑中，然后跳回去做简单的问题，从而<strong>使大脑进入 Diffuse Mode</strong>，这样就很有可能在较短的时间内找到难题的思路（当然，在你的能力范围内）。这不失为一种在条件允许的情况下可以尝试的策略。</p><p>如果你仔细回忆一下过去考试的经历，肯定会有几次是在走出考场之后才察觉到自己的错误或是想到解决难题的思路。这就是因为如果你不刻意去做，考试结束后大脑才能进入 Diffuse Mode。</p><h3 id="Get-Excited"><a href="#Get-Excited" class="headerlink" title="Get Excited"></a>Get Excited</h3><p>当你处于紧张状态时，大脑会分泌化学物质引发一系列生理反应。但你可以用不同的方式来解读这些反应。<strong>恐惧和兴奋其实是两种很相似的反应。</strong>当你坐在考场里，心跳加速，满头是汗，如果你不去想“这场考试让我恐惧”而是“这场考试让我兴奋”，这会对你的考试非常有 帮助。</p><h3 id="Deep-Breathing"><a href="#Deep-Breathing" class="headerlink" title="Deep Breathing"></a>Deep Breathing</h3><p>考试时感到心慌是自然反应。你可以通过深呼吸来部分或者全部抵消这一反应。当然，不要等到考试的时候才去做。<strong>考前两周就可以开始每天做几分钟的深呼吸</strong>，效果更好。</p><p>另外，在考试开始前的最后时间里做深呼吸，有奇效。</p><h3 id="Don’t-Let-the-Brain-Fool-Yourself"><a href="#Don’t-Let-the-Brain-Fool-Yourself" class="headerlink" title="Don’t Let the Brain Fool Yourself"></a>Don’t Let the Brain Fool Yourself</h3><p>如上文所述，大脑经常会欺骗你，让你认为自己的解答是正确的。在考试过程中要时刻小心，可以多眨眨眼或是晃晃头，用这种方式来<strong>提醒自己稍微往 Diffuse Mode 倾斜一点</strong>看看有没有出错，然后再用 Focused Mode 进行 Double Check。</p><p>有些类型的题目可以用多种方式解答，检查时换一种方式可以有效防止被大脑欺骗。但有些题目只能检查每一步的逻辑，那么，Do Your Best。</p><h2 id="Misc"><a href="#Misc" class="headerlink" title="Misc"></a>Misc</h2><h3 id="Neuromodulators"><a href="#Neuromodulators" class="headerlink" title="Neuromodulators"></a>Neuromodulators</h3><ul><li>Acetycholine for focused learning</li><li>Dopamine for motivation and reward learning</li><li>Serotonin for social life</li></ul><h3 id="Importance-of-Exercise-to-Neurons"><a href="#Importance-of-Exercise-to-Neurons" class="headerlink" title="Importance of Exercise to Neurons"></a>Importance of Exercise to Neurons</h3><p>学者们曾经认为神经细胞的数量在出生以后就不会再增加了，但是后来发现<strong>在大脑的一些特定区域，每天都有新的神经细胞产生</strong>，比如被认为是情感和记忆中心的海马体。研究表明<strong>脑力锻炼和体力锻炼</strong>都能很好地促进这些新神经细胞的产生和运作。</p><h2 id="Summary"><a href="#Summary" class="headerlink" title="Summary"></a>Summary</h2><h3 id="Elements-of-Good-Chunking"><a href="#Elements-of-Good-Chunking" class="headerlink" title="Elements of Good Chunking"></a>Elements of Good Chunking</h3><ul><li>Focused attention</li><li>Understanding</li><li>Practice</li></ul><h3 id="Overcoming-Illusion-of-Competence"><a href="#Overcoming-Illusion-of-Competence" class="headerlink" title="Overcoming Illusion of Competence"></a>Overcoming Illusion of Competence</h3><ul><li>Test yourself</li><li>Minimize highlighting</li><li>Mistakes are good</li><li>Use deliberate practice</li></ul><h3 id="Overcoming-Procrastination"><a href="#Overcoming-Procrastination" class="headerlink" title="Overcoming Procrastination"></a>Overcoming Procrastination</h3><ul><li>Keep a planner journal</li><li>Commit yourself to certain routines and task each day</li><li>Delay rewards until you finish the task</li><li>Watch for procrastination cues</li><li>Gain trust in your new system</li><li>Have backup plans for when you still procrastinate</li></ul><h2 id="Todo"><a href="#Todo" class="headerlink" title="Todo"></a>Todo</h2><ul><li>Add reference</li><li>Add more illustrative images</li><li>Read optional materials and add notes</li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;&lt;strong&gt;本博客已经迁移到新域名 &lt;a href=&quot;https://linghao.io&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;linghao.io&lt;/a&gt;。请前往新博客阅读本文：&lt;a href=&quot;https://linghao.io/posts/notes-learning-how-to-learn/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://linghao.io/posts/notes-learning-how-to-learn/&lt;/a&gt;。&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;Introduction&quot;&gt;&lt;a href=&quot;#Introduction&quot; class=&quot;headerlink&quot; title=&quot;Introduction&quot;&gt;&lt;/a&gt;Introduction&lt;/h2&gt;&lt;p&gt;这是 &lt;em&gt;UCSD&lt;/em&gt; 开设在 &lt;em&gt;Coursera&lt;/em&gt; 上的课程 &lt;a href=&quot;https://www.coursera.org/learn/learning-how-to-learn&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;&lt;em&gt;Learning How to Learn&lt;/em&gt;&lt;/a&gt; 的课程笔记。这门课程主要基于&lt;strong&gt;神经科学&lt;/strong&gt;和&lt;strong&gt;认知心理学&lt;/strong&gt;的一些研究成果讲述高效学习的理论和技巧，涉及了&lt;strong&gt;大脑的记忆机制、拖延的成因和应对方式&lt;/strong&gt;，以及许多关于&lt;strong&gt;学习抽象复杂知识的小技巧&lt;/strong&gt;。&lt;/p&gt;
&lt;p&gt;由于时间有限，我只看了视频和通过了所有的 Quiz，Optional Assignment 和参考文献里的内容需要花费数倍的时间去仔细研究。尽管如此我依然感觉获益匪浅，故决定将笔记公开造福大家。&lt;/p&gt;
&lt;p&gt;文中几乎所有的观点都是来自于授课材料，我尽量少做二度演绎。这些观点全部有详实的研究作为支撑，相信大家读了以后也能感受到，其中不少内容我们在日常学习中已经深有体会了。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;本文采用&lt;a href=&quot;http://creativecommons.org/licenses/by-nc-nd/3.0/cn/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;署名 - 非商业性使用 - 禁止演绎 3.0 中国大陆许可协议&lt;/a&gt;进行许可。&lt;/strong&gt;&lt;/p&gt;
    
    </summary>
    
      <category term="Notes" scheme="http://dnc1994.com/categories/Notes/"/>
    
    
  </entry>
  
  <entry>
    <title>【翻译】Evangelion角色分析：明日香</title>
    <link href="http://dnc1994.com/2015/01/translation-asuka-analysis-by-ritsumaya/"/>
    <id>http://dnc1994.com/2015/01/translation-asuka-analysis-by-ritsumaya/</id>
    <published>2015-01-27T07:40:38.000Z</published>
    <updated>2019-12-21T06:30:31.434Z</updated>
    
    <content type="html"><![CDATA[<p>【高能预警：巨量图片】</p><p>准备好了伙计们，是时候来一篇关于惣流·明日香·兰格雷的重量级分析了。</p><a id="more"></a><p><a href="https://web.archive.org/web/20160105212124/http://ritsumaya.tumblr.com:80/post/88279631797/eva-analysis-asuka-langley-sohryuu" target="_blank" rel="noopener">原文地址</a></p><hr><p>前言 (可以跳过这一部分) (For GC: English version below)</p><p>这是一篇在去年就应该完成了的翻译稿。当时我从EvaGeeks论坛上看到这篇转帖，当即就着了迷，马上到Tumblr上关注了原作者GC。从去年夏天GC授权我进行翻译开始，我一直在犯拖延症。明日香是我最爱的动画角色，现在我终于可以将这篇对她的深度分析展现给大家。</p><p>在翻译GC的文章时，我遵循的是自己的一套方法。为了最大限度地保留原意，我在不影响理解和不显得别扭的情况下都采用了直译。同时为了追求语言的流畅度，我从忙碌的日常中抽出尽可能多的时间来斟酌、精炼词句，尽量使用地道的中文表达。但还是不可避免地会出现词不达意的情况。还请原谅。</p><p>这篇翻译离完美还很远。现在的版本要我说才刚刚达到了能看懂并理解原文主要内容的水平。文中有很多问题，我之后会提到一些。如果有人通过读我的翻译而有所收获，我强烈建议你们去读GC的原文，这样能获得更多有趣的细节。虽然GC的母语并不是英语，但也比我强出许多。</p><p>括号里的内容大多是GC原文中的注释。但必要时我也会将自己的译注放在括号里。这只是为了帮助读者更好地理解GC的想法，所以我也不费心去区分这两种注释了。</p><p>在不少地方因为看不懂GC的用语，我只能直译过来。比如我不知道「refused love interest archetype」是什么意思。虽然GC说我可以随时提有关翻译的问题，但主要是我自己没有这个时间去做这件事。所以很遗憾那些直译暂时只能就这样放在那里。我打算之后抽空把不明白该怎么翻的地方整理出来发给GC，再根据回复来修改我的翻译。</p><p>在翻译过程中，我重温了几集Evangelion的旧TV版，发现有些地方GC所提到的点是根据英语字幕而来的，但我所看到的中文字幕以及我听到的日语似乎与之不大一样。我对自己的日语听力并不是100%确信，但我查了其中一些的原文从而知道英文字幕并不准确。当然，字幕翻译的准确与否大多没有太大影响，因为很多时候不同的翻译基本是同一个意思。但作为一个完美主义者(虽然我没有戏剧性人格障碍)，我决定找时间重看整个旧TV版，并将所有的差异在注释中注明。</p><p>现在文中所有的截图都是从GC的原文中复制过来的，所以里面的字幕都是英文的。这主要是因为我没时间去把图重新截一遍。我本来想在每个图下面放上图中字幕的中文翻译，但后来一想我可以在重看有中文字幕的旧TV版时进行截图，这样可以事半功倍。所以这件事情也要之后有空再做了。</p><p>最后我想谈一点过度解读的问题。很多人会说Evangelion只是一部商业动画，所以我们的分析都是在过度挖掘一些原本就不存在的东西。我不同意这个观点。在我看来，当作品完成以后，对它的解读就跟作品的制作者的本意无关了。而且，谁说制作人员并没打算让Evangelion被解读得那么深？他们明显把许多值得深入分析的元素放在了一起，这些元素可能并不是原创的，但它们在一起表现出了一个晦涩而又集中的主题，这是极具独创性的。GC用的类似于精神分析(如果不是请纠正)的分析手法是完全合理的，因为Evangelion中出现的大量心理剧场景是不可否认的。(有人会说出现这样的场景是因为GAINAX没有预算了。好吧，但许多元素在TV中出现得很早，而且庵野秀明及其他一些主要工作人员的采访记录也可以证明他们的确是打算把Evangelion做成一部有深度的动画。)</p><p>最后我要再次感谢GC同意我翻译这篇文章。我期待着看到GC的更多文章，也打算在空闲时能翻译更多GC的文章。希望大家读得开心！</p><hr><p>Preface (You can skip this part if you like)</p><p>This is a piece of translation that was supposed to be finished last year. I saw the repost from the EvaGeeks Forum and was fascinated by its cleverness. Then I followed the author, GC, on Tumblr. I’ve been procrastinating again and again since GC authorized me to translate the post into Chinese last  summer. Now finally I can present to you this translated in-depth analysis of my favorite anime character Asuka.</p><p>I practiced my own methodology while translating GC’s post. To preserve as much information as possible, I followed the principle of word-to-word translation as long as it doesn’t prevent my translation from being understood by Chinese readers or doesn’t sound too weird. Also I have tried my best(in terms of time I spent to refine my work in my already busy routine) to use native Chinese phrases and slangs for language fluency. Still I couldn’t avoid the occurrence of not delivering the accurate meaning. Please forgive me.</p><p>My translation is far from perfect. The current version is only, as I would phrase it, legible and understandable in terms of the main idea. It’s flawed in many ways, some of which I’ll mention below. In fact I strongly encourage anybody who enjoys reading this to read GC’s original post in English. You’ll definitely get more details. Although neither of GC and me uses English as the first language, GC’s English is surely way better than mine.</p><p>Those texts between parentheses are mostly GC’s original comments. But as the translator, I put my words in parentheses too when necessary. I don’t bother distinguishing them since my intent is to help readers better understand GC’s brilliant ideas.</p><p>There are many places where I don’t understand GC’s words at all that I could only metaphrase them. For instance I don’t know what refused love interest archetype is. Although GC said I could ask him for the language he used anytime, it was myself that didn’t have time to do that. So I’m sorry that for now I have to publish my work with all that. Hopefully I’ll sort out a list of not properly translated languages and send that to GC. And I’ll refine my work again based on GC’s explanations.</p><p>As I translated, I rewatched several episodes of NGE and found that there are several places where GC took a point from English subtitle that differs from both Chinese subtitle and my understanding of Japanese dubbing. I’m not 100% confident about my Japanese listening but some of them are confirmed by looking up the original script. Of course though, most of them don’t matter much as different translations imply almost the same meaning. But as a perfectionist(though I don’t have HPD), I decided to rewatch the whole NGE and make sure that each difference is noted in comments.</p><p>For now all the screenshots are copied from GC’s original post, which explains why they are English-subtitled. I did that because I don’t have enough time to recapture them all from my copy of NGE. At first I thought I’d just put the translation of subtitles under each screenshot. But then I realized that I could just capture screenshots as I rewatch NGE with Chinese subtitles. This reduces total work to be done, doesn’t it? So, another to-do item on my leisure time.</p><p>Finally I would like to talk a little bit about the problem of over-interpretation. Many would say that Evangelion is only a commercial anime and our analyses are over-digging something not intended to be there. I couldn’t agree to that. IMHO as soon as the work is done and published, no interpretation should depend on makers’ intent. And who’s to say that the staff of Evangelion didn’t meant it to be analyzed that much when they apparently put elements worth in-depth analyses with ingenuity in ways of combining non-original pieces together to present a obscurely strong theme? GC’s psychoanalysis-like approach(correct me if I’m wrong) is totally legit because no one could ignore the large portion of psychodrama-like scenes in Evangelion.(Some would argue that the reason for those scenes was GAINAX turning out of budget. Well, many pieces emerged very early in the series. Besides interviews with Anno and other major staff can be found, which demonstrate the point that Evangelion was intended to be an in-depth anime.)</p><p>I’d like to thank GC again for authorizing me to do this translation. I look forward to more analyses from GC and doing more translations in my leisure time. Enjoy reading!</p><hr><p>准备好了伙计们，是时候来一篇关于惣流·明日香·兰格雷的重量级分析了。或者说，让我们来分析是绝大多数随流的爱好者文化是如何没能正确地理解她的。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-001.png" alt=""></p><p>主要内容：</p><ul><li>驾驶EVA之于明日香的重要性</li><li>明日香与加持、真嗣和丽的关系</li><li>《End of Evangelion》中出现的「地狱厨房」(Hell Kitchen)场景的含义</li><li>戏剧化人格障碍(histrionic personality disorder)的一种解释</li><li>许多其他内容！</li></ul><p>高能预警：本文将提及自杀、死亡恐惧(thanatophobia)、精神疾病、强奸、月经、裸体。基本上也就是任何Evangelion相关的东西所能触发的预警了。这篇文章将会很长，大约两万四千字。如果有任何疏漏，请联系我。</p><p>请注意在本文中我将讨论局限于来自TV动画的惣流·明日香·兰格雷(惣流·アスカ·ラングレー)，而不涉及到来自新剧场版系列的式波·明日香·兰格雷(式波·アスカ·ラングレー)。这是因为式波是一个与惣流完完全全不同的角色，她们之间的相像仅仅是外表和浅层人格上的。</p><p>作为接触实验(Contact Experiments)的一部分，明日香的母亲恭子被剥离了她部分的灵魂，尤其是她母性的那一部分。与唯被EVA完全吸收的惨剧不同，NERV认为他们可以在保持恭子存活的同时启动EVA。毕竟最初他们也并不打算成为彻头彻尾的怪物。</p><p>与从一开始就被摆明着被挑选出来的丽和真嗣不同，明日香是第一个非碇源渡系的，从众多候选中脱颖而出的驾驶员。注意，正如剑介在第四话中指出，真嗣所在的学校里每个人都失去了母亲。这是因为NERV收割了他们所有人的母亲的灵魂，以便应对可能需要在他们中间产生驾驶员的情况。明日香显然天资异禀，而且她也打算要好好利用自己的天赋大干一场。</p><p>(顺便，零号机和二号机是基于亚当(Adam)而制造的，它们拥有的灵魂并不完整。这一事实解释了为何真嗣的同步率得分能够快速地上升。实际上的最强驾驶员的确是明日香，在《End of Evangelion》中最后达到完全同步之前与量产机的激烈战斗靠的都是她自己的力量。)</p><p>不幸的是，由于灵魂被部分剥离，恭子的精神开始失常。她开始对着一个人偶说话，并用把它当成明日香去照顾。同时，明日香的父亲与恭子的医生开始了婚外情，最后他也娶了那位医生。</p><p>当明日香成为第二适格者时，她处于一种极度渴望母爱的状态。只要母亲能再看着她，她愿意打破一切规则，甚至毁灭世界。她想着，一次就好，只能要再看她一眼就好。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-002.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-003.png" alt=""></p><p>母亲的精神失常对年幼的明日香造成了极大的影响。这首先使得她对人偶嗤之以鼻；更重要的是，这使得她极度渴望他人的关注，极度需求自我隔绝，极度想要证明自己的独立。她变得再也不想依靠任何人，再也不会再为了任何人而试图杀死自己。</p><p>小结一下：因为她的母亲将注意力都放在人偶而不是她身上，明日香变得既渴望他人的关注却又对再度依赖他人感到恐惧。结果就是她拼尽一切努力想要变得像一个大人。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-004.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-005.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-006.png" alt=""></p><p>在那个混乱的社会里，明日香认为成长对于女性来说就是被人看作是一个大人。这一观念由于她将全部的人生价值都寄托在驾驶EVA上而得以内化。事实上，戴在头上A10神经连接器是她的经典形象的一部分。这么说是因为，在明日香版本的「裸体出浴」场景中，明显可以看到她在洗澡时还戴着连接器。而真嗣和丽，或是在动画中短暂出场的其他两名驾驶员，都没有这样的举动。只有明日香这么做了。虽然真嗣和丽也将他们的人生价值寄托在驾驶EVA上，这种寄托的强度远没有明日香那样强。</p><p>现在我要破除一个常见的误解。不少动画爱好者将明日香「诊断」为自恋，但我认为她并不自恋。恰恰相反，明日香遭受的是戏剧化人格障碍的困扰。而从根源上来说，这与依赖性人格障碍更为相像。加持是具有戏剧化人格障碍的另一个例子，不过他的心智很正常。虽然他的性格中有着许多的戏剧化元素，但他并没有像明日香那样表现出戏剧化人格障碍。</p><p>让我们来粗略地看一看与戏剧化人格障碍相关的症状和行为。因为我们只需要简单的总览(毕竟我们并不是真的要去诊断一个虚拟角色的人格)，这里就直接引用维基百科了：</p><ul><li>煽动性(挑逗性)的行为</li><li>将与他人的关系看得过度亲密</li><li>寻求关注</li><li>易受影响</li><li>语言表达中充斥着强烈的个人表现欲、缺少细节</li><li>情绪不稳定、波动剧烈</li><li>化妆、把用外表用于吸引关注</li><li>情绪夸张、戏剧化</li></ul><p>我们可以用「PRAISE ME」(「夸夸我」，是上述各项英文首字母连写)来记忆这些与戏剧化人格障碍有关的常见元素。总结一下也就是说，有自恋型人格障碍的人是发自内心地相信自己比别人要好，而有戏剧化人格障碍的人却只是将自己表现得比别人要好，同时在内心深处对这么做的自己感到非常厌恶。</p><p>明日香毫不掩饰地称自己是最强的驾驶员，并对自己的杰出引以为傲。同样地，她对于自己是学校里最受欢迎的女孩也感到非常自豪，特别是从她跟真嗣讲话时提到自己最受欢迎就可以看出。然而，正如我们所知道的，明日香其实非常憎恨自己，这样的憎恨反复地出现，尤其是在故事的后半部分。但在我们开始对她的戏剧化人格障碍的具体分析之前，我们先回到她的背景故事。</p><p>在恭子开始接触实验之前，她就没有对自己的孩子有过多少的关注。跟律子和美里的母亲一样，她抚养出了一个渴望关注的孩子。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-007.png" alt=""></p><p>母亲恭子因为失去部分的灵魂而精神失常，父亲也出了轨，最后还娶了恭子的医生。纵使明日香在看着母亲抱紧怀中的人偶时流露出的目光是如此凶恶和愤怒，她无疑还是很高兴能见到自己的母亲的，并且不断地试图通过表现来获取母亲的关注。</p><p>之后，明日香成为了驾驶员，她把这件事告诉了母亲。然而此时她的母亲正打算自杀。在令人不寒而栗的最后时刻，我们看到恭子请求明日香(人偶)跟她一起去死。所有目睹过这一幕的人都不可能忘记明日香是怎么回答的：</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-008.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-009.png" alt=""></p><p>只要母亲不抛弃她，她甚至愿意跟母亲一起去死。最后明日香并没有自杀，但这段经历震撼到了她内心的深处：她发现自己完全愿意为他人而死。在葬礼上，明日香告诉自己，她会成为一个大人，从而可以不用再依靠任何人。在童年时期被长期忽略的确是戏剧化人格障碍的的理论成因之一。为了补偿童年时没能得到的关注，明日香说她要成为一个独立的大人，然而她却选择了通过外部认同(external validation)来成为大人。(EVA的一个主题就是，外部认同对于获得属于自己的幸福是无济于事的；认同和爱只能在自己的内心逐渐建筑起来，伴随着「一个人要是都不爱自己，怎么可能学会去爱别人呢？」的想法，等等。不过这些就是下一次的另一篇博客的主题了。)</p><p>她的这种想法存在两方面的问题：首先，她认为成人就仅仅是性成熟。明日香迫不及待地想要长大，而这就是问题的症结，尤其是从她与加持和真嗣的关系上来说。她没有耐心地想通过她身边的男性的认同来证明自己的成长。因为这是社会所教给她的，也是父亲抛弃母亲再婚的经历所教给她的。(我个人认为明日香是个无浪漫者。)</p><p>在来到日本之前，明日香和加持在船上有一次很能说明问题的对话。从时间上来说，这正好发生在第八话之前。明日香先是谈到美里，说自己不是特别喜欢她。(这在之后很重要，关系到她和美里的关系，以及她与自己的关系。)加持告诉明日香，她可能会在日本交上很多新男友，并且第三适格者(真嗣)是个男生。这就暗示了一些很有趣的事情：明日香在德国曾经有过男朋友。并且根据加持所说的，她有过许多男朋友，从一个换到另一个，希望从他们身上获得认同。但当她发现这并不能使她更快乐时，她就变得越来越失望和挫败。</p><p>注意：加持和我在这里所说的，指的并不是真正意义上的「男朋友」，而是指当明日香来到第三新东京市时那些对她有好感的男孩。即使在践踏他们并表现得高高在上的同时，她还是能通过对自己有好感的绝对人数而获得认同，但显然他们中没有一个能让她真正地感到被认同。不要忘记明日香是从德国唯一的驾驶员变成了甚至在日本都排不上第一的驾驶员，考虑一下这个变化会如何影响她所受到的关注等。之后我再提到「男朋友」时，请记住：我指的是那些她毫不费力就能交上的男性朋友。考虑实际的话，要是有任何人企图跟她约会，最后就会变得跟班长光拜托她跟一位朋友约会时那样——明日香中途感到无趣逃走，留下被抛弃的男孩一个人困在过山车上。</p><p>事实上这是很重要的一点：正如美里通过性爱满足自己，明日香也通过某种方式来填充自己的空虚。然而，美里和明日香这种「用获得的欢愉来代替真正的幸福」的做法背后的逻辑却是大相庭径。美里狂野地做爱，是因为正如她自己所解释的，被人所需要的感觉很棒，即使只是肉体被需要。而明日香则是因为她自身需要被认同。她并不想要被人所「需要」。她所想要的，本质上就是别人能关注她，而她不需要去在意他们。她想要被人所「需要」，并不是在与美里一样(真正的需要)的那种意义上，而是在一种使别人关心她而她不需要反过来也「需要」别人的意义上。美里想要的是双向的需要，而当她发现自己无法拥有这种双向需要时，她就让步于在肉体上被需要。明日香想要的是单向的需要，而这种需要没法使她高兴，因为事实上这种需要是不存在的。她没有办法应付这种情况。</p><p>注意：当提及美里的性生活时，我指的是从明日香的视角出发所看到的。从美里与加持的谈话中可以知道，她其实并没有狂野地做爱。但明日香却认为她做了(并且直到人类补完都这么认为)。</p><p>其他的一些EVA分析者已经指出了在EVA中反复出现的三位一体的象征手法，比如MAGI超级计算机以及加持-源堂-冬月三人。在这里我不会过度深究，但总的来说，加持代表着年轻而充满阳刚之气的男性；源堂着代表残酷而好战的男性；冬月着代表年迈而充满智慧的男性。这类似于莎士比亚写的《人生的七个阶段(Seven Stages of Man)》那首诗，同样这也是下次博文的话题了。这其中最关键的部分是加持代表着年轻而阳刚的男性，他显然是三人中最性感的。与简单随意地就能圈牢的，很有可能会对任何人投怀送抱那些男朋友不同，加持是一个成年人。他喝酒，抽烟，有着丰富的性爱，并且如果他真的对她有意，她无疑就能成为真正的大人。</p><p>所以，很不当地，明日香试图引诱他。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-010.png" alt=""></p><p>当然，他拒绝了她的探求，因为他不是她的法定监护人(他从来没能培养出一段积极的关系)，而且他不是恋童癖(pedophile)也没有少年爱(hebephilia)。不幸的是，这对明日香来说并不单纯只是拒绝而已。这是对明日香作为成年人的地位的否定。加持对她说，「你还是个孩子。」</p><p>对明日香来说，没有什么能比被叫作是一个孩子更能伤到她的了。因为正是她还是一个孩子这一事实，差点杀死她自己，使她精神失常到答应跟母亲一起去死。她无法忍受当一个需要依靠他人的孩子这样的想法。她必须马上长大，不是么？</p><p>重申一下：明日香并不是爱着加持。或许她对加持是有类似于一见倾心的迷恋，就像年轻的女孩子对有吸引力且比她们小的男孩子所产生的那种迷恋一样。(可以认为，与真嗣相比，她对加持更有那种感觉)</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-011.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-012.png" alt=""></p><p>我想指出她说这话时伴随着的彻彻底底的孩子气。「接吻，甚至那之后的事情。」我能拿到的所有字幕上基本都是这句话的不同翻法。(为了能好好地分析EVA，我配着几套不同的字幕看了很多遍，包括官方英文字幕、蹩脚的直翻、爱好者们制作的字幕等，从而在某个字幕可能出错的地方我能知道总体上应该是什么感觉。)她甚至都不敢谈及性爱。她其实根本就没有准备好跟人做爱。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-013.png" alt=""></p><p>这一幕的构图很有意思，因为我们能从正面看到明日香的乳沟。明日香在EVA的三位女主角中胸围是最小的，但在一些其他场景中，比如不可摧毁的耶利哥之墙(Wall of Jericho，指第九话中明日香关上的隔门)那一幕，她的胸看起来大了一些。但从另一方面讲，在这一幕中她看起来几乎就是平胸。素白胸罩的使用也进一步地印证了她不是个大人这一点。她还是个孩子，还几乎没有从童年中成长分毫，她试图成为一个大人，但还为时太早。她想要成为一个大人从而能够不再依赖任何人，但当她不被认同而是被拒绝时，她又回到了最初的基本需求：</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-014.png" alt=""></p><p>是的。看着她。看着她，因为她的母亲从没看过她，她的父亲也从没看过她，因为她的母亲用一个人偶替代了她，因为到最后她感到自己对别人来说什么都不是。她因此而恨着每个人。她恨其他人类。当然，她最恨的还是自己。</p><p>所以设定就是这样：明日香寻找着对自己成为大人的认同，从而能够变得独立而不用再依靠任何人。这是她所反复告诉自己的。而在现实中，她渴望关注。这就是她憎恨自己的原因。她固有的对人类的憎恨和对孤立的渴望，与她由于童年从未获得过关注而产生的能够有人看着她的强烈需要，构成了一对矛盾。</p><p>加持拒绝了她。不过这时新选手出场了：第三适格者，真嗣。真嗣是她身边唯一一个同为驾驶员的男孩，而且他作为驾驶员而言很优秀。当真正的大人加持让她失望时，她转而试图引诱真嗣，因为他是一个很好的替代品(从他是人类的救世主等意义上来说)。</p><p>说到真嗣，该谈谈明日香在日本的经历了。</p><p>在第九话的开头，明日香最终抵达日本时，我们看到了学生们以一种很有意思的，预言性的方式谈论着她：</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-015.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-016.png" alt=""></p><p>对我来说这一幕很明显是在暗示明日香无疑会表现得让传言成真：她想让别人以为她很早熟。要知道这可是在我们产生对她的第一印象之前，那么故事的作者早早地将这一预设抛给我们就是一件很有意思的事情了。当然，其他的学生很有力地回应说，她可能只是因为被伤透了心才来到这里的。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-017.png" alt=""></p><p>起初，真嗣是不大乐意见到明日香的。毕竟她有点喜欢欺负他。而另一方面，明日香显然认为自己美丽动人(大部分的学生都会同意)，看到真嗣忐忑不安，立马就开始挖苦伤害他。好戏开场了。</p><p>当然，这里的重点在于，明日香会与真嗣交流，是因为真嗣是第三适格者；同样地，明日香与丽交流，仅仅是因为丽是第一适格者罢了。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-018.png" alt=""></p><p>之后在公寓时，明日香显然深信真嗣将会被当做垃圾而抛弃掉。她都没有怎么再去考量真嗣，而是不断说着他将会如何被取代。这当然是因为她完全没有把他看在眼里，不想为他浪费时间。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-019.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-020.png" alt=""></p><p>借此机会我要提醒读者，加持才是明日香用她那颗小小的少女心去真正在意的男人。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-021.png" alt=""></p><p>在这里我们还能看到明日香很有意思的一面。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-022.png" alt=""></p><p>她吐槽了日本人不在门上装锁，并且也没有办法能隔开她和真嗣。基本上随便是谁只要愿意都可以走过这扇门。</p><p>现在让我们花点时间来回想一下与鸟天使(Arael)的战斗。明日香哀求使徒不要再窥探和挖掘她精神的更深处。她只想让别人看到她的外壳，看到那个「美丽动人的明日香」，那个完美的明日香。在这一话之后的部分，明日香抱怨她在日本的第一次战斗让她显得「不酷」了。她沉溺于只将自己最好的一面展示给别人看。所以，仅仅是允许别人走进房间的想法就使她恐惧而愤怒。她感到没有隐私。在德国生活时，明日香可以把内心封闭起来，但到了日本，她没法再这么做了。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-023.png" alt=""></p><p>嘿，记得我说过明日香的胸有点小吗？仔细看看上面这张截图，再跟在船上的那一幕对比一下，你就会知道我说的是什么意思。</p><p>这一段里明日香在向真嗣提出挑战。因为像在德国时那样，她打算利用当地的男孩子来认同自己。现在既然美里已经明确了真嗣不会被换掉，明日香突然就对他有了兴趣，因为他也一样是EVA驾驶员，还是人类的救世主。他的存在是有一定分量的。</p><p>所以她就提出了「挑战」。顺便，不可摧毁的耶利哥之墙倒塌了。她试图让真嗣想要她，试图让真嗣渴求她关注她，好让她可以嘲笑他无视他，从而获得认同。跟明日香最初的行动是羞辱真嗣一样，现在她试图创造出一段靠情感负担而使自己处于上位的关系。</p><p>当真嗣真的无所动作时，她决定让事情进一步发展。她在真嗣身边躺了下来，让自己的胸部能被整个看到。这下真嗣必须有所反应了。更不用提她当时还在做着噩梦。</p><p>不管怎样，她睡着了。真嗣惊慌着，眼睛盯在她的胸上(真嗣被明日香吸引从本质上来讲完全或者说几乎是出于性本能，讲到《End of Evangelion》等其他内容时我们就会明白)。正如明日香期待的那样，他决定上钩了。他准备亲吻她，然而在静谧的夜里，明日香却流露出了她从不想表现在外的真实自我：</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-024.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-025.png" alt=""></p><p>尽管成为一个大人是明日香唯一想要的，但她无疑还是个孩子。她等不及想要长大(在上面的截图中你也可以清楚地看到她的胸)，但却无法逃避对他人的依赖程度之高以及对他人关注的需求，因为她的母亲没有给予她足够的关注。这在她的心上留下了深深的伤痕。</p><p>她需要关注，需要成为绝对的第一名，这在接下的几话里驱使着她。在第九话中，她做到了与真嗣「同步」，说明她的确有这个能力，但两人还是没能走出之前发生的事(因为一个人可以在心智无法正常运作的同时从外部看上去一切正常，正如这部作品里绝大多数成年人一样)。在第十话中，她不能忍受真嗣看着丽的泳装而没有注意到她，所以她不仅借热膨胀调侃自己的胸部来挑逗真嗣，还回头喊他看自己的背翻式入水。之后，她宁愿死在火山里也不愿意放弃任务。第十一话中，她甚至为真嗣挡下一击，因为她不能忍受自己没能成为第一，没能成为他人依靠和关注的对象(与此类似地，她在基地停电时也当起了小组的领队)，等等。</p><p>最后，那个臭名昭著的真香之吻来了。这一吻为许多真香粉所欢呼，他们将其看作是真香恋的有力证明。那我们就来看看吧。</p><p>在那一话的开头，明日香又给加持打了电话，这次她在电话里装作自己正在受到性侵犯，以图让加持在意。如果这都算不上是极度的绝望，我也不知道怎样才是了。之后她跟一个男孩出去约会，但她显然中途就跑出来了，因为她说那个男孩很无聊(意思就是在他心里比不上真嗣或是「认同大师」加持)。</p><p>美里跟加持出去了。严格来说，美里是跟加持和律子一起出去的，但明日香自然就把这当做是美里(一个她一并不完全喜欢的女人)跟加持的一场约会。显然她对事情的发展感到很不舒服，因为美里作为一个成年人正在获得认同，而加持却将她弃于尘土中。</p><p>这时，明日香转而寻求她生命中的另一个男孩，一个她最容易得到身体和认同的男孩：真嗣。她问真嗣他觉得美里和加持在干什么。当真嗣想要避开这个话题时，她又问他有没有接过吻。真嗣回答说没有，她就逼迫他跟自己接吻。但是请注意她脸上的表情：当她要求真嗣跟他接吻时，她看起来既不高兴也不满意，而不如说是一副郁郁不乐、一厢情愿的表情。她满脑子想的都是加持的事情以及如何获得认同。</p><p>注意：明日香也是一个处于青春期的少女，而且从我的分析看来，她在这里可能并不是在算计什么。她这么做部分只是出于实验心理，而且她被真嗣所吸引或许只是那种典型的「他们那个年龄的非无性欲(allosexual)的男孩和女孩共处在一个环境下」时所会发生的事。不论如何，让我们继续在明日香在作品中的整体发展的框架内来看这个吻。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-026.png" alt=""></p><p>对于将要到来的这个吻，真嗣似乎并不是特别高兴。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-027.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-028.png" alt=""></p><p>当他拒绝时，明日香刺激他。正如我所提到的，明日香对于把她对人类的憎恨当作武器来使用已经是驾轻就熟。她提到今天是真嗣母亲的忌日并问他是不是害怕了，这使真嗣木桩穿心。母亲之名遭到侮辱，真嗣回答说他当然不会害怕一个小小的吻。明日香气势逼人地站了起来。她比真嗣要高，而且她向真嗣靠近的镜头被处理得与其说是浪漫不如说是胁迫而高高在上的。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-029.png" alt=""></p><p>当真嗣和明日香靠近对方时，真嗣的脸稍红了一下(因为他觉得明日香很性感，所以还是有点吃惊的)，而明日香却看上去要命地严肃。她犹豫了一会，告诉真嗣不要呼气。作为人类，真嗣显然还是会让她分心的。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-030.png" alt=""></p><p>接着，她用手捏住真嗣的鼻子，并突然吻了上去。他们一动不动地保持了这个相当尴尬的吻好一会。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-031.png" alt=""></p><p>可怜的真嗣没法呼吸，他的脸先是变红，然后又变青。明日香几乎让他窒息了。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-032.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-033.png" alt=""></p><p>第一次看到这一幕时，我以为这不过是个勃起的玩笑，因为真嗣两手握拳像是要掩盖裤子里的帐篷一样。或许真嗣一开始的确勃起了(毕竟他才十四岁，正在跟一个异常有魅力的少女接吻。他显然对她有性冲动，还看到过她的胸部，之后我们还会发现他对着她自慰)。但当他们继续吻下去时：</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-034.png" alt=""></p><p>可怜的真嗣这下真的要窒息了。由于缺氧，他的皮肤都变青了。这并不令他开心。最后真嗣实在受不了缺氧带来的致死感，一把推开了明日香，开始大口地喘气，庆幸自己没有真的窒息而死。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-035.png" alt=""></p><p>然而对于明日香来说，这不仅仅意味着拒绝。这里发生了许多事情：首先，亲吻第三适格者并没有让明日香好受些；其次，真嗣显然不知道自己在干什么，也没法回应明日香，而是傻站在那儿直到实在受不了而跌跌撞撞地跑开。所以他并不是像加持那样有男人味的「选手」，他在明日香的「认同名单」上自然就排在加持之后；再次，他没有继续而是离开了她，这意味着对她的拒绝；最后，他不是加持。从明日香的角度来想象一下：你梦想中的男人正在跟一个你不是很喜欢的女人约会，而你吻了一个男孩，一边吻一边在心里想，为什么我还是不高兴？为什么？还要多久我才会高兴起来？接着那个男孩突然推开了你，开始大口喘气，很庆幸终于不用再跟你接吻了。</p><p>这跟你刚才被加持拒绝可不一样。你是被第三适格者拒绝了，是被一个你在德国随手就能找出一堆的普通得不能再普通的十四岁男孩拒绝了。而他还是你见过的最懦弱的男孩。</p><p>明日香转头跑进卫生间开始漱口。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-036.png" alt=""></p><p>明日香立马把过错归到真嗣头上来试图挽回面子。因为她的恼火跟她刚刚经历了最糟糕的被拒绝情形和最没法获得认同的事情毫无关系，而仅仅是因为真嗣接吻很差劲。(提示：事实正好是反过来的。)明日香还没有成为大人，她还是个孩子。遭到打击时，她只会责怪真嗣。当真嗣直白地问她怎么了，她狠狠地回击说她不高兴就是因为自己吻了他。这说的倒也没错，但她的本意却并不是真嗣所理解的那样。</p><p>顺便，这一切都是在加持和美里在分别八年之后重归于好的背景下发生的。一对真爱之间的吻与两个没法交流的孩子之间毫无头绪的吻形成了鲜明的对比。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-037.png" alt=""></p><p>之后，加持拖着美里回来了。他把美里安顿到床上，确保她没问题之后走了出来。因为见到了她亲爱的加持君，明日香看上去精神好多了，立马抓住了这个机会。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-038.png" alt=""></p><p>加持委婉地拒绝了住下来的邀请，说自己必须得回家，而明日香坚持要他留下来。真嗣或许是个无用的废物，但加持的话一定能拯救她一天的心情。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-039.png" alt=""></p><p>不幸的是，当明日香试图对加持表现出她最「娇」的一面时，加持拒绝并离开了，留下她一个人站在那里。注意看下一幕中，在观察到一件重要的事的同时，她有多么心碎。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-040.png" alt=""></p><p>这当然是对这一话之前的部分的回应。就在明日香跟光的朋友去约会之前，她问美里能不能把香水借给她。顺便，这时背景中的电视节目里大概是一对前夫妻在吵架。男的说他还爱着对方，女的说这不可能，因为她已经不是原来的那个她了，而且她花了三年才忘记他。这让人马上联想到美里和加持的关系。但这是不是有可能同时也在影射明日香不顾一切地要成为一个她没能成为的人，然而时移境迁物是人非，一切都回不去了呢？</p><p>无论如何，明日香请求借用美里的香水，但美里没有答应。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-041.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-042.png" alt=""></p><p>「这不是给小孩用的。」美里说，毫不掩饰地指出明日香还是个孩子，而自己是成年人，所以可以用香水。自然地加持就会认同美里。没错，美里是个成年人，她值得拥有加持的爱和关注。</p><p>而明日香不是。</p><p>明日香是个孩子，用不上香水，似乎也并不值得加持的爱和关注。在之后的几话中，明日香继续试图吸引加持的注意，比如她闯进加持的办公室却只发现冬二刚刚被选为第四适格者。又比如她再次被加持拒绝。</p><p>跟很多真香粉所认为的浪漫之吻相去甚远，这一幕反而是提醒我们真嗣和明日香不理解对方的场景之一。真嗣想不通明日香为什么不高兴。明日香则根本不在乎真嗣，只是利用他来认同自己。这里根本就没有在他们那个年纪的非无性欲者所面对的性压抑之外的真正的浪漫恋情发生。</p><p>(现在我要转变一下话题，来谈谈丽和美里以及他们与明日香的关系。)</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-043.png" alt=""></p><p>在第九话的开头，正如我所提到过的那样，我们看到明日香认为自己是精英中的精英。她问了真嗣第一适格者在哪，并走过去打算跟她交谈。</p><p>当她走近丽时，她的影子挡在住了丽正在读的书上。第一适格者对此的反应是将书移开了一点，像是明日香的存在根本没打扰到她而且她完全不关心明日香一样。换句话说，她的行为正好是明日香所寻求的关注的对立面。</p><p>对于这一幕的构图我要说几句。他们在室外，在公共场合，被学生围绕着。明日香是极受欢迎的新学生。当她大声宣告自己是二号机的驾驶员时，几乎每个人都在看着她和丽。注意到她特意将丽称呼为原型机的驾驶员。在丽还没说一句话之前，她就把第一适格者放在了低人一等的位置上。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-044.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-045.png" alt=""></p><p>到处都是学生，明日香这么做相当于是把丽扔在了聚光灯下。但丽显然根本不在乎。她很恰当地问，为什么她要跟明日香做朋友。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-046.png" alt=""></p><p>明日香不说是因为自己想跟丽做朋友，也不说是因为她对丽有兴趣，或是任何普通人会说的话。相反，她说：</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-047.png" alt=""></p><p>因为这样会比较方便，在各种意义上。当然不难理解明日香这么回答只是因为这是她的风格。她不知道怎么跟别人打交道。她只会让自己处于主导地位，很明确地创立一段单向关系，使得别人付出对她的关注，而她表现得一点也不在意他们。不过对于她去找丽这件事我们还是要理解的。她不是想让丽讨厌她。很多人都会忘记这一点：她并不是喜欢到处晃悠给自己找麻烦。她是真心想跟丽做朋友。</p><p>然而，丽跟学校里其他人都不一样，她不关心明日香的状态。事实上丽根本不关心任何事情。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-048.png" alt=""></p><p>这让明日香哑然无声。如果是命令的话，我会照做的。在她看来，跟极度想要成为大人的自己不同，丽就是个只会接受命令的机器人。</p><p>从此明日香就称丽为人偶。我们都知道她讨厌人偶，因为她的母亲对着人偶说话，以及她曾经答应只要母亲不离开她，她愿意取代人偶的位置并跟母亲一起去死。由此她将丽与自己所讨厌的一切都联系了起来：一个只知道听从主人命令的人偶，在任何事情上都要依靠主人，完全不会自主思考。在之后至关重要的电梯情节中我们会看到，愿意去死这件事将再次伤害明日香。</p><p>不论如何，最初的这件事使他们之间的关系产生了裂缝，此后明日香几次想要接近丽。举例来说，在空天使(Sahaquiel)的那话中，是明日香把丽拉去跟真嗣和美里一起吃饭；她甚至为了去一家适合丽这个素食者的饭店而放弃了她原来的计划。同样地，在同雨天使(Matarael)的战斗后，明日香充满善意地用自己的方式回应了丽的抒发的哲学思绪，而不是侮辱或嘲笑她。等等。</p><p>不过明日香的确有着讨厌丽的理由。比如，正如她在停电时所说的那样，她认为丽是被偏爱的，称她是「优等生」。丽当然否定说自己并没有受什么偏爱；她自己很清楚这一点。但在明日香看来，丽就是被偏爱的驾驶员，这是她不喜欢丽的原因之一，因为丽差不多算是挂上了在她看来的「最强驾驶员」头衔。事实上这一点对真嗣也是适用的，但我们之后再讲。</p><p>在某种程度上，明日香讨厌丽还因为她觉得真嗣一边像逃避瘟疫一样躲着她，一边却想要接近第一适格者。虽然这种三角关系并没有怎么影响她跟丽的关系，但这里还是要提到重要的一点：加持不认同她，真嗣只在意她的身体(他也把她当做一个朋友来在意，但她几乎看不到这一点，而这里我们是从她的立场来看的)但却似乎有点在意丽。这是一个认同的问题，也是她不喜欢丽的另一个原因。</p><p>小结一下：丽是个人偶，她吸引到了一个明日香没法用自己喜欢的方式套牢的男孩的注意，她愿意为了别人而死，她作为驾驶员比自己更受偏爱。</p><p>正如我们所知道的，随着故事的发展，明日香开始失去她的同步率优势，同时真嗣开始迎头赶上。直到第十六话开头，美里对他说，「你是最棒的！」明日香将自己的全部价值都寄托在驾驶EVA上，因为除此之外她一无所有。她的母亲在她刚被选上驾驶员时就去世了。她成为了第一架实战用EVA的驾驶员并取得了最高的同步率得分。她是如此在意驾驶EVA以及在驾驶时表现得好看这件事，以至于正如之前提到过的，她甚至在洗澡时都把神经连接器戴在头上。</p><p>现在真嗣赶超了她。</p><p>为了说明一些EVA的科学设定，我现在要提一下，同步率下降并不全是她自身的原因。从一方面来讲，零号机和二号机是基于亚当(Adam)制造的，所以只含有部分的灵魂。而初号机则是基于莉莉丝(Lilith)制造的，所以含有全部的灵魂。那么真嗣自然就能超过另外两人。另一方面就是明日香自己的「过错」(因为这是可以避免的)，那就是她的认同情结。</p><p>明日香没能很好地消化真嗣成为第一的事实。而丽似乎一点也不关心。矛盾最后在臭名昭著的电梯情节中爆发。两人在电梯中下行，丽试着跟明日香搭话。</p><p>丽解释说必须要向EVA敞开心扉，对此明日香反应激烈，大声反问丽，像是想要说明自己的确做到了这一点。这当然不是真实的情况：明日香把最深处的想法和感情都紧锁起来，甚至都不让母亲/EVA看到(事实上她在《End of Evangelion》中的觉醒部分就是因为她终于在自杀/随后的求生欲中表露出了内心最深处的自己)。丽真诚地想要接近，却被明日香拒绝，这很能说明两人的角色发展是怎样的：丽已经走上正轨，而明日香却重重地坠落。</p><p>明日香把真嗣单独拎出来当作是问题的主因，因为她再也不是第一了。她总觉得丽是受偏爱的，现在真嗣也因为同步率的数值而受到了偏爱。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-049.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-050.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-051.png" alt=""></p><p>她称丽是个机械傀儡(有些字幕中翻译为人偶，日文原文是「機械人形」)，这突出说明了她是多么讨厌人偶。她甚至比较说自己已经「失去优势」，因为她已经沉到如此低谷以至于能拿去跟人偶比较了。</p><p>她其实并不坚信自己比真嗣和丽要强，而只是出于某种防御机制假装相信。Tumblr上很多人可能会把这跟「是的，我很完美/是的，我是废物」联系起来。她装作自己是最强的，因为这样比接受真实的自己更容易。她还从别人身上寻求对自己是最强的这一事实的认同。不幸的是，这个世界除她以外的人似乎都不这么想。而作为一个有戏剧性人格障碍的人，明日香应付不了这种情况。</p><p>丽真诚地继续试着帮助她，明日香却绝望地喊了出来：</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-052.png" alt=""></p><p>她指望着丽能否认，从而让她确信第一适格者当然能独立做出决定，当然不会简单地因为被这样命令了就真的去死。为什么？因为她不愿意相信自己曾经差点因为母亲让她死就真的去死了。因为碇司令差不多算得上是丽的父亲/监护人，这两者之间就形成了一个很直接的对比。令她无比震惊而恐惧的是，丽承认自己如果被命令就会去死。</p><p>她扇了丽一耳光。丽的话冒犯了她，因为第一适格者的这一决定激发了她最深的恐惧。她能够杀死自己，所以自己真的会因为随便什么原因而自杀的想法让她感到恐惧。</p><p>而丽证实了她的恐惧。丽承认说，一个被全世界所爱的，站在巅峰的，受到偏爱的(从明日香的立场来看)，甚至被真嗣认同了的优秀驾驶员，还是愿意为命令而死。傀儡一样。人偶一样。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-053.png" alt=""></p><p>之后，在她因空天使的攻击而精神崩溃时，丽拯救了她。丽，一个人偶，她所恐惧的一切事物达到极点的化身，拯救了她。明日香曾经认为自己是个大人，能够独自解决任何事情。但在她最低谷的时候，在她所有的恐惧被空天使悉数掘出时，那个她一直想要逃离的机械傀儡拯救了她。</p><p>到最后明日香也没能独立起来。她依赖着别人，没法拯救自己。而且最后她居然是被最依赖别人(从她的立场来看)的人偶所救。</p><p>这就像是扇在自己脸上的一记响亮的耳光(对不起，明日香)。这完全摧毁了她所坚信的一切。空天使彻底摧毁了她，而拯救她的却是她最憎恨的东西。</p><p>关于空天使还有一些别的内容。不过我们先来讨论明日香跟美里的关系。</p><p>在之前的部分我已经涉及到了两人关系的主要内容。所以现在来小结一下：总的来说，美里代表着明日香想成为而又不想成为的人。美里是一个被认同的大人，她有性生活并且赢得了加持。明日香想像美里一样做一个独立的成年人(她吻真嗣部分也是因为她认为这也是加持和美里在约会时做的事情)，从这点上来说她想要成为美里。然而她又觉得美里很恶心；她在问到自己长大之后会不会也变得像美里一样时感到恶心，因为性而觉得美里不知羞耻。她实际上并不想成为美里；她内心的深处其实并不渴望长大。</p><p>(回到认同的话题。)</p><p>谈到与空天使的战斗，现在正好可以总结一下之前发生的事情然后继续讨论。在空天使之前，我们看到明日香再次试图给加持打电话但未能遂愿，之后她又注意到真嗣在跟丽说话。注意加持和真嗣这对共轭在这里又出现了：当无法获得首要目标(加持)时，她就会转向次要替代(真嗣)寻求认同。这次她把真嗣跟第一适格者讲话看作是自己输了。对此她反应木然。同时从EVA首席驾驶员和拥有认同(她以前总是跟加持和她在德国的男朋友们在一起，但现在他们都离开了她)的地位上被踢下来，她奔溃了。</p><p>在第二十二话，也就是被爱好者们称为「明日香的精神强暴(mindfuck)」的那一话开头，初号机因为试图吞噬力天使(Zeruel)的核来觉醒成神而被封印。SEELE没有时间也没有耐心去对付源渡的小把戏，让初号机和真嗣坐上了冷板凳。明日香被派出去调查、对抗并打败空天使。她很清楚这场战斗与自己利害攸关。她知道自己作为EVA驾驶员的名声和地位都在此一举。而且，要是失去了认同地位，她大不了就放下认同，继续靠着最强驾驶员的地位乘风破浪。她还是能重振旗鼓的。</p><p>作为一个完美主义者(戏剧性人格障碍的常见症状)，明日香不允许自己出丑。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-054.png" alt=""></p><p>然而，空天使开始窥探她的精神。跟其他使徒相比，空天使并不打算伤害任何人(的确，除了明日香以外它没有伤害任何人。即使是对明日香，与其说是充满恶意的攻击，不如说是像一个孩子想用一把铲子去探查一只小动物，却没意识到自己拿错了工具)。</p><p>美里命令明日香撤退。但她把这次作战看成是最后一次站在聚光灯下的机会。撤退就相当于彻底的失败。要是现在走开了，逃离了，她就再也无法得到关注了。</p><p>她不能让这样的事情发生。在很有真嗣风格的那一刻，她下定决心不逃跑。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-055.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-056.png" alt=""></p><p>当空天使继续窥探时，我们看到了明日香经常说的：她无法忍受别人看到自己的精神，内心和灵魂。她很小心地培养出一个外部形象，这个形象却是如此易碎，暴露出里面那个抑郁而有自杀倾向的孩子。对明日香来说，能将她最糟糕的回忆和被压抑的自我挖掘出来的的空天使无疑是最可怕的噩梦。这个使徒在无意中将她的内心统统挖掘了出来。</p><p>其实我在这里想说几句题外话。就Evangelion这部作品而言，用手捂住脸的镜头是我的最爱之一，因为我喜欢那种头发晃动的样子。我本想做些动图，但写这篇分析已经很累人了。还是回到分析吧。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-057.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-058.png" alt=""></p><p>整个过程中最糟糕的是其他人都听得到她的声音。虽然我们不知道她在脑子里对空天使说的事情有没有被中央教条(Central Dogma)里的人听到，但这并不是没有可能的。而且从她一直在惊声尖叫来看，他们很可能听到了。</p><p>在我们讨论她所说的内容之前，要记住这一点。想象她刚刚结束这场噩梦，很清楚自己不仅刚刚经历了一场精神崩溃，而且所有人都清楚地听见她哭泣着崩溃。因为对抗使徒拯救世界是公开的作战，她奔溃的事情肯定会或多或少地被新闻报道(当然SEELE和NERV肯定会篡改一些信息，但要紧的是明日香会觉得到处都是关于自己的报道)。</p><p>空天使先是让明日香想起了自己的童年：她的母亲是如何成天在GEHIRN工作，从来不会专门抽出时间来陪她或是她的丈夫(这是明日香与美里的一个对比，也是与律子的对比。不过我完全可以再写一篇文章来谈谈明日香-律子对比以及在Evangelion中反复出现的被拒绝的伪爱原型(rejected love interest archetype))；母亲是如何在被抽离部分灵魂而精神失常后终于开始展露母性；她的父亲是如何跟母亲的医生出轨而抛弃母亲；母亲又是如何为了一个人偶而抛弃她；她是如何答应母亲，只要不抛弃她，就愿意去死；母亲是如何自杀的；她是如何决定以后再也不会哭泣的(这很重要)；如她自己所说，这些全都是她再也不愿想起的回忆。</p><p>她当然不想记起来。她理应是完美的，不应该有这样的背景故事，不应该像这样遭受痛苦，像这样哭泣。她在内心深处不应该是个伤痕累累、寻求关注的孩子。</p><p>最令人恐惧的是，那个作为她母亲化身的人偶没有认出她来(这部分画面在第二十二话的导演剪辑版中才有)，问她是谁，这对强烈地追求独立并渴望关注的明日香来说是致命的一击。</p><p>明日香是谁？是她披上的外层形象(也是很多爱好者文化所消费的那个形象)，还是里面那个崩坏了的明日香？</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-059.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-060.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-061.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-062.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-063.png" alt=""></p><p>我们看到一个充满自信地做着自我介绍的明日香，飘动的长发突出了她的美丽，大张旗鼓地吸引了众人的关注；接下来是一个一见面就用她的口头禅「你是笨蛋吗？」来打击别人的明日香，因为任何能看穿她的人必须马上被置于低人一等的位置上，以便她能站在顶端接受来自众人的注目；然后是一个自鸣得意地说有机会来炫耀她驾驶技巧和获取关注了的明日香；最后是她与加持在船上的那一幕。看着我。那个情色化的镜头反而暴露了她是多么的幼稚和缺乏心理准备。</p><p>明日香说这些都不是真正的她。(没错，这些的确不是真正的她。)</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-064.png" alt=""></p><p>接着，她被丢进一个恐怖视界(horrorscape)，在那里她看到了一个像是加持的人影。她乞求加持(在这里他是所有可能认同她的人物的代表)把她从无意识的人群中救出来。她想站出来。她穿着驾驶服来强调自己的驾驶技能足以使她鹤立鸡群。但仅仅是这一点还不足以把她从单调得像是要吞噬她的人群中拯救出来：她乞求能被认同所拯救，但就现实中一样，这并没有发生。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-065.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-066.png" alt=""></p><p>很多爱好者(我在说你们呢，EvaGeeks Forum)把明日香问真嗣为什么会出现在自己的噩梦中的那句台词归为男女之情(比如说有人认为她想起真嗣是因为她喜欢他)。但并不是这样。下面这个镜头从加持转向真嗣，因为这不仅仅是跟真嗣而是跟加持与真嗣这对共轭有关。她无法拥有加持。她甚至无法拥有真嗣，而真嗣这个既不主动还很娘气的男孩形象已经广为传播。</p><p>更糟糕的是，在这里真嗣的脸上是一副丽那样的冷漠表情。他对这一切毫不关心，而他的不关心就是使她奔溃的最后一根稻草。她需要关注，需要别人来关注她，因为她依赖别人来获得自己的幸福。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-067.png" alt=""></p><p>在这场噩梦过后，我们看到了下面这些关于真正的明日香的镜头。这些镜头里她的台词与之前因为想要成为大人而让别人看着她的镜头里大致一样，但这回画面上出现的形象是一个哭泣的孩子(她还告诉过自己不会再哭泣了)：</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-068.png" alt=""></p><p>下一幕是明日香光着身子一个人坐着，空天使问她是不是很孤独。然而她拒绝了空天使(空天使在她眼里是她小时候的自己的形象)。如果她能接受现在的自己并试着在自己的幸福之上重新构建信心，而不是依赖于从别人那里获得的虚假的幸福(这跟Evangelion想要表达的另一个主题有关，那就是不要依赖虚假的自信。不要像真嗣在与夜天使(Leliel) 战斗时那样一激动就拼命过头最后败得一塌糊涂)，她就能幸福许多，然而她却拒绝这样做。她不仅不承认自己的弱小，反而重复对自己说着那些要独立自强的准则，即使这些虚假的准则并不能使她更快乐(就像真嗣总是对自己说「我不能逃避」一样)，因为除此以外她已经一无所有了。最后空天使问她有没有被(她的母亲)爱过。</p><p>因为明日香不爱她自己，所以她显然也没办法去爱别人，正如她在《End of Evangelion》中告诉真嗣的那样(我们会看到在那时她已经知道要如何去爱自己了)。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-069.png" alt=""></p><p>顺便，请注意幼年的明日香的发夹是故意被设计成跟二号机的四只眼睛相似的。因为在这里明日香不完全只是在拒绝过去的自己，也是在拒绝她的母亲/二号机的灵魂。这也是她的同步率下降得如此之快的原因。相反地，在《End of Evangelion》中她接受了自己的失败，也接受了母亲之后，她得以与二号机同步到足够暴走的程度，不过这一点我们之后会讲的。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-070.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-071.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-072.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-073.png" alt=""></p><p>当然，就像之前的夜天使一样，空天使完全不听这一套鬼话。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-074.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-075.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-076.png" alt=""></p><p>空天使彻底摧毁了明日香的精神，而后丽用朗基努斯之枪将其消灭。明日香惊恐而又麻木地躺在插入栓里。她像胎儿那样蜷缩起来，即使在她试图否定所有甩给自己的过错时也一动不动。</p><p>爱好者们将空天使的行为称为「精神强暴」(我个人不喜欢这个词)是有原因的：明日香说自己被玷污了。使徒使她想起了自己所犯下的无法逃避的过错，这让她觉得被玷污了。而且她作为一名驾驶员的最后一战也失败了。她很清楚自己已经完蛋了。她什么都没有了。</p><p>当然，最糟糕的是她被救的方式。就像在电梯里那样，最后拯救了她的不是她自己，也不是她的独立，更不是她的驾驶技巧，而是人偶(指丽)。那个会因为来自等同于父亲的存在的命令而去死的人偶。明日香完蛋了，更糟的是她将自己的世界建筑于其上的那些价值观也都完蛋了。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-077.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-078.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-079.png" alt=""></p><p>这一事件之后明日香差不多是完蛋了。真嗣站在那，什么都做不了。在接下来的一话里，当丽需要支援时，明日香完全帮不到她，只是坐在那儿，根本无法跟EVA同步。她已经毫无用处。没有人再需要她了。(事实上不是这样的；真嗣和丽是把她当做朋友来关心的，而且在第二十四话以及《End of Evangelion》中也可以看出其他人也都很关心她的状况。)</p><p>紧接着，她显然是在没让任何人阻挡的情况下逃走了。事情会这样，部分是因为美里认为应该让真嗣和明日香自己作出决定。这其实是一种很好的策略，只是她放得太松了。就像真嗣在第四话中逃走，最后好不容易被剑介和NERV的特工们找到一样，明日香带着明显的自杀倾向出走了。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-080.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-081.png" alt=""></p><p>有些人讨论过她到底是打算怎么自杀的。有人说她是割腕(所以她才会坐在浴缸里)，也有人指出她显然已经绝食很久了，因为可以看到她两颊下的阴影，突出的肋骨，两腿间的距离也远比那种动画中典型的苗条女孩要宽。这个画面算不上是福利，反而令人毛骨悚然。</p><p>我们还能看到她把衣服很整齐地叠好放在边上，这在自杀案例中也是很常见的。她坐在浴缸里，因为首先浴缸是一个经常跟自杀联系起来的地方，其次她「我讨厌所有人」的场景(见后文)也是发生在浴缸里的。在整部作品中，水作为一个通用的美学象征而反复出现，它的含义包含了生命、幸福、逃避、思绪、内心等等；而在这里，明日香坐在一个完全干涸的浴缸里，这象征着她身上已经没有丝毫的生命力。事实上，包括浴缸在内，整个屋子都是破烂的。毫无用处。没人有会到这个破烂的小屋来的。这就是这个场景的用意所在。</p><p>然而明日香并没有死，因为NERV的特工们找到了她，就像他们找到真嗣一样。正如我们所知道的，明日香最后被送到了医院，之后又被安置到二号机中。她很吃惊地发现自己还活着。我们会在讲到《End of Evangelion》时讲到这一点的。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-082.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-083.png" alt=""></p><p>明日香的补完差不多就是把之前我所讲到的内容过了一边，所以我就只把截图放在这，让你们再回忆一下我一直在论证的观点。</p><p>其实，明日香和真嗣在本质(core，呵呵，又是一个双关)上并没有那么不同。他们都依赖于从他人那里得到的虚假的幸福，最后都因此而自食苦果。直到他们学会依赖自己的幸福时，两人才能开始互相理解。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-084.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-085.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-086.png" alt=""></p><p>(哈哈，人偶丽又在教训她了。因为即使明日香不愿意，丽还是能理解她的。顺便可以来看看丽的角色发展。从第一话那个连深潜在灵魂深处的感情都无法理解的丽，到《End of Evangelion》中 那个温柔地引导真嗣和其他的人类完成并且最终拒绝补完的丽，这样的变化是惊人的。)</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-087.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-088.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-089.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-090.png" alt=""></p><p>对第二适格者来说，补完只是在简单地重复我们所知道的关于她的一切，不过是以更清晰直白的细节所表现出来的。要是有人没能领悟到其中的一些细节，这时正好可以仔细看看。</p><p>与她所有寻求认同的行为混在一起的，是明日香其实想要独自一人的事实。在戏剧性人格障碍之下，在那个寻求关注的轻浮活泼的人格之下，她充满敌意，总是与他人保持着距离。当她一个人或是很失落时，她就变得残忍而憎恨人类。这里所说的残酷并不是那种用「你是笨蛋吗？」来挖苦别人的那种残酷，而是说她能够将对自己的憎恨转化成伤害甚至摧毁别人的动力。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-091.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-092.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-093.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-094.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-095.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-096.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-097.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-098.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-099.png" alt=""></p><p>她不想跟美里和真嗣共用厕所，也不想跟他们一起洗内衣。在这里美里和真嗣代表了所有她身边的人。有意思的是，上厕所和洗内衣都是很私人的事。明日香对于「分享」那个她创造出来的轻浮的外在形象没有问题，但她内心的一切都是只给自己一个人看的。最关键的是，她厌恶她自己。</p><p>在这一连串画面以及这一整话中，明日香都抱怨了痛经。她说到自己不想成为一个母亲(她不想当母亲却来月经跟充满母性却没有月经的丽之间的对比足以再写一篇分析)，还再次提到了对自己的厌恶。明日香不想成为母亲有很多原因。比如很明显地，她母亲的自杀肯定给她留下了不小的阴影(在面对空天使的噩梦中出现)。但为人之母还意味着另一方面：与另一个人共享自己的身体以及整个人生。我想这是成为一名母亲最令明日香恐惧的地方。</p><p>在《End of Evangelion》中，明日香在湖的深处醒来，吃惊地发现自己还活着，低喃道，「我……还活着？」正如我之前所说的，她觉得自己不再对任何人有价值，所以自杀了。她说现在不会再有人看着她了。失去了他人的认同，她什么也没有了，一切都仿佛失去了意义。然而，在EVA的体内，明日香终于与二号机中部分的灵魂同步了起来。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-100.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-101.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-102.png" alt=""></p><p>在这里她发现了两件事：第一，她的母亲一直都在看着她，保护着她，随后还会一直提供她所极度需求的关注。第二，A.T.Field其实就是将人类的心灵隔离开来的壁垒。但与渚薰在二十四话中或是真嗣在一开始所采取的消极态度不同，明日香在领悟到这一点之后欣喜若狂。</p><p>她拥有A.T.Field，所以她不需要任何人，因为内心的那个自我会永远被那不可打破的壁垒所保护。对代表着人类隔绝(human isolation)的极端的明日香(与代表着人类融合(human union)的极端的丽形成对比)来说，A.T.Field更像是上帝的赐福。与此相对地，人类补完将是她可能经历的最恶毒的诅咒。借助A.T.Field的力量，她能将SEELE的部队和量产机系列全部打败，直到S2机关使得量产机重生为止。这场战斗到这里转向了最糟糕的结局。</p><p>在与空天使的战斗中我们看到，明日香把自己的成功和失败都与性联系起来，这具体是因为她总是联系到潜意识中与性有关的部分。就像她说空天使窥探她的内心使得自己被「玷污」了一样，她在《End of Evangelion》中的最终失败也有着很强的性象征。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-103.png" alt=""></p><p>她先是被朗基努斯之枪刺中，而在那个场景下朗基努斯之枪已经变成了一把象征着阴茎的武器。那些外貌也有几分象征着生殖器的量产机们开始产生反A.T.Field，促进补完的发动。它们共同发起攻击，而不是一个一个来，所以它们代表了明日香所鄙视的一切。它们紧紧包围住她进行了一场仪式般的(高能预警)轮奸。它们落到她身上，侵犯她，将她大卸八块，然后离开了。值得注意的是，如果我们抛开那堆血块不谈，可以发现二号机受到的损伤使得它看起来很像是怀孕了，它的腹部隆起，像是正在生产或是刚生产完。它高高抬起的头像是对怀孕的恐惧。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-104.png" alt=""></p><p>这种解释的可能性被明日香尖叫着捂住腹部的画面强化了。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-105.png" alt=""></p><p>这当然仅仅是受了伤，但这种受伤的方式看上起非常像怀孕。代表着明日香所鄙视的一切的量产机系列攻击了她，剥夺了她那么渴望的隔绝状态。当她试图重新启动EVA并第一次接近暴走时，量产机们用好多把象征着阴茎的武器刺穿了她，这与其说是钉死了她，不如说是反复地刺穿了她，终结了她，征服了她。这不是相对来讲还比较干净爽快的十字架上的钉刑，而是能多混乱能多恐怖就怎么来。</p><p>那么，为什么量产机们要羞辱她到如此极点呢？因为即使她已经通过人类隔绝和A.T.Field找到自己的幸福，她仍旧是躲在在二号机中找到这一切的。Evangelion要传达的主题有一部分就是关于拒绝母亲，拒绝子宫般的插入栓，靠自己的双脚站立起来(因为躲在子宫里我们就可以用一种会引起豪猪的困境(Hedgehog’s Dilemma)的方式来选择性地过滤现实)。所以故事要在这里惩罚明日香，因为她用了错误的方式去追求人类隔绝。当她在《End of Evangelion》的最后时刻再次形体化时，她是在EVA已经被摧毁的情况下选择了人类隔绝。真嗣掐住了她的脖子，因为她无法逃避人际接触。作为解决豪猪的困境的一部分，明日香必须理解她最终必须跟其他人类进行交流。不过到这个时候她也已经能够理解、接受和爱自己了。</p><p>在这时，一个姑且认为是量子态的丽出现来收集明日香的灵魂。直到临补完(pre-Instrumentality)之前我们都没有再见到她。顺便我想在这里指出亚当-莉莉丝融合体(Adam-Lilith complex)在每个灵魂待收割的人面前都是以他/她想要融为一体的人的形象出现的。而在真嗣面前，它是以渚薰而不是以明日香的形象出现的。不论如何：接下来我们来看明日香和真嗣在补完中是怎么试图交流的。</p><p>当补完刚发动时，我们看到了两人之间这一系列精彩的镜头。真嗣被上了传统的象征着男子的阳刚气概以及少年特质的蓝色，而明日香则被上了传统的象征着女性的阴柔气质以及少女特质的粉色。这两种颜色都不是特别适合他们中的任一个。当明日香因为真嗣不理解她而冲他吼(以及很正当地责骂了他)时，我们看到了真嗣眼前一闪而过的一系列画面，悲伤而又无力，与显然被物化了的明日香的画面交织在一起。</p><p>到了这个时候，真嗣关注的只是明日香的身体(注意这些画面里都没有的脸；真嗣当然是把她当作朋友来关心的，但从严格的两个自我之间的交流(ego-to-ego communication)的角度来讲，我们看到的就是下面这些)：</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-106.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-107.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-108.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-109.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-110.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-111.png" alt=""></p><p>真嗣无力地为自己辩护，而明日香直截了当地指出他完全就是利用了她的身体(来自慰)。他应付不了活生生的，呼吸着的明日香，所以他只能把注意力放在自己脑中想象出来的她的投影，一个只会平躺在那一动不动的明日香。(庵野像美剧《The Office》里那样直直地看着镜头。)</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-112.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-113.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-114.png" alt=""></p><p>在地狱列车(Hell Train)里，两人之间的冲突随着明日香把胸部凑到真嗣的眼前(注意列车场景的第一张图中没有出现她的脸，这淡去了她的人格而强调单纯的献出身体，说明对真嗣来说任何人的身体都可以)而达到了极点。</p><p>她意识到，像美里一样，寻求认同把自己物化成了简单的一具身体，而她真正想要的是作为一个具体的个体被认同。现在她已经知道如何去爱自己(联系之前「接受母亲和接受与人隔绝」的部分)，已经能够认同自己，已经不需要别人了(所以她选择了人类隔绝)，但她还是能理解真嗣的感受。</p><p>我们刚刚看到真嗣经历了地狱列车，接下来明日香出现在了地狱厨房的场景里。地狱厨房就是明日香版本的地狱列车。</p><p>在整部作品中，地狱列车场景多次出现，真嗣在其中挣扎于内心的想法和感受。之所选用列车作为背景，是因为在列车上一个人会被许多人包围，却又总是孤独的(只要想想在第一话中真嗣在空荡荡的车厢里听着音乐那一幕)。同时，虽然列车说起来总有一个目的地，但一个人完全可以永远留在车上，不依靠自己的力量，而只是被从一个地方移到另一个地方，到达一些目的地。从这个意义上来讲，地狱列车代表的是真嗣内心的问题(他的逃避型人格障碍，豪猪的困境，以及随波逐流的人生态度)以及当下的状态。</p><p>地狱厨房是明日香的地狱。首先，这是美里的厨房。明日香想要的无非是长大成为一个独立的大人，然而她却坐在自己所依赖的监护人的厨房里。其次，这是个厨房。厨房是一个家的核心，是依赖的高度象征，而且我敢说它也是母亲身份的高度象征。是的，这的的确确是明日香的地狱。而且还是个她没法彻底逃离的地狱。</p><p>这个场景里有两处值得注意的地方。第一点很明显，许多人都注意到了：在这里明日香和真嗣穿着的是跟之前的接吻场景里一样的衣服，而那个场景正是印证明日香没有得到认同的一个重要部分。第二点不是那么明显：一个咖啡杯打碎在地上，洒了一地。这看上去是随意设置的，但它其实是从真嗣告诉明日香她所深爱的加持已经死了的那个场景中提取出来的一个符号。而那一场景是明日香的认同缺失的另一个重要部分。小结一下：地狱厨房里的这些符号不仅让明日香想起了她一直没能变得真正地独立的事实，还让她想起了自己没能被认同为一个有性魅力的成年人。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-115.png" alt=""></p><p>(上图是补完过程中看到的咖啡杯，下面是真嗣告知明日香加持的死讯时出现的咖啡杯，如果你不信可以重看一遍。)</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-116.png" alt=""></p><p>在临补完时，明日香以完全隔绝的状态存在着。在这里她很快乐。她一个人待着，想不出还有什么更好的存在方式。然后真嗣突然跳了出来(注意这发生在完全补完之前。在第二十五话和第二十六话中，我们能看到补完缓慢地逐渐完成。有人提到说补完不是花了数年也至少花了好几个月，在这个过程中不同人的自我发生交互，并逐渐回归到最本质的存在)。他跑过来主动提出要一直陪着她，要帮助她。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-117.png" alt=""></p><p>他能做的太少，来的也太晚了。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-118.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-119.png" alt=""></p><p>明日香曾经寻求过认同，但在她意识到自己的A.T.Field之后，她不再需要任何来自真嗣的认同了。所以她让真嗣不要管她。之后真嗣就只会说「助けて(救救我)」这一句话了，因为光是最初他提出要帮助明日香就只是单纯在伤害她而已。而且事实上他只是自私地试图让明日香来帮助他。当她需要他时，他并不在她身边。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-120.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-121.png" alt=""></p><p>正如明日香所指出的，真嗣并不需要她。真嗣只是需要一个人而已。而她并不想仅仅成为「那个人」；她只想成为她自己。她想要因为自己是怎样的人而受到重视，而不是因为情况所迫使得她不得不做那个善良的好人，尤其不想为了真嗣这么做。因为真嗣在补完中首先选择的是渚薰。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-122.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-123.png" alt=""></p><p>明日香狠狠地驳斥了他的鬼话。她指出他根本不爱他自己，因为她曾经也有过那样的经历。她憎恨自己，但通过理解母亲和自己的A.T.Field，她已经没事了。但她还是拒绝他的存在，拒绝他进入自己的内心，因为她只想一个人待着。她想要与人隔绝。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-124.png" alt=""></p><p>真嗣继续寻求救赎，明日香继续拒绝。我在这里插一句，真嗣断断续续地反复说着「助けて」，先是低语而后开始咆哮起来，而明日香不断地训斥着他(这一幕的镜头看起来像是真嗣是主角而明日香是反面角色，但注意台词和表情就会发现事实上是倒过来的)。人类补完是明日香最糟的噩梦：作为不折不扣的人类隔绝的象征，她不能忍受这样的侵犯。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-125.png" alt=""></p><p>于是，作为最终发动人类补完(与渚薰融为一体)的扳机，真嗣掐住了明日香的脖子。真嗣在补完的过程中杀死了明日香，因为她不能忍受自己的存在被抹杀，成为集体中的一员。然而明日香只是冷漠地盯着真嗣，像是根本就没打算去关心之后会发生什么事一样。真嗣正要杀死她，而她仍旧不在意他半分，因为她现在已经不需要他来认同自己了。(补完仪式发动，《来吧，甜蜜的死亡(Komm, susser Tod)》的前奏响起，画面将真嗣掐死明日香与赤木直子(律子的母亲)掐死一代目的丽(幼年的丽)联系在了一起。)</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-126.png" alt=""></p><p>在继续之前，我要指出下面这个画面：</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-127.png" alt=""></p><p>在这个画面中我们可以看到两件重要的事：真嗣躺在地上，咖啡杯也掉在地上。真嗣和加持，明日香曾经试图从这两个男人身上获得认同。她说，「真遗憾」，并不是对于他们两个感到遗憾，而是为自己需要从他们身上寻求认同而感到遗憾。她现在已经超越那个境界了。</p><p>补完继续进行。我不打算讨论最初播出的半成品(指《死与新生(Death and Rebirth)》等《End of Evangelion》以外的剧场版本)，因为它们在最终的成品中被剔除了，而且我自己也觉得没什么可分析的。通常只有在作品遭到审查而删除了部分片段的情况下我才会去分析额外的片段，但在这里完全不是这种情况。所以我直接跳到关于《End of Evangelion》的一个可能是最深的误解。</p><p>《End of Evangelion》的结局是100%乐观和幸福的结局，甚至比TV动画的结局更好。</p><p>我说真的，没开玩笑。</p><p>丽和渚薰向真嗣和观众解释说，每个个体只要有回来的意志，就都能回来。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-128.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-129.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-130.png" alt=""></p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-131.png" alt=""></p><p>(天哪，好好看看这些画面吧。擦汗。)</p><p>与倾向于真香党的「新的亚当和夏娃」的解释不同，明日香和真嗣不会一直搁浅在那片沙滩上的。反之，任何生命形式只要希望从LCL之海回来，就都能恢复原来的样子。我们再一次深刻地听到了唯常挂在嘴边的那句话：</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-132.png" alt=""></p><p>绅士们，这就是Neon Genesis Evangelion的主题：即使处在自杀和逃避的边缘，只要还有生存下去的意志，就能够回头。那这对明日香来说意味着什么呢？</p><p>真嗣是第一个回来的，这只是因为是他控制着补完。这多亏了丽和渚薰(以及唯)决定把世界的未来交到一个抑郁的、逃避的、有自杀倾向的，但在内心深处又是能够选择生存下去的十四岁男孩的手上。第二个回来的是明日香，因为，再说一遍，她代表着终极的人类隔绝。她代表了最强烈的想要与众不同地活着的意愿。当她寻求不到外部认同时，她曾经被逼得自杀。但现在她能够自己认同自己了。她已经变成了那个尖叫着「我不想死」的女孩，那个告诉敌人自己会将他们全部杀死的女孩，那个伸出手去想要触摸太阳却被她无法控制的力量打回原形的女孩。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-133.png" alt=""></p><p>但LCL她是能轻松地控制的。所以她第一个出来了，因重塑自己的人形而精疲力尽，就跟真嗣在第二十话中那样(迎战力天使时，真嗣与初号机同步达到400%并被其吞噬，之后身体复原)。真嗣先是看到了第三次冲击(Third Impact)时出发的量子丽，又看到明日香的突然出现，他一时恐惧不已，以为自己还在补完当中。所以他再次掐住了明日香的脖子。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-134.png" alt=""></p><p>(声明：我必须在这里说明白，我不会为这次或是之前那次掐死明日香的行为做辩护。真嗣是个暴力的女性厌恶者(misogynistic)，他的所作所为是完全错误的。这也是我每次看到有人觉得真嗣对明日香或多或少地有男女感情意义上的喜欢时感到愤怒的原因，因为这两个人只会在各个方面互相伤害。)</p><p>如果补完已经结束，明日香应该能感受到痛苦。她是感受到了，在冷漠地盯了真嗣几秒钟后，她抬起手轻抚过真嗣的脸庞，告诉他自己能理解他的痛苦，让他知道她能感受到自己的存在，确认了补完发生过但已经结束的事实。真嗣停手了，但他的手还停留在她的喉部颤抖着。如果你真的看过《End of Evangelion》，你会注意到有一个加长的镜头，真嗣的手指颤抖着，犹豫着要不要继续掐死明日香。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-135.png" alt=""></p><p>有趣的是，在最后这个画面中，我们看到真嗣跨坐在明日香身上，像是一个女人跨坐在一个平躺的男人身上一样。事实上这一幕是在映射不久之前的丽跨坐在真嗣身上类似于性爱的那一幕。在这里性别和角色发生了反转和变化，变得更接近于他们真正的自己。</p><p><img src="http://cdn.linghao.now.sh/asuka-analysis/asuka-136.png" alt=""></p><p>真嗣哭泣着崩溃了，部分是因为庆幸自己拒绝了补完重新活了下来，部分是因为对自己感到恶心。同样地，明日香说出了她那句著名的台词「気持ち悪い(真是恶心/不爽)」。这里有趣的是，根据理解和翻译的不同，明日香可能是在说自己，也可能是在说真嗣，或者是在说他们两个人。我倾向于更笼统的一种解释。明日香感到恶心是因为她刚爬到岸上，通过意识从LCL里重塑自己的人形，却再次发现身边只有真嗣一个人。但同时她也是在说人类的悲哀：她，受了伤，失败了；真嗣，在她身上哭着。等等。这种恶心的感觉是笼统的对于生命得以再次开始繁衍而言的。</p><p>这么说是因为Evangelion还传达了一些别的。如唯所说，任何地方都可以是天堂，但并不是任何地方都一定会成为天堂。不过我还是认为这是最振奋人心的一个结局，因为所有的生命都会回归，而这一次真嗣他们能够远比之前更好地追寻他们的幸福了。</p><p>如果你有任何问题、评论和不同意见，请给我发消息！这篇文章写得很匆忙，所以我知道它肯定不完美，而且英语也不是我的母语。不过如果你坚持读到了最后，我还是要祝贺你！</p><p>【完】</p><p>【转载请保留出处，未经许可不得将本文用于商业用途。】</p>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;【高能预警：巨量图片】&lt;/p&gt;
&lt;p&gt;准备好了伙计们，是时候来一篇关于惣流·明日香·兰格雷的重量级分析了。&lt;/p&gt;
    
    </summary>
    
      <category term="ACG" scheme="http://dnc1994.com/categories/ACG/"/>
    
    
  </entry>
  
</feed>