# Looping Over Data Sets


<strong>Questions</strong>
  <ul>
    <li><p>How can I process many data sets with a single command?</p>
    </li>
  </ul>
<strong>Objectives</strong>
  <ul>
	<li><p>Be able to read and write globbing expressions that match sets of files.</p>
    </li>
	<li><p>Use glob to create lists of files.</p>
    </li>
	<li><p>Write for loops to perform operations on files given their names in a list.</p>
    </li>
  </ul>

<h2 id="use-a-for-loop-to-process-files-given-a-list-of-their-names">Use a <code class="language-plaintext highlighter-rouge">for</code> loop to process files given a list of their names.</h2>

<ul>
  <li>A filename is a character string.</li>
  <li>And lists can contain character strings.</li>
</ul>


In [None]:
import pandas as pd
for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']:
    data = pd.read_csv(filename, index_col='country')
    print(filename, data.min())

<h2 id="use-globglob-to-find-sets-of-files-whose-names-match-a-pattern">Use <a href="https://docs.python.org/3/library/glob.html#glob.glob"><code class="language-plaintext highlighter-rouge">glob.glob</code></a> to find sets of files whose names match a pattern.</h2>

<ul>
  <li>In Unix, the term “globbing” means “matching a set of files with a pattern”.</li>
  <li>The most common patterns are:
    <ul>
      <li><code class="language-plaintext highlighter-rouge">*</code> meaning “match zero or more characters”</li>
      <li><code class="language-plaintext highlighter-rouge">?</code> meaning “match exactly one character”</li>
    </ul>
  </li>
  <li>Python’s standard library contains the <a href="https://docs.python.org/3/library/glob.html"><code class="language-plaintext highlighter-rouge">glob</code></a> module to provide pattern matching functionality</li>
  <li>The <a href="https://docs.python.org/3/library/glob.html"><code class="language-plaintext highlighter-rouge">glob</code></a> module contains a function also called <code class="language-plaintext highlighter-rouge">glob</code> to match file patterns</li>
  <li>E.g., <code class="language-plaintext highlighter-rouge">glob.glob('*.txt')</code> matches all files in the current directory 
whose names end with <code class="language-plaintext highlighter-rouge">.txt</code>.</li>
  <li>Result is a (possibly empty) list of character strings.</li>
</ul>


In [None]:
import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))

In [None]:
print('all PDB files:', glob.glob('*.pdb'))

<h2 id="use-glob-and-for-to-process-batches-of-files">Use <code class="language-plaintext highlighter-rouge">glob</code> and <code class="language-plaintext highlighter-rouge">for</code> to process batches of files.</h2>

<ul>
  <li>Helps a lot if the files are named and stored systematically and consistently
so that simple patterns will find the right data.</li>
</ul>

In [None]:
for filename in glob.glob('data/gapminder_*.csv'):
    data = pd.read_csv(filename)
    print(filename, data['gdpPercap_1952'].min())

<ul>
  <li>This includes all data, as well as per-region data.</li>
  <li>Use a more specific pattern in the exercises to exclude the whole data set.</li>
  <li>But note that the minimum of the entire data set is also the minimum of one of the data sets,
which is a nice check on correctness.</li>
</ul>

---
# Exercises

  ## 1) Determining Matches
  
  <p>Which of these files is <em>not</em> matched by the expression <code class="language-plaintext highlighter-rouge">glob.glob('data/*as*.csv')</code>?</p>

  <ol>
    <li><code class="language-plaintext highlighter-rouge">data/gapminder_gdp_africa.csv</code></li>
    <li><code class="language-plaintext highlighter-rouge">data/gapminder_gdp_americas.csv</code></li>
    <li><code class="language-plaintext highlighter-rouge">data/gapminder_gdp_asia.csv</code></li>
    <li>1 and 2 are not matched.</li>
  </ol>

  <blockquote class="solution">
    <h2 id="solution">Solution</h2>
    <p>1 is not matched by the glob.</p>
  </blockquote>

  ## 2) Minimum file size
  
  <p>Modify this program so that it prints the number of records in
the file that has the fewest records.</p>

  <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">glob</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="n">fewest</span> <span class="o">=</span> <span class="n">____</span>
<span class="k">for</span> <span class="n">filename</span> <span class="ow">in</span> <span class="n">glob</span><span class="p">.</span><span class="n">glob</span><span class="p">(</span><span class="s">'data/*.csv'</span><span class="p">):</span>
    <span class="n">dataframe</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">____</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
    <span class="n">fewest</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">____</span><span class="p">,</span> <span class="n">dataframe</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="s">'smallest file has'</span><span class="p">,</span> <span class="n">fewest</span><span class="p">,</span> <span class="s">'records'</span><span class="p">)</span>
</code></pre></div>  </div>
  <p>Note that the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html">shape method</a>
returns a tuple with the number of rows and columns of the data frame.</p>

  <blockquote class="solution">
    <h2 id="solution-1">Solution</h2>
    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">glob</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="n">fewest</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="s">'Inf'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">filename</span> <span class="ow">in</span> <span class="n">glob</span><span class="p">.</span><span class="n">glob</span><span class="p">(</span><span class="s">'data/*.csv'</span><span class="p">):</span>
    <span class="n">dataframe</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
    <span class="n">fewest</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">fewest</span><span class="p">,</span> <span class="n">dataframe</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="s">'smallest file has'</span><span class="p">,</span> <span class="n">fewest</span><span class="p">,</span> <span class="s">'records'</span><span class="p">)</span>
</code></pre></div>    </div>
  </blockquote>

## 3) Comparing data

<p>Write a program that reads in the regional data sets
and plots the average GDP per capita for each region over time
in a single chart.</p>
  <blockquote class="solution">
    <h2 id="solution-2">Solution</h2>
    <p>This solution builds a useful legend by using the string <a href="https://docs.python.org/3/library/stdtypes.html#str.split"><code class="language-plaintext highlighter-rouge">split</code></a> method to
extract the <code class="language-plaintext highlighter-rouge">region</code> from the path ‘data/gapminder_gdp_a_specific_region.csv’. The [<code class="language-plaintext highlighter-rouge">pathlib module</code>]
also provides useful abstractions for file and path manipulation like returning the name of a file 
without the file extension.</p>
    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">glob</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="k">for</span> <span class="n">filename</span> <span class="ow">in</span> <span class="n">glob</span><span class="p">.</span><span class="n">glob</span><span class="p">(</span><span class="s">'data/gapminder_gdp*.csv'</span><span class="p">):</span>
    <span class="n">dataframe</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
    <span class="c1"># extract &lt;region&gt; from the filename, expected to be in the format 'data/gapminder_gdp_&lt;region&gt;.csv'.
</span>    <span class="c1"># we will split the string using the split method and `_` as our separator,
</span>    <span class="c1"># retrieve the last string in the list that split returns (`&lt;region&gt;.csv`), 
</span>    <span class="c1"># and then remove the `.csv` extension from that string.
</span>    <span class="n">region</span> <span class="o">=</span> <span class="n">filename</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">'_'</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">][:</span><span class="o">-</span><span class="mi">4</span><span class="p">]</span> 
    <span class="n">dataframe</span><span class="p">.</span><span class="n">mean</span><span class="p">().</span><span class="n">plot</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="n">region</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div>    </div>
  </blockquote>

In [None]:
import glob
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('data/gapminder_gdp*.csv'):
    dataframe = pd.read_csv(filename)
    # extract <region> from the filename, expected to be in the format 'data/gapminder_gdp_<region>.csv'.
    # we will split the string using the split method and `_` as our separator,
    # retrieve the last string in the list that split returns (`<region>.csv`), 
    # and then remove the `.csv` extension from that string.
    region = filename.split('_')[-1][:-4] 
    dataframe.mean().plot(ax=ax, label=region)
plt.legend()
plt.show()