# Exploring data using NumPy

Our first task in this week's lesson is to learn how to read and explore data files using [NumPy](http://www.numpy.org/).
Reading data files using Pandas will make life a bit easier compared to the traditional Python way of reading data files.
If you're curious about that, you can check out some of the lesson materials from past years about [reading data in the Pythonic way](https://geo-python.github.io/2018/2017/lessons/L5/reading-data-from-file.html).

## Preparation (the key to success)

Presumably you have already opened this Jupyter notebook (if not, do so now using one of the links above), and our first task is to change the working directory to the one containing the files for this week's lesson.
You can do that by...XXX.

## Reading a data file with NumPy

### Importing NumPy

Now we're ready to read in our temperature data file.
First, we need to import the NumPy module.

In [2]:
import numpy as np

That's it!
NumPy is now ready to use.
Notice that we have imported the NumPy module with the name `np`.

### Reading a data file

Now we'll read the file data into a variable called `data`.
We can start by defining the location (filepath) of the data file in the variable `fp`.

In [59]:
fp = 'Kumpula-June-2016-w-metadata.txt'

Now we can read the file using the NumPy `genfromtxt()` function.

In [57]:
data = np.genfromtxt(fp)

`np.genfromtxt()` is a general function for reading data files separated by commas, spaces, or other common separators.
For a full list of parameters for this function, please refer to the [NumPy documentation for numpy.genfromtxt()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html).

Here we use the function simply by giving the filename as an input parameter.
If all goes as planned, you should now have a new variable defined as `data` in memory that contains the contents of the data file.
You can check the the contents of this variable by typing the following:

In [58]:
print(data)

[ nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan
  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan
  nan]


### Inspecting our data file

Hmm...something doesn't look right here.
You were perhaps expecting some temperature data, right?
Instead we have only a list of `nan` values.

`nan` stands for "not a number", and might indicate some problem with reading in the contents of the file.
Looks like we need to investigate this further.

We can begin our investigation by opening the data file in JupyterLab by right-clicking on the `Kumpula-June-2016-w-metadata.txt` data file and selecting **Open**.

![Opening a file in JupyterLab](img/open-text-file.png)

You should see something like the following:

We can observe a few important things:

- There are some metadata at the top of the file (a *header*) that provide basic information about its contents and source.
  This isnâ€™t data we want to process, so we need to skip over that part of the file when we load it.
    - We can skip the top header lines in the file using the `skip_header` parameter.
- The values in the data file are separated by commas.
    - We can specify the value separator using the `delimiter` parameter.
- The top row of values below the header contains names of the column variables.
    - We can tell NumPy to use those names using the `names` parameter.

### Reading our data file, round 2

Let's try reading again with this information in mind.

In [74]:
data = np.genfromtxt(fp, skip_header=8, delimiter=',', names=True)

Note that we now skip the header lines (first 8 lines) using `skip_header=8`, tell NumPy the files is comma-separated using `delimiter=','`, and tell NumPy to use the first line it reads in the data file for the names using `names=True`.

In [72]:
print(data)

[( 20160601.,  65.5,  73.6,  54.7) ( 20160602.,  65.8,  80.8,  55. )
 ( 20160603.,  68.4,  77.9,  55.6) ( 20160604.,  57.5,  70.9,  47.3)
 ( 20160605.,  51.4,  58.3,  43.2) ( 20160606.,  52.2,  59.7,  42.8)
 ( 20160607.,  56.9,  65.1,  45.9) ( 20160608.,  54.2,  60.4,  47.5)
 ( 20160609.,  49.4,  54.1,  45.7) ( 20160610.,  49.5,  55.9,  43. )
 ( 20160611.,  54. ,  62.1,  41.7) ( 20160612.,  55.4,  64.2,  46. )
 ( 20160613.,  58.3,  68.2,  47.3) ( 20160614.,  59.7,  67.8,  47.8)
 ( 20160615.,  63.4,  70.3,  49.3) ( 20160616.,  57.8,  67.5,  55.6)
 ( 20160617.,  60.4,  70.7,  55.9) ( 20160618.,  57.3,  62.8,  54. )
 ( 20160619.,  56.3,  59.2,  54.1) ( 20160620.,  59.3,  69.1,  52.2)
 ( 20160621.,  62.6,  71.4,  50.4) ( 20160622.,  61.7,  70.2,  55.4)
 ( 20160623.,  60.9,  67.1,  54.9) ( 20160624.,  61.1,  68.9,  56.7)
 ( 20160625.,  65.7,  75.4,  57.9) ( 20160626.,  69.6,  77.7,  60.3)
 ( 20160627.,  60.7,  70. ,  57.6) ( 20160628.,  65.4,  73. ,  55.8)
 ( 20160629.,  65.8,  73.2,  59.7)

In [70]:
data

array([( 20160601.,  65.5,  73.6,  54.7),
       ( 20160602.,  65.8,  80.8,  55. ),
       ( 20160603.,  68.4,  77.9,  55.6),
       ( 20160604.,  57.5,  70.9,  47.3),
       ( 20160605.,  51.4,  58.3,  43.2),
       ( 20160606.,  52.2,  59.7,  42.8),
       ( 20160607.,  56.9,  65.1,  45.9),
       ( 20160608.,  54.2,  60.4,  47.5),
       ( 20160609.,  49.4,  54.1,  45.7),
       ( 20160610.,  49.5,  55.9,  43. ),
       ( 20160611.,  54. ,  62.1,  41.7),
       ( 20160612.,  55.4,  64.2,  46. ),
       ( 20160613.,  58.3,  68.2,  47.3),
       ( 20160614.,  59.7,  67.8,  47.8),
       ( 20160615.,  63.4,  70.3,  49.3),
       ( 20160616.,  57.8,  67.5,  55.6),
       ( 20160617.,  60.4,  70.7,  55.9),
       ( 20160618.,  57.3,  62.8,  54. ),
       ( 20160619.,  56.3,  59.2,  54.1),
       ( 20160620.,  59.3,  69.1,  52.2),
       ( 20160621.,  62.6,  71.4,  50.4),
       ( 20160622.,  61.7,  70.2,  55.4),
       ( 20160623.,  60.9,  67.1,  54.9),
       ( 20160624.,  61.1,  68.9, 

In [18]:
data

array([( 20160601.,  65.5,  73.6,  54.7),
       ( 20160602.,  65.8,  80.8,  55. ),
       ( 20160603.,  68.4,  77.9,  55.6),
       ( 20160604.,  57.5,  70.9,  47.3),
       ( 20160605.,  51.4,  58.3,  43.2),
       ( 20160606.,  52.2,  59.7,  42.8),
       ( 20160607.,  56.9,  65.1,  45.9),
       ( 20160608.,  54.2,  60.4,  47.5),
       ( 20160609.,  49.4,  54.1,  45.7),
       ( 20160610.,  49.5,  55.9,  43. ),
       ( 20160611.,  54. ,  62.1,  41.7),
       ( 20160612.,  55.4,  64.2,  46. ),
       ( 20160613.,  58.3,  68.2,  47.3),
       ( 20160614.,  59.7,  67.8,  47.8),
       ( 20160615.,  63.4,  70.3,  49.3),
       ( 20160616.,  57.8,  67.5,  55.6),
       ( 20160617.,  60.4,  70.7,  55.9),
       ( 20160618.,  57.3,  62.8,  54. ),
       ( 20160619.,  56.3,  59.2,  54.1),
       ( 20160620.,  59.3,  69.1,  52.2),
       ( 20160621.,  62.6,  71.4,  50.4),
       ( 20160622.,  61.7,  70.2,  55.4),
       ( 20160623.,  60.9,  67.1,  54.9),
       ( 20160624.,  61.1,  68.9, 

In [42]:
data['YEARMODA'][:5]


array([ 20160601.,  20160602.,  20160603.,  20160604.,  20160605.])