# Lab 1

After Lab 0, your computer should be all set up for this course. Today we dive into the first concepts that will be the underpinning to everything that we do for the rest of term. 

This lab's goals are:
* Be able to import data in both `numpy` and `pandas`
* Articulate the shape of data, specifying 'variables' and 'observations'
* Understand the role of unit testing in our course and in programming development more broadly

### Before starting...

Make sure that you have recently `pull`ed `course-materials`. We will need a data file from the `Data` folder for this lab.  

Also to avoid causing conflicts between the course materials on the master directory and the work you do in your labs, make a copy of this file and put it in a **sub-directory** under `course-materials` called `student-labs`. The `.gitignore` file has been told to ignore any work that you have in that folder, meaning that anything in that folder will not create conflicts with the master directory. 

## Importing the necessary packages

The first coding component of any script or Jupyter Notebook is the list of imported packages. There are a few reasons for this:
1. **Programming reason** - Python executes in order of the given lines. This means that you need to import a package before you use an element from it
2. **Human reason** - Before running a notebook or a script, a new user would want to know immediately if they can run the script or not. Putting the import statements at the top of the file allows your user to check if they can run or not run your file. 

**Note**: Failure to put your import statments at as the first lines of non-commented code will result in an _automatic loss of half of the assignment's total points_. 

In [1]:
import numpy as np
import pandas as pd

Did you just throw an error? You might not be in the correct `conda` enviroment. Remember that in Lab 0, we only installed our required packages into one conda environment. So you may need to shut down the `jupyter` kernal, activate the correct environment, and then relaunch your `jupyter notebook`. 

#### A few notes on imports

You might be wondering why I imported these two packages this way. Partially this is due to what I've seen others do, but a better reason is the one articulated in 'Python for Data Analysis' on page 90:
> ... throughout the book, I use the standard NumPy convention of always using `import numpy as np`. You are, of course, welcome to put `from numpy import *` in your code to avoid having to write `np.`, but I advise against making a habit of this. The `numpy` namespace is large and contains a number of functions whose names conflict with built-in Python functions (like `min` and `max`).

You will even notice that in the help files for [numpy](https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html) and [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics) these conventions are used. 

## Importing Data

When working with data, we need to first import that data. One can do this either using `pandas` or using `numpy`. In this lab, we will work with both methods for sake of completeness. 

### Importing with `numpy`

Importing with `numpy` will bring in your data as a `numpy array`. The most straightforward way to do this is using `genfromtxt()`. To learn more about this function, read its [help file](https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html)

In [7]:
ffire_np = np.genfromtxt("../Data/forestfires.csv", delimiter=',')

Taking this apart piece by piece, let's look at what we just did:
1. `ffire_np` is a variable
2. `np.` tells us to reach into the `numpy` library
3. `genfromtxt` is the specific method that we want to call
4. The first argument (i.e. "../Data/forestfires.csv") is the name of the file.   
   Notice that I used the `..` to tell my machine to look one level above my current folder (i.e. the 'parent' directory), then step into the `Data` sub-directory and finally select the `forestfires` file. 
5. The second argument `delimiter=` tells us what gaps to look for between data information. It is safe to use a comma as our separation since `csv` stands for `c`omma `s`eparated `v`aribles. 

Let's take a peek at what our variable looks like:

In [8]:
print(ffire_np)

[[  nan   nan   nan ...   nan   nan   nan]
 [ 7.    5.     nan ...  6.7   0.    0.  ]
 [ 7.    4.     nan ...  0.9   0.    0.  ]
 ...
 [ 7.    4.     nan ...  6.7   0.   11.16]
 [ 1.    4.     nan ...  4.    0.    0.  ]
 [ 6.    3.     nan ...  4.5   0.    0.  ]]


Is this what we expect to see? Open the datafile using your favorite spreadsheet viewer. 

What do you see? Does it match the above? 

There is a third argument that we can use with `genfromtxt` to ignore the column headers: `skip_header=1`. In the below code block use `genfromtxt` with three arguments to re-import the `forestfires` data and print the result. 

In [None]:
# Import forestfire data here
ffire_np = 

# Take a look at the result here




Again, is this what you expect to see? Compare this with the open datafile in your favorite spreadsheet viewer. 

#### Limitations in `numpy`

`numpy` can have a hard time with non-numerical data. When `numpy` doesn't know what to do with cells that are not numbers, it replaces those cells with `nan` or `n`ot `a` `n`umber. 

Take a look at both of your outputs, and comparing them again to the open datafile in your favorite spreadsheet viewer. Which rows and colums have `nan`s? 

#### What? Why?

This doesn't feel particularly useful given that so much data contains information that is non-numerical. However, a closer examination of what `numpy` stands for makes it a bit clearer why `numpy` does this. With each reference to `numpy`, I've used the `code` formatting, but the name of this package is **NumPy** or **Num**erical **Py**thon. This package is for the fast processing of numerical data leveraging tricks from linear algebra. 

`pandas` can offer us a bit more flexibility. Let's try importing data with it. 

### Importing with `pandas`

#### Resources consulted to build this lab:

1. _Python for Data Analysis_, Chapters X, Y, and Z. 
2. [NumPy Tutorial: Data analysis with Python](https://www.dataquest.io/blog/numpy-tutorial-python/) on Dataquest
3. [A Quick Introduction to the “Pandas” Python Library](https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673)