# Lab 1

After Lab 0, your computer should be all set up for this course. Today we dive into the first concepts that will be the underpinning to everything that we do for the rest of term. 

This lab's goals are:
* Be able to import data in both `numpy` and `pandas`
* Articulate the shape of data, specifying 'variables' and 'observations'
* Understand the role of unit testing in our course and in programming development more broadly

### Before starting...

Make sure that you have recently `pull`ed `course-materials`. We will need a data file from the `Data` folder for this lab.  

Also to avoid causing conflicts between the course materials on the master directory and the work you do in your labs, make a copy of this file and put it in a **sub-directory** under `course-materials` called `student-labs`. The `.gitignore` file has been told to ignore any work that you have in that folder, meaning that anything in that folder will not create conflicts with the master directory. 

## Importing the necessary packages

The first coding component of any script or Jupyter Notebook is the list of imported packages. There are a few reasons for this:
1. **Programming reason** - Python executes in order of the given lines. This means that you need to import a package before you use an element from it
2. **Human reason** - Before running a notebook or a script, a new user would want to know immediately if they can run the script or not. Putting the import statements at the top of the file allows your user to check if they can run or not run your file. 

**Note**: Failure to put your import statments at as the first lines of non-commented code will result in an _automatic loss of half of the assignment's total points_. 

In [1]:
import numpy as np
import pandas as pd

Did you just throw an error? You might not be in the correct `conda` enviroment. Remember that in Lab 0, we only installed our required packages into one conda environment. So you may need to shut down the `jupyter` kernal, activate the correct environment, and then relaunch your `jupyter notebook`. 

#### A few notes on imports

You might be wondering why I imported these two packages this way. Partially this is due to what I've seen others do, but a better reason is the one articulated in 'Python for Data Analysis' on page 90:
> ... throughout the book, I use the standard NumPy convention of always using `import numpy as np`. You are, of course, welcome to put `from numpy import *` in your code to avoid having to write `np.`, but I advise against making a habit of this. The `numpy` namespace is large and contains a number of functions whose names conflict with built-in Python functions (like `min` and `max`).

You will even notice that in the help files for [numpy](https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html) and [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics) these conventions are used. 

## Importing Data

When working with data, we need to first import that data. One can do this either using `pandas` or using `numpy`. In this lab, we will work with both methods for sake of completeness. 

### Importing with `numpy`

Importing with `numpy` will bring in your data as a `numpy array`. The most straightforward way to do this is using `genfromtxt()`. To learn more about this function, read its [help file](https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html). 

In [2]:
ffire_np = np.genfromtxt("../Data/forestfires.csv", delimiter=',')

Taking this apart piece by piece, let's look at what we just did:
1. `ffire_np` is a variable
2. `np.` tells us to reach into the `numpy` library
3. `genfromtxt` is the specific method that we want to call
4. The first argument (i.e. "../Data/forestfires.csv") is the name of the file.   
   Notice that I used the `..` to tell my machine to look one level above my current folder (i.e. the 'parent' directory), then step into the `Data` sub-directory and finally select the `forestfires` file. 
5. The second argument `delimiter=` tells us what gaps to look for between data information. It is safe to use a comma as our separation since `csv` stands for `c`omma `s`eparated `v`aribles. 

Let's take a peek at what our variable looks like:

In [3]:
print(ffire_np)

[[  nan   nan   nan ...   nan   nan   nan]
 [ 7.    5.     nan ...  6.7   0.    0.  ]
 [ 7.    4.     nan ...  0.9   0.    0.  ]
 ...
 [ 7.    4.     nan ...  6.7   0.   11.16]
 [ 1.    4.     nan ...  4.    0.    0.  ]
 [ 6.    3.     nan ...  4.5   0.    0.  ]]


Is this what we expect to see? Open the datafile using your favorite spreadsheet viewer. 

What do you see? Does it match the above? 

There is a third argument that we can use with `genfromtxt` to ignore the column headers: `skip_header=1`. In the below code block use `genfromtxt` with three arguments to re-import the `forestfires` data and print the result. 

In [None]:
# Import forestfire data here
ffire_np = 

# Take a look at the result here




Again, is this what you expect to see? Compare this with the open datafile in your favorite spreadsheet viewer. 

#### Limitations in `numpy`

`numpy` can have a hard time with non-numerical data. When `numpy` doesn't know what to do with cells that are not numbers, it replaces those cells with `nan` or `n`ot `a` `n`umber. 

Take a look at both of your outputs, and comparing them again to the open datafile in your favorite spreadsheet viewer. Which rows and colums have `nan`s? 

#### What? Why?

This doesn't feel particularly useful given that so much data contains information that is non-numerical. However, a closer examination of what `numpy` stands for makes it a bit clearer why `numpy` does this. With each reference to `numpy`, I've used the `code` formatting, but the name of this package is **NumPy** or **Num**erical **Py**thon. This package is for the fast processing of numerical data leveraging tricks from linear algebra. 

`pandas` can offer us a bit more flexibility. Let's try importing data with it. 

### Importing with `pandas`

`pandas` has more functionality when it comes to data with non-numerical values than `numpy`. While our book doesn't delve into the root of the name `panda`, according to [this site](https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673), `pandas` is short for `P`ython `Da`ta `An`aly`s`is. (Clearly I'm not sure how they got 'pandas' from those three words, so if you would like to instead just imagine a group of cuddly pandas, that's fine with me.) 

Returning to how we can import using `pandas`, the most common method that we will use is `read_csv()`.  To learn more about this function, read its [help file](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). 

In [4]:
ffire_pd = pd.read_csv("../Data/forestfires.csv", sep=',')

Taking this apart piece by piece, let's look at what we just did:
1. `ffire_pd` is again just a variable
2. `pd.` tells us to reach into the `pandas` library
3. `read_csv` is the specific method that we want to call
4. The first argument is the name of the file (just like in `genfromtxt()` from above).   
5. The second argument `sep=` is the pandas version of `delimiter=` telling us what gaps to look for between data information. (Why are we using a comma here for denoting the gaps?)

Let's take a peek at what our variable looks like. Note that we're going to do this both using and not using the `print()` method. 

In [6]:
ffire_pd

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.00
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.00
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.00
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.00
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.00
5,8,6,aug,sun,92.3,85.3,488.0,14.7,22.2,29,5.4,0.0,0.00
6,8,6,aug,mon,92.3,88.9,495.6,8.5,24.1,27,3.1,0.0,0.00
7,8,6,aug,mon,91.5,145.4,608.2,10.7,8.0,86,2.2,0.0,0.00
8,8,6,sep,tue,91.0,129.5,692.6,7.0,13.1,63,5.4,0.0,0.00
9,7,5,sep,sat,92.5,88.0,698.6,7.1,22.8,40,4.0,0.0,0.00


In [7]:
print(ffire_pd)

     X  Y month  day  FFMC    DMC     DC   ISI  temp  RH  wind  rain   area
0    7  5   mar  fri  86.2   26.2   94.3   5.1   8.2  51   6.7   0.0   0.00
1    7  4   oct  tue  90.6   35.4  669.1   6.7  18.0  33   0.9   0.0   0.00
2    7  4   oct  sat  90.6   43.7  686.9   6.7  14.6  33   1.3   0.0   0.00
3    8  6   mar  fri  91.7   33.3   77.5   9.0   8.3  97   4.0   0.2   0.00
4    8  6   mar  sun  89.3   51.3  102.2   9.6  11.4  99   1.8   0.0   0.00
..  .. ..   ...  ...   ...    ...    ...   ...   ...  ..   ...   ...    ...
512  4  3   aug  sun  81.6   56.7  665.6   1.9  27.8  32   2.7   0.0   6.44
513  2  4   aug  sun  81.6   56.7  665.6   1.9  21.9  71   5.8   0.0  54.29
514  7  4   aug  sun  81.6   56.7  665.6   1.9  21.2  70   6.7   0.0  11.16
515  1  4   aug  sat  94.4  146.0  614.7  11.3  25.6  42   4.0   0.0   0.00
516  6  3   nov  tue  79.5    3.0  106.7   1.1  11.8  31   4.5   0.0   0.00

[517 rows x 13 columns]


#### Before we go on:
1. What do you see in both of these views of the output `ffire_pd`? 
2. What are the differences between these views?
3. How does `ffire_pd` compare to the datafile in your favorite spreadsheet viewer?
4. How do `ffire_pd` and `ffire_np` differ? 

### Data Frames and Series

Data types are something that we pay a lot of attention to in computer science. So before moving much further,we should ask ourselves: what kinds of objects did we just create? (How do we check the **type** of an object in python?)

In [14]:
# Print the data types for ffire_np and ffire_pd here 




We'll turn our attention to the `pandas` version: `ffire_pd`. This variable is a `pandas` **_DataFrame_**, which are comprised of **_Series_**. DataFrames look like spreadsheets that we are used to seeing, and they can contain data with both numerical and non-numerical values. Each row of the DataFrame is a Series.

## Shape of Data

Data, like spreadsheets, have a shape. Most data that we will work with in this class will have a shape that can be described by the number of rows and the number of columns. When refering to data, we think of **_Observations_** and _**Variables**_. 
* Each observation is a row. Think of the observations as an object, person, or item that we have a set of information on. 
* The variables are stored as columns. The variables details the kind of information that we have stored on each observation. 

Looking at your dataframe and refering to the data's [code book](http://archive.ics.uci.edu/ml/datasets/Forest+Fires), what are the observations in this dataset? What are the variables? _(Note: A code book is a document that provides details on a dataset.)_

#### Finding the shape of your data

Once you have the data loaded, you can quickly generate a number of facts about your data:
1. The dimensions of your data (i.e. the number of observations and variables) using `.shape`
2. The names of the variables using `.columns`
3. Create a quick glance of your data using `.head`

In [23]:
# 1. Using `.shape` on ffire_pd
print(ffire_pd.shape)

#2. Using `.columns` on ffire_pd
var_names = ffire_pd.columns
print(var_names)

# 3. Using `.head` on ffire_pd
glance = ffire_pd.head()
print(glance)

(517, 13)
Index(['X', 'Y', 'month', 'day', 'FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH',
       'wind', 'rain', 'area'],
      dtype='object')
   X  Y month  day  FFMC   DMC     DC  ISI  temp  RH  wind  rain  area
0  7  5   mar  fri  86.2  26.2   94.3  5.1   8.2  51   6.7   0.0   0.0
1  7  4   oct  tue  90.6  35.4  669.1  6.7  18.0  33   0.9   0.0   0.0
2  7  4   oct  sat  90.6  43.7  686.9  6.7  14.6  33   1.3   0.0   0.0
3  8  6   mar  fri  91.7  33.3   77.5  9.0   8.3  97   4.0   0.2   0.0
4  8  6   mar  sun  89.3  51.3  102.2  9.6  11.4  99   1.8   0.0   0.0


## Powerhouse packages

As `python` and data science have grown and evolved, several packages -- including `numpy`, `pandas`, `scipy`, `matplotlib` -- as critical to the practice of machine learning. In this course, we will use these packages, but with an eye towards deep understand of each method that we employ. 

This course has twin themes: **carpentry** and **creativity**. Most of the assignments in the course focus on the former in service of the latter. Think of this course as a kind of cookng course where we will spend a considerable amount of time on each ingredient and the most basic of recipes, with the ultimate goal of creating a glorious meal mixing and blending the ingredients in unexpected ways with each other and with new ingredients. 

**Capentry**: We are going to focus on the interior elements of classic machine learning algorithms, building each one from scratch. We will compare our results to already optimized versions in packages like `scikit-learn`. The goal of carpentry is _deep_ understanding of each algorithm. 

**Creativity**: Building understanding of 

## Unit Testing: When are we right? 



#### Resources consulted to build this lab:

1. _Python for Data Analysis_, Chapters X, Y, and Z. 
2. [NumPy Tutorial: Data analysis with Python](https://www.dataquest.io/blog/numpy-tutorial-python/) on Dataquest
3. [A Quick Introduction to the “Pandas” Python Library](https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673)