# `datascience` Library Demo Notebook

_Notebook created by Chris Pyles_

This notebook is intended to give you some basic information on manipulating rectangular data using the `datascience` library. The `datascience` library is a module for Python developed at UC Berkeley and which is used in the course Data 8: Foundations of Data Science. This notebook covers basic table operations using this library.

<!--

**Table of Contents**
1. [Dependences](#Dependencies)
2. [Loading Data](#Loading-Data)
3. [Moving Between `pandas` and `datascience`](#Moving-Between-pandas-and-datascience)
4. [Rows and Columns](#Rows-and-Columns)
5. [Accessing Vaues](#Accessing-Values)
6. [Missing Values](#Missing-Values)
7. [Descriptive Statistics](#Descriptive-Statistics)
8. [Grouping](#Grouping)
9. [Manipulating Values](#Manipulating-Values)
10. [Exporting Figures](#Exporting-Figures)
11. [Exporting Data](#Exporting-Data)
12. [Conclusion](#Conclusion)

-->

### Dependencies

In the cell below we load the dependencies for this notebook.

In [None]:
from datascience import *
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
from IPython.display import display

### Loading Data

The method that `datascience` provides for reading in data defaults to reading CSV files. The function, `Table.read_table()`, takes as its argument a relative path to the data file. In the cell below, we load the datasets we will be using for this demo.

In [None]:
trips = Table.read_table('data/trips.csv')
stations = Table.read_table('data/stations.csv')
trips.show(5)
stations.show(5)

We use the `Table.show()` method above to display the first 5 lines of each table. This method defaults to all rows, so calling `trips.show()` would have displayed all 354,152 rows of that table.

If you have files that use other delimeters, you can pass the `sep` argument of `pd.read_csv()` to `Table.read_table()` and the file will be read in correctly.

In [None]:
Table.read_table("data/trips.tsv", sep="\t").show(5)

If you have data formatted in ways other than delimited files, these need to be loaded into `pandas` first before being transferred to `datascience`. An example call is given below.

```python
# load data into pandas
trips_df = pd.read_json("data/trips.json")

# transfer to datascience
trips_tbl = Table.from_df(trips_df)
```

### Moving Between `pandas` and `datascience`

As noted above, it is possible to transfer your data between `pandas` and `datascience`. The functions to do this are provided in the `datascience` library; `Table.from_df()` takes a DataFrame and returns a Table and `Table.to_df()` turns the Table into a DataFrame.

```python
# pandas to datascience
tbl = Table.from_df(df)

# datascience to pandas
df = tbl.to_df()
```

### Rows and Columns 

To get row and column counts, the `datascience` library provides the `num_rows` and `num_columns` attributes, which are self-explanatory.

In [None]:
trips.num_rows, trips.num_columns

To access the labels of the columns, `datascience` has `labels`, which is a tuple containing the column labels in numerical index order.

In [None]:
trips.labels

To add columns to a table, you pass a single label and set of values to `.with_column()` or a list of labels and pairs to `.with_columns()` (both shown below). **These functions do not edit the original table, so these modifications can only be saved by assigning them to the name of the table or a new variable name.**

In [None]:
# adding a single column
some_random_numbers = np.random.uniform(0, 10, trips.num_rows)
trips.with_column("Random Numbers", some_random_numbers)

In [None]:
# adding multiple columns
some_more_random_numbers = np.random.normal(0, 10, trips.num_rows)
trips.with_columns(
    "Random Numbers", some_random_numbers,
    "More Random Numbers", some_more_random_numbers
)

Note that in the `.with_columns()` call, the column labels and values alternate; that is, the call should have the form

```python
tbl.with_columns(
    "Label 1", values_1,
    "Label 2", values_2,
    "Label 3", values_3,
    ...
)
```

It is also important that the values argument(s) have the same number of rows as the table they are being added to. A single value entered as this argument will be broadcast to the entire table, but any length besides 1 or the number of rows in the table will throw an error.

It is also possible to change the labels of columns using the `.relabeled()` method.

In [None]:
trips.relabeled("Duration", "Time")

### Accessing Values

For all non-continuous variables, it is usually important to understand the possible values of the variable; that is, to know the variable's _unique_ values. While `datascience` does not have a built-in method, it is a simple thing to export a column as an array and pass it to `np.unique`.

In [None]:
np.unique(trips.column('Start Date'))

The `datascience` library provides the `.where()` method to filter rows, which uses a column name and a predicate function.

In [None]:
trips.where("Duration", lambda x: x < 100)

The library also provides the `are` class to create predicate functions. Each method of this class returns a boolean function that can be called on a value. For example, if we wanted a function that checked whether or not a value is greater than or equal to 1000, we could use the call below:

In [None]:
are.above_or_equal_to(1000)

You can pass these `are` objects to the `.where()` method to use as predicate functions. This is how students in Data 8 are taught to filter rows.

In [None]:
trips.where("Duration", are.below(100))

For a full list of predicate functions, see the [`datascience.predicates` documentation](http://data8.org/datascience/predicates.html).

To sort the rows of a table, use the `.sort()` method. It defaults to ascending, so to get values in `descending` order the `descending` argument must be set to `True`.

In [None]:
trips.sort("Duration", descending=True)

### Missing Values

The `datascience` library does not currently have the functionality to support working with missing values, although it is possible to transfer your data to `pandas` and use that library's tools.

However, it is possible to combine row filtering with NumPy functions (or `pandas` ones) to do some simple filtering. As an example, if we wanted to filter out rows with missing values in a specific column, we could define our own predicate function as below and then use the `.where()` method to filter rows.

In [None]:
not_nan = lambda x: not pd.isna(x)

trips.where('End Terminal', not_nan)

If we wanted to filter rows with missing values in _any_ column, we could iterate through the labels in `Table.labels`, using the `.where()` method to filter on each pass:

In [None]:
for label in trips.labels:
    trips = trips.where(label, not_nan)

### Descriptive Statistics

In order to understand the distribution of your numerical data, it can be very useful to look at descriptive statistics of the values. The `datascience` library allows you to compute statistics on each column of your table, but it requires you to specify which operations you want to run and it does not filter out non-numerical columns.

To use the `datascience` library to get descriptive statistics, use the `.stats()` method; this requires you to specify which statistics you want to use to aggregate each column, which you do by passing a list of functions as the `ops` argument.

In [None]:
# datascience
first_quartile = lambda x: np.quantile(x, 0.25)
third_quartile = lambda x: np.quantile(x, 0.75)
trips.stats(ops = [min, max, np.mean, np.std, first_quartile, third_quartile])

The default behavior of the `.stats()` method is to show the minimum, maximum, median, and sum.

In [None]:
trips.stats()

### Grouping

In the `datascience` library, you can group by a column with the `.group()` method; this defaults to counts, but you can pass an optional second argument with an aggregator function.

In [None]:
trips.group('Start Station')

When you pass an aggregator function, each column is aggregated by that function in the specified groups. This means that the new table will have the same number of columns as the original, unlike the call _without_ an aggregator function. As an example of an aggregator function, we could pass `np.median()`.

In [None]:
trips.group("Start Station", np.median)

To create a pivot table, use the `.pivot()` method. The first argument indicates the column labels, the second the rows, and the third the values that go into each entry. If there are more than one value to go into the cells, it is also possible to pass an aggregator function. The cell below shows a table where each column is a starting station, each row is an ending station, and each value is the mean of the durations for that starting and ending station pair.

In [None]:
trips.pivot("Start Station", "End Station", "Duration", np.mean)

### Joining Tables

The `datascience` library allows you to join tables using its `.join()` method. This method performs an _inner_ join, which means that the rows are only those whose values in the join column(s) appear in _both_ tables. 

The call below joins the `trips` table with the second through fourth columns of the `stations` table, left on `"Start Station"` and right on `"name"`. This means that the result table will have two new columns, `"lat"` and `"long"`, indicating the latitude and longitude of the starting station.

In [None]:
trips.join("Start Station", stations.select(1, 2, 3), "name")

To perform other types of joins, the tables would need to be passed to `pandas`.

### Manipulating Values

The most common way to manipulate a data set is to apply a predefined function on each element of a column. To accomplish this in `datascience`, we utilize the `.apply()` method, which takes as its arguments first a function to apply and then the column index or label.

In [None]:
square = lambda x: x**2

sqaured_durations = trips.apply(square, "Duration")
trips = trips.with_column("Duration^2", sqaured_durations)
trips.show(5)

### Autograding with OkPy

UC Berkeley's Python courses use an autograder called [okpy](https://okpy.org). The package has an easy Jupyter Notebook integration, which is why it is the autograder infrastructure for so many Berkeley courses. Using this autograder requires writing tests, similar to doctests, that will be run in the local environment when you tell the autograder to check the notebook. These can be divided up into multiple sections, and are recorded in Python files.

#### Writing OkPy Tests

Okpy tests are written in "ok format"; this means that they are stored in your Python file as the variable `test`, which is a dictionary of information about the specific test. Each Python file is its own test. In the table below, the keys of the dictionary that are needed are described.

| Key | Type | Description |
|-----|-----|-----|
| `"name"` | `str` | the name of the question |
| `"points"` | `int`, `float` | the point value of the question |
| `"suites"` | `list` | list of dictionaries with the code for the test, with some other attributes |

The `"suites"` key should have a value that is a list of dictionaries, each of which has the following attributes:

| Key | Type | Description |
|-----|-----|-----|
| `"cases"` | `list` | list of dictionaries with the code for each test |
| `"scored"` | `bool` | whether or not the test is scored |
| `"setup"` | `str` | setup code to run before the cases |
| `"teardown"` | `str` | code to run after the cases |
| `"type"` | `str` | type of test, usually set to `"doctest"` |

Each test is divided into suites, which are in turn divided into cases. This is useful in CS courses, but a featured which is often not used in Data 8. For most of the tests you write, it is likely that `test["suites"]` and `test["suites"][0]["cases"]` will have length 1.

As an example of an ok test file, consider the one below.

```python
test = {
    "name": "Question 1",
    "points": 1,
    "suites": [
        {
            "cases": [
                {
                    "code": r"""
                    >>> the_answer
                    42
                    >>> np.isclose(42, the_answer)
                    True
                    """,
                    "hidden": False,
                    "locked": False
                }
            ],
            "scored": True,
            "setup": "",
            "teardown": "",
            "type": doctest
        }
    ]
}
```

As you can see, we check in this test that `the_answer` is an integer with value 42. If we hadn't had numpy imported in the notebook environment, then we would've needed to change the `"setup"` value to include that:

```python
"setup": r"""
>>> import numpy as np
"""
```

Also note that the strings with code in them are all `r` strings. This is important for the interpreter.

**Some things to keep in mind when writing tests:**
* Rows and elements often get shuffled around due to student explorations. For this reason, it is often good to avoid indexing by numbers unless you are _100% certain_ that this won't happen.
* Rounding errors occur, so use functions like `np.isclose` instead of tests for direct equality.
* Write exhaustive tests but don't try to verify that each cell matches. In most cases, either every answer will be off or none of them well; figure out how to exploit this in your tests and writing them will be a lot easier.

#### Usage in Notebooks

To initialize the autograder, you first need to import from the client package the `Notebook` object. The `client` package is set up when you install okpy on your JupyterHub. To initialize the autograder, create an instance of the `Notebook` class; by convention, we save this instance as `ok`. The `Notebook` initializer requires 1 argument, the relative path to your [ok configuration file](https://okpy.github.io/documentation/client.html).

In [None]:
from client.api.notebook import Notebook
ok = Notebook("demo.ok")

If you are using the okpy website to collect submissions, then you would also have put the following line in the cell above:

```python
_ = ok.auth(inline=True)
```

This would direct your students to the okpy site to log in and give them an authentication key for the notebook.

To run autograder tests, we use the `ok.grade()` function; it takes a single argument, the identifier of the tests you're trying to run. As an example, assign `the_answer` below to the value `42`.

In [None]:
the_answer = ...

In [None]:
_ = ok.grade("q1")

Because I have stored the test cases in `tests/q1.py`, the autograder will go there and run the tests to make sure that you pass. (The location of the tests is set up in the ok config file.)

To make life easier for the students, we often include the cell below which will allow them to run all of the ok tests at once to verify that all of their code is working.

```python
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]
```

To submit work to the okpy site, have students run the following:

```python
_ = ok.submit()
```

#### An Example

If we wanted to use okpy in this notebook, we could test that the squared durations were stored correctly. This test is stored as `tests/q2.py`, so we could check this using the code below.

In [None]:
_ = ok.grade("q2")

### Exporting Data

If you make some modifications to the data set or do some data cleaning, you may want to export your data from Python to make it easier to pick up later or to reproduce. For this reason, there is a `datascience` function that allow you to export a Table object to a text file, which you can then load back into Python later. To export as a CSV file, you pass the file name (or file location, if it's going to another folder) to the `.to_csv()` method.

In [None]:
trips.to_csv('export/trips-export.csv')

If you want to save as another file format (e.g. TSV, JSON), you will have to export through `pandas` by setting the `sep` argument of the `.to_csv()` method or using a different export function (e.g. `pd.to_json()`). This is easily accomplished if you have a Table by transferring that table to `pandas` first.

In [None]:
# transfer to pandas, from above
trips_df = trips.to_df()

# export as tsv
trips_df.to_csv('data/trips.tsv', index=False, sep='\t')

### Conclusion

This notebook should have given you a good introduction to the `Table` class of `datascience`. This demo is not an exhaustive one, and there are _many_ other functionalities of the class that were not covered. To see these and the other functions in the library (including plotting and mapping functionality), see the [`datascience` documentation](http://data8.org/datascience).