# Tables: Part 1


### Table of Contents

1. <a href='#section 2'>Tables</a>

    a. <a href='#subsection 2a'>Attributes</a>

    b. <a href='#subsection 2b'>Transformations</a><br><br>

2. <a href='#section 3'>Coming Soon...</a>

In [4]:
# dependencies: THIS CELL MUST BE RUN
from datascience import *
import numpy as np
import math
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import ipywidgets as widgets
%matplotlib inline

## 1. Tables <a id='section 2'></a>

The last section covered four basic concepts of python: data, expressions, names, and functions. In this next section, we'll see just how much we can do to examine and manipulate our data with only these minimal Python skills.

**Tables** are fundamental ways of organizing and displaying data. Run the next cell to load the data.

In [5]:
ratings = Table.read_table("data/imdb_ratings.csv")
ratings

FileNotFoundError: File b'data/imdb_ratings.csv' does not exist

This table is organized into **columns**: one for each *category* of information collected:

You can also think about the table in terms of its **rows**. Each row represents all the information collected about a particular instance, which can be a person, location, action, or other unit. 

What do the rows in this table represent?

By default only the first ten rows are shown. Can you see how many rows there are in total?

### 2a. Table Attributes <a id='subsection 2a'></a>

Every table has **attributes** that give information about the table, like the number of rows and the number of columns. Table attributes are accessed using the dot method. But, since an attribute doesn't perform an operation on the table, there are no parentheses (like there would be in a call expression).

Attributes you'll use frequently include `num_rows` and `num_columns`, which give the number of rows and columns in the table, respectively.

In [None]:
# get the number of columns
ratings.num_columns

<div class="alert alert-info">
<b>PRACTICE:</b> Use `num_rows` to get the number of rows in our table.
</div>

In [None]:
# get the number of rows in the table


### 2b. Table Transformation <a id='subsection 2b'></a>

Not all of our columns are relevant to every question we want to ask. We can save computational resources and avoid confusion by *transforming* our table before we start work.

#### Subsetting columns with `select` and `drop`
The `select` function is used to get a table containing only particular columns. `select` is called on a table using dot notation and takes one or more arguments: the name or names of the column or columns you want.

In [None]:
# make a new table with only selected columns
ratings.select("Votes", "Title")

If instead you need all columns except a few, the `drop` function can get rid of specified columns. `drop` works very similarly to `select`: call it on the table using dot notation, then give it the name or names of what you want to drop.

In [None]:
# drop a column
ratings.drop("Decade")

<div class="alert alert-warning">
<b>EXERCISE:</b> Pick two columns from our table. Create a new table containing only those two columns two different ways: once using `select` and once using `drop`. 
</div>

In [None]:
# use select
...

In [None]:
# use drop
...

#### Filtering rows with `where`
Some analysis questions only deal with a subset of rows.

The **`where`** function allows us to choose certain rows based on two arguments:
- A column label
- A condition that each row should match, called the _predicate_ 

In other words, we call the `where` function like so: `table_name.where(column_name, predicate)`.


In [None]:
# get a subset of rows
ratings.where("Decade", are.equal_to(1950))

There are many types of predicates, but some of the more common ones are:

|Predicate|Example|Result|
|-|-|-|
|`are.equal_to`|`are.equal_to(50)`|Find rows with values equal to 50|
|`are.not_equal_to`|`are.not_equal_to(50)`|Find rows with values not equal to 50|
|`are.above`|`are.above(50)`|Find rows with values above (and not equal to) 50|
|`are.above_or_equal_to`|`are.above_or_equal_to(50)`|Find rows with values above 50 or equal to 50|
|`are.below`|`are.below(50)`|Find rows with values below 50|
|`are.between`|`are.between(2, 10)`|Find rows with values above or equal to 2 and below 10|


In [None]:
# example 2: get a subset of rows
ratings.where("Rank", are.above(8.7))

<div class="alert alert-warning">
<b> EXERCISE:</b> Describe what happened in each of the two examples above. Which rows were filtered out? Give an example where we would want to use those filters for analysis.
</div>

**YOUR RESPONSE HERE**

## 3. Coming up... <a id='section 3'></a>

Knowing these few basic concepts about Python and Tables will help you interact with data in upcoming parts of the module. Here's a preview of the kinds of visualizations and operations coming up:

In [None]:
# make a bar plot of the max rank per decade
ratings.select("Rank", "Decade").group("Decade", max).barh("Decade")

In [None]:
# make many bar plots showing feature averages per decade
def avg_by_decade(feature):
    return ratings.select(feature, "Decade").group("Decade", np.average).barh("Decade")

# create the slider sfor the widget
buttons = widgets.ToggleButtons(options=["Rank", "Votes"])

# create the widget to view plots for different parameter values
display(widgets.interactive(avg_by_decade, feature=buttons))

In [None]:
# load the Capital bike-sharing data set
bikes = Table.read_table("data/day_renamed.csv")
bikes

In [None]:
# look at the correlation between temperature and casual rider numbers
bikes.select("casual", "temp").scatter("temp", fit_line=True)

In [None]:
# compare scatter plots for casual and registered riders for different predictor variables
def scatter_bikes(predictor, response, fit_line):
    if response == "both":
        b = bikes.select("registered", "casual", predictor)
    else:
        b = bikes.select(response, predictor)
    return b.scatter(predictor, fit_line=fit_line)

# create the slider sfor the widget
predict_widget = widgets.Dropdown(options=["humidity", "windspeed", "temp"],
                         value="humidity")
response_widget = widgets.Dropdown(options=["casual", "registered", "both"],
                         value="casual")
fitline_widget = widgets.Dropdown(options=[True, False],
                          value=False)

# create the widget to view plots for different parameter values
display(widgets.interactive(scatter_bikes, predictor=predict_widget, response=response_widget, fit_line=fitline_widget))

#### References

- Sections of "Intro to Jupyter", "Table Transformation" adapted from materials by Kelly Chen and Ashley Chien in [UC Berkeley Data Science Modules core resources](http://github.com/ds-modules/core-resources)
- "A Note on Errors" subsection and "error" image adapted from materials by Chris Hench and Mariah Rogers for the Medieval Studies 250: Text Analysis for Graduate Medievalists [data science module](https://github.com/ds-modules/MEDST-250).
- Rocket Fuel data and discussion questions adapted from materials by Zsolt Katona and Brian Bell, BerkeleyHaas Case Series

Author: Keeley Takimoto