# BUDS Report 03: Tables, Part 1


### Table of Contents

1. <a href='#section 1'>Tables</a>

    a. <a href='#subsection 1a'>Table Attributes</a>

    b. <a href='#subsection 1b'>Table Transformations</a><br><br>

2. <a href='#section 2'>Sorting Tables</a>

In [None]:
# dependencies: THIS CELL MUST BE RUN
from datascience import *
import numpy as np
import math
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

## 1. Tables <a id='section 1'></a>

The last section covered four basic concepts of Python: data, expressions, names, and functions. In this next section, we'll see just how much we can do to examine and manipulate our data with only these minimal Python skills.

**Tables** are fundamental ways of organizing and displaying data. Run the next cell to load the data.

In [None]:
ratings = Table.read_table("data/imdb_ratings.csv")
ratings

This table is organized into **columns** — one for each *category* of information collected.

You can also think about the table in terms of its **rows**. Each row represents all the information collected about a particular instance, which can be a person, location, action, or other unit. 


<div class="alert alert-warning">
<b>PRACTICE:</b> What do the rows in this table represent? How many rows are in this table? How many rows are shown by default?
    </div>


_Written Answer:_

### Table Attributes <a id='subsection 1a'></a>

Every table has **attributes** that give information about the table, like the number of rows and the number of columns. Table attributes are accessed using the dot method. But, since an attribute doesn't perform an operation on the table, there are no parentheses (like there would be in a call expression).

<div class="alert alert-warning">
<b>PRACTICE:</b> Attributes you'll frequently use include the number of rows and number of columns in a table. Use the Python Reference Sheet to find a method that will find the number of rows and number of columns in the <code>ratings</code> table.
    </div>

In [None]:
# get the number of rows in the table
...

In [None]:
# get the number of columns in the table
...

Another attribute you can look at is the column names of a table. Although you can clearly list off the column names of the `ratings` table, it may be helpful to get an array of column names for tables with far more columns. Real-world datasets sometimes have hundreds or thousands of columns, so you may want to look at this before scrolling through the whole table. 

<div class="alert alert-warning">
<b>PRACTICE:</b> Look at the Python Reference Sheet again and find a method that outputs the column names of a table. Then, use it on the <code>ratings</code> table.
    </div>

In [None]:
# find the column names of the table
...

Now that you are looking at the column names as strings in an array format, do you notice anything odd about any of the column names? What issue(s) might come up if you hadn't used this method to get the column labels?

_Written Answer:_

The "Decade" column was likely added to the table after all of the other columns. After all, you can look at the year and know what decade the movie was made in.

<div class="alert alert-warning">
<b>PRACTICE:</b> Consider what data type the values in the "Decade" column might be. Then, write arithmetic that converts a given year into a decade. Be sure that this conversion becomes the same data type as the values in the "Decade" column. Try testing out different years in <code>year</code> to make sure your calculations are correct.
    </div>

In [None]:
# convert year to a decade
year = 1975
decade = ...
decade

### Table Transformations <a id='subsection 1b'></a>

Not all of our columns are relevant to every question we want to ask. We can save computational resources and avoid confusion by *transforming* our table before we start work.


### `select`
The `select` function is used to get a table containing only particular columns. `select` is called on a table using dot notation and takes one or more arguments: the name or names of the column or columns you want. Note that this *does not* change the original table. To save your changes, you must assign your change a name.

<div class="alert alert-warning">
    <b>PRACTICE:</b> To confirm this, select two columns from the <code>ratings</code> table but do not assign it to a name. Then, look at the <code>ratings</code> table again to see if any changes had been made.
    </div>

In [None]:
# make a new table with only selected columns
...

In [None]:
# confirm that there are no changes made to the original table
...

### `drop`

If instead you need all columns except a few, the `drop` function can get rid of specified columns. `drop` works very similarly to `select`: call it on the table using dot notation, then give it the name or names of what you want to drop.

<div class="alert alert-warning">
<b>PRACTICE:</b> Drop the columns that you did not select in the previous section. Similarly, make sure that you do not assign this table to a name.
    </div>

In [None]:
# drop the columns you didn't select earlier
...

### `relabeled`

Other times, you may want to rename the columns of your table to make more sense. Again, you can use the dot notation with `relabeled` to accomplish this.

<div class="alert alert-warning">
<b>PRACTICE:</b> In the following code cell, rename the "Rank" column to a name that is more descriptive of the values in the column. Take a look at the arguments it takes in the Python Reference Sheet. Unlike the previous sections, be sure to save this new table to a name like <code>ratings2</code>. This will allow us to use this exact table later.
    </div>

In [None]:
# relabel the "Rank" column to a name that makes more sense
...

## 2. Sorting Tables <a id='section 2'></a>

The last section covered `select`, `drop`, and `relabeled`. In this next section, we'll learn how we can sort and manipulate data that are placed in tables. However, it's important to know a few more pieces of information before we actually sort the data.

### `show`

In a table, we can display a specific amount of rows using the `show` operation. The `show` operations allows you to enter the amount of rows you want displayed from a table. This can be helpful when you want to look at a certain number of entries in your table.

<div class="alert alert-warning">
    <b>PRACTICE:</b> Display the first 20 rows of the <code>ratings2</code> table.
    </div>

In [None]:
# use show to display 20 rows
...

Although it's nice that you can show the first 20 rows of a table, you can see that there is no logical ordering as of right now. `sort` will alleviate that, so let's first look at the most basic usage of `sort`.

In [None]:
# run this cell
ratings2.sort("Year")

Based on this one evaluation, make a few guesses about the `sort` method.

1. What does the argument in `sort` denote?
2. How is the table sorted? Is it from lowest to highest or highest to lowest?
3. How do you think the table will be sorted if we had instead chosen the column "Title"?

_Written Answer:_

We can answer the last question right now. Try sorting the table by the column "Title". Does it confirm what you suspected?

In [None]:
...

### Another Data Type

To answer these questions, it may be helpful to look at another data type. Unlike integers, floats, and strings, this data type does not fall under the numerical or word/text categories. At its core, **booleans** are `True` or `False` values.

<div class="alert alert-warning">
    <b>PRACTICE:</b> Confirm this by checking the type of the variable <code>tr</code>, which is assigned to `True`. Recall that we checked data types in Report 02.
    </div>

In [None]:
# check the data type of tr
tr = True
...

### `sort`

Booleans can be used to change how we sort our values when using the `sort` method. It would be helpful to switch between lowest to highest and highest to lowest, so `sort` takes in booleans as *optional arguments*. These are not necessary, but can be helpful in your analyses. Take a look at the format of `sort`.

`table.sort(column_or_label, descending=False, distinct=False)`

By default, `sort` organizes values in *ascending* order and allows for duplicate values. This is denoted by `descending=False` and `distinct=False`. To change either of these, you simply set `descending=True` or `distinct=True`.

Let's put this into practice.

<div class="alert alert-warning">
<b>PRACTICE:</b> Sort the table from the most recent release to the oldest movie in the table.
    </div>

In [None]:
...

<div class="alert alert-warning">
    <b>PRACTICE:</b> The movies seem to span from 1921 to 2015. How many distinct years are there in this table?
    </div>

In [None]:
...

_Written Answer:_

The distinct argument only takes one row for each distinct value in the column we sort by. How do you think the computer decides which row to take?

_Written Answer:_

Let's now look at another column that might be of interest.

<div class="alert alert-warning">
<b>PRACTICE:</b> Sort the table from highest ranked movie to lowest ranked movie.
</div>

In [None]:
...

Based on this table, what rankings would you consider high ratings and why? Where do the ratings seem to cap out at?

_Written Answer:_

It's important to look at both ends of our values and not just the range that we are interested in. This will help us understand the data that we're working with.

<div class="alert alert-warning">
    <b>PRACTICE:</b> Sort the table from lowest ranked to highest ranked.
    </div>

In [None]:
...

In [None]:
ratings2.sort("Rating")

This table's results might surprise you. Does your answer from above change? Why or why not?

_Written Answer:_

<div class="alert alert-warning">
    <b>PRACTICE:</b> Let's try something a bit more complicated. Find the highest ranking movie of each year. This might require assignment statements and multiple calls to <code>sort</code>. Try messing around with simple calls to <code>sort</code> or drawing out each step of this process on paper.
</div>

In [None]:
...

### Downloading as PDF

Download this notebook as a pdf by clicking <b><code>File > Download as > PDF via LaTeX</code></b>. Turn in the PDF into bCourses under the corresponding assignment.

#### References

- Sections of "Intro to Jupyter" and "Table Transformation" adapted from materials by Kelly Chen and Ashley Chien in [UC Berkeley Data Science Modules core resources](http://github.com/ds-modules/core-resources)

Authored by Keeley Takimoto, Adapted by the BUDS Team