# BUDS Report 03: Tables, Part 1


### Table of Contents

1. <a href='#section 1'>Tables</a>

    a. <a href='#subsection 1a'>Table Attributes</a>

    b. <a href='#subsection 1b'>Table Transformations</a><br><br>

2. <a href='#section 2'>Sorting Tables</a>

In [1]:
# dependencies: THIS CELL MUST BE RUN
from datascience import *
import numpy as np
import math
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

## 1. Tables <a id='section 1'></a>

The last section covered four basic concepts of Python: data, expressions, names, and functions. In this next section, we'll see just how much we can do to examine and manipulate our data with only these minimal Python skills.

**Tables** are fundamental ways of organizing and displaying data. Run the next cell to load the data.

In [3]:
ratings = Table.read_table("data/imdb_ratings.csv")
ratings

Votes,Rank,Title,Year,Decade
88355,8.4,M (1931),1931,1930
132823,8.3,Singin' in the Rain (1952),1952,1950
74178,8.3,All About Eve (1950),1950,1950
635139,8.6,Léon (1994),1994,1990
145514,8.2,The Elephant Man (1980),1980,1980
425461,8.3,Full Metal Jacket (1987),1987,1980
441174,8.1,Gone Girl (2014),2014,2010
850601,8.3,Batman Begins (2005),2005,2000
37664,8.2,Judgment at Nuremberg (1961),1961,1960
46987,8.0,Relatos salvajes (2014),2014,2010


This table is organized into **columns** — one for each *category* of information collected.

You can also think about the table in terms of its **rows**. Each row represents all the information collected about a particular instance, which can be a person, location, action, or other unit. 


<div class="alert alert-warning">
<b>PRACTICE:</b> What do the rows in this table represent? How many rows are in this table? And how many rows are shown by default?
    </div>


_Written Answer:_

__SOLUTION__:
- The rows in this table represent movies.
- There are 250 rows in this table.
- 10 rows are shown by default.

### Table Attributes <a id='subsection 1a'></a>

Every table has **attributes** that give information about the table, like the number of rows and the number of columns. Table attributes are accessed using the dot method. But, since an attribute doesn't perform an operation on the table, there are no parentheses (like there would be in a call expression).

<div class="alert alert-warning">
<b>PRACTICE:</b> Attributes you'll frequently use include the number of rows and number of columns in a table. Use the Python Reference Sheet to find a method that will find the number of rows and number of columns in the <code>ratings</code> table.
    </div>

In [None]:
# get the number of rows in the table
...

In [None]:
# get the number of columns in the table
...

In [4]:
# SOLUTION
print(ratings.num_rows)
print(ratings.num_columns)

250
5


Another attribute you can look at is the column names of a table. Although you can clearly list off the column names of the `ratings` table, it may be helpful to get an array of column names for tables with far more columns. Real-world datasets sometimes have hundreds or thousands of columns, so you may want to look at this before scrolling through the whole table. 

<div class="alert alert-warning">
<b>PRACTICE:</b> Look at the Python Reference Sheet again and find a method that outputs the column names of a table. Then, use it on the <code>ratings</code> table.
    </div>

In [None]:
# find the column names of the table
...

In [5]:
# SOLUTION
ratings.labels

('Votes', 'Rank', 'Title ', 'Year', 'Decade')

Now that you are looking at the column names as strings in an array format, do you notice anything odd about any of the column names? What issue(s) might come up if you hadn't used this method to get the column labels?

_Written Answer:_

__SOLUTION__:
- It's odd that there is a space at the end of the 'Title' column name.
- If we hadn't used this method to get the column labels, we might have overlooked this space and been very confused about what could have been causing errors later in the notebook.

The "Decade" column was likely added to the table after all of the other columns. After all, you can look at the year and know what decade the movie was made in.

<div class="alert alert-warning">
<b>PRACTICE:</b> Consider what data type the values in the "Decade" column might be. Then, write arithmetic that converts a given year into a decade. Be sure that this conversion becomes the same data type as the values in the "Decade" column. Try testing out different years in <code>year</code> to make sure your calculations are correct.
    </div>

In [None]:
# convert year to a decade
year = 1975
decade = ...
decade

In [6]:
# SOLUTION
year = 1975
decade = int(year / 10) * 10
decade

1970

### Table Transformations <a id='subsection 1b'></a>

Not all of our columns are relevant to every question we want to ask. We can save computational resources and avoid confusion by *transforming* our table before we start work.


### `select`
The `select` function is used to get a table containing only particular columns. `select` is called on a table using dot notation and takes one or more arguments: the name or names of the column or columns you want. Note that this *does not* change the original table. To save your changes, you must assign your change a name.

<div class="alert alert-warning">
    <b>PRACTICE:</b> To confirm this, select two columns from the <code>ratings</code> table but do <i>not</i> assign it to a name. Then, look at the <code>ratings</code> table again to see if any changes had been made.
    </div>

In [None]:
# make a new table with only selected columns
...

In [None]:
# confirm that there are no changes made to the original table
...

In [7]:
# SOLUTION
ratings.select("Votes", "Title ")
ratings

Votes,Rank,Title,Year,Decade
88355,8.4,M (1931),1931,1930
132823,8.3,Singin' in the Rain (1952),1952,1950
74178,8.3,All About Eve (1950),1950,1950
635139,8.6,Léon (1994),1994,1990
145514,8.2,The Elephant Man (1980),1980,1980
425461,8.3,Full Metal Jacket (1987),1987,1980
441174,8.1,Gone Girl (2014),2014,2010
850601,8.3,Batman Begins (2005),2005,2000
37664,8.2,Judgment at Nuremberg (1961),1961,1960
46987,8.0,Relatos salvajes (2014),2014,2010


### `drop`

If instead you need all columns except a few, the `drop` function can get rid of specified columns. `drop` works very similarly to `select`: call it on the table using dot notation, then give it the name or names of what you want to drop.

<div class="alert alert-warning">
    <b>PRACTICE:</b> Drop the columns that you did <i>not</i> select in the previous section. Similarly, make sure that you do <i>not</i> assign this table to a name.
    </div>

In [None]:
# drop the columns you didn't select earlier
...

In [8]:
# SOLUTION
ratings.drop("Rank", "Year", "Decade")

Votes,Title
88355,M (1931)
132823,Singin' in the Rain (1952)
74178,All About Eve (1950)
635139,Léon (1994)
145514,The Elephant Man (1980)
425461,Full Metal Jacket (1987)
441174,Gone Girl (2014)
850601,Batman Begins (2005)
37664,Judgment at Nuremberg (1961)
46987,Relatos salvajes (2014)


### `relabeled`

Other times, you may want to rename the columns of your table to make more sense. Again, you can use the dot notation with `relabeled` to accomplish this.

<div class="alert alert-warning">
<b>PRACTICE:</b> In the following code cell, rename the "Rank" column to a name that is more descriptive of the values in the column. Take a look at the arguments it takes in the Python Reference Sheet. <i>Unlike the previous sections</i>, be sure to save this new table to a name like <code>ratings2</code>. This will allow us to use this exact table later.
    </div>

In [None]:
# relabel the "Rank" column to a name that makes more sense
...

In [9]:
# SOLUTION
ratings2 = ratings.relabeled("Rank", "Rating")
ratings2

Votes,Rating,Title,Year,Decade
88355,8.4,M (1931),1931,1930
132823,8.3,Singin' in the Rain (1952),1952,1950
74178,8.3,All About Eve (1950),1950,1950
635139,8.6,Léon (1994),1994,1990
145514,8.2,The Elephant Man (1980),1980,1980
425461,8.3,Full Metal Jacket (1987),1987,1980
441174,8.1,Gone Girl (2014),2014,2010
850601,8.3,Batman Begins (2005),2005,2000
37664,8.2,Judgment at Nuremberg (1961),1961,1960
46987,8.0,Relatos salvajes (2014),2014,2010


## 2. Sorting Tables <a id='section 2'></a>

The last section covered `select`, `drop`, and `relabeled`. In this next section, we'll learn how we can sort and manipulate data that are placed in tables. However, it's important to know a few more pieces of information before we actually sort the data.

### `show`

In a table, we can display a specific amount of rows using the `show` operation. The `show` operations allows you to enter the amount of rows you want displayed from a table. This can be helpful when you want to look at a certain number of entries in your table.

<div class="alert alert-warning">
    <b>PRACTICE:</b> Display the first 20 rows of the <code>ratings2</code> table.
    </div>

In [None]:
# use show to display 20 rows
...

In [10]:
# SOLUTION
ratings2.show(20)

Votes,Rating,Title,Year,Decade
88355,8.4,M (1931),1931,1930
132823,8.3,Singin' in the Rain (1952),1952,1950
74178,8.3,All About Eve (1950),1950,1950
635139,8.6,Léon (1994),1994,1990
145514,8.2,The Elephant Man (1980),1980,1980
425461,8.3,Full Metal Jacket (1987),1987,1980
441174,8.1,Gone Girl (2014),2014,2010
850601,8.3,Batman Begins (2005),2005,2000
37664,8.2,Judgment at Nuremberg (1961),1961,1960
46987,8.0,Relatos salvajes (2014),2014,2010


Although it's nice that you can show the first 20 rows of a table, you can see that there is no logical ordering as of right now. `sort` will alleviate that, so let's first look at the most basic usage of `sort`.

In [11]:
# run this cell
ratings2.sort("Year")

Votes,Rating,Title,Year,Decade
55784,8.3,The Kid (1921),1921,1920
58506,8.2,The Gold Rush (1925),1925,1920
46332,8.2,The General (1926),1926,1920
98794,8.3,Metropolis (1927),1927,1920
88355,8.4,M (1931),1931,1930
92375,8.5,City Lights (1931),1931,1930
56842,8.1,It Happened One Night (1934),1934,1930
121668,8.5,Modern Times (1936),1936,1930
69510,8.2,Mr. Smith Goes to Washington (1939),1939,1930
259235,8.1,The Wizard of Oz (1939),1939,1930


Based on this one evaluation, make a few guesses about the `sort` method.

1. What does the argument in `sort` denote?
2. How is the table sorted? Is it from lowest to highest or highest to lowest?
3. How do you think the table will be sorted if we had instead chosen the column "Title"?

_Written Answer:_

__SOLUTION__:
1. The argument in `sort` denotes the column(s) that you want to sort by.
2. The table is sorted from lowest to highest.
3. If we had instead chosen the column "Title", the table would instead be sorted in alphabetical order.

We can answer the last question right now. Try sorting the table by the column "Title". Does it confirm what you suspected?

In [None]:
...

In [12]:
# SOLUTION
ratings2.sort("Title ")

Votes,Rating,Title,Year,Decade
384187,8.9,12 Angry Men (1957),1957,1950
359121,8.1,12 Years a Slave (2013),2013,2010
373482,8.3,2001: A Space Odyssey (1968),1968,1960
167076,8.2,3 Idiots (2009),2009,2000
69988,8.1,8½ (1963),1963,1960
525515,8.1,A Beautiful Mind (2001),2001,2000
489807,8.3,A Clockwork Orange (1971),1971,1970
91652,8.0,Akira (1988),1988,1980
496833,8.5,Alien (1979),1979,1970
436218,8.4,Aliens (1986),1986,1980


### Another Data Type

To answer these questions, it may be helpful to look at another data type. Unlike integers, floats, and strings, this data type does not fall under the numerical or word/text categories. At its core, **booleans** are `True` or `False` values.

<div class="alert alert-warning">
    <b>PRACTICE:</b> Confirm this by checking the type of the variable <code>tr</code>, which is assigned to <code>True</code>. Recall that we checked data types in Report 02.
    </div>

In [None]:
# check the data type of tr
tr = True
...

In [13]:
# SOLUTION
tr = True
type(tr)

bool

### `sort`

Booleans can be used to change how we sort our values when using the `sort` method. It would be helpful to switch between lowest to highest and highest to lowest, so `sort` takes in booleans as *optional arguments*. These are not necessary, but can be helpful in your analyses. Take a look at the format of `sort`.

`table.sort(column_or_label, descending=False, distinct=False)`

By default, `sort` organizes values in *ascending* order and allows for duplicate values. This is denoted by `descending=False` and `distinct=False`. To change either of these, you simply set `descending=True` or `distinct=True`.

Let's put this into practice.

<div class="alert alert-warning">
<b>PRACTICE:</b> Sort the table from the most recent release to the oldest movie in the table.
    </div>

In [None]:
...

In [14]:
# SOLUTION
ratings2.sort("Year", descending=True)

Votes,Rating,Title,Year,Decade
79615,8.5,Inside Out (2015/I),2015,2010
262425,8.3,Mad Max: Fury Road (2015),2015,2010
441174,8.1,Gone Girl (2014),2014,2010
46987,8.0,Relatos salvajes (2014),2014,2010
427099,8.0,X-Men: Days of Future Past (2014),2014,2010
321834,8.0,The Imitation Game (2014),2014,2010
689541,8.6,Interstellar (2014),2014,2010
527349,8.0,Guardians of the Galaxy (2014),2014,2010
369141,8.1,The Grand Budapest Hotel (2014),2014,2010
264333,8.5,Whiplash (2014),2014,2010


<div class="alert alert-warning">
    <b>PRACTICE:</b> The movies seem to span from 1921 to 2015. How many distinct years are there in this table?
    </div>

In [None]:
...

In [15]:
# SOLUTION
ratings2.sort("Year", distinct=True)

Votes,Rating,Title,Year,Decade
55784,8.3,The Kid (1921),1921,1920
58506,8.2,The Gold Rush (1925),1925,1920
46332,8.2,The General (1926),1926,1920
98794,8.3,Metropolis (1927),1927,1920
88355,8.4,M (1931),1931,1930
56842,8.1,It Happened One Night (1934),1934,1930
121668,8.5,Modern Times (1936),1936,1930
69510,8.2,Mr. Smith Goes to Washington (1939),1939,1930
55793,8.1,The Grapes of Wrath (1940),1940,1940
101754,8.1,The Maltese Falcon (1941),1941,1940


_Written Answer:_

__SOLUTION__:

There are 80 distinct years in this table.

The distinct argument only takes one row for each distinct value in the column we sort by. How do you think the computer decides which row to take?

_Written Answer:_

__SOLUTION__:

The computer takes the first row in each set of distinct values.

Let's now look at another column that might be of interest.

<div class="alert alert-warning">
<b>PRACTICE:</b> Sort the table from highest ranked movie to lowest ranked movie.
</div>

In [None]:
...

In [16]:
# SOLUTION
ratings2.sort("Rating", descending=True)

Votes,Rating,Title,Year,Decade
1027398,9.2,The Godfather (1972),1972,1970
1498733,9.2,The Shawshank Redemption (1994),1994,1990
692753,9.0,The Godfather: Part II (1974),1974,1970
447875,8.9,"Il buono, il brutto, il cattivo (1966)",1966,1960
1473049,8.9,The Dark Knight (2008),2008,2000
384187,8.9,12 Angry Men (1957),1957,1950
1074146,8.9,The Lord of the Rings: The Return of the King (2003),2003,2000
761224,8.9,Schindler's List (1993),1993,1990
1166532,8.9,Pulp Fiction (1994),1994,1990
1177098,8.8,Fight Club (1999),1999,1990


Based on this table, what rankings would you consider high ratings and why? Where do the ratings seem to cap out at?

_Written Answer:_

__SOLUTION__:

The rankings seem to cap out at 10. Based on personal preference, I would consider ratings above 8 to be high.

It's important to look at both ends of our values and not just the range that we are interested in. This will help us understand the data that we're working with.

<div class="alert alert-warning">
    <b>PRACTICE:</b> Sort the table from lowest ranked to highest ranked.
    </div>

In [None]:
...

In [17]:
# SOLUTION
ratings2.sort("Rating")

Votes,Rating,Title,Year,Decade
46987,8,Relatos salvajes (2014),2014,2010
55382,8,Bom yeoreum gaeul gyeoul geurigo bom (2003),2003,2000
32385,8,La battaglia di Algeri (1966),1966,1960
364225,8,Jaws (1975),1975,1970
158867,8,Before Sunrise (1995),1995,1990
56671,8,The Killing (1956),1956,1950
87591,8,Papillon (1973),1973,1970
43090,8,"Paris, Texas (1984)",1984,1980
427099,8,X-Men: Days of Future Past (2014),2014,2010
87437,8,Roman Holiday (1953),1953,1950


This table's results might surprise you. Does your answer from above change? Why or why not?

_Written Answer:_

__SOLUTION__:

Yes, our answer from above changes. This is because we're now looking in ascending order instead of descending order.

<div class="alert alert-warning">
    <b>PRACTICE:</b> Let's try something a bit more complicated. Find the highest ranking movie of each year. This might require assignment statements and multiple calls to <code>sort</code>. Try messing around with simple calls to <code>sort</code> or drawing out each step of this process on paper.
</div>

In [None]:
...

In [18]:
# SOLUTION
highest = ratings2.sort("Rating", descending=True)
highest.sort("Year", distinct=True)

Votes,Rating,Title,Year,Decade
55784,8.3,The Kid (1921),1921,1920
58506,8.2,The Gold Rush (1925),1925,1920
46332,8.2,The General (1926),1926,1920
98794,8.3,Metropolis (1927),1927,1920
92375,8.5,City Lights (1931),1931,1930
56842,8.1,It Happened One Night (1934),1934,1930
121668,8.5,Modern Times (1936),1936,1930
69510,8.2,Mr. Smith Goes to Washington (1939),1939,1930
117590,8.4,The Great Dictator (1940),1940,1940
268905,8.4,Citizen Kane (1941),1941,1940


### Downloading as PDF

Download this notebook as a pdf by clicking <b><code>File > Download as > PDF via LaTeX (.pdf)</code></b>. Turn in the PDF into bCourses under the corresponding assignment.

#### References

- Sections of "Intro to Jupyter" and "Table Transformation" adapted from materials by Kelly Chen and Ashley Chien in [UC Berkeley Data Science Modules core resources](http://github.com/ds-modules/core-resources)

Authored by Keeley Takimoto, Adapted by the BUDS Team