In [1]:
import numpy as np
import math
from datascience import *

# <div align="center">Conceptual Review Tables

**Created By: Edwin Vargas Navarro (jedwin321@berkeley.edu)**

## Creating Tables

The `Table()` function creates an empty table. There is no input and this ouputs an empty table. 

In [2]:
sample_table = Table()
sample_table

## Adding Columns to Tables

`tbl.with_column("Column Name", array)` This creates one column in the table called "tbl." Make sure that this array has the same number of elements as the number of rows in the table.

`tbl.with_columns("Column Name 1", array_1, "Column Name 2", array_2, ...)` This method is for adding multiple columns. Notice this function has an **s** which the previous one does not have.

In [3]:
numbers_text = make_array("One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten")
numbers = np.arange(1, 11, 1)
numbers_v2 = make_array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
numbers

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [4]:
num_tbl = Table().with_columns("Numbers Text", numbers_text, "Numbers", numbers)
num_tbl

Numbers Text,Numbers
One,1
Two,2
Three,3
Four,4
Five,5
Six,6
Seven,7
Eight,8
Nine,9
Ten,10


## Loading a table from a file

In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we load them in from an external source, like a data file. There are many formats for data files, but CSV ("comma-separated values") is the most common.

`Table.read_table(...)` takes one argument (a path to a data file in **string** format) and returns a table.  

 `imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb.  It is loaded as a table called `imdb`.

In [5]:
imdb = Table.read_table("imdb.csv")
imdb

Votes,Rating,Title,Year,Decade
88355,8.4,M,1931,1930
132823,8.3,Singin' in the Rain,1952,1950
74178,8.3,All About Eve,1950,1950
635139,8.6,Léon,1994,1990
145514,8.2,The Elephant Man,1980,1980
425461,8.3,Full Metal Jacket,1987,1980
441174,8.1,Gone Girl,2014,2010
850601,8.3,Batman Begins,2005,2000
37664,8.2,Judgment at Nuremberg,1961,1960
46987,8.0,Relatos salvajes,2014,2010


## Table to Arrays to Items

### `column`

`column` takes the column name of a table (in string format) as its argument and returns the values in that column as an **array**. 

### `item`

`item` gets the nth element of an array.

In [6]:
imdb_titles_one = imdb.column("Title").item(0)
imdb_titles_one

'M'

In [7]:
sample_array = make_array(1,2,3,4,5)
sample_array.item(4)

5

## More Table Operations
### `take`
The table method `take` takes as its argument an array of numbers.  Each number should be the index of a row in the table.  It returns a **new table** with only those rows. 

You'll usually want to use `take` in conjunction with `np.arange` to take the first few rows of a table.

In [8]:
imdb.take(0) #gets row 0

Votes,Rating,Title,Year,Decade
88355,8.4,M,1931,1930


In [9]:
imdb.take(make_array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)) # gets rows 0-10

Votes,Rating,Title,Year,Decade
88355,8.4,M,1931,1930
132823,8.3,Singin' in the Rain,1952,1950
74178,8.3,All About Eve,1950,1950
635139,8.6,Léon,1994,1990
145514,8.2,The Elephant Man,1980,1980
425461,8.3,Full Metal Jacket,1987,1980
441174,8.1,Gone Girl,2014,2010
850601,8.3,Batman Begins,2005,2000
37664,8.2,Judgment at Nuremberg,1961,1960
46987,8.0,Relatos salvajes,2014,2010


In [10]:
imdb.take(np.arange(0, 11)) # gets rows 0-10

Votes,Rating,Title,Year,Decade
88355,8.4,M,1931,1930
132823,8.3,Singin' in the Rain,1952,1950
74178,8.3,All About Eve,1950,1950
635139,8.6,Léon,1994,1990
145514,8.2,The Elephant Man,1980,1980
425461,8.3,Full Metal Jacket,1987,1980
441174,8.1,Gone Girl,2014,2010
850601,8.3,Batman Begins,2005,2000
37664,8.2,Judgment at Nuremberg,1961,1960
46987,8.0,Relatos salvajes,2014,2010


## Challenge Question

How can I get the number of even years in `imdb`?

*Hint:* The operator modulo `%` computes the remainder when dividing by a number.  So `5 % 2` is 1 and `6 % 2` is 0.  A number is even if the remainder is 0 when you divide by 2.

*Hint 2:* `%` can be used on arrays, operating elementwise like `+` or `*`.  So `make_array(5, 6, 7) % 2` is `array([1, 0, 1])`.

In [11]:
# Challenge code here
imdb.column("Year") % 2 

array([1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0,
       0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1,
       1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       0, 1, 1, 0, 1, 1, 0, 0])

In [12]:
len(imdb.column("Year") % 2)  - np.sum(imdb.column("Year") % 2)

127

In [13]:
np.sum(imdb.column("Year") % 2 == 0)

127

In [14]:
imdb.with_column("Year Remainder", imdb.column("Year") % 2).where('Year Remainder', are.equal_to(0)).num_rows

127

## Sorting

### `sort` 

`tbl.sort("Column Name")` or `tbl.sort(column index)` creates a **copy** of a table sorted by the values in a column. It defaults to **ascending** order unless `descending = True` is included as an additional argument.

In [15]:
imdb.sort("Year")

Votes,Rating,Title,Year,Decade
55784,8.3,The Kid,1921,1920
58506,8.2,The Gold Rush,1925,1920
46332,8.2,The General,1926,1920
98794,8.3,Metropolis,1927,1920
88355,8.4,M,1931,1930
92375,8.5,City Lights,1931,1930
56842,8.1,It Happened One Night,1934,1930
121668,8.5,Modern Times,1936,1930
69510,8.2,Mr. Smith Goes to Washington,1939,1930
259235,8.1,The Wizard of Oz,1939,1930


In [16]:
imdb.sort("Year", descending = True, distinct = True) # Notice how this has 80 rows and not the full 250

Votes,Rating,Title,Year,Decade
79615,8.5,Inside Out (2015/I),2015,2010
441174,8.1,Gone Girl,2014,2010
359121,8.1,12 Years a Slave,2013,2010
137310,8.2,Jagten,2012,2010
287727,8.2,Warrior,2011,2010
670328,8.1,Shutter Island,2010,2010
755013,8.3,Inglourious Basterds,2009,2000
502773,8.2,Gran Torino,2008,2000
433487,8.1,The Bourne Ultimatum,2007,2000
229533,8.4,Das Leben der Anderen,2006,2000


In [17]:
imdb.sort("Title")

Votes,Rating,Title,Year,Decade
384187,8.9,12 Angry Men,1957,1950
359121,8.1,12 Years a Slave,2013,2010
373482,8.3,2001: A Space Odyssey,1968,1960
167076,8.2,3 Idiots,2009,2000
69988,8.1,8½,1963,1960
525515,8.1,A Beautiful Mind,2001,2000
489807,8.3,A Clockwork Orange,1971,1970
91652,8.0,Akira,1988,1980
496833,8.5,Alien,1979,1970
436218,8.4,Aliens,1986,1980


In [18]:
imdb.sort("Title", descending = True)

Votes,Rating,Title,Year,Decade
65370,8.3,Yôjinbô,1961,1960
138240,8.0,Yip Man,2008,2000
427099,8.0,X-Men: Days of Future Past,2014,2010
53186,8.3,Witness for the Prosecution,1957,1950
50208,8.0,Who's Afraid of Virginia Woolf?,1966,1960
264333,8.5,Whiplash,2014,2010
287727,8.2,Warrior,2011,2010
618914,8.4,WALL·E,2008,2000
218430,8.4,Vertigo,1958,1950
700999,8.2,V for Vendetta,2005,2000


## Filtering

### `where` 

Now let's dive into the details a bit more.  `where` takes 2 arguments:

1. The name of a column.  `where` finds rows where that column's values meet some criterion.
2. A predicate that describes the criterion that the column needs to meet.

`where` returns a table that's a copy of the original table, but **with only the rows that meet the given predicate**.

|Predicate|Example|Result|
|-|-|-|
|`are.equal_to`|`are.equal_to(50)`|Find rows with values equal to 50|
|`are.not_equal_to`|`are.not_equal_to(50)`|Find rows with values not equal to 50|
|`are.above`|`are.above(50)`|Find rows with values above (and not equal to) 50|
|`are.above_or_equal_to`|`are.above_or_equal_to(50)`|Find rows with values above 50 or equal to 50|
|`are.below`|`are.below(50)`|Find rows with values below 50|
|`are.between`|`are.between(2, 10)`|Find rows with values above or equal to 2 and below 10|
|`are.between_or_equal_to`|`are.between_or_equal_to(2, 10)`|Find rows with values above or equal to 2 and below or equal to 10|


In [19]:
imdb.where("Decade", 1930) # defaults to are.equal_to

Votes,Rating,Title,Year,Decade
88355,8.4,M,1931,1930
56842,8.1,It Happened One Night,1934,1930
69510,8.2,Mr. Smith Goes to Washington,1939,1930
259235,8.1,The Wizard of Oz,1939,1930
192791,8.1,Gone with the Wind,1939,1930
121668,8.5,Modern Times,1936,1930
92375,8.5,City Lights,1931,1930


In [20]:
imdb.where("Decade", are.equal_to(1930)) # same as above

Votes,Rating,Title,Year,Decade
88355,8.4,M,1931,1930
56842,8.1,It Happened One Night,1934,1930
69510,8.2,Mr. Smith Goes to Washington,1939,1930
259235,8.1,The Wizard of Oz,1939,1930
192791,8.1,Gone with the Wind,1939,1930
121668,8.5,Modern Times,1936,1930
92375,8.5,City Lights,1931,1930


In [21]:
imdb.where("Rating", are.above_or_equal_to(9))

Votes,Rating,Title,Year,Decade
1027398,9.2,The Godfather,1972,1970
1498733,9.2,The Shawshank Redemption,1994,1990
692753,9.0,The Godfather: Part II,1974,1970


## Grouping

#### `group`
`tbl.group(column_or_columns, func)` groups rows by unique values or combinations of values in a column(s). Multiple columns must be entered in array or list form. Other values are aggregated by count (the default) or an optional function argument.

In [22]:
imdb.group("Rating") # tells us the count of how many of each rating

Rating,count
8.0,50
8.1,53
8.2,34
8.3,38
8.4,22
8.5,24
8.6,9
8.7,9
8.8,2
8.9,6


In [23]:
imdb.group("Decade") # tells us how many movies per decade in our dataset

Decade,count
1920,4
1930,7
1940,14
1950,30
1960,22
1970,21
1980,31
1990,42
2000,50
2010,29


In [24]:
imdb.group("Year", np.average)

Year,Votes average,Rating average,Title average,Decade average
1921,55784.0,8.3,,1920
1925,58506.0,8.2,,1920
1926,46332.0,8.2,,1920
1927,98794.0,8.3,,1920
1931,90365.0,8.45,,1930
1934,56842.0,8.1,,1930
1936,121668.0,8.5,,1930
1939,173845.0,8.13333,,1930
1940,83866.3,8.23333,,1940
1941,185330.0,8.25,,1940


## Combining Everything

Find the average rating for movies in the 21st century.

In [25]:
twenty_one = imdb.where("Year", are.above_or_equal_to(2000))
twenty_one

Votes,Rating,Title,Year,Decade
441174,8.1,Gone Girl,2014,2010
850601,8.3,Batman Begins,2005,2000
46987,8.0,Relatos salvajes,2014,2010
502773,8.2,Gran Torino,2008,2000
755013,8.3,Inglourious Basterds,2009,2000
229533,8.4,Das Leben der Anderen,2006,2000
55382,8.0,Bom yeoreum gaeul gyeoul geurigo bom,2003,2000
700999,8.2,V for Vendetta,2005,2000
102735,8.1,Mary and Max,2009,2000
287727,8.2,Warrior,2011,2010


In [26]:
twenty_one_average = np.average(twenty_one.column("Rating"))
twenty_one_average

8.237974683544303

Does the 21st century have a higher rating than the 20th century?

In [27]:
twenty_average = np.average(imdb.where("Year", are.between(1900 , 2000)).column("Rating"))
twenty_average

8.278362573099415

In [28]:
twenty_one_average > twenty_average 

False

## How about the highest decade average?

In [29]:
imdb.group("Decade", np.average)

Decade,Votes average,Rating average,Title average,Year average
1920,64854,8.25,,1924.75
1930,125825,8.27143,,1935.57
1940,122767,8.25714,,1944.07
1950,109829,8.23333,,1955.0
1960,147391,8.23182,,1964.14
1970,353377,8.30952,,1975.62
1980,295290,8.24194,,1984.29
1990,559449,8.35714,,1995.19
2000,481929,8.26,,2004.26
2010,438006,8.2,,2012.31


In [30]:
imdb.group("Decade", np.average).sort("Rating average", descending = True)

Decade,Votes average,Rating average,Title average,Year average
1990,559449,8.35714,,1995.19
1970,353377,8.30952,,1975.62
1930,125825,8.27143,,1935.57
2000,481929,8.26,,2004.26
1940,122767,8.25714,,1944.07
1920,64854,8.25,,1924.75
1980,295290,8.24194,,1984.29
1950,109829,8.23333,,1955.0
1960,147391,8.23182,,1964.14
2010,438006,8.2,,2012.31


In [31]:
imdb.group("Decade", np.average).sort("Rating average", descending = True).column("Decade")

array([1990, 1970, 1930, 2000, 1940, 1920, 1980, 1950, 1960, 2010])

In [32]:
imdb.group("Decade", np.average).sort("Rating average", descending = True).column("Decade").item(0)

1990

In [33]:
make_array(0).item()

0