# Lab 3: Tables

Welcome to lab 3!  This week, we'll learn about *tables*, which let us work with multiple arrays of data about the same things.  Tables are described in [Chapter 5](http://www.inferentialthinking.com/chapters/05/tables.html) of the text.

First, set up the tests and imports by running the cell below.

In [2]:
import numpy as np
from datascience import *
import pandas as pd

# These lines load the tests.

#from client.api.notebook import Notebook
#ok = Notebook('lab03.ok')
#_ = ok.auth(inline=True)

## 1. Introduction

For a collection of things in the world, an array is useful for describing a single attribute of each thing. For example, among the collection of US States, an array could describe the land area of each. Tables extend this idea by describing multiple attributes for each element of a collection.

In most data science applications, we have data about many entities, but we also have several kinds of data about each entity.

For example, in the cell below we have two arrays. The first one contains the world population in each year (as [estimated](http://www.census.gov/population/international/data/worldpop/table_population.php) by the US Census Bureau), and the second contains the years themselves (in order, so the first elements in the population and the years arrays correspond).

In [3]:
population_amounts = Table.read_table("world_population.csv").column("Population")
years = np.arange(1950, 2015+1)
print("Population column:", population_amounts)
print("Years column:", years)

Population column: [2557628654 2594939877 2636772306 2682053389 2730228104 2782098943
 2835299673 2891349717 2948137248 3000716593 3043001508 3083966929
 3140093217 3209827882 3281201306 3350425793 3420677923 3490333715
 3562313822 3637159050 3712697742 3790326948 3866568653 3942096442
 4016608813 4089083233 4160185010 4232084578 4304105753 4379013942
 4451362735 4534410125 4614566561 4695736743 4774569391 4856462699
 4940571232 5027200492 5114557167 5201440110 5288955934 5371585922
 5456136278 5538268316 5618682132 5699202985 5779440593 5857972543
 5935213248 6012074922 6088571383 6165219247 6242016348 6318590956
 6395699509 6473044732 6551263534 6629913759 6709049780 6788214394
 6866332358 6944055583 7022349283 7101027895 7178722893 7256490011]
Years column: [1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964
 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
 

Suppose we want to answer this question:

> When did world population cross 6 billion?

You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a *`Table`*, a 2-dimensional type of dataset. 

The expression below:

- creates an empty table using the expression `Table()`,
- adds two columns by calling `with_columns` with four arguments,
- assignes the result to the name `population`, and finally
- evaluates `population` so that we can see the table.

The strings `"Year"` and `"Population"` are column labels that we have chosen. Ther names `population_amounts` and `years` were assigned above to two arrays of the same length. The function `with_columns` (you can find the documentation [here](http://data8.org/datascience/tables.html)) takes in alternating strings (to represent column labels) and arrays (representing the data in those columns), which are all separated by commas.

In [4]:
# population = Table().with_columns(
#     "Population", population_amounts,
#     "Year", years
# )
population = pd.DataFrame()
population['Population'] = population_amounts
population['Year'] = years
population

Unnamed: 0,Population,Year
0,2557628654,1950
1,2594939877,1951
2,2636772306,1952
3,2682053389,1953
4,2730228104,1954
5,2782098943,1955
6,2835299673,1956
7,2891349717,1957
8,2948137248,1958
9,3000716593,1959


Now the data are all together in a single table! It's much easier to parse this data--if you need to know what the population was in 1959, for example, you can tell from a single glance. We'll revisit this table later.

## 2. Creating Tables

**Question 2.1.** In the cell below, we've created 2 arrays. Using the steps above, assign `top_10_movies` to a table that has two columns called "Rating" and "Name", which hold `top_10_movie_ratings` and `top_10_movie_names` respectively.

In [5]:
top_10_movie_ratings = make_array(9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8)
top_10_movie_names = make_array(
        'The Shawshank Redemption (1994)',
        'The Godfather (1972)',
        'The Godfather: Part II (1974)',
        'Pulp Fiction (1994)',
        "Schindler's List (1993)",
        'The Lord of the Rings: The Return of the King (2003)',
        '12 Angry Men (1957)',
        'The Dark Knight (2008)',
        'Il buono, il brutto, il cattivo (1966)',
        'The Lord of the Rings: The Fellowship of the Ring (2001)')

top_10_movies = pd.DataFrame()
top_10_movies['Ratings'] = top_10_movie_ratings
top_10_movies['Name'] = top_10_movie_names
# We've put this next line here so your table will get printed out when you
# run this cell.
top_10_movies

Unnamed: 0,Ratings,Name
0,9.2,The Shawshank Redemption (1994)
1,9.2,The Godfather (1972)
2,9.0,The Godfather: Part II (1974)
3,8.9,Pulp Fiction (1994)
4,8.9,Schindler's List (1993)
5,8.9,The Lord of the Rings: The Return of the King ...
6,8.9,12 Angry Men (1957)
7,8.9,The Dark Knight (2008)
8,8.9,"Il buono, il brutto, il cattivo (1966)"
9,8.8,The Lord of the Rings: The Fellowship of the R...


#### Loading a table from a file
In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we can use our `Table` functions.

`Table.read_table` takes one argument, a path to a data file (a string) and returns a table.  There are many formats for data files, but CSV ("comma-separated values") is the most common.

**Question 2.2.** The file `imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb.  Load it as a table called `imdb`.

In [6]:
imdb = pd.read_csv('imdb.csv')
imdb

Unnamed: 0,Votes,Rating,Title,Year,Decade
0,88355,8.4,M,1931,1930
1,132823,8.3,Singin' in the Rain,1952,1950
2,74178,8.3,All About Eve,1950,1950
3,635139,8.6,Léon,1994,1990
4,145514,8.2,The Elephant Man,1980,1980
5,425461,8.3,Full Metal Jacket,1987,1980
6,441174,8.1,Gone Girl,2014,2010
7,850601,8.3,Batman Begins,2005,2000
8,37664,8.2,Judgment at Nuremberg,1961,1960
9,46987,8.0,Relatos salvajes,2014,2010


Notice the part about "... (240 rows omitted)."  This table is big enough that only a few of its rows are displayed, but the others are still there.  10 are shown, so there are 250 movies total.

Where did `imdb.csv` come from? Take a look at [this lab's folder](./). You should see a file called `imdb.csv`.

Open up the `imdb.csv` file in that folder and look at the format. What do you notice? The `.csv` filename ending says that this file is in the [CSV (comma-separated value) format](http://edoceo.com/utilitas/csv-file-format).

## 4. Analyzing datasets
With just a few table methods, we can answer some interesting questions about the IMDb dataset.

If we want just the ratings of the movies, we can get an array that contains the data in that column:

In [24]:
imdb["Rating"]

0      8.4
1      8.3
2      8.3
3      8.6
4      8.2
5      8.3
6      8.1
7      8.3
8      8.2
9      8.0
10     8.1
11     8.2
12     8.3
13     8.3
14     8.1
15     8.4
16     8.5
17     8.2
18     8.1
19     8.4
20     8.1
21     8.1
22     9.2
23     8.0
24     8.2
25     8.1
26     8.2
27     8.5
28     8.0
29     8.3
      ... 
220    8.4
221    8.0
222    8.1
223    8.7
224    8.9
225    8.3
226    8.1
227    8.1
228    8.0
229    8.2
230    8.4
231    8.4
232    8.1
233    8.3
234    8.4
235    8.2
236    8.5
237    8.0
238    8.2
239    8.1
240    8.4
241    8.1
242    8.6
243    8.4
244    8.1
245    8.7
246    8.1
247    8.2
248    8.1
249    8.3
Name: Rating, dtype: float64

The value of that expression is an array, exactly the same kind of thing you'd get if you typed in `make_array(8.4, 8.3, 8.3, [etc])`.

**Question 4.1.** Find the rating of the highest-rated movie in the dataset.

*Hint:* Think back to the functions you've learned about for working with arrays of numbers.  Ask for help if you can't remember one that's useful for this.

In [25]:
highest_rating = max(imdb['Rating'])
highest_rating

9.1999999999999993

In [26]:
_ = ok.grade('q4_1')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



That's not very useful, though.  You'd probably want to know the *name* of the movie whose rating you found!  To do that, we can sort the entire table by rating, which ensures that the ratings and titles will stay together. Note that calling sort creates a copy of the table and leaves the original table unsorted.

In [29]:
imdb.sort_values("Rating")

Unnamed: 0,Votes,Rating,Title,Year,Decade
124,91652,8.0,Akira,1988,1980
176,124671,8.0,Per un pugno di dollari,1964,1960
93,527349,8.0,Guardians of the Galaxy,2014,2010
92,49135,8.0,The Man Who Shot Liberty Valance,1962,1960
180,39447,8.0,Underground,1995,1990
182,28012,8.0,Le samouraï,1967,1960
85,268480,8.0,Beauty and the Beast,1991,1990
77,42446,8.0,La strada,1954,1950
174,862016,8.0,The Avengers,2012,2010
191,434906,8.0,The King's Speech,2010,2010


Well, that actually doesn't help much, either -- we sorted the movies from lowest -> highest ratings.  To look at the highest-rated movies, sort in reverse order:

In [30]:
imdb.sort_values("Rating", ascending = False)

Unnamed: 0,Votes,Rating,Title,Year,Decade
22,1027398,9.2,The Godfather,1972,1970
53,1498733,9.2,The Shawshank Redemption,1994,1990
91,692753,9.0,The Godfather: Part II,1974,1970
105,384187,8.9,12 Angry Men,1957,1950
57,447875,8.9,"Il buono, il brutto, il cattivo (1966)",1966,1960
76,1473049,8.9,The Dark Knight,2008,2000
147,761224,8.9,Schindler's List,1993,1990
141,1074146,8.9,The Lord of the Rings: The Return of the King,2003,2000
224,1166532,8.9,Pulp Fiction,1994,1990
178,1099087,8.8,The Lord of the Rings: The Fellowship of the Ring,2001,2000


(The `descending=True` bit is called an *optional argument*. It has a default value of `False`, so when you explicitly tell the function `descending=True`, then the function will sort in descending order.)

So there are actually 2 highest-rated movies in the dataset: *The Shawshank Redemption* and *The Godfather*.

Some details about sort:

1. The first argument to `sort` is the name of a column to sort by.
2. If the column has strings in it, `sort` will sort alphabetically; if the column has numbers, it will sort numerically.
3. The value of `imdb.sort("Rating")` is a *copy of `imdb`*; the `imdb` table doesn't get modified. For example, if we called `imdb.sort("Rating")`, then running `imdb` by itself would still return the unsorted table.
4. Rows always stick together when a table is sorted.  It wouldn't make sense to sort just one column and leave the other columns alone.  For example, in this case, if we sorted just the "Rating" column, the movies would all end up with the wrong ratings.

**Question 4.2.** Create a version of `imdb` that's sorted chronologically, with the earliest movies first.  Call it `imdb_by_year`.

In [44]:
imdb_by_year = imdb.sort_values('Year').reset_index().drop('index', axis = 1)
imdb_by_year

Unnamed: 0,Votes,Rating,Title,Year,Decade
0,55784,8.3,The Kid,1921,1920
1,58506,8.2,The Gold Rush,1925,1920
2,46332,8.2,The General,1926,1920
3,98794,8.3,Metropolis,1927,1920
4,88355,8.4,M,1931,1930
5,92375,8.5,City Lights,1931,1930
6,56842,8.1,It Happened One Night,1934,1930
7,121668,8.5,Modern Times,1936,1930
8,259235,8.1,The Wizard of Oz,1939,1930
9,192791,8.1,Gone with the Wind,1939,1930


**Question 4.3.** What's the title of the earliest movie in the dataset?  You could just look this up from the output of the previous cell.  Instead, write Python code to find out.

*Hint:* Starting with `imdb_by_year`, extract the Title column to get an array, then use `item` to get its first item.

In [45]:
earliest_movie_title = imdb_by_year['Title'][0]
earliest_movie_title

'The Kid'

In [46]:
_ = ok.grade('q4_3')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



## 5. Finding pieces of a dataset
Suppose you're interested in movies from the 1940s.  Sorting the table by year doesn't help you, because the 1940s are in the middle of the dataset.

Instead, we use the table method `where`.

In [47]:
forties = imdb[imdb['Decade'] == 1940]
forties

Unnamed: 0,Votes,Rating,Title,Year,Decade
21,55793,8.1,The Grapes of Wrath,1940,1940
50,86715,8.3,Double Indemnity,1944,1940
72,101754,8.1,The Maltese Falcon,1941,1940
75,71003,8.3,The Treasure of the Sierra Madre,1948,1940
102,35983,8.1,The Best Years of Our Lives,1946,1940
118,81887,8.3,Ladri di biciclette,1948,1940
120,66622,8.0,Notorious,1946,1940
158,350551,8.5,Casablanca,1942,1940
166,59578,8.0,The Big Sleep,1946,1940
167,78216,8.2,Rebecca,1940,1940


Ignore the syntax for the moment.  Instead, try to read that line like this:

> Assign the name **`forties`** to a table whose rows are the rows in the **`imdb`** table **`where`** the **`'Decade'`**s **`are` `equal` `to` `1940`**.

**Question 5.1.** Compute the average rating of movies from the 1940s.

*Hint:* The function `np.average` computes the average of an array of numbers.

In [48]:
average_rating_in_forties = np.average(forties['Rating'])
average_rating_in_forties

8.2571428571428562

In [49]:
_ = ok.grade('q5_1')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



Now let's dive into the details a bit more.  `where` takes 2 arguments:

1. The name of a column.  `where` finds rows where that column's values meet some criterion.
2. Something that describes the criterion that the column needs to meet, called a predicate.

To create our predicate, we called the function `are.equal_to` with the value we wanted, 1940.  We'll see other predicates soon.

`where` returns a table that's a copy of the original table, but with only the rows that meet the given predicate.

**Question 5.2.** Create a table called `ninety_nine` containing the movies that came out in the year 1999.  Use `where`.

In [51]:
ninety_nine = imdb[imdb['Year'] == 1999]
ninety_nine

Unnamed: 0,Votes,Rating,Title,Year,Decade
87,1177098,8.8,Fight Club,1999,1990
104,735056,8.4,American Beauty,1999,1990
115,630994,8.1,The Sixth Sense,1999,1990
129,1073043,8.7,The Matrix,1999,1990
149,672878,8.5,The Green Mile,1999,1990


So far we've only been finding where a column is *exactly* equal to a certain value. However, there are many other predicates.  Here are a few:

|Predicate|Example|Result|
|-|-|-|
|`are.equal_to`|`are.equal_to(50)`|Find rows with values equal to 50|
|`are.not_equal_to`|`are.not_equal_to(50)`|Find rows with values not equal to 50|
|`are.above`|`are.above(50)`|Find rows with values above (and not equal to) 50|
|`are.above_or_equal_to`|`are.above_or_equal_to(50)`|Find rows with values above 50 or equal to 50|
|`are.below`|`are.below(50)`|Find rows with values below 50|
|`are.between`|`are.between(2, 10)`|Find rows with values above or equal to 2 and below 10|

The textbook section on selecting rows has more examples.


**Question 5.3.** Using `where` and one of the predicates from the table above, find all the movies with a rating higher than 8.5.  Put their data in a table called `really_highly_rated`.

In [52]:
really_highly_rated = imdb[imdb['Rating'] > 8.5]
really_highly_rated

Unnamed: 0,Votes,Rating,Title,Year,Decade
3,635139,8.6,Léon,1994,1990
22,1027398,9.2,The Godfather,1972,1970
47,767224,8.6,The Silence of the Lambs,1991,1990
53,1498733,9.2,The Shawshank Redemption,1994,1990
57,447875,8.9,"Il buono, il brutto, il cattivo (1966)",1966,1960
68,967389,8.7,The Lord of the Rings: The Two Towers,2002,2000
70,689541,8.6,Interstellar,2014,2010
76,1473049,8.9,The Dark Knight,2008,2000
78,192206,8.6,C'era una volta il West,1968,1960
81,1271949,8.7,Inception,2010,2010


**Question 5.4.** Find the average rating for movies released in the 20th century and the average rating for movies released in the 21st century for the movies in `imdb`.

*Hint*: Think of the steps you need to do (take the average, find the ratings, find movies released in 20th/21st centuries), and try to put them in an order that makes sense.

In [62]:
average_20th_century_rating = np.average(imdb[imdb["Year"] < 2000].Rating)
average_21st_century_rating = np.average(imdb[imdb["Year"] >= 2000].Rating)
print("Average 20th century rating:", average_20th_century_rating)
print("Average 21st century rating:", average_21st_century_rating)

Average 20th century rating: 8.2783625731
Average 21st century rating: 8.23797468354


In [63]:
_ = ok.grade('q5_4')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



The property `num_rows` tells you how many rows are in a table.  (A "property" is just a method that doesn't need to be called by adding parentheses.)

In [67]:
num_movies_in_dataset = len(imdb)
num_movies_in_dataset

250

**Question 5.5.** Use `num_rows` (and arithmetic) to find the *proportion* of movies in the dataset that were released in the 20th century, and the proportion from the 21st century.

*Hint:* The *proportion* of movies released in the 20th century is the *number* of movies released in the 20th century, divided by the *total number* of movies.

In [68]:
proportion_in_20th_century = len(imdb[imdb['Year'] < 2000])/num_movies_in_dataset
proportion_in_21st_century = len(imdb[imdb['Year'] >= 2000])/num_movies_in_dataset
print("Proportion in 20th century:", proportion_in_20th_century)
print("Proportion in 21st century:", proportion_in_21st_century)

Proportion in 20th century: 0.684
Proportion in 21st century: 0.316


In [69]:
_ = ok.grade('q5_5')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



**Question 5.6.** Here's a challenge: Find the number of movies that came out in *even* years.

*Hint:* The operator `%` computes the remainder when dividing by a number.  So `5 % 2` is 1 and `6 % 2` is 0.  A number is even if the remainder is 0 when you divide by 2.

*Hint 2:* `%` can be used on arrays, operating elementwise like `+` or `*`.  So `make_array(5, 6, 7) % 2` is `array([1, 0, 1])`.

*Hint 3:* Create a column called "Year Remainder" that's the remainder when each movie's release year is divided by 2.  Make a copy of `imdb` that includes that column.  Then use `where` to find rows where that new column is equal to 0.  Then use `num_rows` to count the number of such rows.

In [72]:
num_even_year_movies = len(imdb[imdb['Year'] % 2 == 0])
num_even_year_movies

127

In [73]:
_ = ok.grade('q5_6')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



**Question 5.7.** Check out the `population` table from the introduction to this lab.  Compute the year when the world population first went above 6 billion.

In [78]:
year_population_crossed_6_billion = population[population.Population > 6000000000].reset_index().Year[0]
year_population_crossed_6_billion

1999

In [79]:
_ = ok.grade('q5_7')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



## 6. Miscellanea
There are a few more table methods you'll need to fill out your toolbox.  The first 3 have to do with manipulating the columns in a table.

The table `farmers_markets.csv` contains data on farmers' markets in the United States  (data collected [by the USDA]([dataset](https://apps.ams.usda.gov/FarmersMarketsExport/ExcelExport.aspx)).  Each row represents one such market.

**Question 6.1.** Load the dataset into a table.  Call it `farmers_markets`.

In [80]:
farmers_markets = pd.read_csv('farmers_markets.csv')
farmers_markets

Unnamed: 0,FMID,MarketName,Website,Facebook,Twitter,Youtube,OtherMedia,street,city,County,...,Coffee,Beans,Fruits,Grains,Juices,Mushrooms,PetFood,Tofu,WildHarvested,updateTime
0,1012063,Caledonia Farmers Market Association - Danville,https://sites.google.com/site/caledoniafarmers...,https://www.facebook.com/Danville.VT.Farmers.M...,,,,,Danville,Caledonia,...,Y,Y,Y,N,Y,N,Y,N,N,6/28/2016 12:10:09 PM
1,1011871,Stearns Homestead Farmers' Market,http://Stearnshomestead.com,,,,,6975 Ridge Road,Parma,Cuyahoga,...,N,N,Y,N,N,N,Y,N,N,4/9/2016 8:05:17 PM
2,1011878,100 Mile Market,http://www.pfcmarkets.com,https://www.facebook.com/100MileMarket/?fref=ts,,,https://www.instagram.com/100milemarket/,507 Harrison St,Kalamazoo,Kalamazoo,...,N,N,Y,Y,N,N,N,N,N,4/16/2016 12:37:56 PM
3,1009364,106 S. Main Street Farmers Market,http://thetownofsixmile.wordpress.com/,,,,,106 S. Main Street,Six Mile,,...,N,N,N,N,N,N,N,N,N,2013
4,1010691,10th Steet Community Farmers Market,,,,,http://agrimissouri.com/mo-grown/grodetail.php...,10th Street and Poplar,Lamar,Barton,...,N,N,Y,N,N,N,N,N,N,10/28/2014 9:49:46 AM
5,1002454,112st Madison Avenue,,,,,,112th Madison Avenue,New York,New York,...,N,N,N,N,N,N,N,N,N,3/1/2012 10:38:22 AM
6,1011100,12 South Farmers Market,http://www.12southfarmersmarket.com,12_South_Farmers_Market,@12southfrmsmkt,,@12southfrmsmkt,3000 Granny White Pike,Nashville,Davidson,...,Y,N,Y,N,Y,Y,Y,N,N,5/1/2015 10:40:56 AM
7,1009845,125th Street Fresh Connect Farmers' Market,http://www.125thStreetFarmersMarket.com,https://www.facebook.com/125thStreetFarmersMarket,https://twitter.com/FarmMarket125th,,Instagram--> 125thStreetFarmersMarket,"163 West 125th Street and Adam Clayton Powell,...",New York,New York,...,Y,N,Y,N,Y,N,N,N,N,4/7/2014 4:32:01 PM
8,1005586,12th & Brandywine Urban Farm Market,,https://www.facebook.com/pages/12th-Brandywine...,,,https://www.facebook.com/delawareurbanfarmcoal...,12th & Brandywine Streets,Wilmington,New Castle,...,N,N,Y,N,N,N,N,N,N,4/3/2014 3:43:31 PM
9,1008071,14&U Farmers' Market,,https://www.facebook.com/14UFarmersMarket,https://twitter.com/14UFarmersMkt,,,1400 U Street NW,Washington,District of Columbia,...,N,Y,Y,Y,Y,N,N,N,N,4/5/2014 1:49:04 PM


You'll notice that it has a large number of columns in it!

### `num_columns`

**Question 6.2.** The table property `num_columns` (example call: `tbl.num_columns`) produces the number of columns in a table.  Use it to find the number of columns in our farmers' markets dataset.

In [82]:
num_farmers_markets_columns = len(farmers_markets.columns)
print("The table has", num_farmers_markets_columns, "columns in it!")

The table has 59 columns in it!


In [83]:
_ = ok.grade('q6_2')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



Most of the columns are about particular products -- whether the market sells tofu, pet food, etc.  If we're not interested in that stuff, it just makes the table difficult to read.  This comes up more than you might think.

### `select`

In such situations, we can use the table method `select` to pare down the columns of a table.  It takes any number of arguments.  Each should be the name or index of a column in the table.  It returns a new table with only those columns in it.

For example, the value of `imdb.select("Year", "Decade")` is a table with only the years and decades of each movie in `imdb`.

**Question 6.3.** Use `select` to create a table with only the name, city, state, latitude ('y'), and longitude ('x') of each market.  Call that new table `farmers_markets_locations`.

In [92]:
farmers_markets_locations = farmers_markets[['MarketName','city','State','y','x']]
farmers_markets_locations

Unnamed: 0,MarketName,city,State,y,x
0,Caledonia Farmers Market Association - Danville,Danville,Vermont,44.411013,-72.140305
1,Stearns Homestead Farmers' Market,Parma,Ohio,41.375118,-81.728597
2,100 Mile Market,Kalamazoo,Michigan,42.296024,-85.574887
3,106 S. Main Street Farmers Market,Six Mile,South Carolina,34.804200,-82.818700
4,10th Steet Community Farmers Market,Lamar,Missouri,37.495628,-94.274619
5,112st Madison Avenue,New York,New York,40.793900,-73.949300
6,12 South Farmers Market,Nashville,Tennessee,36.118370,-86.790709
7,125th Street Fresh Connect Farmers' Market,New York,New York,40.808953,-73.948248
8,12th & Brandywine Urban Farm Market,Wilmington,Delaware,39.742117,-75.534460
9,14&U Farmers' Market,Washington,District of Columbia,38.916998,-77.032050


### `select` is not `column`!

The method `select` is **definitely not** the same as the method `column`.

`farmers_markets.column('y')` is an *array* of the latitudes of all the markets.  `farmers_markets.select('y')` is a table that happens to contain only 1 column, the latitudes of all the markets.

**Question 6.4.** Below, we tried using the function `np.average` to find the average latitude ('y') and average longitude ('x') of the farmers' markets in the table, but we screwed something up.  Run the cell to see the (somewhat inscrutable) error message that results from calling `np.average` on a table.  Then, fix our code.

In [93]:
average_latitude = np.average(farmers_markets.y)
average_longitude = np.average(farmers_markets.x)
print("The average of US farmers' markets' coordinates is located at (", average_latitude, ",", average_longitude, ")")

The average of US farmers' markets' coordinates is located at ( 39.1864645235 , -90.9925808129 )


In [94]:
_ = ok.grade('q6_4')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



### `drop`

`drop` serves the same purpose as `select`, but it takes away the columns you list instead of the ones you don't list, leaving all the rest of the columns.

**Question 6.5.** Suppose you just didn't want the "FMID" or "updateTime" columns in `farmers_markets`.  Create a table that's a copy of `farmers_markets` but doesn't include those columns.  Call that table `farmers_markets_without_fmid`.

In [95]:
farmers_markets_without_fmid = farmers_markets.drop(["FMID","updateTime"], axis = 1)
farmers_markets_without_fmid

Unnamed: 0,MarketName,Website,Facebook,Twitter,Youtube,OtherMedia,street,city,County,State,...,Wine,Coffee,Beans,Fruits,Grains,Juices,Mushrooms,PetFood,Tofu,WildHarvested
0,Caledonia Farmers Market Association - Danville,https://sites.google.com/site/caledoniafarmers...,https://www.facebook.com/Danville.VT.Farmers.M...,,,,,Danville,Caledonia,Vermont,...,N,Y,Y,Y,N,Y,N,Y,N,N
1,Stearns Homestead Farmers' Market,http://Stearnshomestead.com,,,,,6975 Ridge Road,Parma,Cuyahoga,Ohio,...,N,N,N,Y,N,N,N,Y,N,N
2,100 Mile Market,http://www.pfcmarkets.com,https://www.facebook.com/100MileMarket/?fref=ts,,,https://www.instagram.com/100milemarket/,507 Harrison St,Kalamazoo,Kalamazoo,Michigan,...,Y,N,N,Y,Y,N,N,N,N,N
3,106 S. Main Street Farmers Market,http://thetownofsixmile.wordpress.com/,,,,,106 S. Main Street,Six Mile,,South Carolina,...,N,N,N,N,N,N,N,N,N,N
4,10th Steet Community Farmers Market,,,,,http://agrimissouri.com/mo-grown/grodetail.php...,10th Street and Poplar,Lamar,Barton,Missouri,...,N,N,N,Y,N,N,N,N,N,N
5,112st Madison Avenue,,,,,,112th Madison Avenue,New York,New York,New York,...,N,N,N,N,N,N,N,N,N,N
6,12 South Farmers Market,http://www.12southfarmersmarket.com,12_South_Farmers_Market,@12southfrmsmkt,,@12southfrmsmkt,3000 Granny White Pike,Nashville,Davidson,Tennessee,...,N,Y,N,Y,N,Y,Y,Y,N,N
7,125th Street Fresh Connect Farmers' Market,http://www.125thStreetFarmersMarket.com,https://www.facebook.com/125thStreetFarmersMarket,https://twitter.com/FarmMarket125th,,Instagram--> 125thStreetFarmersMarket,"163 West 125th Street and Adam Clayton Powell,...",New York,New York,New York,...,Y,Y,N,Y,N,Y,N,N,N,N
8,12th & Brandywine Urban Farm Market,,https://www.facebook.com/pages/12th-Brandywine...,,,https://www.facebook.com/delawareurbanfarmcoal...,12th & Brandywine Streets,Wilmington,New Castle,Delaware,...,N,N,N,Y,N,N,N,N,N,N
9,14&U Farmers' Market,,https://www.facebook.com/14UFarmersMarket,https://twitter.com/14UFarmersMkt,,,1400 U Street NW,Washington,District of Columbia,District of Columbia,...,N,N,Y,Y,Y,Y,N,N,N,N


#### `take`
Let's find the 5 northernmost farmers' markets in the US.  You already know how to sort by latitude ('y'), but we haven't seen how to get the first 5 rows of a table.  That's what `take` is for.

The table method `take` takes as its argument an array of numbers.  Each number should be the index of a row in the table.  It returns a new table with only those rows.

Most often you'll want to use `take` in conjunction with `np.arange` to take the first few rows of a table.

**Question 6.6.** Make a table of the 5 northernmost farmers' markets in `farmers_markets_locations`.  Call it `northern_markets`.  (It should include the same columns as `farmers_markets_locations`.

In [100]:
northern_markets = farmers_markets_locations.sort_values('y', ascending = False).head()
northern_markets

Unnamed: 0,MarketName,city,State,y,x
7274,Tanana Valley Farmers Market,Fairbanks,Alaska,64.86275,-147.7811
2392,Ester Community Market,Ester,Alaska,64.8459,-148.01
2430,Fairbanks Downtown Market,Fairbanks,Alaska,64.844414,-147.719593
5182,Nenana Open Air Market,Nenana,Alaska,64.5566,-149.096
3496,Highway's End Farmers' Market,Delta Junction,Alaska,64.038462,-145.733115


**Question 6.7.** Make a table of the farmers' markets in Berkeley, California.  (It should include the same columns as `farmers_markets_locations`.)

In [101]:
berkeley_markets = farmers_markets_locations[farmers_markets_locations.city == 'Berkeley']
berkeley_markets

Unnamed: 0,MarketName,city,State,y,x
1938,Downtown Berkeley Farmers' Market,Berkeley,California,37.869692,-122.272792
5308,North Berkeley Farmers' Market,Berkeley,California,37.880235,-122.269242
6881,South Berkeley Farmers' Market,Berkeley,California,37.847761,-122.271922


Recognize any of them?

## 7. Summary

For your reference, here's a table of all the functions and methods we saw in this lab.

|Name|Example|Purpose|
|-|-|-|
|`Table`|`Table()`|Create an empty table, usually to extend with data|
|`Table.read_table`|`Table.read_table("my_data.csv")`|Create a table from a data file|
|`with_columns`|`tbl = Table().with_columns("N", np.arange(5), "2*N", np.arange(0, 10, 2))`|Create a copy of a table with more columns|
|`column`|`tbl.column("N")`|Create an array containing the elements of a column|
|`sort`|`tbl.sort("N")`|Create a copy of a table sorted by the values in a column|
|`where`|`tbl.where("N", are.above(2))`|Create a copy of a table with only the rows that match some *predicate*|
|`num_rows`|`tbl.num_rows`|Compute the number of rows in a table|
|`num_columns`|`tbl.num_columns`|Compute the number of columns in a table|
|`select`|`tbl.select("N")`|Create a copy of a table with only some of the columns|
|`drop`|`tbl.drop("2*N")`|Create a copy of a table without some of the columns|
|`take`|`tbl.take(np.arange(0, 6, 2))`|Create a copy of the table with only the rows whose indices are in the given array|

<br/>

Alright! You're finished with lab 3!  Be sure to...
- **run all the tests** (the next cell has a shortcut for that), 
- **Save and Checkpoint** from the `File` menu,
- **run the last cell to submit your work**,
- and ask one of the staff members to check you off.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
##import os
##_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

In [None]:
##_ = ok.submit()