# Lab 4: Data Frames

Welcome to lab 4!  This week, we'll learn about *data frames*, which let us work with multiple arrays of data about the same things.

First, set up the tests and imports by running the cell below.

## Your Name: Caroline Petersen

In [1]:
import pandas as pd
import numpy as np

## 1. Introduction

For a collection of things in the world, an array is useful for describing a single attribute of each thing. For example, among the collection of US States, an array could describe the land area of each. Tables extend this idea by describing multiple attributes for each element of a collection.

In most data science applications, we have data about many entities, but we also have several kinds of data about each entity.

For example, in the cell below we have two arrays. The first one contains the world population in each year (as [estimated](http://www.census.gov/population/international/data/worldpop/table_population.php) by the US Census Bureau), and the second contains the years themselves (in order, so the first elements in the population and the years arrays correspond).

In [2]:
population_amounts = pd.read_csv("world_population.csv")["Population"].to_numpy()
years = np.arange(1950, 2020 + 1)
print("Population column:\n", population_amounts)
print("\n\nYears column:\n", years)

Population column:
 [2536431018 2584034227 2630861690 2677609061 2724846754 2773019915
 2822443254 2873306058 2925686680 2979576147 3034949715 3091843513
 3150420761 3211000946 3273978272 3339583510 3407922631 3478770104
 3551599436 3625680965 3700437042 3775760030 3851650588 3927780519
 4003794178 4079480474 4154666827 4229505919 4304533599 4380506185
 4458003466 4536996619 4617386526 4699569187 4784011517 4870921666
 4960568000 5052521998 5145425994 5237441434 5327231041 5414289383
 5498919893 5581597598 5663150428 5744212930 5824891931 5905045647
 5984794075 6064239033 6143493806 6222626531 6301773172 6381185141
 6461159391 6541906956 6623517917 6705946643 6789088672 6872766988
 6956823588 7041194168 7125827957 7210582041 7295290759 7379796967
 7464021934 7547858900 7631091113 7713468205 7794798729]


Years column:
 [1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963
 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977
 1978 1979 1980 1981 1982 1

Suppose we want to answer this question:

> When did world population cross 6 billion?

You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a *data frame*, a 2-dimensional type of dataset. 

The expression below creates a data frame with two columns:
- *Population*, with the content of `population_amounts`,
- *Year*, with the content of `years`.

The strings `"Year"` and `"Population"` are column labels that we have chosen. The names `population_amounts` and `years` were assigned above to two arrays of the same length.

In [3]:
population = pd.DataFrame({"Year": years, "Population": population_amounts})
population

Unnamed: 0,Year,Population
0,1950,2536431018
1,1951,2584034227
2,1952,2630861690
3,1953,2677609061
4,1954,2724846754
...,...,...
66,2016,7464021934
67,2017,7547858900
68,2018,7631091113
69,2019,7713468205


Now the data are all together in a single table! It's much easier to parse this data--if you need to know what the population was in 1959, for example, you can tell from a single glance. We'll revisit this table later.

## 2. Creating Data Frames (Tables)

**Question 2.1:** In the cell below, we've created 2 arrays. Using the steps above, assign `top_10_movies` to a data frame that has two columns called "Name" and "Rating", which hold `top_10_movie_names` and `top_10_movie_ratings` respectively.

In [4]:
top_10_movie_ratings = np.array([9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8])
top_10_movie_names = np.array([
        'The Shawshank Redemption (1994)',
        'The Godfather (1972)',
        'The Godfather: Part II (1974)',
        'Pulp Fiction (1994)',
        "Schindler's List (1993)",
        'The Lord of the Rings: The Return of the King (2003)',
        '12 Angry Men (1957)',
        'The Dark Knight (2008)',
        'Il buono, il brutto, il cattivo (1966)',
        'The Lord of the Rings: The Fellowship of the Ring (2001)'])

top_10_movies = pd.DataFrame({"Name": top_10_movie_names, "Rating": top_10_movie_ratings})
top_10_movies

Unnamed: 0,Name,Rating
0,The Shawshank Redemption (1994),9.2
1,The Godfather (1972),9.2
2,The Godfather: Part II (1974),9.0
3,Pulp Fiction (1994),8.9
4,Schindler's List (1993),8.9
5,The Lord of the Rings: The Return of the King ...,8.9
6,12 Angry Men (1957),8.9
7,The Dark Knight (2008),8.9
8,"Il buono, il brutto, il cattivo (1966)",8.9
9,The Lord of the Rings: The Fellowship of the R...,8.8


### Loading a data frame from a file

In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we can use our `read_csv` function from pandas.

`pd.read_csv` takes one argument, a path to a CSV ("comma-separated values") data file (a string) and returns a data frame.  There are many formats for data files, but CSV is the most common.

---

**Question 2.2:** The file `imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb.  Load it as a data frame called `imdb`.

In [5]:
imdb = pd.read_csv("imdb.csv")
imdb

Unnamed: 0,Votes,Rating,Title,Year,Decade
0,88355,8.4,M,1931,1930
1,132823,8.3,Singin' in the Rain,1952,1950
2,74178,8.3,All About Eve,1950,1950
3,635139,8.6,Léon,1994,1990
4,145514,8.2,The Elephant Man,1980,1980
...,...,...,...,...,...
245,1078416,8.7,Forrest Gump,1994,1990
246,31003,8.1,Le salaire de la peur,1953,1950
247,167076,8.2,3 Idiots,2009,2000
248,91689,8.1,Network,1976,1970


---

Notice that the result shows `250 rows x 5 columns`, but pandas prints only the top and bottom five rows.


Where did `imdb.csv` come from? That is a file that should be in the same folder as this notebook.

If you open up the `imdb.csv` file in *text editor*, what do you notice? The `.csv` filename ending says that this file is in the [CSV (comma-separated value) format](http://edoceo.com/utilitas/csv-file-format).

## 3. Analyzing datasets

With just a few data frame methods, we can answer some interesting questions about the IMDb dataset.

If we want just the ratings of the movies, we can get an array that contains the data in that column:

In [6]:
imdb["Rating"].to_numpy()

array([8.4, 8.3, 8.3, 8.6, 8.2, 8.3, 8.1, 8.3, 8.2, 8. , 8.1, 8.2, 8.3,
       8.3, 8.1, 8.4, 8.5, 8.2, 8.1, 8.4, 8.1, 8.1, 9.2, 8. , 8.2, 8.1,
       8.2, 8.5, 8. , 8.3, 8.1, 8. , 8. , 8.3, 8.1, 8. , 8. , 8.3, 8.4,
       8.1, 8.1, 8.5, 8.5, 8. , 8.3, 8.1, 8. , 8.6, 8.5, 8.3, 8.3, 8. ,
       8.2, 9.2, 8.2, 8.5, 8. , 8.9, 8.4, 8.2, 8.1, 8.3, 8.1, 8.1, 8.1,
       8.3, 8.2, 8.3, 8.7, 8.3, 8.6, 8. , 8.1, 8.2, 8.5, 8.3, 8.9, 8. ,
       8.6, 8.3, 8.1, 8.7, 8.4, 8.1, 8.4, 8. , 8.5, 8.8, 8.2, 8.2, 8.5,
       9. , 8. , 8. , 8.3, 8.4, 8.6, 8.5, 8.7, 8.4, 8.1, 8.1, 8.1, 8.7,
       8.4, 8.9, 8.1, 8.2, 8. , 8.5, 8.5, 8. , 8. , 8.4, 8.1, 8.1, 8. ,
       8. , 8.3, 8.1, 8. , 8.3, 8. , 8. , 8. , 8. , 8. , 8. , 8. , 8.7,
       8.3, 8. , 8. , 8.5, 8. , 8.1, 8.1, 8.1, 8.3, 8.2, 8.3, 8.9, 8.2,
       8.2, 8. , 8.3, 8.2, 8.9, 8.5, 8.5, 8.1, 8.1, 8.5, 8.3, 8. , 8.2,
       8.7, 8.3, 8.5, 8.1, 8.3, 8.2, 8.4, 8.1, 8.1, 8.1, 8. , 8.2, 8. ,
       8.6, 8.3, 8.2, 8. , 8.3, 8. , 8.2, 8. , 8.2, 8.8, 8.1, 8.

The value of that expression is an array, exactly the same kind of thing you'd get if you typed in `np.array([8.4, 8.3, 8.3, [etc]])`.

If you do not add the method `to_numpy`, we get a *series*, which is, roughly speaking, an array with some extra data: the *index* (same as the data frame) and a name (the column name).

In [7]:
imdb["Rating"]

0      8.4
1      8.3
2      8.3
3      8.6
4      8.2
      ... 
245    8.7
246    8.1
247    8.2
248    8.1
249    8.3
Name: Rating, Length: 250, dtype: float64

Series come with many methods themselves.  For instance, if I wanted to know the lowest rating for a movie in the data set, I could use the `min()` method:

In [8]:
imdb["Rating"].min()

8.0

So, the lowest rating in the data set was 8.0 (out of 100).

Alternatively, you could use NumPy's `np.min`, although the former is usually preferred.

In [9]:
np.min(imdb["Rating"])

8.0

---

**Question 3.1:** Find the rating of the highest-rated movie in the dataset.

*Hint:* You can probably guess the method you need here.

In [12]:
highest_rating = imdb["Rating"].max()
highest_rating

9.2

---

That's not very useful, though.  You'd probably want to know the *name* of the movie whose rating you found!  In the previous lab, you've learned how to do that with arrays.  You can use a similar idea here:

In [13]:
imdb.loc[imdb["Rating"].argmax()]

Votes           1027398
Rating              9.2
Title     The Godfather
Year               1972
Decade             1970
Name: 22, dtype: object

(Note the `.loc`.  We've discussed it in class, and you will have some practice with it below.)

But remember that this would give you only *one* answer.  If there were more than one movie with the same highest rate, you would not see it.

An alternative would be to sort the entire table by rating, which ensures that the ratings and titles will stay together.

In [14]:
imdb.sort_values("Rating")

Unnamed: 0,Votes,Rating,Title,Year,Decade
124,91652,8.0,Akira,1988,1980
176,124671,8.0,Per un pugno di dollari,1964,1960
93,527349,8.0,Guardians of the Galaxy,2014,2010
92,49135,8.0,The Man Who Shot Liberty Valance,1962,1960
180,39447,8.0,Underground,1995,1990
...,...,...,...,...,...
147,761224,8.9,Schindler's List,1993,1990
105,384187,8.9,12 Angry Men,1957,1950
91,692753,9.0,The Godfather: Part II,1974,1970
53,1498733,9.2,The Shawshank Redemption,1994,1990


This does help, but it feels strange to have the highest rating at the bottom.  Alternatively, we can sort it in reverse order:

In [15]:
imdb.sort_values("Rating", ascending=False)

Unnamed: 0,Votes,Rating,Title,Year,Decade
22,1027398,9.2,The Godfather,1972,1970
53,1498733,9.2,The Shawshank Redemption,1994,1990
91,692753,9.0,The Godfather: Part II,1974,1970
105,384187,8.9,12 Angry Men,1957,1950
57,447875,8.9,"Il buono, il brutto, il cattivo (1966)",1966,1960
...,...,...,...,...,...
168,500576,8.0,"Monsters, Inc. (2001)",2001,2000
166,59578,8.0,The Big Sleep,1946,1940
46,427099,8.0,X-Men: Days of Future Past,2014,2010
51,87437,8.0,Roman Holiday,1953,1950


(The `ascending=False` bit is called an *optional argument*. It has a default value of `True`, so when you explicitly tell the function `ascending=False`, then the function will sort in descending order.)

So indeed there are actually 2 highest-rated movies in the dataset: *The Shawshank Redemption* and *The Godfather*.

Some details about `sort_values`:

1. The first argument to `sort_values` is the name of a column to sort by.
2. If the column has strings in it, `sort_values` will sort alphabetically; if the column has numbers, it will sort numerically.
3. The value of `imdb.sort_values("Rating")` is a *copy of `imdb`*; the `imdb` data frame doesn't get modified. For example, if we called `imdb.sort_values("Rating")`, then running `imdb` by itself would still return the unsorted data frame.
4. Rows always stick together when a data frame is sorted.  It wouldn't make sense to sort just one column and leave the other columns alone.  For example, in this case, if we sorted just the "Rating" column, the movies would all end up with the wrong ratings.

---

**Question 3.2:** Create a version of `imdb` that's sorted chronologically, with the earliest movies first.  Call it `imdb_by_year`.

In [16]:
imdb_by_year = imdb.sort_values("Year", ascending=True)
imdb_by_year

Unnamed: 0,Votes,Rating,Title,Year,Decade
173,55784,8.3,The Kid,1921,1920
205,58506,8.2,The Gold Rush,1925,1920
146,46332,8.2,The General,1926,1920
49,98794,8.3,Metropolis,1927,1920
0,88355,8.4,M,1931,1930
...,...,...,...,...,...
100,369141,8.1,The Grand Budapest Hotel,2014,2010
9,46987,8.0,Relatos salvajes,2014,2010
70,689541,8.6,Interstellar,2014,2010
233,262425,8.3,Mad Max: Fury Road,2015,2010


**Question 3.3:** What's the **title** of the earliest movie in the dataset?  You could just look this up from the output of the previous cell.  Instead, write Python code to find out.  (Note that I do not want a series, i.e., a row of the data frame.  I want just the string with the title of the movie.)

*Hint:* There a few different ways of doing it.  You could extract the column `Title` from `imdb_by_year` and look up its first element.  For this last part, you can convert it to an array and then look at the first element (as we already learned), or you could `.iloc`.  (Note that `.loc` would not work here, as you can see that the index is not in order anymore after sorting.  We will talk more about the `.loc` and `.iloc` methods below, so if you still feel unfamiliar with them, converting to a an array might be easier.)

In [19]:
earliest_movie_title = imdb_by_year['Title'].iloc[0]
earliest_movie_title

'The Kid'

## 4. Finding pieces of a dataset

Suppose you're interested in movies from the 1940s.  Sorting the table by year doesn't help you, because the 1940s are in the middle of the dataset.

Instead, we use the data frame method `loc`.

In [20]:
forties = imdb.loc[imdb["Decade"] == 1940]
forties

Unnamed: 0,Votes,Rating,Title,Year,Decade
21,55793,8.1,The Grapes of Wrath,1940,1940
50,86715,8.3,Double Indemnity,1944,1940
72,101754,8.1,The Maltese Falcon,1941,1940
75,71003,8.3,The Treasure of the Sierra Madre,1948,1940
102,35983,8.1,The Best Years of Our Lives,1946,1940
118,81887,8.3,Ladri di biciclette,1948,1940
120,66622,8.0,Notorious,1946,1940
158,350551,8.5,Casablanca,1942,1940
166,59578,8.0,The Big Sleep,1946,1940
167,78216,8.2,Rebecca,1940,1940


Alternatively, you can also use the `query` method.

In [21]:
forties = imdb.query("Decade == 1940")
forties

Unnamed: 0,Votes,Rating,Title,Year,Decade
21,55793,8.1,The Grapes of Wrath,1940,1940
50,86715,8.3,Double Indemnity,1944,1940
72,101754,8.1,The Maltese Falcon,1941,1940
75,71003,8.3,The Treasure of the Sierra Madre,1948,1940
102,35983,8.1,The Best Years of Our Lives,1946,1940
118,81887,8.3,Ladri di biciclette,1948,1940
120,66622,8.0,Notorious,1946,1940
158,350551,8.5,Casablanca,1942,1940
166,59578,8.0,The Big Sleep,1946,1940
167,78216,8.2,Rebecca,1940,1940


(The result should be exactly the same.)

Ignore the syntax for the moment.  Instead, try to read that line like this:

> Assign the name **`forties`** to a data frame whose rows are the rows in **`imdb`** for which the column **`imdb["Decade"]`** has values equal to **`1940`**.

The `loc` method is more powerful, so it is worth learning, while the `query` method is shorter to type.  In `query`, the argument is a string, and instead of entering columns (such as `imdb["Year"]` in our example) as we must do with `loc`, we can simply use the column label, *without quotes*, to represent the column (as in `Year` in our example).

A problem with `query` is that the column labels cannot have space in them.

Note that while `loc` uses square brackets `[ ]`, `query` uses parentheses `( )`.

---

**Question 4.1:** Compute the average rating of movies from the 1940s.

*Hint:* Series have a `mean` method that gives the mean/average.

In [22]:
average_rating_in_forties = forties["Rating"].mean()
average_rating_in_forties

8.257142857142856

**Question 4.2:** Create a data frame called `ninety_nine` containing the movies that came out in the year 1999.  Use can use either `loc` or `query`.

In [24]:
ninety_nine = imdb.loc[imdb["Year"]==1999]
ninety_nine

Unnamed: 0,Votes,Rating,Title,Year,Decade
87,1177098,8.8,Fight Club,1999,1990
104,735056,8.4,American Beauty,1999,1990
115,630994,8.1,The Sixth Sense,1999,1990
129,1073043,8.7,The Matrix,1999,1990
149,672878,8.5,The Green Mile,1999,1990


---

So far we've only been finding where a column is *exactly* equal to a certain value. However, there are many other comparisons.  Here are a few:

|Comparison|`loc` Example|`query` Example|Result|
|-|-|-|-|
|`==`|`data_frame["Column"] == 50`|`"Column == 50"`|Find rows with values equal to 50|
|`!=`|`data_frame["Column"] != 50`|`"Column != 50"`|Find rows with values not equal to 50|
|`>`|`data_frame["Column"] > 50`|`"Column > 50"`|Find rows with values above (and not equal to) 50|
|`>=`|`data_frame["Column"] >= 50`|`"Column >= 50"`|Find rows with values above 50 or equal to 50|
|`<`|`data_frame["Column"] < 50`|`"Column < 50"`|Find rows with values below 50|
|`<=`|`data_frame["Column"] <= 50`|`"Column <= 50"`|Find rows with values below 50 or equal to 50|


With `loc` can combine the comparisons above using `&` for *and* and `|` for *or*, and **separating the conditions with parentheses**.  On the other hand, `query` uses a more usual Python syntax.

For instance, if we want movies with ratings between 8.1 and 8.6 (including neither 8.1 nor 8.6), we can do:
```python
imdb.loc[(imdb["Rating"] > 8.1) & (imdb["Rating"] < 8.6)]
```
or
```python
imdb.query("Rating > 8.1 and Rating < 8.6")
```

Although in this case we could also use the method `between`:
```python
imdb.loc[imdb["Rating"].between(8.1, 8.6, inclusive="neither")]
```
or
```python
imdb.query("8.1 < Rating < 8.6")
```

If we want movies that *either* have a rating above 9 or came out before 1950:
```python
imdb.loc[(imdb["Rating"] > 9) | (imdb["Year"] < 1950)]
```
or
```python
imdb.query("Rating > 9 or Year < 1950")
```

---

**Question 4.3:** Using `loc` and one of the comparisons from the table above, find all the movies with a rating higher than 8.5.  Put their data in a data frame called `really_highly_rated`.

In [25]:
really_highly_rated = imdb.loc[imdb["Rating"]>8.5]
really_highly_rated

Unnamed: 0,Votes,Rating,Title,Year,Decade
3,635139,8.6,Léon,1994,1990
22,1027398,9.2,The Godfather,1972,1970
47,767224,8.6,The Silence of the Lambs,1991,1990
53,1498733,9.2,The Shawshank Redemption,1994,1990
57,447875,8.9,"Il buono, il brutto, il cattivo (1966)",1966,1960
68,967389,8.7,The Lord of the Rings: The Two Towers,2002,2000
70,689541,8.6,Interstellar,2014,2010
76,1473049,8.9,The Dark Knight,2008,2000
78,192206,8.6,C'era una volta il West,1968,1960
81,1271949,8.7,Inception,2010,2010


**Question 4.4:** Find the average rating for movies released in the 20th century and the average rating for movies released in the 21st century for the movies in `imdb`.

*Hint*: Think of the steps you need to do (take the average, find the ratings, find movies released in 20th/21st centuries), and try to put them in an order that makes sense.  (Note that the 21st century started on the year 2001, not 2000!)

In [26]:
average_20th_century_rating = imdb.loc[imdb["Year"]<=2000]["Rating"].mean()
average_21st_century_rating = imdb.loc[imdb["Year"]>2000]["Rating"].mean()

print(f"Average 20th century rating: {average_20th_century_rating}.")
print(f"Average 21st century rating: {average_21st_century_rating}.")

Average 20th century rating: 8.280113636363636.
Average 21st century rating: 8.23108108108108.


The builtin function `len` tells you how many rows are in a data frame.

In [27]:
num_movies_in_dataset = len(imdb)
num_movies_in_dataset

250

**Question 4.5:** Use `len` (and arithmetic) to find the *proportion* of movies in the dataset that were released in the 20th century, and the proportion from the 21st century.

*Hint:* The *proportion* of movies released in the 20th century is the *number* of movies released in the 20th century, divided by the *total number* of movies (which we've saved in the `num_movies_in_dataset` variable above).

In [28]:
proportion_in_20th_century = len(imdb.loc[imdb["Year"] <= 2000]) / num_movies_in_dataset
proportion_in_21st_century = len(imdb.loc[imdb["Year"] > 2000]) / num_movies_in_dataset

print(f"Proportion in 20th century: {proportion_in_20th_century}.")
print(f"Proportion in 21st century: {proportion_in_21st_century}.")

Proportion in 20th century: 0.704.
Proportion in 21st century: 0.296.


**Question 4.6:** Find the number of movies that came out in *even* years.

*Hint:* The operator `%` computes the remainder when dividing by a number.  So `5 % 2` is 1 and `6 % 2` is 0.  A number is even if the remainder is 0 when you divide by 2.

*Hint 2:* `%` can be used on arrays (or series, i.e., columns of data frames), operating elementwise like `+` or `*`.  So `np.array([5, 6, 7]) % 2` is `array([1, 0, 1])`.

*Hint 3:* The method `loc` (or `qurey`) can be helpful again!

In [31]:
num_even_year_movies = len(imdb.loc[(imdb["Year"]% 2)==0])
num_even_year_movies

127

**Question 4.7:** Check out the `population` table from the introduction to this lab.  Compute the year when the world population first went above 6 billion.

*Hint:* this requires a string a commands, like filtering, extracting columns, getting elements of arrays, etc.

In [32]:
year_population_crossed_6_billion = population.loc[population["Population"]>6e9]["Year"].to_numpy()[0]
year_population_crossed_6_billion

1999

## 5. Miscellanea

There are a few more data frame methods and procedures you'll need to fill out your toolbox.

The table `farmers_markets.csv` contains data on farmers' markets in the United States  (data collected [by the USDA]([dataset](https://apps.ams.usda.gov/FarmersMarketsExport/ExcelExport.aspx)).  Each row represents one such market.

**Question 5.1:** Load the dataset into a data frame.  Call it `farmers_markets`.

In [33]:
farmers_markets = pd.read_csv("farmers_markets.csv")
farmers_markets

Unnamed: 0,FMID,MarketName,Website,Facebook,Twitter,Youtube,OtherMedia,street,city,County,...,Coffee,Beans,Fruits,Grains,Juices,Mushrooms,PetFood,Tofu,WildHarvested,updateTime
0,1012063,Caledonia Farmers Market Association - Danville,https://sites.google.com/site/caledoniafarmers...,https://www.facebook.com/Danville.VT.Farmers.M...,,,,,Danville,Caledonia,...,Y,Y,Y,N,Y,N,Y,N,N,6/28/2016 12:10:09 PM
1,1011871,Stearns Homestead Farmers' Market,http://Stearnshomestead.com,,,,,6975 Ridge Road,Parma,Cuyahoga,...,N,N,Y,N,N,N,Y,N,N,4/9/2016 8:05:17 PM
2,1011878,100 Mile Market,http://www.pfcmarkets.com,https://www.facebook.com/100MileMarket/?fref=ts,,,https://www.instagram.com/100milemarket/,507 Harrison St,Kalamazoo,Kalamazoo,...,N,N,Y,Y,N,N,N,N,N,4/16/2016 12:37:56 PM
3,1009364,106 S. Main Street Farmers Market,http://thetownofsixmile.wordpress.com/,,,,,106 S. Main Street,Six Mile,,...,N,N,N,N,N,N,N,N,N,2013
4,1010691,10th Steet Community Farmers Market,,,,,http://agrimissouri.com/mo-grown/grodetail.php...,10th Street and Poplar,Lamar,Barton,...,N,N,Y,N,N,N,N,N,N,10/28/2014 9:49:46 AM
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8541,1004767,Zia Bernalillo Farmers' Market,http://www.eatfreshnm.org/,https://www.facebook.com/farmersmarketsnm?ref=hl,,,,335 S. Camino del Pueblo,Bernalillo,Sandoval,...,N,Y,Y,N,N,N,N,N,N,6/5/2014 2:40:25 PM
8542,1000778,Zimmerman Farmers Market,http://www.pzfarmersmarket.org,princeton-zimerman farmers market,,,,"Lions Park, Main Street",Zimmerman,Sherburne,...,N,N,Y,N,Y,N,N,N,N,6/27/2016 3:44:36 PM
8543,1002868,Zion Canyon Farmers Market,http://www.zionharvest.org,https://www.facebook.com/ZionCanyonFarmersMark...,,,,1212 Zion Park Blvd.,Springdale,Washington,...,Y,N,Y,Y,N,N,N,N,N,3/19/2015 1:43:50 PM
8544,1004686,Zionsville Farmers Market,http://www.zionsvillefarmersmarket.org,,,,,Hawthorne & Main Street,Zionsville,Boone,...,N,N,N,N,N,N,N,N,N,2009


You'll notice that it has a large number of columns in it!

### `shape`

**Question 5.2:** The data frame property `shape` (example call: `data_frame.shape`  -- note it has no parentheses at the end!) produces a pair containing the number of rows (first) and the number of columns (second) in a data frame.  Use it to find the number of columns in our farmers' markets dataset.

In [34]:
num_farmers_markets_columns = farmers_markets.shape[1]
print(f"The table has {num_farmers_markets_columns} columns in it!")

The table has 59 columns in it!


Most of the columns are about particular products -- whether the market sells tofu, pet food, etc.  If we're not interested in that stuff, it just makes the table difficult to read.  This comes up more than you might think.

### Select Columns

In such situations, we can use a list of columns of a data frame to create a new one.

For example, the value of `imdb[["Year", "Decade"]]` is a data frame with only the years and decades of each movie in `imdb`.

**Question 5.3:** Create a data frame with only the name, city, state, latitude (`y`), and longitude (`x`) of each market.  Call that new data frame `farmers_markets_locations`.

*Hint:* You can use `farmers_markets.columns` (again, no parentheses) to find the actual labels of the columns.

In [35]:
farmers_markets_locations = farmers_markets[["MarketName", "city", "State", "y", "x"]]
farmers_markets_locations

Unnamed: 0,MarketName,city,State,y,x
0,Caledonia Farmers Market Association - Danville,Danville,Vermont,44.411013,-72.140305
1,Stearns Homestead Farmers' Market,Parma,Ohio,41.375118,-81.728597
2,100 Mile Market,Kalamazoo,Michigan,42.296024,-85.574887
3,106 S. Main Street Farmers Market,Six Mile,South Carolina,34.804200,-82.818700
4,10th Steet Community Farmers Market,Lamar,Missouri,37.495628,-94.274619
...,...,...,...,...,...
8541,Zia Bernalillo Farmers' Market,Bernalillo,New Mexico,35.313704,-106.546840
8542,Zimmerman Farmers Market,Zimmerman,Minnesota,45.437700,-93.585014
8543,Zion Canyon Farmers Market,Springdale,Utah,37.184508,-113.003311
8544,Zionsville Farmers Market,Zionsville,Indiana,39.949100,-86.261200


### `drop`

`drop` takes away the columns you list instead of the ones you don't list, leaving all the rest of the columns.

For instance, to drop the "decade" column from our `imdb` data frame, we could do
```python
imdb.drop(columns="Decade")
```
Note that the named argument is `columns` (plural!), even whend dropping a single column!  If we want to drop more than one column, we put them in a list.  So, to drop columns "decade" and "votes" from `imdb` we use
```python
imdb.drop(columns=["Decade", "Votes"])
```

**Question 5.4:** Suppose you just didn't want the "FMID" or "updateTime" columns in `farmers_markets`.  Create a data frame that's a copy of `farmers_markets` but doesn't include those columns.  Call that data frame `farmers_markets_without_fmid`.

In [37]:
farmers_markets_without_fmid = farmers_markets.drop(columns=["FMID", "updateTime"])
farmers_markets_without_fmid

Unnamed: 0,MarketName,Website,Facebook,Twitter,Youtube,OtherMedia,street,city,County,State,...,Wine,Coffee,Beans,Fruits,Grains,Juices,Mushrooms,PetFood,Tofu,WildHarvested
0,Caledonia Farmers Market Association - Danville,https://sites.google.com/site/caledoniafarmers...,https://www.facebook.com/Danville.VT.Farmers.M...,,,,,Danville,Caledonia,Vermont,...,N,Y,Y,Y,N,Y,N,Y,N,N
1,Stearns Homestead Farmers' Market,http://Stearnshomestead.com,,,,,6975 Ridge Road,Parma,Cuyahoga,Ohio,...,N,N,N,Y,N,N,N,Y,N,N
2,100 Mile Market,http://www.pfcmarkets.com,https://www.facebook.com/100MileMarket/?fref=ts,,,https://www.instagram.com/100milemarket/,507 Harrison St,Kalamazoo,Kalamazoo,Michigan,...,Y,N,N,Y,Y,N,N,N,N,N
3,106 S. Main Street Farmers Market,http://thetownofsixmile.wordpress.com/,,,,,106 S. Main Street,Six Mile,,South Carolina,...,N,N,N,N,N,N,N,N,N,N
4,10th Steet Community Farmers Market,,,,,http://agrimissouri.com/mo-grown/grodetail.php...,10th Street and Poplar,Lamar,Barton,Missouri,...,N,N,N,Y,N,N,N,N,N,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8541,Zia Bernalillo Farmers' Market,http://www.eatfreshnm.org/,https://www.facebook.com/farmersmarketsnm?ref=hl,,,,335 S. Camino del Pueblo,Bernalillo,Sandoval,New Mexico,...,N,N,Y,Y,N,N,N,N,N,N
8542,Zimmerman Farmers Market,http://www.pzfarmersmarket.org,princeton-zimerman farmers market,,,,"Lions Park, Main Street",Zimmerman,Sherburne,Minnesota,...,N,N,N,Y,N,Y,N,N,N,N
8543,Zion Canyon Farmers Market,http://www.zionharvest.org,https://www.facebook.com/ZionCanyonFarmersMark...,,,,1212 Zion Park Blvd.,Springdale,Washington,Utah,...,N,Y,N,Y,Y,N,N,N,N,N
8544,Zionsville Farmers Market,http://www.zionsvillefarmersmarket.org,,,,,Hawthorne & Main Street,Zionsville,Boone,Indiana,...,N,N,N,N,N,N,N,N,N,N


### `iloc`

Let's find the 5 northernmost farmers' markets in the US.  You already know how to sort by latitude (`y`), but we haven't seen how to get the first 5 rows of a data frame.  We can use `iloc` for that.

`iloc` locates rows (and columns) by *position* index, and not the data frame index, which might not even be numeric. **Important:** Remember that in Python indexing starts with 0, and not 1!

`iloc` takes a list of indices and returns a data frame only the rows indexed by the numbers in the list.  For example,
```python
imdb.iloc[0, 2, 7]
```
returns a data frame with the first (position index 0), third (position index 2), and eighth (position index 7) rows of `imdb`.

We can also use *slicing*.  For instance,
```python
imdb.iloc[4:10]
```
returns a data frame with rows from (and including) the fifth (index 4) to (and including) the tenth (index **9**) of `imdb`.  *Note that the index 10 is not included in the slice!*

**Question 5.5:** Make a data frame of the 5 northernmost farmers' markets in `farmers_markets_locations`.  Call it `northern_markets`.  (It should include the same columns as `farmers_markets_locations`.)

In [38]:
northern_markets = farmers_markets_locations.sort_values("y", ascending=False).iloc[:5]
northern_markets

Unnamed: 0,MarketName,city,State,y,x
7274,Tanana Valley Farmers Market,Fairbanks,Alaska,64.86275,-147.7811
2392,Ester Community Market,Ester,Alaska,64.8459,-148.01
2430,Fairbanks Downtown Market,Fairbanks,Alaska,64.844414,-147.719593
5182,Nenana Open Air Market,Nenana,Alaska,64.5566,-149.096
3496,Highway's End Farmers' Market,Delta Junction,Alaska,64.038462,-145.733115


**Question 5.6:** Find and  farmers' markets in Knoxville, TN.  (It should include the same columns as `farmers_markets_locations`.)

*Hint:* Careful.  Iowa also has a city named Knoxville.

In [39]:
knoxville_markets = farmers_markets_locations.loc[
    (farmers_markets_locations["city"]=="Knoxville")
    & (farmers_markets_locations["State"]=="Tennessee")
]
knoxville_markets

Unnamed: 0,MarketName,city,State,y,x
1897,Dixie Lee Farmers Market,Knoxville,Tennessee,35.867136,-84.209422
2180,East TN Farmers Association for Retail Marketi...,Knoxville,Tennessee,35.892849,-84.07023
2181,East TN Farmers Association for Retail Marketi...,Knoxville,Tennessee,35.947813,-83.961598
4593,Marble Springs Farmers Market,Knoxville,Tennessee,35.896958,-83.876818
5223,New Harvest Park Farmers Market,Knoxville,Tennessee,36.040025,-83.884558
7447,The Market Square Farmers' Market,Knoxville,Tennessee,35.964911,-83.919967
7794,UT Farmers Market,Knoxville,Tennessee,35.944173,-83.938079


## 6. Submission 

Great job!  **To submit this lab**, please download your notebook as a .ipynb file and submit in Canvas under Lab 4 (under Assignments). To export, go to the toolbar at the top of this page, click File > Download. Then, go to our class's Canvas page and upload your file under "Lab 4".

For easy identification, **please add your surname to the file**, as in: `lab_04_DS201_Name.ipynb`.