# Lab 1: Arrays and DataFrames

## Due Thursday, October 12th at 11:59PM

Welcome to Lab 1!  This week, we'll learn about arrays, which allow us to store sequences of data, and DataFrames, which let us work with multiple arrays of data about the same things. These topics are covered in [BPD 7-10](https://notes.dsc10.com/02-data_sets/arrays.html) in the `babypandas` notes. You should complete this entire lab so that all tests pass and submit it to Gradescope by 11:59PM on the due date.


**Please do not use for-loops for any questions in this lab.** If you don't know what a for-loop is, don't worry – we haven't covered them yet. But if you do know what they are and are wondering why it's not OK to use them, it is because loops in Python are slow, and looping over arrays and DataFrames should usually be avoided.

First, set up the imports we'll need by running the cell below.

In [None]:
import numpy as np
import babypandas as bpd

import otter
grader = otter.Notebook()

# 1. Arrays

Computers are most useful when you can use a small amount of code to *do the same action* to *many different things*.

For example, in the time it takes you to calculate the 18% tip on a restaurant bill, a laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day.  (That is, if you're pretty fast at doing arithmetic in your head!)

**Arrays** are how we put many values in one place so that we can operate on them as a group. For example, if `billions_of_numbers` is an array of numbers, the expression

```py
0.18 * billions_of_numbers
```

evaluates to a new array of numbers that's the result of multiplying each number in `billions_of_numbers` by 0.18 (18%).  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Concretely, an array is a **collection of values of the same type**, like a column in a spreadsheet (think Google Sheets or Microsoft Excel). 

<img src="data/sheet_array.png" width=600>

## 1.1. Making arrays
You can type in the data that goes in an array yourself, but that's not typically how we'll create arrays. Normally, we create arrays by loading them from an external source, like a data file.

First, though, let's learn how to do it the hard way. To begin, we can make a **list** of numbers by putting them within square brackets and separating them by commas:

In [None]:
my_list = [14, -2.26, 0.15]
my_list

Just like `int`, `float`, and `str`, the `list` is a data type provided by Python. Lists are very flexible and easy to work with, but they are *slowwww* 🐢.

As data scientists, we'll often be working with millions or even billions of numbers. For this, we need something faster than a `list`. Instead of lists, we will use *arrays*. 

Arrays are provided by a package called [NumPy](http://www.numpy.org/) (pronounced "NUM-pie" or, if you prefer to pronounce things incorrectly, "NUM-pee").  The package is called `numpy`, but it's standard to rename it `np` for brevity.  You can do that with:

    import numpy as np

Data scientists, as well as engineers and scientists of all kinds, use `numpy` frequently, and you'll see quite a bit of it if you're a data science major.

In [None]:
import numpy as np

Now, to create an array, call the function `np.array` with a list of numbers.  Run this cell to see an example:

In [None]:
np.array([14, -2.26, 0.15])

Note that you need the square-brackets here. If you were to try running the following code, Python would yell at you because you forgot them:

```py
np.array(14, -2.26, 0.15)
```

<img src='data/brackets.png' width=400>

Arrays themselves are also values, just like numbers and strings.  That means you can assign them names or use them as arguments to functions.

**Question 1.1.1.** Make an array containing the numbers 2, 4, and 6, in that order.  Name it `even_numbers`.

In [None]:
even_numbers = ...
even_numbers

In [None]:
grader.check("q1_1_1")

**Question 1.1.2.** Make an array containing the numbers 0, -1, 1, $\pi$, and $e$, in that order.  Name it `odd_numbers`.

**_Hint:_**  $\pi$ and $e$ are available from the `np` module, which has already been imported. Just as you used `math.pi` to get $\pi$ in the last lab, you can use `np.pi` to get $\pi$ as well. **Do not** import the `math` module.

In [None]:
odd_numbers = ...
odd_numbers

In [None]:
grader.check("q1_1_2")

**Question 1.1.3.** Make an array containing the five strings `"Hello"`, `","`, `" "`, `"world"`, and `"!"`.  (The third one is a single space inside quotes.)  Name it `hello_world_components`.

*Note:* If you print `hello_world_components`, you'll notice some extra information in addition to its contents: `dtype='<U5'`.  That's just NumPy's extremely odd way of saying that the things in the array are strings. In case you're interested, the `U` means that this string is encoded in [unicode](https://en.wikipedia.org/wiki/Unicode), and the `<5` means all strings in the array are 5 characters long or less.

In [None]:
hello_world_components = ...
hello_world_components

In [None]:
grader.check("q1_1_3")

Very often in data science, we want to work with many numbers that are evenly spaced within some range.  NumPy provides a special function for this called `arange`.  The expression `np.arange(start, stop, space)` produces an array with all the numbers starting at `start` and counting up by `space`, stopping **before** `stop` is reached.

For example, the value of `np.arange(1, 8, 2)` is an array with elements 1, 3, 5, and 7 – it starts at 1 and counts up by 2, then stops before 8.  In other words, it makes the same array as `np.array([1, 3, 5, 7])`.

`np.arange(4, 9, 1)` is an array with elements 4, 5, 6, 7, and 8.  (It doesn't contain 9 because `np.arange` stops *before* the stop value is reached.)

**Question 1.1.4.** Use `np.arange` to create an array with all the multiples of 99 from 0 up to (**and including**) 9999.  (Its elements should be 0, 99, 198, 297, and so on.)

In [None]:
multiples_of_99 = ...
multiples_of_99

In [None]:
grader.check("q1_1_4")

##### Temperature readings 🌡️
NOAA (the US National Oceanic and Atmospheric Administration) operates weather stations that measure surface temperatures at different sites around the United States.  The hourly readings are [publicly available](http://www.ncdc.noaa.gov/qclcd/QCLCD?prior=N).

Suppose we download all the hourly data from the San Diego, California site for the month of September 2023. To analyze the data, we want to know when each reading was taken, but we find that the data doesn't include the timestamps of the readings (the time at which each one was taken).

However, we know the first reading was taken at the first instant of September 2023 (midnight on September 1st) and each subsequent reading was taken exactly 1 hour after the last.

**Question 1.1.5.** Create an array of the *time, in seconds, since the start of the month* at which each hourly reading was taken.  Name it `collection_times`.

**_Hints:_** 
- There are 30 days in September, which is equivalent to ($30 \times 24$) hours or ($30 \times 24 \times 60 \times 60$) seconds.  
- The `len` function works on arrays, too.  If your `collection_times` isn't passing the tests, check its length and make sure it has $30 \times 24$ elements, since readings are taken hourly for 30 days.

In [None]:
collection_times = ...
collection_times

In [None]:
grader.check("q1_1_5")

## 1.2. Working with single elements of arrays ("indexing")
Let's work with a more interesting dataset.  The next cell creates an array called `population` that includes estimated world populations in every year from **1950** to **2023**.  (The estimates come from the [International Database](https://www.census.gov/data-tools/demo/idb/#/table?menu=tableViz&quickReports=CUSTOM&CUSTOM_COLS=POP,TFR,CBR,E0,IMR,CDR,NMR&CCODE=**&show_countries=n&CCODE_SINGLE=**&TABLE_RANGE=1950,2023&TABLE_YEARS=1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023&TABLE_USE_RANGE=Y&TABLE_USE_YEARS=N&TABLE_STEP=1), maintained by the US Census Bureau.)

Rather than type in the data manually, we've loaded them from a file called `world_population_2023.csv`.  You'll learn how to read in data from files very soon.

In [None]:
# Don't worry too much about what goes on in this cell.
population = bpd.read_csv("data/world_population_2023.csv").get("Population").values
population

Here's how we get the first element of `population`, which is the world population in the first year in the dataset, 1950.

In [None]:
population[0]

Notice that we use square brackets here. The square brackets signal that we are *accessing* an element of the array. Square brackets in Python are kind of like subscripts in math.

The value of that expression is the number 2558023014 (around 2.5 billion), because that's the first thing in the array `population`.

Notice that we wrote `population[0]`, not `population[1]`, to get the first element.  This is a weird convention in programming. 0 is called the *index* of the first item.  It's the number of elements that appear *before* that item.  So 3 is the index of the 4th item.

Here are some more examples.  In the examples, we've given names to the things we get out of `population`.  Read and run each cell.

In [None]:
# The third element in the array is the population in 1952.
population_1952 = population[2]
population_1952

In [None]:
# The thirteenth element in the array is the population
# in 1962 (which is 1950 + 12).
population_1962 = population[12]
population_1962

In [None]:
# The 74th element in the array is the population in 2023.
population_2023 = population[73]
population_2023

In [None]:
# The array has only 74 elements, so this doesn't work.
# (There's no element with 74 other elements before it.)

population_2024 = population[74]
population_2024

# 🚨 After running this cell, please place a # before each line above to make sure that it doesn't run again.

**Question 1.2.1.** Set `population_1998` to the world population in 1998 by getting the appropriate element from `population`.

In [None]:
population_1998 = ...
population_1998

In [None]:
grader.check("q1_2_1")

## 1.3. Performing an operation on every element of an array
Arrays are primarily useful for doing the same operation many times, so we don't often have to access and work with single elements.

##### Logarithms
Here is one simple question we might ask about world population:

> How big was the population in *orders of magnitude* in each year?

The logarithm function is one way of measuring how big a number is. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.

We could try to answer our question like this, using the `log10` function from NumPy on each element of the `population` array:

In [None]:
population_1950_magnitude = np.log10(population[0])
population_1951_magnitude = np.log10(population[1])
population_1952_magnitude = np.log10(population[2])
population_1953_magnitude = np.log10(population[3])

# ... and so on!

But this is tedious and repetitive. There must be a better way!

It turns out that NumPy's `log10` is pretty powerful. Not only can it take in a single number (like `population[0]`) as input and return the logarithm of a single number, but it can **also** take in an entire array of numbers and return the logarithm of each element in that array!

If you give NumPy's `log10` an array as input, it will return an array of the same length, where the first element of the result is the logarithm of the first element of the input, the second element of the result is the logarithm of the second element of the input, and so on.

<img src="data/array_logarithm.jpg">

This is called *elementwise* application of the function, since it operates separately on each element of the array it's called on.  

**Question 1.3.1.** Use NumPy's `log10` function to compute the logarithms of the world population in every year.  Give the result (an array of 73 numbers) the name `population_magnitudes`.  Your code should be very short.

In [None]:
population_magnitudes = ...
population_magnitudes

In [None]:
grader.check("q1_3_1")

##### Arithmetic
Arithmetic also works elementwise on arrays.  For example, you can divide all the population numbers by 1 billion to get numbers in billions:

In [None]:
population_in_billions = population / 1000000000
population_in_billions

You can do the same with addition, subtraction, multiplication, and exponentiation (`**`). For example, you can calculate a twenty percent tip on several restaurant bills at once:

In [None]:
restaurant_bills = np.array([20.12, 39.90, 31.01])
print("Restaurant bills:\t", restaurant_bills)
tips = 0.2 * restaurant_bills 
print("Tips:\t\t\t", tips)

<img src="data/array_multiplication.jpg">

**Question 1.3.2.** Suppose the total charge at a restaurant is the original bill plus the tip (20%).  That means we can multiply the original bill by 1.2 to get the total charge.  Compute the total charge for each bill in `restaurant_bills` and give the resulting array the name `total_charges`.

In [None]:
total_charges = ...
total_charges

In [None]:
grader.check("q1_3_2")

Let's read in some data to use in the next question.

In [None]:
more_restaurant_bills = bpd.read_csv("data/more_restaurant_bills.csv").get("Bill").values

**Question 1.3.3.** The array `more_restaurant_bills` contains 100,000 bills!  Compute the total charge for each one, assuming again a twenty percent tip, and give the resulting array the name `more_total_charges`.

In [None]:
more_total_charges = ...
more_total_charges

In [None]:
grader.check("q1_3_3")

The function `sum` takes a single array of numbers as its argument.  It returns the sum of all the numbers in that array (so it returns a single number, not an array).

**Question 1.3.4.** What was the sum of all the bills in `more_restaurant_bills`, **including tips**?

In [None]:
sum_of_bills = ...
sum_of_bills

In [None]:
grader.check("q1_3_4")

##### Powers of Two
The powers of 2 ($2^0 = 1$, $2^1 = 2$, $2^2 = 4$, etc) arise frequently in computer science.  (For example, you may have noticed that storage on smartphones or computers come in powers of 2, like 64 GB, 128 GB, or 256 GB.)

**Question 1.3.5.** Use `np.arange` and the exponentiation operator `**` to create an array containing the first 40 powers of 2, starting from $2^0=1$.

**_Hints:_**
- Did your kernel "die" when you ran your solution? There is a common incorrect response to this problem that tries to create an array with so many entries that Python gives up and crashes. If this happens to you, double-check your answer! 
- Maybe just start with the first 5 powers of two. Once you get that working, then try all 40. At no point should you have to manually write `0, 1, 2, 3, 4, ...`; if you find yourself trying that, scroll up to earlier in the lab notebook.

In [None]:
powers_of_2 = ...
powers_of_2

In [None]:
grader.check("q1_3_5")

# 2. DataFrames 

## 2.1. Introduction

For a collection of things in the world, an array is useful for describing a single attribute of each thing. For example, among the collection of US States, an array could describe the land area of each. Tables extend this idea by describing multiple attributes for each element of a collection. In a table of states, for example, we might keep track of land area, population, state capital, and the name of the governor. In other words, tables keep track of many entities (individuals, stored as rows), and for each entity, many attributes (features, stored as columns).

In the cell below we have two arrays. The first one contains the world population in each year (as estimated by the US Census Bureau), and the second contains the years themselves (in order, so the first elements in the population and the years arrays correspond).

In [None]:
population_amounts = bpd.read_csv("data/world_population_2023.csv").get("Population").values
population_years = np.arange(1950, 2023 + 1)
print("Population column:", population_amounts)
print("Years column:", population_years)

Suppose we want to answer this question:

> When did the world's population surpass 7 billion?

You could technically answer this question just by staring at the arrays, but it's a bit complicated, since you would have to count the position where the population first crossed 7 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a table.

Just as `numpy` provides arrays, a popular package called `pandas` provides **DataFrames**, which is `pandas`' name for **tables**. `pandas` is *the* tool for doing data science in Python. Unfortunately, `pandas` isn't as cute as its name might suggest: it's very complicated and can be somewhat hard to learn.

Instead of using `pandas`, we'll use a package that we've created specifically for DSC 10. It is a *subset* of `pandas`, including only the parts that we think are necessary and throwing out all of the rest. Because it is smaller (and cuter), we've called it `babypandas`. 

<img src='data/pandas-babypandas.jpg' width=400>

You can import `babypandas` using the following code:

In [None]:
import babypandas as bpd


The nice thing about `babypandas` is that it is easier to learn *but* every bit of code you write using `babypandas` will work with `pandas`, too. If you're a data science major, or just going to be doing a lot of data analysis in Python, you'll see quite a lot of `pandas` in your future.

The cell below:

- creates an empty DataFrame using the expression `bpd.DataFrame()`,
- assigns two columns to the DataFrame by calling `assign`,
- assigns the resulting DataFrame to the name `population_df`, and finally
- displays `population_df` so that we can see the DataFrame we've made.

`"Population"` and `"Year"` are column labels that we have chosen. We could have chosen anything, but it's a good idea to choose names that are descriptive and not too long.

In [None]:
population_df = bpd.DataFrame().assign(
    Population=population_amounts,
    Year=population_years
)
population_df

Now the data are all together in a single DataFrame! It's much easier to parse this data. If you need to know what the population was in 2011, for example, you can tell from a single glance. We'll revisit this DataFrame later, but first we'll build some skills practicing with a new DataFrame of top movies.

**Question 2.1.1.** In the cell below, we've created 2 arrays. Using the steps above, assign `top_10_movies` to a DataFrame that has two columns called `"Rating"` and `"Name"`, which hold `top_10_movie_ratings` and `top_10_movie_names` respectively.

In [None]:
top_10_movie_ratings = np.array([9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8])
top_10_movie_names = np.array([
        'The Shawshank Redemption (1994)',
        'The Godfather (1972)',
        'The Godfather: Part II (1974)',
        'Pulp Fiction (1994)',
        "Schindler's List (1993)",
        'The Lord of the Rings: The Return of the King (2003)',
        '12 Angry Men (1957)',
        'The Dark Knight (2008)',
        'Il buono, il brutto, il cattivo (1966)',
        'The Lord of the Rings: The Fellowship of the Ring (2001)'
])

In [None]:
top_10_movies = ...
top_10_movies

In [None]:
grader.check("q2_1_1")

Suppose you want to add your own ratings to this DataFrame. The cell below contains your ranking of each movie:

In [None]:
my_ranking = [8, 2, 1, 9, 7, 10, 6, 4, 3, 5]

**Question 2.1.2** You can use the `assign` method to add a column to an already-existing DataFrame, too. Create a new DataFrame called `with_ranking` by adding a column named `"Ranking"` to the DataFrame in `top_10_movies`.

In [None]:
with_ranking = ...
with_ranking

In [None]:
grader.check("q2_1_2")

## 2.2. Indexes

You may have noticed that the DataFrame of populations contains what looks like an extra, unlabeled column on the left with the numbers 0 through 73. **This is not a column, it's what we call an *index***. The index contains the row labels. Whereas the columns of this DataFrame are labeled `"Population"` and `"Year"`, the rows are labeled 0, 1, ..., 73.

By default, `babypandas` doesn't know how to label the rows, and so it just numbers them (starting with 0). In this case, it makes more sense to use the year as a row's label. We can do this by telling `babypandas` to set the `"Year"` column as the index:

In [None]:
population_by_year = population_df.set_index('Year')
population_by_year

As we'll see, this does more than make the DataFrame look nicer – it is very useful, too. Let's perform this same process on the `top_10_movies` DataFrame we were just working with.

**Question 2.2.1** Create a new DataFrame named `top_10_movies_by_name` by taking the DataFrame you made above, `top_10_movies`, and setting the index to be the `'Name'` column.

In [None]:
top_10_movies_by_name = ...
top_10_movies_by_name

In [None]:
grader.check("q2_2_1")

You can get an array of row names using `.index`. For instance, the array of row names of the `population_by_year` DataFrame is:

In [None]:
population_by_year.index

**Question 2.2.2** Using code, assign to `tenth_movie` the name of the tenth movie in `top_10_movies_by_name`.

**_Hint:_**  Remember that the index is an array, and we use square brackets to access elements of an array.

In [None]:
tenth_movie = ...
tenth_movie

In [None]:
grader.check("q2_2_2")

## 2.3 Reading a DataFrame from a file
In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we can use functions provided by `babypandas` to read in data from external files.

The function `bpd.read_csv` takes one argument, a path to a data file (a string), and returns a DataFrame.  There are many formats for data files, but CSV ("comma-separated values") is the most common. 

**Question 2.3.1.** The file `data/imdb.csv` contains information about the 250 highest-rated movies on IMDb.  Load it as a DataFrame called `imdb`.

In [None]:
imdb = ...
imdb

In [None]:
grader.check("q2_3_1")

Notice the dots in the middle of the DataFrame. This means that a lot of the rows have been omitted. This DataFrame is big enough that only a few of its rows are displayed, but the others are still there.  There are 250 movies total.

Where did `imdb.csv` come from? Take a look at [this lab's folder](./). If you go into the `data/` directory, you should see a file called `imdb.csv`.

Open up the `imdb.csv` file in that folder and look at the format. What do you notice? The `.csv` filename ending says that this file is in the [CSV (comma-separated value) format](http://edoceo.com/utilitas/csv-file-format).

**Question 2.3.2.** This is a data set of movies, so it makes sense to use the movie title as the row label. Create a new DataFrame called `imdb_by_name` which uses the movie title as the index.

In [None]:
imdb_by_name = ...
imdb_by_name

In [None]:
grader.check("q2_3_2")

## 2.4. Series



Suppose we're interested primarily in movie ratings. To extract just this column from the DataFrame, we use the `.get` method:

In [None]:
ratings = imdb_by_name.get('Rating')
ratings

Notice how not only the movie ratings have been returned, but also the name of the movie! This is precisely because we have set the movie title to be the index! For example, if we had asked for the `"Rating"` column of the original DataFrame, `imdb`, we would see:

In [None]:
imdb.get('Rating')

This is one way in which indices are very useful - they provide meaningful labels for the data.

At first glance, it might look like asking for a column using `.get` returns a DataFrame with one column, but that's not quite right. Instead, it returns a special type of thing called a *Series*:

In [None]:
type(imdb_by_name.get('Rating'))

You can think of a `Series` as an array with an index. Whereas arrays are simple sequences of numbers without labels, `Series` can have labels. This is often very useful.

`ratings` is now a `Series` which contains the column of movie ratings. Suppose we're interested in the rating of a particular movie: _Alien_. To do so, we will use the `.loc` *accessor* which pulls a value from the Series at a particular *loc*ation:

In [None]:
ratings.loc['Alien']

There are a couple of things to note here. First, those are square brackets around `"Alien"`. This is because `.loc` is not a method, but an *accessor*. The square brackets signal that we're going to be extracting an element from the `Series`. Second, we passed in the label as a string.

**Question 2.4.1.** Find the rating of _3 Idiots_.

In [None]:
three_idiots_rating = ...
three_idiots_rating

In [None]:
grader.check("q2_4_1")

Now suppose we wanted to know the year in which _Alien_ was released. We could do this by first getting the column of years:

In [None]:
years = imdb_by_name.get('Year')
years

And then using `.loc` to get the right entry:

In [None]:
years.loc['Alien']

We could also do this in one step by *chaining* the operations together:

In [None]:
imdb_by_name.get('Year').loc['Alien']

This works because Python first evaluates `imdb_by_name.get('Year')` to a Series. It then evaluates the `.loc['Alien']` to return the year.

Chaining is used pretty frequently and can be handy. Just be sure not to chain *too* many things together that your code gets hard to read. You can always save an intermediate result to a variable.

**Question 2.4.2** Find the decade in which _Gone Girl_ was released using chaining. 

**_Hint:_**  `imdb_by_name` has a column named `'Decade'`.

In [None]:
decade = ...
decade

In [None]:
grader.check("q2_4_2")

# 3. Analyzing datasets

With just a few DataFrame methods, we can answer some interesting questions about the IMDb dataset.

If we want just the ratings of the movies, we can use `.get`:

In [None]:
ratings = imdb_by_name.get('Rating')
ratings

Remember that `ratings` is a Series. Series objects have some useful methods.

**Question 3.1.** Find the rating of the highest-rated movie in the dataset.

**_Hint:_**  Type `ratings.` and hit Tab to see a list of the available methods. Is there one that looks useful?

In [None]:
highest_rating = ...
highest_rating

In [None]:
grader.check("q3_1")

You probably want to know the *name* of the movie whose rating you found!  To do that, we can sort the whole Series using the `.sort_values` method:

In [None]:
ratings.sort_values()

So there are actually 2 highest-rated movies in the dataset: *The Shawshank Redemption* and *The Godfather*.

Notice that we are sorting by the ratings, not the labels! The label moves with the rating as it is sorted. This is exactly what we want.

When we use the `sort_values` method, the resulting Series has the data sorted in ascending order, from small to large. This is the default behavior of `sort_values`, but we can change that. Had we wanted the highest rated movies on top, we would need to specify that the sorting should not be in ascending order with an optional *keyword argument*:


In [None]:
ratings.sort_values(ascending=False)

If we set the keyword argument `ascending` to `True`, we get the same result as if we did not set it at all. This is what we mean when we say that the default behavior of `sort_values` is to sort in ascending order. Confirm that the next two cells give the same output.

In [None]:
ratings.sort_values(ascending=True)

In [None]:
ratings.sort_values()

Not only can we sort Series, but we can sort entire DataFrames, too. When we do that, we have to specify the column to sort by:

In [None]:
imdb_by_name.sort_values('Rating')

Similarly, we can specify that the sort should be in descending order:

In [None]:
imdb_by_name.sort_values('Rating', ascending=False)


Some details about sorting a DataFrame:

1. The first argument to `sort_values` is the name of a column to sort by.
2. If the column has strings in it, `sort` will sort alphabetically; if the column has numbers, it will sort numerically.
3. `imdb_by_name.sort_values('Rating')` returns a new DataFrame; the `imdb_by_name` DataFrame doesn't get modified. For example, if we called `imdb_by_name.sort('Rating')`, then running `imdb_by_name` by itself would still return the unsorted DataFrame. To save the result, you should assign it to a new variable.
4. Rows always stick together when a DataFrame is sorted.  It wouldn't make sense to sort just one column and leave the other columns alone.  For example, in this case, if we sorted just the `'Rating'` column, the movies would all end up with the wrong ratings.

**Question 3.2.** Create a version of `imdb_by_name` that's sorted chronologically, with the earliest movies first.  Call it `imdb_sorted`.

In [None]:
imdb_sorted = ...
imdb_sorted

In [None]:
grader.check("q3_2")

**Question 3.3.** What's the title of the earliest movie in the dataset?  You could just look this up from the output of the previous cell.  Instead, write Python code to find out.

**_Hint:_**  Remember that the index is an array.

In [None]:
earliest_movie_title = ...
earliest_movie_title

In [None]:
grader.check("q3_3")

Suppose we want to get the rating of the oldest movie in the DataFrame. One way to do this is to first find the index label of the oldest movie (which we've already done). We then extract the `"Rating"` column and use `.loc` to find the rating of the oldest movie.

In [None]:
imdb_sorted.get('Rating').loc[earliest_movie_title]

There's a faster way, though. A Series not only has a `.loc` accessor, but also an `.iloc` accessor. While `.loc` looks up things by *label*, `.iloc` looks up elements by *integer position*.

Let's remember what is in the `'Rating'` column:

In [None]:
imdb_sorted.get('Rating')

If we want the rating of the first row, we can use `.iloc[0]`:

In [None]:
imdb_sorted.get('Rating').iloc[0]

This returns the exact same thing as `imdb_sorted.get('Rating').loc['The Kid']`; these are two ways of doing the same thing. Usually it is more convenient to access an element by its label rather than by its integer position, but both `.loc` and `.iloc` are good to know.

**Question 3.4.** What is the rating of the fifth oldest movie in the dataset? You could just look this up from the output of the previous cell. Instead, write Python code to find out.

In [None]:
fifth_oldest_rating = ...
fifth_oldest_rating

In [None]:
grader.check("q3_4")

# 4. Finding pieces of a dataset

Suppose you're interested in movies from the 1950s.  Sorting the DataFrame by year doesn't help you, because the 1950s are in the middle of the dataset. Instead, we'll use a feature of Series that allows us to easily compare each element in a column to a particular value.

First, remember that we can use `.get` to extract a single column. The result is not a DataFrame, but rather a Series:

In [None]:
imdb_by_name.get('Decade')

We want to check whether each movie is released in the decade 1940. Python gives us a way of checking whether two things are equal with `==` (remember that `=` is already being used for another purpose: it assigns values to variable names):

In [None]:
3 == 4

In [None]:
3 == 3

`True` and `False` are instances of a type that we haven't seen before:

In [None]:
type(True)

`bool` stands for "Boolean", named after the English logician [George Boole](https://en.wikipedia.org/wiki/George_Boole). We say that "True" and "False" are *Boolean* values.

It turns out that we can easily check if *each* of the elements in a `Series` is equal to something:

In [None]:
imdb_by_name.get('Decade') == 1950

We see that the result is a new series which has `True` only where the decade was 1950, and `False` everywhere else. We say that the resulting series is a series of *Booleans*, or a *Boolean Series*.

Let's call this result `is_from_1950s`. Its name can be read like it is a question: "is this movie from the 1950s"?

In [None]:
is_from_1950s = imdb_by_name.get('Decade') == 1950
is_from_1950s

Each row is an answer to this question. Is _The Elephant Man_ from the 1950s? `False`. Is _All About Eve_ from the 1950s? `True`.

We can use `is_from_1950s` to select only the rows from `imdb_by_name` for which the answer is `True`. The syntax for this is:

In [None]:
imdb_by_name[is_from_1950s]

What `imdb_by_name[is_from_1950s]` does, precisely, is to go through `imdb_by_name` row by row. If the row named _Singin' in the Rain_ has the value `True` in `is_from_1950s`, that row is kept. If the value is `False`, the row is discarded. And so on, for every row.

Note that we could have accomplished this without ever creating the variable `is_from_1950s` by simply placing the code that we used to create the boolean series directly inside the `[...]`. This is a typical pattern you'll be using a lot!

In [None]:
imdb_by_name[imdb_by_name.get('Decade') == 1950]

It helps to read the square brackets as "where." So the command in the cell above says to keep all rows from `imbdb_by_name` *where* the decade is the 1950s. 

Creating a new DataFrame by selecting only certain rows from an existing DataFrame which satisfy some condition is called *querying*. The line of code `imdb_by_name[imdb_by_name.get('Decade') == 1950]` is a *query*.

**Question 4.1.** Create a DataFrame called `ninety_eight` containing the movies that came out in 1998.

In [None]:
ninety_eight = ...
ninety_eight

In [None]:
grader.check("q4_1")

So far we've only been finding where a column is *exactly* equal to a certain value. However, there are many other comparison operators we could use.  Here are a few:

|Operator|Tests|
|-|-|
|`==`|thing on left is equal to thing on right|
|`!=`|thing on left is *not* equal to thing on right|
|`>`|thing on left is greater than (and not equal to) thing on right|
|`>=`|thing on left is greater than or equal to thing on right|
|`<`|thing on left is less than (and not equal to) thing on right|

[BPD 10](https://notes.dsc10.com/02-data_sets/querying.html#examples) in the babypandas notes has more examples.

**Question 4.2.** Using operators from the table above, find all the movies with a rating higher than 8.6.  Put their data in a DataFrame called `really_highly_rated`.

In [None]:
really_highly_rated = ...
really_highly_rated

In [None]:
grader.check("q4_2")

What is the highest rating of any movie from the 1990s? We now have the tools to answer questions like these. Breaking it into pieces, we first find all of the movies from the 1990s:

In [None]:
is_from_1990s = imdb_by_name.get('Decade') == 1990
is_from_1990s

We then select only these movies from our DataFrame:

In [None]:
from_1990s = imdb_by_name[is_from_1990s]
from_1990s

We then find the highest rating out of just these movies:

In [None]:
from_1990s.get('Rating').max()

Or, if we wanted to do all of this more concisely using chaining:

In [None]:
imdb_by_name[imdb_by_name.get('Decade') == 1990].get('Rating').max()

**Question 4.3.** Find the average rating for movies released in the 20th century and the average rating for movies released in the 21st century for the movies in `imdb`.

**_Hint:_**  Series have a `.mean()` method. Note that the year 2000 is in the 20th century, and that the earliest movie in the dataset is from 1921!

In [None]:
average_20th_century_rating = ...
average_20th_century_rating

In [None]:
grader.check("q4_3_1")

In [None]:
average_21st_century_rating = ...
average_21st_century_rating

In [None]:
grader.check("q4_3_2")

The attribute `shape` tells you how many rows and columns are in a DataFrame. (An attribute, or a property, is not a method and is not called using parentheses.)

In [None]:
imdb_by_name.shape

Like an array, you can get the first element of the shape using `[0]`, and the second element using `[1]`. For instance, the number of rows in `imdb_by_name` is:

In [None]:
imdb_by_name.shape[0]

We can use this to answer "How many movies are from the 20th century?":

In [None]:
imdb_by_name[imdb_by_name.get('Year') <= 2000].shape[0]

**Question 4.4.** Use `shape` (and arithmetic) to find the *proportion* of movies in the dataset from the 20th century, and the proportion from the 21st century.

In [None]:
proportion_in_20th_century = ...
proportion_in_20th_century

In [None]:
grader.check("q4_4_1")

In [None]:
proportion_in_21st_century = ...
proportion_in_21st_century

In [None]:
grader.check("q4_4_2")

**Question 4.5.** Finally, let's revisit the `population_by_year` DataFrame from earlier in the lab.  Compute the year when the world population first went above 7 billion.

In [None]:
year_population_crossed_7_billion = ...
year_population_crossed_7_billion

In [None]:
grader.check("q4_5")

# Finish Line 🏁

Congratulations! You are done with Lab 1.

**Citations:** Did you use any generative artificial intelligence tools to assist you on this assignment? If so, please state, for each tool you used, the name of the tool (ex. ChatGPT) and the problem(s) in this assignment where you used the tool for help.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

Please cite tools here.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.
5. Stick around while the Gradescope autograder grades your work. Make sure you see that all tests have passed on Gradescope.
6. Check that you have a confirmation email from Gradescope and save it as proof of your submission.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
grader.check_all()