# Objectives
1. Use the `iloc` and `loc` operators to access slices of the rows and columns of a DataFrame
2. Use Boolean expressions to select rows of a DataFrame based on one or more conditions
3. Combine multiple data manipulation steps to answer a question using data

In this lab, you will use the Baby Names dataset that *Data.gov* releases each year to practice working with data using Pandas. The dataset contains the total number of babies born with a given first name and gender each year from 1880 through 2018. The file size is fairly large, so you will not be able to open the full dataset in Excel. The first five lines out of nearly two million are:
```
Id,Name,Year,Gender,Count
1,Mary,1880,F,7065
2,Anna,1880,F,2604
3,Emma,1880,F,2003
4,Elizabeth,1880,F,1939
```

Each row represents a unique instance of a name-gender-year combination. In addition to those columns, the Count column represents the number of babies born for that name-gender-year instance.

*Note: to protect individual's privacy, Data.gov does not publish names with fewer than five births for that given name.*

### Load the Libraries
Run the following code cell to import each of the libraries that we will use in the lab.

In [39]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Explore the Data
In the next several exercises, you will use Pandas attributes and methods to become familiar with the data.

**Note: Jupyter Notebook formats the output of Pandas objects nicely. As such, you should NOT use the `print` function unless specifically instructed. Furthermore, we will explicitly define what variable name you should assign a result if we will continue to use it; otherwise, you may assume that the output is not required for future computations and is meant only for us to be able to inspect and explore.**

**Q1.1.** In the below cell, read the 'babynames.csv' file to a `DataFrame` named `names`. Specify the `Id` column as the index. Inspect the first five lines.

In [40]:
# read file and set index


# inspect


**Q1.2.** Output information regarding the DataFrame attributes using the `info` method.

In [41]:
# inspect data types


**Q1.3.** Determine the number of observations for each gender. You can determine this by calling the `value_counts` method on the `Gender` column.

In [42]:
# find number of males and females


Why do you think there are so many more female observations than male? Does that mean that there are that many more girls born than boys?

**Q1.4.** Use the below Markdown cell to summarize what the difference in these numbers mean:

You can use the `describe` method for Series or DataFrame objects and its behavior will change depending on the type of data.

**Q1.5.** In the below cell, call the `describe` method on the entire DataFrame.

**Q1.6.** In the below cell, call the `describe` method on the `Name` column.

The `describe` method returns summary statistics for a given DataFrame or Series. The statistics that it contains depends on the type of data in the DataFrame or Series.

Looking at the output, does this mean the top name is the most popular name? We will explore this question later in the lab.

## 2. Indexing in Pandas
#### `loc` and `iloc`
The Kaggle lesson reviewed three different ways to access a subset of data within a DataFrame. The naive method is to use the `[]` indexing operator similar to lists or dictionaries. This can be convenient, but the more appropriate method is to use either the `loc` or `iloc` operators.

**Q2.1.** In the below cell, use the `iloc` method to access the first row of data:

Take note that when there is only a single dimension (one row or one column) of data, Pandas will return a Series. It is important to know what type of object a statement will return so you know what attributes and methods are available to that object. It is also important to know whether a method modifies the object in-place or returns a new object. In the former, an assigment operator is not necessary while the latter requires an assignment if you are going to use the resulting object in the future.

**Q2.2.** Again, use the `iloc` operator to output the value in the first row and column:

Using a list for both the rows and columns will force Pandas to return a DataFrame, even if the resulting object only has a single dimension or is a single value. The below table summarizes the resulting object type, but you should explore each statement for yourself:

| Statement | Type |
| :----: | :----: |
| `names.iloc[0]` |  Series |
| `names.iloc[:, 0]` | Series |
| `names.iloc[[0]]` | DataFrame |
| `names.iloc[0, 0]` | String |
| `names.iloc[[0], 0]` | Series |
| `names.iloc[[0], [0]]` | DataFrame |

**Q2.3.** Use the `.loc` operator to access and output the `Count` column.

#### Conditional Selection
The power of Pandas comes from being able to subset our data using Boolean expressions. This will allow us to answer interesting questions and perform operations on a subset of our data.

There are two important difference between Boolean expressions in Pandas and those in the standard Python libraries. First, we will not be able to use the `and` and `or` Boolean operators because much of Pandas is built on libraries from the C programming language. Instead, we will use the `&` and `|` Boolean operators respectively. Second, we must enclose each Boolean expression in parenthesis when they are combined with a Boolean operator.

Run the below code cell and note three things:
* the values of the resulting Series
* the length is the number of rows in the DataFrame
* we often use the attribute to apply a Boolean expression to an entire column

In [None]:
names.Name == 'Kevin'

**Q2.4.** If we use this Boolean expression inside of the `.loc` operator, it will return only the rows where the Boolean expression is `True`. In the below code cell, output only those rows where the Name value is 'Kevin'.

## 3. What were the five most popular baby names in 2000?
We will walk through a logic design to see what the most popular baby names around your era. In order to answer this question, we can break it down into a couple of steps:
1. Subset the data to only the year 2001
2. Sort the data on the Count column in descending order
3. Access the first five rows

**Q3.1.** Use a Boolean expression to subset the data to only the year 2001 and assign the resulting DataFrame to the variable `yob_2001`.

**Q3.2.** Sort the data in descending order by the Count column and assign the result to a DataFrame named `yob_2001_sorted`.

**Q3.3.** Use the `.iloc` operator to output a slice of the first five rows. In other words, do not use the `head` method even though it would be equivalent.

Was your name among the most popular? Sometimes it is more fun to look at the most unique names. What were the five most unique names? Continue to explore using indexing in Pandas to become more familiar.