In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("dsc495_0019_r2.ipynb")

# Week 2 Review

## Due: Friday, January 28, 2022 @ 11:59pm

In this assignment we will review the concepts and topics that we covered in Week 2.

I would like you to attempt each level of the assignment. They get progressively harder. To get a **comprehensive review** of last week's content, you must complete all levels.

**Note:** Try not to delete the instructions of the assignment.

In the markdown cell below enter your name, section, and the date.

**Name:** 

**Section:** 

**Date:**

## Pandas Overview

[Pandas](https://pandas.pydata.org/) is one of the most widely used Python libraries in data science. In this lab, you will learn commonly used data wrangling operations/tools in Pandas. We aim to give you familiarity with:

* Creating dataframes
* Slicing data frames (i.e. selecting rows and columns)
* Filtering data (using boolean arrays)

In this review you are going to use several pandas methods, such as `.drop` and `.loc`. You may press `shift+tab` on the method parameters to see the documentation for that method.

Run the cell below.

In [None]:
import pandas as pd
import numpy as np

## Creating DataFrames & Basic Manipulations

A [dataframe](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) is a table in which each column has a type; there is an index over the columns (typically string labels) and an index over the rows (typically ordinal numbers).

The [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for the pandas `DataFrame` class  provide at least two syntaxes to create a data frame.

**Syntax 1:** You can create a data frame by specifying the columns and values using a [dictionary](https://www.geeksforgeeks.org/python-dictionary/) as shown below. 

The keys of the [dictionary](https://www.geeksforgeeks.org/python-dictionary/) are the column names, and the values of the [dictionary](https://www.geeksforgeeks.org/python-dictionary/) are lists containing the row entries.

In [None]:
fruit_info = pd.DataFrame(
    data = {'fruit': ['apple', 'orange', 'banana', 'raspberry'],
            'color': ['red', 'orange', 'yellow', 'pink']})
fruit_info

**Syntax 2:** You can also define a dataframe by specifying the rows like below. 

Each row corresponds to a distinct [tuple](https://www.w3schools.com/python/python_tuples.asp), and the columns are specified separately.

In [None]:
fruit_info2 = pd.DataFrame(
    [("red", "apple"), ("orange", "orange"), ("yellow", "banana"),
     ("pink", "raspberry")], 
    columns = ["color", "fruit"])
fruit_info2

You can obtain the dimensions of a `DataFrame` by using the shape method `.shape`.

In [None]:
fruit_info.shape

You can also convert the entire dataframe into a two-dimensional `NumPy` array.

In [None]:
fruit_info.values

## Level I

**Question 1.**  For a `DataFrame` `d`, you can add a column with 

    d['new column name'] = ... 

and assign a list or array of values to the column. Add a column of integers containing 1, 2, 3, and 4 called `rank1` to the `fruit_info` table which expresses your personal preference about the taste ordering for each fruit (1 is tastiest; 4 is least tasty).

In [None]:
...
fruit_info

In [None]:
grader.check("q1")

## Level II

**Question 2.** You can also add a column to `d` with 

    d.loc[:, 'new column name'] = ... 
    
As discussed in the lesson, the first parameter is for the rows and second is for columns. The `:` means change all rows and the `new column name` indicates the column you are modifying (or in this case, adding). 

Make a copy of the `fruit_info` dataframe named `fuit_info_copy` using the `.copy()` method. Then add a column called `rank2` to the `fruit_info_copy` table which contains the same values in the same order as the `rank1` column.

**Hint:** Click [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html) to read the documentation on `.copy()`.

In [None]:
...
fruit_info_copy

In [None]:
grader.check("q2")

**Question 3.**  Use the `.drop()` method to [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) both the `rank1` and `rank2` columns you created in `fruit_info_copy` (make sure to use the `axis` parameter correctly).

**Note:** `drop` does not change a table, but instead returns a new table with fewer columns or rows unless you set the optional `inplace` parameter.

**Hint:** Look through the documentation to see how you can drop multiple columns of a `pandas` `DataFrame` at once using a list of column names.

In [None]:
fruit_info_original = ...
fruit_info_original

In [None]:
grader.check("q3")

## Level III

**Question 4.** Use the `.rename()` method to [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) the columns of `fruit_info` so they begin with capital letters. Set this new `DataFrame` to `fruit_info_caps`.

**Hint:** 

In [None]:
fruit_info_caps = ...
fruit_info_caps

In [None]:
grader.check("q4")

## Level IV

### Babyname Dataset

Now that we have reviewed the basics, let's move on to the babynames dataset. The babynames dataset contains a record of the given names of babies born in the United States each year.

First let's run the following cells to build the dataframe `baby_names`. The `baby_names.csv` file contains baby names from North Carolina, the bordering states (GA, SC, TN, and VA), and Kentucky.

In [None]:
baby_names = pd.read_csv('data/baby_names.csv', index_col=0)
len(baby_names)

We can use the `.unique()` method to view the unique values in the column of a `DataFrame`.

In [None]:
baby_names['State'].unique()

### Slicing `DataFrame`s (Selecting rows and columns)

### Selection Using Label/Index (using `.loc`)

#### Column Selection 

To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html). General usage of `.loc` looks like 

    `df.loc[rowname, colname]`. 
    
Reminder that the colon `:` means "everything." For example, if we want the `color` column of the `ex` data frame, we would use: `ex.loc[:, 'color']`

- You can also slice across columns. For example, `baby_names.loc[:, 'Name':]` would select the column `Name` and all columns after `Name`.

- **Alternative:** While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[ ]` method, which takes on the form `df['colname']`.

#### Row Selection

Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (i.e. primary key) of the `DataFrame`.

In [None]:
baby_names.loc[2:5, 'Name']

Notice the difference between this method and the method in the previous cell.

Just passing in `'Name'` returns a `Series` while `['Name']` returns a `Dataframe`.

In [None]:
baby_names.loc[2:5, ['Name']]

**Note:** `.loc` actually uses the `pandas` row index rather than row id/position of rows in the `DataFrame` to perform the selection. Also, notice that if you write `2:5` with `loc[]`, contrary to normal Python slicing functionality, the end index is included, so you get the row with index 5.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by finding it in the file browswer on the left side of the screen, then right-click and select **Download**. You'll submit this .zip file for the assignment in Canvas to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)