In [None]:
import babypandas as bpd
import numpy as np

# What is the difference between `babypandas` and Python?

In short, `babypandas` is a library in Python. This means that `babypandas` is a fast and powerful way for us to manipulate DataFrames that was built on the Python programming language.

**Note**: `list` means it is a list variable and `df` means it is a DataFrame.

Some common similarities can be found in the table below:

| **Similarities**       | **Python**                         | **babypandas**                                   |
|---------------------------|------------------------------------|----------------------------------------------|
| Creating Data             | Lists                | `bpd.DataFrame()`, `bpd.Series()`              |
| Indexing                  | `list[index]`         | `.loc[]`, `.iloc[]`                          |
| Adding Data               | `list.append()`| `df.assign(new_column = data)`                      |
| Removing Data             | `list.remove()` | `df.drop(columns=['col'])`      |
| Applying Function         | `func(list)`              | `df.apply()`               |
| Aggregation               | `sum(list)`, `max(list)`, etc.     | `df.sum()`, `df.mean()`          |


We will go through the elements in the table above with examples to help explain the differences between `babypandas` and Python.

---

<a id="table"></a>
## Table of Contents:

Click on the links below to quickly navigate to your desired topic.
- [Defined Variables](#variables)
- [Creating Data](#creating)
- [Indexing](#indexing)
- [Adding Data](#adding)
- [Removing Data](#removing)
- [Applying Functions](#apply-funcs)
- [Aggregation](#aggregation)

---

<a id="variables"></a>
## Our Data:

Before we get into any examples I will establish variables we will use.

- `pop_estimates` is a DataFrame.
- `tutors` is a list of strings (the tutors).
- `fav_nums` is a list of my (Zoe's) favorite numbers.

You might recall this dataset from Lab 1. The estimates in the column `"Population"` come from the [International Database](https://www.census.gov/data-tools/demo/idb/#/table?menu=tableViz&quickReports=CUSTOM&CUSTOM_COLS=POP,TFR,CBR,E0,IMR,CDR,NMR&CCODE=**&show_countries=n&CCODE_SINGLE=**&TABLE_RANGE=1950,2023&TABLE_YEARS=1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023&TABLE_USE_RANGE=Y&TABLE_USE_YEARS=N&TABLE_STEP=1), maintained by the US Census Bureau.

In [None]:
pop_estimates = bpd.read_csv("data/world_population_2023.csv")
pop_estimates.iloc[0:5] #Here are the first five rows displayed for your convenience

You might recall this dataset from Homework 3. In the `states` DataFrame , each state's `'Favorite Cereal'` is defined as the cereal, among the top 20 varieties, that has been Google searched a disproportionately high amount in that state. 

In [None]:
states = bpd.read_csv('data/states.csv')
states = states.set_index("State")
states.iloc[0:5]

The rest of the variables are made up by me (Zoe)!

In [None]:
tutors = ["Jack", "Ashley", "Jason", "Zoe", "Nick", "Guoxuan"]
tutors

In [None]:
fav_nums = [2, np.e, 14] # e is such a nice number!
fav_nums

---

<a id="creating"></a>
## Creating Data

- To create a DataFrame you can do `bpd.DataFrame()`
    - Refer to documentation for using [babypandas](https://babypandas.readthedocs.io/en/latest/_autosummary/bpd.DataFrame.__init__.html)
- To create a Series you can do `bpd.Series()`
- To create a list you use brackets (`[]`)

In DSC 10 we read DataFrames in, so this is not super relevant to you all, but it is good to know!

In [None]:
# This makes an empty DataFrame... exciting!
bpd.DataFrame()

---

<a id="indexing"></a>
## Indexing

### Lists

Recall we can index lists using brackets `[]` and inside we are following `[start : stop : step]`. This is also known as slicing!

**Note:** We are trying to extract multiple or individual elements using the brackets

In [None]:
# First element in tutors
tutors[0]

In [None]:
# First three elements in tutors
tutors[0:3] # notice that stop is exclusive!

In [None]:
# Every other tutor
print(tutors) # What did tutors look like originally?
tutors[0:len(tutors):2]

In [None]:
# You do not need to specify 0!
tutors[::2]

### Babypandas - `iloc`

Recall we can use `df.iloc[]` to isolate **at an integer location**. `iloc` uses `[start, stop]`.

In [None]:
# The first element
pop_estimates.iloc[0]

In [None]:
# The first three elements
pop_estimates.iloc[0:3]

In [None]:
# Remember we do not need to put 0 if we don't want to!
pop_estimates.iloc[:3]

`iloc` is **at an integer location**, which means we can still use it when the index is not an integer.

In [None]:
# The first element
states.iloc[0]

We can also give `.iloc` a list of elements if we want specific integer locations.

In [None]:
pop_estimates.iloc[[2, 14, 44]]

### Babypandas - `loc`

Recall we can use `df.loc[]` to isolate with **a specific label**. `loc` also uses `[start : stop]`.

In [None]:
# The first element 
pop_estimates.loc[0] #the index's label is 0!

In [None]:
# A re-fresh of the states' DataFrame
states.iloc[0:5]

In [None]:
# We want California
states.loc["California"]

In [None]:
# We want Illinois and Arkansas
states.loc[["Illinois", "Arkansas"]]

---

<a id="adding"></a>
## Adding Data

### Lists - `.append`

This will add an item to the end of the list. It happens in place.

**Note**: When something happens in place it means the object is modified directly. You do not need to re-assign the variable.

In [None]:
tutors

In [None]:
tutors.append("Baby Panda")

In [None]:
tutors

### Numpy Array - `np.append`

This will add an arrays to one another. Note this is different from a list's version!

You can read more about it [here](https://numpy.org/doc/stable/reference/generated/numpy.append.html).

In [None]:
# You first give it an array, then you give it the values you want to add to the original array!
np.append([1, 2, 3], [[2, 4, 6],[1, 3, 5]])

### DataFrames - `.assign`

This will add a new column to the DataFrame. It does not happen in place.

In [None]:
pop_estimates.assign(Population_Dupe = pop_estimates.get("Population")).iloc[0:5]

In [None]:
pop_estimates.iloc[0:5] #Notice it was not changed!

In [None]:
# This means we need a variable or need to re-assign the variable to contain the updated information
temp = pop_estimates.assign(Population_Dupe = pop_estimates.get("Population"))
temp.iloc[0:5]

---

<a id="removing"></a>
## Removing Data

### List - `.remove`

These are not necessary for our class. I am pointing this out so you do not try and `.drop` from a list or dictionary!

`.remove` happens in place.

In [None]:
tutors

In [None]:
tutors.remove("Baby Panda")

In [None]:
tutors

### DataFrame - `.drop`

This will remove the column we specify to drop. It does not happen in place.

In [None]:
temp.drop(columns = "Population_Dupe").iloc[0:5]

In [None]:
# Use a list to drop multiple columns
temp.drop(columns = ["Population_Dupe"]).iloc[0:5]

In [None]:
# Notice once again that we did not update temp, so the column was not dropped
temp.iloc[0:5]

In [None]:
temp = temp.drop(columns = "Population_Dupe")
temp.iloc[0:5]

---

<a id="apply-funcs"></a>
## Applying Functions

For this part I have made some functions below.

In [None]:
# This function determines if a year was from the 20th or 21st century

def determine_century(year):
    if 1900 <= year <= 1999:
        return "20th Century"
    elif year >= 2000:
        return "21st Century"

In [None]:
# This function creates a list of Booleans if the number inside the list is greater than 5

def bigger_five(nums):
    output = []
    for num in nums:
        if num > 5:
            output.append(True)
        else:
            output.append(False)
    return output

### Lists - The function goes around your value!

In [None]:
# Recall the variable fav_nums
fav_nums

In [None]:
bigger_five(fav_nums)

## Babypandas - `.apply`

This is a method that will apply your function to each row inside of a **Series**. This will not work on a DataFrame!

In [None]:
pop_estimates.get("Year").apply(determine_century) # You get a series back

In [None]:
pop_estimates = pop_estimates.assign(Century = pop_estimates.get("Year").apply(determine_century))
pop_estimates.iloc[[0, -1]]

---

<a id="aggregation"></a>
## Aggregation

A **function** is a block of reusable code that performs a specific task. It can take in inputs, perform operations, and return an output. They are defined with `def`. They are not tied to any specific object. I (Zoe) likes to think of functions as something that hugs elements (parameters). It is surrounding the thing we want to transform.

A **method** also performs a specific task, but it is associated with an object. They are defined within a class and are called on objects. I (Zoe) likes to think of methods as something that follows an element (an object). It is always behind a variable with a dot (`.`).

This might get a bit technical, but here is a table of the differences:

| Function                                | Method                                           |
|-----------------------------------------|--------------------------------------------------|
| Independent and not associated with any object | Associated with an object (instance methods) or a class (class methods) |
| Called by its name directly             | Called on an object or class                     |
| Parameters are user-defined             | First parameter is `self` (for instance methods) or `cls` (for class methods) |
| Defined using the `def` keyword outside of a class | Defined inside a class                           |


### Lists - Functions (`min` and `max`)

In [None]:
# If I want the minimum of a list I use a function
min(fav_nums)

In [None]:
# If I want the maximum of a list I use a function
max(fav_nums)

### Babypandas - Methods (`.min` and `.max`)

In [None]:
# If I want the minimum of a Series I use a method
pop_estimates.get("Population").min()

In [None]:
# If I want the maximum of a Series I use a method
pop_estimates.get("Population").max()

### Lists - Functions (`mean`)

In [None]:
np.mean(fav_nums)

### Babypandas - Methods (`.mean`)

In [None]:
pop_estimates.get("Population").mean()

### Lists - Functions (`sum`)

In [None]:
sum(fav_nums)

### Babypandas - Methods (`.sum`)

In [None]:
pop_estimates.get("Population").sum()

---

# The End!

As you can see there are differences between normal Python (the coding language) and `babypandas`. I hope you can refer to this as a guide to help you avoid making silly mistakes. If you have questions please post them on Ed. Thank you!

[Back to Table of Contents](#table)