# **BEACO$_{2}$N Notebook 2a: Introduction to Pandas**


### Learning Outcomes
Working through this notebook, you will learn about:
  1. The `DataFrame` and `Series` data structures of the pandas library
  1. Importing CSV data into a pandas `DataFrame`
  1. Accessing and manipulating data within a `DataFrame` and `Series`



## Table of Contents

1. Welcome to Pandas  

2. Pandas Structure  
> 2.1 Series  <br>
> 2.2 DataFrames  <br>

3. Importing to DataFrames  

4. Viewing DataFrames

*Note: In this notebook, there are some more advanced topics that are "optional". This means you can just read over these sections; don't worry about fully understanding these parts unless you are really interested. They may be useful later in the course, but for now they are not necessary, so feel free to just skim the parts labelled "Optional"!*






<hr style="border: 2px solid #003262">
<hr style="border: 2px solid #C9B676">

## 1. Welcome to Pandas
[Pandas](http://pandas.pydata.org/)  is a Python tool that helps you work with and analyze data organized in columns, kind of like a spreadsheet. It’s widely used in data science and is supported by many machine learning tools. While pandas has a lot of features, its basic ideas are easy to learn, and we’ll cover them here. For a more complete reference, the [pandas docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials.



**Reminder:** every time we see a cell block we should run that cell to see what it outputs. It's good practice to also run text cells- that way you're in the habit of running everything as you work down the notebook. To run a cell:


*   Click the **Play icon** in the left gutter of the cell;
*   Type **Shift+Enter** or **Shift+Return** to run the cell and move focus to the next cell (will one if none exists)

The line `import pandas as pd` imports the pandas library and gives it the alias pd, which is a common convention in Python. The line `pd.__version__` prints the version number of the pandas library.<br>**Run the cell below!** It will print out the version number below the cell.


In [None]:
from __future__ import print_function

import pandas as pd
pd.__version__

***
## 2. Pandas Structure

The primary data structures in pandas are implemented as two classes:

  * **`DataFrame`**, which you can imagine as a data table, with rows and named columns.
  * **`Series`**, which is a single column. A `DataFrame` contains one or more `Series` and a name for each `Series`. Series have similar properties and look similar to *lists* (covered in *Intro to Colab* notebook).


### 2.1 Series
One way to create a `Series` is to construct a `Series` object. This is done by using the *Series* function call from the pandas package (pd).

For example, run the code cell below.

In [None]:
print(pd.Series(['San Francisco', 'San Jose', 'Sacramento']))

San Francisco, San Jose, and Sacramento turn appear to be the values of a column! The above is a *Series*, and *Series* are what make up each column of a *DataFrame*.

Let's move on to a slightly more complex use case. In the above example, we input `['San Francisco', 'San Jose', 'Sacramento']` in the parentheses of the `pd.Series(___)` function to create a series of city names.
1. Let's do that again below, but this time, let's save this series by giving it a name! We need to assign it to a *variable* (review *Intro to Colab* notebook if you don't remember the term variables). Let's call the variable `city_names`. (See cell below)

2. Next, let's make a series of population sizes of each of those cities. We can make this series with the same format (putting the values in a comma-separated list between brackets). We've assigned this series to the variable `population`. Run the cell below to see what the series, `population`, looks like.


In [None]:
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])
print(population)

###2.2 DataFrames
**Now we have two series! Let's put them into a DataFrame!**

For a `DataFrame`, we need the name of the series (so your code knows what values the column should be made up of) *and* a name for the column. Similarly to how we made a series with `pd.Series`, we make DataFrames with the `pd.DataFrame` function. `DataFrame` objects can be created by putting in something called a [dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) in between the parentheses. To understand what a Python dictionary is, think of a real life dictionary: a book full of words and their corresponding definitions. <br>The idea here is the same: a comma-separated list where each component of the list has a name (also called "key") and a corresponding value (which in our case will be the name of the series).
>Format: pd.DataFrame({**_<mark style="background-color: red;">"column_name1": column1_values</mark>_**, <mark style="background-color: yellow;">**"column_name2": column2_values,** ...}</mark>)



In the cell below, we have the column name "City name" with values of the series `city_names` and the column name "Population" with the values of the series `population`. (Remember we assigned both series to these variable names in the cells above).<br>
**Run this cell to see the DataFrame we made!**

In [None]:
pd.DataFrame({'City name': city_names, 'Population': population})

*Reminder:* We put the names of the columns `"City name"` and `"Population"` in quotations because they are *strings*. If we tried the code statement above without quotations, we would get a NameError because it would try to reference variables we haven't made! Note: There's no difference between single quotes ' ' and double quotes " ". They just need to match, so a string can be `"hello"` or `'hello'`, but not `"hello'`.

*The code cell below is optional/advanced. Skip to "Your turn!" if you're not interested in this section.*

You can also create `DataFrame` objects by specifying the rows. For example:

Note: we have to wrap our data in brackets `[]` to be able to pass it into the `data` argument of the `pd.DataFrame` function!

In [None]:
pd.DataFrame(data = [('San Francisco', 852469), ('San Jose', 1015785), ('Sacramento', 485199)],
             columns = ['City name', 'Population'])

***

##3. Importing to DataFrames

More often than creating our own DataFrames from scratch, we will load an entire file into a `DataFrame`.

Let's import a file and save it as a DataFrame in the cell below. **The data below is about Bay Area fine particulate matter concentrations from August 2021** (a wildfire period). The link we are pulling the data from is saved in the `pm_data` variable. We use the `pm_data` variable inside of the function **`pd.read_csv`** to be able to load the data and create a `DataFrame`. The commands `on_bad_lines = 'skip`, `index_col = [0]`, and `parse_dates = [0]` help us format the DataFrame. We label the DataFrame as the variable, `wildfire_pm`, to stand for **"wildfire particulate matter concentrations"**. Run the cell below and notice what we're doing as you'll do it solo later on!

In [None]:
# Holds the link we are pulling data from
pm_data = 'https://github.com/wintera71/BEACO2N-Modules/raw/refs/heads/main/Lesson%202:%20Introduction%20to%20Pandas/CSVs/pm_data.csv'

# Using the pm_data variable, we load the data into a DataFrame
wildfire_pm = pd.read_csv(pm_data, on_bad_lines='skip', index_col = [0], parse_dates=[0])
wildfire_pm

Let's now make sure we understand our data. You can use the `.columns` function to list the column names of the `wildfire_pm` DataFrame we just created.

In [None]:
wildfire_pm.columns

*It is always important to know what our data represents!* This DataFrame only has one column:
* pm: Fine particulate matter (PM$_{2.5}$) concentration in ug/m^3

***
## 4. Viewing DataFrames

A useful function is **`DataFrame.head()`**, which by default displays the first 5 rows of a DataFrame. This is often one of the first commands you run after loading a dataset, since it provides a quick look at the structure of your data—showing the column names, the index, and rows. You can also pass an integer argument to `.head()` (e.g., `df.head(10)`) to view a different number of rows. This makes it especially handy for inspecting large datasets without printing the entire table to the screen.


In [None]:
wildfire_pm.head()

For the `.head(n)` function, you can input any number `n` to display the first `n` rows. Try it out for yourself!

In [None]:
wildfire_pm.head(...) # Replace ... with any number

The opposite function of `.head` is `DataFrame.tail()`, which displays the **last** 5 records of a `DataFrame` by default:

In [None]:
wildfire_pm.tail()

Like above, replace the `...` with any number to display that number of rows. `wildfire_pm.tail(1)` would show the very last row of the `wildfire_pm` DataFrame.

In [None]:
wildfire_pm.tail(3)

***
#### You've finished the **Introduction to Pandas *In Class* notebook** and are ready to begin the **Introduction to Pandas *Student Exploration* notebook**! Good job!

***
***