# EDS 217 Day 4 Notes 
## 2023-09-11

### Day 4 (kinda day 5 though) Notes 


used curl to get new version of 4-1_pandas.ipynb:

curl https://raw.githubusercontent.com/environmental-data-science/eds217_2023/main/interactive_sessions/4-1_pandas.ipynb > 4-1_pandas.ipynb

changed working directory to lectures, used curl to get new version of 03-debugging.ipynb:

curl https://raw.githubusercontent.com/environmental-data-science/eds217_2023/main/lectures/03-debugging.ipynb > 03-debugging.ipynb

used:
`ls -l 03-debugging.ipynb`
to check how recently the file had been updated etc.

In [1]:
# the Zen of python
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


## Exceptions

In [2]:
try:
    print( 0 / 0)
except Exception as e:
    print(f"It didn't work because {e}")

It didn't work because division by zero


When python says anything other than `SyntaxError`, you should read this as `You are asking to do something I can't do`

### Types of Exceptions

Python has a lot of builtin Errors that correspond to the definition of the Python language. 


A few common Exceptions you will see include `TypeError`, `IndexError`, and `KeyError`.

### `TypeError`


A `TypeError` is raised when you try to perform a valid method on an inappropriate data type. 

### `IndexError`


An `IndexError` is raised when you try to access an undefined element of a sequence. Sequences are structured data types whose elements are stored in a specific order. A **list** is an example of a sequence.

### `KeyError`

A `KeyError` is raised when you try to perform a valid method on an inappropriate data type. 

In [None]:
# KeyError Examples:

my_dict = {'column_1': 'definition 1', 'another_word': 'a second definition'}
my_dict['column12']

### Deciphering Tracebacks 
When an exception is raised in python the interpreter generates a "Traceback" that shows **where** and **why** the error occurred. Generally, the REPL has most detailed Traceback information, although Jupyter Notebooks and iPython interactive shells also provide necessary information to debug any exception. 

In [None]:
# defining a function
def multiply(num1, num2):
    result = num1 * num2
    print(results)
 
# calling the function
multiply(10, 2)

NameError: name 'results' is not defined

## Pandas

In [3]:
import pandas as pd

2 fundamental objects in this library:
- Series
- DataFrame

also work w Datetime objects

### Creating `Series` and `DataFrame` objects

In [5]:
series = pd.Series([25.8, 16.2, 17.9, 18.8, 23.6, 29.9, 23.6, 22.1])

series

0    25.8
1    16.2
2    17.9
3    18.8
4    23.6
5    29.9
6    23.6
7    22.1
dtype: float64

if you wrote '2.4' , it would consider everything as "object" type -- it's confused. It'll try to coerce the string into a number, but is still confused

list of lists:

"columns" keyword argument constructor lets us specify column names at the end

In [10]:
# Create a df from a list of lists
df = pd.DataFrame([[25.8, 28.1, 16.2, 11.0],[17.9, 14.2, 18.8, 28.0],
                   [23.6, 18.4, 29.9, 27.8],[23.6, 36.2, 22.1, 14.5]],
                 columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,25.8,28.1,16.2,11.0
1,17.9,14.2,18.8,28.0
2,23.6,18.4,29.9,27.8
3,23.6,36.2,22.1,14.5


creating a DataFrame from a `dict` object:
- manually entering each column, note the rotation

In [14]:
# Create a df from a dictionary
dict_df = pd.DataFrame({
    'A': [25.8, 17.9, 23.6, 23.6],
    'B': [28.1, 14.2, 18.4, 36.2],
    'C': [16.2, 18.8, 29.9, 22.1],
    'D': [11.0, 28.0, 27.8, 14.5]
    })

dict_df

Unnamed: 0,A,B,C,D
0,25.8,28.1,16.2,11.0
1,17.9,14.2,18.8,28.0
2,23.6,18.4,29.9,27.8
3,23.6,36.2,22.1,14.5



Using this method, each `key` corresponds to a column name, and each `value` is a column.

### Importing Data

`pd.read_csv()` function requires AT LEAST a path or url to a csv file
- assumes the data starts in the first row unless the column names are the first row
- data separated by commas unless you specify otherwise
    - using "se" parameter, i.e. `sep='\t' ` = tab delimiter


In this session, we'll be importing a CSV file containing radiation data for October 2019 from a Baseline Surface Radiation Network (BSRN) station in Southern Africa. [BSRN](https://bsrn.awi.de) is a Global Energy and Water Cycle Experiment project aimed at monitoring changes in the Earth's surface radiation field. The network is comprised of 64 stations across various climate zones across the globe, whose data are used as the global baseline for surface radiation by the Global Climate Observing System.

The CSV file is located in the [data](../data) folder on the course GitHub repository. The files should already by in your private repo. 

While the file may not display properly in VSCode, the first 10 lines of the file should look like:

```
DATE,H_m,SWD_Wm2,STD_SWD,DIR_Wm2,STD_DIR,DIF_Wm2,STD_DIF,LWD_Wm2,STD_LWD,SWU_Wm2,LWU_Wm2,T_degC,RH,P_hPa
2019-10-01 00:00:00,2,-3,0,0,0,-3,0,300,0.1,0,383,16.2,30.7,966
2019-10-01 00:01:00,2,-3,0,0,0,-3,0,300,0.3,0,383,16.4,30.7,966
2019-10-01 00:02:00,2,-3,0,0,0,-3,0,300,0.2,0,383,16.5,30.5,966
2019-10-01 00:03:00,2,-3,0,0,0,-3,0,300,0.1,0,383,16.5,30.4,966
2019-10-01 00:04:00,2,-3,0,0,0,-3,0,300,0.1,0,383,16.8,30.5,966
2019-10-01 00:05:00,2,-2,0,0,0,-2,0,300,0.2,0,383,16.9,30.5,966
2019-10-01 00:06:00,2,-2,0,0,0,-2,0,300,0.2,0,383,16.8,30.4,966
2019-10-01 00:07:00,2,-2,0,0,0,-2,0,300,0.1,0,384,17,31,966
2019-10-01 00:08:00,2,-2,0,0,0,-2,0,300,0.2,0,384,16.7,30.6,966
```

The first line of the file contains the names of the columns, which are described in the table below.

| Column name | Description |
| :---------- | :---------- |
| **DATE**    | Date/Time |
| **H_m**     | Height of measurement $(\text{m})$ |
| **SWD_Wm2** | Incoming shortwave radiation $(\text{W m}^{-2})$|
| **STD_SWD** | Standard deviation of incoming shortwave radiation $(\text{W m}^{-2})$ |
| **DIR_Wm2** | Direct radiation $(\text{W m}^{-2})$ |
| **STD_DIR** | Standard deviation of direct radiation $(\text{W m}^{-2})$ |
| **DIF_Wm2** | Diffuse radiation $(\text{W m}^{-2})$ |
| **STD_DIF** | Standard deviation of diffuse radiation $(\text{W m}^{-2})$ |
| **LWD_Wm2** | Incoming longwave radiation $(\text{W m}^{-2})$ |
| **STD_LWD** | Standard deviation of incoming longwave radiation $(\text{W m}^{-2})$ |
| **SWU_Wm2** | Outgoing shortwave radiation $(\text{W m}^{-2})$ |
| **LWU_Wm2** | Outgoing longwave radiation $(\text{W m}^{-2})$ |
| **T_degC**  | Air temperature $(^{\circ}\text{C})$ |
| **RH**      | Relative humidity $(\%)$ |
| **P_hPa**   | Air pressure $(\text{hPa})$ |


We can import the data into pandas using the following syntax:

```python
bsrn = pd.read_csv('../data/BSRN_GOB_2019-10.csv')
```

<div class="example">
    ✏️ <b> Try it. </b> 
    Copy and paste the code above to import the data in the CSV file into a pandas <code>DataFrame</code> named <code>bsrn</code>.
</div>

In [15]:
bsrn = pd.read_csv('../data/BSRN_GOB_2019-10.csv')

In [16]:
bsrn

Unnamed: 0,DATE,H_m,SWD_Wm2,STD_SWD,DIR_Wm2,STD_DIR,DIF_Wm2,STD_DIF,LWD_Wm2,STD_LWD,SWU_Wm2,LWU_Wm2,T_degC,RH,P_hPa
0,2019-10-01 00:00:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.1,0,383,16.2,30.7,966
1,2019-10-01 00:01:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.3,0,383,16.4,30.7,966
2,2019-10-01 00:02:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.2,0,383,16.5,30.5,966
3,2019-10-01 00:03:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.1,0,383,16.5,30.4,966
4,2019-10-01 00:04:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.1,0,383,16.8,30.5,966
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44635,2019-10-31 23:55:00,2,-2.0,0.0,0.0,0.0,-2.0,0.0,380.0,0.1,0,423,23.0,35.6,964
44636,2019-10-31 23:56:00,2,-2.0,0.0,0.0,0.0,-2.0,0.0,380.0,0.1,0,423,23.1,35.5,964
44637,2019-10-31 23:57:00,2,-2.0,0.0,0.0,0.0,-2.0,0.0,380.0,0.1,0,423,23.0,35.3,964
44638,2019-10-31 23:58:00,2,-2.0,0.0,0.0,0.0,-2.0,0.0,381.0,0.2,0,423,23.0,35.2,964


- can use any csv file locally w tab completion

- this one is radiation data
- U and D are for upward and downward radiation directions

In [19]:
bsrn.describe()

Unnamed: 0,H_m,SWD_Wm2,STD_SWD,DIR_Wm2,STD_DIR,DIF_Wm2,STD_DIF,LWD_Wm2,STD_LWD,SWU_Wm2,LWU_Wm2,T_degC,RH,P_hPa
count,44640.0,44630.0,44637.0,44623.0,44623.0,44632.0,44632.0,44589.0,44637.0,44640.0,44640.0,44640.0,44640.0,44640.0
mean,2.0,318.046516,3.269498,348.581987,3.916621,65.294542,0.349209,342.350692,0.224545,110.445004,455.054032,22.525101,39.586891,965.043302
std,0.0,401.239735,19.032068,412.247947,21.608338,92.513191,1.29099,36.968507,0.354692,134.875619,79.024957,7.34031,24.702667,1.637144
min,2.0,-8.0,0.0,-1.0,0.0,-9.0,0.0,266.0,0.0,-2.0,338.0,9.2,5.2,960.0
25%,2.0,-2.0,0.0,0.0,0.0,-2.0,0.0,313.0,0.1,0.0,388.0,16.2,18.0,964.0
50%,2.0,27.0,0.3,0.0,0.0,19.0,0.1,340.0,0.1,11.0,432.0,22.4,33.1,965.0
75%,2.0,694.0,1.0,813.0,0.9,113.0,0.2,368.0,0.3,245.0,522.0,28.4,58.9,966.0
max,2.0,1383.0,337.8,1066.0,383.1,659.0,51.5,456.0,18.6,454.0,657.0,43.5,94.8,969.0


In [18]:
type(bsrn)

pandas.core.frame.DataFrame

Both `df.head()` and `df.tail()` can also accept an integer argument, e.g. `df.head(n)`, where the first `n` rows will be printed.

<div class="example">
    ✏️ <b> Try it. </b> 
    Print the first and last 10 rows of <code>bsrn</code> using <code>df.head()</code> and <code>df.tail()</code>.
</div>

In [20]:
bsrn.head(10)
bsrn.tail(5)

Unnamed: 0,DATE,H_m,SWD_Wm2,STD_SWD,DIR_Wm2,STD_DIR,DIF_Wm2,STD_DIF,LWD_Wm2,STD_LWD,SWU_Wm2,LWU_Wm2,T_degC,RH,P_hPa
44635,2019-10-31 23:55:00,2,-2.0,0.0,0.0,0.0,-2.0,0.0,380.0,0.1,0,423,23.0,35.6,964
44636,2019-10-31 23:56:00,2,-2.0,0.0,0.0,0.0,-2.0,0.0,380.0,0.1,0,423,23.1,35.5,964
44637,2019-10-31 23:57:00,2,-2.0,0.0,0.0,0.0,-2.0,0.0,380.0,0.1,0,423,23.0,35.3,964
44638,2019-10-31 23:58:00,2,-2.0,0.0,0.0,0.0,-2.0,0.0,381.0,0.2,0,423,23.0,35.2,964
44639,2019-10-31 23:59:00,2,-2.0,0.0,0.0,0.0,-2.0,0.0,381.0,0.1,0,423,23.1,35.0,964


In [21]:
bsrn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44640 entries, 0 to 44639
Data columns (total 15 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   DATE     44640 non-null  object 
 1   H_m      44640 non-null  int64  
 2   SWD_Wm2  44630 non-null  float64
 3   STD_SWD  44637 non-null  float64
 4   DIR_Wm2  44623 non-null  float64
 5   STD_DIR  44623 non-null  float64
 6   DIF_Wm2  44632 non-null  float64
 7   STD_DIF  44632 non-null  float64
 8   LWD_Wm2  44589 non-null  float64
 9   STD_LWD  44637 non-null  float64
 10  SWU_Wm2  44640 non-null  int64  
 11  LWU_Wm2  44640 non-null  int64  
 12  T_degC   44640 non-null  float64
 13  RH       44640 non-null  float64
 14  P_hPa    44640 non-null  int64  
dtypes: float64(10), int64(4), object(1)
memory usage: 5.1+ MB


`df.info()` provides several different pieces of info about the DataFrame that are sometimes useful to retrieve separately

ex.: `df.index()` and `df.columns()`

- `df.index()` returns the index as an iterable obkect for use in plotting
- `df.columns()` returns column names as an index object which can be used in a for loop or to reset the column names

The `df.info()` method provides several different pieces of information about the DataFrame that are sometimes useful to retrieve separately. For example, `df.index` returns the index as an iterable object for use in plotting and the `df.columns` method returns the column names as an index object which can be used in a `for` loop or to reset the column names. These and other descriptive DataFrame methods are summarized in the table below.


| Method | Description |
| :----- | :---------- |
| <span style="font-family: Lucida Console, Courier, monospace; font-weight: bold"> df.info() </span> | Prints a concise summary of the DataFrame |
| <span style="font-family: Lucida Console, Courier, monospace; font-weight: bold"> df.head(<i>n</i>) </span> | Returns the first *n* rows of the DataFrame |
| <span style="font-family: Lucida Console, Courier, monospace; font-weight: bold"> df.tail(<i>n</i>) </span> | Returns the last *n* rows of the DataFrame |
| <span style="font-family: Lucida Console, Courier, monospace; font-weight: bold"> df.index </span> | Returns the index range (number of rows) |
| <span style="font-family: Lucida Console, Courier, monospace; font-weight: bold"> df.columns </span> | Returns the column names |
| <span style="font-family: Lucida Console, Courier, monospace; font-weight: bold"> df.dtypes </span> | Returns a Series with the data types of each column indexed by column name |
| <span style="font-family: Lucida Console, Courier, monospace; font-weight: bold"> df.size </span> | Returns the total number of values in the DataFrame as an `int` |
| <span style="font-family: Lucida Console, Courier, monospace; font-weight: bold"> df.shape </span> | Returns the shape of the DataFrame as a tuple (*rows*,*columns*) |
| <span style="font-family: Lucida Console, Courier, monospace; font-weight: bold"> df.values </span> | Returns the DataFrame values as a NumPy array (not recommended) |
| <span style="font-family: Lucida Console, Courier, monospace; font-weight: bold"> df.describe() </span> | Returns a DataFrame with summary statistics of each column |

In [25]:
bsrn.head(1)


Unnamed: 0,DATE,H_m,SWD_Wm2,STD_SWD,DIR_Wm2,STD_DIR,DIF_Wm2,STD_DIF,LWD_Wm2,STD_LWD,SWU_Wm2,LWU_Wm2,T_degC,RH,P_hPa
0,2019-10-01 00:00:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.1,0,383,16.2,30.7,966


the first element here might not be interpreted (currently) as datetime info

In [29]:
bsrn.size

669600

### DataFrame indexing + data selection
<hr style="border-top: 0.2px solid gray; margin-top: 12px; margin-bottom: 1px"></hr>

Because DataFrames can contain *labels* as well as *indices*, indexing in pandas DataFrames is a bit more complicated than we've seen with strings, lists, and arrays. Generally speaking, pandas allows indexing by either the integer index or the label, but the syntax is a bit different for each. 

The index operator, which refers to the square brackets following an object `[]`, does not work quite like we might expect it to.


we need to fix the time stamps... 

In [30]:
bsrn.index

RangeIndex(start=0, stop=44640, step=1)

In [31]:
bsrn.columns

Index(['DATE', 'H_m', 'SWD_Wm2', 'STD_SWD', 'DIR_Wm2', 'STD_DIR', 'DIF_Wm2',
       'STD_DIF', 'LWD_Wm2', 'STD_LWD', 'SWU_Wm2', 'LWU_Wm2', 'T_degC', 'RH',
       'P_hPa'],
      dtype='object')

In [34]:
# 2nd row, data in the 4th column
# bsrn[1][3]
# doesn't like this -- not an array of arrays

series can be indexed w [] notation, DataFrames cannot... ?
- the columns aren't numbered 0:... 
- not an array of arrays, it's a DataFrame -- the columns have names

introducing... `iloc` !!
(index location)

In [35]:
bsrn.iloc[1,3]

0.0

In [36]:
bsrn.iloc[1434,12]

19.6


`df.iloc` acts just like the index operator works with arrays. In addition to indexing a single value, `df.iloc` can be used to select multiple rows and columns via slicing: `df.iloc[row_start:row_end:row_step, col_start:col_end:col_step]`.

In [38]:
bsrn.iloc[1434:1440,12:]

Unnamed: 0,T_degC,RH,P_hPa
1434,19.6,17.6,965
1435,19.5,17.5,965
1436,19.4,17.4,965
1437,19.1,17.5,965
1438,19.4,17.6,965
1439,19.3,17.5,965


In [42]:
bsrn.iloc[-6:,-3:] 
# go 6 rows from the bottom to the end,
# 3 columns from the end to the end

Unnamed: 0,T_degC,RH,P_hPa
44634,22.9,35.7,964
44635,23.0,35.6,964
44636,23.1,35.5,964
44637,23.0,35.3,964
44638,23.0,35.2,964
44639,23.1,35.0,964


In [37]:
bsrn.iloc[::40,:5]

Unnamed: 0,DATE,H_m,SWD_Wm2,STD_SWD,DIR_Wm2
0,2019-10-01 00:00:00,2,-3.0,0.0,0.0
40,2019-10-01 00:40:00,2,-3.0,0.0,0.0
80,2019-10-01 01:20:00,2,-3.0,0.0,0.0
120,2019-10-01 02:00:00,2,-3.0,0.0,0.0
160,2019-10-01 02:40:00,2,-2.0,0.0,0.0
...,...,...,...,...,...
44440,2019-10-31 20:40:00,2,-2.0,0.0,0.0
44480,2019-10-31 21:20:00,2,-2.0,0.0,0.0
44520,2019-10-31 22:00:00,2,-2.0,0.0,0.0
44560,2019-10-31 22:40:00,2,-2.0,0.0,0.0


#### Row indexing

`df.loc` locates rows based on their labels

In [43]:
bsrn.loc[1434]

DATE       2019-10-01 23:54:00
H_m                          2
SWD_Wm2                   -2.0
STD_SWD                    0.0
DIR_Wm2                    0.0
STD_DIR                    0.0
DIF_Wm2                   -2.0
STD_DIF                    0.0
LWD_Wm2                  307.0
STD_LWD                    0.1
SWU_Wm2                      0
LWU_Wm2                    385
T_degC                    19.6
RH                        17.6
P_hPa                      965
Name: 1434, dtype: object

In [44]:
bsrn.loc[0]

DATE       2019-10-01 00:00:00
H_m                          2
SWD_Wm2                   -3.0
STD_SWD                    0.0
DIR_Wm2                    0.0
STD_DIR                    0.0
DIF_Wm2                   -3.0
STD_DIF                    0.0
LWD_Wm2                  300.0
STD_LWD                    0.1
SWU_Wm2                      0
LWU_Wm2                    383
T_degC                    16.2
RH                        30.7
P_hPa                      966
Name: 0, dtype: object

In [45]:
type(bsrn.loc[0])

pandas.core.series.Series

- in `df.loc`, stop value is *inclusive*

In [47]:
bsrn.loc[1434:1440]

Unnamed: 0,DATE,H_m,SWD_Wm2,STD_SWD,DIR_Wm2,STD_DIR,DIF_Wm2,STD_DIF,LWD_Wm2,STD_LWD,SWU_Wm2,LWU_Wm2,T_degC,RH,P_hPa
1434,2019-10-01 23:54:00,2,-2.0,0.0,0.0,0.0,-2.0,0.0,307.0,0.1,0,385,19.6,17.6,965
1435,2019-10-01 23:55:00,2,-2.0,0.0,0.0,0.0,-2.0,0.0,307.0,0.1,0,385,19.5,17.5,965
1436,2019-10-01 23:56:00,2,-2.0,0.0,0.0,0.0,-2.0,0.0,307.0,0.1,0,386,19.4,17.4,965
1437,2019-10-01 23:57:00,2,-2.0,0.0,0.0,0.0,-2.0,0.1,306.0,0.1,0,386,19.1,17.5,965
1438,2019-10-01 23:58:00,2,-2.0,0.0,0.0,0.0,-2.0,0.0,306.0,0.2,0,386,19.4,17.6,965
1439,2019-10-01 23:59:00,2,-2.0,0.0,0.0,0.0,-2.0,0.0,306.0,0.1,0,386,19.3,17.5,965
1440,2019-10-02 00:00:00,2,-2.0,0.0,0.0,0.0,-2.0,0.0,306.0,0.1,0,386,19.1,17.5,965


In [49]:
bsrn.loc[0]

DATE       2019-10-01 00:00:00
H_m                          2
SWD_Wm2                   -3.0
STD_SWD                    0.0
DIR_Wm2                    0.0
STD_DIR                    0.0
DIF_Wm2                   -3.0
STD_DIF                    0.0
LWD_Wm2                  300.0
STD_LWD                    0.1
SWU_Wm2                      0
LWU_Wm2                    383
T_degC                    16.2
RH                        30.7
P_hPa                      966
Name: 0, dtype: object

In [51]:
one_row = bsrn.loc[0]
# you can index into this row -- type series 
# using strings or numbers
one_row['RH']

30.7

In [53]:
one_row[13]

30.7

#### Column indexing

In addition to integer indexing with `df.iloc`, columns can be accessed in two ways: dot notation `.` or square brackets `[]`. The former takes advantage of the fact that the columns are effectively "attributes" of the DataFrame and returns a Series:

In [57]:
bsrn.columns

Index(['DATE', 'H_m', 'SWD_Wm2', 'STD_SWD', 'DIR_Wm2', 'STD_DIR', 'DIF_Wm2',
       'STD_DIF', 'LWD_Wm2', 'STD_LWD', 'SWU_Wm2', 'LWU_Wm2', 'T_degC', 'RH',
       'P_hPa'],
      dtype='object')

In [58]:
bsrn['H_m']

0        2
1        2
2        2
3        2
4        2
        ..
44635    2
44636    2
44637    2
44638    2
44639    2
Name: H_m, Length: 44640, dtype: int64

In [55]:
bsrn.SWD_Wm2

0       -3.0
1       -3.0
2       -3.0
3       -3.0
4       -3.0
        ... 
44635   -2.0
44636   -2.0
44637   -2.0
44638   -2.0
44639   -2.0
Name: SWD_Wm2, Length: 44640, dtype: float64

- note: bsrn[0] results in a keyError because you don't have a column named 0 and the columns aren't numerically indexed

Using single brackets, the result is a Series. However, using double brackets, it is possible to return the column as a DataFrame:

In [59]:
bsrn[['SWD_Wm2']]

Unnamed: 0,SWD_Wm2
0,-3.0
1,-3.0
2,-3.0
3,-3.0
4,-3.0
...,...
44635,-2.0
44636,-2.0
44637,-2.0
44638,-2.0


- no such thing as "these things" in python
- only "this list of things"
- you *have* to wrap it in a list

In [63]:
bsrn[ # this bracket says we want stuff from bsrn
        ["H_m", 'RH'] # this list is the stuff we want (in brackets!!)
] # this bracket ends our comms with pandas

Unnamed: 0,H_m,RH
0,2,30.7
1,2,30.7
2,2,30.5
3,2,30.4
4,2,30.5
...,...,...
44635,2,35.6
44636,2,35.5
44637,2,35.3
44638,2,35.2


if you have a list of columns, it's a DataFrame. If you specify only 1 column in this expected list of columns, it'll still be type: DataFrame

In [61]:
bsrn[['SWD_Wm2', 'LWD_Wm2']]

Unnamed: 0,SWD_Wm2,LWD_Wm2
0,-3.0,300.0
1,-3.0,300.0
2,-3.0,300.0
3,-3.0,300.0
4,-3.0,300.0
...,...,...
44635,-2.0,380.0
44636,-2.0,380.0
44637,-2.0,380.0
44638,-2.0,381.0


difference between series and array

dot notation can only be used if your columns are named "nicely" :) (like lower snake case)

want to get height of measurement and temperature in degrees c

In [66]:
new_df = bsrn[['H_m', 'T_degC']]
new_df.head()

Unnamed: 0,H_m,T_degC
0,2,16.2
1,2,16.4
2,2,16.5
3,2,16.5
4,2,16.8


this is truly a new dataframe, not a copy. It's made out of a subset of the old one, not just pointing at a part of it.

Datetime strings are still objects

In [69]:
bsrn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44640 entries, 0 to 44639
Data columns (total 15 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   DATE     44640 non-null  object 
 1   H_m      44640 non-null  int64  
 2   SWD_Wm2  44630 non-null  float64
 3   STD_SWD  44637 non-null  float64
 4   DIR_Wm2  44623 non-null  float64
 5   STD_DIR  44623 non-null  float64
 6   DIF_Wm2  44632 non-null  float64
 7   STD_DIF  44632 non-null  float64
 8   LWD_Wm2  44589 non-null  float64
 9   STD_LWD  44637 non-null  float64
 10  SWU_Wm2  44640 non-null  int64  
 11  LWU_Wm2  44640 non-null  int64  
 12  T_degC   44640 non-null  float64
 13  RH       44640 non-null  float64
 14  P_hPa    44640 non-null  int64  
dtypes: float64(10), int64(4), object(1)
memory usage: 5.1+ MB


In [68]:
bsrn.describe()

Unnamed: 0,H_m,SWD_Wm2,STD_SWD,DIR_Wm2,STD_DIR,DIF_Wm2,STD_DIF,LWD_Wm2,STD_LWD,SWU_Wm2,LWU_Wm2,T_degC,RH,P_hPa
count,44640.0,44630.0,44637.0,44623.0,44623.0,44632.0,44632.0,44589.0,44637.0,44640.0,44640.0,44640.0,44640.0,44640.0
mean,2.0,318.046516,3.269498,348.581987,3.916621,65.294542,0.349209,342.350692,0.224545,110.445004,455.054032,22.525101,39.586891,965.043302
std,0.0,401.239735,19.032068,412.247947,21.608338,92.513191,1.29099,36.968507,0.354692,134.875619,79.024957,7.34031,24.702667,1.637144
min,2.0,-8.0,0.0,-1.0,0.0,-9.0,0.0,266.0,0.0,-2.0,338.0,9.2,5.2,960.0
25%,2.0,-2.0,0.0,0.0,0.0,-2.0,0.0,313.0,0.1,0.0,388.0,16.2,18.0,964.0
50%,2.0,27.0,0.3,0.0,0.0,19.0,0.1,340.0,0.1,11.0,432.0,22.4,33.1,965.0
75%,2.0,694.0,1.0,813.0,0.9,113.0,0.2,368.0,0.3,245.0,522.0,28.4,58.9,966.0
max,2.0,1383.0,337.8,1066.0,383.1,659.0,51.5,456.0,18.6,454.0,657.0,43.5,94.8,969.0


### `Datetime` objects

the timestamp is the unique identifier for our data -- makes sense for our data to be indexed by these unique IDs 
- making our datetimes the index

In [78]:
bsrn.head()

Unnamed: 0,DATE,H_m,SWD_Wm2,STD_SWD,DIR_Wm2,STD_DIR,DIF_Wm2,STD_DIF,LWD_Wm2,STD_LWD,SWU_Wm2,LWU_Wm2,T_degC,RH,P_hPa
0,2019-10-01 00:00:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.1,0,383,16.2,30.7,966
1,2019-10-01 00:01:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.3,0,383,16.4,30.7,966
2,2019-10-01 00:02:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.2,0,383,16.5,30.5,966
3,2019-10-01 00:03:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.1,0,383,16.5,30.4,966
4,2019-10-01 00:04:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.1,0,383,16.8,30.5,966


In [79]:
# Convert bsrn.DATE column to datetime objects
bsrn = pd.read_csv('../data/BSRN_GOB_2019-10.csv')
bsrn['DATE'] = pd.to_datetime(bsrn.DATE)  # Note: overwriting a column like this is NOT recommended.
# Set bsrn.DATE as the DataFrame index
bsrn.set_index('DATE', inplace=True) # make our unique datetime identifiers sa the index
bsrn.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 44640 entries, 2019-10-01 00:00:00 to 2019-10-31 23:59:00
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   H_m      44640 non-null  int64  
 1   SWD_Wm2  44630 non-null  float64
 2   STD_SWD  44637 non-null  float64
 3   DIR_Wm2  44623 non-null  float64
 4   STD_DIR  44623 non-null  float64
 5   DIF_Wm2  44632 non-null  float64
 6   STD_DIF  44632 non-null  float64
 7   LWD_Wm2  44589 non-null  float64
 8   STD_LWD  44637 non-null  float64
 9   SWU_Wm2  44640 non-null  int64  
 10  LWU_Wm2  44640 non-null  int64  
 11  T_degC   44640 non-null  float64
 12  RH       44640 non-null  float64
 13  P_hPa    44640 non-null  int64  
dtypes: float64(10), int64(4)
memory usage: 5.1 MB


In [75]:
bsrn.head()

Unnamed: 0_level_0,H_m,SWD_Wm2,STD_SWD,DIR_Wm2,STD_DIR,DIF_Wm2,STD_DIF,LWD_Wm2,STD_LWD,SWU_Wm2,LWU_Wm2,T_degC,RH,P_hPa
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2019-10-01 00:00:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.1,0,383,16.2,30.7,966
2019-10-01 00:01:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.3,0,383,16.4,30.7,966
2019-10-01 00:02:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.2,0,383,16.5,30.5,966
2019-10-01 00:03:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.1,0,383,16.5,30.4,966
2019-10-01 00:04:00,2,-3.0,0.0,0.0,0.0,-3.0,0.0,300.0,0.1,0,383,16.8,30.5,966


converted the concept of lists of lists to a very formal structure, where every entry can be indexed and has an address of date and type, structure = good :)

In [80]:
bsrn.index

DatetimeIndex(['2019-10-01 00:00:00', '2019-10-01 00:01:00',
               '2019-10-01 00:02:00', '2019-10-01 00:03:00',
               '2019-10-01 00:04:00', '2019-10-01 00:05:00',
               '2019-10-01 00:06:00', '2019-10-01 00:07:00',
               '2019-10-01 00:08:00', '2019-10-01 00:09:00',
               ...
               '2019-10-31 23:50:00', '2019-10-31 23:51:00',
               '2019-10-31 23:52:00', '2019-10-31 23:53:00',
               '2019-10-31 23:54:00', '2019-10-31 23:55:00',
               '2019-10-31 23:56:00', '2019-10-31 23:57:00',
               '2019-10-31 23:58:00', '2019-10-31 23:59:00'],
              dtype='datetime64[ns]', name='DATE', length=44640, freq=None)

In [81]:
bsrn.index.hour

Index([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
       ...
       23, 23, 23, 23, 23, 23, 23, 23, 23, 23],
      dtype='int32', name='DATE', length=44640)

In [82]:
bsrn.index.day

Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       ...
       31, 31, 31, 31, 31, 31, 31, 31, 31, 31],
      dtype='int32', name='DATE', length=44640)

by having an index that's a datetime, you can start selecting and indexing into your data in powerful ways 

In [84]:
result = bsrn.index.hour
result.unique()

Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23],
      dtype='int32', name='DATE')

In [85]:
bsrn.index.hour.unique()

Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23],
      dtype='int32', name='DATE')

In [86]:
bsrn.describe()

Unnamed: 0,H_m,SWD_Wm2,STD_SWD,DIR_Wm2,STD_DIR,DIF_Wm2,STD_DIF,LWD_Wm2,STD_LWD,SWU_Wm2,LWU_Wm2,T_degC,RH,P_hPa
count,44640.0,44630.0,44637.0,44623.0,44623.0,44632.0,44632.0,44589.0,44637.0,44640.0,44640.0,44640.0,44640.0,44640.0
mean,2.0,318.046516,3.269498,348.581987,3.916621,65.294542,0.349209,342.350692,0.224545,110.445004,455.054032,22.525101,39.586891,965.043302
std,0.0,401.239735,19.032068,412.247947,21.608338,92.513191,1.29099,36.968507,0.354692,134.875619,79.024957,7.34031,24.702667,1.637144
min,2.0,-8.0,0.0,-1.0,0.0,-9.0,0.0,266.0,0.0,-2.0,338.0,9.2,5.2,960.0
25%,2.0,-2.0,0.0,0.0,0.0,-2.0,0.0,313.0,0.1,0.0,388.0,16.2,18.0,964.0
50%,2.0,27.0,0.3,0.0,0.0,19.0,0.1,340.0,0.1,11.0,432.0,22.4,33.1,965.0
75%,2.0,694.0,1.0,813.0,0.9,113.0,0.2,368.0,0.3,245.0,522.0,28.4,58.9,966.0
max,2.0,1383.0,337.8,1066.0,383.1,659.0,51.5,456.0,18.6,454.0,657.0,43.5,94.8,969.0


In [92]:
bsrn['H_m'].unique()
# bsrn.index.unique() would return every single row, because by definition these are all unique IDs

array([2], dtype=int64)


<div class="python">
    🐍 <b>Method chaining.</b>  This process of stringing multiple methods together in a single line of code is called <b>method chaining</b>, a hallmark of object-oriented programming. Method chaining is a means of concatenating functions in order to quickly complete a series of data transformations. In pandas, we often use method chaining in aggregation processes to perfrom calculations on groups or selections of data. Methods are appended using dot notation to the end of a command. Any code that is expressed using method chaining could also be written using a series of commands (and vice versa). Method chaining is common in JavaScript, and while it is not widely used in Python, it is commonly applied in pandas.
</div>



Dealing with `datetime` objects can be tricky and often requires a bit of trial and error before the timestamps are in the desired format. If you know the format of your dataset and its timestamp records, you can parse the datetimes and set the index when reading in the data. For example, we could have imported our data as follows:

```python
bsrn = pd.read_csv('../data/BSRN_GOB_2019-10.csv',index_col=0,parse_dates=True)
```
This would have accomplished what we ultimately did in three lines in a single line of code. But remember, working with most raw datasets is rarely this straightforward – even the file we are using in this session was preprocessed to streamline the import process!

### A few useful operations
<hr style="border-top: 0.2px solid gray; margin-top: 12px; margin-bottom: 1px"></hr>

Now that our DataFrame is a bit cleaner – each of the columns contains a single, numeric data type – we are ready to start working with our data. Next, we'll explore `DataFrame` reduction operations, how to add and delete data, and concatenation in pandas.

#### `DataFrame` reduction

Much like NumPy, pandas has several useful methods for *reducing* data to a single statistic. These are intuitively named and include: `df.mean()`, `df.median()`, `df.sum()`, `df.max()`, `df.min()`, and `df.std()`. Unlike array reduction, however, these basic statistical methods in pandas operate *column-wise*, returning a Series containing the statistic for each column indexed by column name. For example:


In [87]:
bsrn.median()

H_m          2.0
SWD_Wm2     27.0
STD_SWD      0.3
DIR_Wm2      0.0
STD_DIR      0.0
DIF_Wm2     19.0
STD_DIF      0.1
LWD_Wm2    340.0
STD_LWD      0.1
SWU_Wm2     11.0
LWU_Wm2    432.0
T_degC      22.4
RH          33.1
P_hPa      965.0
dtype: float64

In [93]:
bsrn[['LWD_Wm2', 'SWD_Wm2']].mean()

LWD_Wm2    342.350692
SWD_Wm2    318.046516
dtype: float64


To retrieve the value for just a single column, you can use indexing to call the column as a Series:


In [88]:
bsrn.SWD_Wm2.median()

27.0


Furthermore, while it is not apparent in this example, pandas default behaviour is to **ignore NaN values** when performing computations. This can be changed by passing `skipna=False` to the reduction method (e.g. `df.median(skipna=False)`), though skipping NaNs is often quite useful!



### Adding data

In [94]:
df = pd.DataFrame([[25.8, 28.1, 16.2, 11.0],
                   [17.9, 14.2, 18.8, 28.0],
                   [23.6, 18.4, 29.9, 27.8],
                   [23.6, 36.2, 22.1, 14.5]],
                 columns=['A','B','C','D'])

# Add a column from a list
df['E'] = [13.0, 40.1, 39.8, 28.2]

# Add a column from a Series
df['F'] = pd.Series([18, 22, 30, 24])

# Propagate a single value through all rows
df['G'] = 'blue'

df

Unnamed: 0,A,B,C,D,E,F,G
0,25.8,28.1,16.2,11.0,13.0,18,blue
1,17.9,14.2,18.8,28.0,40.1,22,blue
2,23.6,18.4,29.9,27.8,39.8,30,blue
3,23.6,36.2,22.1,14.5,28.2,24,blue


In [95]:
df['H'] = df['E'] * 4
df

Unnamed: 0,A,B,C,D,E,F,G,H
0,25.8,28.1,16.2,11.0,13.0,18,blue,52.0
1,17.9,14.2,18.8,28.0,40.1,22,blue,160.4
2,23.6,18.4,29.9,27.8,39.8,30,blue,159.2
3,23.6,36.2,22.1,14.5,28.2,24,blue,112.8


In [96]:
df['AB_diff'] = df.A - df.B

df

Unnamed: 0,A,B,C,D,E,F,G,H,AB_diff
0,25.8,28.1,16.2,11.0,13.0,18,blue,52.0,-2.3
1,17.9,14.2,18.8,28.0,40.1,22,blue,160.4,3.7
2,23.6,18.4,29.9,27.8,39.8,30,blue,159.2,5.2
3,23.6,36.2,22.1,14.5,28.2,24,blue,112.8,-12.6


In [139]:
df = pd.DataFrame([[25.8, 28.1, 16.2, 11.0],
                   [17.9, 14.2, 18.8, 28.0],
                   [23.6, 18.4, 29.9, 27.8],
                   [23.6, 36.2, 22.1, 14.5]],
                 columns=['A','B','C','D'])

# Add a column from a list
df['E'] = [13.0, 40.1, 39.8, 28.2]

# Add a column from a Series
df['F'] = pd.Series([18, 22, 30, 24])


In [140]:
df['A_degF'] = (df['A'] * (9/5) + 32)
df

Unnamed: 0,A,B,C,D,E,F,A_degF
0,25.8,28.1,16.2,11.0,13.0,18,78.44
1,17.9,14.2,18.8,28.0,40.1,22,64.22
2,23.6,18.4,29.9,27.8,39.8,30,74.48
3,23.6,36.2,22.1,14.5,28.2,24,74.48


In [141]:
df['A'] >= 20

0     True
1    False
2     True
3     True
Name: A, dtype: bool

In [142]:
# filtering
# do a boolean test against the A column, 
# return rows where this criteria is true
df[df['A'] >= 20]

Unnamed: 0,A,B,C,D,E,F,A_degF
0,25.8,28.1,16.2,11.0,13.0,18,78.44
2,23.6,18.4,29.9,27.8,39.8,30,74.48
3,23.6,36.2,22.1,14.5,28.2,24,74.48


In [143]:
df[ #dataframe rows where
    df['A'] >= df['B'] # this statement is true
    ]

Unnamed: 0,A,B,C,D,E,F,A_degF
1,17.9,14.2,18.8,28.0,40.1,22,64.22
2,23.6,18.4,29.9,27.8,39.8,30,74.48


In [144]:
df["is_hot"] = df['A'] >= 20
df

Unnamed: 0,A,B,C,D,E,F,A_degF,is_hot
0,25.8,28.1,16.2,11.0,13.0,18,78.44,True
1,17.9,14.2,18.8,28.0,40.1,22,64.22,False
2,23.6,18.4,29.9,27.8,39.8,30,74.48,True
3,23.6,36.2,22.1,14.5,28.2,24,74.48,True


In [145]:
# determine what value of d < c

df[
    df['D'] <= df['C']
    ]

Unnamed: 0,A,B,C,D,E,F,A_degF,is_hot
0,25.8,28.1,16.2,11.0,13.0,18,78.44,True
2,23.6,18.4,29.9,27.8,39.8,30,74.48,True
3,23.6,36.2,22.1,14.5,28.2,24,74.48,True


In [146]:
# avg values of all locations in the df where A > D
df[
    df['A'] <= df['D']
    ].mean()


A         20.75
B         16.30
C         24.35
D         27.90
E         39.95
F         26.00
A_degF    69.35
is_hot     0.50
dtype: float64

In [147]:
# avg values of all locations in the df where A > D
df[ # take a dataframe
    df['A'] <= df['D'] # find the rows where this is true
    ][
        ['C', 'E'] # give me this part of it
        ].mean()  # take the mean


C    24.35
E    39.95
dtype: float64

In [148]:
# avg values of all locations in the df where A > D
df[ # take a dataframe
    df['A'] <= df['D'] # find the rows where this is true
    ].iloc[
        :,-3: # select only these columns
        ].mean()  # take the mean


# note in the output: assumes that if you take the average of a boolean, 
# you'll take the average of 1s and 0s

F         26.00
A_degF    69.35
is_hot     0.50
dtype: float64

In [127]:
import numpy as np
df[ # take a dataframe
    df['A'] <= df['D'] # find the rows where this is true
    ][
        ['C', 'E'] # give me this part of it
        ].apply(np.mean)  # take the mean


C    24.35
E    39.95
dtype: float64

In [150]:
df['BC_diff'] = df.B - df.C
df['D_less20'] = df.D[df.D >= 20.0] - 20.0

df

Unnamed: 0,A,B,C,D,E,F,A_degF,is_hot,D_less20,BC_diff
0,25.8,28.1,16.2,11.0,13.0,18,78.44,True,,11.9
1,17.9,14.2,18.8,28.0,40.1,22,64.22,False,8.0,-4.6
2,23.6,18.4,29.9,27.8,39.8,30,74.48,True,7.8,-11.5
3,23.6,36.2,22.1,14.5,28.2,24,74.48,True,,14.1


In [155]:
df['H'] = df.D[df.D >= 20]
print(df)
# pandas assumes you want that average of the values that exist
df['H'].mean()

      A     B     C     D     E   F  A_degF  is_hot  D_less20  BC_diff     H
0  25.8  28.1  16.2  11.0  13.0  18   78.44    True       NaN     11.9   NaN
1  17.9  14.2  18.8  28.0  40.1  22   64.22   False       8.0     -4.6  28.0
2  23.6  18.4  29.9  27.8  39.8  30   74.48    True       7.8    -11.5  27.8
3  23.6  36.2  22.1  14.5  28.2  24   74.48    True       NaN     14.1   NaN


27.9

In [156]:
df['H'].mean(skipna=False)

nan

In [157]:
df.describe()

Unnamed: 0,A,B,C,D,E,F,A_degF,D_less20,BC_diff,H
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,2.0,4.0,2.0
mean,22.725,24.225,21.75,20.325,30.275,23.5,72.905,7.9,2.475,27.9
std,3.379719,9.880073,5.945587,8.863173,12.78003,5.0,6.083494,0.141421,12.507698,0.141421
min,17.9,14.2,16.2,11.0,13.0,18.0,64.22,7.8,-11.5,27.8
25%,22.175,17.35,18.15,13.625,24.4,21.0,71.915,7.85,-6.325,27.85
50%,23.6,23.25,20.45,21.15,34.0,23.0,74.48,7.9,3.65,27.9
75%,24.15,30.125,24.05,27.85,39.875,25.5,75.47,7.95,12.45,27.95
max,25.8,36.2,29.9,28.0,40.1,30.0,78.44,8.0,14.1,28.0


In [158]:
df

Unnamed: 0,A,B,C,D,E,F,A_degF,is_hot,D_less20,BC_diff,H
0,25.8,28.1,16.2,11.0,13.0,18,78.44,True,,11.9,
1,17.9,14.2,18.8,28.0,40.1,22,64.22,False,8.0,-4.6,28.0
2,23.6,18.4,29.9,27.8,39.8,30,74.48,True,7.8,-11.5,27.8
3,23.6,36.2,22.1,14.5,28.2,24,74.48,True,,14.1,


In [159]:
# Create list of seasons
seasons = ['winter', 'spring', 'summer', 'fall']

# Insert season as first column
df.insert(0, 'SEASON', seasons)

df

Unnamed: 0,SEASON,A,B,C,D,E,F,A_degF,is_hot,D_less20,BC_diff,H
0,winter,25.8,28.1,16.2,11.0,13.0,18,78.44,True,,11.9,
1,spring,17.9,14.2,18.8,28.0,40.1,22,64.22,False,8.0,-4.6,28.0
2,summer,23.6,18.4,29.9,27.8,39.8,30,74.48,True,7.8,-11.5,27.8
3,fall,23.6,36.2,22.1,14.5,28.2,24,74.48,True,,14.1,


In [160]:
df[ # take a dataframe
    df['A'] <= df['D'] # find the rows where this is true
    ][
        ['C', 'E'] # give me this part of it
        ].apply(np.mean)  # take the mean

C    24.35
E    39.95
dtype: float64

#### Removing Data

.insert()

del

.pop() extracts column fro mdf as a new series




In [163]:
# df_F = df.pop('F')
df

Unnamed: 0,SEASON,A,B,C,D,E,A_degF,is_hot,D_less20,BC_diff,H
0,winter,25.8,28.1,16.2,11.0,13.0,78.44,True,,11.9,
1,spring,17.9,14.2,18.8,28.0,40.1,64.22,False,8.0,-4.6,28.0
2,summer,23.6,18.4,29.9,27.8,39.8,74.48,True,7.8,-11.5,27.8
3,fall,23.6,36.2,22.1,14.5,28.2,74.48,True,,14.1,


#### Applying functions

In [169]:
def add_one(d):
    return d+1

df[['A', 'B']].apply(add_one)

Unnamed: 0,A,B
0,26.8,29.1
1,18.9,15.2
2,24.6,19.4
3,24.6,37.2


In [171]:
df[ # take my dataframe
    df['A'] > df['B'] # select only the rows where this is true
    ][
        'A' # select only these columns from that result
    ].apply(add_one) # and apply this function

1    18.9
2    24.6
Name: A, dtype: float64

In [172]:
df[df['A'] > df['B']]['A'].apply(add_one)

1    18.9
2    24.6
Name: A, dtype: float64

In [173]:
def convert_CtoF(degC):
    """ Converts a temperature to from Celsius to Fahrenheit
    
    Parameters
    ----------
        degC : float
            Temperature value in °C
       
    Returns
    -------
        degF : float
            Temperature value in °F
    """
    
    degF = (degC *(9./5)) + 32
    
    return degF

In [174]:
df.A.apply(convert_CtoF)

0    78.44
1    64.22
2    74.48
3    74.48
Name: A, dtype: float64