In [2]:
import pandas as pd
df = pd.read_html("https://finance.yahoo.com/quote/TSLA/history?period1=1546300800&period2=1550275200&interval=1d&filter=history&frequency=1d")[0]
df = df.head(11).sort_values(by='Date')
df = df.astype({"Open":'float',
                "High":'float',
                "Low":'float',
                "Close*":'float',
                "Adj Close**":'float',
                "Volume":'float'})
df['Gain'] = df['Close*'] - df['Open']

Getting all rows that match a simple conditional statement

First, let’s just try to grab all rows in our DataFrame that match one condition. In this example, I’d just like to get all the rows that occur after a certain date, so we’ll run the following code below:

In [4]:
df1 = df.loc[df['Date'] > 'Feb 06, 2019']
df1.head()

Unnamed: 0,Date,Open,High,Low,Close*,Adj Close**,Volume,Gain
6,"Feb 07, 2019",62.66,62.94,60.6,61.5,61.5,32603000.0,-1.16
5,"Feb 08, 2019",61.37,61.49,59.7,61.16,61.16,29221000.0,-0.21
4,"Feb 11, 2019",62.32,63.72,62.1,62.57,62.57,35648500.0,0.25
3,"Feb 12, 2019",63.24,63.64,61.92,62.36,62.36,27588000.0,-0.88
2,"Feb 13, 2019",62.47,62.55,61.11,61.63,61.63,25708000.0,-0.84


Getting specific columns that match a conditional statement

Now, we’ll introduce the syntax that allows you to specify which columns you want .loc to return. In this case, we’ll use the same conditional statement as before to filter out specific dates. However, our goal this time is to only select two columns (Date and Open) from the original DataFrame. To do so, we run the following code:


In [5]:
df2 = df.loc[df['Date'] > 'Feb 06, 2019', ['Date','Open']]
df2.head()

Unnamed: 0,Date,Open
6,"Feb 07, 2019",62.66
5,"Feb 08, 2019",61.37
4,"Feb 11, 2019",62.32
3,"Feb 12, 2019",63.24
2,"Feb 13, 2019",62.47


Using multiple conditional statements to filter a DataFrame

If you have two or more conditions you would like to use to get a very specific subset of your data, .loc allows you to do that very easily. In our case, let’s take the rows that not only occur after a specific date but also have an Open value greater than a specific value. To do so, we run the following:


In [6]:
df3 = df.loc[(df['Date'] > 'Feb 06, 2019') & (df['Open'] > 62), ['Date', 'Open']]
df3.head()

Unnamed: 0,Date,Open
6,"Feb 07, 2019",62.66
4,"Feb 11, 2019",62.32
3,"Feb 12, 2019",63.24
2,"Feb 13, 2019",62.47


Editing a DataFrame based on multiple conditional statements

As mentioned before, there may be other ways to do this, but you might end up with a “SettingwithCopyWarning” if you’re not careful. Using .loc to assign values will take care of this issue for you!
We’ll put together all the previous steps and edit our DataFrame so that rows that meet a condition that we set will be assigned a specific value. In this case, we’ll create a new “Remarkable” column, which will include rows that either has a very high Volume or a positive Gain. To do so, we run the following code:


In [8]:
remarkable_filter = (df['Volume'] > 30000000) | (df['Gain'] > 0)
df4 = df.copy()
df4['Remarkable'] = ''
df4.loc[remarkable_filter, ['Remarkable']] = True
df4.loc[~remarkable_filter, ['Remarkable']] = False
df4.head()

Unnamed: 0,Date,Open,High,Low,Close*,Adj Close**,Volume,Gain,Remarkable
10,"Feb 01, 2019",61.08,63.22,60.7,62.44,62.44,36417000.0,1.36,True
9,"Feb 04, 2019",62.6,63.06,60.38,62.58,62.58,36760500.0,-0.02,True
8,"Feb 05, 2019",62.5,64.49,62.45,64.27,64.27,33714000.0,1.77,True
7,"Feb 06, 2019",63.92,64.85,63.12,63.44,63.44,25192500.0,-0.48,False
6,"Feb 07, 2019",62.66,62.94,60.6,61.5,61.5,32603000.0,-1.16,True


For clarity, we put our conditional statements in a separate variable, which is used later in .loc. Then, we assign either True to the Remarkable column for all the rows that meet our conditional statements. We use the ~ symbol to find all the rows that don’t meet our conditional statement and then assign False to the Remarkable column for those rows.


Rolling Functions in a Pandas DataFrame

So what is a rolling window calculation?

You’ll typically use rolling calculations when you work with time-series data. Again, a window is a subset of rows that you perform a window calculation on. After you’ve defined a window, you can perform operations like calculating running totals, moving averages, ranks, and much more!
Let’s clear this up with some examples.
1. Window Rolling Mean (Moving Average)

The moving average calculation creates an updated average value for each row based on the window we specify. The calculation is also called a “rolling mean” because it’s calculating an average of values within a specified range for each row as you go along the DataFrame.
That sounds a bit abstract, so let’s calculate the rolling mean for the “Close” column price over time. 

In [11]:
df['Rolling Close Average'] = df['Close*'].rolling(2).mean()
df.head(5)

Unnamed: 0,Date,Open,High,Low,Close*,Adj Close**,Volume,Gain,Rolling Close Average
10,"Feb 01, 2019",61.08,63.22,60.7,62.44,62.44,36417000.0,1.36,
9,"Feb 04, 2019",62.6,63.06,60.38,62.58,62.58,36760500.0,-0.02,62.51
8,"Feb 05, 2019",62.5,64.49,62.45,64.27,64.27,33714000.0,1.77,63.425
7,"Feb 06, 2019",63.92,64.85,63.12,63.44,63.44,25192500.0,-0.48,63.855
6,"Feb 07, 2019",62.66,62.94,60.6,61.5,61.5,32603000.0,-1.16,62.47


We’re creating a new column “Rolling Close Average” which takes the moving average of the close price within a window.
To do this, we simply write .rolling(2).mean(), where we specify a window of “2” and calculate the mean for every window along the DataFrame. Each row gets a “Rolling Close Average” equal to its “Close*” value plus the previous row’s “Close*” divided by 2 (the window). In essence, it’s Moving Avg = ([t] + [t-1]) / 2.

In practice, this means the first calculated value (62.44 + 62.58) / 2 = 62.51, which is the “Rolling Close Average” value for February 4. There is no rolling mean for the first row in the DataFrame, because there is no available [t-1] or prior period “Close*” value to use in the calculation, which is why Pandas fills it with a NaN value.


2. Window Rolling Standard Deviation
To further see the difference between a regular calculation and a rolling calculation, let’s check out the rolling standard deviation of the “Open” price. To do so, we’ll run the following code:

In [13]:
df['Open Standard Deviation'] = df['Open'].std()
df['Rolling Open Standard Deviation'] = df['Open'].rolling(2).std()
df.head(5)

Unnamed: 0,Date,Open,High,Low,Close*,Adj Close**,Volume,Gain,Rolling Close Average,Open Standard Deviation,Rolling Open Standard Deviation
10,"Feb 01, 2019",61.08,63.22,60.7,62.44,62.44,36417000.0,1.36,,1.025347,
9,"Feb 04, 2019",62.6,63.06,60.38,62.58,62.58,36760500.0,-0.02,62.51,1.025347,1.074802
8,"Feb 05, 2019",62.5,64.49,62.45,64.27,64.27,33714000.0,1.77,63.425,1.025347,0.070711
7,"Feb 06, 2019",63.92,64.85,63.12,63.44,63.44,25192500.0,-0.48,63.855,1.025347,1.004092
6,"Feb 07, 2019",62.66,62.94,60.6,61.5,61.5,32603000.0,-1.16,62.47,1.025347,0.890955


I also included a new column “Open Standard Deviation” for the standard deviation that simply calculates the standard deviation for the whole “Open” column.

Beside it, you’ll see the “Rolling Open Standard Deviation” column, in which I’ve defined a window of 2 and calculated the standard deviation for each row.

Just as with the previous example, the first non-null value is at the second row of the DataFrame, because that’s the first row that has both [t] and [t-1]. 

You can see how the moving standard deviation varies as you move down the table, which can be useful to track volatility over time.

Pandas uses N-1 degrees of freedom when calculating the standard deviation. You can pass an optional argument to ddof, which in the std function is set to “1” by default.


3. Window Rolling Sum
As a final example, let’s calculate the rolling sum for the “Volume” column. To do so, we run the following code:

In [14]:
df['Rolling Volume Sum'] = df['Volume'].rolling(3).sum()
df.head(5)

Unnamed: 0,Date,Open,High,Low,Close*,Adj Close**,Volume,Gain,Rolling Close Average,Open Standard Deviation,Rolling Open Standard Deviation,Rolling Volume Sum
10,"Feb 01, 2019",61.08,63.22,60.7,62.44,62.44,36417000.0,1.36,,1.025347,,
9,"Feb 04, 2019",62.6,63.06,60.38,62.58,62.58,36760500.0,-0.02,62.51,1.025347,1.074802,
8,"Feb 05, 2019",62.5,64.49,62.45,64.27,64.27,33714000.0,1.77,63.425,1.025347,0.070711,106891500.0
7,"Feb 06, 2019",63.92,64.85,63.12,63.44,63.44,25192500.0,-0.48,63.855,1.025347,1.004092,95667000.0
6,"Feb 07, 2019",62.66,62.94,60.6,61.5,61.5,32603000.0,-1.16,62.47,1.025347,0.890955,91509500.0


We’ve defined a window of “3”, so the first calculated value appears on the third row. The sum calculation then “rolls” over every row, so that you can track the sum of the current row and the two prior row’s values over time.

It’s important to emphasize here that these rolling (moving) calculations should not be confused with running calculations.

Rolling calculations, as you can see int he diagram above, have a moving window. So with our moving sum, the calculated value for February 6 (the fourth row) does not include the value for February 1 (the first row), because the specified window (3) does not go that far back. 

In contrast, a running calculation would take continually add each row value to a running total value across the whole DataFrame. You can check out the cumsum function for that.
