# Pandas resample() tricks you should know for manipulating time-series data
Time-series data is common in data science projects. Often, you may be interested in resampling your time-series data into the frequency that you want to analyze data or draw additional insights from data [1].

In this article, we’ll be going through some examples of resampling time-series data using Pandas resample() function. We will cover the following common problems and should help you get started with time-series data manipulation.

*  Downsampling and performing aggregation
*    Downsampling with a custom base
*    Upsampling and filling values
*    A practical example

## Downsampling and performing aggregation

Downsampling is to resample a time-series dataset to a wider time frame. For example, from minutes to hours, from days to years. The result will have a reduced number of rows and values can be aggregated with mean(), min(), max(), sum() etc.

Let’s see how it works with the help of an example.

Suppose we have a dataset about sales.

In [35]:
import pandas as pd 
import numpy as np
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


df = pd.read_csv(
    './data/sales_data.csv')

df.head()

Unnamed: 0,date,num_sold
0,2017-01-02 09:02:03,5
1,2017-01-02 09:14:13,7
2,2017-01-02 09:21:00,5
3,2017-01-02 09:28:57,9
4,2017-01-02 09:42:14,1


In [2]:
df_sales = pd.read_csv(
    './data/sales_data.csv', 
    parse_dates=['date'], 
    index_col=['date']
)

df_sales.head()

Unnamed: 0_level_0,num_sold
date,Unnamed: 1_level_1
2017-01-02 09:02:03,5
2017-01-02 09:14:13,7
2017-01-02 09:21:00,5
2017-01-02 09:28:57,9
2017-01-02 09:42:14,1


To get the total number of sales added every 2 hours, we can simply use resample() to downsample the DataFrame into 2-hour bins and sum the values of the timestamps falling into a bin.

In [3]:
df_sales.resample('2H').sum()

Unnamed: 0_level_0,num_sold
date,Unnamed: 1_level_1
2017-01-02 08:00:00,37
2017-01-02 10:00:00,66
2017-01-02 12:00:00,81
2017-01-02 14:00:00,50
2017-01-02 16:00:00,64
2017-01-02 18:00:00,66
2017-01-02 20:00:00,44
2017-01-02 22:00:00,45


To perform multiple aggregations, we can pass a list of aggregation functions to agg() method.

In [4]:
df_sales.resample('2H').agg(['min','max', 'sum'])

Unnamed: 0_level_0,num_sold,num_sold,num_sold
Unnamed: 0_level_1,min,max,sum
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2017-01-02 08:00:00,1,9,37
2017-01-02 10:00:00,1,9,66
2017-01-02 12:00:00,1,9,81
2017-01-02 14:00:00,1,9,50
2017-01-02 16:00:00,1,8,64
2017-01-02 18:00:00,1,9,66
2017-01-02 20:00:00,1,9,44
2017-01-02 22:00:00,2,6,45


## Downsampling with a custom base

By default, for the frequencies that evenly subdivide 1 day/month/year, the “origin” of the aggregated intervals is defaulted to `0`. So, for the 2H frequency, the result range will be `00:00:00`, `02:00:00`, `04:00:00`, …, `22:00:00`.

For the sales data we are using, the first record has a date value 2017–01–02 09:02:03 , so it makes much more sense to have the output range start with 09:00:00, rather than 08:00:00. To do that, we can set the “origin” of the aggregated intervals to a different value using the argument base, for example, set base=1 so the result range can start with 09:00:00.


![](./i/1oEJpC3wudNTyZSE-rimJKw.png)

In [5]:
df_sales.resample('2H', base=1).sum()

The new arguments that you should use are 'offset' or 'origin'.

>>> df.resample(freq="3s", base=2)

becomes:

>>> df.resample(freq="3s", offset="2s")

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,num_sold
date,Unnamed: 1_level_1
2017-01-02 09:00:00,62
2017-01-02 11:00:00,77
2017-01-02 13:00:00,64
2017-01-02 15:00:00,55
2017-01-02 17:00:00,72
2017-01-02 19:00:00,48
2017-01-02 21:00:00,70
2017-01-02 23:00:00,5


## Upsampling and filling values

Upsampling is the opposite operation of downsampling. It resamples a time-series dataset to a smaller time frame. For example, from hours to minutes, from years to days. The result will have an increased number of rows and additional rows values are defaulted to NaN. The built-in method ffill() and bfill() are commonly used to perform forward filling or backward filling to replace NaN.

Let’s make up a DataFrame for demonstration.

In [6]:
df = pd.DataFrame(
    { 'value': [1, 2, 3] }, 
    index=pd.period_range(
        '2012-01-01',
         freq='A',
         periods=3
    )
)
df

Unnamed: 0,value
2012,1
2013,2
2014,3


To resample a year by quarter and forward filling the values. The forward fill method ffill() will use the last known value to replace NaN.
![](./i/1_PHEUoOLiCtJe5KtCKnO5hg.png)

In [7]:
df.resample('Q').ffill()

Unnamed: 0,value
2012Q1,1
2012Q2,1
2012Q3,1
2012Q4,1
2013Q1,2
2013Q2,2
2013Q3,2
2013Q4,2
2014Q1,3
2014Q2,3


To resample a year by quarter and backward filling the values. The backward fill method bfill() will use the next known value to replace NaN.

![](./i/1_LXPqyzf-QAYye4YCJ6fZcw.png)

In [8]:
df.resample('Q').bfill()

Unnamed: 0,value
2012Q1,1.0
2012Q2,2.0
2012Q3,2.0
2012Q4,2.0
2013Q1,2.0
2013Q2,3.0
2013Q3,3.0
2013Q4,3.0
2014Q1,3.0
2014Q2,


## A Practical example

Let’s take a look at how to use Pandas resample() to deal with a real-world problem.

Suppose we have 2 datasets, one for monthly sales df_sales and the other for price df_price. The df_price only has records on price changes.

### Step 1: Resample price dataset by month and forward fill the values

In [9]:
# load sales
df_sales = pd.read_csv('data/sales.csv', parse_dates=['date'], index_col=['date'])
df_sales.head()

Unnamed: 0_level_0,num_sold
date,Unnamed: 1_level_1
2018-01-31,5
2018-02-28,17
2018-03-31,5
2018-04-30,16
2018-05-31,12


In [10]:
# load price
df_price = pd.read_csv('data/price.csv', parse_dates=['date'], index_col=['date'])
df_price.head()

Unnamed: 0_level_0,price
date,Unnamed: 1_level_1
2018-01-31,16.0
2018-05-31,15.5
2018-12-31,10.0


In [11]:
df_price = df_price.resample('M').ffill()
df_price

Unnamed: 0_level_0,price
date,Unnamed: 1_level_1
2018-01-31,16.0
2018-02-28,16.0
2018-03-31,16.0
2018-04-30,16.0
2018-05-31,15.5
2018-06-30,15.5
2018-07-31,15.5
2018-08-31,15.5
2018-09-30,15.5
2018-10-31,15.5


![](./i/1_3yuXIWOGIhUlrsl8NwhgMw.png)


### Step 2: Combine results and calculate total sales

In [12]:
df = pd.concat([df_sales, df_price], axis = 1)
df

Unnamed: 0_level_0,num_sold,price
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-01-31,5,16.0
2018-02-28,17,16.0
2018-03-31,5,16.0
2018-04-30,16,16.0
2018-05-31,12,15.5
2018-06-30,12,15.5
2018-07-31,2,15.5
2018-08-31,9,15.5
2018-09-30,5,15.5
2018-10-31,15,15.5


Pandas concat() function with argument axis=1 is used to combine df_sales and df_price horizontally. 

After that, the total sales can be calculated using the element-wise multiplication df['num_sold'] * df['price'].

By executing the above statement, you should get an output like below:
    
[](./i/AhPi27cPdze67s13m4s2bw.png)

In [13]:
df['total_sales'] = df['num_sold'] * df['price']
df

Unnamed: 0_level_0,num_sold,price,total_sales
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-01-31,5,16.0,80.0
2018-02-28,17,16.0,272.0
2018-03-31,5,16.0,80.0
2018-04-30,16,16.0,256.0
2018-05-31,12,15.5,186.0
2018-06-30,12,15.5,186.0
2018-07-31,2,15.5,31.0
2018-08-31,9,15.5,139.5
2018-09-30,5,15.5,77.5
2018-10-31,15,15.5,232.5


# All the Pandas shift() you should know for data analysis

Suppose you encountered a situation where you need to push all rows in a DataFrame or require to use the previous row in a DataFrame. Maybe you want to calculate the difference in consecutive rows, Pandas shift() would be an ideal way to achieve these objectives.

In this article, we’ll be going through some examples of manipulating data using Pandas shift() function. We will focus on practical problems and should help you get started with data analysis.

1. Shifting values with periods
2. Shifting time-series data with freq
3. A practical example: calculating the difference in consecutive rows
4. A practical example: calculating the 7 days difference for time-series data

## Shifting values with periods

Pandas shift() shift index by the desired number of periods. The simplest call should have an argument periods (It defaults to 1) and it represents the number of shifts for the desired axis. And by default, it is shifting values vertically along the axis 0 . NaN will be filled for missing values introduced as a result of the shifting.

Let’s see how this works with the help of an example.

![](./i/lGPAbHRtK1TArAzB_6hzVQ.png)

In [28]:
df = pd.DataFrame({
    "A": [1, 2, 3, 4, 5],
    "B": [10, 20, 30, 40, 50]
})

df

Unnamed: 0,A,B
0,1,10
1,2,20
2,3,30
3,4,40
4,5,50


In [37]:
df.shift(periods=1)
df.shift(1)

Unnamed: 0,A,B
2020-01-01,,
2020-01-02,1.0,10.0
2020-01-03,2.0,20.0
2020-01-04,3.0,30.0
2020-01-05,4.0,40.0


Unnamed: 0,A,B
2020-01-01,,
2020-01-02,1.0,10.0
2020-01-03,2.0,20.0
2020-01-04,3.0,30.0
2020-01-05,4.0,40.0


To replace NaN , you can use the argument fill_value , for example, replace NaN with 0

In [25]:
df.shift(periods=1 , fill_value=0)

Unnamed: 0,A,B
0,0,0
1,1,10
2,2,20
3,3,30
4,4,40


In [32]:
# In addition, you can pass a negative number to periods and it will shift values in the oppositive direction.
df.shift(periods=-1 )

Unnamed: 0,A,B
0,2.0,20.0
1,3.0,30.0
2,4.0,40.0
3,5.0,50.0
4,,


In [33]:
# To shift values horizontally, you can set axis=1
df.shift(periods=1, axis=1)

Unnamed: 0,A,B
0,,1
1,,2
2,,3
3,,4
4,,5


## Shifting time-series data with freq

Pandas shift() function has an argument called freq which allows you to do the frequency-based shifting. The Pandas shift() function is beneficial when dealing with time-series data.

In order to use the argument freq, you need to make sure the index of DataFrame is date or datetime, otherwise, it will raise a NotImplementedError.

Let’s see how this works with the help of an example.

In [41]:
df = pd.DataFrame({
        "A": [1, 2, 3, 4, 5],
        "B": [10, 20, 30, 40, 50]
    },  
    index=pd.date_range("2020-01-01", freq='D', periods=5)
)
df
df.shift(freq='10D')# The equivalent
df.shift(periods=10, freq='D')
df.shift(periods=10)

Unnamed: 0,A,B
2020-01-01,1,10
2020-01-02,2,20
2020-01-03,3,30
2020-01-04,4,40
2020-01-05,5,50


Unnamed: 0,A,B
2020-01-11,1,10
2020-01-12,2,20
2020-01-13,3,30
2020-01-14,4,40
2020-01-15,5,50


Unnamed: 0,A,B
2020-01-11,1,10
2020-01-12,2,20
2020-01-13,3,30
2020-01-14,4,40
2020-01-15,5,50


Unnamed: 0,A,B
2020-01-01,,
2020-01-02,,
2020-01-03,,
2020-01-04,,
2020-01-05,,


## A practical example: calculating the difference in consecutive rows

Suppose you need to use the previous row value to calculate the sales change, Pandas shift() would be a way to achieve this task.

In [46]:
df = pd.DataFrame({
    "date": pd.date_range("2020-01-01", freq='D', periods=5),
    "sales": [22, 30, 32, 25, 42]
})
df
df['shift_sales'] = df.shift(1)['sales']
# To calculate the sales change in consecutive rows.
df['diff'] = df['sales'] - df.shift(1)['sales']
df

Unnamed: 0,date,sales
0,2020-01-01,22
1,2020-01-02,30
2,2020-01-03,32
3,2020-01-04,25
4,2020-01-05,42


Unnamed: 0,date,sales,shift_sales,diff
0,2020-01-01,22,,
1,2020-01-02,30,22.0,8.0
2,2020-01-03,32,30.0,2.0
3,2020-01-04,25,32.0,-7.0
4,2020-01-05,42,25.0,17.0


## A practical example: calculating the 7 days difference for time-series data

Now, suppose you have been asked to calculate the 7 days sales change as follows

value_1 = Day_8 - Day_1
value_2 = Day_9 - Day_2
value_3 = Day_10 - Day_3
...
value_n = Day_N - Day_N-7

Pandas shift() with the argument freq would be an ideal way to achieve this task. Let’s use read_csv() with the argument parse_dates and index_col to load data into a DataFrame.

![](./i/1BJdB-9vrh5rGeO75-d_pwQ.png)

In [52]:
df = pd.read_csv('./data/time_series.csv' , parse_dates=['date'] , index_col=['date'])
df

the_7_days_diff = df['sales'] - df.shift(freq='7D')['sales']
the_7_days_diff

Unnamed: 0_level_0,sales
date,Unnamed: 1_level_1
2020-01-01,22
2020-01-02,30
2020-01-03,32
2020-01-04,25
2020-01-05,42
2020-01-06,20
2020-01-07,45
2020-01-09,43
2020-01-10,27


date
2020-01-01     NaN
2020-01-02     NaN
2020-01-03     NaN
2020-01-04     NaN
2020-01-05     NaN
2020-01-06     NaN
2020-01-07     NaN
2020-01-08     NaN
2020-01-09    13.0
2020-01-10    -5.0
2020-01-11     NaN
2020-01-12     NaN
2020-01-13     NaN
2020-01-14     NaN
2020-01-16     NaN
2020-01-17     NaN
Name: sales, dtype: float64

### Notice that

    There is a record for “2020–01–08” in the result
    The value of “2020–01–08” is NaN because df doesn’t have this value
    Values for the date from “2020–01–01” to “2020–01–07” are NaN. This is because df.shift(freq='7D') doesn’t have these values.
    The last 6 records are NaN because df doesn’t have these values

# Pandas convert JSON into a DataFrame
DataFrame and Series are two core data structures in Pandas. DataFrame is a 2-dimensional labeled data with rows and columns. It is like a spreadsheet or SQL table. Series is a 1-dimensional labeled array. It is sort of like a more powerful version of the Python list. Understanding Series is very important, not only because it is one of the core data structures, but also because it is the building blocks of a DataFrame.

In this article, you’ll learn the most commonly used data operations with Pandas Series and should help you get started with Pandas. The article is structured as follows:

    Creating a Series
    Retrieving elements
    Attributes (commonly used)
    Methods (commonly used)
    Working with Python built-in functions
    
## Creating a Series

In [61]:
df = pd.read_json('./data/simple.json')
df
df.info()
df.describe()

Unnamed: 0,id,name,math,physics,chemistry
0,A001,Tom,60,66,61
1,A002,James,89,76,51
2,A003,Jenny,79,90,78


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         3 non-null      object
 1   name       3 non-null      object
 2   math       3 non-null      int64 
 3   physics    3 non-null      int64 
 4   chemistry  3 non-null      int64 
dtypes: int64(3), object(2)
memory usage: 248.0+ bytes


Unnamed: 0,math,physics,chemistry
count,3.0,3.0,3.0
mean,76.0,77.333333,63.333333
std,14.73092,12.055428,13.650397
min,60.0,66.0,51.0
25%,69.5,71.0,56.0
50%,79.0,76.0,61.0
75%,84.0,83.0,69.5
max,89.0,90.0,78.0


## read from html

In [64]:
URL = 'http://raw.githubusercontent.com/BindiChen/machine-learning/master/data-analysis/027-pandas-convert-json/data/simple.json'
df = pd.read_json(URL)
df

Unnamed: 0,id,name,math,physics,chemistry
0,A001,Tom,60,66,61
1,A002,James,89,76,51
2,A003,Jenny,79,90,78


##  Flattening nested list from JSON object

Pandas read_json() works great for flattened JSON like we have in the previous example. What about JSON with a nested list? Let’s see how to convert the following JSON into a DataFrame:

In [71]:
df = pd.read_json('data/nested_list.json')
df

import json
# load data using Python JSON module
with open('data/nested_list.json','r') as f:
    data = json.loads(f.read())
    
    
df_nested_list = pd.json_normalize(data, record_path =['students'])
df_nested_list

# To include school_name and class
df_nested_list = pd.json_normalize(
    data, 
    record_path =['students'], 
    meta=['school_name', 'class']
)

df_nested_list

Unnamed: 0,school_name,class,students
0,ABC primary school,Year 1,"{'id': 'A001', 'name': 'Tom', 'math': 60, 'phy..."
1,ABC primary school,Year 1,"{'id': 'A002', 'name': 'James', 'math': 89, 'p..."
2,ABC primary school,Year 1,"{'id': 'A003', 'name': 'Jenny', 'math': 79, 'p..."


Unnamed: 0,id,name,math,physics,chemistry
0,A001,Tom,60,66,61
1,A002,James,89,76,51
2,A003,Jenny,79,90,78


Unnamed: 0,id,name,math,physics,chemistry,school_name,class
0,A001,Tom,60,66,61,ABC primary school,Year 1
1,A002,James,89,76,51,ABC primary school,Year 1
2,A003,Jenny,79,90,78,ABC primary school,Year 1


## Flattening nested list and dict from JSON object

Next, let’s try to read a more complex JSON data, with a nested list and a nested dictionary.

In [77]:

# load data using Python JSON module
with open('data/nested_mix.json','r') as f:
    data = json.loads(f.read())
    
# Normalizing data
df = pd.json_normalize(data, record_path =['students'])

df

df = pd.json_normalize(
    data, 
    record_path =['students'], 
    meta=[
        'class',
        ['info', 'president'], 
        ['info', 'contacts', 'tel']
    ]
)

df

Unnamed: 0,id,name,math,physics,chemistry
0,A001,Tom,60,66,61
1,A002,James,89,76,51
2,A003,Jenny,79,90,78


Unnamed: 0,id,name,math,physics,chemistry,class,info.president,info.contacts.tel
0,A001,Tom,60,66,61,Year 1,John Kasich,123456789
1,A002,James,89,76,51,Year 1,John Kasich,123456789
2,A003,Jenny,79,90,78,Year 1,John Kasich,123456789


## Extracting a single value from deeply nested JSON

Pandas json_normalize() can do most of the work when working with nested data from a JSON file. However, it flattens the entire nested data when your goal might actually be to extract one value. For example, to extract the property math from the following JSON file.

In [82]:
pip install glom

Note: you may need to restart the kernel to use updated packages.


In [85]:
from glom import glom
df = pd.read_json('data/nested_deep.json')
df['students'].apply(lambda row: glom(row, 'grade.math'))

0    60
1    89
2    79
Name: students, dtype: int64

https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd
    
https://github.com/BindiChen/machine-learning/blob/master/data-analysis/028-pandas-json_normalize/pandas-json_normalize.ipynb