# Preprocessing Data with Dates

- **`Pandas`** contains extensive capabilities and features for working with data containing dates for all domains. 
- Using the NumPy **datetime64** and **timedelta64** dtypes, pandas has consolidated a large number of features from other Python libraries like scikits.timeseries as well as created a tremendous amount of new functionality for manipulating time related data.

## Reading and Loading Data

In [2]:
# import the pandas library
import pandas as pd
import datetime

import warnings
warnings.filterwarnings('ignore')

print(pd.__version__)

2.1.1


In [3]:
# read the data set
data = pd.read_csv('datasets/time_series.csv')

# view the top rows
data.head(5)

Unnamed: 0,ID,Datetime,Count
0,0,25-08-2012 00:00,8
1,1,25-08-2012 01:00,2
2,2,25-08-2012 02:00,6
3,3,25-08-2012 03:00,2
4,4,25-08-2012 04:00,2


In [4]:
# Calculate the data types
data.dtypes

ID           int64
Datetime    object
Count        int64
dtype: object

By Default: All datetime based columns are considered as strings. Hence, change the 'Datetime' column to data type **`datetime`**.

In [5]:
# Change type to datetime
data['Datetime'] = pd.to_datetime(data['Datetime'])

# Check the data types
data.dtypes

ID                   int64
Datetime    datetime64[ns]
Count                int64
dtype: object

In [6]:
# Check the day name
data['Datetime'].apply(lambda x: x.day_name()).head()

0    Saturday
1    Saturday
2    Saturday
3    Saturday
4    Saturday
Name: Datetime, dtype: object

In [7]:
# Check the month name
data['Datetime'].apply(lambda x: x.month_name()).head()

0    August
1    August
2    August
3    August
4    August
Name: Datetime, dtype: object

### Loading another data

In [8]:
# Load the data
data_2 = pd.read_csv('datasets/time_series_2.csv')
data_2.head()

Unnamed: 0,ID,Datetime,Count
0,0,25 Aug 2012,8
1,1,25 Aug 2012,2
2,2,25 Aug 2012,6
3,3,25 Aug 2012,2
4,4,25 Aug 2012,2


We can read the date time by specifying the format. Here are some of the common used directives.

| **Directive** | **Meaning**                                            |
| ---           | ---                                                    |
|  **%a**       | Weekday as locale’s abbreviated name.                  |
|  **%A**       | Weekday as locale’s full name.                         |  
|  **%d**       | Day of the month as a zero-padded decimal number.      |
|  **%b**       | Month as locale’s abbreviated name.	                 |
|  **%B**       | Month as locale’s full name.	                         |
|  **%m**       | Month as a zero-padded decimal number.                 |
|  **%y**       | Year without century as a zero-padded decimal number.  |
|  **%Y**       | Year with century as a decimal number.                 |
|  **%H**       | Hour (24-hour clock) as a zero-padded decimal number.  |

In [10]:
# convert to datetime by specifying the data time format
data_2['Datetime'] = pd.to_datetime(data_2['Datetime'], format = '%d %b %Y ')
data_2.head()

Unnamed: 0,ID,Datetime,Count
0,0,2012-08-25,8
1,1,2012-08-25,2
2,2,2012-08-25,6
3,3,2012-08-25,2
4,4,2012-08-25,2


## Time based features

### Creating new time related features 

In [11]:
# Check the data
data.head()

Unnamed: 0,ID,Datetime,Count
0,0,2012-08-25 00:00:00,8
1,1,2012-08-25 01:00:00,2
2,2,2012-08-25 02:00:00,6
3,3,2012-08-25 03:00:00,2
4,4,2012-08-25 04:00:00,2


In [12]:
# create month and month_name 
data['month'] = data['Datetime'].dt.month
data['month_name'] = data['Datetime'].dt.month_name()

# Check the data again
data.head()

Unnamed: 0,ID,Datetime,Count,month,month_name
0,0,2012-08-25 00:00:00,8,8,August
1,1,2012-08-25 01:00:00,2,8,August
2,2,2012-08-25 02:00:00,6,8,August
3,3,2012-08-25 03:00:00,2,8,August
4,4,2012-08-25 04:00:00,2,8,August


In [13]:
# Create features
data['day_name'] = data['Datetime'].dt.day_name()
data['day_of_week'] = data['Datetime'].dt.dayofweek
data['day_of_year'] = data['Datetime'].dt.dayofyear

data.head()

Unnamed: 0,ID,Datetime,Count,month,month_name,day_name,day_of_week,day_of_year
0,0,2012-08-25 00:00:00,8,8,August,Saturday,5,238
1,1,2012-08-25 01:00:00,2,8,August,Saturday,5,238
2,2,2012-08-25 02:00:00,6,8,August,Saturday,5,238
3,3,2012-08-25 03:00:00,2,8,August,Saturday,5,238
4,4,2012-08-25 04:00:00,2,8,August,Saturday,5,238


### Different between 2 dates

Add the current date in the new column

In [14]:
# Creating a new column with today's date
data['today'] = pd.to_datetime(datetime.date.today())
data.head()

Unnamed: 0,ID,Datetime,Count,month,month_name,day_name,day_of_week,day_of_year,today
0,0,2012-08-25 00:00:00,8,8,August,Saturday,5,238,2024-06-25
1,1,2012-08-25 01:00:00,2,8,August,Saturday,5,238,2024-06-25
2,2,2012-08-25 02:00:00,6,8,August,Saturday,5,238,2024-06-25
3,3,2012-08-25 03:00:00,2,8,August,Saturday,5,238,2024-06-25
4,4,2012-08-25 04:00:00,2,8,August,Saturday,5,238,2024-06-25


In [15]:
# Finding the difference of the dates
date_diff = data['today'] - data['Datetime']
date_diff.head()

0   4322 days 00:00:00
1   4321 days 23:00:00
2   4321 days 22:00:00
3   4321 days 21:00:00
4   4321 days 20:00:00
dtype: timedelta64[ns]

In [19]:
# We want to extract only the dates from the difference
date_diff.apply(lambda x: x.days).head()

0    4322
1    4321
2    4321
3    4321
4    4321
dtype: int64

In [20]:
# Creating a new column with the difference of dates
data['day_difference'] = date_diff.apply(lambda x: x.days)

# Check the data
data.head()

Unnamed: 0,ID,Datetime,Count,month,month_name,day_name,day_of_week,day_of_year,today,day_difference
0,0,2012-08-25 00:00:00,8,8,August,Saturday,5,238,2024-06-25,4322
1,1,2012-08-25 01:00:00,2,8,August,Saturday,5,238,2024-06-25,4321
2,2,2012-08-25 02:00:00,6,8,August,Saturday,5,238,2024-06-25,4321
3,3,2012-08-25 03:00:00,2,8,August,Saturday,5,238,2024-06-25,4321
4,4,2012-08-25 04:00:00,2,8,August,Saturday,5,238,2024-06-25,4321


## Challenges with Time data

### Dealing with Time Zones

If you have the dataset of a specific time zone, use function **`dt.tz_localize`** to set the local time zone.

In [21]:
# Set the current time as of Asia
data['asia_timezone'] = data['Datetime'].dt.tz_localize('Asia/Calcutta')
data.head()

Unnamed: 0,ID,Datetime,Count,month,month_name,day_name,day_of_week,day_of_year,today,day_difference,asia_timezone
0,0,2012-08-25 00:00:00,8,8,August,Saturday,5,238,2024-06-25,4322,2012-08-25 00:00:00+05:30
1,1,2012-08-25 01:00:00,2,8,August,Saturday,5,238,2024-06-25,4321,2012-08-25 01:00:00+05:30
2,2,2012-08-25 02:00:00,6,8,August,Saturday,5,238,2024-06-25,4321,2012-08-25 02:00:00+05:30
3,3,2012-08-25 03:00:00,2,8,August,Saturday,5,238,2024-06-25,4321,2012-08-25 03:00:00+05:30
4,4,2012-08-25 04:00:00,2,8,August,Saturday,5,238,2024-06-25,4321,2012-08-25 04:00:00+05:30


We can also convert time from a current time zone to a new one by using the function **`dt.tz_convert`**.

In [22]:
# Change the asia time zone to UTC
data['utc_timezone'] = data['asia_timezone'].dt.tz_convert('UTC')
data.head()

Unnamed: 0,ID,Datetime,Count,month,month_name,day_name,day_of_week,day_of_year,today,day_difference,asia_timezone,utc_timezone
0,0,2012-08-25 00:00:00,8,8,August,Saturday,5,238,2024-06-25,4322,2012-08-25 00:00:00+05:30,2012-08-24 18:30:00+00:00
1,1,2012-08-25 01:00:00,2,8,August,Saturday,5,238,2024-06-25,4321,2012-08-25 01:00:00+05:30,2012-08-24 19:30:00+00:00
2,2,2012-08-25 02:00:00,6,8,August,Saturday,5,238,2024-06-25,4321,2012-08-25 02:00:00+05:30,2012-08-24 20:30:00+00:00
3,3,2012-08-25 03:00:00,2,8,August,Saturday,5,238,2024-06-25,4321,2012-08-25 03:00:00+05:30,2012-08-24 21:30:00+00:00
4,4,2012-08-25 04:00:00,2,8,August,Saturday,5,238,2024-06-25,4321,2012-08-25 04:00:00+05:30,2012-08-24 22:30:00+00:00


In [23]:
# Check the timezone data
data[['asia_timezone', 'utc_timezone']].head()

Unnamed: 0,asia_timezone,utc_timezone
0,2012-08-25 00:00:00+05:30,2012-08-24 18:30:00+00:00
1,2012-08-25 01:00:00+05:30,2012-08-24 19:30:00+00:00
2,2012-08-25 02:00:00+05:30,2012-08-24 20:30:00+00:00
3,2012-08-25 03:00:00+05:30,2012-08-24 21:30:00+00:00
4,2012-08-25 04:00:00+05:30,2012-08-24 22:30:00+00:00


In [30]:
# Checking the times
time1 = data['asia_timezone'][1000]
time2 = data['utc_timezone'][1000]

print(time1, ',', time2)

2012-10-05 16:00:00+05:30 , 2012-10-05 10:30:00+00:00


You can see that time difference is 5 hours 30 minutes.


## Reading data with UNIX timestamp

- A UNIX timestamp is a way of storing a specific date and time.
- The timestamp is a ten digit number which represents the number of seconds that have passed since midnight on the 1st January 1970, UTC time.

In [31]:
# Read data
data = pd.read_csv('datasets/data_with_timestamp.csv')
data.head()

Unnamed: 0,ID,timestamp,Count
0,0,1345852800,8
1,1,1345856400,2
2,2,1345860000,6
3,3,1345863600,2
4,4,1345867200,2


In [138]:
# Convert the unix timestamp to datetime.
data['timestamp'] = pd.to_datetime(data['timestamp'], unit='s')

In [139]:
# view the top rows
data_with_unix_ts.head()

Unnamed: 0,ID,timestamp,Count
0,0,2012-08-25 00:00:00,8
1,1,2012-08-25 01:00:00,2
2,2,2012-08-25 02:00:00,6
3,3,2012-08-25 03:00:00,2
4,4,2012-08-25 04:00:00,2
