# Date Extractor Tutorial

The `date_extractor_mds` package offers helper functions for extracting individual components from datetime strings that are formatted according to the ISO 8601 standard. On this page, you'll find complete documentation for the functions in the package, including real-world examples employed by Renee, a data engineer working on a machine learning project.

## Rennee's Journey

### Extracting Time Features for a Time Series Model

Rennee, a data engineer at a telecommunications company, is working with a large dataset of timestamps for customer usage patterns. To build a time series model that can predict customer behavior, she needs to extract specific time-related features from strings that contain both dates and times (a.k.a. "datetime strings"). These features include the year, month, day, and time (composed of hour, minute, and second). These individual components of a datetime string could be crucial on their own for identifying trends, seasonality, or daily usage patterns.

Rennee starts by preparing her dataset, where each entry contains a datetime string in the ISO 8601 format (e.g., `2023-07-16T12:34:56`). She needs to break these datetime strings down into more manageable features to feed into her machine learning model.

Here's how Rennee can use Python and the helper functions from the `date_extractor_mds` package in her time series analysis project.

### Extracting a Year

The function `extract_year()` allows users to extract the year from datetime strings formatted according to ISO 8601. It takes one argument, which can be either a single string or a `pandas.Series` of strings. The function returns the year as either an integer or a `pandas.Series` of integers, depending on the input data type.

#### Getting Years From a String
Rennee verifies that she gets the output she expects by extracting the year from a single example string that she manually defines.

In [65]:
from date_extractor_mds.date_extractor_mds import *

my_datetime = "2023-07-16T12:34:56"
extracted_year = extract_year(my_datetime)

extracted_year


2023

Renee gets the correct year out as expected. She next verifies that the output data type is also as expected:

In [66]:
type(extracted_year)

int

As expected, the type is `int`.

#### Getting Years From a `pandas.Series`

Next, Rennee tests `extract_year()` on a `pandas.Series` of datetime strings. In a data analytics context, the typical use case of this functionality would be to pass in the contents of a column from a `pandas.DataFrame`, which is itself stored as a `pandas.Series`.

This means Rennee can subscript her DataFrame by column name, which returns a series, and pass this to `extract_year()`. She can then use the output to either modify an existing column in place or create a new column.

First, she sets up a test DataFrame containing a `date` column with two datetime strings.

In [67]:
import pandas as pd

# Set up the DataFrame
data = {'date': ["2023-07-16T12:34:56", "2024-03-25T08:15:30"]}
my_dataframe = pd.DataFrame(data)

my_dataframe

Unnamed: 0,date
0,2023-07-16T12:34:56
1,2024-03-25T08:15:30


Above, she can see the test DataFrame.

Rennee decides to create a new column in the DataFrame called `year` and populate it with just the extracted years as integers.

In [68]:
my_dataframe['year'] = extract_year(my_dataframe['date'])

my_dataframe

Unnamed: 0,date,year
0,2023-07-16T12:34:56,2023
1,2024-03-25T08:15:30,2024


Now Renee can see the DataFrame has an additional column, `year`, which contains the correct year for each row. Finally, she verifies that the column contains the expected type (integers):

In [69]:
my_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    2 non-null      object
 1   year    2 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 160.0+ bytes


As expected, the column `year` has the data type `int64`. Looks good!

### Extracting a Month

Like `extract_year()`, `extract_month()` also accepts only a single argument, an ISO 8601 date string or a `pandas.Series` of such strings.

#### Getting Months From a String

Rennee again tests out to extracting from a string:

In [70]:
my_datetime = "2023-07-16T12:34:56"
extracted_month = extract_month(my_datetime)

extracted_month


7

Looks good. The data type is also correct:

In [71]:
type(extracted_month)

int

#### Getting Months From a `pandas.Series`

Renee performs another test on her DataFrame, adding a `month` column this time:

In [72]:
my_dataframe['month'] = extract_month(my_dataframe['date'])

my_dataframe

Unnamed: 0,date,year,month
0,2023-07-16T12:34:56,2023,7
1,2024-03-25T08:15:30,2024,3


Finally, she again confirms the new column is of the correct type:

In [73]:
my_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    2 non-null      object
 1   year    2 non-null      int64 
 2   month   2 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 176.0+ bytes


### Extracting a Day

Like the last two functions Renee has tested, `extract_day()` returns the day as an integer.

#### Getting Years From a String

Once again, she first tests the `extract_day()` function on a single string:

In [74]:
my_datetime = "2023-07-16T12:34:56"
extracted_day = extract_day(my_datetime)

extracted_day

16

Looks good, and the datatype is still as expected:

In [75]:
type(extracted_day)

int

#### Getting Days From a `pandas.Series`

Let's make sure `extract_day()` works properly on a `pandas.Series` like the previous functions did:

In [76]:
my_dataframe['day'] = extract_day(my_dataframe['date'])

my_dataframe

Unnamed: 0,date,year,month,day
0,2023-07-16T12:34:56,2023,7,16
1,2024-03-25T08:15:30,2024,3,25


Great! And she notes the new column is also `int64`, as expected:

In [77]:
my_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    2 non-null      object
 1   year    2 non-null      int64 
 2   month   2 non-null      int64 
 3   day     2 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 192.0+ bytes


### Extracting a Time

text

#### Getting a Time From a String

text

In [78]:
my_datetime = "2023-07-16T12:34:56"
extracted_time = extract_time(my_datetime)

extracted_time

datetime.time(12, 34, 56)

text

In [79]:
type(extracted_time)

datetime.time

#### Getting Times From a `pandas.Series`

text

In [80]:
my_dataframe['time'] = extract_time(my_dataframe['date'])

my_dataframe

Unnamed: 0,date,year,month,day,time
0,2023-07-16T12:34:56,2023,7,16,12:34:56
1,2024-03-25T08:15:30,2024,3,25,08:15:30


text

In [81]:
my_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    2 non-null      object
 1   year    2 non-null      int64 
 2   month   2 non-null      int64 
 3   day     2 non-null      int64 
 4   time    2 non-null      object
dtypes: int64(3), object(2)
memory usage: 208.0+ bytes
