#### Pandas Tutorial - Part 51

This notebook covers various Series methods including:
- Normalizing datetime values with `dt.normalize()`
- Formatting datetime values with `dt.strftime()`
- String extraction methods: `str.extractall()`, `str.find()`, and `str.findall()`

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

%matplotlib inline

##### Normalizing Datetime Values with `dt.normalize()`

The `dt.normalize()` method converts times to midnight (00:00:00). This is useful when the time component doesn't matter.

In [2]:
# Create a datetime Series
idx = pd.date_range(start='2023-01-01 10:00', freq='H', periods=5)
s = pd.Series(idx)
print("Original datetime Series:")
print(s)

Original datetime Series:
0   2023-01-01 10:00:00
1   2023-01-01 11:00:00
2   2023-01-01 12:00:00
3   2023-01-01 13:00:00
4   2023-01-01 14:00:00
dtype: datetime64[ns]


  idx = pd.date_range(start='2023-01-01 10:00', freq='H', periods=5)


In [3]:
# Normalize the datetime values
s_normalized = s.dt.normalize()
print("Normalized datetime Series:")
print(s_normalized)

Normalized datetime Series:
0   2023-01-01
1   2023-01-01
2   2023-01-01
3   2023-01-01
4   2023-01-01
dtype: datetime64[ns]


In [4]:
# Create a datetime Series with timezone
idx_tz = pd.date_range(start='2023-01-01 10:00', freq='H', periods=5, tz='Asia/Calcutta')
s_tz = pd.Series(idx_tz)
print("Original datetime Series with timezone:")
print(s_tz)

Original datetime Series with timezone:
0   2023-01-01 10:00:00+05:30
1   2023-01-01 11:00:00+05:30
2   2023-01-01 12:00:00+05:30
3   2023-01-01 13:00:00+05:30
4   2023-01-01 14:00:00+05:30
dtype: datetime64[ns, Asia/Calcutta]


  idx_tz = pd.date_range(start='2023-01-01 10:00', freq='H', periods=5, tz='Asia/Calcutta')


In [5]:
# Normalize the datetime values with timezone
s_tz_normalized = s_tz.dt.normalize()
print("Normalized datetime Series with timezone:")
print(s_tz_normalized)

Normalized datetime Series with timezone:
0   2023-01-01 00:00:00+05:30
1   2023-01-01 00:00:00+05:30
2   2023-01-01 00:00:00+05:30
3   2023-01-01 00:00:00+05:30
4   2023-01-01 00:00:00+05:30
dtype: datetime64[ns, Asia/Calcutta]


In [6]:
# Create a datetime Series with different dates
dates = ['2023-01-01 10:30:45', '2023-01-02 12:15:30', '2023-01-03 18:45:00']
s_mixed = pd.Series(pd.to_datetime(dates))
print("Original datetime Series with different dates:")
print(s_mixed)

Original datetime Series with different dates:
0   2023-01-01 10:30:45
1   2023-01-02 12:15:30
2   2023-01-03 18:45:00
dtype: datetime64[ns]


In [7]:
# Normalize the datetime values
s_mixed_normalized = s_mixed.dt.normalize()
print("Normalized datetime Series with different dates:")
print(s_mixed_normalized)

Normalized datetime Series with different dates:
0   2023-01-01
1   2023-01-02
2   2023-01-03
dtype: datetime64[ns]


##### Formatting Datetime Values with `dt.strftime()`

The `dt.strftime()` method converts datetime values to strings using a specified format.

In [8]:
# Create a datetime Series
rng = pd.date_range(pd.Timestamp("2023-01-01 09:00"), periods=5, freq='H')
s = pd.Series(rng)
print("Original datetime Series:")
print(s)

Original datetime Series:
0   2023-01-01 09:00:00
1   2023-01-01 10:00:00
2   2023-01-01 11:00:00
3   2023-01-01 12:00:00
4   2023-01-01 13:00:00
dtype: datetime64[ns]


  rng = pd.date_range(pd.Timestamp("2023-01-01 09:00"), periods=5, freq='H')


In [9]:
# Format datetime values with strftime
s_formatted = s.dt.strftime("%Y-%m-%d %H:%M:%S")
print("Formatted datetime Series:")
print(s_formatted)

Formatted datetime Series:
0    2023-01-01 09:00:00
1    2023-01-01 10:00:00
2    2023-01-01 11:00:00
3    2023-01-01 12:00:00
4    2023-01-01 13:00:00
dtype: object


In [10]:
# Format datetime values with different format
s_formatted_short = s.dt.strftime("%d/%m/%Y")
print("Formatted datetime Series (short format):")
print(s_formatted_short)

Formatted datetime Series (short format):
0    01/01/2023
1    01/01/2023
2    01/01/2023
3    01/01/2023
4    01/01/2023
dtype: object


In [11]:
# Format datetime values with day name
s_formatted_day = s.dt.strftime("%A, %B %d, %Y")
print("Formatted datetime Series with day name:")
print(s_formatted_day)

Formatted datetime Series with day name:
0    Sunday, January 01, 2023
1    Sunday, January 01, 2023
2    Sunday, January 01, 2023
3    Sunday, January 01, 2023
4    Sunday, January 01, 2023
dtype: object


In [12]:
# Format datetime values with time only
s_formatted_time = s.dt.strftime("%I:%M %p")
print("Formatted datetime Series with time only:")
print(s_formatted_time)

Formatted datetime Series with time only:
0    09:00 AM
1    10:00 AM
2    11:00 AM
3    12:00 PM
4    01:00 PM
dtype: object


In [13]:
# Create a datetime Series with timezone
rng_tz = pd.date_range(pd.Timestamp("2023-01-01 09:00"), periods=5, freq='H', tz='Europe/Berlin')
s_tz = pd.Series(rng_tz)
print("Original datetime Series with timezone:")
print(s_tz)

Original datetime Series with timezone:
0   2023-01-01 09:00:00+01:00
1   2023-01-01 10:00:00+01:00
2   2023-01-01 11:00:00+01:00
3   2023-01-01 12:00:00+01:00
4   2023-01-01 13:00:00+01:00
dtype: datetime64[ns, Europe/Berlin]


  rng_tz = pd.date_range(pd.Timestamp("2023-01-01 09:00"), periods=5, freq='H', tz='Europe/Berlin')


In [14]:
# Format datetime values with timezone
s_tz_formatted = s_tz.dt.strftime("%Y-%m-%d %H:%M:%S %Z")
print("Formatted datetime Series with timezone:")
print(s_tz_formatted)

Formatted datetime Series with timezone:
0    2023-01-01 09:00:00 CET
1    2023-01-01 10:00:00 CET
2    2023-01-01 11:00:00 CET
3    2023-01-01 12:00:00 CET
4    2023-01-01 13:00:00 CET
dtype: object


##### String Extraction Methods

Pandas provides several methods for extracting and finding patterns in strings.

### Extracting All Matches with `str.extractall()`

The `str.extractall()` method extracts groups from all matches of a regular expression pattern.

In [15]:
# Create a Series with strings
s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"])
print("Original Series:")
print(s)

Original Series:
A    a1a2
B      b1
C      c1
dtype: object


In [16]:
# Extract all matches with one group
result = s.str.extractall(r"[ab](\d)")
print("Result of extractall with one group:")
print(result)

Result of extractall with one group:
         0
  match   
A 0      1
  1      2
B 0      1


In [17]:
# Extract all matches with named group
result_named = s.str.extractall(r"[ab](?P<digit>\d)")
print("Result of extractall with named group:")
print(result_named)

Result of extractall with named group:
        digit
  match      
A 0         1
  1         2
B 0         1


In [18]:
# Extract all matches with two groups
result_two_groups = s.str.extractall(r"(?P<letter>[ab])(?P<digit>\d)")
print("Result of extractall with two groups:")
print(result_two_groups)

Result of extractall with two groups:
        letter digit
  match             
A 0          a     1
  1          a     2
B 0          b     1


In [19]:
# Extract all matches with optional group
result_optional = s.str.extractall(r"(?P<letter>[ab])?(?P<digit>\d)")
print("Result of extractall with optional group:")
print(result_optional)

Result of extractall with optional group:
        letter digit
  match             
A 0          a     1
  1          a     2
B 0          b     1
C 0        NaN     1


In [20]:
# Create a Series with more complex strings
s_complex = pd.Series(['foo 123 bar 456', 'bar 789 foo', 'foo 123 456'])
print("Original Series with complex strings:")
print(s_complex)

Original Series with complex strings:
0    foo 123 bar 456
1        bar 789 foo
2        foo 123 456
dtype: object


In [21]:
# Extract all numbers
result_complex = s_complex.str.extractall(r'(\d+)')
print("Result of extractall for numbers:")
print(result_complex)

Result of extractall for numbers:
           0
  match     
0 0      123
  1      456
1 0      789
2 0      123
  1      456


In [22]:
# Extract words and numbers
result_complex_words = s_complex.str.extractall(r'(?P<word>foo|bar) (?P<number>\d+)')
print("Result of extractall for words and numbers:")
print(result_complex_words)

Result of extractall for words and numbers:
        word number
  match            
0 0      foo    123
  1      bar    456
1 0      bar    789
2 0      foo    123


### Finding Substrings with `str.find()`

The `str.find()` method returns the lowest index where the substring is found. Returns -1 if not found.

In [23]:
# Create a Series with strings
s = pd.Series(['apple', 'banana', 'cherry'])
print("Original Series:")
print(s)

Original Series:
0     apple
1    banana
2    cherry
dtype: object


In [24]:
# Find substring 'a'
result = s.str.find('a')
print("Result of find('a'):")
print(result)

Result of find('a'):
0    0
1    1
2   -1
dtype: int64


In [25]:
# Find substring 'an'
result_an = s.str.find('an')
print("Result of find('an'):")
print(result_an)

Result of find('an'):
0   -1
1    1
2   -1
dtype: int64


In [26]:
# Find substring 'z'
result_z = s.str.find('z')
print("Result of find('z'):")
print(result_z)

Result of find('z'):
0   -1
1   -1
2   -1
dtype: int64


In [27]:
# Find substring 'a' with start index
result_start = s.str.find('a', 1)
print("Result of find('a', 1):")
print(result_start)

Result of find('a', 1):
0   -1
1    1
2   -1
dtype: int64


In [28]:
# Find substring 'a' with start and end indices
result_start_end = s.str.find('a', 1, 3)
print("Result of find('a', 1, 3):")
print(result_start_end)

Result of find('a', 1, 3):
0   -1
1    1
2   -1
dtype: int64


### Finding All Occurrences with `str.findall()`

The `str.findall()` method finds all occurrences of a pattern or regular expression.

In [29]:
# Create a Series with strings
s = pd.Series(['Lion', 'Monkey', 'Rabbit'])
print("Original Series:")
print(s)

Original Series:
0      Lion
1    Monkey
2    Rabbit
dtype: object


In [30]:
# Find all occurrences of 'Monkey'
result = s.str.findall('Monkey')
print("Result of findall('Monkey'):")
print(result)

Result of findall('Monkey'):
0          []
1    [Monkey]
2          []
dtype: object


In [31]:
# Find all occurrences of 'MONKEY'
result_upper = s.str.findall('MONKEY')
print("Result of findall('MONKEY'):")
print(result_upper)

Result of findall('MONKEY'):
0    []
1    []
2    []
dtype: object


In [32]:
# Find all occurrences of 'MONKEY' with case-insensitive flag
result_ignore_case = s.str.findall('MONKEY', flags=re.IGNORECASE)
print("Result of findall('MONKEY', flags=re.IGNORECASE):")
print(result_ignore_case)

Result of findall('MONKEY', flags=re.IGNORECASE):
0          []
1    [Monkey]
2          []
dtype: object


In [33]:
# Create a Series with more complex strings
s_complex = pd.Series(['apple and banana', 'orange, apple and pear', 'apple, orange'])
print("Original Series with complex strings:")
print(s_complex)

Original Series with complex strings:
0          apple and banana
1    orange, apple and pear
2             apple, orange
dtype: object


In [34]:
# Find all occurrences of fruits
result_complex = s_complex.str.findall(r'apple|banana|orange|pear')
print("Result of findall for fruits:")
print(result_complex)

Result of findall for fruits:
0          [apple, banana]
1    [orange, apple, pear]
2          [apple, orange]
dtype: object


In [35]:
# Find all words
result_words = s_complex.str.findall(r'\w+')
print("Result of findall for words:")
print(result_words)

Result of findall for words:
0          [apple, and, banana]
1    [orange, apple, and, pear]
2               [apple, orange]
dtype: object


##### Conclusion

In this notebook, we've explored various Series methods in pandas:

1. `dt.normalize()`: Converts times to midnight (00:00:00), which is useful when the time component doesn't matter.
2. `dt.strftime()`: Formats datetime values to strings using a specified format, allowing for customized date and time representations.
3. String extraction methods:
   - `str.extractall()`: Extracts groups from all matches of a regular expression pattern, returning a DataFrame with one row for each match and one column for each group.
   - `str.find()`: Returns the lowest index where the substring is found, or -1 if not found.
   - `str.findall()`: Finds all occurrences of a pattern or regular expression, returning a list of matches for each string.

These methods are essential tools for working with datetime data and string manipulation in pandas, allowing for flexible and powerful operations on your data.