In [1]:
import time

In [2]:
import numpy as np
import pandas as pd

In [3]:
np.random.seed(0)

In [4]:
pd.options.mode.copy_on_write = True

# String Manipulation

Python is well-known for its ease of handling strings and text. The string object’s built-in methods simplify most text operations, while complex pattern matching tasks can be handled with regular expressions.

To enable you to perform string methods and regular expressions concisely on entire arrays of data while managing missing values, pandas provides the `.str` accessor on Series. This accessor includes a lot of vectorized string operations that skip over and propagate NA values and contains some parameters to control regular expression behavior, case sensitivity, and handling of missing data.

The `.str` attribute is accessible only for text data types, and pandas offers two ways to store text data:

- **object** dtype NumPy array
- **StringDtype** extension type

It's recommended to use `StringDtype` for storing text data since the `object` dtype is a generic dtype inherited from NumPy arrays, used to store any Python object type.

In contrast, StringDtype comes from pandas' new extension type system, allowing for new data types that are not natively supported by NumPy. These new data types can be treated as first-class citizens and contains specific optimizations to make it computally efficient.

## Vectorized String Methods 

The `str` accessor implements most of the built-in string handling methods available in Python, making it as straightforward to work with as traditional Python strings.

The complete list of available methods can be found in the [official pandas docs](https://pandas.pydata.org/docs/user_guide/text.html#method-summary).

In [5]:
# Concatenate strings in the Series/Index with a given separator.

s1 = pd.Series(['arthur', 'bruna', 'camila'])
s2 = pd.Series(['costa', 'silva', 'oliveira'])

s1.str.cat(s2, sep='.') + '@email.com'

0       arthur.costa@email.com
1        bruna.silva@email.com
2    camila.oliveira@email.com
dtype: object

In [6]:
# Counts occurrences of a pattern in each string.

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.count('a')

0    1
1    3
2    0
dtype: int64

In [7]:
# Computes length of each string in the Series/Index.

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.len()

0    5
1    6
2    6
dtype: int64

In [8]:
# Checks if each string in the Series/Index contains a pattern.

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.contains('a')

0     True
1     True
2    False
dtype: bool

In [9]:
# Check if each string starts with a match of a pattern.

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.match('a')

0     True
1    False
2    False
dtype: bool

In [10]:
# Finds all occurrences of a pattern in each string.

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.findall('a')

0          [a]
1    [a, a, a]
2           []
dtype: object

In [11]:
# Replaces occurrences of pattern/regex with another string.

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.replace('a', 'o')

0     opple
1    bonono
2    cherry
dtype: object

In [12]:
# Joins lists contained as elements in the Series/Index with passed delimiter.

s1 = pd.Series([['a', 'b', 'c'], ['d', 'e', 'f']])
s1.str.join('-')

0    a-b-c
1    d-e-f
dtype: object

In [13]:
# Splits each string in the Series/Index by given delimiter.
#
# Tip: Elements in the split lists can be accessed using get or [] notation:
# Or you can use 'expand' to transform the splitted elements
# into columns

s1 = pd.Series(['a-b-c', 'd-e-f'])
s1.str.split('-')

0    [a, b, c]
1    [d, e, f]
dtype: object

In [14]:
# Removes leading and trailing characters from each string.

s1 = pd.Series(['  apple  ', '  banana', 'cherry   '])
s1.str.strip()

0     apple
1    banana
2    cherry
dtype: object

In [15]:
# Slice the first 3 characters of each string

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.slice(start=0, stop=3)

0    app
1    ban
2    che
dtype: object

In [16]:
# Extract dummy variables from string columns.
# For example if they are separated by a '|':

s1 = pd.Series(["a", "a|b", np.nan, "a|c"])
s1

0      a
1    a|b
2    NaN
3    a|c
dtype: object

In [17]:
s1.str.get_dummies(sep="|")

Unnamed: 0,a,b,c
0,1,0,0
1,1,1,0
2,0,0,0
3,1,0,1


## Regular Expressions

Most methods of the `str` accessor support regular expressions (regex). By using regex patterns, we can perform powerful and flexible string manipulations.

In [18]:
# Regex pattern to find 'a' followed by any character and then 'n'

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.contains(r'a.n')

0    False
1    False
2    False
dtype: bool

In [19]:
# Counts occurrences of a pattern in each string.

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.count(r'na')

0    0
1    2
2    0
dtype: int64

In [20]:
# Check if strings match pattern of a letter followed by a digit

s1 = pd.Series(['a1', 'bx', 'c3'])
s1.str.match(r'[a-z]\d')  

0     True
1    False
2     True
dtype: bool

## Python Operators: Arithmetic and Boolean

As seen in the *Essential Basic Functionality* notebook, pandas supports using traditional arithmetic and boolean operators to handle text data on Series and DataFrames, enabling element-wise operations similar to those in traditional Python string manipulation.

The drawback is that while `str` accessor methods offer many optional parameters to control a lot of behaviors and enable complex string manipulations in various scenarios, especially handling missing data, the overloaded Python operators are as simple as they come.

In [21]:
# Concatenate two Series element-wise or
# Series with a scalar string.

s1 = pd.Series(['arthur', 'bruna', 'camila'])
s2 = pd.Series(['costa', 'silva', 'oliveira'])

s1 + '.' + s2 + '@email.com'

0       arthur.costa@email.com
1        bruna.silva@email.com
2    camila.oliveira@email.com
dtype: object

In [22]:
# Repeats each string in the Series
# a specified number of times.

s1 = pd.Series(['a', 'b', 'c'])
s1 * 3

0    aaa
1    bbb
2    ccc
dtype: object

In [23]:
# Equality comparison
s1 = pd.Series(['apple', 'banana', 'cherry'])
s1 == 'apple'

0     True
1    False
2    False
dtype: bool

In [24]:
# Slicing

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str[:3]

0    app
1    ban
2    che
dtype: object

# String Manipulation on Index and Columns

String methods on Index objects are especially useful for cleaning up or transforming DataFrame columns. For instance, if you have columns with leading or trailing whitespace, you can use the `.str` accessor on `df.columns` since it is an Index object.

In [25]:
df = pd.DataFrame({
    'FruiTs': ['apple', 'banana', 'cherry'],
    '   Color_  ': ['red', 'yellow', 'red'],
    }, index=['a', 'b', 'c']
)
df

Unnamed: 0,FruiTs,Color_
a,apple,red
b,banana,yellow
c,cherry,red


In [26]:
# Clean up DataFrame column names

df.columns = df.columns.str.lower().str.strip().str.replace('_', '')
df

Unnamed: 0,fruits,color
a,apple,red
b,banana,yellow
c,cherry,red


In [27]:
df[df.index.str.contains('a')]

Unnamed: 0,fruits,color
a,apple,red


# Enhancing Performance

As mentioned earlier, it's recommended to use `StringDtype` for storing text data because the `object` dtype is a generic type inherited from NumPy arrays, used to store any Python object type.

In contrast, `StringDtype` comes from pandas' new extension type system and include specific optimizations to enhance computational efficiency. Indeed, `StringDtype` arrays generally use much less memory and are often more computationally efficient for operations on large datasets, especially when using the Apache Arrow memory layout.

To illustrate the difference between the default `object` dtype and `StringDtype` in string manipulation, this section includes a very simple and naive benchmark.

The benchmark creates two DataFrames, each with one column named `'fruits'` containing 1 million rows. The column values are five different fruit names. Three different operations are tested: length calculation, replacement, and containment, as well as the size of the DataFrame.

To use `StringDtype`, you need to have installed pyarrow.

In [28]:
def show_time_results(object_results: dict, string_results: dict):
    speedup = {op: object_results[op] / string_results[op] for op in object_results if op != 'memory_usage'}

    table = f"{'Operation':<15}{'Object Time (s)':<20}{'StringDtype Time (s)':<25}{'Speedup':<10}\n"
    table += "-" * 70 + "\n"
    for op in speedup:
        table += f"{op:<15}{object_results[op]:<20.5f}{string_results[op]:<25.5f}{speedup[op]:<10.2f}\n"

    print(table)


def show_saving_results(object_results: dict, string_results: dict):
    savings = object_results['memory_usage'] / string_results['memory_usage']

    object_mem_use_mb = object_results['memory_usage'] / (1024 ** 2)
    string_mem_use_mb = string_results['memory_usage'] / (1024 ** 2)

    table = f"{'Memory Usage (MB)':<20}{'Object Size (MB)':<20}{'StringDtype Size (MB)':<25}{'Savings (x)':<15}\n"
    table += "-" * 80 + "\n"
    table += f"{'Memory Usage':<20}{object_mem_use_mb:<20.2f}{string_mem_use_mb:<25.2f}{savings:<15.2f}\n"

    print(table)

In [29]:
def run_benchmark(df: pd.DataFrame):
    results = {}
    
    start_time = time.time()
    df['fruits'].str.len()
    results['length'] = time.time() - start_time
    
    start_time = time.time()
    df['fruits'].str.replace('a', 'o')
    results['replace'] = time.time() - start_time

    start_time = time.time()
    df['fruits'].str.contains('a')
    results['contains'] = time.time() - start_time

    results['memory_usage'] = df['fruits'].memory_usage(deep=True)

    return results

In [30]:
n_rows = 1_000_000

data = np.random.choice(['apple', 'banana', 'cherry', 'dragon fruit', 'elderberry'], n_rows)

df_object = pd.DataFrame({'fruits': data})
df_string = pd.DataFrame({'fruits': data}, dtype=pd.StringDtype('pyarrow'))

In [31]:
print(f'df_object\n{df_object.dtypes}', end='\n\n')
print(f'df_string\n{df_string.dtypes}')

df_object
fruits    object
dtype: object

df_string
fruits    string[pyarrow]
dtype: object


In [32]:
object_results = run_benchmark(df_object)
string_results = run_benchmark(df_string)

In [33]:
show_time_results(object_results, string_results)

Operation      Object Time (s)     StringDtype Time (s)     Speedup   
----------------------------------------------------------------------
length         0.19111             0.01234                  15.49     
replace        0.13211             0.02197                  6.01      
contains       0.14554             0.03839                  3.79      



In [34]:
show_saving_results(object_results, string_results)

Memory Usage (MB)   Object Size (MB)    StringDtype Size (MB)    Savings (x)    
--------------------------------------------------------------------------------
Memory Usage        61.80               15.07                    4.10           



# References

- [Python for Data Analysis by Wes McKinney (3e)](https://wesmckinney.com/book/)
- [Pandas Official Documentation](https://pandas.pydata.org/docs/user_guide/10min.html)
- [Frequently Asked Questions (FAQ) on Pandas](https://pandas.pydata.org/docs/user_guide/gotchas.html)

# Exercises

To help you understand the concepts covered in this notebook, here are some practice problems.

These questions refer to a dataset containing information on the type, cast, director, description and name of Netflix titles over the years. The dataset is available on [Kaggle by Shivam Bansal](https://www.kaggle.com/datasets/shivamb/netflix-shows).

In [35]:
df = pd.read_csv('https://raw.githubusercontent.com/ahayasic/workshop-pandas-zero-to-hero/main/datasets/netflix-movies-and-tv-shows/netflix_titles.csv')
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


1. Get all rows where country is "Brazil" and have "Dramas" in listed_in category

2. Get all rows where the cast includes the actor "Leonardo DiCaprio"

3. Get all rows where the title is composed of more than 5 words

4. Get all rows where the description contains the words "power" and "love".

5. Show the top 10 TV shows with the biggest cast size. Display only the title, director, cast, cast size, country, and release year columns.

6. Show the top 10 TV shows with the most listed_in categories.

7. Show the top 5 TV shows with the most seasons.

8. Count how many titles "Emma Stone" has done.

9. Count how many movies "Steven Spielberg" has done with the actor "Harrison Ford".

10. Replace the country 'United States' for 'no geography knowledge' (rs)

11. Create a new column date_added_datetime with a string in the pattern 'YYYY-MM-DD'. Extract the data from the date_added column to build this.

12. (Plus) Find the top 5 most frequent genres