# New string data type + upcoming Arrow support

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv")

In [None]:
df.head()

## Explaining dtypes

In [None]:
df.dtypes

<div style="font-size:120%">

> "You can assume that "object" dtype means you have string data ..."
    
</div>

## Dedicated "string" data type

Introduced in pandas 1.0 (as experimental feature): https://pandas.pydata.org/docs/dev/whatsnew/v1.0.0.html#dedicated-string-data-type

In [None]:
df2 = df.convert_dtypes(convert_string=True, convert_integer=False, convert_floating=False)

In [None]:
df2.dtypes

We have strings now!

Creating a Series with the dtype manually:

In [None]:
s = pd.Series(["a", "b", "c"], dtype="string")
s

In [None]:
s[0] = "B"

In [None]:
s.to_numpy()

<div style="font-size:120%">

-> Implementation is almost exactly the same (still storing Python strings in object-dtype numpy array), but the intent is much clearer!
    
</div>

## Native string dtype using Apache Arrow

This is Work-In-Progress (an initial version to land in pandas 1.2 or 1.3), see https://github.com/pandas-dev/pandas/issues/35169

In [None]:
df = pd.read_csv("string_data.csv")
df.head()

In [None]:
s = df["code"]

In [None]:
s_python = s.astype("string")

In [None]:
from pandas.core.arrays.string_arrow import ArrowStringDtype, ArrowStringArray
s_arrow = s.astype(ArrowStringDtype())

In [None]:
s_arrow.head()

**Better memory usage**

In [None]:
"{:.2f} MiB".format(s_python.memory_usage(deep=True) / 1024**2)

In [None]:
"{:.2f} MiB".format(s_arrow.memory_usage(deep=True) / 1024**2)

FIGUUR?

**Faster string operations**

Converting to lower case:

In [None]:
%time _ = s_python.str.lower()

In [None]:
%timeit _ = s_arrow.str.lower()

Equality check:

In [None]:
%time _ = s_python == "A1"

In [None]:
%timeit _ = s_arrow == "A1"

Contains check:

In [None]:
%time _ = s_python.str.contains("A1", regex=False)

In [None]:
%timeit _ = s_arrow.str.contains("A1", regex=False)

In [None]:
(s_python.str.contains("A1", regex=False) == s_arrow.str.contains("A1", regex=False)).all()

**How does this work?**

- Apache Arrow has an efficient memory representation for variable-length strings + a growing library of computational kernels
- In pandas, we can optionally store a `pyarrow.array` of strings instead of an object-dtype numpy array
- BUT! setitem operations are less efficient

**Thanks to**
 
* CZI for funding this work
* Maarten Breddels and the Arrow team for implementing string kernels in Arrow
* Uwe Korn, Tom Augspurger and Simon Hawkins for the work integrating this in pandas

In [None]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Quite happy with my first major contribution to <a href="https://twitter.com/ApacheArrow?ref_src=twsrc%5Etfw">@ApacheArrow</a> which is a redo/upstreaming of the <a href="https://twitter.com/vaex_io?ref_src=twsrc%5Etfw">@vaex_io</a> string algorithms. From 2min12 → 8 seconds on half a billion strings (single-threaded). <a href="https://t.co/BSjjBgMSpt">pic.twitter.com/BSjjBgMSpt</a></p>&mdash; Maarten A. Breddels (@maartenbreddels) <a href="https://twitter.com/maartenbreddels/status/1278047178808799233?ref_src=twsrc%5Etfw">June 30, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

In [None]:
%%html
<style>
.jp-Cell.jp-mod-selected ~ .jp-Cell {
    display: none;
}
</style>

Generation of string data:

In [None]:
# copied from https://github.com/hmelberg/health-analytics-using-python/blob/master/4_Organizing_your_data_The_answer_is_half_long.ipynb

import numpy as np
import pandas as pd

def make_data(n, letters=26, numbers=100, seed=False):
    """
    Generate a dataframe with a column of random codes

    Args:
    letters (int): The number of different letters to use
    numbers (int): The number of different numbers to use

    Returns
    A dataframe with a column with one or more codes in the rows

    """
    # each code is assumed to consist of a letter and a number
    alphabet = list('abcdefghigjklmnopqrstuvwxyz')
    letters=alphabet[:letters+1]

    # make random numbers same if seed is specified
    if seed:
        np.random.seed(0)

    # determine the number of codes to be drawn for each event
    n_codes=np.random.negative_binomial(1, p=0.3, size=n)
    # avoid zero (all events have to have at least one code)
    n_codes=n_codes+1

    # for each event, randomly generate a the number of codes specified by n_codes
    codes=[]
    for i in n_codes:
        diag = [np.random.choice(letters).upper()+
              str(int(np.random.uniform(low=1, high=numbers))) 
              for num in range(i)]

        code_string=','.join(diag)
        codes.append(code_string)

    # create a dataframe based on the list   
    df=pd.DataFrame(codes)    
    df.columns=['code']

    return df

In [None]:
df = make_data(10_000_000)

In [None]:
df.to_csv("string_data.csv", index=False)