# Pandas tip #2: Split text column into multiple new columns
In my projects I always do my first data analysis in Pandas. Often, one of the columns contains text data and requires some processing. For example, the column contains `first` and `last` name. What I previously did was write a Lambda function and use the apply to process each row. There is however a better way using `.str.split()` which is very similar to Python's `.split()` method. Using the `expand=True` parameter, the splitted result is put in new columns.

In [1]:
import pandas as pd

df = pd.DataFrame([
    {'path': 'train/data_shard_1.csv'},
    {'path': 'train/data_shard_2.csv'},
    {'path': 'train/data_shard_3.csv'},
    {'path': 'test/data_shard_1.csv'},
    {'path': 'test/data_shard_2.csv'},
])

In [2]:
# https://linkedin.com/in/dennisbakhuis
df = (df
    .join(df
        .loc[:, 'path']
        .str.split('/', expand=True)
        .rename(columns={0: 'folder', 1: 'filename'})
    )
)

df

Unnamed: 0,path,folder,filename
0,train/data_shard_1.csv,train,data_shard_1.csv
1,train/data_shard_2.csv,train,data_shard_2.csv
2,train/data_shard_3.csv,train,data_shard_3.csv
3,test/data_shard_1.csv,test,data_shard_1.csv
4,test/data_shard_2.csv,test,data_shard_2.csv


### A more meaningful example
Most of you probably have seen the Titanic dataset. This dataset has a `Name` column which has some hidden information. It always starts with the last name (or family name) followed by a title of the person. We can easily extract that information using `.str.split(expand=True)`. Lets have a look:

In [3]:
# Use a list of column names to ensure we return a DataFrame
df = pd.read_csv('Assets/Titanic_train_data.csv')[['Name']]
df

Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,"Heikkinen, Miss. Laina"
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,"Allen, Mr. William Henry"
...,...
886,"Montvila, Rev. Juozas"
887,"Graham, Miss. Margaret Edith"
888,"Johnston, Miss. Catherine Helen ""Carrie"""
889,"Behr, Mr. Karl Howell"


The family name is before the `,` and the syntax is very similar to the regular `.split()`:

In [4]:
df['family_name'] = df['Name'].str.split(',', expand=True)[0]
df

Unnamed: 0,Name,family_name
0,"Braund, Mr. Owen Harris",Braund
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Cumings
2,"Heikkinen, Miss. Laina",Heikkinen
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Futrelle
4,"Allen, Mr. William Henry",Allen
...,...,...
886,"Montvila, Rev. Juozas",Montvila
887,"Graham, Miss. Margaret Edith",Graham
888,"Johnston, Miss. Catherine Helen ""Carrie""",Johnston
889,"Behr, Mr. Karl Howell",Behr


To get the title, we have to chain a couple of splits after each other.

In [5]:
df['title'] = (df
    .loc[:, 'Name']  # is the same as df['Name'] but looks better
    .str.split(',', expand=True)[1]
    .str.split(expand=True)[0]
    .str.split('.', expand=True)[0]  # remove the `.`
)

In [6]:
df

Unnamed: 0,Name,family_name,title
0,"Braund, Mr. Owen Harris",Braund,Mr
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Cumings,Mrs
2,"Heikkinen, Miss. Laina",Heikkinen,Miss
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Futrelle,Mrs
4,"Allen, Mr. William Henry",Allen,Mr
...,...,...,...
886,"Montvila, Rev. Juozas",Montvila,Rev
887,"Graham, Miss. Margaret Edith",Graham,Miss
888,"Johnston, Miss. Catherine Helen ""Carrie""",Johnston,Miss
889,"Behr, Mr. Karl Howell",Behr,Mr


If you have any questions, comments, or requests, feel free to [contact me on LinkedIn](https://linkedin.com/in/dennisbakhuis).