Link to Medium blog post: https://towardsdatascience.com/check-for-a-substring-in-a-pandas-dataframe-column-4b949f64852

# Check For a Substring in a Pandas DataFrame Column

The Pandas library is a comprehensive tool not only for crunching numbers but also for working with text data.

For many data analysis applications and machine learning exploration/pre-processing, you’ll want to either filter out or extract information from text data. To do so, Pandas offers a wide range of in-built methods that you can use to add, remove, and edit text columns in your DataFrames.

In this piece, let’s take a look specifically at searching for substrings in a DataFrame column. This may come in handy when you need to create a new category based on existing data (for example during feature engineering before training a machine learning model).

In [2]:
import pandas as pd
df = pd.read_csv('vgsales.csv')

df

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.00
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37
...,...,...,...,...,...,...,...,...,...,...,...
16593,16596,Woody Woodpecker in Crazy Castle 5,GBA,2002.0,Platform,Kemco,0.01,0.00,0.00,0.00,0.01
16594,16597,Men in Black II: Alien Escape,GC,2003.0,Shooter,Infogrames,0.01,0.00,0.00,0.00,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,Racing,Activision,0.00,0.00,0.00,0.00,0.01
16596,16599,Know How 2,DS,2010.0,Puzzle,7G//AMES,0.00,0.01,0.00,0.00,0.01


## Using “contains” to Find a Substring in a Pandas DataFrame

The contains method in Pandas allows you to search a column for a specific substring. The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not. A basic application of contains should look like Series.str.contains("substring"). However, we can immediately take this to the next level with two additions:

1. Using the case argument to specify whether to match on string case;
2. Using the returned Series of boolean values as a mask to get a subset of the DataFrame.

Applying these two should look like this:

In [3]:
pokemon_games = df.loc[df['Name'].str.contains("pokemon", case=False)]

pokemon_games

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37
12,13,Pokemon Gold/Pokemon Silver,GB,1999.0,Role-Playing,Nintendo,9.0,6.18,7.2,0.71,23.1
20,21,Pokemon Diamond/Pokemon Pearl,DS,2006.0,Role-Playing,Nintendo,6.42,4.52,6.04,1.37,18.36
25,26,Pokemon Ruby/Pokemon Sapphire,GBA,2002.0,Role-Playing,Nintendo,6.06,3.9,5.38,0.5,15.85
26,27,Pokemon Black/Pokemon White,DS,2010.0,Role-Playing,Nintendo,5.57,3.28,5.65,0.82,15.32
32,33,Pokemon X/Pokemon Y,3DS,2013.0,Role-Playing,Nintendo,5.17,4.05,4.34,0.79,14.35
45,46,Pokemon HeartGold/Pokemon SoulSilver,DS,2009.0,Action,Nintendo,4.4,2.77,3.96,0.77,11.9
49,50,Pokemon Omega Ruby/Pokemon Alpha Sapphire,3DS,2014.0,Role-Playing,Nintendo,4.23,3.37,3.08,0.65,11.33
58,59,Pokemon FireRed/Pokemon LeafGreen,GBA,2004.0,Role-Playing,Nintendo,4.34,2.65,3.15,0.35,10.49
81,82,Pokemon Black 2/Pokemon White 2,DS,2012.0,Role-Playing,Nintendo,2.91,1.86,3.14,0.43,8.33


Using the loc method allows us to get only the values in the DataFrame that contain the string “pokemon”. We’ve simply used the contains method to acquire True and False values based on whether the “Name” column includes our substring and then returned only the True values.

## Using regex with the “contains” method in Pandas

In addition to just matching on a regular substring, we can also use contains to match on regular expressions. We’ll use the exact same format as before, except this time let’s use a bit of regex to only find the story-based Pokemon games (i.e. excluding Pokemon Pinball and the like).

In [5]:
pokemon_og_games = df.loc[df['Name'].str.contains("pokemon \w{1,}/", case=False)]

pokemon_og_games

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37
12,13,Pokemon Gold/Pokemon Silver,GB,1999.0,Role-Playing,Nintendo,9.0,6.18,7.2,0.71,23.1
20,21,Pokemon Diamond/Pokemon Pearl,DS,2006.0,Role-Playing,Nintendo,6.42,4.52,6.04,1.37,18.36
25,26,Pokemon Ruby/Pokemon Sapphire,GBA,2002.0,Role-Playing,Nintendo,6.06,3.9,5.38,0.5,15.85
26,27,Pokemon Black/Pokemon White,DS,2010.0,Role-Playing,Nintendo,5.57,3.28,5.65,0.82,15.32
32,33,Pokemon X/Pokemon Y,3DS,2013.0,Role-Playing,Nintendo,5.17,4.05,4.34,0.79,14.35
45,46,Pokemon HeartGold/Pokemon SoulSilver,DS,2009.0,Action,Nintendo,4.4,2.77,3.96,0.77,11.9
58,59,Pokemon FireRed/Pokemon LeafGreen,GBA,2004.0,Role-Playing,Nintendo,4.34,2.65,3.15,0.35,10.49


Above, I just used some simple regex to find strings that matched the pattern of “pokemon” + “one character or more” + “/”. The result of the new mask returned rows including “Pokemon Red/Pokemon Blue”, “Pokemon Gold/Pokemon Silver”, and more.

Next, let’s do another quick example of using regex to find all Sports games with “football” or “soccer” in its name. First, we’ll use a simple conditional statement to filter out all rows with the a genre of “sports”:

In [6]:
sports_games = df.loc[df['Genre'] == 'Sports']

sports_games

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.00
13,14,Wii Fit,Wii,2007.0,Sports,Nintendo,8.94,8.03,3.60,2.15,22.72
14,15,Wii Fit Plus,Wii,2009.0,Sports,Nintendo,9.09,8.59,2.53,1.79,22.00
77,78,FIFA 16,PS4,2015.0,Sports,Electronic Arts,1.11,6.06,0.06,1.26,8.49
...,...,...,...,...,...,...,...,...,...,...,...
16576,16579,Rugby Challenge 3,XOne,2016.0,Sports,Alternative Software,0.00,0.01,0.00,0.00,0.01
16578,16581,Outdoors Unleashed: Africa 3D,3DS,2011.0,Sports,Mastiff,0.01,0.00,0.00,0.00,0.01
16579,16582,PGA European Tour,N64,2000.0,Sports,Infogrames,0.01,0.00,0.00,0.00,0.01
16581,16584,Fit & Fun,Wii,2011.0,Sports,Unknown,0.00,0.01,0.00,0.00,0.01


You’ll notice that above there was no real need to match on a substring or use regex, because we were simply selecting rows based on a category. However, when matching on the row name, we’ll need to be searching different types of strings for a substring, which is where regex comes in handy. To do so, we’ll do the following:

In [7]:
football_soccer_games = sports_games.loc[df['Name'].str.contains("soccer|football", case=False)]

football_soccer_games

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
199,200,FIFA Soccer 11,PS3,2010.0,Sports,Electronic Arts,0.60,3.29,0.06,1.13,5.08
249,250,Winning Eleven: Pro Evolution Soccer 2007,PS2,2006.0,Sports,Konami Digital Entertainment,0.10,2.39,1.05,0.86,4.39
269,270,FIFA Soccer 06,PS2,2005.0,Sports,Electronic Arts,0.78,2.55,0.04,0.84,4.21
283,284,FIFA Soccer 07,PS2,2006.0,Sports,Electronic Arts,0.71,2.48,0.03,0.89,4.11
292,293,World Soccer Winning Eleven 9,PS2,2005.0,Sports,Konami Digital Entertainment,0.12,2.26,0.90,0.77,4.06
...,...,...,...,...,...,...,...,...,...,...,...
16076,16079,Fab 5 Soccer,DS,2008.0,Sports,Destineer,0.01,0.00,0.00,0.00,0.01
16118,16121,Football Manager 2005,PC,2004.0,Sports,Sega,0.00,0.01,0.00,0.00,0.01
16400,16403,Pro Evolution Soccer 2010,PC,2009.0,Sports,Konami Digital Entertainment,0.00,0.01,0.00,0.00,0.01
16420,16423,Winning Eleven: Pro Evolution Soccer 2007,PC,2006.0,Sports,Konami Digital Entertainment,0.00,0.01,0.00,0.00,0.01


Now we’ve gotten a DataFrame with just the games that have a name including “soccer” or “football”. We simply made use of the “|” regex “or” operator that allows you to match on a string that contains one or another substring.

So we’ve successfully gotten a DataFrame with only names that contain either “football” or “soccer”, but we don’t actually know which of those two strings it contains. If we wanted to know which of the two it contained, we could use the findall method on the name column and assign the returned values to a new column in the DataFrame.

The findall method returns matches of the pattern of regular expression you specify in each string of the Series you call it on. The format is largely the same as the contains method, except you’ll need to import re to not match on string case.



In [8]:
import re
football_soccer_games['Football/Soccer'] = football_soccer_games['Name'].str.findall('football|soccer', flags=re.IGNORECASE)

football_soccer_games

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  football_soccer_games['Football/Soccer'] = football_soccer_games['Name'].str.findall('football|soccer', flags=re.IGNORECASE)


Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Football/Soccer
199,200,FIFA Soccer 11,PS3,2010.0,Sports,Electronic Arts,0.60,3.29,0.06,1.13,5.08,[Soccer]
249,250,Winning Eleven: Pro Evolution Soccer 2007,PS2,2006.0,Sports,Konami Digital Entertainment,0.10,2.39,1.05,0.86,4.39,[Soccer]
269,270,FIFA Soccer 06,PS2,2005.0,Sports,Electronic Arts,0.78,2.55,0.04,0.84,4.21,[Soccer]
283,284,FIFA Soccer 07,PS2,2006.0,Sports,Electronic Arts,0.71,2.48,0.03,0.89,4.11,[Soccer]
292,293,World Soccer Winning Eleven 9,PS2,2005.0,Sports,Konami Digital Entertainment,0.12,2.26,0.90,0.77,4.06,[Soccer]
...,...,...,...,...,...,...,...,...,...,...,...,...
16076,16079,Fab 5 Soccer,DS,2008.0,Sports,Destineer,0.01,0.00,0.00,0.00,0.01,[Soccer]
16118,16121,Football Manager 2005,PC,2004.0,Sports,Sega,0.00,0.01,0.00,0.00,0.01,[Football]
16400,16403,Pro Evolution Soccer 2010,PC,2009.0,Sports,Konami Digital Entertainment,0.00,0.01,0.00,0.00,0.01,[Soccer]
16420,16423,Winning Eleven: Pro Evolution Soccer 2007,PC,2006.0,Sports,Konami Digital Entertainment,0.00,0.01,0.00,0.00,0.01,[Soccer]


You’ll see at the end of the returned DataFrame a new column that contains either “Soccer” or “Football”, depending on which of the two the videogame name contains. This can be helpful if you need to create new columns based on the existing columns and using values from those columns.

Finally, for a quick trick to exclude strings with just one additional operator on top of the basic contains method, let’s try to get all the football and soccer games that don’t include “FIFA” in the name.

In [9]:
not_fifa = football_soccer_games.loc[~football_soccer_games['Name'].str.contains('FIFA')]

not_fifa

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Football/Soccer
249,250,Winning Eleven: Pro Evolution Soccer 2007,PS2,2006.0,Sports,Konami Digital Entertainment,0.10,2.39,1.05,0.86,4.39,[Soccer]
292,293,World Soccer Winning Eleven 9,PS2,2005.0,Sports,Konami Digital Entertainment,0.12,2.26,0.90,0.77,4.06,[Soccer]
315,316,World Soccer Winning Eleven 8 International,PS2,2004.0,Sports,Konami Digital Entertainment,0.16,1.89,1.12,0.68,3.85,[Soccer]
348,349,Pro Evolution Soccer 2008,PS2,2007.0,Sports,Konami Digital Entertainment,0.05,0.00,0.64,2.93,3.63,[Soccer]
474,475,World Soccer Winning Eleven 6 International,PS2,2002.0,Sports,Konami Digital Entertainment,0.12,1.26,1.16,0.45,2.99,[Soccer]
...,...,...,...,...,...,...,...,...,...,...,...,...
16076,16079,Fab 5 Soccer,DS,2008.0,Sports,Destineer,0.01,0.00,0.00,0.00,0.01,[Soccer]
16118,16121,Football Manager 2005,PC,2004.0,Sports,Sega,0.00,0.01,0.00,0.00,0.01,[Football]
16400,16403,Pro Evolution Soccer 2010,PC,2009.0,Sports,Konami Digital Entertainment,0.00,0.01,0.00,0.00,0.01,[Soccer]
16420,16423,Winning Eleven: Pro Evolution Soccer 2007,PC,2006.0,Sports,Konami Digital Entertainment,0.00,0.01,0.00,0.00,0.01,[Soccer]


As you can see, we’ve simply made use of the ~ operator that allows us to take all the False values of the mask inside the loc method.