# String (Text) Operations<a name="_string (text) operations"></a>

# Table of Contents
- [ String (Text) Operations](#_string (text) operations) 
  - [ Fixing Case, Spacing, Etc. on String Columns](#_fixing case, spacing, etc. on string columns) 
  - [ Search for String Values](#_search for string values) 
  - [ Split a String Column Into Multiple Columns](#_split a string column into multiple columns) 
    - [  What if we want to add that to the crimes dataframe?](#_ what if we want to add that to the crimes dataframe?) 
  - [ Splitting but not making new columns- Just a list.](#_splitting but not making new columns- just a list.) 


You can find documentation here: http://pandas.pydata.org/pandas-docs/stable/text.html

In [None]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

We can do a lot of cleaning operations using the str functions in pandas.

In [33]:
crimes = pd.read_csv("data/chicago_crimes.csv")

In [35]:
crimes.head(3)

Unnamed: 0,identification,Case Number,Date-Time,Date,Time,Block,Street,IUCR,Primary Type,Description,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Location,Latitude,Longitude
0,7446859,HS247325,1/1/08 0:01,1/1/08,0:01,004XX E 133RD ST,E 133RD ST,841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,...,9,54,6,,,2008,4/30/10 1:15,,,
1,6236266,HP323693,1/1/08 0:01,1/1/08,0:01,013XX E 49TH ST,E 49TH ST,840,THEFT,FINANCIAL ID THEFT: OVER $300,...,4,39,6,1186146.0,1872823.0,2008,5/24/08 1:05,"(41.80615476732223, -87.59280284925518)",41.806155,-87.592803
2,7514546,HS317259,1/1/08 0:01,1/1/08,0:01,014XX E 59TH ST,E 59TH ST,840,THEFT,FINANCIAL ID THEFT: OVER $300,...,5,41,6,,,2008,5/24/10 1:12,,,


In [36]:
crimes['Primary Type'].value_counts().head(3)

THEFT              605
BATTERY            358
CRIMINAL DAMAGE    187
Name: Primary Type, dtype: int64

## Fixing Case, Spacing, Etc. on String Columns<a name="_fixing case, spacing, etc. on string columns"></a>

We can lower-case (lower()), upper-case (upper()) or "Title" case strings in a column:

In [37]:
crimes['Primary Type'] = crimes['Primary Type'].str.title()

In [38]:
crimes.head(3)

Unnamed: 0,identification,Case Number,Date-Time,Date,Time,Block,Street,IUCR,Primary Type,Description,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Location,Latitude,Longitude
0,7446859,HS247325,1/1/08 0:01,1/1/08,0:01,004XX E 133RD ST,E 133RD ST,841,Theft,FINANCIAL ID THEFT:$300 &UNDER,...,9,54,6,,,2008,4/30/10 1:15,,,
1,6236266,HP323693,1/1/08 0:01,1/1/08,0:01,013XX E 49TH ST,E 49TH ST,840,Theft,FINANCIAL ID THEFT: OVER $300,...,4,39,6,1186146.0,1872823.0,2008,5/24/08 1:05,"(41.80615476732223, -87.59280284925518)",41.806155,-87.592803
2,7514546,HS317259,1/1/08 0:01,1/1/08,0:01,014XX E 59TH ST,E 59TH ST,840,Theft,FINANCIAL ID THEFT: OVER $300,...,5,41,6,,,2008,5/24/10 1:12,,,


In [39]:
crimes['Description'] = crimes['Description'].str.lower()

In [40]:
crimes['Description'].head()

0    financial id theft:$300 &under
1     financial id theft: over $300
2     financial id theft: over $300
3     financial id theft: over $300
4           harassment by telephone
Name: Description, dtype: object

## Search for String Values<a name="_search for string values"></a>

We can search for rows where a string is contained in part of the Description, for instance:

In [41]:
crimes['Description'].str.contains("theft").head()

0     True
1     True
2     True
3     True
4    False
Name: Description, dtype: bool

Now let's use this match to subset the crimes dataframe to those matching rows, and look at two columns:

In [42]:
crimes_with_theft = crimes[crimes['Description'].str.contains("theft")]   
#crimes_with_theft
crimes_with_theft[['Primary Type','Description']]

Unnamed: 0,Primary Type,Description
0,Theft,financial id theft:$300 &under
1,Theft,financial id theft: over $300
2,Theft,financial id theft: over $300
3,Theft,financial id theft: over $300
5,Theft,financial id theft:$300 &under
7,Theft,financial id theft: over $300
9,Theft,financial id theft: over $300
10,Theft,financial id theft: over $300
12,Theft,financial id theft:$300 &under
14,Theft,financial id theft: over $300


Now let's count how many of each theft description type there are, using value_counts():

In [43]:
crimes[crimes['Description'].str.contains("theft")]['Description'].value_counts()

financial id theft: over $300       247
financial id theft:$300 &under       68
retail theft                         24
agg: financial id theft              13
attempt financial identity theft     10
theft of labor/services               9
theft/recovery: automobile            8
theft of lost/mislaid prop            2
theft by lessee,non-veh               1
Name: Description, dtype: int64

You can use str functions like `str.match` (same as ==), `str.contains`, `str.startswith`, and `str.endswith`.

You can also use regular expressions in searches, which means a more flexible search. This means that some characters are special and need "escaping" to be matched explicitly.  That includes ".", "+", "$", "*".

In [44]:
crimes[crimes['Description'].str.contains('\$')]['Description'].value_counts()

financial id theft: over $300     247
$300 and under                     82
financial id theft:$300 &under     68
over $300                          60
over $500                           1
Name: Description, dtype: int64

You can read more about Regular Expressions here: https://docs.python.org/3/howto/regex.html

And some examples in pandas strings here: http://pandas.pydata.org/pandas-docs/stable/text.html#extracting-substrings (and also for searches)


## Split a String Column Into Multiple Columns<a name="_split a string column into multiple columns"></a>

How can we take a text column and split it by a delimiter, like we can in Excel?  Take the Location column, for instance.  We want to split by the comma, and then remove the parentheses.  We can use "split" and the keyword "expand" to result in new columns:

In [45]:
crimes['Location'].head()

0                                        NaN
1    (41.80615476732223, -87.59280284925518)
2                                        NaN
3     (41.78228592790556, -87.5908189949899)
4    (41.76318136897587, -87.58973074990949)
Name: Location, dtype: object

In [46]:
crimes['Location'].str.split(",", expand=True).head()

Unnamed: 0,0,1
0,,
1,(41.80615476732223,-87.59280284925518)
2,,
3,(41.78228592790556,-87.5908189949899)
4,(41.76318136897587,-87.58973074990949)


In [47]:
lat_lon = crimes['Location'].str.split(",", expand=True)

In [48]:
# take care - the columns are actually named with integers, not strings.
lat_lon = lat_lon.rename(index=str, columns={0:"Lat", 1:"Lon"})

In [None]:
lat_lon.head()

In [49]:
lat_lon['Lat'] = lat_lon['Lat'].str.replace("(", "")

In [50]:
lat_lon['Lon'] = lat_lon['Lon'].str.replace(")", "")

In [52]:
lat_lon.head()

Unnamed: 0,Lat,Lon
0,,
1,41.80615476732223,-87.59280284925518
2,,
3,41.78228592790556,-87.5908189949899
4,41.76318136897587,-87.58973074990949


###  What if we want to add that to the crimes dataframe?<a name="_ what if we want to add that to the crimes dataframe?"></a><a name="_ what if we want to add that to the crimes dataframe?"></a>

You can just set new columns in the crimes dataframe equal to the ones in this small new dataframe, like this (you can call them different names in crimes, if you wanted):

In [53]:
type(lat_lon['Lat'])

pandas.core.series.Series

In [54]:
lat_lon.index

Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
       ...
       '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998',
       '1999'],
      dtype='object', length=2000)

In [55]:
# Notice this index is ints, not strings. they don't match.
crimes.index

RangeIndex(start=0, stop=2000, step=1)

Because their indices are different, we can't set one equal to the other as columns-- we have to extract the values from the lat_lon and use those (it's just a giant list).  Other methods like pd.concat will also fail.

In [56]:
crimes['Lat'] = lat_lon['Lat'].values

In [57]:
crimes.head()

Unnamed: 0,identification,Case Number,Date-Time,Date,Time,Block,Street,IUCR,Primary Type,Description,...,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Location,Latitude,Longitude,Lat
0,7446859,HS247325,1/1/08 0:01,1/1/08,0:01,004XX E 133RD ST,E 133RD ST,841,Theft,financial id theft:$300 &under,...,54,6,,,2008,4/30/10 1:15,,,,
1,6236266,HP323693,1/1/08 0:01,1/1/08,0:01,013XX E 49TH ST,E 49TH ST,840,Theft,financial id theft: over $300,...,39,6,1186146.0,1872823.0,2008,5/24/08 1:05,"(41.80615476732223, -87.59280284925518)",41.806155,-87.592803,41.80615476732223
2,7514546,HS317259,1/1/08 0:01,1/1/08,0:01,014XX E 59TH ST,E 59TH ST,840,Theft,financial id theft: over $300,...,41,6,,,2008,5/24/10 1:12,,,,
3,6422569,HP499242,1/1/08 0:01,1/1/08,0:01,014XX E 62ND ST,E 62ND ST,840,Theft,financial id theft: over $300,...,42,6,1186762.0,1864130.0,2008,8/17/08 1:04,"(41.78228592790556, -87.5908189949899)",41.782286,-87.590819,41.78228592790556
4,6013347,HP118097,1/1/08 0:01,1/1/08,0:01,014XX E 72ND PL,E 72ND PL,2825,Other Offense,harassment by telephone,...,43,26,1187119.0,1857171.0,2008,1/16/08 1:05,"(41.76318136897587, -87.58973074990949)",41.763181,-87.589731,41.76318136897587


## Splitting but not making new columns- Just a list.<a name="_splitting but not making new columns- just a list."></a>

Sometimes it's useful to split up a string into a list, without going making new columns.

Now the new values are a list of strings.

In [58]:
crimes['Location'].str.split(",").head()

0                                           NaN
1    [(41.80615476732223,  -87.59280284925518)]
2                                           NaN
3     [(41.78228592790556,  -87.5908189949899)]
4    [(41.76318136897587,  -87.58973074990949)]
Name: Location, dtype: object

In [59]:
crimes['Location'].str.split(",")[1]

['(41.80615476732223', ' -87.59280284925518)']