# 8. String Series Methods

# Methods for Series with String Data Types
In this notebook, we will focus on methods that work for Series that contain string data. Remember that there is no string data type in Pandas. Instead there is **object** which technically refers to any Python object, but for the vast majority of the time, object columns will be entirely composed of strings.

The methods in the previous two notebooks focused on numeric and boolean Series. Many of those methods will work for both string Series as well but some will not.

For instance, the **`mean`** method will not work for string columns. Let's see this in action by selecting the department column from the City of Houston dataset.

In [1]:
import pandas as pd

In [2]:
emp = pd.read_csv('data/employee.csv')
emp.head()

Unnamed: 0,title,dept,salary,race,gender,experience
0,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,1
1,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,34
2,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,Black,Male,32
3,ENGINEER,Public Works & Engineering-PWE,71680.0,Asian,Male,4
4,CARPENTER,Houston Airport System (HAS),42390.0,White,Male,3


In [3]:
dept = emp['dept']
dept.head()

0     Houston Police Department-HPD
1     Houston Fire Department (HFD)
2     Houston Police Department-HPD
3    Public Works & Engineering-PWE
4      Houston Airport System (HAS)
Name: dept, dtype: object

## Most valuable method for string columns: `value_counts`
The **`value_counts`** method is one of the most valuable methods for string columns. It returns the frequency of each value in the Series and sorts it from most to least common.

In [4]:
dept.value_counts().head()

Houston Police Department-HPD     570
Houston Fire Department (HFD)     365
Public Works & Engineering-PWE    341
Health & Human Services           103
Houston Airport System (HAS)      103
Name: dept, dtype: int64

## Notice what object is returned
The **`value_counts`** method returns a Series object itself with the old values as the index and the count as the new values.

### Use `normalize=True` for proportion
We can use **`value_counts`** to return the proportion of each occurrence instead of the raw count by setting parameter **`normalize`** to **`True`**. For instance, this tells us that 32% of the employees are members of the police department.

In [5]:
dept.value_counts(normalize=True).head()

Houston Police Department-HPD     0.371336
Houston Fire Department (HFD)     0.237785
Public Works & Engineering-PWE    0.222150
Health & Human Services           0.067101
Houston Airport System (HAS)      0.067101
Name: dept, dtype: float64

# Special methods just for object columns
Pandas provides a collection of methods only available to object columns with the **str accessor**. The str accessor is only available to Series objects with data type of **object**. It provides a few dozen methods for string manipulation.

### Access with dot notation
To access these special string methods you append the Series object with `.str` and then the specific string method. Again, these are only available to Series with object data types.

### Make each value uppercase
Let's call a simple string method to make each value in the **`dept`** Series uppercase. We will use the **`upper`** method of the str accessor.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

In [6]:
dept.str.upper().head()

0     HOUSTON POLICE DEPARTMENT-HPD
1     HOUSTON FIRE DEPARTMENT (HFD)
2     HOUSTON POLICE DEPARTMENT-HPD
3    PUBLIC WORKS & ENGINEERING-PWE
4      HOUSTON AIRPORT SYSTEM (HAS)
Name: dept, dtype: object

### `str` accessor API
Take a look at the [str accessor API][1] in the official documentation. Let's output all the public methods in the notebook below.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

### Lot's of methods but mostly easy to use
There is quite a lot of functionality to manipulate and probe strings in almost any way you can imagine. Let's work through some examples of the following string methods:

* **`count`**
* **`contains`**
* **`find`**
* **`len`**

### `count` str method
Returns the count of the passed string:

In [7]:
dept.str.count('a').head()

0    1
1    1
2    1
3    0
4    0
Name: dept, dtype: int64

In [8]:
dept.str.count('Department').head()

0    1
1    1
2    1
3    0
4    0
Name: dept, dtype: int64

### `contains` str method
Returns a boolean whether or not the passed string is contained somewhere within the string. Let's determine if any departments contain the letter **z**?

In [9]:
dept.str.contains('z').head()

0    False
1    False
2    False
3    False
4    False
Name: dept, dtype: bool

In [10]:
dept.str.contains('z').sum()

0

### `find` str method
Returns the lowest index (the integer location) of the passed string. If not found returns -1.

In [11]:
dept.str.find('a').head(10)

0    18
1    16
2    18
3    -1
4    -1
5    -1
6    16
7     2
8    -1
9     2
Name: dept, dtype: int64

### `len` str method
Returns the length of each string.

In [12]:
dept.str.len().head()

0    29
1    29
2    29
3    30
4    28
Name: dept, dtype: int64

### Selecting substrings with the brackets

In [13]:
dept.str[5].head()

0    o
1    o
2    o
3    c
4    o
Name: dept, dtype: object

Select the last 5 characters of each value in the Series:

In [14]:
dept.str[-5:].head()

0    t-HPD
1    (HFD)
2    t-HPD
3    g-PWE
4    (HAS)
Name: dept, dtype: object

Select characters 5 through 15

In [15]:
dept.str[5:15].head()

0    on Police 
1    on Fire De
2    on Police 
3    c Works & 
4    on Airport
Name: dept, dtype: object

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the title as the index. Assign the actor 1 column to its own Series variable. Make sure to drop missing values from this Series before assigning it.

Which actor 1 has appeared in the most movies? Can you write an expression that returns this actors name as a string?</span>

In [36]:
var_m = pd.read_csv('data/movie.csv', index_col='title')
var_a = var_m['actor1'].dropna()
var_a.value_counts().head()

Robert De Niro       48
Johnny Depp          36
Nicolas Cage         32
Denzel Washington    29
J.K. Simmons         29
Name: actor1, dtype: int64

In [35]:
var_a.value_counts().index[0]

'Robert De Niro'

### Problem 2
<span  style="color:green; font-size:16px">How many actor 1's have appeared in exactly one movie?</span>

In [28]:
var_a.value_counts(normalize=True).iloc[:100].sum()

0.32511713179873702

### Problem 3
<span  style="color:green; font-size:16px">How many actor 1's have more than 3 e's in their name? Output a unique array of just these actor names so we can manually verify them.</span>

In [31]:
func_3es = var_a.str.count('e') > 3
func_3es.sum()
var_a[func_3es].unique()

array(['Jennifer Lawrence', 'Keanu Reeves', 'Seychelle Gabriel',
       'Jeremy Renner', 'Amber Stevens West', 'Peter Greene',
       'Steven Anthony Lawrence', 'Cedric the Entertainer', 'Sean Pertwee',
       'Xander Berkeley', 'Kathleen Freeman', 'Pierre Perrier',
       'Catherine Deneuve', 'George Kennedy', 'Leighton Meester',
       'Steve Guttenberg', 'Emmanuelle Seigner', 'Jurnee Smollett-Bell',
       'Steve Oedekerk', 'Johannes Silberschneider', 'Bernadette Peters',
       'Jacqueline McKenzie', 'Dee Bradley Baker', 'Jennifer Freeman',
       'Gene Tierney', 'Roscoe Lee Browne', 'Phoebe Legere',
       'Eric Sheffer Stevens', 'Michael Greyeyes', 'Steven Weber',
       'George Newbern', 'Florence Henderson', 'Michelle Simone Miller',
       'Chemeeka Walker', 'Fereshteh Sadre Orafaiy'], dtype=object)

### Problem 4
<span  style="color:green; font-size:16px">Get a unique list of all actors that have the name 'Johnson' as part of their name. </span>

In [32]:
var_a[var_a.str.contains('Johnson').values].unique()

array(['Don Johnson', 'Dwayne Johnson', 'Richard Johnson', 'Eric Johnson',
       'Bill Johnson', 'Nicole Randall Johnson', 'R. Brandon Johnson'], dtype=object)

### Problem 5
<span  style="color:green; font-size:16px">How many actor 1 names end in 'x'?</span>

In [33]:
var_a.str.endswith('x').sum()

28