# 8. String Series Methods

# Methods for Series with String Data Types
In this notebook, we will focus on methods that work for Series that contain string data. Remember that there is no string data type in Pandas. Instead there is **object** which technically refers to any Python object, but for the vast majority of the time, object columns will be entirely composed of strings.

The methods in the previous two notebooks focused on numeric and boolean Series. Many of those methods will work for both string Series as well but some will not.

For instance, the **`mean`** method will not work for string columns. Let's see this in action by selecting the department column from the City of Houston dataset.

In [1]:
import pandas as pd

In [2]:
emp = pd.read_csv('data/employee.csv')
emp.head()

Unnamed: 0,title,dept,salary,race,gender,experience
0,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,1
1,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,34
2,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,Black,Male,32
3,ENGINEER,Public Works & Engineering-PWE,71680.0,Asian,Male,4
4,CARPENTER,Houston Airport System (HAS),42390.0,White,Male,3


In [3]:
dept = emp['dept']
dept.head()

0     Houston Police Department-HPD
1     Houston Fire Department (HFD)
2     Houston Police Department-HPD
3    Public Works & Engineering-PWE
4      Houston Airport System (HAS)
Name: dept, dtype: object

## Most valuable method for string columns: `value_counts`
The **`value_counts`** method is one of the most valuable methods for string columns. It returns the frequency of each value in the Series and sorts it from most to least common.

In [4]:
dept.value_counts().head()

Houston Police Department-HPD     570
Houston Fire Department (HFD)     365
Public Works & Engineering-PWE    341
Health & Human Services           103
Houston Airport System (HAS)      103
Name: dept, dtype: int64

## Notice what object is returned
The **`value_counts`** method returns a Series object itself with the old values as the index and the count as the new values.

### Use `normalize=True` for proportion
We can use **`value_counts`** to return the proportion of each occurrence instead of the raw count by setting parameter **`normalize`** to **`True`**. For instance, this tells us that 32% of the employees are members of the police department.

In [None]:
dept.value_counts(normalize=True).head()

# Special methods just for object columns
Pandas provides a collection of methods only available to object columns with the **str accessor**. The str accessor is only available to Series objects with data type of **object**. It provides a few dozen methods for string manipulation.

### Access with dot notation
To access these special string methods you append the Series object with `.str` and then the specific string method. Again, these are only available to Series with object data types.

### Make each value uppercase
Let's call a simple string method to make each value in the **`dept`** Series uppercase. We will use the **`upper`** method of the str accessor.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

In [None]:
dept.str.upper().head()

### `str` accessor API
Take a look at the [str accessor API][1] in the official documentation. Let's output all the public methods in the notebook below.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

### Lot's of methods but mostly easy to use
There is quite a lot of functionality to manipulate and probe strings in almost any way you can imagine. Let's work through some examples of the following string methods:

* **`count`**
* **`contains`**
* **`find`**
* **`len`**

### `count` str method
Returns the count of the passed string:

In [None]:
dept.str.count('a').head()

In [None]:
dept.str.count('Department').head()

### `contains` str method
Returns a boolean whether or not the passed string is contained somewhere within the string. Let's determine if any departments contain the letter **z**?

In [None]:
dept.str.contains('z').head()

In [None]:
dept.str.contains('z').sum()

### `find` str method
Returns the lowest index (the integer location) of the passed string. If not found returns -1.

In [None]:
dept.str.find('a').head(10)

### `len` str method
Returns the length of each string.

In [None]:
dept.str.len().head()

### Selecting substrings with the brackets

In [None]:
dept.str[5].head()

Select the last 5 characters of each value in the Series:

In [None]:
dept.str[-5:].head()

Select characters 5 through 15

In [None]:
dept.str[5:15].head()

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the title as the index. Assign the actor 1 column to its own Series variable. Make sure to drop missing values from this Series before assigning it.

Which actor 1 has appeared in the most movies? Can you write an expression that returns this actors name as a string?</span>

In [None]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">How many actor 1's have appeared in exactly one movie?</span>

In [None]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">How many actor 1's have more than 3 e's in their name? Output a unique array of just these actor names so we can manually verify them.</span>

In [None]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Get a unique list of all actors that have the name 'Johnson' as part of their name. </span>

In [None]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">How many actor 1 names end in 'x'?</span>

In [None]:
# your code here