# 3. String Series Methods

## Methods for Series with String Columns
In this notebook, we will focus on methods that work for Series that contain string data. Remember that there is no string data type in Pandas. Instead there is the **object** data dtype which may contain any Python object. The vast majority of the time, object columns will be entirely composed of strings.

The methods in the previous two notebooks focused on numeric and boolean Series. Many of those methods will work for both string Series as well but some will not.

For instance, the **`mean`** method will not work for string columns. Let's see this in action by selecting the department column from the City of Houston dataset.

In [None]:
import pandas as pd

emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])
emp.head()

In [None]:
dept = emp['dept']
dept.head()

### Attempt to take the mean

In [None]:
dept.mean()

### Other methods do work
Many of the other methods we covered from the previous two notebooks will work just fine with string columns such as finding the maximum department - maximum being that department with the highest alphabetical letter.

In [None]:
dept.max()

Calculate number of missing values:

In [None]:
dept.isna().sum()

## Valuable method for string columns: `value_counts`
The **`value_counts`** method is one of the most valuable methods for string columns. It returns the frequency of each value in the Series and sorts it from most to least common.

In [None]:
dept.value_counts()

## Notice what object is returned
The **`value_counts`** method returns a Series object itself with the old values as the index and the count as the new values.

### Use `normalize=True` for proportion
We can use **`value_counts`** to return the proportion of each occurrence instead of the raw count by setting parameter **`normalize`** to **`True`**. For instance, this tells us that 32% of the employees are members of the police department.

In [None]:
dept.value_counts(normalize=True)

### `value_counts` also works for columns of all types
The **`value_counts`** method works for all columns of all types and not just strings. It's just usually more informative for string columns. Let's use it on the salary column to see if we have common salaries.

In [None]:
emp['salary'].value_counts().head(10)

# Special methods just for object columns
Pandas provides a collection of methods only available to object columns with the **str accessor**. The str accessor is only available to Series objects with data type of **object**. It provides a few dozen methods for string manipulation.

### Access with dot notation
To access these special string methods, first append the Series object with `.str` followed by another dot and then the specific string method.

### Make each value uppercase
Let's call a simple string method to make each value in the **`dept`** Series uppercase. We will use the **`upper`** method of the str accessor.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

In [None]:
dept.str.upper().head()

### `str` accessor API
Take a look at the [str accessor API][1] in the official documentation.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

### Lot's of methods but mostly easy to use
There is quite a lot of functionality to manipulate and probe strings in almost any way you can imagine. Let's work through some examples of the following string methods:

* **`count`**
* **`contains`**
* **`find`**
* **`len`**
* **`split`**
* **`replace`**

### `count` str method
Returns the count of the passed string:

In [None]:
dept.str.count('a').head()

In [None]:
dept.str.count('Department').head()

### `contains` str method
Returns a boolean whether or not the passed string is contained somewhere within the string. Let's determine if any departments contain the letter **z**?

In [None]:
dept.str.contains('z').head()

In [None]:
dept.str.contains('z').sum()

### `find` str method
Returns the lowest index (the integer location) of the passed string. If not found returns -1.

In [None]:
dept.str.find('a').head(10)

### `len` str method
Returns the length of each string.

In [None]:
dept.str.len().head()

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the title as the index. Assign the actor 1 column to its own Series variable. Make sure to drop missing values from this Series before assigning it. Which actor 1 has appeared in the most movies? Can you write an expression that returns this actors name as a string?</span>

### Problem 2
<span  style="color:green; font-size:16px">What percent of movies have the top 100 most frequent actor 1's appeared in?</span>

### Problem 3
<span  style="color:green; font-size:16px">How many actor 1's have appeared in exactly one movie?</span>

### Problem 4
<span  style="color:green; font-size:16px">How many actor 1's have more than 3 e's in their name? Output a unique array of just these actor names so we can manually verify them.</span>

### Problem 5
<span  style="color:green; font-size:16px">Get a unique list of all actors that have the name 'Johnson' as part of their name.</span>

### Problem 6
<span  style="color:green; font-size:16px">How many actor 1 names end in 'x'?</span>

### Problem 7
<span  style="color:green; font-size:16px">The Pandas string methods overlap with the builtin Python string methods. Find all the public method names that are in-common to both. Then find the public methods that are unique to each.</span>

# Explore More `str` Methods and their parameters
In this section below, you can learn and practice with other methods and their parameters. There are much too many to cover all during a lecture and left to you to understand on your own.

### `split` str method
Splits into multiple separate strings based on a given separator. The default separator is a single space. The following splits on each space and returns a Series of lists.

In [None]:
dept.str.split().head()

Set the **`expand`** parameter to **`True`** to return a DataFrame:

In [None]:
dept.str.split(expand=True).head()

### `replace` str method
You must pass two string arguments to replace - the string you want to replace and its replacement value.

In [None]:
dept.str.replace('Houston', 'H-Town').head()

### Selecting substrings with the brackets
Selecting a single character of a Python string is simple and accomplished by placing the integer location of the desired character in brackets. Selecting substrings is also quite simple and accomplished by using slice notation in the brackets.

Pandas allows us to perform the exact same operation with its **`str`** accessor to select one or more characters of each string. We simply append the brackets to **`str`** and use the same selection process as we do with Python strings. Let's see some examples.

Select the character with integer location 5 for each value in the Series:

In [None]:
dept.str[5].head()

Select the last 5 characters of each value in the Series:

In [None]:
dept.str[-5:].head()

Select characters 5 through 15

In [None]:
dept.str[5:15].head()

# There are dozens of other string methods. Keep practicing below
Use the documentation to read about every parameter in each method.