# 8. String Series Methods

# Methods for Series with String Data Types
In this notebook we will focus on methods that work for Series that contain string data. Remember that there is no string data type in Pandas. Instead there is **object** which technically refers to any Python object, but for the vast majority of the time, object columns will be entirely composed of strings.

The methods in the previous two notebooks focused on numeric and boolean Series. Many of those methods will work for both string Series as well but some will not.

For instance, the **`mean`** method will not work for string columns. Let's see this in action by selecting the department column from the City of Houston dataset.

In [4]:
import pandas as pd

In [5]:
emp = pd.read_csv('../data/employee.csv', parse_dates=['HIRE_DATE', 'JOB_DATE'])
emp.head()

Unnamed: 0,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13
1,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18
2,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03
3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25
4,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22


In [7]:
dept = emp['DEPARTMENT']
dept.head()

0      Municipal Courts Department
1                          Library
2    Houston Police Department-HPD
3    Houston Fire Department (HFD)
4      General Services Department
Name: DEPARTMENT, dtype: object

### Attempt to take the mean
The output of this error is very long, so we will catch it and simply print its name out:

In [13]:
try:
    dept.mean()
except Exception as e:
    print(type(e))

<class 'TypeError'>


### Other methods do work
Many of the other methods we covered from the previous two notebooks will work just fine with string columns such as finding the maximum department - maximum being that department with the highest alphabetical letter.

In [16]:
dept.max()

'Solid Waste Management'

Calculate number of missing values:

In [15]:
dept.isna().sum()

0

## Most valuable method for string columns: `value_counts`
The **`value_counts`** method is one of the most valuable methods for string columns. It returns the frequency of each value in the Series and sorts it from most to least common.

In [17]:
dept.value_counts()

Houston Police Department-HPD     638
Houston Fire Department (HFD)     384
Public Works & Engineering-PWE    343
Health & Human Services           110
Houston Airport System (HAS)      106
Parks & Recreation                 74
Solid Waste Management             43
Library                            36
Fleet Management Department        36
Admn. & Regulatory Affairs         29
Municipal Courts Department        28
Human Resources Dept.              24
Houston Emergency Center (HEC)     23
General Services Department        22
Housing and Community Devp.        22
Dept of Neighborhoods (DON)        17
Legal Department                   17
City Council                       11
Finance                            10
Houston Information Tech Svcs       9
Planning & Development              7
Mayor's Office                      5
City Controller's Office            5
Convention and Entertainment        1
Name: DEPARTMENT, dtype: int64

## Notice what object is returned
The **`value_counts`** method returns a Series object itself with the old values as the index and the count as the new values.

### Use `normalize=True` for proportion
We can use **`value_counts`** to return the proportion of each occurrence instead of the raw count by setting parameter **`normalize`** to **`True`**. For instance, this tells us that 32% of the employees are members of the police department.

In [18]:
dept.value_counts(normalize=True)

Houston Police Department-HPD     0.3190
Houston Fire Department (HFD)     0.1920
Public Works & Engineering-PWE    0.1715
Health & Human Services           0.0550
Houston Airport System (HAS)      0.0530
Parks & Recreation                0.0370
Solid Waste Management            0.0215
Library                           0.0180
Fleet Management Department       0.0180
Admn. & Regulatory Affairs        0.0145
Municipal Courts Department       0.0140
Human Resources Dept.             0.0120
Houston Emergency Center (HEC)    0.0115
General Services Department       0.0110
Housing and Community Devp.       0.0110
Dept of Neighborhoods (DON)       0.0085
Legal Department                  0.0085
City Council                      0.0055
Finance                           0.0050
Houston Information Tech Svcs     0.0045
Planning & Development            0.0035
Mayor's Office                    0.0025
City Controller's Office          0.0025
Convention and Entertainment      0.0005
Name: DEPARTMENT

### `value_counts` also works for columns of all types
The **`value_counts`** method works for all columns of all types and not just strings. It's just usually more informative for string columns. Let's use it on the salary column to see if we have common salaries.

In [20]:
emp['BASE_SALARY'].value_counts().head(10)

66614.0    157
55461.0     68
81239.0     59
26125.0     39
62540.0     38
47650.0     37
70181.0     31
60347.0     30
66523.0     29
63166.0     29
Name: BASE_SALARY, dtype: int64

### `isin` method
The **`isin`** method accepts a list of values and returns True or False for each Series element if it is contained in the given list. We saw this method during boolean indexing when we wanted to use multiple *or* conditions.

In [28]:
has_depts = dept.isin(['Parks & Recreation', 'Finance', 'Legal Department'])
has_depts.head()

0    False
1    False
2    False
3    False
4    False
Name: DEPARTMENT, dtype: bool

In [29]:
emp[has_depts].head()

Unnamed: 0,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
32,SENIOR ACCOUNTANT,Finance,46963.0,Black or African American,Full Time,Male,Active,1991-02-11,2016-02-13
92,CUSTODIAN,Parks & Recreation,26125.0,Black or African American,Full Time,Female,Active,1993-10-02,1993-10-02
117,SENIOR ASSISTANT CITY ATTORNEY I,Legal Department,90957.0,Black or African American,Full Time,Female,Active,1998-03-20,2012-07-21
131,RECREATION SPECIALIST,Parks & Recreation,33592.0,Black or African American,Full Time,Male,Active,2013-06-22,2013-06-22
157,RECREATION SPECIALIST,Parks & Recreation,30368.0,Black or African American,Full Time,Male,Active,2007-10-22,2007-10-22


# Special methods just for object columns
Pandas provides a collection of methods only available to object columns with the **str accessor**. The str accessor is only available to Series objects with data type of **object**. It provides a few dozen methods for string manipulation.

### Access with dot notation
To access these special string methods you append the Series object with `.str` and then the specific string method. Again, these are only available to Series with object data types.

### Make each value uppercase
Let's call a simple string method to make each value in the **`dept`** Series uppercase. We will use the **`upper`** method of the str accessor.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

In [36]:
dept.str.upper().head()

0      MUNICIPAL COURTS DEPARTMENT
1                          LIBRARY
2    HOUSTON POLICE DEPARTMENT-HPD
3    HOUSTON FIRE DEPARTMENT (HFD)
4      GENERAL SERVICES DEPARTMENT
Name: DEPARTMENT, dtype: object

### `str` accessor API
Take a look at the [str accessor API][1] in the official documentation. Let's output all the public methods in the notebook below.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

In [38]:
str_methods = [method for method in dir(dept.str) if method[0] != '_']
str_methods

['capitalize',
 'cat',
 'center',
 'contains',
 'count',
 'decode',
 'encode',
 'endswith',
 'extract',
 'extractall',
 'find',
 'findall',
 'get',
 'get_dummies',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'islower',
 'isnumeric',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'len',
 'ljust',
 'lower',
 'lstrip',
 'match',
 'normalize',
 'pad',
 'partition',
 'repeat',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'slice',
 'slice_replace',
 'split',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'wrap',
 'zfill']

In [39]:
len(str_methods)

52

### Lot's of methods but mostly easy to use
There is quite a lot of functionality to manipulate and probe strings in almost any way you can imagine. Let's work through some examples of the following string methods:

* **`count`**
* **`contains`**
* **`find`**
* **`len`**
* **`split`**
* **`replace`**

### `count` str method
Returns the count of the passed string:

In [53]:
dept.str.count('a').head()

0    2
1    1
2    1
3    1
4    2
Name: DEPARTMENT, dtype: int64

In [54]:
dept.str.count('Department').head()

0    1
1    0
2    1
3    1
4    1
Name: DEPARTMENT, dtype: int64

### `contains` str method
Returns a boolean whether or not the passed string is contained somewhere within the string. Let's determine if any departments contain the letter **z**?

In [58]:
dept.str.contains('z').head()

0    False
1    False
2    False
3    False
4    False
Name: DEPARTMENT, dtype: bool

In [59]:
dept.str.contains('z').sum()

0

### `find` str method
Returns the lowest index (the integer location) of the passed string. If not found returns -1.

In [62]:
dept.str.find('a').head(10)

0     7
1     4
2    18
3    16
4     5
5    18
6    -1
7    -1
8    -1
9    -1
Name: DEPARTMENT, dtype: int64

### `len` str method
Returns the length of each string.

In [63]:
dept.str.len().head()

0    27
1     7
2    29
3    29
4    27
Name: DEPARTMENT, dtype: int64

### `split` str method
Splits into multiple separate strings based on a given separator. The default separator is a single space. The following splits on each space and returns a Series of lists.

In [67]:
dept.str.split().head()

0       [Municipal, Courts, Department]
1                             [Library]
2     [Houston, Police, Department-HPD]
3    [Houston, Fire, Department, (HFD)]
4       [General, Services, Department]
Name: DEPARTMENT, dtype: object

Set the **`expand`** parameter to **`True`** to return a DataFrame:

In [68]:
dept.str.split(expand=True).head()

Unnamed: 0,0,1,2,3
0,Municipal,Courts,Department,
1,Library,,,
2,Houston,Police,Department-HPD,
3,Houston,Fire,Department,(HFD)
4,General,Services,Department,


### `repalce` str method
You must pass two string arguments to replace - the string you want to replace and its replacement value.

In [130]:
dept.str.replace('Houston', 'H-Town').head()

0     Municipal Courts Department
1                         Library
2    H-Town Police Department-HPD
3    H-Town Fire Department (HFD)
4     General Services Department
Name: DEPARTMENT, dtype: object

### Selecting substrings with the brackets
Selecting a single character of a Python string is simple and accomplished by placing the integer location of the desired character in brackets. Selecting substrings is also quite simple and accomplished by using slice notation in the brackets.

Pandas allows us to perform the exact same operation with its **`str`** accessor to select one or more characters of each string. We simply append the brackets to **`str`** and use the same selection process as we do with Python strings. Let's see some examples.

Select the character with integer location 5 for each value in the Series:

In [139]:
dept.str[5].head()

0    i
1    r
2    o
3    o
4    a
Name: DEPARTMENT, dtype: object

Select the last 5 characters of each value in the Series:

In [140]:
dept.str[-5:].head()

0    tment
1    brary
2    t-HPD
3    (HFD)
4    tment
Name: DEPARTMENT, dtype: object

Select characters 5 through 15

In [141]:
dept.str[5:15].head()

0    ipal Court
1            ry
2    on Police 
3    on Fire De
4    al Service
Name: DEPARTMENT, dtype: object

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the title as the index. Assign the actor 1 column to its own Series variable. Make sure to drop missing values from this Series before assigning it.

Which actor 1 has appeared in the most movies? Can you write an expression that returns this actors name as a string?</span>

In [131]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">What percent of movies have the top 100 most frequent actor 1's appeared in?</span>

In [132]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">How many actor 1's have appeared in exactly one movie?</span>

In [133]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">How many actor 1's have more than 3 e's in their name? Output a unique array of just these actor names so we can manually verify them.</span>

In [134]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Get a unique list of all actors that have the name 'Johnson' as part of their name. Note: When using the </span>

In [135]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">How many actor 1 names end in 'x'?</span>

In [136]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">The Pandas string methods overlap with the builtin Python string methods. Find all the public method names that are in-common to both. Then find the public methods that are unique to each.</span>

In [None]:
# your code here