# Working with Text Data

In [24]:
import pandas as pd

## This Module's Dataset
- This module's dataset (`chicago.csv`) is a collection of public sector employees in the city of Chicago.
- Each row inclues the employee's name, position, department, and salary.

In [25]:
chicago:pd.DataFrame = pd.read_csv("chicago.csv")

chicago = chicago.dropna(how="all")

chicago.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Name                    32062 non-null  object
 1   Position Title          32062 non-null  object
 2   Department              32062 non-null  object
 3   Employee Annual Salary  32062 non-null  object
dtypes: object(4)
memory usage: 1.2+ MB


In [26]:

chicago["Department"] = chicago["Department"].astype("category")

In [27]:

chicago.head(10)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00
5,"ABARCA, ANABEL",ASST TO THE ALDERMAN,CITY COUNCIL,$70764.00
6,"ABARCA, EMMANUEL",GENERAL LABORER - DSS,STREETS & SAN,$41849.60
7,"ABASCAL, REECE E",TRAFFIC CONTROL AIDE-HOURLY,OEMC,$20051.20
8,"ABBASI, CHRISTOPHER",STAFF ASST TO THE ALDERMAN,CITY COUNCIL,$49452.00
9,"ABBATACOLA, ROBERT J",ELECTRICAL MECHANIC,AVIATION,$93600.00


In [28]:
def parse_str(_str:str)->str:
    _str = _str.replace("$","")
    index: int = _str.find(".")
    _str = _str[:index]
    return _str

In [32]:
chicago["Employee Annual Salary"]  = chicago["Employee Annual Salary"].str.replace("$","")
chicago["Employee Annual Salary"]


0         90744
1         84450
2         84450
3         89880
4        106836
          ...  
32057     99528
32058     87384
32059     84450
32060     87384
32061    113664
Name: Employee Annual Salary, Length: 32062, dtype: object

## Common String Methods
- A **Series** has a special `str` attribute that exposes an object with string methods.
- Access the `str` attribute, then invoke the string method on the nested object.
- Most method names will match their Python method equivalents (`upper`, `lower`, `title`, etc).

## Filtering with String Methods
- The `str.contains` method checks whether a substring exists anywhere in the string.
- The `str.startswith` method checks whether a substring exists at the start of the string.
- The `str.endswith` method checks whether a substring exists at the end of the string.

## String Methods on Index and Columns
- Use the `index` and `columns` attributes to access the **DataFrame** index/column labels.
- These objects support string methods via their own `str` attribute.

## The split Method
- The `str.split` method splits a string by the occurrence of a delimiter. Pandas returns a **Series** of lists.
- Use the `str.get` method to access a nested list element by its index position.

In [46]:
split_positions: pd.Series = chicago["Position Title"].str.split(" ")

split_positions.map(lambda x : x[0]).value_counts()

Position Title
POLICE             10856
FIREFIGHTER-EMT     1509
SERGEANT            1186
POOL                 918
FIREFIGHTER          810
                   ...  
DENTIST                1
ASSOC                  1
TELEPHONE              1
MAYOR                  1
PREPRESS               1
Name: count, Length: 320, dtype: int64

## More Practice with Splits

In [56]:
chicago["Name"].str.replace(", ","").str.split(" ")

0             [AARON, ELVIA, J]
1           [AARON, JEFFERY, M]
2               [AARON, KARINA]
3         [AARON, KIMBERLEI, R]
4        [ABAD, JR, VICENTE, M]
                  ...          
32057     [ZYGADLO, MICHAEL, J]
32058      [ZYGOWICZ, PETER, J]
32059       [ZYMANTAS, MARK, E]
32060     [ZYRKOWSKI, CARLO, E]
32061      [ZYSKOWSKI, DARIUSZ]
Name: Name, Length: 32062, dtype: object

## The expand and n Parameters of the split Method
- The `expand` parameter returns a **DataFrame** instead of a **Series** of lists.
- The `n` parameter limits the number of splits.