In [6]:
from dsc80_utils import *

# Lecture 11 – Regular Expressions

## DSC 80, Fall 2023

## 📣 Announcements 📣

- Mid-quarter survey out, due **tonight, Nov 9 at 11:59pm**.
    - https://forms.gle/khHDPRuhgTqZTW1e9
    - If 90% of the class fills it out, everyone gets +1 point on the midterm.
- No discussion or OH tomorrow (Veteran's Day).
- Lab 6 due Monday.
- Project 3 due Friday, Nov 17.

### Agenda

Lots and lots of regular expressions! Good resources:
- [regex101.com](https://regex101.com), a helpful site to have open while writing regular expressions.
- Python [`re` library documentation](https://docs.python.org/3/library/re.html) and [how-to](https://docs.python.org/3/howto/regex.html).
    - The "how-to" is great, read it!
- [regex "cheat sheet"](https://dsc80.com/resources/other/berkeley-regex-reference.pdf) (taken from [here](https://ds100.org/sp22/resources/)).

See [dsc80.com/resources/#regular-expressions](https://dsc80.com/resources/#regular-expressions).

## Motivation

In [7]:
contact = '''
Thank you for buying our expensive product!

If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309.

If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you!

Due to high demand, please allow one-hundred (100) business days for a response.
'''

### Who called? 📞

- **Goal**: Extract all phone numbers from a piece of text, assuming they are of the form `'(###) ###-####'`.

In [8]:
print(contact)


Thank you for buying our expensive product!

If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309.

If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you!

Due to high demand, please allow one-hundred (100) business days for a response.



- We can do this using the same string methods we've come to know and love.

- Strategy:
    - Split by spaces.
    - Check if there are any consecutive "words" where:
        - the first "word" looks like an area code, like `'(678)'`.
        - the second "word" looks like the last 7 digits of a phone number, like `'999-8212'`. 

Let's first write a function that takes in a string and returns whether it looks like an area code.

In [9]:
def is_possibly_area_code(s):
    '''Does `s` look like (678)?'''
    return (len(s) == 5 and
            s.startswith('(') and
            s.endswith(')') and
            s[1:4].isnumeric()
           )

In [10]:
is_possibly_area_code('(123)')

True

In [11]:
is_possibly_area_code('(12)')

False

Let's also write a function that takes in a string and returns whether it looks like the last 7 digits of a phone number.

In [12]:
def is_last_7_phone_number(s):
    '''Does `s` look like 999-8212?'''
    return (len(s) == 8 and
            s[0:3].isnumeric() and
            s[3] == '-' and
            s[4:].isnumeric()
           )

In [13]:
is_last_7_phone_number('999-1234')

True

In [14]:
is_last_7_phone_number('999-123')

False

Finally, let's split the entire text by spaces, and check whether there are any instances where `pieces[i]` looks like an area code and `pieces[i+1]` looks like the last 7 digits of a phone number.

In [None]:
# Removes punctuation from the end of each string.
pieces = [s.rstrip('.,?;"\'') for s in contact.split()]

...

### Is there a better way?

- This was an example of **pattern matching**.
- It can be done with string methods, but there is often a better approach: **regular expressions**.

In [15]:
print(contact)


Thank you for buying our expensive product!

If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309.

If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you!

Due to high demand, please allow one-hundred (100) business days for a response.



In [16]:
import re
re.findall(r'\(\d{3}\) \d{3}-\d{4}', contact)

['(800) 867-5309', '(800) 123-4567']

<center><h3>🤯</h3></center>

## Basic regular expressions

### Regular expressions

- A regular expression, or **regex** for short, is a sequence of characters used to **match patterns in strings**.
    - For example, `\(\d{3}\) \d{3}-\d{4}` describes a **pattern** that matches US phone numbers of the form `'(XXX) XXX-XXXX'`.
    - Think of regex as a "mini-language" (formally: they are a grammar for describing a language).

- **Pros**: They are very powerful and are widely used (virtually every programming language has a module for working with them).

- **Cons**: They can be hard to read and have many different "dialects."

### Writing regular expressions

- You will ultimately write most of your regular expressions in Python, using the `re` module. We will see how to do so shortly.

- However, a useful tool for designing regular expressions is [regex101.com](https://regex101.com).

- We will use it heavily during lecture; you should have it open as we work through examples. **If you're trying to revisit this lecture in the future, you'll likely want to watch the podcast.**

### Literals

- A literal is a character that has no special meaning.

- Letters, numbers, and some symbols are all literals.

- Some symbols, like `.`, `*`, `(`, and `)`, are special characters.

- ***Example:*** The regex `hey` matches the string `'hey'`. The regex `he.` also matches the string `'hey'`.

### Regex building blocks 🧱

The four main building blocks for all regexes are shown below ([table source](https://www.cs.princeton.edu/courses/archive/spring17/cos226/lectures/54RegularExpressions.pdf), [inspiration](https://docs.google.com/presentation/d/1xQsqa7e3xDZ9nBiekbSBOecwvQm8pSVGa-FBoV6aJ7E/edit#slide=id.g11197671c7e_0_919)).

| operation | order of op. | example | matches ✅ | does not match ❌ |
|:--- |:---|:---|:---|:---|
| <span style='color:purple'><b>concatenation</b></span> | 3 | `AABAAB` | `'AABAAB'` | every other string |
| <span style='color:purple'><b>or</b></span> | 4 | `AA\|BAAB` | `'AA'`, `'BAAB'` | every other string |
| <span style='color:purple'><b>closure</b><br>(zero or more)</span> | 2 | `AB*A` | `'AA'`, `'ABBBBBBA'` | `'AB'`, `'ABABA'` |
| <span style='color:purple'><b>parentheses</b></span> | 1 | `A(A\|B)AAB` <hr style="height:1px"> `(AB)*A` | `'AAAAB'`, `'ABAAB'`<hr style="height:1px">`'A'`, `'ABABABABA'` | every other string<hr style="height:1px">`'AA'`, `'ABBA'` |

Note that `|`, `(`, `)`, and `*` are **special characters**, not literals. They manipulate the characters around them.

***Example (or, parentheses):*** 
- What does `DSC 30|80` match?
- What does `DSC (30|80)` match?

***Example (closure, parentheses):*** 
- What does `blah*` match?
- What does `(blah)*` match?

### Exercise

Write a regular expression that matches `'billy'`, `'billlly'`, `'billlllly'`, etc.
- First, think about how to match strings with any even number of `'l'`s, including zero `'l'`s (i.e. `'biy'`).
- Then, think about how to match only strings with a **positive even** number of `'l'`s.

<br><br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>
<code>bi(ll)*y</code> will match any even number of <code>'l'</code>s, including 0.
    
To match only a positive even number of <code>'l'</code>s, we'd need to first "fix into place" two <code>'l'</code>s, and then follow that up with zero or more pairs of <code>'l'</code>s. This specifies the regular expression <code>bill(ll)*y</code>.
    </details>

### Exercise

Write a regular expression that matches `'billy'`, `'billlly'`, `'biggy'`, `'biggggy'`, etc.

Specifically, it should match any string with a **positive even** number of `'l'`s in the middle, or a **positive even** number of `'g'`s in the middle.

<br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>

Possible answers: <code>bi(ll(ll)\*|gg(gg)\*)y</code> or <code>bill(ll)\*y|bigg(gg)\*y</code>.
 
<br>

Note, <code>bill(ll)\*|gg(gg)\*y</code> is <b>not</b> a valid answer! This is because "concatenation" comes before "or" in the order of operations. This regular expression would match strings that match <code>bill(ll)\*</code>, like <code>'billll'</code>, OR strings that match <code>gg(gg)\*y</code>, like <code>'ggy'</code>.

    
</details>

## Intermediate regex

### More regex syntax

| operation | example | matches ✅ | does not match ❌ |
|:--- |:---|:---|:---|
| <span style='color:purple'><b>wildcard</b></span> | `.U.U.U.` | `'CUMULUS'`<br>`'JUGULUM'` | `'SUCCUBUS'`<br>`'TUMULTUOUS'` |
| <span style='color:purple'><b>character class</b></span>  | `[A-Za-z][a-z]*` | `'word'`<br>`'Capitalized'` | `'camelCase'`<br>`'4illegal'` |
| <span style='color:purple'><b>at least one</b></span> | `bi(ll)+y` | `'billy'`<br>`'billlllly'` | `'biy'`<br>`'bily'` |
| <span style='color:purple'><b>between $i$ and $j$ occurrences</b></span> | `m[aeiou]{1,2}m` | `'mem'`<br>`'maam'`<br>`'miem'` | `'mm'`<br>`'mooom'`<br>`'meme'` |

`.`, `[`, `]`, `+`, `{`, and `}` are also special characters, in addition to `|`, `(`, `)`, and `*`.

***Example (character classes, at least one):*** `[A-E]+` is just shortform for `(A|B|C|D|E)(A|B|C|D|E)*`.

***Example (wildcard):*** 
- What does `.` match? 
- What does `he.` match? 
- What does `...` match?

***Example (at least one, closure):*** 
- What does `123+` match?
- What does `123*` match?

***Example (number of occurrences):*** What does `tri{3, 5}` match? Does it match `'triiiii'`?

***Example (character classes, number of occurrences):*** What does `[1-6a-f]{3}-[7-9E-S]{2}` match?

### Exercise

Write a regular expression that matches any lowercase string has a repeated vowel, such as `'noon'`, `'peel'`, `'festoon'`, or `'zeebraa'`.

<br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>

One answer: <code>[a-z]\*(aa|ee|ii|oo|uu)[a-z]\*</code>
 
<br>

This regular expression matches strings of lowercase characters that have <code>'aa'</code>, <code>'ee'</code>, <code>'ii'</code>, <code>'oo'</code>, or <code>'uu'</code> in them anywhere. <code>[a-z]\*</code> means "zero or more of any lowercase characters"; essentially we are saying it doesn't matter what letters come before or after the double vowels, as long as the double vowels exist somewhere.

    
</details>

### Exercise

Write a regular expression that matches any string that contains **both** a lowercase letter and a number, in any order. Examples include `'billy80'`, `'80!!billy'`, and `'bil8ly0'`.

<br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>

One answer: <code>(.\*[a-z].\*[0-9].\*)|(.\*[0-9].\*[a-z].\*)</code>
 
<br>

We can break the above regex into two parts – everything before the `|`, and everything after the `|`.

The first part, <code>.\*[a-z].\*[0-9].\*</code>, matches strings in which there is at least one lowercase character and at least one digit, with the lowercase character coming first.

The second part, <code>.\*[0-9].\*[a-z].\*</code>, matches strings in which there is at least one lowercase character and at least one digit, with the digit coming first.
    
Note, the <code>.\*</code> between the digit and letter classes is needed in the event the string has non-digit and non-letter characters.
    
<b>This is the kind of task that would be easier to accomplish with regular Python string methods.</b>

    
</details>

### Even more regex syntax

| operation | example | matches ✅ | does not match ❌ |
|:--- |:---|:---|:---|
| <span style='color:purple'><b>escape character</b></span> | `ucsd\.edu` | `'ucsd.edu'` | `'ucsd!edu'` |
| <span style='color:purple'><b>beginning of line</b></span> | `^ark` | `'ark two'`<br>`'ark o ark'` | `'dark'` |
| <span style='color:purple'><b>end of line</b></span>  | `ark$` | `'dark'`<br>`'ark o ark'` | `'ark two'` |
| <span style='color:purple'><b>zero or one</b></span> | `cat?` | `'ca'`<br>`'cat'` | `'cart'` (matches `'ca'` only) |
| <span style='color:purple'><b>built-in character classes*</b></span> | `\w+` <br> `\d+` | `'billy'`<br>`'231231'` | `'this person'`<br>`'858 people'` |
| <span style='color:purple'><b>character class negation</b></span> | `[^a-z]+` | `'KINGTRITON551'`<br>`'1721$$'` | `'porch'`<br>`'billy.edu'` |

****Note: in Python's implementation of regex,*** 
- `\d` refers to digits.
- `\w` refers to alphanumeric characters (`[A-Z][a-z][0-9]_`).
- `\s` refers to whitespace.
- `\b` is a word boundary.

***Example (escaping):*** 
- What does `he.` match? 
- What does `he\.` match? 
- What does `(858)` match? 
- What does `\(858\)` match?

***Example (anchors):*** 
- What does `858-534` match?
- What does `^858-534` match?
- What does `858-534$` match?

### Example (built-in character classes)

****Note: in Python's implementation of regex,*** 
- `\d` refers to digits.
- `\w` refers to alphanumeric characters (`[A-Z][a-z][0-9]_`).
- `\s` refers to whitespace.
- `\b` is a word boundary.


- What does `\d{3} \d{3}-\d{4}` match?
- What does `\bcat\b` match? Does it find a match in `'my cat is hungry'`? What about `'concatenate'`?

### Exercise

Write a regular expression that matches any string that:
- is between 5 and 10 characters long, and
- is made up of only vowels (either uppercase or lowercase, including `'Y'` and `'y'`), periods, and spaces.

Examples include `'yoo.ee.IOU'` and `'AI.I oey'`.

<br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>

One answer: <code>^[aeiouyAEIOUY. ]{5,10}$</code>
 
<br>

<b>Key idea:</b> Within a character class (i.e. <code>[...]</code>), special characters do not generally need to be escaped.


    
</details>

## Regex in Python

### `re` in Python

The `re` package is built into Python. It allows us to use regular expressions to find, extract, and replace strings.

In [17]:
import re

`re.search` takes in a string `regex` and a string `text` and returns the location and substring corresponding to the **first** match of `regex` in `text`.

In [18]:
re.search('AB*A', 
          'here is a string for you: ABBBA. here is another: ABBBBBBBA')

<re.Match object; span=(26, 31), match='ABBBA'>

`re.findall` takes in a string `regex` and a string `text` and returns a list of all matches of `regex` in `text`. You'll use this most often.

In [19]:
re.findall('AB*A', 
           'here is a string for you: ABBBA. here is another: ABBBBBBBA')

['ABBBA', 'ABBBBBBBA']

`re.sub` takes in a string `regex`, a string `repl`, and a string `text`, and replaces all matches of `regex` in `text` with `repl`.

In [20]:
re.sub('AB*A', 
       'billy', 
       'here is a string for you: ABBBA. here is another: ABBBBBBBA')

'here is a string for you: billy. here is another: billy'

### Raw strings

When using regular expressions in Python, it's a good idea to use **raw strings**, denoted by an `r` before the quotes, e.g. `r'exp'`.

In [23]:
re.findall('\bcat\b', 'my cat is hungry')

['cat']

In [22]:
re.findall(r'\bcat\b', 'my cat is hungry')

['cat']

In [24]:
# Huh?
print('\bcat\b')

cat


### Capture groups
* Surround a regex with `(` and `)` to define a **capture group** within a pattern.
- Capture groups are useful for extracting relevant parts of a string.

In [26]:
re.findall(r'\w+@(\w+).edu', 
           'my old email was billy@notucsd.edu, my new email is notbilly@ucsd.edu')

['notucsd', 'ucsd']

- Notice what happens if we remove the `(` and `)`!

In [None]:
re.findall(r'...', 
           'my old email was billy@notucsd.edu, my new email is notbilly@ucsd.edu')

- Earlier, we also saw that parentheses can be used to group parts of a regex together. When using `re.findall`, all groups are treated as capturing groups.

In [27]:
# A regex that matches strings with two of the same vowel followed by 3 digits
# We only want to capture the digits, but...
re.findall(r'(aa|ee|ii|oo|uu)(\d{3})', 'eeoo124')

[('oo', '124')]

## Example: Log parsing

Web servers typically record every request made of them in the "logs".

In [33]:
s = '''132.249.20.188 - - [ab/Feb/2023:12:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585'''

Let's use our new regex syntax (including capturing groups) to extract the day, month, year, and time from the log string `s`.

In [34]:
exp = r"(.+)/(.+)/(.+):(.+:.+:.+)"
re.findall(exp, s)

[('132.249.20.188 - - [ab',
  'Feb',
  '2023',
  '12:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585')]

While above regex works, it is not very **specific**. It _works_ on incorrectly formatted log strings.

In [None]:
other_s = '[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'
re.findall(exp, other_s)

### The more specific, the better!    

- Be as specific in your pattern matching as possible – you don't want to match and extract strings that don't fit the pattern you care about.
    - `.*` matches every possible string, but we don't use it very often.


- A better date extraction regex:
```
\[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]
```
    - `\d{2}` matches any 2-digit number.
    - `[A-Z]{1}` matches any single occurrence of any uppercase letter.
    - `[a-z]{2}` matches any 2 consecutive occurrences of lowercase letters.
    - Remember, special characters (`[`, `]`, `/`) need to be escaped with `\`.

In [None]:
s

In [None]:
new_exp = ...
re.findall(new_exp, s)

A benefit of `new_exp` over `exp` is that it doesn't capture anything when the string doesn't follow the format we specified.

In [None]:
other_s

In [None]:
re.findall(new_exp, other_s)

In [38]:
s = '<p>Hello <p>world</p></p></p>'
exp = r"<p>[^<]+</p>"
re.findall(exp, s)

['<p>world</p>']

## Limitations

### Limitations of regexes

Writing a regular expression is like writing a program.
* You need to know the syntax well.
* They can be easier to write than to read.
* They can be difficult to debug.

Regular expressions are terrible at certain types of problems. Examples:
* Anything involving counting (same number of instances of a and b).
* Anything involving complex structure (palindromes).
* Parsing highly complex text structure ([HTML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), for instance).

## Text features

<center><img src='imgs/ds-lifecycle.svg' width=60%></center>

### Review: Regression and features

- In DSC 40A, our running example was to use **regression** to predict a data scientist's salary, given their GPA, years of experience, and years of education.

- After minimizing empirical risk to determine optimal parameters, $w_0^*, \dots, w_3^*$, we made predictions using:

$$\text{predicted salary} = w_0^* + w_1^* \cdot \text{GPA} + w_2^* \cdot \text{experience} + w_3^* \cdot \text{education}$$

- GPA, years of experience, and years of education are **features** – they represent a data scientist as a vector of _numbers_.
    - e.g. Your feature vector may be [3.5, 1, 7].

- **This approach requires features to be numerical.**

### Moving forward

Suppose we'd like to predict the **sentiment** of a piece of text from 1 to 10.
- 10: Very positive (happy).
- 1: Very negative (sad, angry).

Example:
- Input: "DSC 80 is a pretty good class."
- Output: 7.

- We can frame this as a regression problem, but we can't directly use what we learned in 40A, because here our inputs are **text**, not **numbers**.

### Text features

- **Big question: How do we represent a text document as a feature vector of numbers?**

- If we can do this, we can:
    - use a text document as input in a regression or classification model (in a few lectures).
    - **quantify** the similarity of two text documents (today).

### Example: San Diego employee salaries

- [Transparent California](https://transparentcalifornia.com/salaries/san-diego/) publishes the salaries of all City of San Diego employees.
- Let's look at the 2021 data.

In [39]:
salaries = pd.read_csv('https://transcal.s3.amazonaws.com/public/export/san-diego-2021.csv')
salaries['Employee Name'] = salaries['Employee Name'].str.split().str[0] + ' Xxxx'

In [40]:
salaries.head()

Unnamed: 0,Employee Name,Job Title,Base Pay,Overtime Pay,...,Year,Notes,Agency,Status
0,Mara Xxxx,City Attorney,218759.0,0.0,...,2021,,San Diego,FT
1,Todd Xxxx,Mayor,218759.0,0.0,...,2021,,San Diego,FT
2,Elizabeth Xxxx,Investment Officer,259732.0,0.0,...,2021,,San Diego,FT
3,Terence Xxxx,Police Officer,212837.0,0.0,...,2021,,San Diego,FT
4,Andrea Xxxx,Independent Budget Analyst,224312.0,0.0,...,2021,,San Diego,FT


### Aside on privacy and ethics

- Even though the data we downloaded is publicly available, employee names still correspond to real people.

- Be careful when dealing with PII (personably identifiable information).
    - Only work with the data that is needed for your analysis.
    - Even when data is public, people have a reasonable right to privacy.

- Remember to think about the impacts of your work **outside** of your Jupyter Notebook.

### Goal: Quantifying similarity

- Our goal is to describe, numerically, how **similar** two job titles are.

- For instance, our similarity metric should tell us that `'Deputy Fire Chief'` and `'Fire Battalion Chief'` are more similar than `'Deputy Fire Chief'` and `'City Attorney'`.

- **Idea:** Two job titles are similar if they contain shared words, regardless of order. So, to measure the similarity between two job titles, let's **count the number of words they share in common**.

- Before we do this, we need to be confident that the job titles are clean and consistent – let's explore.

### Exploring job titles

In [41]:
jobtitles = salaries['Job Title']
jobtitles.head()

0                 City Attorney
1                         Mayor
2            Investment Officer
3                Police Officer
4    Independent Budget Analyst
Name: Job Title, dtype: object

In [44]:
jobtitles.nunique()

588

In [45]:
jobtitles.value_counts()

Police Officer                2123
Fire Fighter Ii                331
Assistant Engineer - Civil     284
                              ... 
Senior Marine Biologist          1
Roofing Supervisor               1
Test Monitor I                   1
Name: Job Title, Length: 588, dtype: int64

In [48]:
(jobtitles
 .value_counts()
 .iloc[:25]
 .sort_values()
 .plot(kind='barh')
)

In [51]:
jobtitles = jobtitles[jobtitles.notna()]

### Canonicalization

Remember, our goal is ultimately to count the number of shared words between job titles. But before we start counting the number of shared words, we need to consider the following:

- Some job titles may have **punctuation**, like `'-'` and `'&'`, which may count as words when they shouldn't.
    - `'Assistant - Manager'` and `'Assistant Manager'` should count as the same job title.

- Some job titles may have **"glue" words**, like `'to'` and `'the'`, which (we can argue) also shouldn't count as words.
    - `'Assistant To The Manager'` and `'Assistant Manager'` should count as the same job title.

Let's address the above issues. The process of converting job titles so that they are always represented the same way is called **canonicalization**.

### Punctuation

Are there job titles with unnecessary punctuation that we can remove? 

- To find out, we can write a regular expression that looks for characters other than letters, numbers, and spaces.

- We can use regular expressions with the `.str` methods we learned earlier in the quarter just by using `regex=True`.

In [56]:
jobtitles.str.contains(r'[^A-Za-z0-9 ]', regex=True)

0        False
1        False
2        False
         ...  
12302    False
12303    False
12304    False
Name: Job Title, Length: 12303, dtype: bool

In [57]:
jobtitles[jobtitles.str.contains(r'[^A-Za-z0-9 ]', regex=True)]

281                          Park & Recreation Director
539                     Associate Engineer - Mechanical
1023                         Associate Engineer - Civil
                              ...                      
12243                 Workers' Compensation Claims Aide
12263    Workers' Compensation Claims Representative Ii
12279                   Management Intern-Mayor/Council
Name: Job Title, Length: 845, dtype: object

It seems like we should replace these pieces of punctuation with a single space.

### "Glue" words

Are there job titles with "glue" words in the middle, such as `'Assistant to the Manager'`?

To figure out if any titles contain the word `'to'`, we **can't** just do the following, because it will evaluate to `True` for job titles that have `'to'` anywhere in them, even if not as a standalone word.

In [59]:
jobtitles.str.lower().str.contains('to')

0         True
1        False
2        False
         ...  
12302    False
12303    False
12304    False
Name: Job Title, Length: 12303, dtype: bool

In [60]:
jobtitles

0                  City Attorney
1                          Mayor
2             Investment Officer
                  ...           
12302               Fire Captain
12303    Fleet Repair Supervisor
12304              Fire Engineer
Name: Job Title, Length: 12303, dtype: object

Instead, we need to look for `'to'` separated by word boundaries.

In [61]:
jobtitles.str.lower().str.contains(r'\bto\b', regex=True)

0        False
1        False
2        False
         ...  
12302    False
12303    False
12304    False
Name: Job Title, Length: 12303, dtype: bool

In [62]:
jobtitles[jobtitles.str.lower().str.contains(r'\bto\b', regex=True)]

664               Assistant To The Fire Chief
1403     Principal Assistant To City Attorney
2358                Assistant To The Director
                         ...                 
7544          Confidential Secretary To Mayor
9627     Principal Assistant To City Attorney
12061               Assistant To The Director
Name: Job Title, Length: 11, dtype: object

We can look for other filler words too, like `'the'` and `'for'`.

In [63]:
jobtitles[jobtitles.str.lower().str.contains(r'\b(to|the|for)\b', regex=True)]


This pattern has match groups. To actually get the groups, use str.extract.



664               Assistant To The Fire Chief
1403     Principal Assistant To City Attorney
2358                Assistant To The Director
                         ...                 
9627     Principal Assistant To City Attorney
11010        Assistant For Community Outreach
12061               Assistant To The Director
Name: Job Title, Length: 14, dtype: object

We should probably remove these "glue" words.

### Fixing punctuation and removing "glue" words

Let's put the following two steps together, and canonicalize job titles by:
- converting to lowercase,
- removing each occurrence of `'to'`, `'the'`, and `'for'`,
- replacing each non-letter/digit/space character with a space, and
- replacing each sequence of multiple spaces with a single space.

In [71]:
jobtitles = (jobtitles
 .str.lower()
 .str.replace(r'\b(to|the|for)\b', '', regex=True)
 .str.replace(r'[^A-Za-z0-9 ]', ' ', regex=True)
 .str.replace(r'\s+', ' ', regex=True)
 .str.strip()
)

In [75]:
jobtitles.sample(10)

3261                   police officer
7978                 fleet technician
5825                assistant chemist
                    ...              
4369                   police officer
9956    grounds maintenance worker ii
5378                   police officer
Name: Job Title, Length: 10, dtype: object

### Possible issue: inconsistent representations

Another possible issue is that some job titles may have inconsistent representations of the same word (e.g. `'Asst.'` vs `'Assistant'`).

The 2020 salaries dataset had several of these issues, but fortunately they appear to be fixed for us in the 2021 dataset (thanks, Transparent California).

## Bag of words 💰

### Text similarity

Recall, our idea is to measure the similarity of two job titles by counting the number of shared words between the job titles. How do we actually do that, for all of the job titles we have?

### A counts matrix

Let's create a "counts" matrix, such that:
- there is 1 row per job title,
- there is 1 column per **unique** word that is used in job titles, and
- the value in row `title` and column `word` is the number of occurrences of `word` in `title`.

Such a matrix might look like:

| | senior | lecturer | teaching | professor | assistant | associate |
| --- | --- | --- | --- | --- | --- | --- |
| **senior lecturer** | 1 | 1 | 0 | 0 | 0 | 0 |
| **assistant teaching professor** | 0 | 0 | 1 | 1 | 1 | 0 | 
| **associate professor** | 0 | 0 | 0 | 1 | 0 | 1 |
| **senior assistant to the assistant professor** | 1 | 0 | 0 | 1 | 2 | 0 |

- Then, we can make statements like:
    - "assistant teaching professor" is more similar to "associate professor" than to "senior lecturer".
- Next time!