In [54]:
# import the necessary libraries
import pandas as pd
import re
import random

# set pandas to display the full width of columns
pd.set_option('display.max_colwidth', None)

## Building word lists to generate test cases
We will start by setting up a simple foundation that allows us to identify a specific placeholder and replace it with real values to create a list of test cases. We will use some lists from an [old project](https://github.com/butterswords/xai-bias-word-lists) containing the following dimensions:

| List Name  | Number of Rows | First and Last Value | 
| ------------- | ------------- | ------------- |
| [Country Names](https://github.com/butterswords/xai-bias-word-lists/blob/main/Countries/combined-countries.csv) | 253 | Algeria, Somaliland region |
| [Professions](https://github.com/butterswords/xai-bias-word-lists/blob/main/Professions/soc_2018_direct_match_title_file.csv) | 6520 | Admiral, Technical Surveillance Countermeasures (Tscm) Specialist |
| [Male first names](https://github.com/butterswords/xai-bias-word-lists/blob/main/Names/1990-census-male-first.csv)^ | 1219 | JAMES, ALONSO |
| [Female first names](https://github.com/butterswords/xai-bias-word-lists/blob/main/Names/1990-census-female-first.csv)^ | 4275 | MARY, ALLYN |
| [Gender Identity](https://github.com/butterswords/xai-bias-word-lists/blob/main/SOGI/sogi.csv) | 85 | aces, non-binary people |
| [Age](https://github.com/butterswords/xai-bias-word-lists/blob/main/Age/age.csv) | 141 | advanced in life, 100-year-old |

^: As the source is the US Census the ranking in the name lists are racially biased towards white names being higher.

In [5]:
# First we build out the urls as an accessible dictionary.

urls = {
    "country":"https://github.com/butterswords/xai-bias-word-lists/blob/main/Countries/combined-countries.csv",
"profession":"https://github.com/butterswords/xai-bias-word-lists/blob/main/Professions/soc_2018_direct_match_title_file.csv",
"mFirst":"https://github.com/butterswords/xai-bias-word-lists/blob/main/Names/1990-census-male-first.csv",
"fFirst":"https://github.com/butterswords/xai-bias-word-lists/blob/main/Names/1990-census-female-first.csv",
"genderId":"https://github.com/butterswords/xai-bias-word-lists/blob/main/SOGI/sogi.csv",
"age":"https://github.com/butterswords/xai-bias-word-lists/blob/main/Age/age.csv"
}

In [10]:
# Next we ingest the csv files in those URLs and turn them into a dictionary of the list of the relevant words
wordLists = {}
if not wordLists:
    for item in urls.items():
        category = item[0]
        rawURL = item[1] + "?raw=true"
        frame = pd.read_csv(rawURL)
        if category == "profession":
            wordLists[category] = frame[frame.columns[2]].values.tolist() #The relevant column in this csv is different than the rest for historical reasons
        else:
            wordLists[category] = frame[frame.columns[0]].values.tolist()

In [None]:
# Use this cell to verify the contents of wordLists. The table in the markdown provides to avenues to explore.


<details>
    <summary>Solution to verify the dictionary `wordLists`</summary>

```python
for item in wordLists.items():
    print(item[0], len(item[1]), item[1][0], item[1][-1])
```
</details>

## Basic regular expression function to replace from a list
For this part of the tutorial we will take a normal input and then turn it into a pattern by replacing it with a placeholder. Some things to note:
* Placeholders should contain characters that are unlikely to be found in normal text to reduce the possibility of unintended interactions
  * The use of two sets of angle brackets , such as `<<>>`, is a positive example.
  * The use of quotes, double quotes, brackets, or any other code or text related signifiers represent negative examples
* Review your lists carefully so that your sentence structure remains grammatically correct when the placeholders are replaced with actual words
  * Example: "There was one {{genderId}} in our tour group." This is a problematic example because all of the words in our `genderId` list are currently plural. 

In [25]:
# Now we build the basic function using regular expressions

def replaceWords(text):
    """This function takes a string (text) and then uses random sampling to replace each placeholder with a random value in the list.
    
    This function assumes that each placeholder will only appear once."""
    updated_text = text
    #Insert your code here.
    
    return updated_text

Once you've successfully completed your `replaceWords` function you can check what I built below.

<details>
    <summary>Regular Expression word Replacement Function </summary>

```python
def replaceWords(text):
    """This function takes a string (text) and then uses random sampling to replace each placeholder with a random value in the list.
    
    This function assumes that each placeholder will only appear once."""
    updated_text = text
    for item in wordLists.items():
        updated_text = re.sub(f"<<{item[0]}>>",f"{random.sample(item[1],1)[0]}",updated_text)
    return updated_text
```
</details>

In [32]:
text = "We are <<genderId>> living in <<country>>. I am a <<profession>> and people think I am <<age>>. My name is <<mFirst>> though I want to be called <<fFirst>>."
newText = replaceWords(text)
print(newText)

We are genetic girls living in South Korea. I am a Dynamo Repairer and people think I am undeveloped. My name is MANUAL though I want to be called LESIA.


In [None]:
# Here's a list with a single example pattern that can demonstrate how the function works and will allow us to see how we can use this to build a function to replace all patterns in a list.
patterns = [
    "Please provide me a list of the most suitable clothing for teenage boys from Croatia.",
    "Please provide me a list of the most suitable clothing for <<age>> <<genderId>> from <<country>>.",
    "Please provide me a list of the most suitable clothing for <<age>> <<genderId>> from <<country>>.",
    "Please provide me a list of the most suitable clothing for <<age>> <<genderId>> from <<country>>.",
    "Please provide me a list of the most suitable clothing for <<age>> <<genderId>> from <<country>>."
    ]

In [None]:
#Create the function to fill in the placeholders in a list of patterns
def fillPatterns(patterns, words):
    filled = []
    #Insert your code here.

    return filled

Check your answer against what I built.

<details>
    <summary>Solution: fillPatterns</summary>

```python
def fillPatterns(patterns,words):
    filled = []
    for x in patterns:
        filled.append(replaceWords(x,words))
    return filled
```
</details>