# Baby Names - Girl Names DB Creation

This notebook aims to create the girls/female names dataset for the project TheName 👶🏽✨, using Gemini.

## Setup

### Installation

The necessary dependencies in the `requirements.txt` file (in the parent folder of this notebook). You should install them. It's recommended to use a virtual environment to manage dependencies.

### Imports

In [None]:
import sys, os
import pandas as pd
import json

# Add the root directory to the Python path
sys.path.append(os.path.abspath(os.path.join('..')))

from utils import * # Helper Functions



### Environment variables

**⚠️ Important Note ⚠️:** To run this notebook, you will need to place environment variables such as the project ID and region in the `utils.py` file. This file is located in the parent folder of this notebook.

### Helper functions

Helper functions can be found in the `utils.py` file. A summary of all functions used in this notebook can be found here:

- `generate_names_info_async` - Async function to be able to process Gemini calls in parallel. This is needed for the names' synthetic data generation: we'll have input lists of 16 names each to avoid hitting output context window limit. Each of this calls will produce the initial JSONs version with exhaustive info on each name. We will be using Gemini-Flash-002 (experimented better results than 001, especially in returning valid JSON data).

- `process_all_lists` goes through each of the sublists in the huge list of girl names, and processes each list using `generate_names_info_async`

- `process_remaining_girl_remaining_names` - As I'm having a quota of 100 requests to Gemini per minute and girl names need 400+ lists, I needed to run `generate_names_info_async` various (4-5) times. In order not to process the same list twice, I defined this function which outputs the lists that have NOT been generated yet. This function uses the `get_created_names` and `filter_remaining_girl_names` helper functions.

- `clean_json_files` - Takes each JSON in a folder (with this folder containing the output JSONs of `process_all_lists`) and cleans them.

- `merge_json_files_by_letter` - Takes all JSONs in a directory starting with the same letter (e.g. "G", from girl) and merges them into one master JSON (`G_master_initial.json`, which will be further enhanced and cleansed -that's why we say "_initial"-).

- `add_attributes_field` - Enriches each name from a master JSON with attributes based on certain keywords defined (e.g. if a name's meaning, family_meaning and/or other_info contain the word "God", the name will be given the attribute "religious").

- `remove_duplicates_from_json` - Removes duplicates in `likely liked`, `famous` other elements in the master JSON file. Uses `remove_duplicates` function to recursively remove duplicates while preserving order.

- `load_json_and_find_names` - Goes through the master JSON and detect names that are not of the desired gender, based on keywords. E.g. in the girls case, it will look for the words `boy`, `man` and `masculine`, and if the name's meaning or other info contains these, it will put the name into a list. Later we'll check which of these names are girl or gender-neutral names, with another call to Gemini, to leave these.

- `upload_to_gcs` - Uploads a local file to Google Cloud Storage, allowing us to rename the file in the process.

## Produce source data

We'll start from [this Kaggle dataset](https://kaggle.com/datasets/ryanburnsworth/popular-names-by-birth-year-1880-2022/data). I downloaded it in CSV format. We'll only be using names with > 1,000 occurrences.

In [25]:
# Load the CSV, filter for girl names with a count higher than 1000
df = pd.read_csv("../data/raw/gender_name.csv", usecols=["Name", "Gender", "Count"])
filtered_df = df.query("Gender == 'F' and Count > 1000")

# Select only the "Name" column and convert it to a list
all_girl_names_list = filtered_df["Name"].dropna().tolist()
all_girl_names_list.sort()

print(all_girl_names_list)

['Aadhya', 'Aadya', 'Aaleyah', 'Aaliya', 'Aaliyah', 'Aaniyah', 'Aanya', 'Aaralyn', 'Aarna', 'Aaron', 'Aarya', 'Aaryn', 'Abagail', 'Abbe', 'Abbey', 'Abbi', 'Abbie', 'Abbigail', 'Abbigale', 'Abbigayle', 'Abby', 'Abbygail', 'Abigael', 'Abigail', 'Abigale', 'Abigayle', 'Abilene', 'Abriana', 'Abrianna', 'Abriella', 'Abrielle', 'Abril', 'Abygail', 'Acacia', 'Ada', 'Adah', 'Adair', 'Adalee', 'Adaleigh', 'Adalia', 'Adalie', 'Adalina', 'Adaline', 'Adalyn', 'Adalynn', 'Adam', 'Adamari', 'Adamaris', 'Adara', 'Addalyn', 'Addalynn', 'Addelyn', 'Addie', 'Addilyn', 'Addilynn', 'Addisen', 'Addison', 'Addisyn', 'Addy', 'Addysen', 'Addyson', 'Adel', 'Adela', 'Adelaida', 'Adelaide', 'Adele', 'Adelene', 'Adelia', 'Adelina', 'Adeline', 'Adelita', 'Adell', 'Adella', 'Adelle', 'Adelyn', 'Adelynn', 'Adena', 'Adia', 'Adilene', 'Adilyn', 'Adilynn', 'Adina', 'Adison', 'Adisyn', 'Aditi', 'Adley', 'Adline', 'Adora', 'Adreanna', 'Adria', 'Adrian', 'Adriana', 'Adriane', 'Adrianna', 'Adrianne', 'Adrien', 'Adriene', '

Then producing 16-names length lists so that the input (output from call 1) fits into Gemini's input context window later.

In [26]:
# Split the list into 16 equal parts
chunk_size = 16
all_girl_names_list = [all_girl_names_list[i:i+chunk_size] for i in range(0, len(all_girl_names_list), chunk_size)]

print(all_girl_names_list)
print("Lists number (of 16 names each):",len(all_girl_names_list), ", so total girl names = ", len(all_girl_names_list*chunk_size))

[['Aadhya', 'Aadya', 'Aaleyah', 'Aaliya', 'Aaliyah', 'Aaniyah', 'Aanya', 'Aaralyn', 'Aarna', 'Aaron', 'Aarya', 'Aaryn', 'Abagail', 'Abbe', 'Abbey', 'Abbi'], ['Abbie', 'Abbigail', 'Abbigale', 'Abbigayle', 'Abby', 'Abbygail', 'Abigael', 'Abigail', 'Abigale', 'Abigayle', 'Abilene', 'Abriana', 'Abrianna', 'Abriella', 'Abrielle', 'Abril'], ['Abygail', 'Acacia', 'Ada', 'Adah', 'Adair', 'Adalee', 'Adaleigh', 'Adalia', 'Adalie', 'Adalina', 'Adaline', 'Adalyn', 'Adalynn', 'Adam', 'Adamari', 'Adamaris'], ['Adara', 'Addalyn', 'Addalynn', 'Addelyn', 'Addie', 'Addilyn', 'Addilynn', 'Addisen', 'Addison', 'Addisyn', 'Addy', 'Addysen', 'Addyson', 'Adel', 'Adela', 'Adelaida'], ['Adelaide', 'Adele', 'Adelene', 'Adelia', 'Adelina', 'Adeline', 'Adelita', 'Adell', 'Adella', 'Adelle', 'Adelyn', 'Adelynn', 'Adena', 'Adia', 'Adilene', 'Adilyn'], ['Adilynn', 'Adina', 'Adison', 'Adisyn', 'Aditi', 'Adley', 'Adline', 'Adora', 'Adreanna', 'Adria', 'Adrian', 'Adriana', 'Adriane', 'Adrianna', 'Adrianne', 'Adrien'], 

Now, let's process each of this lists with our `generate_names_info_async` function, called recursively by `process_all_lists`. This function takes the input list of names, and generates relevant info in JSON for for each name. The propmt to Gemini and model configuration (e.g. temp=0.2) can be found in `utils.py`.

Let's start by generating 3 JSONs only. The following cell should take around 30s to run, and create a `generated` folder in the `data/generated` folder, with 3 initial versions of JSONs, cleansed a bit below in this notebook.

In [4]:
# Test with only 3 lists
async def produce_names_json_async_3():
    lists_of_names = all_girl_names_list[:3] # Only process the first 3 lists for testing

    results = await process_all_lists(lists_of_names)
    print(f"All lists processed successfully. Results: {results}")
    
await produce_names_json_async_3()

JSON data for list 3 has been written to ../data/generated/G_A3.json
JSON data for list 1 has been written to ../data/generated/G_A1.json
JSON data for list 2 has been written to ../data/generated/G_A2.json
All lists processed successfully. Results: ['{"response": "{\\"Aadhya\\": {\\"meaning\\": \\"first born, beginning, primordial, the first\\", \\"origin\\": [\\"Indian\\", \\"Sanskrit\\"], \\"sound_details\\": {\\"phonemes\\": [\\"aː\\", \\"d\\", \\"j\\", \\"ə\\", \\"jɑː\\"], \\"syllables\\": 2}, \\"variants\\": [\\"Aadya\\", \\"Aditi\\"], \\"famous\\": null, \\"other_info\\": \\"Aadhya is a popular Indian name, primarily used in Hindu culture.  It signifies the beginning or the first of something, often associated with creation and new beginnings. The name has gained popularity in recent years, both within India and among Indian diaspora communities worldwide.\\", \\"family_meaning\\": \\"A family choosing Aadhya likely values tradition, spirituality, and the significance of beginni

⚠️⚠️⚠️ **WARNING:** Uncommenting and running the next cell will make many calls to Gemini in parallel. 
- You may hit limit of requests to Gemini per minute. If you don't want to hit this limit, you should [request a quota increase](https://cloud.google.com/docs/quotas/help/request_increase) of 500 calls per minute. This can take up to 24h, usually less than 1h. Otherwise no worries, you can use the `process_remaining_girl_remaining_names` helper function to make different calls to Gemini without repeating the lists being processed, but depending on your limit this may be a bit of an overhead.
- Generation of the whole dataset typically costs around 1-2 USD (depending on the region).

In [5]:
# # Run for all lists
# async def produce_names_json_async():
#     # List of lists of baby names to be computed (16 names each in order not to hit Gemini's output context limits)
#     lists_of_names = all_girl_names_list

#     results = await process_all_lists(lists_of_names)
#     print(f"All lists processed successfully. Results: {results}")

# await produce_names_json_async()

## Clean JSONs

This code will clean the JSONs and print them in pretty JSON format, saving them to a `cleaned` folder. If any of the generated JSONs is not valid, it will return an error "Error processing file (...)" indicating where the error is, and skip that file. You'll need to either fix that file manually, or try luck running the generation again.

In [6]:
clean_json_files('../data/generated')

Creating processed directory at: ../data/generated/cleaned
All files in directory: ['cleaned', 'G_A1.json', 'G_A2.json', 'G_A3.json']
Checking file: cleaned
Checking file: G_A1.json
Processing file: ../data/generated/G_A1.json
Read content from ../data/generated/G_A1.json: "{\"response\": \"{\\\"Aadhya\\\": {\\\"meaning\\\": \\\"first born, beginning, primordial, the first\\\", \\\"origin\\\": [\\\"Indian\\\", \\\"Sanskrit\\\"], \\\"sound_details\\\": {\\\"phonemes\\\": [\\\"aː\\\", \\\"d\\\", \\\"j\\\", \\\"ə\\\", \\\"jɑː\\\"], \\\"syllables\\\": 2}, \\\"variants\\\": [\\\"Aadya\\\", \\\"Aditi\\\"], \\\"famous\\\": null, \\\"other_info\\\": \\\"Aadhya is a popular Indian name, primarily used in Hindu culture.  It signifies the beginning or the first of something, often associated with creation and new beginnings. The name has gained popularity in recent years, both within India and among Indian diaspora communities worldwide.\\\", \\\"family_meaning\\\": \\\"A family choosing Aadhya l

In case there were invalid JSONs or you hit any quotas, you can run the following cell to check which and how many lists are pending the data generation yet. This function will return a list with all sublists pending a valid generation.

In [7]:
process_remaining_girl_names('../data/generated/cleaned', all_girl_names_list)

Created lists start with names: {'Aadhya', 'Abygail', 'Abbie'}
Remaining lists start with names: [['Adara', 'Addalyn', 'Addalynn', 'Addelyn', 'Addie', 'Addilyn', 'Addilynn', 'Addisen', 'Addison', 'Addisyn', 'Addy', 'Addysen', 'Addyson', 'Adel', 'Adela', 'Adelaida'], ['Adelaide', 'Adele', 'Adelene', 'Adelia', 'Adelina', 'Adeline', 'Adelita', 'Adell', 'Adella', 'Adelle', 'Adelyn', 'Adelynn', 'Adena', 'Adia', 'Adilene', 'Adilyn'], ['Adilynn', 'Adina', 'Adison', 'Adisyn', 'Aditi', 'Adley', 'Adline', 'Adora', 'Adreanna', 'Adria', 'Adrian', 'Adriana', 'Adriane', 'Adrianna', 'Adrianne', 'Adrien'], ['Adriene', 'Adrienne', 'Adrina', 'Adyson', 'Aerial', 'Aeris', 'Aeryn', 'Africa', 'Afton', 'Agatha', 'Aggie', 'Agnes', 'Agustina', 'Ahna', 'Ahsley', 'Ahtziri'], ['Ahuva', 'Aida', 'Aidan', 'Aide', 'Aiden', 'Aiesha', 'Aiko', 'Aila', 'Ailani', 'Aileen', 'Ailene', 'Aili', 'Ailyn', 'Aime', 'Aimee', 'Aine'], ['Ainslee', 'Ainsley', 'Aisha', 'Aishah', 'Aislin', 'Aisling', 'Aislinn', 'Aislyn', 'Aislynn', 'Ai

[['Adara',
  'Addalyn',
  'Addalynn',
  'Addelyn',
  'Addie',
  'Addilyn',
  'Addilynn',
  'Addisen',
  'Addison',
  'Addisyn',
  'Addy',
  'Addysen',
  'Addyson',
  'Adel',
  'Adela',
  'Adelaida'],
 ['Adelaide',
  'Adele',
  'Adelene',
  'Adelia',
  'Adelina',
  'Adeline',
  'Adelita',
  'Adell',
  'Adella',
  'Adelle',
  'Adelyn',
  'Adelynn',
  'Adena',
  'Adia',
  'Adilene',
  'Adilyn'],
 ['Adilynn',
  'Adina',
  'Adison',
  'Adisyn',
  'Aditi',
  'Adley',
  'Adline',
  'Adora',
  'Adreanna',
  'Adria',
  'Adrian',
  'Adriana',
  'Adriane',
  'Adrianna',
  'Adrianne',
  'Adrien'],
 ['Adriene',
  'Adrienne',
  'Adrina',
  'Adyson',
  'Aerial',
  'Aeris',
  'Aeryn',
  'Africa',
  'Afton',
  'Agatha',
  'Aggie',
  'Agnes',
  'Agustina',
  'Ahna',
  'Ahsley',
  'Ahtziri'],
 ['Ahuva',
  'Aida',
  'Aidan',
  'Aide',
  'Aiden',
  'Aiesha',
  'Aiko',
  'Aila',
  'Ailani',
  'Aileen',
  'Ailene',
  'Aili',
  'Ailyn',
  'Aime',
  'Aimee',
  'Aine'],
 ['Ainslee',
  'Ainsley',
  'Aisha',
  'A

If you still have lists pending to be generated, you can uncomment and run the following cell, which will check which are the pending lists and run the generation only for them.

In [8]:
# # Run generation for remaining lists
# async def produce_names_json_async():
#     # List of lists of baby names to be computed (16 names each in order not to hit Gemini's output context limits)
#     remaining_list = process_remaining_girl_names('../data/generated/reviewed/cleaned', all_girl_names_list)

#     results = await process_all_lists(remaining_list)
#     print(f"All lists processed successfully. Results: {results}")

# await produce_names_json_async()

## Create Master JSON

The following cell will create a master JSON file with all JSONs inside a folder, merging files by starting letter -e.g. all files starting with a "G" will be merged into the `G_master.json` (girls master) file-.

In [9]:
merge_json_files_by_letter('../data/generated/cleaned')

## Enhance master JSON

To allow users to make more granular name findings (and therefore make Gemini's search more optimal) we will enhance the data by adding attributes based on keywords for each name (e.g. if a name's meaning or additional info has the words `God`, `faith`, `religion` or `angel`, it'll be given the attribute `Religious`). All attributes are defined in the `add_attributes_field` function inside `utils.py`.

In [10]:
# Enrich the master JSON with attributes

# Example usage
master_json_path = '../data/generated/cleaned/G_master_initial.json'
output_json_name = 'G_master_attr.json'
add_attributes_field(master_json_path, output_json_name)

# Print example of the modified JSON (specific fields)
output_json_path = os.path.join(os.path.dirname(master_json_path), output_json_name)
with open(output_json_path, 'r') as file:
    modified_data = json.load(file)
    
    # Get the first name and its details
    first_name, first_details = next(iter(modified_data.items()))
    
    print(f"Name: {first_name}")
    print(f"Meaning: {first_details.get('meaning', 'N/A')}")
    print(f"Other Info: {first_details.get('other_info', 'N/A')}")
    print(f"Attributes: {first_details.get('attributes', 'N/A')}")

Name: Abbie
Meaning: God is my father
Other Info: Abbie is a diminutive of Abigail. It has been in use since the Middle Ages and has a sweet and simple sound. It is a good choice for parents who are looking for a name that is both classic and modern.
Attributes: ['Religious', 'Classic', 'Modern']


## Clean Master JSON file

### Remove duplicated info

Some likely liked, famous, and variations elements in the master array are duplicated, let's remove them.

In [17]:
input_json_path = '../data/generated/cleaned/G_master_attr.json'
output_json_path = '../data/generated/cleaned/G_master_attr_no_dupl.json'
remove_duplicates_from_json(input_json_path, output_json_path)

Let's also remove whole names that may be duplicate in the master JSON (in case we accidentally run the generation more than once for the same list).

In [18]:
# Remove duplicates and get the deleted names
unique_data, deleted_names = remove_duplicate_names('../data/generated/cleaned/G_master_attr_no_dupl.json')

# Print the deleted names and their counts
for name, count in deleted_names.items():
    print(f"Deleted {count} instances of the name '{name}'")

# Save the unique data back to the same JSON file
with open('../data/generated/cleaned/G_master_attr_no_dupl.json', 'w') as file:
    json.dump(unique_data, file, indent=4, ensure_ascii=False)

## Spot boy/man names and erase them

Now let's compare these names with a list of Masculine names we've got from Kaggle, to see which of these list are typically masculine, which feminine, and which gender-neutral. We will only leave feminine and gender-neutral names.

In [6]:
# Load the CSV, filter for girl names with a count higher than 1000
df = pd.read_csv("../data/raw/gender_name.csv", usecols=["Name", "Gender", "Count"])

filtered_df_girl = df.query("Gender == 'F' and Count > 1000")
all_girl_names_list = filtered_df_girl["Name"].dropna().tolist()
all_girl_names_list.sort()

#  Filter for boy names with a count higher than 1000
filtered_df_boy = df.query("Gender == 'M' and Count > 1000")
all_boy_names_list = filtered_df_boy["Name"].dropna().tolist()
all_boy_names_list.sort()

print(all_boy_names_list)

['Aaden', 'Aakash', 'Aamir', 'Aarav', 'Aaron', 'Aarron', 'Aarush', 'Aaryan', 'Aayan', 'Aayden', 'Aayush', 'Abbas', 'Abbott', 'Abdallah', 'Abdiel', 'Abdirahman', 'Abdul', 'Abdulaziz', 'Abdullah', 'Abdullahi', 'Abdulrahman', 'Abe', 'Abel', 'Abelardo', 'Abhinav', 'Abiel', 'Abimael', 'Abner', 'Abraham', 'Abram', 'Abran', 'Ace', 'Achilles', 'Acie', 'Adair', 'Adalberto', 'Adam', 'Adams', 'Adan', 'Adarius', 'Addison', 'Adel', 'Adelbert', 'Adell', 'Adem', 'Aden', 'Adian', 'Adiel', 'Adil', 'Adin', 'Aditya', 'Adler', 'Adnan', 'Adolf', 'Adolfo', 'Adolph', 'Adolphus', 'Adonis', 'Adrain', 'Adrian', 'Adriano', 'Adriel', 'Adrien', 'Adron', 'Adryan', 'Adyn', 'Aedan', 'Aeden', 'Agapito', 'Agustin', 'Aharon', 'Ahmad', 'Ahmed', 'Ahmir', 'Aidan', 'Aiden', 'Aidyn', 'Aj', 'Ajani', 'Ajay', 'Akash', 'Akeem', 'Akhil', 'Akil', 'Akira', 'Akiva', 'Akram', 'Aksel', 'Akshay', 'Al', 'Alain', 'Alan', 'Alaric', 'Alastair', 'Albert', 'Alberto', 'Albin', 'Albino', 'Aldair', 'Alden', 'Aldo', 'Aldon', 'Alec', 'Aleck', 'Al

In [7]:
girl_names_set = set(all_girl_names_list)
boy_names_set = set(all_boy_names_list)
gender_neutral_names_set = girl_names_set.intersection(boy_names_set)

gender_neutral_names_list = list(gender_neutral_names_set)
print(gender_neutral_names_list)

['Deborah', 'Linden', 'Ellie', 'Amani', 'Darrell', 'Jaydin', 'Jamey', 'Laine', 'Marvin', 'Timothy', 'Yuri', 'Erie', 'Johnny', 'Cadence', 'Michel', 'Juan', 'Arin', 'Jaylin', 'Shanon', 'Lois', 'Donnie', 'Mark', 'Brittany', 'Kori', 'Kamari', 'Karol', 'Shirley', 'Heather', 'Tristin', 'Holly', 'Ramsey', 'Val', 'Zion', 'Jennifer', 'Royal', 'Tory', 'Mary', 'Carmen', 'Marshall', 'Artie', 'Jacqueline', 'Floyd', 'Sky', 'Quinn', 'Howard', 'Devyn', 'Emily', 'Micaiah', 'Amari', 'Frances', 'Betty', 'Korey', 'Jewell', 'Kaiden', 'Melvin', 'Sarah', 'Jean', 'Dee', 'Braylin', 'Christopher', 'Raleigh', 'Richard', 'Lane', 'Lloyd', 'Ozell', 'Norman', 'Gail', 'Kathryn', 'Jensen', 'Blair', 'Channing', 'Cris', 'Juanita', 'Connor', 'Joe', 'Ora', 'Jordon', 'Jackie', 'Kelby', 'Karter', 'Winifred', 'Hudson', 'Noel', 'Albert', 'Diane', 'Gary', 'Dominique', 'Sheridan', 'Meredith', 'Emory', 'Rylan', 'Stephanie', 'Carmel', 'Rian', 'Archie', 'Lashawn', 'Micha', 'Carolyn', 'Audie', 'Brighton', 'Callan', 'Deven', 'Joey',

After human verification, we saw some theoretically gender-neutral names that don't seem right. Let's manually create a list of those and remove them from the gender-neutral names list:

In [8]:
to_remove = ['Peter', 'Tommie', 'Jennifer', 'Delma', 'Nikita', 'Fred', 'Robert', 'Victor', 'David', 'Maria', 'Jose', 'Norma', 'Sasha', 'Amy', 'Teresa', 'William', 'Scott', 'Evelyn', 'Sarah', 'Lindsay', 'Freddie', 'Allison', 'Sandy', 'Carmen', 'Lindsey', 'Thelma', 'Amanda', 'Claire', 'Katherine', 'Laura', 'Vincent', 'Anna', 'Ashley', 'Adrien', 'Carl', 'Donald', 'Erica', 'Eric', 'Andrea', 'Martin', 'Edward', 'Mitchell', 'Isabel', 'Noa', 'Leo', 'Michael', 'Roy', 'Page', 'Samuel', 'Marie', 'Charles', 'Gerald', 'Phillip', 'Dana', 'Cristian', 'Leon', 'Angel', 'Timothy', 'Mickey', 'Edwin', 'Joe', 'Hillary', 'Theo', 'Larry', 'Stacey', 'Steven', 'Kendal', 'Albert', 'Clara', 'Sara', 'Helen', 'Bobby', 'Gabriel', 'Harry', 'Karen', 'Elisabeth', 'Michele', 'Jesus', 'Jessie','Jonnie', 'Johnie', 'Brittany', 'Juanita', 'Rickie', 'Don', 'Gregory', 'Emily', 'Frankie', 'Augusta', 'Carol', 'Lisa', 'Cynthia', 'Gloria', 'Troy', 'Oscar', 'Thomas', 'Donna', 'Rachel', 'Rebecca', 'Judith', 'Betty', 'Guadalupe', 'Brenda', 'Pamela', 'Julie', 'Virginia', 'Patricia', 'Ruth', 'Daniel', 'Leonard', 'Frank', 'Justin', 'Mario', 'Jonathan', 'Johnnie', 'Paige', 'Seneca', 'Deborah', 'Irene', 'Mikel', 'Rosario', 'Paul', 'Andrew', 'Kendall', 'Bobbie', 'Robbie', 'Martha', 'Luca', 'Israel', 'Monica', 'Eva', 'Antonia', 'Diana', 'Richard', 'Isa', 'Melissa', 'Dolores', 'Miguel', 'Raymond', 'Johnny', 'Frederick', 'Raymond', 'Linda', 'Angela', 'Joseph', 'Rosa', 'Tiffany', 'Diane', 'Carlos', 'Connor', 'James', 'Emma', 'Julian' ,'Alfred', 'Annie', 'Philip', 'Manuel', 'Rose', 'Ernest', 'Santos', 'Stacy', 'Samantha', 'Susan', 'Dennis', 'Lorenza', 'Sandra', 'Adrian', 'Julia', 'Kevin', 'Michelle', 'Jack', 'Jessica', 'George', 'Nicholas', 'Aubrey', 'Juan', 'Lauren', 'Carolyn', 'Jerome', 'Victoria', 'Theresa', 'Darrell', 'Vanessa']

# Remove visually inspected non-gender-neutral names:
for name in to_remove:
    if name in gender_neutral_names_list:
        gender_neutral_names_list.remove(name)

gender_neutral_names_set = set(gender_neutral_names_list)

print(gender_neutral_names_list)

['Linden', 'Ellie', 'Amani', 'Jaydin', 'Jamey', 'Laine', 'Marvin', 'Yuri', 'Erie', 'Cadence', 'Michel', 'Arin', 'Jaylin', 'Shanon', 'Lois', 'Donnie', 'Mark', 'Kori', 'Kamari', 'Karol', 'Shirley', 'Heather', 'Tristin', 'Holly', 'Ramsey', 'Val', 'Zion', 'Royal', 'Tory', 'Mary', 'Marshall', 'Artie', 'Jacqueline', 'Floyd', 'Sky', 'Quinn', 'Howard', 'Devyn', 'Micaiah', 'Amari', 'Frances', 'Korey', 'Jewell', 'Kaiden', 'Melvin', 'Jean', 'Dee', 'Braylin', 'Christopher', 'Raleigh', 'Lane', 'Lloyd', 'Ozell', 'Norman', 'Gail', 'Kathryn', 'Jensen', 'Blair', 'Channing', 'Cris', 'Ora', 'Jordon', 'Jackie', 'Kelby', 'Karter', 'Winifred', 'Hudson', 'Noel', 'Gary', 'Dominique', 'Sheridan', 'Meredith', 'Emory', 'Rylan', 'Stephanie', 'Carmel', 'Rian', 'Archie', 'Lashawn', 'Micha', 'Audie', 'Brighton', 'Callan', 'Deven', 'Joey', 'Jalen', 'Jon', 'Marion', 'Azariah', 'Jadyn', 'Louise', 'Yael', 'Elliott', 'Carroll', 'Ashtin', 'Corey', 'Kim', 'Ellery', 'Jaidyn', 'Carrie', 'Laken', 'Reilly', 'Rory', 'Elmer', 'J

And let's now keep only girls and gender-neutral names:

In [9]:
# Remove all names in all_boys_names_list from all_girls_names_list and add back gender-neutral names
final_girls_names_set = (girl_names_set - boy_names_set).union(gender_neutral_names_set)

# Convert the set back to a list
final_girls_names_list = list(final_girls_names_set)

In [10]:
# Remove all boy-only names from Girls Master JSON
input_path = '../data/generated/cleaned/G_master_attr_no_dupl.json'
output_path = input_path

keep_names_in_json(input_path, final_girls_names_list, output_path)

*Et voilà!* We have our Girl names synthetic dataset ready 👶🏽

Let's now uploaded to GCS, where our main process (Cloud Function) will pick it up from:

In [None]:
source_file_name = '../data/generated/cleaned/G_master_attr_no_dupl.json'
destination_blob_name = 'G_master_test.json'

upload_to_gcs(source_file_name, destination_blob_name)

File ../data/generated/cleaned/G_master_attr_no_dupl.json uploaded to G_master_test.json.
