# Preface to the Data Collection
I have searched for an expansive word list to help provide coverage for Sexual Orientation and Gender Identity (SOGI) and have not found anything in a useable format. I am a cis-gender, straight, white man in the United States. It's not for me to impose taxonomy on communities so I am going to do what I can to draw from the way community members and reputable organizations speak about them.

As such I'm drawing heavily on two sources to start creating word lists:
* https://nonbinary.wiki/wiki/Glossary_of_English_gender_and_sex_terminology
* https://www.hrc.org/resources/glossary-of-terms

```bibtex
 @misc{ wiki:xxx,
   author = "Nonbinary Wiki",
   title = "Glossary of English gender and sex terminology --- Nonbinary Wiki{,} ",
   year = "2022",
   url = "https://nonbinary.wiki/w/index.php?title=Glossary_of_English_gender_and_sex_terminology&oldid=32474",
   note = "[Online; accessed 22-July-2022]"
 }
 ```

In [1]:
import requests, pandas as pd, numpy as np, re, csv
from bs4 import BeautifulSoup

In [2]:
url = "https://nonbinary.wiki/w/index.php?title=Glossary_of_English_gender_and_sex_terminology&oldid=32474"
page = requests.get(url)

In [3]:
#Removing all of the navigation at the start of the document to focus on the data we care about.
pageParts = page.text[page.text.find('<h2>'):]

In [4]:
#Create a soup object by parsing and then get the text as a single block.
soup = BeautifulSoup(pageParts, "html.parser")
text = soup.get_text()
#Clean the extra white space.
strings = soup.stripped_strings

In [5]:
lists = soup.find_all("ul")

In [None]:
words = {}
for list in lists:
    items = list.find_all("li")
    for item in items:
        bolds = tiem.find_all("b")
        for bold in bolds:
            

In [6]:
words = []
for item in lists:
    bold = item.find_all("b")
    for text in bold:
        string = text.string
        if string is not None:
            results = re.sub('(\.|\,)', '', string)
            if results.lower() not in words and results != '':
                words.append(results)
                #print(results) #for QA
        else:
            for t in text.children:
                if t.string is not None:
                    results = re.sub('(\.|\,)', '', t.string)
                    if results not in words and results != '':
                        words.append(results)
                    
len(words)

243

In [7]:
"CAGAB" in words

True

In [8]:
context = []
for item in lists:
    for l in item.find_all("li"):
        current = []
        for desc in l.children:
            if desc.string is None:
                for n in desc.children:
                    if n.string is not None:
                        #c = re.sub("(\.|\,)","",n.string)
                        current.append(n.string)
            else:
                #d = re.sub("(\.|\,)","",desc.string)
                current.append(desc.string)
        #print(current)
        if len(current) > 1:
            full = " ".join(current)
            mid = re.sub(" \. ", ". ", full)
            clean = " ".join(mid.split())
            #print(mid)
            if any(word in clean for word in words) and ("." in clean):
                context.append(clean)
            elif any(word in clean.split() for word in words):
                context.append(clean)
                #print("---> " + clean)

In [9]:
print(f"words: {len(words)}")
print(f"context: {len(context)}")

words: 243
context: 214


In [26]:
terms = []
texts = []
#outliers = ["polyromantic", "polysexual", "NBY", "em eir eirs eirself"]
for word in words:
    for entry in context:
        if (word == entry[:len(word)]) and (word not in terms):
            terms.append(word)
            texts.append(entry)
        elif "." in entry:
            if word in entry[:entry.index(".")].split():
                if (word not in terms):
                    terms.append(word)
                    texts.append(entry)
            else:
                if (word not in terms) and (entry not in texts):
                    terms.append(word)
                    texts.append(entry)
                
print(len(terms), len(texts))

243 243


In [325]:
gender = pd.DataFrame(list(zip(terms, texts)), columns =['Words', 'Context'])

In [326]:
missing = []
for word in words:
    if word in gender["Words"].tolist():
        continue
    else:
        missing.append(word)

In [327]:
len(missing)

0

In [328]:
gender.head()

Unnamed: 0,Words,Context
0,ace,"ace. Short for asexual, which see. [1]"
1,AGAB,AGAB. Assigned gender at birth. Most people ar...
2,AFAB,AFAB. See AGAB.
3,agender,agender. A nonbinary identity. 1. Some who cal...
4,altersex,altersex. Describes people or fictional charac...


In [329]:
#gender.drop(index = 0, inplace = True)
#gender.reset_index(inplace=True)
#gender.drop(columns=["index"], inplace=True)

In [330]:
gender.head()

Unnamed: 0,Words,Context
0,ace,"ace. Short for asexual, which see. [1]"
1,AGAB,AGAB. Assigned gender at birth. Most people ar...
2,AFAB,AFAB. See AGAB.
3,agender,agender. A nonbinary identity. 1. Some who cal...
4,altersex,altersex. Describes people or fictional charac...


In [331]:
gender.to_csv("nonbinary-wiki-gender-terms.csv", index=False)