# NLP Mini Project // Term 1 // Cindy Lin

Project Overview:

Before CCI, I was working on a photo project where I incorporated research on modern fertility practices. In my research, I found aspects of eugenics, unethical marketing practices, and I was interested in exploring this idea further where human genetic materials are objectified as sellable products. I was constantly told by fertility clinics that "Buying sperms online is as easy as shopping on Amazon.com." 

Now, with NLP methods, I want to rework this photo project to see if I can further the research I was doing in 2019/2020.

Originally, I wanted to create a dataset from the Curated Donor Profiles from https://www.cryobank.com/. Cyrobank California is one of the largest sperm banks based in the United States and I wanted to see if there were specific language used to market this "curated collection" the company is promoting as some of their most popular donors. There were only 8 profiles in that particular collection, I felt the dataset was a bit too small to do any sort of real analysis. 

So I broaden the scrap by ethnic origins of the donors: Caucasian, Asian, Black or African American, Hispanic or Latino using the site's Donor Search functions (https://www.cryobank.com/search/).

On each Donor Profile, I scraped the key marketing descriptions from the site. Here is an example:

- Headline: A Dreamer and A Doer
- Description: One look tells you there’s something special about Donor 16910 with his shoulder-length jet-black hair pulled back into a ponytail, neatly trimmed beard and thick mustache, rosy cheeks, mocha brown eyes, and athletic build. An actor/singer/director with theater and film credits in his native Nepal, this multilingual (English, Hindi, and Nepali) entertainment industry pro is now exploring a career as a writer. He’s a gifted soccer player and a passionate fan of cricket and international cinema.

For the project, I modified the Wikipedia code from the web crawler/ web scraping notebook to scrape the data. 

Originally, with the curated collection, I wanted to see if there is a trend in the language used in marketing by comparing the frequency of the following in the donor description:

- Physical attributes (brown eyes, tall, full head of hair)
- Interests (long walks on the beach, likes chicken pot pies)
- Accomplishments (great at playing piano, math champion, can type 70 words per minute)

After increasing the dataset, now I am grouping donors by ethnic origins to see if I can analyze the dataset and find differences in marketing languages. I am also interested to see if certain keywords are used over and over.

In [1]:
import requests
from bs4 import BeautifulSoup

def extract_donor_info(donor_number):
    # The Cryobank donor urls and pages follow a consistent structure
    donor_url = f'https://www.cryobank.com/donor/{donor_number}'

    # Make a request to the Cryobank server and check to see we get a response
    response = requests.get(donor_url)
    if response.status_code != 200:
        return "Failed to retrieve the page."

    # Use beatiful soup to parse the HTML
    soup = BeautifulSoup(response.text, 'html.parser')

    # Using the Developer Tool on my web browser, I found the content I want in the 3rd <h2> tag
    # In this part of the code, the function will find all the <h2> tag on the page
    h2_info = ""
    h2_tags = soup.find_all('h2')

    # The if/else condition will pull only the 3rd <h2> tag by specifying
    # if the <h2> tag index numbers are larger or equal to 3, pull only the second index <h2> tag
    if len(h2_tags) >= 3:
        h2_info = h2_tags[2].get_text()
    else:
        h2_info = "No third <h2> tag found"

    # Using the Developer Tool in my browser, I found the content I want in the 13th <p> tag
    p_info = ""
    p_tags = soup.find_all('p')

    # Using the same process for <h2>, we are scraping the 13th <p>
    if len(p_tags) >= 13:
        p_info = p_tags[12].get_text()
    else:
        p_info = "No 13th <p> tag found"
    
    return h2_info, p_info




## Scraping Asian Donors:
- 127 donors available. 
- It looks like the donor numbers are randomly assigned, so I searched the database selecting Ethnic Origin -- Asian as a search criteria.
- Then I typed up all the donor numbers. 
- When donor numbers started with zero, I added '' around the numbers, otherwise, the program returns an error.

In [3]:

donor_numbers = [18230, 18579, 18784, 17858, 18256, 17801, 18435, 18027, 17711, 18100, 16948, 17951, 17566, 17651, 16973, 16534, 16482, 17962, 17019, 17530, 17866, 17589, 17608, 17564, 17409, 17055, 17067, 17121, 17096, 16805, 16928, 16816, 16924, 16910, 16814, 16643, 16691, 16622, 16768, 16417, 16845, 16630, 16371, 16456, 16451, 16344, 16265, 16247, 16354, 16187, 15928, 15961, 15855, 15867, 15862, 15830, 15784, 15857, 15789, 15782, 15721, 15647, 15557, 15512, 15516, 15472, 15253, 15262, 15063, 15096, 14935, 14724, 14747, 14759, 14765, 14593, 14455, 14578, 14487, 14548, 14185, 14232, 14143, 14109, 14045, 13987, 13909, 13704, 13910, 13820, 13608, 13229, 13240, 13210, 13009, 12337, 12180, 11576, 11560, 11432, 11361, 11346, 11312, 11278, 11245, 11060, 11097, 11048, 11011, '05686', '05626', '02435', '03689', '03637', '05502', '05446', '05272', '00756', '02110', '00644', '03046', '03056', '00890', '00763', '00894', '01136', '00701']
for donor_number in donor_numbers:
    h2_info, p_info = extract_donor_info(donor_number)
    # print("Donor Number " + str(donor_number))
    print(h2_info)
    print(p_info)

Staff Favorite

ABOUT THIS DONOR
Our 90-day Donor Information Subscriptions provide access to all available information on every donor for one low fee. Level 2 and Level 3 items are view only and NOT DOWNLOADABLE. Subscriptions are non-refundable.
6'4'' Soulful Musician
At 6’4, with wavy black hair and hazel eyes, this Donor 18784 is a beautiful combination of his Korean mom and Irish dad. He ran varsity track in high school and now stays fit by rock climbing. Currently a college student, he graduated high school with a staggering 4.16 GPA. In his free time, this passionate donor plays drums in a band and writes music. He finds inspiration from human connection and loves cooking for friends. Deep and thoughtful, he believes in banishing regret and releasing what he can't control.
Poli-Sci Marathoner
Donor 17858 is the one all of his friends go to for advice. He has a calm, caring personality, a warm smile, and thick shiny hair straight out of a shampoo commercial. A former varsity wres

### Printing Asian Donors info scraped to .txt file 

In [5]:
with open("../NLP-Mini-Project/my-data/Asian-Donors.txt", "a") as myfile:
    for donor_number in donor_numbers:
        h2_info, p_info = extract_donor_info(donor_number)
        # myfile.write(f"Donor Number {donor_number}\n")
        myfile.write(h2_info + "\n")
        myfile.write(p_info + "\n")

## Scraping Black or African American Donors:

- Only 8 donor profiles are available. 
- The same process is followed to create dataset for this demographic.

In [6]:
donor_numbers = [18507,18552, 17841, 16323, 17576, 17064, 16733, '05346']
for donor_number in donor_numbers:
    h2_info, p_info = extract_donor_info(donor_number)
    # print("Donor Number " + str(donor_number))
    print(h2_info)
    print(p_info)

Rising Opera Star
Born with a singing voice that would be the envy of Pavarotti and Bocelli, Donor 18507’s talent is as impressive as his dimples are big, his smile bright, and his almond-shaped eyes beautiful. Having discovered his passion for music as a 14-year-old in his church choir, this aspiring opera singer is now working on a performing arts degree (3.7 GPA). Ambitious, kind-hearted, and supremely self-confident, his many creative outlets range from acting and dancing to cooking and repairing motorcycles.
Caribbean Charmer
With dark, long lashes, and dimples, Donor 18552 gets his broad shoulders from competitive weightlifting when he was younger. He graduated from high school two years early (with an associate degree), then got a bachelor’s in psychology and studied post grad — with a 4.0. He loves the arts (he’s pursuing acting) but is also very scientific. His favorite holiday is Thanksgiving. In his family it’s a big party with a “turkey-off” using the recipes of his Trinida

### Printing Black or African American Donors info scraped to .txt file 

In [7]:
# Print to txt

with open("../NLP-Mini-Project/my-data/Black-Donors.txt", "a") as myfile:
    for donor_number in donor_numbers:
        h2_info, p_info = extract_donor_info(donor_number)
        # myfile.write(f"Donor Number {donor_number}\n")
        myfile.write(h2_info + "\n")
        myfile.write(p_info + "\n")

## Scraping Hispanic or Latino Donors:

- Only 25 donor profiles are available. 
- The same process is followed to create dataset for this demographic.

In [8]:

donor_numbers = [18230, 18689, 18510, 17858, 18383, 18552, 18107, 17205, 16828, 16432, 17297, 16889, 16928, 16783, 16138, 16478, 16388, 16158, 16104, 15079, 14484, 14377, 12785, 11206, 16359]
for donor_number in donor_numbers:
    h2_info, p_info = extract_donor_info(donor_number)
    # print("Donor Number " + str(donor_number))
    print(h2_info)
    print(p_info)

Staff Favorite

Academic Marathoner
After slaying high school with a 4.2 GPA and honors, Donor 18689 is now studying human biology and society in college. He describes himself as an empathetic and sensitive, but he also has a competitive side and loves debating movies or playing board games with friends — and winning! He stays fit by running and even completed his first marathon, recently. He’s a pasta and sandwiches kind of guy with jet black hair and a cute chin dimple. One day, he hopes to travel to Seattle to see the Space Needle.
ABOUT THIS DONOR
Our 90-day Donor Information Subscriptions provide access to all available information on every donor for one low fee. Level 2 and Level 3 items are view only and NOT DOWNLOADABLE. Subscriptions are non-refundable.
Poli-Sci Marathoner
Donor 17858 is the one all of his friends go to for advice. He has a calm, caring personality, a warm smile, and thick shiny hair straight out of a shampoo commercial. A former varsity wrestler, he now runs 

### Printing Hispanic or Latino Donors info scraped to .txt file 

In [9]:
# Print to txt

with open("../NLP-Mini-Project/my-data/Hipanic-Donors.txt", "a") as myfile:
    for donor_number in donor_numbers:
        h2_info, p_info = extract_donor_info(donor_number)
        # myfile.write(f"Donor Number {donor_number}\n")
        myfile.write(h2_info + "\n")
        myfile.write(p_info + "\n")

## Scraping Caucasian Donors:

- 107 donor profiles are available. 
- The same process is followed to create dataset for this demographic.

In [16]:

donor_numbers = [18230, 18579, 17958, 18784, 17858, 18436, 17965, 18306, 17917, 18599, 18552, 18107, 17981, 17841, 18163, 16495, 16403, 16701, 16620, 16948, 16386, 18127, 16828, 17651, 16908, 17894, 17052, 18088, 16677, 17576, 17954, 17800, 17859, 17593, 17690, 17728, 17416, 17457, 17564, 17212, 17277, 16889, 17279, 17187, 16903, 16847, 16928, 17089, 16783, 16555, 16824, 16477, 16630, 16138, 16478, 16317, 16337, 16388, 16027, 16049, 16088, 16141, 15882, 15896, 15935, 15808, 15713, 15472, 15402, 15412, 15230, 14455, 14484, 14420, 14259, 14176, 13595, 13608, 13317, 13009, 13025, 12873, 12785, 12824, 12594, 12685, 12632, 11740, 11422, 11283, 11294, 11291, 11061, 11038, 11030, '05766', '05694', '05679', '05642', '05497', '03559', '05441', '03461', '05269', '05268', '03291', '00790'
]
for donor_number in donor_numbers:
    h2_info, p_info = extract_donor_info(donor_number)
    # print("Donor Number " + str(donor_number))
    print(h2_info)
    print(p_info)

Staff Favorite

ABOUT THIS DONOR
Our 90-day Donor Information Subscriptions provide access to all available information on every donor for one low fee. Level 2 and Level 3 items are view only and NOT DOWNLOADABLE. Subscriptions are non-refundable.
Blue-Eyed BioMed Grad

6'4'' Soulful Musician
At 6’4, with wavy black hair and hazel eyes, this Donor 18784 is a beautiful combination of his Korean mom and Irish dad. He ran varsity track in high school and now stays fit by rock climbing. Currently a college student, he graduated high school with a staggering 4.16 GPA. In his free time, this passionate donor plays drums in a band and writes music. He finds inspiration from human connection and loves cooking for friends. Deep and thoughtful, he believes in banishing regret and releasing what he can't control.
Poli-Sci Marathoner
Donor 17858 is the one all of his friends go to for advice. He has a calm, caring personality, a warm smile, and thick shiny hair straight out of a shampoo commercial

### Printing Caucasian Donor info scraped to .txt file 

In [None]:
# Print to txt

with open("../NLP-Mini-Project/my-data/Caucasian-Donors.txt", "a") as myfile:
    for donor_number in donor_numbers:
        h2_info, p_info = extract_donor_info(donor_number)
        # myfile.write(f"Donor Number {donor_number}\n")
        myfile.write(h2_info + "\n")
        myfile.write(p_info + "\n")

# Part 2: Cleaning up the scraped text & generate unique keywords used. 

I went through the four .txt generated by hand first and removed any irrelevant information by hand. 

For example:
- The company uses AI generated stock photos to make the donor profiles more enticing, so each descriptions has a discalimer: "The pictures provided are meant to represent the interests and/or activities of the donor. These are not photographs of the donor engaging in the activities."
- On older profiles, there were photo disclaimers like "Our 90-day Donor Information Subscriptions provide access to all available information on every donor for one low fee. Level 2 and Level 3 items are view only and NOT DOWNLOADABLE. Subscriptions are non-refundable." These are removed as well.
- With each filtered search, all of them has a "Staff Favorite" profile pinned on the top. The profile didn't have any descriptions in them, so those are removed as well. 

For all four datasets, I did the following to clean up the text:

1. Lowercasing all the text

2. Tokenization

3. Removing special characters and punctuation

4. Removing Stop Words: In additions to removing common words like "the," "is," "in", etc., I will also remove the word "donor" since that's going to be appearing at a very frequent level in each of the dataset. I also ended up removing more words like "he", "his", etc. that doesn't contribute to the results. Since I want to see the most frequently used keywords in the descriptions.

5. Lemmatization

5. Numerical Data: A lot of the donor descriptions include the donor numbers. So they will be excluded during this process.

Here, I used the code from class. I also had ChatGPT help me with modifying the code and troubleshoot errors from consolidating the code. 

I was having issues of getting "he's" in the results and Aarav also suggested installing clean-text for a more powerful filtering. Which worked great and also removed .lower function in the original code.

In [25]:
%pip install clean-text

Collecting clean-text
  Downloading clean_text-0.6.0-py3-none-any.whl (11 kB)
Collecting emoji<2.0.0,>=1.0.0 (from clean-text)
  Downloading emoji-1.7.0.tar.gz (175 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.4/175.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting ftfy<7.0,>=6.0 (from clean-text)
  Obtaining dependency information for ftfy<7.0,>=6.0 from https://files.pythonhosted.org/packages/91/f8/dfa32d06cfcbdb76bc46e0f5d69c537de33f4cedb1a15cd4746ab45a6a26/ftfy-6.1.3-py3-none-any.whl.metadata
  Downloading ftfy-6.1.3-py3-none-any.whl.metadata (6.2 kB)
Collecting wcwidth<0.3.0,>=0.2.12 (from ftfy<7.0,>=6.0->clean-text)
  Obtaining dependency information for wcwidth<0.3.0,>=0.2.12 from https://files.pythonhosted.org/packages/31/b1/a59de0ad3aabb17523a39804f4c6df3ae87ead053a4e25362ae03d73d03a/wcwidth-0.2.12-py2.py3-none-any.whl.metadata
  Downloading wcwidth-0.2.12-py2.py3-no

In [27]:
import numpy as np
from collections import Counter
from sklearn.feature_extraction import _stop_words
import nltk
from nltk.tokenize import word_tokenize
from cleantext import clean

# File path
file_path = "../NLP-Mini-Project/my-data/Asian-Donors.txt"

# Read the file content
with open(file_path, 'r') as file:
    content = file.read()

# Clean the text using clean-text
cleaned_text = clean(content, 
                     fix_unicode=True, 
                     to_ascii=True, 
                     lower=True, 
                     no_line_breaks=True, 
                     no_urls=True, 
                     no_emails=True, 
                     no_phone_numbers=True, 
                     no_numbers=True, 
                     no_digits=True, 
                     no_currency_symbols=True, 
                     no_punct=True, 
                     replace_with_url="", 
                     replace_with_email="", 
                     replace_with_phone_number="", 
                     replace_with_number="", 
                     replace_with_digit="")

# Tokenize the cleaned text using NLTK word_tokenize
tokens = word_tokenize(cleaned_text)

# Define the words to remove
words_to_remove = ["he", "his", "donor", "man", "hes"]

# Remove the words and any empty strings
tokens_without_removed_words = [word for word in tokens if word not in words_to_remove and word]

# Remove stop words
tokens_cleaned = [t for t in tokens_without_removed_words if t not in _stop_words.ENGLISH_STOP_WORDS]

# Get unique words and their count
vocab = np.unique(tokens_cleaned)
print("Tokenized content for Asian Donors:")
print("Unique words:", vocab.shape)
print(Counter(tokens_cleaned).most_common(50))


Tokenized content for Asian Donors:
Unique words: (2132,)
[('enjoys', 45), ('loves', 36), ('time', 35), ('school', 33), ('friends', 31), ('playing', 30), ('basketball', 30), ('eyes', 27), ('world', 27), ('life', 26), ('family', 25), ('high', 23), ('hair', 22), ('brown', 22), ('college', 21), ('smile', 21), ('career', 21), ('great', 21), ('math', 21), ('tall', 20), ('work', 19), ('love', 18), ('new', 18), ('like', 18), ('working', 18), ('earned', 17), ('good', 17), ('black', 16), ('personality', 16), ('major', 16), ('outgoing', 16), ('engineering', 16), ('chinese', 16), ('english', 16), ('proud', 16), ('student', 15), ('business', 15), ('friendly', 15), ('future', 15), ('tennis', 15), ('swimming', 15), ('creative', 15), ('people', 15), ('gpa', 14), ('computer', 14), ('china', 14), ('piano', 14), ('goal', 14), ('fit', 13), ('plays', 13)]


In [29]:
import numpy as np
from collections import Counter
from sklearn.feature_extraction import _stop_words
import nltk
from nltk.tokenize import word_tokenize
from cleantext import clean

# File path
file_path = "../NLP-Mini-Project/my-data/Black-Donors.txt"

# Read the file content
with open(file_path, 'r') as file:
    content = file.read()

# Clean the text using clean-text
cleaned_text = clean(content, 
                     fix_unicode=True, 
                     to_ascii=True, 
                     lower=True, 
                     no_line_breaks=True, 
                     no_urls=True, 
                     no_emails=True, 
                     no_phone_numbers=True, 
                     no_numbers=True, 
                     no_digits=True, 
                     no_currency_symbols=True, 
                     no_punct=True, 
                     replace_with_url="", 
                     replace_with_email="", 
                     replace_with_phone_number="", 
                     replace_with_number="", 
                     replace_with_digit="")

# Tokenize the cleaned text using NLTK word_tokenize
tokens = word_tokenize(cleaned_text)

# Define the words to remove
words_to_remove = ["he", "his", "donor", "man", "hes"]

# Remove the words and any empty strings
tokens_without_removed_words = [word for word in tokens if word not in words_to_remove and word]

# Remove stop words
tokens_cleaned = [t for t in tokens_without_removed_words if t not in _stop_words.ENGLISH_STOP_WORDS]

# Get unique words and their count
vocab = np.unique(tokens_cleaned)
print("Tokenized content for Asian Donors:")
print("Unique words:", vocab.shape)
print(Counter(tokens_cleaned).most_common(50))


Tokenized content for Asian Donors:
Unique words: (245,)
[('family', 4), ('music', 3), ('working', 3), ('school', 3), ('grad', 3), ('mvp', 3), ('video', 3), ('opera', 2), ('born', 2), ('singing', 2), ('voice', 2), ('dimples', 2), ('big', 2), ('eyes', 2), ('beautiful', 2), ('passion', 2), ('singer', 2), ('arts', 2), ('degree', 2), ('gpa', 2), ('acting', 2), ('caribbean', 2), ('dark', 2), ('gets', 2), ('high', 2), ('psychology', 2), ('standing', 2), ('kind', 2), ('basketball', 2), ('talented', 2), ('allstar', 2), ('game', 2), ('designer', 2), ('community', 2), ('amazing', 2), ('boom', 2), ('playing', 2), ('haitian', 2), ('rising', 1), ('star', 1), ('envy', 1), ('pavarotti', 1), ('bocelli', 1), ('s', 1), ('talent', 1), ('impressive', 1), ('smile', 1), ('bright', 1), ('almondshaped', 1), ('having', 1)]


In [30]:
import numpy as np
from collections import Counter
from sklearn.feature_extraction import _stop_words
import nltk
from nltk.tokenize import word_tokenize
from cleantext import clean

# File path
file_path = "../NLP-Mini-Project/my-data/Hipanic-Donors.txt"

# Read the file content
with open(file_path, 'r') as file:
    content = file.read()

# Clean the text using clean-text
cleaned_text = clean(content, 
                     fix_unicode=True, 
                     to_ascii=True, 
                     lower=True, 
                     no_line_breaks=True, 
                     no_urls=True, 
                     no_emails=True, 
                     no_phone_numbers=True, 
                     no_numbers=True, 
                     no_digits=True, 
                     no_currency_symbols=True, 
                     no_punct=True, 
                     replace_with_url="", 
                     replace_with_email="", 
                     replace_with_phone_number="", 
                     replace_with_number="", 
                     replace_with_digit="")

# Tokenize the cleaned text using NLTK word_tokenize
tokens = word_tokenize(cleaned_text)

# Define the words to remove
words_to_remove = ["he", "his", "donor", "man", "hes"]

# Remove the words and any empty strings
tokens_without_removed_words = [word for word in tokens if word not in words_to_remove and word]

# Remove stop words
tokens_cleaned = [t for t in tokens_without_removed_words if t not in _stop_words.ENGLISH_STOP_WORDS]

# Get unique words and their count
vocab = np.unique(tokens_cleaned)
print("Tokenized content for Asian Donors:")
print("Unique words:", vocab.shape)
print(Counter(tokens_cleaned).most_common(50))


Tokenized content for Asian Donors:
Unique words: (671,)
[('loves', 12), ('enjoys', 11), ('friends', 8), ('family', 7), ('school', 6), ('hair', 6), ('love', 6), ('cooking', 6), ('working', 6), ('like', 6), ('just', 6), ('playing', 5), ('running', 5), ('political', 5), ('basketball', 5), ('science', 5), ('high', 4), ('skills', 4), ('new', 4), ('helping', 4), ('life', 4), ('dad', 4), ('college', 3), ('competitive', 3), ('games', 3), ('guy', 3), ('hopes', 3), ('smile', 3), ('young', 3), ('went', 3), ('works', 3), ('dancing', 3), ('football', 3), ('favorite', 3), ('gets', 3), ('got', 3), ('big', 3), ('future', 3), ('talents', 3), ('outgoing', 3), ('major', 3), ('amazing', 3), ('s', 3), ('great', 3), ('passionate', 3), ('lover', 3), ('work', 3), ('academic', 2), ('marathoner', 2), ('biology', 2)]


In [31]:
import numpy as np
from collections import Counter
from sklearn.feature_extraction import _stop_words
import nltk
from nltk.tokenize import word_tokenize
from cleantext import clean

# File path
file_path = "../NLP-Mini-Project/my-data/Caucasian-Donors.txt"

# Read the file content
with open(file_path, 'r') as file:
    content = file.read()

# Clean the text using clean-text
cleaned_text = clean(content, 
                     fix_unicode=True, 
                     to_ascii=True, 
                     lower=True, 
                     no_line_breaks=True, 
                     no_urls=True, 
                     no_emails=True, 
                     no_phone_numbers=True, 
                     no_numbers=True, 
                     no_digits=True, 
                     no_currency_symbols=True, 
                     no_punct=True, 
                     replace_with_url="", 
                     replace_with_email="", 
                     replace_with_phone_number="", 
                     replace_with_number="", 
                     replace_with_digit="")

# Tokenize the cleaned text using NLTK word_tokenize
tokens = word_tokenize(cleaned_text)

# Define the words to remove
words_to_remove = ["he", "his", "donor", "man", "hes"]

# Remove the words and any empty strings
tokens_without_removed_words = [word for word in tokens if word not in words_to_remove and word]

# Remove stop words
tokens_cleaned = [t for t in tokens_without_removed_words if t not in _stop_words.ENGLISH_STOP_WORDS]

# Get unique words and their count
vocab = np.unique(tokens_cleaned)
print("Tokenized content for Asian Donors:")
print("Unique words:", vocab.shape)
print(Counter(tokens_cleaned).most_common(50))


Tokenized content for Asian Donors:
Unique words: (2133,)
[('enjoys', 45), ('loves', 36), ('time', 35), ('school', 33), ('friends', 31), ('playing', 30), ('basketball', 30), ('eyes', 27), ('world', 27), ('life', 26), ('family', 25), ('high', 23), ('hair', 22), ('brown', 22), ('college', 21), ('smile', 21), ('career', 21), ('great', 21), ('math', 21), ('tall', 20), ('work', 19), ('love', 18), ('new', 18), ('like', 18), ('working', 18), ('earned', 17), ('good', 17), ('black', 16), ('personality', 16), ('major', 16), ('outgoing', 16), ('engineering', 16), ('chinese', 16), ('english', 16), ('proud', 16), ('student', 15), ('business', 15), ('friendly', 15), ('future', 15), ('tennis', 15), ('swimming', 15), ('creative', 15), ('people', 15), ('gpa', 14), ('computer', 14), ('china', 14), ('piano', 14), ('goal', 14), ('fit', 13), ('plays', 13)]
