## 4_NER_manual

In addition to NER by spaCy(3_ner notebook), for Job titles (JOB), industries (IND), and technology names (TECH) that are difficult to recognize with default spaCy, lists were manually created utilizing websites and GPT-4, and recognition was simplified using regular expressions. 

In [1]:
import pandas as pd
import re
from tqdm import tqdm
from collections import Counter

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 500)

import warnings
warnings.simplefilter('ignore')



In [2]:
%%time

df = pd.read_parquet('sentence.parquet', engine='pyarrow')
df.shape

CPU times: user 2.62 s, sys: 2.25 s, total: 4.87 s
Wall time: 5.54 s


(7068447, 3)

In [3]:
df.head()

Unnamed: 0,doc_id,sentence_id,sentence
0,1,1,LegalTech Artificial Intelligence Market 2019
1,1,2,Technology Advancement and Future Scope Casetext Inc.
2,1,3,Catalyst Repository Systems eBREVIA Galus Australis Galus Australis BusinessGeneral NewsHealthcareIndustryInternationalLifestyleSci-Tech Wednesday February 26 2020 Trending Needle Counters Market Comprehensive Study by Companies Medline Industries Boen Healthcare Skin Scrub Trays Market Comprehensive Study by Companies Medline Industries BD Deroyal Global Portable Handheld Electronic Game Machine Market Outlook and Business Insights 2020-2026 Apollo Games Sony Aristocrat Leisure IGT Infectio...
3,1,4,Catalyst Repository Systems eBREVIA
4,1,5,General NewsLegalTech Artificial Intelligence Market 2019 Technology Advancement and Future Scope Casetext Inc.


## Industry

In [5]:
# industry list
industry = pd.read_csv('Industry.csv')

In [6]:
industry

Unnamed: 0,Industry
0,Aerospace
1,Agriculture
2,Airline
3,Apparel
4,Automotive
5,Biotech
6,Biotechnology
7,Chemical
8,Communication Service
9,Construction


## Job Title

In [7]:
#  job title list
job = pd.read_csv('JobTitle.csv')

In [8]:
job

Unnamed: 0,job
0,Engineer
1,Account Executive
2,Copywriter
3,Graphic Designer
4,Market Research Analyst
5,Product Manager
6,Agricultural Specialist
7,Biochemist
8,Biomedical Engineer
9,Research Coordinator


## Technology

In [9]:
# technology list
tech = pd.read_csv('Technology.csv')

In [10]:
tech

Unnamed: 0,technology
0,GenAI
1,Gen AI
2,Generative AI
3,GPT
4,GPT3.5
5,GPT-3.5
6,GPT4
7,GPT-4
8,ChatGPT
9,LLM


## Add Entities

In [11]:
%%time

# Add manual entities in the same format as NER

# Define a function to detect if keywords are present in a sentence, converting both to lowercase. 
# Accepts an entity type as an argument to generalize its use.
def find_keywords(sentence, keywords, entity_type):
    sentence_lower = sentence.lower()
    return [[keyword, entity_type] for keyword in keywords if re.search(r'\b{}\b'.format(re.escape(keyword.lower())), sentence_lower)]

# Create lists of keywords for industries, jobs, and technologies.
industry_keywords = industry['Industry'].tolist()
job_keywords = job['job'].tolist()
tech_keywords = tech['technology'].tolist()

# Search for matching keywords in all sentences for each category.
entities_ind = [find_keywords(sentence, industry_keywords, 'IND') for sentence in df['sentence']]
entities_job = [find_keywords(sentence, job_keywords, 'JOB') for sentence in df['sentence']]
entities_tech = [find_keywords(sentence, tech_keywords, 'TECH') for sentence in df['sentence']]

# Add new columns to the DataFrame for each entity type.
df['entities_ind'] = entities_ind
df['entities_job'] = entities_job
df['entities_tech'] = entities_tech

CPU times: user 1h 50min 58s, sys: 21.4 s, total: 1h 51min 19s
Wall time: 1h 51min 33s


## Result

In [12]:
# Show 5 rows that has new entities
df[(df['entities_ind'].apply(lambda x: len(x) > 0)) & (df['entities_job'].apply(lambda x: len(x) > 0)) & (df['entities_tech'].apply(lambda x: len(x) > 0))].head()

Unnamed: 0,doc_id,sentence_id,sentence,entities_ind,entities_job,entities_tech
4527,94,1,Microsoft's new AI key is first big change to keyboards in decades - KESQ circle-arrow Play Button Stop Button chevron-right chevron-left chevron-up search warning chevron-left-skinny chevron-right-skinny x clock calendar play-button cancel-circle user twitter facebook youtube instagram email linkedin Home News California Crime Colorado River Crisis Coachella Valley Questions Answered Education Fentanyl Crisis Palm Springs International Film Festival I-Team Investigations Neighborhood Heroes...,"[[Education, IND], [Sports, IND], [School, IND]]","[[Athlete, JOB]]","[[PaLM, TECH]]"
4698,98,2,Viaero Wireless Network Cameras Weather Video Ski Report Weather Photo Galleries Sports Friday Night Blitz Videos and Livestream Live Newscasts Livestream Special Coverage Videos Radio KRDO NewsRadio Traffic Weather Maps and Forecasts Listen Live Radio Program Guide Podcasts Radio Contests Pet of the Week Traffic Gas Prices Health Healthy Colorado Healthy Kids Healthy Seniors Healthy Women Healthy Men Centura Health Telemundo Telemundo Programacion Colorado Living Victory For Veterans Events...,"[[Entertainment, IND], [Sports, IND]]","[[Writer, JOB]]","[[ChatGPT, TECH]]"
5636,116,6,Artisan 9 Automotive 198 Best Gear 6 Bicycles 20 Books 3 BrandingIdentity 254 Camping 32 Climate Change 807 Clothing 70 Colors 918 Craft 305 Culture 57 Design 2214 Documentary Film 809 Dogs 50 Drink 99 Eco-Friendly 298 Europe 18 EV 60 Family 30 Fashion 154 Flowers 46 Food 310 Footwear 45 Furniture 231 Future 995 Get Smarter 891 Gifts 2 Gluten-Free 9 Graphic Design 63 History 271,"[[Automotive, IND], [Fashion, IND], [Graphic Design, IND]]","[[Artisan, JOB]]","[[EV, TECH]]"
5644,116,14,Artisan 9 Automotive 198 Best Gear 6 Bicycles 20 Books 3 BrandingIdentity 254 Camping 32 Climate Change 807 Clothing 70 Colors 918 Craft 305 Culture 57 Design 2214 Documentary Film 809 Dogs 50 Drink 99 Eco-Friendly 298 Europe 18 EV 60 Family 30 Fashion 154 Flowers 46 Food 310 Footwear 45 Furniture 231 Future 995 Get Smarter 891 Gifts 2 Gluten-Free 9 Graphic Design 63 History 271,"[[Automotive, IND], [Fashion, IND], [Graphic Design, IND]]","[[Artisan, JOB]]","[[EV, TECH]]"
6108,125,22,For the last few years Seattle-based software engineer Peter Whidden has been training a reinforcement learning algorithm to navigate the classic first game of the Pokmon series -- in that time the AI has played more than 50000 hours of the game.11h agoTechCrunchTrans healthcare startup Plume lays off dozens of workersPlume a startup founded to offer essential online healthcare services to trans people across the U.S. laid off more than two dozen workers in October several sources close to t...,"[[Game, IND], [Healthcare, IND], [Software, IND]]","[[Engineer, JOB], [Software Engineer, JOB]]","[[Reinforcement Learning, TECH]]"


In [13]:
# Merge 3 entities columns to 'entities_manual'
df['entities_manual'] = df.apply(lambda row: row['entities_ind'] + row['entities_job'] + row['entities_tech'], axis=1)

In [14]:
# Count entities

# function
# Count entities by type and return a DataFrame showing the top N entities for each specified type
def count_top_entities(entities_series, entity_types, top_n):
    counters = {entity_type: Counter() for entity_type in entity_types}
    
    # Iterate over the series to count entities by type
    for entities in entities_series:
        for entity_text, entity_type in entities:
            if entity_type in entity_types:
                counters[entity_type][entity_text] += 1
    
    # Prepare the DataFrame to display the top N entities for each type, filling in missing values
    top_entities_df = pd.DataFrame()
    for entity_type in entity_types:
        top_entities = counters[entity_type].most_common(top_n)
        # Ensure the list has a length of top_n by appending empty strings if necessary
        top_entities += [("", 0)] * (top_n - len(top_entities))
        top_entities_df[entity_type] = [f"{entity[0]} ({entity[1]})" if entity[0] else "" for entity in top_entities]
    
    return top_entities_df

In [15]:
# Count entities by type
entity_types = ['IND', 'JOB', 'TECH']
top_n = 50

top_entities_df = count_top_entities(df['entities_manual'], entity_types, top_n)
top_entities_df

Unnamed: 0,IND,JOB,TECH
0,Software (141933),Editor (19206),Generative AI (136466)
1,Financial (100247),Analyst (18475),Cloud (131637)
2,Sports (97024),Professor (18091),ChatGPT (104579)
3,Healthcare (82139),Artist (10537),Machine Learning (71834)
4,Entertainment (81627),Writer (10364),OpenAI (59404)
5,Education (72731),Scientist (10298),Chatbot (44835)
6,Government (71007),Athlete (7505),Cybersecurity (28731)
7,Energy (70648),Engineer (7177),GPT (27349)
8,School (62584),Teacher (5819),Blockchain (23070)
9,Consumer (54250),Designer (4082),Bard (22362)


In [16]:
# Save

df_drop = df[['doc_id','sentence_id', 'entities_manual']]

# Specify the file path where the Parquet file will be saved
file_path = 'entities_manual.parquet'
# Save the DataFrame as a Parquet file
df_drop.to_parquet(file_path)

Note: In the final analysis, topics labeled as 'other' have been excluded, which has resulted in different rankings for entities (refer to the 6_summary file). A detailed analysis of each entity will be presented in the 6_summary file.