# Description

Our model uses a pipeline that employs zero-shot classification with the 'valhalla/distilbart-mnli-12-3' model. Rather than dividing the dataset into training and testing sets, we use all descriptions from the dataset (column '**busdesc**') as input. To prevent any potential disruption to the model, we also change the names of the sectors and labels.

From our previous work, we got 42 different verticals. To check the efficiency of the model, we labelled 241 companies mannually for these verticals, i.e. at least 5 companies for each vertical, then applied Zero-Shot classification to this dataset. However, there is no observed data for the label '**Ephemeral content**', so we just leave it here without supported data.

As a result, the precision is 0.7, the recall and the F1 Score are both 0.6 with weighted average for this model.

# Preprocessing the dataset

In [None]:
import pandas as pd
import numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
data = pd.read_excel("/content/drive/MyDrive/related_company_after_trans.xlsx")
data

Unnamed: 0,CompanyID,Company Name,Description,Industries,Tagline,Specialties,busdesc,gind
0,3,90,90* is a social consultancy + research initia...,Research,,"consultancy, research, Turkey, Data, Knowledge...",90 90 is a social consultancy research initi...,
1,8,1871,The story of the Great Chicago Fire of 1871 is...,Internet,World #1 Private Business Incubator. \n\nSuppo...,,1871 the story of the great chicago fire of 18...,
2,12,Nova,Nova is the global top-talent community that c...,Internet,The Global Top Talent Network | Connecting the...,"Employer Branding, Recruitment, Headhunting, T...",nova nova is the global toptalent community th...,
3,19,.org/advisors,Planning. Systems. Culture. Only when all thre...,Management Consulting,,"nonprofit, management consulting, strategic pl...",orgadvisors planning systems culture only when...,
4,36,(Re)pensar Direitos Humanos [(Re)think Human R...,(Re)pensar Direitos Humanos é um projeto que c...,Non-profit Organization Management,Um convite para perceber e (re)pensar os Direi...,,rethinking human rights rethinking human right...,
...,...,...,...,...,...,...,...,...
21650,167369,VitaMania.su,VitaMania has attracted more than 500k $ in ge...,Retail,"Online selection of vitamin recipe and shop, a...",,vitamaniasu vitamania has attracted more than ...,
21651,167374,Instrat Foundation,Warsaw based think tank\nEnergy & climate | Ju...,Think Tanks,Energy & climate | Just transition | Sustainab...,"public policy, economics, energy, inequalities...",instrat foundation warsaw based think tankener...,
21652,167375,DelphCap,DelphCap is a London-based ESG and Impact advi...,International Trade and Development,"Your partners in client-focused, integrated ES...","Investment Sourcing, Africa Focused, ESG and I...",delphcap delphcap is a londonbased esg and imp...,
21653,167386,Ikigai Growth,En Ikigai Growth somos expertos en ayudar a ne...,Management Consulting,Fórmulas de crecimiento para negocios en Inter...,,ikigai growth at ikigai growth we are experts ...,


In [None]:
data.dropna(subset=['gind'], how='any', inplace=True)

# Installing valhalla/distilbart-mnli-12-3

In [None]:
!pip install transformers



In [None]:
from transformers import pipeline
classifier = pipeline("zero-shot-classification", model='valhalla/distilbart-mnli-12-3', device=0, batch_size=8)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

# Changing the names of the sectors and labels

The following labels and sectors are from column 'class_name' in the table 'merged_df', which is clustered and grouped in our previous study.

In [None]:
labels = [
    "Beauty, Femtech",
    "Autonomous cars",
    "Cybersecurity",
    "Gaming, eSports",
    "Supply chain technology",
    "Digital health, Lifestyles of Health and Sustainability (LOHAS) and wellness, Wearables and quantified self",
    "Mortgage tech, Real estate tech",
    "Legal tech",
    "Mobile",
    "Robotics and drones",
    "Cleantech",
    "B2B payments, Mobile commerce",
    "Oncology",
    "Pet tech",
    "3D printing, Advanced manufacturing, Construction technology, Industrials, Infrastructure, Manufacturing",
    "Oil and gas",
    "Cannabis",
    "Adtech, Marketing tech",
    "Ecommerce",
    "HRtech",
    "Audiotech",
    "Life sciences",
    "Cloudtech and DevOps, Software as a service (SaaS)",
    "Ephemeral content",
    "Carsharing, Micro-mobility, Mobility tech, Ridesharing",
    "Impact investing",
    "Space tech",
    "Agtech",
    "Insurtech",
    "Big Data",
    "Internet of Things (IoT)",
    "Restaurant tech",
    "Climate tech",
    "Foodtech",
    "Healthtech",
    "Fintech",
    "Nanotechnology",
    "Media and telecommunications (TMT), Technology",
    "Edtech",
    "Artificial intelligence and machine learning (AI/ML)",
    "Augmented reality (AR), Virtual reality (VR)",
    "Cryptocurrency and blockchain"
]


In [None]:
sector = {
    1: "Beauty, Femtech",
    2: "Autonomous cars",
    3: "Cybersecurity",
    4: "Gaming, eSports",
    5: "Supply chain technology",
    6: "Digital health, Lifestyles of Health and Sustainability (LOHAS) and wellness, Wearables and quantified self",
    7: "Mortgage tech, Real estate tech",
    8: "Legal tech",
    9: "Mobile",
    10: "Robotics and drones",
    11: "Cleantech",
    12: "B2B payments, Mobile commerce",
    13: "Oncology",
    14: "Pet tech",
    15: "3D printing, Advanced manufacturing, Construction technology, Industrials, Infrastructure, Manufacturing",
    16: "Oil and gas",
    17: "Cannabis",
    18: "Adtech, Marketing tech",
    19: "Ecommerce",
    20: "HRtech",
    21: "Audiotech",
    22: "Life sciences",
    23: "Cloudtech and DevOps, Software as a service (SaaS)",
    25: "Carsharing, Micro-mobility, Mobility tech, Ridesharing",
    26: "Impact investing",
    27: "Space tech",
    28: "Agtech",
    29: "Insurtech",
    30: "Big Data",
    31: "Internet of Things (IoT)",
    32: "Restaurant tech",
    33: "Climate tech",
    34: "Foodtech",
    35: "Healthtech",
    36: "Fintech",
    37: "Nanotechnology",
    38: "Media and telecommunications (TMT), Technology",
    39: "Edtech",
    40: "Artificial intelligence and machine learning (AI/ML)",
    41: "Augmented reality (AR), Virtual reality (VR)",
    42: "Cryptocurrency and blockchain"
}


In [None]:
data

Unnamed: 0,CompanyID,Company Name,Description,Industries,Tagline,Specialties,busdesc,gind
31,211,22 Eleven,22 Eleven is an eCommerce retailer building a ...,Retail,Building global eCommerce brands,,22 eleven 22 eleven is an ecommerce retailer b...,19.0
75,418,99bros | Simply Insured,99bros is a digital insurance brokerage platfo...,Insurance,Who says Insurance must be complicated?,"Insurtech, Insurance, Pension, Motor, Life, No...",99bros simply insured 99bros is a digital ins...,29.0
109,667,Abloh,A web and mobile-based platform for students a...,"Technology, Information and Internet",The Next Wave in Higher Education,,abloh a web and mobilebased platform for stude...,39.0
115,711,abtira | garden,"The idea behind our line is natural, as much a...",Cosmetics,"We deliver natural, minimalist products for yo...","skincare, bodycare, organic , and natural",abtira garden the idea behind our line is nat...,1.0
155,994,Acumen,Acumen is changing the way the world tackles p...,Venture Capital & Private Equity,Changing the way the world tackles poverty by ...,"non-profit, poverty, social enterprise, impact...",acumen acumen is changing the way the world ta...,26.0
...,...,...,...,...,...,...,...,...
21259,164493,SkyGrids,SkyGrids is revolutionising the drone industry...,Defense & Space,Limitless drones. Boundless flight.,,skygrids skygrids is revolutionising the drone...,10.0
21428,165520,Panel iQ Technologies Pvt. Ltd.,PanelIQ Technologies is staffing agency and an...,Staffing and Recruiting,A Tech Talent Company that is Building an Eco-...,,panel iq technologies pvt ltd paneliq technolo...,20.0
21528,166317,The Climate Consultancy,The Climate Consultancy is an impact strategy ...,Information Technology and Services,Collective action for the climate.,"Content marketing, Social Media, Thought Leade...",the climate consultancy the climate consultanc...,33.0
21617,167103,HYRD,HYRD is a revolutionary AI-driven hiring tool ...,Human Resources,HYRD is an AI-driven platform that matches can...,,hyrd hyrd is a revolutionary aidriven hiring t...,20.0


# Predictions

In [None]:

preds = classifier(list(data["busdesc"]), labels)

In [None]:
result = []
for item in preds:
  result.append(item["labels"][0])

In [None]:
import math

true = []
for item in data["gind"]:
  key = item
  true.append(sector[key])

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, f1_score

print(confusion_matrix(true, result))
print(classification_report(true, result))
print("F1 score is: "+ (str)(f1_score(true, result, average='micro')))

[[1 0 0 ... 0 0 1]
 [0 5 0 ... 0 0 0]
 [0 0 4 ... 0 0 0]
 ...
 [0 0 1 ... 6 0 0]
 [0 0 0 ... 0 3 0]
 [0 0 0 ... 0 0 4]]
                                                                                                             precision    recall  f1-score   support

   3D printing, Advanced manufacturing, Construction technology, Industrials, Infrastructure, Manufacturing       1.00      0.14      0.25         7
                                                                                     Adtech, Marketing tech       0.83      1.00      0.91         5
                                                                                                     Agtech       0.67      0.67      0.67         6
                                                       Artificial intelligence and machine learning (AI/ML)       0.40      0.67      0.50         6
                                                                                                  Audiotech       0.00      0.00      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
# Calculating the total data size from the confusion matrix
conf_matrix = confusion_matrix(true, result)
data_size = conf_matrix.sum()
data_size


241

# Add Unknown (In Further Research)

In [None]:
data1 = pd.read_excel("/content/drive/MyDrive/498 capstone/merged_df.xlsx")
data1.head()

Unnamed: 0.1,Unnamed: 0,New_class_byhand,class_name,keywords,top_keywords,generated ketwords,combine_keywords
0,0,1,"Beauty,Femtech",companies in the beauty vertical ...,"bespoke, clean, digital, female, femtech, fer...","OBGYN tech, acne treatment, anti-aging, beaut...","OBGYN tech,acne treatment,anti-aging,beauty,be..."
1,1,2,Autonomous cars,autonomous car technologies are ha...,"autonomous, cars, driver, lidar, mobility, te...","applications of driverless technology, autono...","applications of driverless technology,autonomo..."
2,2,3,Cybersecurity,cybersecurity companies provide infor...,"cloud based, cybersecurity, data, detect, mal...","artificial intelligence in security, blockcha...","artificial intelligence in security,blockchain..."
3,3,4,"Gaming,eSports",esports refers to competitive onli...,"broadcast, consoles, electronic, gambling, ga...","AR gaming experiences, adaptive gaming techno...","AR gaming experiences,Next-gen game consoles,P..."
4,4,5,Supply chain technology,supply chain technology companies ...,"chains, delivery, demand, freight, last mile ...","AI for supplier selection, IoT in logistics, ...","AI for supplier selection,IoT in logistics,Ven..."


In [None]:
name = labels

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="Falconsai/text_summarization")

config.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [None]:
summaries = []
for keywords in data1['combine_keywords']:
    summary = summarizer(keywords, max_length=30, min_length=1, do_sample=False)[0]['summary_text']
    summaries.append(summary)

data1['label'] = summaries

label_dic = dict(zip(summaries, name))

Token indices sequence length is longer than the specified maximum sequence length for this model (719 > 512). Running this sequence through the model will result in indexing errors


In [None]:
label_dic

{'OBGYN tech,acne treatment,anti-aging,beauty blog,beeauty tips,bespoke': 'Beauty, Femtech',
 'applications of driverless technology,autonomous driving platforms . autonomous vehicles industry insights . autonomous vehicle technology .': 'Autonomous cars',
 'artificial intelligence in security,blockchain security,cloud based,cloud security,compliance frameworks,cyber forensics,': 'Cybersecurity',
 'AR gaming experiences,Next-gen game consoles,Player performance analytics,adaptive gaming technology,anti-cheating technology,blockchain': 'Gaming, eSports',
 'AI for supplier selection,IoT in logistics,Vendor managed inventory,blockchain verified origin,carbon-neutral shipping options,': 'Supply chain technology',
 'AI medical imaging,Advanced heart rate monitors,Genomic data analysis,Sustainable travel technologies,UV exposure track': 'Digital health, Lifestyles of Health and Sustainability (LOHAS) and wellness, Wearables and quantified self',
 'AI-driven property recommendations,Automated

In [None]:
data = data.reset_index(drop=True, inplace=True)
results = data.copy()
results['1st_pred'] = results['busdesc']
results['1st_prob'] = results['busdesc']
results['2nd_pred'] = results['busdesc']
results['2nd_prob'] = results['busdesc']
results['3rd_pred'] = results['busdesc']
results['3rd_prob'] = results['busdesc']
results['4th_pred'] = results['busdesc']
results['4th_prob'] = results['busdesc']
results['5th_pred'] = results['busdesc']
results['5th_prob'] = results['busdesc']

for i in range(241):
  pred = classifier(data["busdesc"].iloc[i], list(data1['combine_keywords']))
  results.loc[i, '1st_pred'] = label_dic[pred['labels'][0]]
  results.loc[i, '1st_prob'] = str("{:.3f}".format(pred['scores'][0]))
  results.loc[i, '2nd_pred'] = label_dic[pred['labels'][1]]
  results.loc[i, '2nd_prob'] = str("{:.3f}".format(pred['scores'][1]))
  results.loc[i, '3rd_pred'] = label_dic[pred['labels'][2]]
  results.loc[i, '3rd_prob'] = str("{:.3f}".format(pred['scores'][2]))
  results.loc[i, '4th_pred'] = label_dic[pred['labels'][3]]
  results.loc[i, '4th_prob'] = str("{:.3f}".format(pred['scores'][3]))
  results.loc[i, '5th_pred'] = label_dic[pred['labels'][4]]
  results.loc[i, '5th_prob'] = str("{:.3f}".format(pred['scores'][4]))



In [None]:
results

Unnamed: 0,CompanyID,Company Name,busdesc,gind,1st_pred,1st_prob,2nd_pred,2nd_prob,3rd_pred,3rd_prob,4th_pred,4th_prob,5th_pred,5th_prob
0,211,22 Eleven,22 eleven 22 eleven is an ecommerce retailer b...,19,Ecommerce,0.078,"Carsharing, Micro-mobility, Mobility tech, Rid...",0.057,Pet tech,0.057,Ephemeral content,0.055,Agtech,0.051
1,418,99bros | Simply Insured,99bros simply insured 99bros is a digital ins...,29,Artificial intelligence and machine learning (...,0.191,Insurtech,0.065,"Augmented reality (AR), Virtual reality (VR)",0.056,Agtech,0.045,"Mortgage tech, Real estate tech",0.038
2,667,Abloh,abloh a web and mobilebased platform for stude...,39,Robotics and drones,0.063,"Media and telecommunications (TMT), Technology",0.055,Artificial intelligence and machine learning (...,0.054,Pet tech,0.052,"Augmented reality (AR), Virtual reality (VR)",0.050
3,711,abtira | garden,abtira garden the idea behind our line is nat...,1,Mobile,0.116,Robotics and drones,0.049,Audiotech,0.049,Legal tech,0.043,Autonomous cars,0.043
4,994,Acumen,acumen acumen is changing the way the world ta...,26,Impact investing,0.079,Mobile,0.068,Pet tech,0.048,Supply chain technology,0.047,"Augmented reality (AR), Virtual reality (VR)",0.044
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
236,164493,SkyGrids,skygrids skygrids is revolutionising the drone...,10,Autonomous cars,0.085,Artificial intelligence and machine learning (...,0.081,"3D printing, Advanced manufacturing, Construct...",0.050,Pet tech,0.048,Robotics and drones,0.044
237,165520,Panel iQ Technologies Pvt. Ltd.,panel iq technologies pvt ltd paneliq technolo...,20,Mobile,0.065,Cryptocurrency and blockchain,0.043,"Augmented reality (AR), Virtual reality (VR)",0.042,Pet tech,0.041,"Cloudtech and DevOps, Software as a service (S...",0.036
238,166317,The Climate Consultancy,the climate consultancy the climate consultanc...,33,Ephemeral content,0.092,"Augmented reality (AR), Virtual reality (VR)",0.060,Pet tech,0.058,Agtech,0.057,Climate tech,0.054
239,167103,HYRD,hyrd hyrd is a revolutionary aidriven hiring t...,20,Climate tech,0.094,Cryptocurrency and blockchain,0.079,Mobile,0.060,Impact investing,0.051,"Augmented reality (AR), Virtual reality (VR)",0.041


In [None]:
results.to_csv('/content/drive/MyDrive/498 capstone/ZeroShot_labelled_Result_Verticals_all.csv', index=False)
