# Text Classification with Zero-Shot

In this notebook, I used Zero-Shot Classification on Wikipedia summary texts to categorize economists into standard subfields (e.g., Macroeconomics, Financial Economics, Public Economics) without needing a pre-labeled training dataset.

**Table of Content**

1. [Set-up & Load Data](#sec1)
2. [Load Zero Shot Classifier](#sec2)
3. [Classify Summaries](#sec3)
4. [Save Results](#sec4)
5. [Accuracy Check](#sec5)
6. [Exporting Results](#sec6)

<a id="sec1"></a>
### Set-up & Load Data

For the classification process, we only need the summaries of the economists.

In [1]:
import pandas as pd
from transformers import pipeline
from tqdm import tqdm

df = pd.read_csv("../Data/economists_final_dataset.csv")

df = df[["qid", "name", "summary"]].dropna(subset=["summary"])
df.head()

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,qid,name,summary
0,Q272731,Edith Abbott,"Edith Abbott (September 26, 1876 – July 28, 19..."
1,Q272731,Edith Abbott,"Edith Abbott (September 26, 1876 – July 28, 19..."
2,Q272731,Edith Abbott,"Edith Abbott (September 26, 1876 – July 28, 19..."
3,Q272731,Edith Abbott,"Edith Abbott (September 26, 1876 – July 28, 19..."
4,Q272731,Edith Abbott,"Edith Abbott (September 26, 1876 – July 28, 19..."


Previously, many rows got duplicated multiple times. The code below ensures that only unique rows are stored

In [2]:
# Total Rows 
print("Total rows:", len(df))
# Unique Rows
print(
    "Unique qid-date pairs:",
    df[["qid", "name"]].drop_duplicates().shape[0]
)
# Duplicated Rows
dupes = df.duplicated(subset=["qid", "name"]).sum()
print("Duplicate qid-date rows:", dupes)

Total rows: 114798
Unique qid-date pairs: 1101
Duplicate qid-date rows: 113697


In [8]:
df = df.drop_duplicates(subset=["qid", "name"])
df.head()

Unnamed: 0,qid,name,summary
0,Q272731,Edith Abbott,"Edith Abbott (September 26, 1876 – July 28, 19..."
107,Q718581,Daron Acemoglu,"Kamer Daron Acemoğlu (born September 3, 1967) ..."
214,Q7001311,Nicola Acocella,Nicola Acocella (born 3 July 1939) is an Itali...
321,Q8073604,Zoltan Acs,Zoltan J. Acs (born 1947) is an American econo...
428,Q518021,Henry Carter Adams,"Henry Carter Adams (December 31, 1851 – August..."


<a id="sec2"></a>
### Load Zero Shot Classifier

In [9]:
classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli"
)

econ_categories = [
    "macroeconomics",
    "microeconomics",
    "development economics",
    "financial economics",
    "labor economics",
    "political economy",
    "economic history",
    "econometrics",
    "public policy",
    "other"
]

Device set to use mps:0


<a id="sec3"></a>
### Classify Summaries

In [10]:
results = []

for _, row in tqdm(df.iterrows(), total=len(df)):
    out = classifier(
        row["summary"][:512],
        candidate_labels = econ_categories
    )
    
    results.append({
        "qid": row["qid"],
        "econ_field_pred": out["labels"][0],
        "econ_field_score": out["scores"][0]
    })


100%|██████████| 1101/1101 [06:35<00:00,  2.78it/s]


<a id="sec4"></a>
### Save Results

In [11]:
field_df = pd.DataFrame(results)
field_df.to_csv("../Data/economist_text_classification.csv", index=False)

<a id="sec5"></a>
### Accuracy Check

In [12]:
ground_truth = {
    "Q718581": "macroeconomics",  # Acemoglu
    "Q272731": "public policy",  # Edith Abbott
    "Q160270": "trade", # Ricardo
    "Q9061": "sociology", # Marx
    "Q13529" : "neoclassical",
    "Q9317": "macroeconomics",
    "Q9381": "classical",
    "Q1325": "liberalism",
    "Q102454": "theory",
    "Q192592": "equilibrium",
    "Q191020": "microeconomics",
    "Q132489": "welfare",
    "Q153761": "governance",
    "Q233950": "behavioral",
    "Q131112": "geography",
    "Q434509": "development",
    "Q1097475": "labor",
    "Q562481": "environment",
    "Q263725": "policy"
}

eval_df = field_df[field_df["qid"].isin(ground_truth.keys())].copy()
eval_df["true_label"] = eval_df["qid"].map(ground_truth)
eval_df["correct"] = eval_df["econ_field_pred"] == eval_df["true_label"]

eval_df["correct"].mean()

np.float64(0.05263157894736842)

Because economists often span multiple subfields, strict accuracy evaluation is difficult.

<a id="sec6"></a>
### Exporting Results

In [21]:
views = pd.read_csv("../Data/nobel_combined.csv")
views = views[["qid", "name", "date", "views", "is_nobel", "nobel_year"]]
views = views.drop_duplicates(subset=["qid", "date"])

print(len(views))

114905


In [22]:
meta = pd.read_csv("../Data/economists_final_dataset.csv")
meta = meta[["qid", "gender", "citizenship", "birth_year"]].drop_duplicates(subset=["qid"])

In [23]:
fields = pd.read_csv("../Data/economist_text_classification.csv")
fields = fields[["qid", "econ_field_pred", "econ_field_score"]].drop_duplicates(subset=["qid"])

In [24]:
df = (views.merge(meta, on="qid", how="left", validate="many_to_one")
      .merge(fields, on="qid", how="left", validate="many_to_one"))

print(df.shape)

(114905, 11)


In [25]:
df.head()

Unnamed: 0,qid,name,date,views,is_nobel,nobel_year,gender,citizenship,birth_year,econ_field_pred,econ_field_score
0,Q272731,Edith_Abbott,201701,909,False,,female,['United States'],1876.0,other,0.641633
1,Q272731,Edith_Abbott,201702,1005,False,,female,['United States'],1876.0,other,0.641633
2,Q272731,Edith_Abbott,201703,1461,False,,female,['United States'],1876.0,other,0.641633
3,Q272731,Edith_Abbott,201704,901,False,,female,['United States'],1876.0,other,0.641633
4,Q272731,Edith_Abbott,201705,801,False,,female,['United States'],1876.0,other,0.641633


In [26]:
df.to_csv("../Data/final_streamlit_dataset.csv", index=False)