## Text Classification Workspace

I want to analyze the following categories: person, place, film/tv, event

First, I want to get my dataframes for each country and then add a column (using zero shot learning) that says what the row is an instance of

<br> I am going to try this first just for the US dataframe

In [7]:
import pandas as pd

In [8]:
text_classification_df = pd.read_csv('top5000_each.csv')

In [9]:
US_text_classification_df = text_classification_df[text_classification_df['country_code'].str.contains("US", case=False, na=False)]

In [10]:
US_text_classification_df.head()

Unnamed: 0,article,qid,country_code,total_pageviews
0,Main_Page,Q5296,US,89005625
1,Cookie_(informatique),Q178995,US,49289112
2,Jimmy_Carter,Q23685,US,4964868
3,ã¡ã¤ã³ãã¼ã¸,Q5296,US,4061575
4,YouTube,Q866,US,3624806


In [11]:
len(US_text_classification_df)

5000

In [12]:
US_qid_df = US_text_classification_df.drop_duplicates(subset=['qid'], keep='first')

In [13]:
US_qid_df.head()

Unnamed: 0,article,qid,country_code,total_pageviews
0,Main_Page,Q5296,US,89005625
1,Cookie_(informatique),Q178995,US,49289112
2,Jimmy_Carter,Q23685,US,4964868
4,YouTube,Q866,US,3624806
5,URL,Q42253,US,3366191


In [14]:
len(US_qid_df)

4917

Now I have a dataframe of all the unique articles, I can add my column with their text classification

I will use the model from the tutorial notebook to classify the text because it works with different languages

I am having a lot of issues with the text classifier, so I am going to move on to some other things I can take care of before that

In [15]:
!pip install transformers pandas tqdm



In [16]:
pip install torch

Note: you may need to restart the kernel to use updated packages.


In [17]:
!pip install protobuf



In [18]:
import torch
import pandas as pd
from transformers import pipeline
from tqdm.notebook import tqdm

In [24]:
MODEL_NAME = "facebook/bart-large-mnli"
DEVICE = 0 if torch.cuda.is_available() else -1

print(f"Loading model: {MODEL_NAME} on device: {'GPU' if DEVICE == 0 else 'CPU'}")

# pipeline is a function from HuggingFace's transformers library


Loading model: facebook/bart-large-mnli on device: CPU


In [25]:
classifier = pipeline(
    "zero-shot-classification",
    model=MODEL_NAME,
    device=DEVICE,
    batch_size=32
)

classifier

Device set to use cpu


<transformers.pipelines.zero_shot_classification.ZeroShotClassificationPipeline at 0x305f82210>

Now I can determine my labels

In [26]:
US_labels = ["Person", "Place", "Event", "TV"]

### GEMINI

In [27]:
resultsUS = classifier(
        US_text_classification_df['article'].to_list(),
        candidate_labels=US_labels,
        hypothesis_template= "This text is about {}.",
        multi_label=False
    )

print("Classification complete.")

Classification complete.


In [28]:
predicted_labels = [result['labels'][0] for result in resultsUS]
predicted_scores = [result['scores'][0] for result in resultsUS]

In [29]:
US_text_classification_df['predicted_category'] = predicted_labels
US_text_classification_df['category_score'] = predicted_scores

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  US_text_classification_df['predicted_category'] = predicted_labels
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  US_text_classification_df['category_score'] = predicted_scores


In [30]:
US_text_classification_df.head()

Unnamed: 0,article,qid,country_code,total_pageviews,predicted_category,category_score
0,Main_Page,Q5296,US,89005625,Place,0.414593
1,Cookie_(informatique),Q178995,US,49289112,Person,0.379777
2,Jimmy_Carter,Q23685,US,4964868,Person,0.891086
3,ã¡ã¤ã³ãã¼ã¸,Q5296,US,4061575,Person,0.51493
4,YouTube,Q866,US,3624806,TV,0.865205


### Post text classification stuff

I can set up all my stuff for my other countries

In [31]:
JP_text_classification_df = text_classification_df[text_classification_df['country_code'].str.contains("JP", case=False, na=False)]
IN_text_classification_df = text_classification_df[text_classification_df['country_code'].str.contains("IN", case=False, na=False)]
DE_text_classification_df = text_classification_df[text_classification_df['country_code'].str.contains("DE", case=False, na=False)]

In [32]:
JP_qid_df = JP_text_classification_df.drop_duplicates(subset=['qid'], keep='first')
IN_qid_df = IN_text_classification_df.drop_duplicates(subset=['qid'], keep='first')
DE_qid_df = DE_text_classification_df.drop_duplicates(subset=['qid'], keep='first')

In [33]:
len(JP_qid_df)


4966

## This model isn't good with other languages tho

In [34]:
JP_labels = ["Hito", "Basho", "Ibento", "Terebi"]

In [None]:
resultsJP = classifier(
        JP_text_classification_df['article'].to_list(),
        candidate_labels=JP_labels,
        hypothesis_template= "Kono tekisuto no naiyō wa {}.",
        multi_label=False
    )

print("Classification complete.")

In [None]:
predicted_labels = [result['labels'][0] for result in resultsJP]
predicted_scores = [result['scores'][0] for result in resultsJP]

In [None]:
JP_text_classification_df['predicted_category'] = predicted_labels
JP_text_classification_df['category_score'] = predicted_scores

In [40]:
len(IN_qid_df)

4744

In [None]:
IN_labels = ["Hito", "Basho", "Ibento", "Terebi"]

In [None]:
resultsIN = classifier(
        IN_text_classification_df['article'].to_list(),
        candidate_labels=IN_labels,
        hypothesis_template= "Kono tekisuto no naiyō wa {}.",
        multi_label=False
    )

print("Classification complete.")

In [None]:
predicted_labels = [result['labels'][0] for result in resultsIN]
predicted_scores = [result['scores'][0] for result in resultsIN]

In [None]:
IN_text_classification_df['predicted_category'] = predicted_labels
IN_text_classification_df['category_score'] = predicted_scores

In [41]:
len(DE_qid_df)

4907

In [None]:
DE_labels = ["Hito", "Basho", "Ibento", "Terebi"]

In [None]:
resultsDE = classifier(
        DE_text_classification_df['article'].to_list(),
        candidate_labels=DE_labels,
        hypothesis_template= "Kono tekisuto no naiyō wa {}.",
        multi_label=False
    )

print("Classification complete.")

In [None]:
predicted_labels = [result['labels'][0] for result in resultsDE]
predicted_scores = [result['scores'][0] for result in resultsDE]

In [None]:
DE_text_classification_df['predicted_category'] = predicted_labels
DE_text_classification_df['category_score'] = predicted_scores

Now I have everything I need for all the text classification. I can move onto what I need to check 

Almost every qid has an instance of attribute, so I want to add a column to my dataframe that includes that value

## Wikidata verification

In [44]:
json_df = pd.read_json("entity_results2.jsonl", lines=True)

In [45]:
json_df.head()

Unnamed: 0,QID,status,label,description,attributes,error_message
0,Q5296,success,Wikimedia main page,main page of a Wikimedia project,"{'instance of': 'Wikimedia internal item', 'su...",
1,Q178995,success,HTTP cookie,small piece of data sent from a website and st...,"{'named after': 'cookie', 'Commons category': ...",
2,Q23685,success,Jimmy Carter,president of the United States from 1977 to 19...,"{'Perlentaucher ID': 'jimmy-carter', 'given na...",
3,Q866,success,YouTube,American video-sharing platform owned by Alpha...,"{'instance of': 'video streaming service', 'Co...",
4,Q42253,success,URL,web address to a particular file or page,"{'subclass of': 'Uniform Resource Identifier',...",


## From Gemini

In [47]:
import numpy as np

In [48]:
json_df["instance_of"] = json_df["attributes"].apply(
    lambda x: x.get("instance of") if isinstance(x, dict) else np.nan
)

In [49]:
json_df.head()

Unnamed: 0,QID,status,label,description,attributes,error_message,instance_of
0,Q5296,success,Wikimedia main page,main page of a Wikimedia project,"{'instance of': 'Wikimedia internal item', 'su...",,Wikimedia internal item
1,Q178995,success,HTTP cookie,small piece of data sent from a website and st...,"{'named after': 'cookie', 'Commons category': ...",,
2,Q23685,success,Jimmy Carter,president of the United States from 1977 to 19...,"{'Perlentaucher ID': 'jimmy-carter', 'given na...",,human
3,Q866,success,YouTube,American video-sharing platform owned by Alpha...,"{'instance of': 'video streaming service', 'Co...",,video streaming service
4,Q42253,success,URL,web address to a particular file or page,"{'subclass of': 'Uniform Resource Identifier',...",,technical standard


I just want to keep the qid and instance of columns

In [57]:
json_df = json_df.rename(columns={'QID': 'qid'})

In [58]:
to_merge = json_df[["qid", "instance_of"]]

In [61]:
merged = US_qid_df.merge(to_merge, on="qid", how="left")

In [62]:
merged.head()

Unnamed: 0,article,qid,country_code,total_pageviews,instance_of
0,Main_Page,Q5296,US,89005625,Wikimedia internal item
1,Cookie_(informatique),Q178995,US,49289112,
2,Jimmy_Carter,Q23685,US,4964868,human
3,YouTube,Q866,US,3624806,video streaming service
4,URL,Q42253,US,3366191,technical standard


I will do this with the rest of my countries also 

### This is the dataframe I will use to check my text classification predictions