## Text Classification Workspace

I want to analyze the following categories: person, place, film/tv, event

First, I want to get my dataframes for each country and then add a column (using zero shot learning) that says what the row is an instance of

<br> I am going to try this first just for the US dataframe

In [1]:
import pandas as pd

In [2]:
text_classification_df = pd.read_csv('top5000_each.csv')

In [3]:
US_text_classification_df = text_classification_df[text_classification_df['country_code'].str.contains("US", case=False, na=False)]

In [4]:
US_text_classification_df.head()

Unnamed: 0,article,qid,country_code,total_pageviews
0,Main_Page,Q5296,US,89005625
1,Cookie_(informatique),Q178995,US,49289112
2,Jimmy_Carter,Q23685,US,4964868
3,メインページ,Q5296,US,4061575
4,YouTube,Q866,US,3624806


In [5]:
len(US_text_classification_df)

5000

In [6]:
US_qid_df = US_text_classification_df.drop_duplicates(subset=['qid'], keep='first')

In [7]:
US_qid_df.head()

Unnamed: 0,article,qid,country_code,total_pageviews
0,Main_Page,Q5296,US,89005625
1,Cookie_(informatique),Q178995,US,49289112
2,Jimmy_Carter,Q23685,US,4964868
4,YouTube,Q866,US,3624806
5,URL,Q42253,US,3366191


In [8]:
len(US_qid_df)

4917

Now I have a dataframe of all the unique articles, I can add my column with their text classification

I will use the model from the tutorial notebook to classify the text because it works with different languages

I am having a lot of issues with the text classifier, so I am going to move on to some other things I can take care of before that

In [9]:
!pip install transformers pandas tqdm



In [10]:
pip install torch

Note: you may need to restart the kernel to use updated packages.


In [11]:
!pip install protobuf



In [12]:
import torch
import pandas as pd
from transformers import pipeline
from tqdm.notebook import tqdm

In [None]:
MODEL_NAME = "facebook/bart-large-mnli"
DEVICE = 0 if torch.cuda.is_available() else -1

print(f"Loading model: {MODEL_NAME} on device: {'GPU' if DEVICE == 0 else 'CPU'}")

# pipeline is a function from HuggingFace's transformers library


Loading model: facebook/bart-large-mnli on device: CPU


In [14]:
classifier = pipeline(
    "zero-shot-classification",
    model=MODEL_NAME,
    device=DEVICE,
    batch_size=32
)

classifier

Device set to use cpu


<transformers.pipelines.zero_shot_classification.ZeroShotClassificationPipeline at 0x168559400>

Now I can determine my labels

In [15]:
US_labels = ["Person", "Place", "Event", "TV"]

### GEMINI

In [16]:
resultsUS = classifier(
        US_text_classification_df['article'].to_list(),
        candidate_labels=US_labels,
        hypothesis_template= "This text is about {}.",
        multi_label=False
    )

print("Classification complete.")

Classification complete.


In [18]:
predicted_labels = [result['labels'][0] for result in resultsUS]
predicted_scores = [result['scores'][0] for result in resultsUS]

In [19]:
US_text_classification_df['predicted_category'] = predicted_labels
US_text_classification_df['category_score'] = predicted_scores

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  US_text_classification_df['predicted_category'] = predicted_labels
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  US_text_classification_df['category_score'] = predicted_scores


In [20]:
US_text_classification_df.head()

Unnamed: 0,article,qid,country_code,total_pageviews,predicted_category,category_score
0,Main_Page,Q5296,US,89005625,Place,0.414593
1,Cookie_(informatique),Q178995,US,49289112,Person,0.379777
2,Jimmy_Carter,Q23685,US,4964868,Person,0.891086
3,メインページ,Q5296,US,4061575,Person,0.451591
4,YouTube,Q866,US,3624806,TV,0.865205


### Post text classification stuff

I can set up all my stuff for my other countries

In [21]:
JP_text_classification_df = text_classification_df[text_classification_df['country_code'].str.contains("JP", case=False, na=False)]
IN_text_classification_df = text_classification_df[text_classification_df['country_code'].str.contains("IN", case=False, na=False)]
DE_text_classification_df = text_classification_df[text_classification_df['country_code'].str.contains("DE", case=False, na=False)]
GB_text_classification_df = text_classification_df[text_classification_df['country_code'].str.contains("GB", case=False, na=False)]

In [22]:
JP_qid_df = JP_text_classification_df.drop_duplicates(subset=['qid'], keep='first')
IN_qid_df = IN_text_classification_df.drop_duplicates(subset=['qid'], keep='first')
DE_qid_df = DE_text_classification_df.drop_duplicates(subset=['qid'], keep='first')
GB_qid_df = DE_text_classification_df.drop_duplicates(subset=['qid'], keep='first')

In [23]:
len(JP_qid_df)


4966

## This model isn't good with other languages tho

# I NEED TO USE THE JAPANESE CLASSIFIER HERE

In [56]:
MODEL_NAME_JA = "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli"
DEVICE = 0 if torch.cuda.is_available() else -1

print(f"Loading model: {MODEL_NAME_JA} on device: {'GPU' if DEVICE == 0 else 'CPU'}")

# pipeline is a function from HuggingFace's transformers library


Loading model: MoritzLaurer/mDeBERTa-v3-base-mnli-xnli on device: CPU


In [57]:
classifier_JA = pipeline(
    "zero-shot-classification",
    model=MODEL_NAME_JA,
    device=DEVICE,
    batch_size=32
)

classifier_JA

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

Device set to use cpu


<transformers.pipelines.zero_shot_classification.ZeroShotClassificationPipeline at 0x13a7ab890>

In [58]:
JP_labels = ["人", "場所", "イベント", "テレビ"]

In [59]:
resultsJP = classifier_JA(
        JP_text_classification_df['article'].to_list(),
        candidate_labels=JP_labels,
        hypothesis_template= "この文章は～についてです {}.",
        multi_label=False
    )

print("Classification complete.")

Classification complete.


In [60]:
JP_predicted_labels = [result['labels'][0] for result in resultsJP]
JP_predicted_scores = [result['scores'][0] for result in resultsJP]

In [61]:
JP_text_classification_df['predicted_category'] = JP_predicted_labels
JP_text_classification_df['category_score'] = JP_predicted_scores

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  JP_text_classification_df['predicted_category'] = JP_predicted_labels
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  JP_text_classification_df['category_score'] = JP_predicted_scores


In [62]:
JP_text_classification_df.head()

Unnamed: 0,article,qid,country_code,total_pageviews,predicted_category,category_score
5000,メインページ,Q5296,JP,11080041,場所,0.754544
5001,大谷翔平,Q4391858,JP,2274673,人,0.874603
5002,ヌートバー,Q107315831,JP,1945207,人,0.662389
5003,吉田正尚,Q22120815,JP,1636163,人,0.868529
5004,栗山英樹,Q10855516,JP,1618438,人,0.822546


In [29]:
len(IN_qid_df)

4744

In [30]:
IN_text_classification_df.head()

Unnamed: 0,article,qid,country_code,total_pageviews
15000,Main_Page,Q5296,IN,11404116
15001,XXX_(film_series),Q25136249,IN,3271545
15002,XXX:_Return_of_Xander_Cage,Q22075020,IN,2296142
15003,Women's_Premier_League_(cricket),Q115877036,IN,2233534
15004,YouTube,Q866,IN,2232855


Figure out how many articles from India are written in what languages

In [31]:
IN_labels = ["Person", "Place", "Event", "TV"]

In [32]:
resultsIN = classifier(
        IN_text_classification_df['article'].to_list(),
        candidate_labels=IN_labels,
        hypothesis_template= "This article is about: {}.",
        multi_label=False
    )

print("Classification complete.")

Classification complete.


In [33]:
IN_predicted_labels = [result['labels'][0] for result in resultsIN]
IN_predicted_scores = [result['scores'][0] for result in resultsIN]

In [34]:
IN_text_classification_df['predicted_category'] = IN_predicted_labels
IN_text_classification_df['category_score'] = IN_predicted_scores

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  IN_text_classification_df['predicted_category'] = IN_predicted_labels
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  IN_text_classification_df['category_score'] = IN_predicted_scores


In [35]:
IN_text_classification_df.head()

Unnamed: 0,article,qid,country_code,total_pageviews,predicted_category,category_score
15000,Main_Page,Q5296,IN,11404116,Place,0.436284
15001,XXX_(film_series),Q25136249,IN,3271545,Event,0.398008
15002,XXX:_Return_of_Xander_Cage,Q22075020,IN,2296142,Event,0.755607
15003,Women's_Premier_League_(cricket),Q115877036,IN,2233534,Event,0.754898
15004,YouTube,Q866,IN,2232855,TV,0.72173


In [47]:
GB_text_classification_df.head()

Unnamed: 0,article,qid,country_code,total_pageviews
10000,Cookie_(informatique),Q178995,GB,62366362
10001,Main_Page,Q5296,GB,14118850
10002,Lily_Savage,Q1416917,GB,1021566
10003,YouTube,Q866,GB,849475
10004,Charles_Bronson_(prisoner),Q967157,GB,590418


In [48]:
GB_labels = ["Person", "Place", "Event", "TV"]

In [49]:
resultsGB = classifier(
        GB_text_classification_df['article'].to_list(),
        candidate_labels=IN_labels,
        hypothesis_template= "This article is about: {}.",
        multi_label=False
    )

print("Classification complete.")

Classification complete.


In [50]:
GB_predicted_labels = [result['labels'][0] for result in resultsIN]
GB_predicted_scores = [result['scores'][0] for result in resultsIN]

In [51]:
GB_text_classification_df['predicted_category'] = GB_predicted_labels
GB_text_classification_df['category_score'] = GB_predicted_scores

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  GB_text_classification_df['predicted_category'] = GB_predicted_labels
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  GB_text_classification_df['category_score'] = GB_predicted_scores


In [53]:
GB_text_classification_df.tail()

Unnamed: 0,article,qid,country_code,total_pageviews,predicted_category,category_score
14995,Cockapoo,Q3241878,GB,13589,Place,0.40132
14996,RV_Petrel,Q47012687,GB,13588,Person,0.87998
14997,Tommy_Fleetwood,Q1865564,GB,13579,Person,0.42239
14998,Chet_Hanks,Q20993895,GB,13577,Person,0.810153
14999,Bristol_City_F.C.,Q19456,GB,13577,Person,0.861716


In [41]:
len(DE_qid_df)

4907

In [38]:
MODEL_NAME_DE = "joeddav/xlm-roberta-large-xnli"
DEVICE = 0 if torch.cuda.is_available() else -1

print(f"Loading model: {MODEL_NAME_DE} on device: {'GPU' if DEVICE == 0 else 'CPU'}")

# pipeline is a function from HuggingFace's transformers library


Loading model: joeddav/xlm-roberta-large-xnli on device: CPU


In [39]:
classifier_DE = pipeline(
    "zero-shot-classification",
    model=MODEL_NAME_DE,
    device=DEVICE,
    batch_size=32
)

classifier_DE

Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


<transformers.pipelines.zero_shot_classification.ZeroShotClassificationPipeline at 0x1270d4190>

In [40]:
DE_labels = ["Person", "Ort", "Ereignis", "Fernseher"]

In [41]:
resultsDE = classifier_DE(
        DE_text_classification_df['article'].to_list(),
        candidate_labels=DE_labels,
        hypothesis_template= "Dieser Artikel handelt von {}.",
        multi_label=False
    )

print("Classification complete.")

Classification complete.


In [42]:
predicted_labels = [result['labels'][0] for result in resultsDE]
predicted_scores = [result['scores'][0] for result in resultsDE]

In [43]:
DE_text_classification_df['predicted_category'] = predicted_labels
DE_text_classification_df['category_score'] = predicted_scores

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DE_text_classification_df['predicted_category'] = predicted_labels
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DE_text_classification_df['category_score'] = predicted_scores


In [46]:
DE_text_classification_df.head()

Unnamed: 0,article,qid,country_code,total_pageviews,predicted_category,category_score
20000,Cookie_(informatique),Q178995,DE,53862246,Ereignis,0.398149
20001,Main_Page,Q5296,DE,4811021,Ereignis,0.406531
20002,Website,Q35127,DE,1196442,Ort,0.6314
20003,Internationaler_Frauentag,Q38964,DE,417220,Ereignis,0.811293
20004,Der_Schwarm,Q1196780,DE,389195,Ereignis,0.562383


Now I have everything I need for all the text classification. I can move onto what I need to check 

Almost every qid has an instance of attribute, so I want to add a column to my dataframe that includes that value

## Wikidata verification

In [66]:
json_df = pd.read_json("entity_results2.jsonl", lines=True)

In [67]:
json_df.head()

Unnamed: 0,QID,status,label,description,attributes,error_message
0,Q5296,success,Wikimedia main page,main page of a Wikimedia project,"{'instance of': 'Wikimedia internal item', 'su...",
1,Q178995,success,HTTP cookie,small piece of data sent from a website and st...,"{'named after': 'cookie', 'Commons category': ...",
2,Q23685,success,Jimmy Carter,president of the United States from 1977 to 19...,"{'Perlentaucher ID': 'jimmy-carter', 'given na...",
3,Q866,success,YouTube,American video-sharing platform owned by Alpha...,"{'instance of': 'video streaming service', 'Co...",
4,Q42253,success,URL,web address to a particular file or page,"{'subclass of': 'Uniform Resource Identifier',...",


## From Gemini

In [64]:
import numpy as np

In [68]:
json_df["instance_of"] = json_df["attributes"].apply(
    lambda x: x.get("instance of") if isinstance(x, dict) else np.nan
)

In [69]:
json_df.head()

Unnamed: 0,QID,status,label,description,attributes,error_message,instance_of
0,Q5296,success,Wikimedia main page,main page of a Wikimedia project,"{'instance of': 'Wikimedia internal item', 'su...",,Wikimedia internal item
1,Q178995,success,HTTP cookie,small piece of data sent from a website and st...,"{'named after': 'cookie', 'Commons category': ...",,
2,Q23685,success,Jimmy Carter,president of the United States from 1977 to 19...,"{'Perlentaucher ID': 'jimmy-carter', 'given na...",,human
3,Q866,success,YouTube,American video-sharing platform owned by Alpha...,"{'instance of': 'video streaming service', 'Co...",,video streaming service
4,Q42253,success,URL,web address to a particular file or page,"{'subclass of': 'Uniform Resource Identifier',...",,technical standard


I just want to keep the qid and instance of columns

In [70]:
json_df = json_df.rename(columns={'QID': 'qid'})

In [71]:
to_merge = json_df[["qid", "instance_of"]]

In [61]:
merged = US_qid_df.merge(to_merge, on="qid", how="left")

In [62]:
merged.head()

Unnamed: 0,article,qid,country_code,total_pageviews,instance_of
0,Main_Page,Q5296,US,89005625,Wikimedia internal item
1,Cookie_(informatique),Q178995,US,49289112,
2,Jimmy_Carter,Q23685,US,4964868,human
3,YouTube,Q866,US,3624806,video streaming service
4,URL,Q42253,US,3366191,technical standard


I will do this with the rest of my countries also 

### This is the dataframe I will use to check my text classification predictions

These are all my text classification dataframes:
<li>US_text_classification_df
<li>JP_text_classification_df
<li>GB_text_classification_df
<li>IN_text_classification_df
<li>DE_text_classification_df

Now I need to get my ground truth from the wikidata for each country

In [63]:
US_text_classification_df.head()

Unnamed: 0,article,qid,country_code,total_pageviews,predicted_category,category_score
0,Main_Page,Q5296,US,89005625,Place,0.414593
1,Cookie_(informatique),Q178995,US,49289112,Person,0.379777
2,Jimmy_Carter,Q23685,US,4964868,Person,0.891086
3,メインページ,Q5296,US,4061575,Person,0.451591
4,YouTube,Q866,US,3624806,TV,0.865205


In [73]:
merged_US = US_text_classification_df.merge(to_merge, on="qid", how="left")

In [74]:
merged_US.head()

Unnamed: 0,article,qid,country_code,total_pageviews,predicted_category,category_score,instance_of
0,Main_Page,Q5296,US,89005625,Place,0.414593,Wikimedia internal item
1,Cookie_(informatique),Q178995,US,49289112,Person,0.379777,
2,Jimmy_Carter,Q23685,US,4964868,Person,0.891086,human
3,メインページ,Q5296,US,4061575,Person,0.451591,Wikimedia internal item
4,YouTube,Q866,US,3624806,TV,0.865205,video streaming service


In [75]:
merged_JP = JP_text_classification_df.merge(to_merge, on="qid", how="left")
merged_GB = GB_text_classification_df.merge(to_merge, on="qid", how="left")
merged_IN = IN_text_classification_df.merge(to_merge, on="qid", how="left")
merged_DE = DE_text_classification_df.merge(to_merge, on="qid", how="left")

In [76]:
merged_DE.head()

Unnamed: 0,article,qid,country_code,total_pageviews,predicted_category,category_score,instance_of
0,Cookie_(informatique),Q178995,DE,53862246,Ereignis,0.398149,
1,Main_Page,Q5296,DE,4811021,Ereignis,0.406531,Wikimedia internal item
2,Website,Q35127,DE,1196442,Ort,0.6314,type of website
3,Internationaler_Frauentag,Q38964,DE,417220,Ereignis,0.811293,world day
4,Der_Schwarm,Q1196780,DE,389195,Ereignis,0.562383,literary work


In [98]:
merged_US.to_csv("US_text_classification.csv", index=False)
merged_JP.to_csv("JP_text_classification.csv", index=False)
merged_GB.to_csv("GB_text_classification.csv", index=False)
merged_IN.to_csv("IN_text_classification.csv", index=False)
merged_DE.to_csv("DE_text_classification.csv", index=False)


In [99]:
US_humans = pd.read_csv('US_text_classification.csv')
JP_humans = pd.read_csv('JP_text_classification.csv')
GB_humans = pd.read_csv('GB_text_classification.csv')
IN_humans = pd.read_csv('IN_text_classification.csv')
DE_humans = pd.read_csv('DE_text_classification.csv')

In [129]:
human_qids = pd.concat([US_humans, JP_humans, GB_humans, IN_humans, DE_humans], ignore_index=True)

In [130]:
len(human_qids)

25000

In [133]:
human_qids.to_csv("human_qids.csv", index=False)

In [102]:
US_humans.head()

Unnamed: 0,article,qid,country_code,total_pageviews,predicted_category,category_score,instance_of
0,Main_Page,Q5296,US,89005625,Place,0.414593,Wikimedia internal item
1,Cookie_(informatique),Q178995,US,49289112,Person,0.379777,
2,Jimmy_Carter,Q23685,US,4964868,Person,0.891086,human
3,メインページ,Q5296,US,4061575,Person,0.451591,Wikimedia internal item
4,YouTube,Q866,US,3624806,TV,0.865205,video streaming service


In [81]:
all_data = pd.read_csv('final-project-data2.csv')

In [86]:
all_data.head()

Unnamed: 0,date,country_code,article,qid,pageviews,lang_code
0,2023-03-15,US,Main_Page,Q5296,7132908,en
1,2023-03-16,US,Main_Page,Q5296,4532076,en
2,2023-03-01,US,Cookie_(informatique),Q178995,4251750,fr
3,2023-03-17,US,Main_Page,Q5296,4233371,en
4,2023-03-10,US,Cookie_(informatique),Q178995,4158637,fr


In [134]:
filtered_all_data = all_data[
    all_data["qid"].isin(human_qids["qid"])
]

In [135]:
filtered_all_data.head()

Unnamed: 0,date,country_code,article,qid,pageviews,lang_code
0,2023-03-15,US,Main_Page,Q5296,7132908,en
1,2023-03-16,US,Main_Page,Q5296,4532076,en
2,2023-03-01,US,Cookie_(informatique),Q178995,4251750,fr
3,2023-03-17,US,Main_Page,Q5296,4233371,en
4,2023-03-10,US,Cookie_(informatique),Q178995,4158637,fr


In [136]:
len(filtered_all_data)

14316

In [106]:
filtered_all_data.to_csv("human-data.csv", index=False)

Now I am going to use my dataframes to get the accuracy of the text classifier

In [107]:
US_humans.head()

Unnamed: 0,article,qid,country_code,total_pageviews,predicted_category,category_score,instance_of
0,Main_Page,Q5296,US,89005625,Place,0.414593,Wikimedia internal item
1,Cookie_(informatique),Q178995,US,49289112,Person,0.379777,
2,Jimmy_Carter,Q23685,US,4964868,Person,0.891086,human
3,メインページ,Q5296,US,4061575,Person,0.451591,Wikimedia internal item
4,YouTube,Q866,US,3624806,TV,0.865205,video streaming service


In [110]:
#true positives
US_TP = (
    (US_humans['predicted_category'] == 'Person') &
    (US_humans['instance_of'] == 'human')
).sum()

#false positives
US_FP = (
    (US_humans['predicted_category'] == 'Person') &
    (US_humans['instance_of'] != 'human')
).sum()

#false negatives
US_FN = (
    (US_humans['predicted_category'] != 'Person') &
    (US_humans['instance_of'] == 'human')
).sum()

#true negatives
US_TN = (
    (US_humans['predicted_category'] != 'Person') &
    (US_humans['instance_of'] != 'human')
).sum()

US_TP, US_FP, US_FN, US_TN 


(np.int64(2748), np.int64(585), np.int64(34), np.int64(1633))

In [115]:
JP_humans.head()

Unnamed: 0,article,qid,country_code,total_pageviews,predicted_category,category_score,instance_of
0,メインページ,Q5296,JP,11080041,場所,0.754544,Wikimedia internal item
1,大谷翔平,Q4391858,JP,2274673,人,0.874603,human
2,ヌートバー,Q107315831,JP,1945207,人,0.662389,human
3,吉田正尚,Q22120815,JP,1636163,人,0.868529,human
4,栗山英樹,Q10855516,JP,1618438,人,0.822546,human


In [116]:
#true positives
JP_TP = (
    (JP_humans['predicted_category'] == '人') &
    (JP_humans['instance_of'] == 'human')
).sum()

#false positives
JP_FP = (
    (JP_humans['predicted_category'] == '人') &
    (JP_humans['instance_of'] != 'human')
).sum()

#false negatives
JP_FN = (
    (JP_humans['predicted_category'] != '人') &
    (JP_humans['instance_of'] == 'human')
).sum()

#true negatives
JP_TN = (
    (JP_humans['predicted_category'] != '人') &
    (JP_humans['instance_of'] != 'human')
).sum()

JP_TP, JP_FP, JP_FN, JP_TN 


(np.int64(2551), np.int64(1330), np.int64(179), np.int64(940))

In [113]:
#true positives
GB_TP = (
    (GB_humans['predicted_category'] == 'Person') &
    (GB_humans['instance_of'] == 'human')
).sum()

#false positives
GB_FP = (
    (GB_humans['predicted_category'] == 'Person') &
    (GB_humans['instance_of'] != 'human')
).sum()

#false negatives
GB_FN = (
    (GB_humans['predicted_category'] != 'Person') &
    (GB_humans['instance_of'] == 'human')
).sum()

#true negatives
GB_TN = (
    (GB_humans['predicted_category'] != 'Person') &
    (GB_humans['instance_of'] != 'human')
).sum()

GB_TP, GB_FP, GB_FN, GB_TN 


(np.int64(741), np.int64(1647), np.int64(813), np.int64(1799))

In [114]:
#true positives
IN_TP = (
    (IN_humans['predicted_category'] == 'Person') &
    (IN_humans['instance_of'] == 'human')
).sum()

#false positives
IN_FP = (
    (IN_humans['predicted_category'] == 'Person') &
    (IN_humans['instance_of'] != 'human')
).sum()

#false negatives
IN_FN = (
    (IN_humans['predicted_category'] != 'Person') &
    (IN_humans['instance_of'] == 'human')
).sum()

#true negatives
IN_TN = (
    (IN_humans['predicted_category'] != 'Person') &
    (IN_humans['instance_of'] != 'human')
).sum()

IN_TP, IN_FP, IN_FN, IN_TN 


(np.int64(1627), np.int64(761), np.int64(80), np.int64(2532))

In [117]:
DE_humans.head()

Unnamed: 0,article,qid,country_code,total_pageviews,predicted_category,category_score,instance_of
0,Cookie_(informatique),Q178995,DE,53862246,Ereignis,0.398149,
1,Main_Page,Q5296,DE,4811021,Ereignis,0.406531,Wikimedia internal item
2,Website,Q35127,DE,1196442,Ort,0.6314,type of website
3,Internationaler_Frauentag,Q38964,DE,417220,Ereignis,0.811293,world day
4,Der_Schwarm,Q1196780,DE,389195,Ereignis,0.562383,literary work


In [118]:
#true positives
DE_TP = (
    (DE_humans['predicted_category'] == 'Person') &
    (DE_humans['instance_of'] == 'human')
).sum()

#false positives
DE_FP = (
    (DE_humans['predicted_category'] == 'Person') &
    (DE_humans['instance_of'] != 'human')
).sum()

#false negatives
DE_FN = (
    (DE_humans['predicted_category'] != 'Person') &
    (DE_humans['instance_of'] == 'human')
).sum()

#true negatives
DE_TN = (
    (DE_humans['predicted_category'] != 'Person') &
    (DE_humans['instance_of'] != 'human')
).sum()

DE_TP, DE_FP, DE_FN, DE_TN

(np.int64(2132), np.int64(965), np.int64(36), np.int64(1867))

Okay, I have the stats for each country for labelling person and I can do my confusion matricies and whatnot

In [121]:
country_stats = {
    "US": {"TP": US_TP, "FP": US_FP, "FN": US_FN, "TN": US_TN},
    "JP": {"TP": JP_TP, "FP": JP_FP, "FN": JP_FN, "TN": JP_TN},
    "GB": {"TP": GB_TP, "FP": GB_FP, "FN": GB_FN, "TN": GB_TN},
    "IN": {"TP": IN_TP, "FP": IN_FP, "FN": IN_FN, "TN": IN_TN},
    "DE": {"TP": DE_TP, "FP": DE_FP, "FN": DE_FN, "TN": DE_TN}
}

In [122]:
for country, stats in country_stats.items():
    TP, FP, FN, TN = stats["TP"], stats["FP"], stats["FN"], stats["TN"]

    stats["precision"] = TP / (TP + FP) if (TP + FP) else 0
    stats["recall"]    = TP / (TP + FN) if (TP + FN) else 0
    stats["accuracy"]  = (TP + TN) / (TP + FP + FN + TN) if (TP + FP + FN + TN) else 0


In [123]:
country_stats

{'US': {'TP': np.int64(2748),
  'FP': np.int64(585),
  'FN': np.int64(34),
  'TN': np.int64(1633),
  'precision': np.float64(0.8244824482448245),
  'recall': np.float64(0.9877785765636233),
  'accuracy': np.float64(0.8762)},
 'JP': {'TP': np.int64(2551),
  'FP': np.int64(1330),
  'FN': np.int64(179),
  'TN': np.int64(940),
  'precision': np.float64(0.6573048183457871),
  'recall': np.float64(0.9344322344322344),
  'accuracy': np.float64(0.6982)},
 'GB': {'TP': np.int64(741),
  'FP': np.int64(1647),
  'FN': np.int64(813),
  'TN': np.int64(1799),
  'precision': np.float64(0.3103015075376884),
  'recall': np.float64(0.4768339768339768),
  'accuracy': np.float64(0.508)},
 'IN': {'TP': np.int64(1627),
  'FP': np.int64(761),
  'FN': np.int64(80),
  'TN': np.int64(2532),
  'precision': np.float64(0.681323283082077),
  'recall': np.float64(0.9531341534856473),
  'accuracy': np.float64(0.8318)},
 'DE': {'TP': np.int64(2132),
  'FP': np.int64(965),
  'FN': np.int64(36),
  'TN': np.int64(1867),
 

In [127]:
country_stats_df = (
    pd.DataFrame.from_dict(country_stats, orient="index")
      .reset_index()
      .rename(columns={"index": "country"})
)

country_stats_df

Unnamed: 0,country,TP,FP,FN,TN,precision,recall,accuracy
0,US,2748,585,34,1633,0.824482,0.987779,0.8762
1,JP,2551,1330,179,940,0.657305,0.934432,0.6982
2,GB,741,1647,813,1799,0.310302,0.476834,0.508
3,IN,1627,761,80,2532,0.681323,0.953134,0.8318
4,DE,2132,965,36,1867,0.688408,0.983395,0.7998


In [128]:
country_stats_df.to_csv("country_stats.csv", index=False)