Cluster bias: category

Flat clustering (partitioning) -> clustering should be k-means

The cluster elements are document vectors

Extrinsic Eval: we are comparing the clusters vs ground truth (the categories)

Metrics:

Similarity between elements in a cluster

Homogeneity, Completeness and V-measure, Rand index, Adjusted Rand index, FowlkesMallows scores, Mutual information based scores ...

Even though silhouette coefficient is used for non-supervised clustering, we can check it to intuitively see if the models have difficulties bc the texts are two different. We could compare the silhouette scores for the diff text data used to see which one is more appropiate? 

In [108]:
import pandas as pd 
import numpy as np 
import matplotlib as plt 
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re

In [109]:
df = pd.read_csv("physics_and_chemistry_nobel_laureate.csv")

In [110]:
df.head()

Unnamed: 0,year,name,country,category,biography
0,1901,Wilhelm Röntgen,German Empire,Physics,Wilhelm Conrad Röntgen (; German pronunciation...
1,1902,Hendrik Lorentz,Netherlands,Physics,Hendrik Antoon Lorentz (; 18 July 1853 – 4 Feb...
2,1902,Pieter Zeeman,Netherlands,Physics,Pieter Zeeman (Dutch: [ˈzeːmɑn]; 25 May 1865 –...
3,1903,Henri Becquerel,France,Physics,Antoine Henri Becquerel (; French pronunciatio...
4,1903,Pierre Curie,France,Physics,"Pierre Curie ( KURE-ee, French: [pjɛʁ kyʁi]; 1..."


In [111]:
corpus = df["biography"].to_list()
print(len(corpus))

200


Preprocess to remove all the Wiki characters

In [112]:
def contains_alphanum(word):
    
    regex = r'\w.+?|[0-9]|\d.+?'
    search = re.match(regex, word)
    boolean = bool(search)
    return boolean 

In [113]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [114]:
def filter_tokens(text):
    doc = nlp(text)
    tokens = [token.text for token in doc if contains_alphanum(token.text) == True]
    filtered_tokens = [token.text for token in doc if contains_alphanum(token.text) == False]
    # print(set(filtered_tokens))
    clean_text = " ".join(tokens)
    return clean_text

In [115]:
df["clean"] = df["biography"].apply(filter_tokens)

Random idea: t-test for math characters -> justify whether we can use this regex for both categories

## Model

### Data splitting

In [116]:
from sklearn.model_selection import train_test_split

X = df["clean"].to_numpy()
y = df["category"].to_numpy()

In [117]:
# train_proportion = 0.8
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y, shuffle=True)

### Data vectorization

In [118]:
vectorizer = TfidfVectorizer(max_features=20,
                                       use_idf=True,
                                       stop_words='english',
                                       tokenizer=word_tokenize)

In [119]:
doc_vectors_train = vectorizer.fit_transform(X_train)
doc_vectors_test = vectorizer.transform(X_test)



In [120]:
vectorizer.get_feature_names_out()

array(['american', 'born', 'chemistry', 'einstein', 'institute',
       'laboratory', 'life', 'new', 'nobel', 'physics', 'prize',
       'professor', 'research', 'science', 'society', 'theory', 'time',
       'university', 'war', 'work'], dtype=object)

### Clustering

In [121]:
km = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=5, verbose=0, random_state=42)

In [122]:
km.fit(doc_vectors_train)

In [123]:
km.labels_

array([0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0,
       1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 1, 1, 1, 0])

In [150]:
# would this be the actual way to get the predictions???

y_clusters_train = km.labels_.tolist()

In [128]:
y_pred = km.predict(doc_vectors_test)

In [129]:
y_pred

array([0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0])

## Visualizations, TO DO!!!!!!!!!!!!!!!!!

TO DO

## Results

In [139]:
cluster_df_test = pd.DataFrame({"text": X_test, "target_category": y_test, "predicted_category": y_pred, "data_split":"test"})

In [140]:
cluster_df_train = pd.DataFrame({"text": X_train, "target_category": y_train, "predicted_category": y_clusters_train, "data_split":"train"})

In [141]:
cluster_df = pd.concat([cluster_df_test, cluster_df_train])

In [143]:
cluster_df[cluster_df["data_split"] == "train"]

Unnamed: 0,text,target_category,predicted_category,data_split
0,Hermann Staudinger German pronunciation ˈhɛʁma...,Chemistry,0,train
1,Rudolf Ludwig Mössbauer German spelling Mößbau...,Physics,1,train
2,Gerhard Heinrich Friedrich Otto Julius Herzber...,Chemistry,1,train
3,Wilhelm Conrad Röntgen German pronunciation ˈv...,Physics,1,train
4,Frits Zernike Dutch pronunciation ˈfrɪtˈsɛrnik...,Physics,1,train
...,...,...,...,...
155,Frederick Soddy FRS 2 September 1877 22 Septem...,Chemistry,0,train
156,Yang Chen Ning or Chen Ning Yang simplified Ch...,Physics,1,train
157,Alexander Mikhailovich Prokhorov born Alexande...,Physics,1,train
158,Patrick Maynard Stuart Blackett Baron Blackett...,Physics,1,train


In [144]:
cluster_df[cluster_df["data_split"] == "test"]

Unnamed: 0,text,target_category,predicted_category,data_split
0,Adolf Friedrich Johann Butenandt German pronun...,Chemistry,0,test
1,Victor Franz Hess German ˈvɪktoːɐ̯ fʁants ˈhɛs...,Physics,1,test
2,Reona Esaki 江崎 玲於奈 Esaki Reona born March 12 1...,Physics,1,test
3,Ivar Giaever Norwegian Giæver IPA ˈìːvɑr ˈjèːv...,Physics,1,test
4,Adolf Otto Reinhold Windaus German pronunciati...,Chemistry,0,test
5,Pierre Curie KURE ee French pjɛʁ kyʁi 15 May 1...,Physics,1,test
6,Irving Langmuir January 31 1881 August 16 1957...,Chemistry,0,test
7,Nikolay Gennadiyevich Basov Russian Никола́й Г...,Physics,1,test
8,Willis Eugene Lamb Jr. July 12 1913 May 15 200...,Physics,1,test
9,Vladimir Prelog 23 July 1906 7 January 1998 wa...,Chemistry,0,test


In [145]:
label_dictionary = {0:"Chemistry", 1:"Physics"}

In [146]:
cluster_df = cluster_df.replace(label_dictionary)

In [149]:
cluster_df[cluster_df["target_category"] != cluster_df["predicted_category"]]

Unnamed: 0,text,target_category,predicted_category,data_split
2,Gerhard Heinrich Friedrich Otto Julius Herzber...,Chemistry,Physics,train
15,Ernest Rutherford 1st Baron Rutherford of Nels...,Chemistry,Physics,train
58,Sir Aaron Klug 11 August 1926 20 November 2018...,Chemistry,Physics,train
63,Edwin Mattison McMillan September 18 1907 Sept...,Chemistry,Physics,train
70,Viscount Ilya Romanovich Prigogine Russian Иль...,Chemistry,Physics,train
81,Peter Joseph William Debye Dutch dəˈbɛiə March...,Chemistry,Physics,train
88,Walter Gilbert born March 21 1932 is an Americ...,Chemistry,Physics,train
159,Maria Salomea Skłodowska Curie Polish ˈmarja s...,Physics,Chemistry,train


### Metrics

In [151]:
y_test

array(['Chemistry', 'Physics', 'Physics', 'Physics', 'Chemistry',
       'Physics', 'Chemistry', 'Physics', 'Physics', 'Chemistry',
       'Physics', 'Physics', 'Chemistry', 'Physics', 'Chemistry',
       'Chemistry', 'Physics', 'Chemistry', 'Chemistry', 'Physics',
       'Chemistry', 'Physics', 'Physics', 'Physics', 'Physics', 'Physics',
       'Physics', 'Chemistry', 'Physics', 'Chemistry', 'Physics',
       'Chemistry', 'Chemistry', 'Chemistry', 'Chemistry', 'Chemistry',
       'Chemistry', 'Chemistry', 'Physics', 'Chemistry'], dtype=object)

In [152]:
y_pred

array([0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0])

In [160]:
y_pred = list(map(lambda x: x.replace("Chemistry","0").replace("Physics","1"), y_test))
y_test = np.array(y_test).astype(int)

In [161]:
y_test

array([0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0])

In [162]:
from sklearn import metrics

print("Homogeneity: %0.3f" % metrics.homogeneity_score(y_pred, y_pred))
print("Completeness: %0.3f" % metrics.completeness_score(y_pred, y_pred))
print("V-measure: %0.3f" % metrics.v_measure_score(y_pred, y_pred))
print("Adjusted Rand-Index: %.3f"% metrics.adjusted_rand_score(y_pred, y_pred))


Homogeneity: 1.000
Completeness: 1.000
V-measure: 1.000
Adjusted Rand-Index: 1.000
