<a href="https://colab.research.google.com/github/blue-create/langlens/blob/main/models/stepwise_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Stepwise Classificaiton of Labels
1. DV vs. nonDV
2. problematic vs. non-problematic
3. category

### Imports

In [2]:
%%capture
!pip install transformers==4.20.0

In [3]:
# PACKAGES
import pandas as pd
import os
import numpy as np
import json
import matplotlib.pyplot as plt
import plotly.express as px
from transformers import AutoTokenizer
from sklearn.manifold import TSNE
# MODELLING
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier

from scipy.stats import randint

In [4]:
# connect with google drive
from google.colab import drive
drive.mount('/content/drive')
# change cwd
%cd drive/MyDrive/Work/Frontline/data/


Mounted at /content/drive
/content/drive/.shortcut-targets-by-id/1WfnZsqpG1r110J63sMbfS5TpsDOkveiV/data


In [5]:
# CUSTOM PACKAGES
from scripts import annotations

### Exporting annotations from Elinor

In [6]:
# list of dfs with all annotated datasets
dfs={}
for doc in os.listdir("annotated/new_ontology"):
  if doc.endswith(".json"):
    #read json data
    json_data=json.load(open("annotated/new_ontology/"+doc, encoding="utf-8"))
    #convert to dataframe
    data=pd.DataFrame(json_data["documents"])
    data.loc[:,"file"]=doc
    dfs[doc]=data

data=pd.concat(dfs,ignore_index=True)

Extract:artikel_id, titel, annotations


In [7]:
data.loc[:,"artikel_id"]=data.attributes_flat.apply(lambda x: x["artikel_id"])
data.loc[:,"titel"]=data.attributes_flat.apply(lambda x: x["titel"])
data.loc[:,"annotations"]=data.loc[:,"annotations"].apply(annotations.extract_annotations)


### 1."Domestic Violence" vs. not "Domestic Violence"

Binary Label: Other/ Domestic Violence

In [8]:
data.loc[:,"label"]=data.annotations.apply(lambda x: "Other" if x=={} else "Domestic Violence")

In [9]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased")

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/474k [00:00<?, ?B/s]

In [10]:
tokens=data.text.apply(lambda x: tokenizer(x,padding="max_length")["input_ids"])

In [11]:
cat = {'Other': 1,'Domestic Violence': 0}

In [12]:
y= [cat[item] for item in data.label]

In [13]:
X_train, X_test, y_train, y_test = train_test_split(tokens, y, test_size=0.20)
X_train = X_train.apply(pd.Series).to_numpy()
X_test = X_test.apply(pd.Series).to_numpy()

KNN

In [14]:
# Use the KNN classifier to fit data:
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

# Predict y data with classifier:
y_predict = classifier.predict(X_test)

# Print results:
print(confusion_matrix(y_test, y_predict))
print(classification_report(y_test, y_predict))

[[ 22 138]
 [ 66 374]]
              precision    recall  f1-score   support

           0       0.25      0.14      0.18       160
           1       0.73      0.85      0.79       440

    accuracy                           0.66       600
   macro avg       0.49      0.49      0.48       600
weighted avg       0.60      0.66      0.62       600



Random Forest

In [15]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

In [16]:
confusion_matrix(y_test, y_pred)

array([[  4, 156],
       [  4, 436]])

GradientBoostingTree

In [17]:
clf = GradientBoostingClassifier(n_estimators=1000, learning_rate=0.01, max_depth=10, random_state=0).fit(X_train, y_train)
y_pred=clf.predict(X_test)

In [18]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[  4 156]
 [ 10 430]]
              precision    recall  f1-score   support

           0       0.29      0.03      0.05       160
           1       0.73      0.98      0.84       440

    accuracy                           0.72       600
   macro avg       0.51      0.50      0.44       600
weighted avg       0.61      0.72      0.63       600



### 2.Problematic vs. nonProblematic Articles

In [19]:
data_dv=data[data.annotations.apply(len)!=0]

In [20]:
data_dv.loc[:,"label"]=data_dv.annotations.apply(lambda x: "DV" if x["K"]=={"Domestic Violence"} else "Problematic" )
data_dv.label.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_dv.loc[:,"label"]=data_dv.annotations.apply(lambda x: "DV" if x["K"]=={"Domestic Violence"} else "Problematic" )


DV             751
Problematic    124
Name: label, dtype: int64

In [21]:
tokens=data_dv.text.apply(lambda x: tokenizer(x,padding="max_length")["input_ids"])

In [22]:
cat = {'DV': 1,'Problematic': 0}

In [23]:
y= [cat[item] for item in data_dv.label]

In [24]:
X_train, X_test, y_train, y_test = train_test_split(tokens, y, test_size=0.20)
X_train = X_train.apply(pd.Series).to_numpy()
X_test = X_test.apply(pd.Series).to_numpy()

KNN

In [25]:
# Use the KNN classifier to fit data:
classifier = KNeighborsClassifier(n_neighbors=4)
classifier.fit(X_train, y_train)

# Predict y data with classifier:
y_predict = classifier.predict(X_test)

# Print results:
print(confusion_matrix(y_test, y_predict))
print(classification_report(y_test, y_predict))

[[  1  19]
 [ 10 145]]
              precision    recall  f1-score   support

           0       0.09      0.05      0.06        20
           1       0.88      0.94      0.91       155

    accuracy                           0.83       175
   macro avg       0.49      0.49      0.49       175
weighted avg       0.79      0.83      0.81       175



Random Forest

In [26]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

In [27]:
confusion_matrix(y_test, y_pred)

array([[  0,  20],
       [  0, 155]])

GradientBoostingTree

In [31]:
clf = GradientBoostingClassifier(n_estimators=500, learning_rate=0.01, max_depth=10, random_state=0).fit(X_train, y_train)
y_pred=clf.predict(X_test)

In [32]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[  0  20]
 [  6 149]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        20
           1       0.88      0.96      0.92       155

    accuracy                           0.85       175
   macro avg       0.44      0.48      0.46       175
weighted avg       0.78      0.85      0.81       175



### 3. Focus Categories

In [33]:

def hamming_score(y_true, y_pred):
    return (
        (y_true & y_pred).sum(axis=1) / (y_true | y_pred).sum(axis=1)
    ).mean()


In [34]:
data_cat=data_dv[data_dv.label=="Problematic"]

In [35]:
data_cat.loc[:,"label"]=data_cat.annotations.apply(lambda x: list(x["K"]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cat.loc[:,"label"]=data_cat.annotations.apply(lambda x: list(x["K"]))


In [36]:

data_cat["label"]=["*".join(i) for i in data_cat.label]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cat["label"]=["*".join(i) for i in data_cat.label]


In [37]:
y=data_cat.label.str.get_dummies(sep="*")

In [38]:
tokens=data_cat.text.apply(lambda x: tokenizer(x,padding="max_length")["input_ids"])

In [39]:
X_train, X_test, y_train, y_test = train_test_split(tokens, y, test_size=0.20)
X_train = X_train.apply(pd.Series).to_numpy()
X_test = X_test.apply(pd.Series).to_numpy()

KNN

In [40]:
# Use the KNN classifier to fit data:
classifier = KNeighborsClassifier(n_neighbors=4)
classifier.fit(X_train, y_train)

# Predict y data with classifier:
y_predict = classifier.predict(X_test)

# Print results:

hamming_score(y_test, y_predict)

0.16

Random Forest

In [41]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

In [44]:
hamming_score(y_test, y_pred)

0.72

GradientBoostingTree - try other model