# Iterative Null Space projection

Given a set of word embeddings $X = \{x_1, x_2, \ldots, x_n\}$, $x_i \in \mathbb{R}^d$, such as professions (doctor, nurse, teacher etc) and attributes $Z = \{z_1, z_2, \ldots, z_n\}$, such as instances of gender (male, female, man, woman etc), we aim to find a transformation $g$ such that $z_i$ cannot be predicted from $g(x_i)$. In other words, if we have a classifier $c$ used to predict a persons profession, we build an auxiliary model $c'$ to predict their gender from their profession.

If $c'$ is a classifier with parameters $W$ (e.g. the weights in a deep learning model), we want a projection matrix $P$ such that $W(Px)=0$ for all $x$, rendering the paramters $W$ useless on $X$. 

Additional classifiers $W'$ are then trained until no linear information regading $Z$ remains in $X$. $P$ is constructed using nullspace projection. In other words, we train our auxiliary model $c'$ until it predicts the protected attribute no better than randomly guessing, i.e. $\leq50 \%$ accuracy. The projection matrix is then applied to the weights of the original model with the hope that the dependency of the sensitive attribute with respect to the output is removed.

The relationship between $W$ and an input $x$ is that $x$ is projected onto the sub-space of the rows of $W$ and is classified by taking the dot product between $W$ and the components of its rows. Therefore if we zero the components of $x$ in the direction of $W's$ rowspace, we remove the information within $W$ used to make a predcition based on $x$, i.e. the ability to make a prediction based on a sensitive attribute in the set $X$. 

Algebraically this is equaivalent to projecting $x$ onto the nullspace of $W$, $N(W)$:
$$
N(W) = \{ x \mid Wx = 0 \}
$$


The dataset we are using is an augmented version of the German Credit Scoring dataset where each the ethnicity and marital status of each candidate has been randomly generated. The dataset as also been formatted as prompts which could be used with an LLM. Here we are training a SGD classifier to determine whether or not the candidate is a good or bad risk. We attempt to use nullspace projection to remove the dependency of gender in the prediction.

**The data for this particular use case is a little bit problematic. INLP requires the training of an auxilliary model with flipped predictors, i.e. in our task we end up training a model to predict the gender from the prompt, however the gender appears in the prompt. Removing the gender from the prompt may make it impossible to predict the gender from the prompt (as there are no real features in the prompts to base a prediciton off) and ironically removes all dependence of gender on the output too, removing the need for INLP. As more use cases appear, we will update this notebook.**

In [1]:
# Imports

import sys
import old_debias
import classifier

import numpy as np
import pandas as pd
import random
import sklearn
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn import cluster
from sklearn import metrics
from sklearn.manifold import TSNE
from sklearn.svm import LinearSVC, SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier, SGDRegressor, Perceptron, LogisticRegression

import nltk
import tqdm

In [2]:
# Load dataset

url='https://drive.google.com/file/d/1N7z7Nn4IuQjZSv1eqLh32HKrKKaLrsoQ/view?usp=sharing'
file_id=url.split('/')[-2]
dwn_url='https://drive.google.com/uc?id=' + file_id
df=pd.read_csv(dwn_url)

In [3]:
# Create training and testing splits

data = df.drop('Sex', axis = 1)
data

Unnamed: 0,Prompt,Risk
0,"Using the following characteristics, assess wh...",good
1,"Using the following characteristics, assess wh...",bad
2,"Using the following characteristics, assess wh...",good
3,"Using the following characteristics, assess wh...",good
4,"Using the following characteristics, assess wh...",bad
...,...,...
995,"Using the following characteristics, assess wh...",good
996,"Using the following characteristics, assess wh...",good
997,"Using the following characteristics, assess wh...",good
998,"Using the following characteristics, assess wh...",bad


Fitting a standard ML model to our data


We separate our dataset into the inputs (the prompts) and the outputs (the risk). We later want to check if the gender of the individual has a bearing on the output therefore we isolate this column too.

In [4]:
# Train/Test Splits

X = data['Prompt']
y = data['Risk']
y_gender = df['Sex']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle = False, stratify = None)


Tokenising the text using NLTK

In [5]:
# Tokenisation

def nltk_tokenization(text):
    tokens = nltk.word_tokenize(text)
    return tokens

def built_in_tokenization(text):
    tokens = text.split()
    return tokens

In [6]:
# Fit a standard ML model to the data

clf = Pipeline([
    ('vectorizer', CountVectorizer(tokenizer=built_in_tokenization)),
    ('selection', SelectKBest(chi2, k=894)),
#     ('classifier', LogisticRegression())
    ('classifier', SGDClassifier(warm_start=True, loss='log_loss', n_jobs=64, max_iter=75, random_state=0))
])

clf.fit(X_train, y_train)
   



Our SGD classifier achieves ~70% accuracy 

In [7]:
clf.score(X_test, y_test)

0.705

Debiasing the dataset using Null Space Projection

We now train a model to predict the gender based on the prompts. 

In [8]:
# Debiasing 

X_train_one_hot = clf.named_steps['selection'].transform(clf.named_steps['vectorizer'].transform(X_train))
X_test_one_hot = clf.named_steps['selection'].transform(clf.named_steps['vectorizer'].transform(X_test))


In [9]:
# Arrays for sensitive features

X_train_gender, X_test_gender, y_train_gender, y_test_gender = train_test_split(X, y_gender, test_size = 0.2, shuffle = False, stratify = None)

The projection matrix is produced from training an auxilliary classifier by flipping the target variable, i.e. our initial task was to predict the credit risk based on the prompts, now we are predicting the gender based on the prompts. 

In [19]:
def get_projection_matrix(num_clfs, X_train, y_train, X_test, y_test, y_train_main, y_test_main, dim=300):

    is_autoregressive = True
    reg = "l2"
    min_acc = 0.
    noise = False
    random_subset = False
    regression = False
    
    clf = SGDClassifier
    params = {'warm_start': True, 'loss': 'log_loss', 'n_jobs': 64, 'max_iter': 100, 'random_state': 0}

    P = old_debias.get_debiasing_projection(clf, params, num_clfs, dim, is_autoregressive,
                                           min_acc, X_train, y_train, X_test, y_test,
                                           by_class=True, y_train_main=y_train_main, y_test_main=y_test_main)
    return P



num_clfs = 40
y_test_gender = np.array(y_test_gender)
y_train_gender = np.array(y_train_gender)
y_test = np.array(y_test)
y_train = np.array(y_train)

n_examples = 1000

In [20]:
P = get_projection_matrix(40, X_train_one_hot[:n_examples],
                          y_train_gender[:n_examples], X_test_one_hot[:n_examples], y_test_gender[:n_examples],
                             y_train[:n_examples], y_test[:n_examples], dim = 894)

iteration: 39, accuracy: 0.275: 100%|██████████| 40/40 [00:04<00:00,  8.07it/s]


Testing the debiased dataset. The projection matrix $P$ is applied to the testing dataset containing the prompts.

In [21]:
# Apply the projection matrix to testing datasets

debiased_train = X_train_one_hot.dot(P)
debiased_test = X_test_one_hot.dot(P)

This shows the prediction of the auxiliary model where the input is the prompt and the expected response is the gender. As the gender is actually within the prompt we expect 100% accurary.

In [22]:
params = {'warm_start': True, 'loss': 'log_loss', 'n_jobs': 64, 'max_iter': 75, 'random_state': 0}
temp = SGDClassifier(**params)

temp.fit(X_train_one_hot[:n_examples], y_train_gender[:n_examples])
temp.score(X_test_one_hot, y_test_gender)

1.0

We apply the projection matrix to the X_test dataset (the prompts) and feed these into our classifier hoping to get the gender from the prompt. The accuracy of this model decreased compared to above indicating the debiasing has worked to some extent, the model is no longer able to predict the gender with complete certainty.

In [23]:
params = {'warm_start': True, 'loss': 'log_loss', 'n_jobs': 64, 'max_iter': 75, 'random_state': 0}
temp = SGDClassifier(**params)

temp.fit(debiased_train[:n_examples], y_train_gender[:n_examples])
temp.score(debiased_test, y_test_gender)



0.725

The following is the accuracy of the model where the input is the prompt and the output is the credit risk.

In [24]:
params = {'warm_start': True, 'loss': 'log_loss', 'n_jobs': 64, 'max_iter': 75, 'random_state': 0}
temp = SGDClassifier(**params)

temp.fit(X_train_one_hot[:n_examples], y_train[:n_examples])
temp.score(X_test_one_hot, y_test)



0.705

The following is the accuracy when the input is the debiased dataset (the prompts with the projection matrix applied) and the output is the risk. We have a slight increase in accuracy here.

In [25]:
params = {'warm_start': True, 'loss': 'log_loss', 'n_jobs': 64, 'max_iter': 75, 'random_state': 0}
temp = SGDClassifier(**params)

temp.fit(debiased_train[:n_examples], y_train[:n_examples])
temp.score(debiased_test, y_test)

0.715

Train new model on debiased dataset

In [26]:
svc = clf.named_steps['classifier']


In [28]:
from copy import deepcopy
debiased_svc = deepcopy(svc)


In [29]:
debiased_svc.coef_ = svc.coef_.dot(P.T)

In [30]:
debias_clf = Pipeline([
    ('vectorizer', clf.named_steps['vectorizer']),
    ('selection', clf.named_steps['selection']),
    ('classifier', debiased_svc),
])

This is the accuracy of a new model trained on the debiased datasets.

In [33]:
debias_clf.score(X_test, y_test)

0.42