# Classify charities into ICNP/TSO categories

A further test of the machine-learning model created in `icnptso-machine-learning-test.ipynb`, to see whether it's improved by using the tags applied to the sample to limit the choices of ICNPTSO codes.

## Import packages

- `pandas` is used to manipulate the data
- `sklearn.train_test_split` is used to split the sample data
- `nltk` provides functions for preparing the data, plus a list of common stopwords

In [109]:
import re
import pickle

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\drkan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\drkan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Create settings

These settings hold the location of files used in the process.

In [2]:
MODEL_PICKLE_FILE = '../data/icnptso_ml_model.pkl'
UKCAT_FILE = "../data/ukcat.csv"
SAMPLE_FILE = "../data/sample.csv"
TOP2000_FILE = "../data/top2000.csv"

## Fetch the sample data

Remove any records which don't have a ICNPTSO category included.

In [3]:
df = pd.concat([
    pd.read_csv(SAMPLE_FILE),
    pd.read_csv(TOP2000_FILE),
]).reset_index()
df = df[df["ICNPTSO"].notnull()]

## Prepare the training data

Create the text corpus by combining the name and activities data. `y` is the ICNPTSO code attached to the charity.

In [4]:
corpus = pd.DataFrame([df["name"], df["activities"]]).T.apply(lambda x: " ".join(x), axis=1)
y = df["ICNPTSO"].values
len(y)

6203

Prepare functions used to clean the text data before it's included in the machine learning models. 

[Lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) is the process where words are turned into the base for of the word - for example "walking" becomes "walk", "better" becomes "good".

Stopwords (common words like "and", "for", "of") are skipped.

In [5]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english') + [
    "trust",
    "fund",
    "charitable",
    "charity",
])

stemmer = LancasterStemmer()
lemma = WordNetLemmatizer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

def lemma_words(doc):
    return (lemma.lemmatize(w) for w in analyzer(doc))

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join(lemma.lemmatize(word) for word in text.split() if word not in STOPWORDS) # delete stopwors from text
    return text

`X` is the list of cleaned values.

In [6]:
X = corpus.apply(clean_text).values
np.random.choice(X, 1)

array(['association commonwealth university working member others promote contribute provision excellent higher education benefit people throughout commonwealth beyond administration funding student scholarship'],
      dtype=object)

Produce test and train datasets from `X` and `y`.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)
len(X_test)

1241

## Fetch classification models

Fetch the model produced in `icnptso-machine-learning-test.ipynb`.

In [8]:
with open(MODEL_PICKLE_FILE, 'rb') as model_file:
    nb = pickle.load(model_file)

## Work out possible ICNPTSO tags

Function to fetch potential ICNPTSO tags based on the UK-CAT classification.

In [70]:
# fetch the classification file
ukcat = pd.read_csv(UKCAT_FILE, index_col="Code")

# for each classification category go through and apply the regular expression
def get_icnptso_categories(x_text):
    x_text = pd.Series(x_text)
    results = []
    for index, row in ukcat.iterrows():
        if (
            not isinstance(row["Regular expression"], str)
            or row["Regular expression"] == r"\b()\b"
        ):
            continue
        criteria = x_text.str.contains(
            row["Regular expression"], case=False, regex=True
        )
        if (
            isinstance(row["Exclude regular expression"], str)
            and row["Exclude regular expression"] != r"\b()\b"
        ):
            criteria = criteria & ~x_text.str.contains(
                row["Exclude regular expression"], case=False, regex=True
            )

        results.append(
            pd.Series(
                data=criteria,
                index=x_text.index,
                name=index,
            )
        )

    results = pd.DataFrame(results).T
    results = results.apply(lambda row: [ukcat["Related ICNPTSO code"].fillna("")[i].split(";") for i, v in row.iteritems() if v], axis=1)
    results = results.apply(lambda x: [item for sublist in x for item in sublist])
    results = results.apply(lambda x: list(set([item for item in x if item])))
    return results

Apply the function to the test and training corpuses.

In [71]:
X_train_cats = get_icnptso_categories(X_train)
X_test_cats = get_icnptso_categories(X_test)
X_train_cats

  return func(self, *args, **kwargs)


0            [D11, A21, B31, G13, G14]
1                           [D19, G11]
2       [B13, D11, I10, B12, G14, B90]
3                 [B13, B12, B32, B90]
4                           [A10, H10]
                     ...              
4957         [B13, D14, B12, G13, H90]
4958         [D12, G16, G15, D19, D13]
4959                        [I90, I10]
4960         [D19, G11, K10, G22, B90]
4961                             [A11]
Length: 4962, dtype: object

`predict_proba` gives use the predicted probability of every ICNP/TSO category for each organisation in the sample. This enables us to assess how confident the model is in its estimate.

In [72]:
y_pred_proba = nb.predict_proba(X_test)
y_pred_proba = pd.DataFrame([
    dict(zip(nb.classes_, row))
    for row in y_pred_proba
])

Function to get the categories for a given row of the `y_pred_proba` dataframe. The row contains the probabilities against each ICNPTSO category. Rows are narrowed using the `X_test_cats` series created in the last cell.

In [115]:
def get_cats(row):
    potential_cats = [c for c in X_test_cats[row.name] if c in row.index]
    source = "ranked_tags"
    if len(potential_cats)==1:
        source = "only_one_tag"
    if len(potential_cats)==0:
        source = "all_icnptso"
        potential_cats = row.index.tolist()
    result = row[potential_cats].sort_values(ascending=False).head(1)
    result = list(zip(result,result.index))[0]
    return {
        "icnptso_code": result[1],
        "score": result[0],
        "source": source,
        "source_count": len(potential_cats),
        "original_max_score": row.max(),
        "original_icnptso_code": row.idxmax(),
        "actual_icnptso_code": y_test[row.name],
    }

result = y_pred_proba.apply(get_cats, axis=1, result_type='expand')
result

Unnamed: 0,icnptso_code,score,source,source_count,original_max_score,original_icnptso_code,actual_icnptso_code
0,D11,0.899893,ranked_tags,7,0.899893,D11,D13
1,D13,0.989706,ranked_tags,3,0.989706,D13,C12
2,G11,1.000000,ranked_tags,8,1.000000,G11,G11
3,D19,0.000940,ranked_tags,5,0.999059,C11,C11
4,A11,0.999885,ranked_tags,2,0.999885,A11,A11
...,...,...,...,...,...,...,...
1236,D11,0.459962,ranked_tags,5,0.459962,D11,D19
1237,A21,1.000000,ranked_tags,3,1.000000,A21,A21
1238,K10,0.004695,ranked_tags,2,0.995273,C12,C12
1239,I10,0.999993,ranked_tags,2,0.999993,I10,I10


Test the results against the result from the original Model.

In [116]:
accuracy_score(result["icnptso_code"], result["actual_icnptso_code"])

0.5213537469782433

In [117]:
accuracy_score(result["original_icnptso_code"], result["actual_icnptso_code"])

0.5576148267526189

In [118]:
accuracy_score(result["icnptso_code"], result["original_icnptso_code"])

0.7268331990330379

Which method was used to get the ICNPTSO code?

In [122]:
result["source"].value_counts()

ranked_tags     1007
only_one_tag     171
all_icnptso       63
Name: source, dtype: int64

How many source tags were available?

In [120]:
result["source_count"].value_counts().sort_index()

1     171
2     231
3     183
4     170
5     140
6     108
7      63
8      39
9      31
10     15
11      8
12      6
13      7
14      2
15      2
16      1
18      1
86     63
Name: source_count, dtype: int64

Testing the accuracy for various numbers of potential categories from the tags. Here `1` equals one possible ICNPTSO category from the tags.

In [136]:
tests = {
    "1": result["source_count"]==1,
    "2": result["source_count"]==2,
    "3-5": result["source_count"].between(3, 5),
    "5+": result["source_count"].between(5, 85),
    "all": result["source_count"]==86,
}
pd.DataFrame([
    pd.Series(
        data=[accuracy_score(
            result.loc[criteria,"icnptso_code"],
            result.loc[criteria,"actual_icnptso_code"],
        ) for criteria in tests.values()],
        index=tests.keys()
    ).rename("tags_method"),
    pd.Series(
        data=[accuracy_score(
            result.loc[criteria,"original_icnptso_code"],
            result.loc[criteria,"actual_icnptso_code"],
        ) for criteria in tests.values()],
        index=tests.keys()
    ).rename("just_model_method"),
]).T

Unnamed: 0,tags_method,just_model_method
1,0.625731,0.672515
2,0.497835,0.562771
3-5,0.515213,0.537525
5+,0.463357,0.50591
all,0.539683,0.539683
