# Classify charities into ICNP/TSO categories

A further test of the machine-learning model created in `icnptso-machine-learning-test.ipynb`, to see whether it's improved by turning it into a two-step model. First classifying charities into the top-level ICNPTSO groups, then classifying within each group.

## Import packages

- `pandas` is used to manipulate the data
- `sklearn.train_test_split` is used to split the sample data
- `nltk` provides functions for preparing the data, plus a list of common stopwords

In [48]:
import re
import pickle

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, plot_confusion_matrix
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\drkan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\drkan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Create settings

These settings hold the location of files used in the process.

In [2]:
MODEL_PICKLE_FILE = '../data/icnptso_ml_model.pkl'
UKCAT_FILE = "../data/ukcat.csv"
SAMPLE_FILE = "../data/sample.csv"
TOP2000_FILE = "../data/top2000.csv"

## Fetch the sample data

Remove any records which don't have a ICNPTSO category included.

In [3]:
df = pd.concat([
    pd.read_csv(SAMPLE_FILE),
    pd.read_csv(TOP2000_FILE),
]).reset_index()
df = df[df["ICNPTSO"].notnull()]

## Prepare the training data

Create the text corpus by combining the name and activities data. `y` is the ICNPTSO code attached to the charity.

In [11]:
corpus = pd.DataFrame([df["name"], df["activities"]]).T.apply(lambda x: " ".join(x), axis=1)
y = df["ICNPTSO"].values
len(y)

6203

Prepare functions used to clean the text data before it's included in the machine learning models. 

[Lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) is the process where words are turned into the base for of the word - for example "walking" becomes "walk", "better" becomes "good".

Stopwords (common words like "and", "for", "of") are skipped.

In [12]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english') + [
    "trust",
    "fund",
    "charitable",
    "charity",
])

stemmer = LancasterStemmer()
lemma = WordNetLemmatizer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

def lemma_words(doc):
    return (lemma.lemmatize(w) for w in analyzer(doc))

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join(lemma.lemmatize(word) for word in text.split() if word not in STOPWORDS) # delete stopwors from text
    return text

`X` is the list of cleaned values.

In [13]:
X = corpus.apply(clean_text).values
np.random.choice(X, 1)

array(['college corpus christi blessed virgin mary university cambridge education undergraduate graduate student research work associated provision accommodation welfare catering service community scholar university cambridge'],
      dtype=object)

Produce test and train datasets from `X` and `y`.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)
len(X_test)

1241

Produce a version of y with just the first letter of the code in (the group for the category)

In [22]:
y_train_group = np.array([cat[0] for cat in y_train])
y_test_group = np.array([cat[0] for cat in y_test])

## Create classification model

In [17]:
nb = Pipeline(
    [
        ("vect", CountVectorizer()),
        ("tfidf", TfidfTransformer()),
        ("clf", LogisticRegression(n_jobs=5, C=1e5, max_iter=1000)),
    ]
)

Fit the training data

In [23]:
nb.fit(X_train, y_train_group)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf',
                 LogisticRegression(C=100000.0, max_iter=1000, n_jobs=5))])

Predict the test data

In [24]:
y_pred_group = nb.predict(X_test)

Get the accuracy of the group model

In [25]:
accuracy_score(y_pred_group, y_test_group)

0.6913779210314263

Create individual models for the groups

In [46]:
group_models = {
    cat: Pipeline(
        [
            ("vect", CountVectorizer()),
            ("tfidf", TfidfTransformer()),
            ("clf", LogisticRegression(n_jobs=5, C=1e5, max_iter=1000)),
        ]
    )
    for cat in sorted(set([cat[0] for cat in y]))
}

Go through each group. For each group:

 - train the model on the training set
 - get the number of categories
 - test the model against the test set
 - compute the accuracy
 
If there's only one sub-category then assign all the values to that category.

In [54]:
multi_y_pred = []
multi_y_test = []
for cat, pipeline in group_models.items():
    print(cat)
    train_index = [c[0]==cat for c in y_train]
    cat_X_train = X_train[train_index]
    cat_y_train = y_train[train_index]
    print("Train: {:,.0f}".format(len(cat_X_train)))
    
    training_categories = sorted(set(cat_y_train))
    print("Training categories: {}".format(
        ", ".join(training_categories)
    ))
    
    test_index = [c==cat for c in y_pred_group]
    cat_X_test = X_test[test_index]
    cat_y_test = y_test[test_index]
    print("Test : {:,.0f}".format(len(cat_X_test)))
    
    # if there's only one category then just use that
    if len(training_categories)==1:
        cat_y_pred = np.array([training_categories[0] for item in cat_y_test])
    else:
        pipeline.fit(cat_X_train, cat_y_train)
        cat_y_pred = pipeline.predict(cat_X_test)

    print("Accuracy: {:.3f}".format(accuracy_score(cat_y_pred, cat_y_test)))
    
    multi_y_pred.extend(cat_y_pred)
    multi_y_test.extend(cat_y_test)
    print()
    
print("Overall accuracy: {:.3f}".format(
    accuracy_score(multi_y_pred, multi_y_test)
))

A
Train: 702
Training categories: A10, A11, A12, A19, A20, A21, A22, A29, A30, A90
Test : 191
Accuracy: 0.649

B
Train: 794
Training categories: B10, B11, B12, B13, B19, B20, B21, B29, B30, B31, B32, B90
Test : 195
Accuracy: 0.590

C
Train: 379
Training categories: C10, C11, C12, C13, C14, C19, C21, C22, C29, C31, C32, C39
Test : 84
Accuracy: 0.476

D
Train: 706
Training categories: D10, D11, D12, D13, D14, D19, D20, D30, D31, D32, D33, D34, D41, D49, D90
Test : 186
Accuracy: 0.333

E
Train: 81
Training categories: E10, E11, E12, E13, E14, E19, E20, E21, E22, E29, E90
Test : 11
Accuracy: 0.273

F
Train: 439
Training categories: F12, F20, F30, F40
Test : 111
Accuracy: 0.613

G
Train: 465
Training categories: G10, G11, G12, G13, G14, G15, G16, G19, G20, G22, G30
Test : 114
Accuracy: 0.360

H
Train: 534
Training categories: H10, H90
Test : 138
Accuracy: 0.543

I
Train: 672
Training categories: I10, I90
Test : 176
Accuracy: 0.648

J
Train: 62
Training categories: J10, J20
Test : 8
Accuracy