# Insights Primary Topic Predictor 

![datascience-hero](datascience-hero.jpeg)

Insights Primary Topic Predictor is written to predict the “primary topic” category which an anecdote falls into using natural language processing and sentence vectors (with a list of 2 million words and their coordinates in a 300 dimensional space where coordinates are determined by the relations of words among each other). It is a Python script written in IDLE version 3.10.4.

This predictor takes CBI Insights Platform exports in excel format as input. The output of this predictor is in two categories: model properties (the cross validation results of the model to determine overfitting, parameter training results to fine tune the model’s prediction) and predictions (an excel document that has additional columns to display post-natural-language processing version of the data points, vector coordinates, actual primary topics of the anecdotes given by CBI staff and predictor’s assigned topic for the anecdote. 




## Disclaimer

The data needed for this script to work is not available due to organisational confidentiality reasons. Therefore, the reader must review the code with hypothetical data with stated specifications in mind.

## Methodology and Script

### Functions as they are used in the script

This script uses a total of 9 functions. We can inspect them in 3 categories: building sentence vectors, training and evaluating the model and displaying the output in excel format.

#### Introduction and Preperation

##### Packages and Libraries

- numpy
- pandas
- time
- spacy
- nltk
- “stopwords” from “nltk.corpus”, 
- random
- “RandomForestClassifier” from “sklearn.ensemble”
- “train_test_split”, “KFold” and “StratifiedKFold” from “sklearn.model_selection”

In [None]:
import numpy as np
import pandas as pd
import time
import spacy
from nltk.corpus import stopwords
import random
import nltk

#machine learning classifier, classifies the sentence vector to numbers 0-4 (topics)
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split, KFold, StratifiedKFold

# requirements.txt
# numpy
# pandas
# openpyxl
# scikit-learn
# spacy
# nltk

#console (to be executed in the console)
# python -m spacy download en_core_web_sm
#in python (to be executed in the text editor)
# nltk.download('stopwords')

nlp = spacy.load('en_core_web_sm')

# Don't take the default, e.g. shouldn't ignore the word "people"
# Also, some rows have no data, should exclude from analysis
stop_words = set([w.lower() for w in stopwords.words()])
stop_words.remove('people')


global df, wordVecs

#trying to read df throws error
#df = pd.read_excel(r'C:\\Users\\username\\Downloads\\training02.xlsx')

# Could possibly store efficiently with numpy
# e.g. 1 file - list of 2M words, other file 2M x 300 array




#### Building Sentence Vectors

- loadWordVectors()
Loads the 2 million x 300 array named “lexvec.commoncrawl.300d.W.pos.neg3.vectors” which should be downloaded separately

- buildSentenceVectors()
This is where we input the CBI insights export in excel format. It loads word vectors from the previous function. Then splits the sentences into useful bits by removing stop words. Finally, it fits the individual words from data points into that 300 dimension word vector and saves the binary dataframe with each data point’s coordinates in that array (lexvec.commoncrawl.300d.W.pos.neg3.vectors).

This function should be repeated anytime the user makes a change on the excel document (e.g. adding new datapoints or a new export) and on the stop words.

Any other time, this function should stay as comment because it takes a while for it to run and its output (binary data frame) is already saved into a pickle.

- splitSentence(s)
Splits the data points into useful words and removes stop words (stop words don’t include “people” and other relevant words can be excluded from stop words depending on the context). This function is included in “buildSentenceVectors()” function.
sentenceVector(words)
This function takes the average (of coordinates) of the words from data points that appear in the word vector database. It returns "No words recognised from sentence, check stop list” if the datapoint has no words left after stop word cleaning.

“Tot” is a 300 length vector of 0s initially. The function adds the word vector for each word in the data point, then divide by the count. Tot/count is the average word vector of the words in the data point. Therefore it assigns a single word vector for the whole data point. 

- loadSentenceVecs():
This function read the precomputed sentence vectors for practicality.

- rowsWithoutAnyWords():
This function detects empty rows.



In [None]:
def loadWordVectors():
    global wordVecs
    wordVecs = {}
    with open('C:\\Users\\username\\Downloads\\lexvec.commoncrawl.300d.W.pos.neg3.vectors', encoding="UTF-8") as f:
        print(f.readline().strip())
        i = 1
        t0 = time.time()
        line = f.readline()
        while len(line) > 0:
            line = line.strip()
            split = line.split(" ")
            wordVecs[split[0]] = np.array([float(x) for x in split[1:]])
            if i % 100000 == 0:
                t1 = time.time()
                print(i, split[0], t1-t0)
                t0 = t1
            line = f.readline()
            i += 1

    print('word vecs loaded from file')

    return wordVecs

def buildSentenceVectors():
    global df, wordVecs
    df = pd.read_excel(r'C:\\Users\\username\\Downloads\\Latest2108.xlsx')
    print("Data frame loaded")
  

    # f = open('C:\\Users\\username\\Downloads\\lexvec.commoncrawl.300d.W.pos.neg3.vectors')
    # 2000000 300
    # the 0.008314 0.026552 ...

    wordVecs = loadWordVectors()
    print("Word vectors loaded")

    # TODO: No punctuation e.g. need to map "they're" to "they are"
    # hyphens being replaced with \x ?
    # need to remove punctuation e.g. full stop at end of sentences

    df['words'] = df['Data Point'].map(splitSentence)
    print("Words separated")
    df['wordVec'] = df['words'].map(sentenceVector)
    print("Complete!")

    # save/load binary dataframe
    df.to_pickle('C:\\Users\\username\\Downloads\\trainingWithVecs2')
    df = pd.read_pickle('C:\\Users\\username\\Downloads\\trainingWithVecs2')
    return df


#function that gets useful words from sentences
#nlp natural language processing library
def splitSentence(s):
    if type(s) != str:
        return []
    return list(filter(lambda s : s.isalpha() and s not in stop_words, [token.text for token in nlp(s.lower())]))

# average of those words that appear in the word vector database
def sentenceVector(words):
    tot = np.zeros(300) # dimension of vectors
    count = 0
    for word in words:
        if word in wordVecs:
            tot += wordVecs[word]
            count += 1
    if count == 0:
        print("No words recognised from sentence, check stop list")
        return tot
    return tot/count

def loadSentenceVecs():
    print('Loading word vectors from pre-computed file')
    return pd.read_pickle('C:\\Users\\username\\Downloads\\trainingWithVecs2')

# buildSentenceVectors for rebuilding vectors, loadSentenceVecs to just reload from a file
#df = buildSentenceVectors()
df = loadSentenceVecs()

# analysis functions to investigate better approaches
def rowsWithoutAnyWords():
    return df[df['words'].isin([[]])]


Loading word vectors from pre-computed file


NameError: name 'pd' is not defined

#### Training and evaluating the model 

- cross_validate(df)
This function uses “StratifiedKFold” to split the data points into n and then to use this split to train and test the model on different fractions of the data. Default split is 5 but can be changed.

This function uses Random Forest Classifier as model. In the default code, there are some modification to the parameters in trying to increase model’s success. 

The function returns the performance of the model in different training models to help the user evaluate the success. The output informs the user about the overfitting and the score of the model (how accurately the model assigned the categories to the data points compared to actual categorisation).

- do_cross_val(kf, df)
This function is nearly identical to cross_validate(df). User should check which one works the best for the visualisation of the model success. 

- trainParameters(df)
This function trains parameters of the model to optimise them. Default code deals with max_depth and min_samples_leaf parameters in Random Forest Classifier for smoothing. User can add additional parameters available to Random Forest Classifier.



In [None]:
# Encoding the Primary Topic column
topicMap = {
    'Demand impact' : 0,
    'People' : 1,
    'Policy recommendations' : 2,
    'Supply impact' : 3,
    'Other operational impact': 4
}
df['label'] = df['Primary Topic'].map(topicMap)

print("Data loaded")


# Gives a more reliable score than just using one split (like below with train_test_split)
def cross_validate(df):
    kf = StratifiedKFold(n_splits=5, shuffle=True) # idea - also try StratifiedKFold
    score = 0
    overfitScore = 0
    count = 0

    for trainIndices, testIndices in kf.split(df['wordVec'], df['label']):
        train, test = df.iloc[trainIndices], df.iloc[testIndices]
        X_train = np.array(train['wordVec'].tolist())
        X_test = np.array(test['wordVec'].tolist())
        y_train = train['label']
        y_test = test['label']

        print('Training model ' + str(count))


        # Modify parameters to improve model/avoid overfitting
        # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
        # min_samples_leaf : int -- recommended for smoothing
        rf = RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, random_state=1234)


        
        rf_model = rf.fit(X_train, y_train.values)
        print('Model ' + str(count) + ' fit')
        score += rf.score(X_test, y_test)
        overfitScore += rf.score(X_train, y_train)
        print('Model ' + str(count) + ' scored')
        count += 1

    print("Score on training data (detect overfitting): " + str(overfitScore/count))
    return score/count




# kf = KFold splitter, df = data frame
def do_cross_val(kf, df, n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1):
    score = 0
    overfitScore = 0
    count = 0
            
    for trainIndices, testIndices in kf.split(df['wordVec'], df['label']):
        train, test = df.iloc[trainIndices], df.iloc[testIndices]
        X_train = np.array(train['wordVec'].tolist())
        X_test = np.array(test['wordVec'].tolist())
        y_train = train['label']
        y_test = test['label']


        # Modify parameters to improve model/avoid overfitting
        # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
        # min_samples_leaf : int -- recommended for smoothing
         # rf = RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, random_state=1234)
        rf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, random_state=1234)

        rf_model = rf.fit(X_train, y_train.values)
        score += rf.score(X_test, y_test)
        overfitScore += rf.score(X_train, y_train)
        count += 1
    
    return score/count, overfitScore/count

def trainParameters(df):
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1234) # idea - also try StratifiedKFold

    min_samples_leaf_values = [1, 2, 4, 8, 16, 32]
    max_depth = [None, 15, 12, 9, 6]
    
    results = []
    overfitResults = []
    
    for leaf_samples in min_samples_leaf_values:
        leafResults = []
        overfitLeafResults = []
        
        for depth in max_depth:
            score, overfitScore = do_cross_val(kf, df, max_depth=depth, min_samples_leaf=leaf_samples)
            

            print('min_samples_leaf: ' + str(leaf_samples) + ', max_depth: ' + str(depth) + ', score: ' + str(score) + ', overfit score: ' + str(overfitScore))
            leafResults.append(score)
            overfitLeafResults.append(overfitScore)

        results.append(leafResults)
        overfitResults.append(overfitLeafResults)

    return min_samples_leaf_values, max_depth, results, overfitResults



#### Displaying the output in excel format

- saveDf(name)
Saves the predictions of this model alongside the real classifications to inspect errors manually. 


In [None]:
def saveDf(name):
    df.to_excel('C:\\Users\\username\\Downloads\\' + name)


## Results and Observations

This model and all its variations have an average of 65.3% percent success rate in matching data points to their primary topic. The average error rate is above acceptable (34.7%) and this model is not yet fit to be used in classification of data points. 



## Further Suggestions

Main problems regarding the data used that seemingly decrease the success of the model is as follows:

Grammar: A considerable number of data points include bad grammar which makes the word vector coordinates of a data point less accurate and therefore decreases the quality of the classification. 
Data points that are too short: When inspected, some of the data points turn out to be blank after stop word removal was applied which hindered the classification considerably.
Blank data points: Some data points have no words in them which involuntarily trained the model to assign “supply impact” category for empty rows. 


## Remarks, Notes and Further Improvment in Script

In [None]:
# random_state=1234 makes result reproducible
##train, test = train_test_split(df, test_size=0.1, random_state=1)
##
##X_train = np.array(train['wordVec'].tolist())
##X_test = np.array(test['wordVec'].tolist())
##y_train = train['label']
##y_test = test['label']
##
##print("Train/test split")
##
### random_state=1234 makes result reproducible
##rf = RandomForestClassifier(random_state=1234)
##rf_model = rf.fit(X_train, y_train.values)
##
##print("Random Forest model fit")
##
##print(rf.score(X_test, y_test))
# 0.5982142857142857

#print(cross_validate(df))

#print(df.groupby('Primary Topic')['Primary Topic'].count() / len(df))

#print(create_df())


# Select single class vs probability split between classes
# rf.predict(X_test)
# rf.predict_proba(X_test)

### Model training
# varying min_samples_leaf_values or max_depth seemingly still only max 65%
# TODO: Try filtering out rows with no words found before doing split
