# Auto Labeller - A step by step guide
Here we provide a step by step guide for you to experiment with the semi auto-labelling tool.

## Step 1 - Standard Library Imports
First lets perform the standard library imports.

In [1]:
# Standard Libary Imports
import pandas as pd
import numpy as np

from src.toolkit.autolabel import Preprocessor, AutoLabeller, check_labels
from src.toolkit.autolabel import recommend_words

from sklearn.naive_bayes import MultinomialNB

## Step 2 - Adjust the Input
Input the file path to your text data.

The tabular data should have the following format

<img style="float: left;" src="data/images/input_sample.png">

In [2]:
# file path to text data
text_path = "data/movies/movies500.csv"  ##  PUT YOUR FILEPATH

# name of column that contains text data
text_column_name = "overview"  ##  PUT YOUR TEXT COLUMN

data = pd.read_csv(text_path)[[text_column_name]]
data.head(5)  ## CHECK THE FORMAT OF YOUR DATA

Unnamed: 0,overview
0,A family wedding reignites the ancient feud be...
1,"Cheated on, mistreated and stepped on, the wom..."
2,"Obsessive master thief, Neil McCauley leads a ..."
3,An ugly duckling having undergone a remarkable...
4,"A mischievous young boy, Tom Sawyer, witnesses..."


## Step 3 - Data Preprocessing
Run the following code to prepare your text data for model input.

In [3]:
corpus = data[text_column_name]

preprocessor = Preprocessor()

# Text Preprocessing
stopwords_path = "data/stopwords.csv"  # STOPWORDS PATH
preprocessed_corpus = preprocessor.corpus_preprocess(corpus=corpus, stopwords_path=stopwords_path)

# Replace bigrams
data[text_column_name] = preprocessor.corpus_replace_bigrams(corpus=preprocessed_corpus, min_df=50, max_df=500)

## Step 4 - Generate recommmended labels
Run the following code to generate recommended labels

In [4]:
n_words = 20  # CHANGE THE NUMBER OF RECOMMENDED WORDS (IF YOU WANT TO)

# Returns a matrix of recommended words
topic_model, dtm, best_n = recommend_words(corpus) 
topic_model.show_topics(dtm=dtm, best_n=best_n, n_words=n_words)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,man,demon,convince,story,discovers,political,guest,friend,true,like,look,know,fear,pretend,nature,confront,descend,bright,personality,hunt
1,love,fall,french,student,king,two,play,romantic,story,dr,counterpart,exchange,marry,brother,tale,famous,mysterious,think,hard,mother
2,young,husband,try,officer,beautiful,story,dangerous,help,boy,home,escape,relationship,war,girl,men,name,doctor,fantasy,begin,decide
3,killer,serial,detective,law,female,victim,violence,person,gay,one,lead,two,duo,clue,personality,journalist,past,intelligent,effort,capture
4,daughter,wife,face,leave,suffer,die,car,accident,second,poor,castle,rule,need,become,begin,two,look,experiment,professor,mute
5,team,stop,evil,try,former,cop,computer,agent,kgb,protect,steal,world,rescue,russian,plan,use,chip,new,nuclear,game
6,woman,town,come,pursue,small,notorious,beautiful,mysterious,work,comedy,along,successful,show,move,many,compete,dream,fortune,unemployed,within
7,film,story,work,tell,base,red,comedy,set,boy,strange,world,bring,novel,movie,studio,village,adventure,romantic,mafia,company
8,get,thing,friend,go,best,marry,tell,old,help,marriage,boyfriend,hire,cop,mother,business,try,brother,wife,kid,give
9,find,new,way,time,hit,job,secret,die,friend,make,mob,bring,bos,work,advertising,ancient,human,virus,miss,discover


Explanation of the above matrix:
* Each row relates to a "theme" identified by the model
* Read row by row to identify themes which you want for the following step

## Step 5 - Create and feed your label dictionary
Open your favourite tabular editor (excel/Numbers/libre) to create a labels csv file based on the suggested vocabulary shown below.

The label dictionary should be in a tabular format with keys that relates to each label

**Note: these labels needs to exist within the dataset**

<img style="float: left;" src="data/images/labels_sample.png"> </img>

In [5]:
labels_path = "data/movies/movies500_labels.csv"  # INPUT PATH TO LABELS DICTIONARY
labels = pd.read_csv(labels_path)
labels = check_labels(data, labels)

In [6]:
# show label data
labels.head(5)  

Unnamed: 0,Action,Romance,Science Fiction,Thriller,War,Western
0,criminal,love,science,killer,war,criminal
1,kill,french,,detective,world,
2,kidnap,romantic,,law,ii,
3,gang,marry,,victim,struggle,
4,revenge,marriage,,cop,,


## Step 6 - Enriching Labels
In this step, the model enriches your current labels with other words close to the current topic. This enriched dictionary is used for topic labelling

In [7]:
autoLabeller = AutoLabeller(labels, corpus, data)
enriched_labels = autoLabeller.train()

enriched_labels  ## Enriched labels

Unnamed: 0,Action,Romance,Science Fiction,Thriller,War,Western
0,married,notorious,case,world,world,case
1,marry,world,attend,law,singer,kill
2,street,attend,havoc,nuclear,expatriate,law
3,gang,marry,marry,kill,french,assign
4,yearold,french,science,clean,slave,retired
5,french,science,comedy,civil,ii,killer
6,violent,comedy,nuclear,renegade,disappearance,victim
7,kill,view,view,killer,holiday,next
8,clean,center,assign,russian,paris,detective
9,throw,bunch,side,terrorist,nazi,serial


## Step 7 - Apply the auto labeller
Input file path to output your labelled dataset

In [8]:
labelled_path = "data/movies/movies500_labelled.csv"  # INPUT YOUR PREFERED OUTPUT PATH

In [9]:
mnb = MultinomialNB()
ypred = autoLabeller.apply(mnb, text_column_name)
ypred.to_csv(labelled_path)