CS410 Final Project - Tweet Normalizer

Introduction

The Tweet Normalizer is the implementation of the unconstrained mode of this paper: NCSU-SAS-Ning: Candidate Generation and Feature Engineering for Supervised Lexical Normalization. Tweets are retrieved by the Twitter API /statuses/filter on the account @TestNormalizer specifically registered for this application. With the training on dataset provided by this competition, and static mapping expanded by Lexical normalisation dictionary (found in Resource section in the competition). The application supplies the revision feature, which expand the dataset to enable better normalization.

CS410 Final Project - Tweet Normalizer

Setup

*The application only runs on MacOS or Linux.

Set up Python 3 environment

Make sure Python 3 is installed by which python3, and install the required libraries

pip install -r requirements.txt

Set up Electron

Install Node.js and NPM, and install the packages:

cd twimalizer
npm install -g electron-forge
npm install --save

Start the application

Stay in twimalizer folder, and run

electron-forge start

Architecture

Two-step procedure including candidate generation and candidate evaluation is proposed in the paper.

Feature Set

The feature set is generated for a token by calculating its n-gram and k-skip-n-gram (In this application the configuration is 2-gram and 1-skip-2 gram). '$' is prepended and appended before and after the first and last n-gram, and '|' is added between skips. For example:

love -> { $lo, ov, ve$, l|v, o|e }

Similarity Index

The paper proposed to use Jaccard index as the similarity measure, which is the cardinality ratio of the intersection of two feature sets and their union.

Candidate Generation

The following are considered possible candidates:

The token itself
Words in the mapping
- Canonical form: transformed word, such as ur -> your
- Split words: a token is split into multiple words, such as lol -> laugh out loud

*During testing, token itself is not included.

Constrained mode

The following candidates are generated for each token:

Token itself
Top scoring canonical form (for repetitive token only, which means the same letter appears three times consecutively)
Split words (if exists)

Unconstrained mode

The following candidates are generated for each token:

Token itself
Top 3 scoring words in mapping (no differentiation between canonical form and splitting)

Candidate Evaluation

Feature

For a token t_i in the tweet T composed of "t₁ t₂ t₃ ... t_i-1 t_i ... t_n", candidates for it are associated with the following feature vectors for the classifier to determine if the candidate is the correct form to normalize to.

Feature	Association	Definition	Assumption
Support	Token	The number of times that the token appears during training	0 if the token never appears during training
Confidence	Candidate	The probability that the token is normalized to this candidate form during training	1 if the corresponding token never appears during training
Similarity	Candidate	Jaccard index calculated between the token and the candidate
Token Length	Token	The length of the token string
Candidate Length	Candidate	The length of the candidate string
Length Difference	Candidate	Difference of length between the token and the candidate
Mean POS Confidence Diff	Candidate	The change in the mean POS confidence for the whole tweet before and after normalization
POS Confidence Diff	Candidate	The change in POS confidence for the current token before and after normalization	If the candidate is of multiple words, average POS confidence is used to calculate the change
POS of t_i-1	Candidate	The part-of-speech tagging of the previous token	Empty for the first token
POS of t_i	Candidate	The part-of-speech tagging of the previous candidate	If the candidate is of multiple words, the POS tagging for the first word is used

Classifier

Random forest classifier is used for training. For fitting, simply pass the features and labels to the random forest classifier. For prediction, first, obtain a list of predictions using classifier. Then for each token, among candidates predicted to be correct canonical form, select the one with the highest confidence.

Implementation

Training data is provided in JSON file, and the basic format for a tweet is the following:

{
    'input':  ['token1', 'token2', ...],
    'output': ['token1', 'token2', ...]
}

'input' is the original tweet tokenized, 'output' is the tokens normalized with correspondence to 'input'.

Normalizer Implementation

Dataset generation

generate_mapping.py

Function	Parameters	Return	Description
generateMap	tweets:List	(static_map, support_map, confidence_map, index_map):(defaultdict, defaultdict, defaultdict, defaultdict)	Create the mapping from training data. Static map is all the mappings from token to its normalized form. Support map counts the times a token appears. Confidence map is the frequency count for each normalized form when a token appears. Index map is the Jaccard index between the token and the normalized form.
augmentMapUsingEMNLP	(static_map, support_map, confidence_map, index_map):(defaultdict, defaultdict, defaultdict, defaultdict)	(static_map, support_map, confidence_map, index_map):(defaultdict, defaultdict, defaultdict, defaultdict)	Augment the mappings with EMNLP dataset.
augmentMapUsingFeiLiu	(static_map, support_map, confidence_map, index_map):(defaultdict, defaultdict, defaultdict, defaultdict)	(static_map, support_map, confidence_map, index_map):(defaultdict, defaultdict, defaultdict, defaultdict)	Augment the mappings with Fei Liu's dataset.
consolidateMap	(static_map(defaultdict), support_map(defaultdict), confidence_map(defaultdict), index_map(defaultdict))	(static_map, support_map, confidence_map, index_map):(dict, dict, dict, dict)	Convert to normal dictionary in Python for saving later.

generate_pos_info.py

Function	Parameters	Return	Description
initWithPOS	tweets:List	mappedTweets:List	Invoke ark-tweet POS tagger and extend each Tweet object with field `mean` (Mean POS tagging confidence), `prob` (Array of POS tagging confidence for each token), `tag` (Array of POS tags).
generatePOSConfidence	tweets:List	(originalTweets, mappedTweets):(List, List)	Invoke ark-tweet POS tagger and extend each Tweet object with field `mean` (Mean POS tagging confidence), `prob` (Array of POS tagging confidence for each token), `tag` (Array of POS tags). Work only for tweets that normalize to equal or longer in length. If the normalized tweet is shorter in the number of tokens, it is dropped. `originalTweet` contains all legal tweets and `mappedTweets` is its mapped version.

similarity_index.py

Function	Parameters	Return	Description
ngram	word:string ninteger	k0gram:set	Generate n-gram set. With $ appended (prepended) at the end (beginning).
skipgram	word:string n:integer k:integer	kngram:set	Generate k-skip-n gram set. With \| to separate characters.
sim_feature	word:string n:integer k:integer(default=1)	features:set	Generate proposed feature set which combines n-gram and k-skip-n gram.
JaccardIndex	s1:string s2:string n:integer(default=2) k:integer(default=1) tailWeight:integer(default=3)	score:float	Calculate the Jaccard index between two words.

generate_candidate.py

Function	Parameters	Return	Description
generateCandidates	mappedTweets:List maps:(4 dictionaries) includeSelf:bool(default=True) constrained:bool(default=True)	candidates:List	Return a list of mapped candidates, which contain fields original input `input`, whether this is correct form `label`, feature vector `feature`, normalizing token `token`, type of candidate `category` (either `self`, `canonical`, or `split`), tweet index `tweet_idx`, and word index `idx`. If `constrained` is set True, then similarity measure is used only when the word is repetitive, otherwise, top 3 tokens are used.
generateTrainingCandidates	mappedTweets:List maps:(4 dictionaries) includeSelf:bool(default=False)	candidates:List	Same as `generateCandidates` except for all possible normalizing tokens are used.
isRepetitive	token:string	repetitive:bool	Check whether three same letters appear consecutively.

generate_feature.py

Function	Parameters	Return	Description
generateFeatureVectors	(candidateTweets, TaggedTweets):(List, List)	(tweet_id, indices, category, token, training, label):(List, List, List ,List, List, List)	Generate feature vectors, and the corresponding properties and labels. See `generateCandidates` for explaination about properties.

create_dataset.py

The script that generates the training and testing dataset with all the mappings saved for future use.

Training & testing

predictor.py

Function	Parameters	Return	Description
Predictor	(classifier): (sklearn.ensemble.RandomForestClassifier)	Predictor	init the class
Predictor.fit	(features, labels): (numpy.ndarray, numpy.ndarray)	Predictor	Fit the features and labels to the predictor
Predictor.predict	(group_ix, features): (list, numpy.ndarray)	(result): (numpy.ndarray)	predict the results by select just one canonical form for each group_ix. When labels are identical, default use the first column of training data to break the tie
Predictor.score	(group_ix, features, labels): (list, numpy.ndarray, numpy.ndarray)	(precision, recall, f1_score): (float, float, float)	return precision recall and f1_score for testing_data

training.py

training.py passed random forest classifier to the predictor class we create, fit the training data to it, and then save the model. It then evaluates the precision, recall, and f1-score on both the constrained and unconstrained datasets and prints out predictions to make sure result looks correct.

load_store_data.py

Function	Parameters	Return	Description
load_dataset_from_file	(filename, categories): (str, numpy.ndarray)	(group_ix, tokens, features, labels): (list, numpy.ndarray, numpy.ndarray, numpy.ndarray)	load dataset from file to arrays needed for prediction
load_dataset	(tweet_ix, ix, tokens, features, labels, categories): (numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray)	(group_ix, tokens, features, labels): (list, numpy.ndarray, numpy.ndarray, numpy.ndarray)	load dataset from various part to arrays needed for prediction
save_model	(model, file_name): (Predictor, str)	None	save model to file specified
load_model	(file_name): (str)	(model): (Predictor)	load model from file specified

Frontend

normalize_tweets.py

The script that spawns the unconstrained mode normalizer and read from stdin to get a tweet to normalize. The result is written to stdout.

Function	Parameters	Return	Description
mapATweet	tweet:string	(inputTokens, normalizedTokens):(List, List)	Normalize a single tweet and tokenize them.

GUI Implementation

The GUI is implemented using modern techniques with web development. The wrapper is Electron and the framework is Vue.js. Electron configuration is in src/index.html and src/index.js. Vue component src/normalizer.vue is using Semantic UI to create the feed list. A Twitter client is connected at the creation of the component, stream API is hooked to a function that continuously pushes new tweets to the array.

Normalizing process is achieved by using Subprocess to spawn a python instance to execute normalize_tweets.py with the input being the tweet and its output parsed to substitute the chosen tweet.

Training & Testing

To train a new model, or, if new data are added and you would like to rebuild the dataset. Run

python3 create_dataset.py
python3 training.py

Now you have the model saved to a file, you can start the application in unconstrained mode following the instructions above.

For model evaluation, we obtained the following result (see training.py for more details):

dataset	precision	recall	f1-score
Constrained	0.9258239891267415	0.9989734188817598	0.9610087293889428
Unconstrained	0.972782874617737	0.9924084858569051	0.9824976835169361

Brief contribution description

Dachun Sun:
- Online text normalization dataset collection
- Application structure design
- Implement and optimize word splitting process
- Implement enhanced frontend using electron
- Debugging and performance optimization
- Software documentation
Tiancheng Wu:
- Backend machine learning data preparation
- Learning feature extraction and selection
- Learning model selection and optimization
- Training the model
- Software documentation
Xuanyi Zhu:
- Frontend implementation using the PyQt5 framework
- Frontend debugging and optimization
- User-friendly GUI optimization with the empirical user study
- Group progress coordination
- Software documentation
- Project presentation video

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
Presentation screenshot		Presentation screenshot
Text_Norm_Data_Release_Fei_Liu		Text_Norm_Data_Release_Fei_Liu
ark-tweet-nlp-0.3.2		ark-tweet-nlp-0.3.2
emnlp2012-lexnorm		emnlp2012-lexnorm
lexnorm2015		lexnorm2015
twimalizer		twimalizer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
W15-4313.pdf		W15-4313.pdf
create_dataset.py		create_dataset.py
gate-EN-twitter.model		gate-EN-twitter.model
generate_candidate.py		generate_candidate.py
generate_feature.py		generate_feature.py
generate_mapping.py		generate_mapping.py
generate_pos_info.py		generate_pos_info.py
load_store_data.py		load_store_data.py
mapping_constrained		mapping_constrained
mapping_unconstrained		mapping_unconstrained
model_trained		model_trained
normalize_tweets.py		normalize_tweets.py
predictor.py		predictor.py
requirements.txt		requirements.txt
similarity_index.py		similarity_index.py
testing_constrained		testing_constrained
testing_unconstrained		testing_unconstrained
training		training
training.py		training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS410 Final Project - Tweet Normalizer

Introduction

Setup

Set up Python 3 environment

Set up Electron

Start the application

Architecture

Feature Set

Similarity Index

Candidate Generation

Constrained mode

Unconstrained mode

Candidate Evaluation

Feature

Classifier

Implementation

Normalizer Implementation

Dataset generation

Training & testing

Frontend

GUI Implementation

Training & Testing

Brief contribution description

About

Releases

Packages

Languages

License

XuanyiZ/Text-Normalization

Folders and files

Latest commit

History

Repository files navigation

CS410 Final Project - Tweet Normalizer

Introduction

Setup

Set up Python 3 environment

Set up Electron

Start the application

Architecture

Feature Set

Similarity Index

Candidate Generation

Constrained mode

Unconstrained mode

Candidate Evaluation

Feature

Classifier

Implementation

Normalizer Implementation

Dataset generation

Training & testing

Frontend

GUI Implementation

Training & Testing

Brief contribution description

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages