# Applying LexNorm to Text

Load in various libraries, plus the LexNorm.py script

In [1]:
import pandas as pd
import numpy as np
from timeit import default_timer as timer
import random
import itertools

from LexNorm import Normalizer

Example of how to apply the script. Input to the correct_spelling_mistakes must be tokenized, but running normalize first will do that for you. 

In [3]:
#example of usage

test = ['Seroquel makes me sleepy', '@TwitterHandle Have you heard of Lyrica? #diabetes']

# Normalize with our final version of the normalizer, including transform from drug brand name
test2 = Normalizer().drug_normalize(test)
print(test2)

#correct spelling mistakes - input must be tokenized
#test3 = Normalizer().correct_spelling_mistakes(test2)[0]
#print(test3)

[['quetiapine', 'makes', 'me', 'sleepy'], ['-TH-', 'have', 'you', 'heard', 'of', 'pregabalin', '?', 'diabetes']]


## Applying to tweets

Load in the train/dev/test tweets

In [2]:
train_df= pd.read_csv('./data/train.tsv', sep="\t", 
                       names = ['tweet_id','label','tweet'])
dev_df = pd.read_csv('./data/dev.tsv', sep="\t", 
                       names = ['tweet_id','label','tweet'])
test_df = pd.read_csv('./data/test.tsv', sep="\t", 
                       names = ['tweet_id','user_id','tweet'])

Parse out just the tweet column

In [3]:
train_tweets = train_df.tweet
dev_tweets = dev_df.tweet
test_tweets = test_df.tweet

To run the full normalizer + spell corrector pipeline, run below. For our models, we just ran the drug_normalizer.

In [4]:
# full normalizer (takes care of tokenization) and correct spelling mistakes
start = timer()
normalized_tweets = Normalizer().normalize(dev_tweets)
corrected_tweets = Normalizer().correct_spelling_mistakes(normalized_tweets)
end = timer()
print("Time to run: ", int(end-start), "sec")
print("Total tweets processed: ", len(corrected_tweets[1]))
print("Total tweets corrected: ", np.count_nonzero(corrected_tweets[1]))
print("Total tokens corrected: ", sum(corrected_tweets[1]))
print("Unique tokens replaced: ", len(corrected_tweets[4]))

Time to run:  101 sec
Total tweets processed:  5136
Total tweets corrected:  680
Total tokens corrected:  789
Unique tokens replaced:  646


This runs just the simple normalizer + drug normalizer and is our final preprocessor

In [4]:
# normalize (takes care of tokenization) and correct spelling mistakes
start = timer()
normalized_tweets = Normalizer().drug_normalize(dev_tweets)
end = timer()
print("Time to run: ", int(end-start), "sec")

Time to run:  1 sec


In [5]:
# throw back in a dataframe
normalized_dev_tweets = pd.Series([" ".join(x) for x in normalized_tweets[0]])
dev_df['normalized_tweet'] = normalized_dev_tweets
dev_df.head()

Unnamed: 0,tweet_id,label,tweet,normalized_tweet
0,0,"@chrisbrown im on depokaote, zoloft, paxil and...","-TH- i am on depokaote , zoloft , paxil and pr...","-th- i am on depokaote , sertraline , paroxeti..."
1,0,rt @ovariancancers: clinical oncology news - d...,rt -TH- : clinical oncology news - duloxetine ...,rt -th- : clinical oncology news - duloxetine ...
2,0,whoever created fairytales needs to take respo...,whoever created fairytales needs to take respo...,whoever created fairytales needs to take respo...
3,0,@anorexic0 @ewdustin -generic drugs as effecti...,-TH- -TH- -generic drugs as effective as amgen...,-th- -th- -generic drugs as effective as amgen...
4,0,@gussypalore weekly enbrel injections. paracet...,-TH- weekly enbrel injections . paracetamol . ...,-th- weekly etanercept injections . paracetamo...


In [29]:
train_df.to_csv('train.tsv',sep="\t",index=False,index_label=False, header=False)
dev_df.to_csv('dev.tsv',sep="\t",index=False,index_label=False, header=False)
test_df.to_csv('test.tsv',sep="\t",index=False,index_label=False, header=False)

In [30]:
full_train_df = pd.concat([train_df, dev_df], axis=0)
full_train_df.reset_index(drop=True, inplace=True)

In [31]:
full_train_df.to_csv('full_training.tsv',sep="\t",index=True,index_label=False, header=False,
                     columns=['label','tweet','normalized_tweet'])

## Spelling Correction Analysis

Analyze output of spell corrector, if using:

In [5]:
print("Total words processed: ", sum([len(x) for x in normalized_tweets]))
print("Total tokens corrected: ", sum(corrected_tweets[1]))

Total words processed:  206916
Total tokens corrected:  5379


Take a look at some examples of corrections that it made:

In [54]:
random.sample( corrected_tweets[4].items(), 20 )

[('arthrotec', 'arthritis'),
 ('followfriday', ['follow', 'friday']),
 ('tt', 'ass'),
 ('xanaxdreams', ['xanax', 'dreams']),
 ('toastie', 'chocolate'),
 ('spondyloarthritis', 'arthritis'),
 ('surgeryblue', ['surgery', 'blue']),
 ('dysthymia', 'schizophrenia'),
 ('aderol', 'adderall'),
 ('injectable', 'injection'),
 ('bussing', 'missing'),
 ('undersleep', 'understand'),
 ('doxycycline', 'ciprofloxacin'),
 ('sai', 'said'),
 ('macrobid', 'ciprofloxacin'),
 ('crazytown', ['crazy', 'town']),
 ('chd', 'ischemic'),
 ('hbp', 'metoprolol'),
 ('sleepdisorders', ['sleep', 'disorders']),
 ('infront', 'instead')]