## Analyzing the NUS SMS Corpus

This is a corpus of SMS (Short Message Service) messages collected for research at the Department of Computer Science at the National University of Singapore. This dataset consists of 67,093 SMS messages taken from the corpus on Mar 9, 2015. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The data collectors opportunistically collected as much metadata about the messages and their senders as possible, so as to enable different types of analyses.

Data is available from: https://www.kaggle.com/rtatman/the-national-university-of-singapore-sms-corpus 

Tao Chen and Min-Yen Kan (2013). Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus. Language Resources and Evaluation, 47(2)(2013), pages 299-355.

In [1]:
import json
import pandas as pd
from statistics import mean, median, mode

In [2]:
pd.set_option('max_colwidth', 100)

In [3]:
with open('data/smsCorpus_en_2015.03.09_all.json') as f:
    data = json.load(f)

In [4]:
data['smsCorpus']['message'][100]

{'@id': 10220,
 'text': {'$': 'm going to be late leh.'},
 'source': {'srcNumber': {'$': 51},
  'phoneModel': {'@manufactuer': 'unknown', '@smartphone': 'unknown'},
  'userProfile': {'userID': {'$': 51},
   'age': {'$': 'unknown'},
   'gender': {'$': 'unknown'},
   'nativeSpeaker': {'$': 'unknown'},
   'country': {'$': 'SG'},
   'city': {'$': 'unknown'},
   'experience': {'$': 'unknown'},
   'frequency': {'$': 'unknown'},
   'inputMethod': {'$': 'unknown'}}},
 'destination': {'@country': 'unknown', 'destNumber': {'$': 'unknown'}},
 'messageProfile': {'@language': 'en', '@time': 'unknown', '@type': 'unknown'},
 'collectionMethod': {'@collector': 'howyijue',
  '@method': 'unknown',
  '@time': '2003/4'}}

In [5]:
messages = data['smsCorpus']['message']

In [6]:
print(f'Total SMS count: {len(messages)}')

Total SMS count: 55835


In [7]:
full_df = pd.DataFrame(messages)

In [8]:
sms_df = full_df[['text']]
sms_df.head()

Unnamed: 0,text
0,{'$': 'Bugis oso near wat...'}
1,"{'$': 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine ..."
2,{'$': 'I dunno until when... Lets go learn pilates...'}
3,{'$': 'Den only weekdays got special price... Haiz... Cant eat liao... Cut nails oso muz wait un...
4,{'$': 'Meet after lunch la...'}


In [9]:
def clean_sms(sms_text):
    sms = str(sms_text).lower()
    return sms[7:-2]

In [10]:
sms_df['clean_text'] = sms_df['text'].apply(clean_sms)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [11]:
sms_df.head(50)

Unnamed: 0,text,clean_text
0,{'$': 'Bugis oso near wat...'},bugis oso near wat...
1,"{'$': 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine ...","go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there g..."
2,{'$': 'I dunno until when... Lets go learn pilates...'},i dunno until when... lets go learn pilates...
3,{'$': 'Den only weekdays got special price... Haiz... Cant eat liao... Cut nails oso muz wait un...,den only weekdays got special price... haiz... cant eat liao... cut nails oso muz wait until i f...
4,{'$': 'Meet after lunch la...'},meet after lunch la...
5,{'$': 'm walking in citylink now ü faster come down... Me very hungry...'},m walking in citylink now ü faster come down... me very hungry...
6,{'$': '5 nights...We nt staying at port step liao...Too ex'},5 nights...we nt staying at port step liao...too ex
7,{'$': 'Hey pple...$700 or $900 for 5 nights...Excellent location wif breakfast hamper!!!'},hey pple...$700 or $900 for 5 nights...excellent location wif breakfast hamper!!!
8,{'$': 'Yun ah.the ubi one say if ü wan call by tomorrow.call 67441233 look for irene.ere only go...,"yun ah.the ubi one say if ü wan call by tomorrow.call 67441233 look for irene.ere only got bus8,..."
9,{'$': 'Hey tmr maybe can meet you at yck'},hey tmr maybe can meet you at yck


In [12]:
sample_df = sms_df['clean_text'][0:1000]

In [13]:
sample_df.head()

0                                                                                  bugis oso near wat...
1    go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there g...
2                                                         i dunno until when... lets go learn pilates...
3    den only weekdays got special price... haiz... cant eat liao... cut nails oso muz wait until i f...
4                                                                                 meet after lunch la...
Name: clean_text, dtype: object

In [14]:
with open('1000sms.txt', 'a') as f:
    f.write(sample_df.to_string(header=False, index=False))

In [15]:
from nltk.util import pad_sequence
from nltk.util import bigrams
from nltk.util import ngrams
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import flatten

In [16]:
try: # Use the default NLTK tokenizer.
    from nltk import word_tokenize, sent_tokenize 
    # Testing whether it works. 
    # Sometimes it doesn't work on some machines because of setup issues.
    word_tokenize(sent_tokenize("This is a foobar sentence. Yes it is.")[0])
except: # Use a naive sentence tokenizer and toktok.
    import re
    from nltk.tokenize import ToktokTokenizer
    # See https://stackoverflow.com/a/25736515/610569
    sent_tokenize = lambda x: re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', x)
    # Use the toktok tokenizer that requires no dependencies.
    toktok = ToktokTokenizer()
    word_tokenize = word_tokenize = toktok.tokenize

In [17]:
import os
import requests
import io #codecs


# Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
if os.path.isfile('1000sms.txt'):
    with io.open('1000sms.txt') as fin:
        text = fin.read()
else:
    url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
    text = requests.get(url).content.decode('utf8')
    with io.open('language-never-random.txt', 'w', encoding='utf8') as fout:
        fout.write(text)

In [18]:
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                  for sent in sent_tokenize(text)]

In [19]:
tokenized_text[0]

['bugis',
 'oso',
 'near',
 'wat',
 '...',
 'go',
 'until',
 'jurong',
 'point',
 ',',
 'crazy..',
 'available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 '...',
 'cine',
 'there',
 'g',
 '...',
 'i',
 'dunno',
 'until',
 'when',
 '...',
 'lets',
 'go',
 'learn',
 'pilates',
 '...',
 'den',
 'only',
 'weekdays',
 'got',
 'special',
 'price',
 '...',
 'haiz',
 '...',
 'cant',
 'eat',
 'liao',
 '...',
 'cut',
 'nails',
 'oso',
 'muz',
 'wait',
 'until',
 'i',
 'f',
 '...',
 'meet',
 'after',
 'lunch',
 'la',
 '...',
 'm',
 'walking',
 'in',
 'citylink',
 'now',
 'ü',
 'faster',
 'come',
 'down',
 '...',
 'me',
 'very',
 'hungry',
 '...',
 '5',
 'nights',
 '...',
 'we',
 'nt',
 'staying',
 'at',
 'port',
 'step',
 'liao',
 '...',
 'too',
 'ex',
 'hey',
 'pple',
 '...',
 '$',
 '700',
 'or',
 '$',
 '900',
 'for',
 '5',
 'nights',
 '...',
 'excellent',
 'location',
 'wif',
 'breakfast',
 'hamper',
 '!',
 '!',
 '!']

In [20]:
print(text[:500])

                                                                               bugis oso near wat...
 go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there g...
                                                      i dunno until when... lets go learn pilates...
 den only weekdays got special price... haiz... cant eat liao... cut nails oso muz wait until i f...
                                                                              meet after lunch l


In [21]:
from nltk.lm.preprocessing import padded_everygram_pipeline

# Preprocess the tokenized text for 3-grams language modelling
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

In [22]:
from nltk.lm import MLE
model = MLE(n) # Lets train a 3-grams model, previously we set n=3

In [23]:
len(model.vocab)

0

In [24]:
model.fit(train_data, padded_sents)
print(model.vocab)

<Vocabulary with cutoff=1 unk_label='<UNK>' and 1571 items>


In [25]:
len(model.vocab)

1571

In [26]:
print(model.vocab.lookup(tokenized_text[0]))

('bugis', 'oso', 'near', 'wat', '...', 'go', 'until', 'jurong', 'point', ',', 'crazy..', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'cine', 'there', 'g', '...', 'i', 'dunno', 'until', 'when', '...', 'lets', 'go', 'learn', 'pilates', '...', 'den', 'only', 'weekdays', 'got', 'special', 'price', '...', 'haiz', '...', 'cant', 'eat', 'liao', '...', 'cut', 'nails', 'oso', 'muz', 'wait', 'until', 'i', 'f', '...', 'meet', 'after', 'lunch', 'la', '...', 'm', 'walking', 'in', 'citylink', 'now', 'ü', 'faster', 'come', 'down', '...', 'me', 'very', 'hungry', '...', '5', 'nights', '...', 'we', 'nt', 'staying', 'at', 'port', 'step', 'liao', '...', 'too', 'ex', 'hey', 'pple', '...', '$', '700', 'or', '$', '900', 'for', '5', 'nights', '...', 'excellent', 'location', 'wif', 'breakfast', 'hamper', '!', '!', '!')


In [27]:
print(model.counts)

<NgramCounter with 3 ngram orders and 210849 ngrams>


In [28]:
model.counts['meh'] # i.e. Count('meh')

55

In [29]:
model.score('meh') # P('meh')

0.0007421000080956365

In [30]:
print(model.generate(20, random_seed=321))

['<s>', '<s>', 'ok', 'i', 'wun', 'disturb', 'ü.', 'ok', 'lor', '...', 'm', 'free', '...', 'ok.', 'me', 'watching', 'tv', ',', 'hee..', 'lucky']


In [31]:
from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(model, num_words, random_seed=42):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in model.generate(num_words, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

In [32]:
for i in range(321, 330):
    print(generate_sent(model, 20, random_seed=i))

ok i wun disturb ü. ok lor...m free...ok. me watching tv, hee.. lucky
tmr nite...did u receive my msg?
nvm take ur time.
i scared u dun miss me...aiyo...u free on sat right?
here also a bit lor.
2 eat in sch today...u jus ate honey ar?
things 2 do meh...huh u still come n fetch me already.
ll noe later n ask me so i considering...ü wan to go earlier...later got to go
so fast...cos i sms ü then i ask both of us lor.
