## Natural Language Processing (NLP)

### Import Data
First of all we need some data to work with. For this propose we can use the [public SMS Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) on machine learning repository of UCI.

After downloding and unzipping the data, there would be two files:
- readme file including the information about the data Set
- SMSSpamCollection file which contains the data

In summary, the data contains more than 5000 messages that have been collected for SMS Spam research (4,827 SMS legitimate messages (86.6%) and a total of 747 (13.4%) spam messages).

The files contain one message per line. Each line is composed by two columns: one with label (ham or spam) and other with the raw text. 

In [6]:
# Import messages from file
messages = [line.rstrip() for line in open('smsspamcollection/SMSSpamCollection')]

# Show the first 10 messages
for num, message in enumerate(messages[:10]):
    print(num, message)

0 ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
1 ham	Ok lar... Joking wif u oni...
2 spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3 ham	U dun say so early hor... U c already then say...
4 ham	Nah I don't think he goes to usf, he lives around here though
5 spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, آ£1.50 to rcv
6 ham	Even my brother is not like to speak with me. They treat me like aids patent.
7 ham	As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
8 spam	WINNER!! As a valued network customer you have been selected to receivea آ£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 1

As you can see in the print above, this is a TSV (_Tab Separated Values_) file and secondly messages have labels _ham_ and _spam_, which corresponds to _normal_ and _spam_ messages, respectively.

In the continue the main goal of this article is to setup a machine learning model to identify _ham_ and _spam_ messages itself. It would be a supervised method and we will use some part of the messages for the training process. 

But before that, we will do some analysis on the messages. For convenience 

In [7]:
# import pandas library
import pandas as pd 

df_messages = pd.read_csv('smsspamcollection/SMSSpamCollection',
                        sep='\t', names=['label','message'])

In [8]:
# Check the dataframe
df_messages

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


As you can see, there are 5572 messages in our dataframe. It would be nice if we check some statistics by plotting. 