**Context**

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

**Content**

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.


### spam.csv
In the given spam csv

ham : means legitimate mail

spam: means non-legitimate mail

Here we are coding a machine learning model where we can work on textual datasets. 

v1 is the label : ham or spam

v2 : contains the raw text messages.

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report



from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re

In [2]:
data = pd.read_csv("E\spam.csv", encoding = 'latin-1')

In [3]:
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


### We can remove the unnecessary columns like unnamed 2,3 and 4 by mentioning(axis = 1) that entire column to be eliminated.

In [4]:
data = data.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)

### Rename the columns v1 and v2

In [5]:
data.rename(columns = {"v1": "label", "v2":"Message"}, inplace = True) 

In [6]:
data.head()

Unnamed: 0,label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Handling Categorical Data

We know that our machine does not understand the categorical values ham and spam.So, we have to convert in to 0 and 1 using get dummies.

If we try to change manually, than otherwise labels set as 1 will be considered by our model is of high priority and 0 as of low priority. But we do not want to do that. We want our model to have a unbiased understanding of our labels.

In [7]:
data = pd.get_dummies(data, columns=['label'])

In [8]:
data.head()

Unnamed: 0,Message,label_ham,label_spam
0,"Go until jurong point, crazy.. Available only ...",1,0
1,Ok lar... Joking wif u oni...,1,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,0,1
3,U dun say so early hor... U c already then say...,1,0
4,"Nah I don't think he goes to usf, he lives aro...",1,0


We have ham 1 when there is ham or else 0

Same with the case of spam, when there is spam , it is 1 otherwise it is 0

In [9]:
# Total ham(1) and spam(0) messages
data['label_ham'].value_counts()

1    4825
0     747
Name: label_ham, dtype: int64

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Message     5572 non-null   object
 1   label_ham   5572 non-null   uint8 
 2   label_spam  5572 non-null   uint8 
dtypes: object(1), uint8(2)
memory usage: 32.7+ KB


### To check the total number of words or total length of a message

In [12]:
#np.arrange is to change the data.message Dataframe to the arrays
# and to store at the ith column of count column,
# we want to store the ith element of the message column
# this will count the length of each message and add to the column Count

data['Count'] = 0
for i in np.arange(0,len(data.Message)):
    data.loc[i,'Count'] = len(data.loc[i,'Message'])
    

In [13]:
data.head()

Unnamed: 0,Message,label_ham,label_spam,Count
0,"Go until jurong point, crazy.. Available only ...",1,0,111
1,Ok lar... Joking wif u oni...,1,0,29
2,Free entry in 2 a wkly comp to win FA Cup fina...,0,1,155
3,U dun say so early hor... U c already then say...,1,0,49
4,"Nah I don't think he goes to usf, he lives aro...",1,0,61


describe will not give any details about message because it is not in numerical form

In [14]:
data.describe()

Unnamed: 0,label_ham,label_spam,Count
count,5572.0,5572.0,5572.0
mean,0.865937,0.134063,80.118808
std,0.340751,0.340751,59.690841
min,0.0,0.0,2.0
25%,1.0,0.0,36.0
50%,1.0,0.0,61.0
75%,1.0,0.0,121.0
max,1.0,1.0,910.0
