# 1. Introduction about dataset
## Context
* The **SMS Spam Collection** is a set of SMS tagged messages that have been collected for SMS Spam research.
* It contains one set of SMS messages in English of **5,574** messages, tagged acording being **ham** (legitimate) or **spam.**

## Content
* The files contain one message per line.
* Each line is composed by two columns:
 * v1 contains the label (ham or spam) and
 * v2 contains the raw text.




**Thanks to:-**
* [SMS Spam Collection](https://archive.ics.uci.edu/dataset/228/sms+spam+collection)

# 2. Import libraries
- To building a **spam classifier** we need to perform **text preprocessing.**
- For this required necessary libraries and tools to import are
  - data manipulation,
  - text cleaning,
  - stemming,
  - lemmatization, and
  - removal of common stopwords.

These steps help to prepare raw text data for machine learning models, improving their accuracy and performance.

In [1]:
# provides tools for text preprocessing, tokenization, stemming, lemmatization, stopwords removal, and more.
import nltk

# Helps in loading, analyzing, and manipulating the dataset efficiently. It is typically used to read the data, handle missing values, and structure the data in a tabular format, particularly when working with structured data like CSV files.
import pandas as pd

# for cleaning text data, such as removing special characters, punctuations, and other unwanted patterns from the text.
import re

# simplify words in text data to their base form, which helps in reducing the dimensionality of text data and improving the performance of the model.
from nltk.stem import PorterStemmer

# convert words to their base or dictionary form (e.g., "better" to "good"). It is more context-aware than stemming, making it more effective in NLP tasks.
from nltk.stem import WordNetLemmatizer

# removing the frequently occurring words (e.g., "the", "and", "is") that do not contribute significantly to the classification task, thereby improving model efficiency and accuracy.
from nltk.corpus import stopwords

# 3. Load the Dataset

In [2]:
# define the URL, points to a CSV file hosted on GitHub, specifically at the repository and path where the dataset is stored.
url_sms ='https://raw.githubusercontent.com/akdubey2k/NLP/main/Spam_Classifier/SMSSpamCollection.csv'

# 4. Dataset EDA (Exploratory Data Analysis)
Read the data from **github** repositary using **pandas**.

In [3]:
# url_new="https://raw.githubusercontent.com/akdubey2k/NLP/main/Spam_Classifier/SpamCollection.csv"

# Loading the CSV file from the URL into a pandas DataFrame without header
df = pd.read_csv(url_sms, sep='\t')
df.head()

Unnamed: 0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
0,ham,Ok lar... Joking wif u oni...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,U dun say so early hor... U c already then say...
3,ham,"Nah I don't think he goes to usf, he lives aro..."
4,spam,FreeMsg Hey there darling it's been 3 week's n...


#### Creating a **pandas** dataframe by inserting new columns named as **label**, **message** and **index** by default true.

The first column contains the labels (ham for non-spam and spam for spam).
The second column contains the message text.

In [4]:
df=pd.read_csv(url_new, sep='\t', names=['label', 'message'])
df.head()

NameError: name 'url_new' is not defined

In [None]:
df=pd.read_csv(url_org, sep='\t', names=['label', 'message'])
df.head()

#### Creation of **Stemming** and **Lemmatization** object

In [None]:
ps=PorterStemmer()
wnl=WordNetLemmatizer()

#### Data **Cleaning** and **preprocessing**

In [None]:
corpus=[]
for i in range(0, len(df)):
  data=re.sub('[^a-zA-Z]', ' ', df['message'][i])
  # print(data)
  data=data.lower()
  data=data.split()
  #data=[ps.stem(word) for word in data if word not in set(stopwords.words('english'))]
  data=[wnl.lemmatize(word) for word in data if word not in set(stopwords.words('english'))]
  data=' '.join(data)
  corpus.append(data)

# print(corpus)

#### **Bag of word** creation for spam classification

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=2500)
X=cv.fit_transform(corpus).toarray()
X

#### get **dummies** from **pandas** dataframe for the **output/dependent** variable (**label** column)

In [None]:
y=pd.get_dummies(df['label'])
y

#### As there is only two dummy varibales, so we do not need two column... we can convert this **dummy variable trap** by using single column.

In [None]:
y=y.iloc[:,1].values # all rows and only one column.... values need to be present
y

#### Split the data into **training** and **test** set, using **sklearn**
X (fit_transform(corpus)) => independent variable and

y (single dummy varible) => depedent variable

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.20, random_state=0)

#### Classification (Training) model using **Naive Bayes Classifier**


In [None]:
from sklearn.naive_bayes import MultinomialNB
mtnb=MultinomialNB()
mtnb.fit(X, y)

In [None]:
mtnb=mtnb.fit(X_train, y_train)
y_pred=mtnb.predict(X_test)
y_pred

#### Classification between **y_test** and **y_pred** values using **Confusion Matrix**

In [None]:
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred) # 2 x 2 matrix
cm

#### Check the accuracy score of model

In [None]:
from sklearn.metrics import accuracy_score
ac=accuracy_score(y_test, y_pred)
ac