<h1>To build an <b>AI model</b> that can classify <b>SMS messages</b> as <b>spam</b> or <b>legitimate</b>, we'll follow these <b>steps</b>:</h1>

<h2><u><b>Note</b></u>: Only, use the below <b>code</b> for <b>importing</b> the any <b>external </b>file to <b>local</b> environment .)</h2>

In [288]:
# from google.colab import files
# df=files.upload()

<h2>No <b>Misuse</b> of the 👆 above <b>Code</b> 😉 .)</h2>

<hr><hr>

<h2>. Initializing the Data</h2>



In [289]:
import os

files=os.listdir('/content/')
print(files)

['.config', 'spam.csv', 'sample_data']


In [290]:
import pandas as pd
df=pd.read_csv('/content/spam.csv', encoding='ISO-8859-1')

In [291]:
df

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


<hr><hr>

# 1.   Data Preparation:



In [292]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


<hr>

<h3>. <b>Clean</b> the <b>data</b> by removing <b>unnecessary</b> columns.</h3>


In [293]:
df.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'], inplace=True)
df

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [294]:
df_copy = df['v1'].copy()

<hr>

<h3>. <b>Convert</b> the <b>labels</b> ("<u><b>ham</b></u>" and "<u><b>spam</b></u>") into <b>numerical</b> format.</h3>



In [295]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      5572 non-null   object
 1   v2      5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


<hr>

# For, this we've to First <b>Convert</b> the <b>Category Data</b> to <b>Numerical Data</b>.

<h2>. Converting <b>categorical data</b> to <b>numerical data</b> using <b>Pandas</b>: <b>get_dummies()</b> method.</h2>



In [296]:
encoded = pd.get_dummies(df['v1'])
encoded

Unnamed: 0,ham,spam
0,True,False
1,True,False
2,False,True
3,True,False
4,True,False
...,...,...
5567,False,True
5568,True,False
5569,True,False
5570,True,False


<hr>

<h2>. <b>Replacing</b> the <b>original</b> column which was not <b>encoded</b>.</h2>

In [297]:
df['v1'].replace(['ham', 'spam'],[0, 1], inplace=True)
df

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will Ì_ b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


<hr><hr>

# <b>Split</b> the <b>data</b> into <b>training</b> and <b>testing</b> sets.

In [298]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test, y_train, y_test = train_test_split(df['v2'], df_copy, test_size=0.2,random_state=42)

In [299]:
tfidf = TfidfVectorizer(max_features=5000)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

<hr><hr>

# <b>Preprocess</b> the <b>data</b>: Make sure the <b>Training</b> and <b>Testing</b> data are <b>consistent</b>.

In [300]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

model= LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_tfidf, y_train)

<hr><hr>

# <b>Test</b> the <b>Model .)</b>

In [301]:
pred = model.predict(X_test_tfidf)

print(f'Total ({len(pred)}) Spam Messages Identified .)')

Total (1115) Spam Messages Identified .)


<hr><hr>

# Check <b>Accuracy</b> of the <b>Model</b> .)

In [302]:
print(f"Accuracy Score {accuracy_score(y_test, pred) * 100:.2f}.")

Accuracy Score 96.77.


<hr><hr>

  <a>
    <img src="https://github.com/user-attachments/assets/08ed2cbc-1be6-4690-a6c1-49a98a9787d6" width="600" height="300" />
  </a><br>