# Logistic Regression for Spam detection

For a logistic regression analysis, we perform the following steps. 

## 1. Data Collection
- Gather a dataset of emails or text messages labeled as spam or not spam (ham).
- We use a dataset from kaggle: https://www.kaggle.com/code/mfaisalqureshi/email-spam-detection-98-accuracy/input

## 2. Data Preprocessing (optional)
- Preprocess the text data, including tasks such as punctuation removal, lowercase conversion, tokenization, stop word removal, and stemming/lemmatization.

## 3. Feature Extraction
- Convert preprocessed text data into numerical features suitable for analysis. Common techniques include Bag of Words (BoW) or TF-IDF (Term Frequency-Inverse Document Frequency) to represent text as vectors of word frequencies or scores.

## 4. Splitting Data
- Divide the dataset into training and testing sets. The training set will be used to train the model, while the testing set will evaluate its performance.

## 5. Model Training
- Train the logistic regression model using the training data. Fit the model to the training data and adjust parameters to minimize error between predicted and actual labels.

## 6. Model Evaluation
- Evaluate the performance of the trained model using the testing data. Common metrics include accuracy, precision, recall, F1 score, and ROC-AUC score.

## 7. Hyperparameter Tuning (Optional)
- Fine-tune the model's hyperparameters, such as regularization strength, using techniques like cross-validation to optimize performance.

## 8. Prediction
- Use the trained model to classify new, unseen emails or text messages as spam or not spam.



In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split as TTS
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

ModuleNotFoundError: No module named 'nltk'

We read the data and add a boolean column "Spam" that returns 1 for spam and 0 else.

In [2]:
df = pd.read_csv("data/spam.csv")

df['Spam'] = df['Category'].apply(lambda x:1 if x=='spam' else 0)

df.head()

Unnamed: 0,Category,Message,Spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


Check for missing values:

In [16]:
df.isna().sum()

Category    0
Message     0
Spam        0
dtype: int64

# Classifier Model using Multinomial Bayes 

Using multinomial bayes because the data is in a discrete form.

In [10]:
X_train,X_test,y_train,y_test=TTS(df.Message,df.Spam,test_size=0.25)

clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

clf.fit(X_train, y_train)

clf.score(X_test, y_test)

0.9842067480258435

No let's test our model on some manually generated mails:

In [7]:
emails = [
    'Sounds great! Are you home now?',
    'Will u meet ur dream partner soon? Is ur career off 2 a flyng start? 2 find out free, txt HORO followed by ur star sign, e. g. HORO ARIES',
    'Click on this link to re-activate your account.'
    ]

clf.predict(emails)

array([0, 1, 1], dtype=int64)

Here we have tested a few sample emails. The first one does not seem like spam, but the second and third definitely look like spam. The resulting array [0, 1, 1] states that the classifier (clf) correctly assigns the spam to the latter two mails.

## LogisticRegression() Module

Using the LogisticRegression() class from the sklearn module, we can train the model as follows:

In [15]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Message'])
y = df['Spam']

X_train, X_test, y_train, y_test = TTS(X, y, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))



Accuracy: 0.9856502242152466
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       0.99      0.90      0.94       149

    accuracy                           0.99      1115
   macro avg       0.99      0.95      0.97      1115
weighted avg       0.99      0.99      0.99      1115

