# **Handling Categorical Values**

In [3]:
import pandas as pd

# Create a small dataframe with 'Age' and 'Gender' columns
data = {
    'Age': [25, 30, 35, 40, 28],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female']  # Categorical column with 'Female'/'Male' values
}

df = pd.DataFrame(data)
df

Unnamed: 0,Age,Gender
0,25,Female
1,30,Male
2,35,Male
3,40,Male
4,28,Female


In [4]:
def convert(x):
  if(x=='Male'):
    return 0
  else:
    return 1

df['Gender'] = df['Gender'].apply(convert)
df


Unnamed: 0,Age,Gender
0,25,1
1,30,0
2,35,0
3,40,0
4,28,1


# **Handling Textual Data**

In [5]:
data = {
    'text': [
        'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)',
        'Nah I don’t think he goes to usf, he lives around here though',
        'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward!',
        'Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free!',
        'I’m gonna be home soon and i don’t want to talk about this stuff anymore tonight, k? I’ve cried enough today.',
    ],
    'label': ['spam', 'not spam', 'spam', 'spam', 'not spam']
}

# Convert to DataFrame
df = pd.DataFrame(data)
df


Unnamed: 0,text,label
0,Free entry in 2 a wkly comp to win FA Cup fina...,spam
1,"Nah I don’t think he goes to usf, he lives aro...",not spam
2,WINNER!! As a valued network customer you have...,spam
3,Had your mobile 11 months or more? U R entitle...,spam
4,I’m gonna be home soon and i don’t want to tal...,not spam


# **Bag of Words Model**



*   Bag of Words model is one of the popular Text repesentation technique.

*   This model take some text data as the input and creates a Vocabulary out of all the unique words present in that input data.
*   It will be taking every word in that vocabulary and count how many times that word occurs in the sentence and updates its count in the vector.


*   That vector will be the vector representation of a sentence.

**Example**

*  Text data for Vocabulary creation ==> 'Bag of Words model is one of the popular Text repesentation technique'

*  Vocabulary {
                 Bag,of, Words, model, is, one,the, popular,Text, repesentation, technique
}

**Creating a vector with that vocabulary**  
* Text       : My model is popular than any other model.
* Word_Count : {
                 Bag : 0,
                 of:0,
                 Words:0,
                 model:2,
                 is:1,
                 one:0,
                 the:0,
                 popular:1,
                 Text:0,
                 representation:0,
                 technique:0

  }
* Vector     : [
  0 , 0 , 0 , 2 , 1 , 0 , 0 , 1 , 0 , 0 0
  ]

  **Vector size will be equal to Vocabulary Size**








# **Spam Classification Project**

In [6]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


df = pd.read_csv("spam.csv")
df.sample(5)

Unnamed: 0,text,label
16,Your package has been delivered. Check your fr...,not spam
10,You have 1 new message. Please call 0844875225...,spam
13,Your account has been compromised. Please call...,spam
35,I HAVE A DATE ON SUNDAY WITH WILL!!,not spam
5,"Hey, I left my jacket at your place. Can I com...",not spam


In [7]:

# Encode labels as integers
df['label'] = df['label'].map({'spam': 1, 'not spam': 0})

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)




In [8]:

# Vectorize the text data
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)


In [9]:

# Train the Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train_vec, y_train)


In [10]:

# Predict on the test set
y_pred = clf.predict(X_test_vec)

# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)


Accuracy: 0.90
Confusion Matrix:
[[3 1]
 [0 6]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.75      0.86         4
           1       0.86      1.00      0.92         6

    accuracy                           0.90        10
   macro avg       0.93      0.88      0.89        10
weighted avg       0.91      0.90      0.90        10

