# Project MLE - Spam filter

## Data preparation
Collect a dataset of emails labeled as spam or not spam.
Preprocess the emails by removing stop words, stemming, and converting text into numerical features.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

# Download stop words and initialize stemmer
nltk.download('stopwords')
stop_words = stopwords.words('english')
stemmer = PorterStemmer()

# Define function to preprocess text
def preprocess(text):
    # Convert to lowercase and remove punctuation
    text = text.lower().replace('[^\w\s]','')

    # Tokenize and remove stop words
    words = nltk.word_tokenize(text)
    words = [word for word in words if word not in stop_words]

    # Stem words
    words = [stemmer.stem(word) for word in words]

    # Join words back into a single string
    text = ' '.join(words)
    
    return text

# Example usage
text = "This is an example sentence. It contains multiple sentences!"
preprocessed_text = preprocess(text)
print(preprocessed_text)

## Model training
Create a statistical model, such as Naive Bayes or logistic regression, to predict whether an email is spam or not spam.
Use MLE to estimate the model parameters that maximize the likelihood of the training data.

In [None]:
from sklearn.naive_bayes import MultinomialNB

# Load preprocessed data and labels
X_train, y_train = load_data()

# Create bag-of-words vectorizer
vectorizer = CountVectorizer()

# Convert text to bag-of-words features
X_train_bow = vectorizer.fit_transform(X_train)

# Train Naive Bayes model with MLE
clf = MultinomialNB()
clf.fit(X_train_bow, y_train)


## Model evaluation
Evaluate the trained model on a held-out test set to measure its performance.
Tune the model's hyperparameters, such as regularization strength or feature selection, to improve its accuracy.
Sample code for model evaluation:

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Load preprocessed data and labels
X, y = load_data()

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Convert text to bag-of-words features
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

# Train Naive Bayes model with MLE
clf = MultinomialNB()
clf.fit(X_train_bow, y_train)

# Evaluate model on test set
y_pred = clf.predict(X_test_bow)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


## Inference
Use the trained model to predict whether new, unseen emails are spam or not spam.
Preprocess the new emails in the same way as the training data before making predictions.

In [None]:
# Preprocess new email text
new_email = "Get rich quick! Earn thousands of dollars a day!"
preprocessed_email = preprocess(new_email)

# Convert preprocessed text to bag-of-words features
new_email_b
