# Email Similarity with Naive Bayes

In this project, we'll use scikit-learn's Naive Bayes implementation to classify emails from different categories. We'll explore how difficult it is to distinguish between similar topics (like hockey vs soccer) versus different topics (like sports vs tech).

In [None]:
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

## Exploring the Data

### Task 1: View available categories

In [None]:
# Fetch the dataset to see available categories
emails = fetch_20newsgroups()

# Task 1: Print the target names to see different categories
print(emails.target_names)

### Task 2: Select baseball and hockey categories

In [None]:
# Task 2: Fetch emails from baseball and hockey categories
emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'])

### Task 3: Look at a sample email

In [None]:
# Task 3: Print the email at index 5
print(emails.data[5])

### Task 4: Check the label of the email

In [None]:
# Task 4: Print the label of the email at index 5
print(f"Label number: {emails.target[5]}")
print(f"Label name: {emails.target_names[emails.target[5]]}")

## Making the Training and Test Sets

### Task 5: Create training set

In [None]:
# Task 5: Create training emails dataset
train_emails = fetch_20newsgroups(
    categories=['rec.sport.baseball', 'rec.sport.hockey'],
    subset='train',
    shuffle=True,
    random_state=108
)

### Task 6: Create test set

In [None]:
# Task 6: Create test emails dataset
test_emails = fetch_20newsgroups(
    categories=['rec.sport.baseball', 'rec.sport.hockey'],
    subset='test',
    shuffle=True,
    random_state=108
)

## Counting Words

### Task 7: Create CountVectorizer

In [None]:
# Task 7: Create a CountVectorizer object
counter = CountVectorizer()

### Task 8: Fit the CountVectorizer

In [None]:
# Task 8: Fit the counter with all email data
counter.fit(test_emails.data + train_emails.data)

### Task 9: Transform training data

In [None]:
# Task 9: Transform training emails into word counts
train_counts = counter.transform(train_emails.data)

### Task 10: Transform test data

In [None]:
# Task 10: Transform test emails into word counts
test_counts = counter.transform(test_emails.data)

## Making a Naive Bayes Classifier

### Task 11: Create classifier

In [None]:
# Task 11: Create MultinomialNB classifier
classifier = MultinomialNB()

### Task 12: Train the classifier

In [None]:
# Task 12: Fit the classifier with training data
classifier.fit(train_counts, train_emails.target)

### Task 13: Test the classifier

In [None]:
# Task 13: Test classifier accuracy
accuracy = classifier.score(test_counts, test_emails.target)
print(f"Accuracy for baseball vs hockey: {accuracy:.4f}")

## Testing Other Datasets

### Task 14: Test hardware vs hockey emails

In [None]:
# Task 14: Test with different categories (hardware vs hockey)
train_emails = fetch_20newsgroups(
    categories=['comp.sys.ibm.pc.hardware', 'rec.sport.hockey'],
    subset='train',
    shuffle=True,
    random_state=108
)

test_emails = fetch_20newsgroups(
    categories=['comp.sys.ibm.pc.hardware', 'rec.sport.hockey'],
    subset='test',
    shuffle=True,
    random_state=108
)

# Re-fit and transform with new data
counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

# Train and test classifier
classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)
accuracy = classifier.score(test_counts, test_emails.target)
print(f"Accuracy for hardware vs hockey: {accuracy:.4f}")

### Task 15: Experiment with different category combinations

Try different combinations to find:
- Very similar topics (low accuracy)
- Very different topics (high accuracy)

In [None]:
# Task 15: Test different category combinations

# Function to test accuracy between two categories
def test_categories(category1, category2):
    # Fetch data
    train_emails = fetch_20newsgroups(
        categories=[category1, category2],
        subset='train',
        shuffle=True,
        random_state=108
    )
    
    test_emails = fetch_20newsgroups(
        categories=[category1, category2],
        subset='test',
        shuffle=True,
        random_state=108
    )
    
    # Transform data
    counter = CountVectorizer()
    counter.fit(test_emails.data + train_emails.data)
    train_counts = counter.transform(train_emails.data)
    test_counts = counter.transform(test_emails.data)
    
    # Train and test
    classifier = MultinomialNB()
    classifier.fit(train_counts, train_emails.target)
    accuracy = classifier.score(test_counts, test_emails.target)
    
    return accuracy

# Test similar topics (should have lower accuracy)
print("Similar topics:")
print(f"Baseball vs Hockey: {test_categories('rec.sport.baseball', 'rec.sport.hockey'):.4f}")
print(f"PC Hardware vs Mac Hardware: {test_categories('comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware'):.4f}")
print(f"Politics Guns vs Politics Mideast: {test_categories('talk.politics.guns', 'talk.politics.mideast'):.4f}")

print("\nDifferent topics:")
print(f"Atheism vs Christian: {test_categories('alt.atheism', 'soc.religion.christian'):.4f}")
print(f"Space vs Hockey: {test_categories('sci.space', 'rec.sport.hockey'):.4f}")
print(f"Medicine vs Motorcycles: {test_categories('sci.med', 'rec.motorcycles'):.4f}")

## Available Categories

Here are all the available categories you can experiment with:

- 'alt.atheism'
- 'comp.graphics'
- 'comp.os.ms-windows.misc'
- 'comp.sys.ibm.pc.hardware'
- 'comp.sys.mac.hardware'
- 'comp.windows.x'
- 'misc.forsale'
- 'rec.autos'
- 'rec.motorcycles'
- 'rec.sport.baseball'
- 'rec.sport.hockey'
- 'sci.crypt'
- 'sci.electronics'
- 'sci.med'
- 'sci.space'
- 'soc.religion.christian'
- 'talk.politics.guns'
- 'talk.politics.mideast'
- 'talk.politics.misc'
- 'talk.religion.misc'