# Mystery Friend

You’ve received an anonymous postcard from a friend who you haven’t seen in years. Your friend did not leave a name, but the card is definitely addressed to you. So far, you’ve narrowed your search down to three friends, based on handwriting:

-    Emma Goldman
-    Matthew Henson
-    TingFang Wu

But which one sent you the card?

Just like you can classify a message as spam or not spam with a spam filter, you can classify writing as related to one friend or another by building a kind of friend-writing classifier. You have past writing from all three friends stored up in the variable `friends_docs`, which means you can use scikit-learn’s bag-of-words and Naive Bayes classifier to determine who the mystery friend is!

Ready?

## Feature vectors are in the bag with scikit-learn

1. Import `CountVectorizer` from `sklearn.feature_extraction.text`. Below it, import `MultinomialNB` from `sklearn.naive_bayes`.

In [2]:
from goldman_emma_raw import goldman_docs
from henson_matthew_raw import henson_docs
from wu_tingfang_raw import wu_docs
# import sklearn modules here:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [5]:
# Setting up the combined list of friends' writing samples
friends_docs = goldman_docs + henson_docs + wu_docs
# Setting up labels for your three friends
friends_labels = [1] * 154 + [2] * 141 + [3] * 166


In [42]:
mystery_postcard = """
My friend,
From the 10th of July to the 13th, a fierce storm raged, clouds of
freeing spray broke over the ship, incasing her in a coat of icy mail,
and the tempest forced all of the ice out of the lower end of the
channel and beyond as far as the eye could see, but the _Roosevelt_
still remained surrounded by ice.
Hope to see you soon.
"""
# mystery_postcard = """
# Marriage and love have nothing in common; they are as far apart as the
# poles; are, in fact, antagonistic to each other. No doubt some marriages
# have been the result of love. Not, however, because love could assert
# itself only in marriage; much rather is it because few people can
# completely outgrow a convention. There are today large numbers of men
# and women to whom marriage is naught but a farce, but who submit to it
# for the sake of public opinion. At any rate, while it is true that some
# marriages are based on love, and while it is equally true that in some
# cases love continues in married life, I maintain that it does so
# regardless of marriage, and not because of it.
# """


2. Define `bow_vectorizer` as an implementation of `CountVectorizer`.

In [26]:
bow_vectorizer = CountVectorizer()

3. Use your newly minted `bow_vectorizer` to both `fit` (train) and `transform` (vectorize) all your friends’ writing (stored in the variable `friends_docs`). Save the resulting vector object as `friends_vectors`.

In [27]:
# Define friends_vectors:
friends_vectors = bow_vectorizer.fit_transform(friends_docs)
# friends_vectors

4. Create a new variable `mystery_vector`. Assign to it the vectorized form of `[mystery_postcard]` using the vectorizer’s `.transform()` method.

    (`mystery_postcard` is a string, while the vectorizer expects a list as an argument.)

In [43]:
# Define mystery_vector:
mystery_vector = bow_vectorizer.transform([mystery_postcard])

## This mystery friend gets classified

5. You’ve vectorized and prepared all the documents. Let’s take a looks at your friends’ writing samples to get a sense of how they write.

    Print out one document of each friend’s writing — try any one between `0` and `140`. (Your friend’s documents are stored in `goldman_docs`, `henson_docs`, and `wu_docs`.)

In [29]:
goldman_docs[20]

' All the\nearly sagas rest on that idea, which continues to be the LEIT-MOTIF\nof the biblical tales dealing with the relation of man to God, to the\nState, to society'

In [30]:
henson_docs[20]

'Peary accompanied her husband, and among the members\nof the expedition were Dr'

In [31]:
wu_docs[20]

' I have asked many other\nAmericans similar questions and they all have given me replies in the\nsame way'

6. Have an inkling about which friend wrote the mystery card? We can use a classifier to confirm those suspicions…

    Implement a Naive Bayes classifier using `MultinomialNB`. Save the result to `friends_classifier`.

In [32]:
# Define friends_classifier:
friends_classifier = MultinomialNB()

7. Train `friends_classifier` on `friends_vectors` and `friends_labels` using the classifier’s `.fit()` method.

In [44]:
# Train the classifier:
friends_classifier.fit(friends_vectors, friends_labels)

MultinomialNB()

8. Change predictions value from `["None Yet"]` to the classifier’s prediction about which friend wrote the postcard. You can do this by calling the classifier’s `.predict()` method on the `mystery_vector`.

In [45]:
# Change predictions:
# predictions = ["None Yet"]
predictions = friends_classifier.predict(mystery_vector)

# predictions = friends_classifier.predict_proba(mystery_vector)

In [46]:
mystery_friend = predictions[0] if predictions[0] else "someone else"

## Mystery revealed!

9. Uncomment the final print statement and save your code to see who your mystery friend was all along!

In [47]:
# Uncomment the print statement:
print("The postcard was from {}!".format(mystery_friend))

The postcard was from 2!


*It is from Matthew Henson.*

10. But does it really work? Find some lines by Emma Goldman, Matthew Henson, and TingFang Wu on <a href="https://gutenberg.org" target=_blank>gutenberg.org</a> and save them to mystery_postcard to see how the classifier holds up!

    Try using the `.predict_proba()` method instead of `.predict()` and print out predictions to see the estimated probabilities that the `mystery_postcard` was written by each person.

    What happens when you add in a recent email or text instead?