
In this off-platform project, you will use a Naive Bayes Classifier to find patterns in real tweets. We've given you three files: `new_york.json`, `london.json`, and `paris.json`. These three files contain tweets that we gathered from those locations.

The goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

# Investigate the Data

* The number of tweets.
* The columns, or features, of a tweet.
* The text of the 12th tweet in the New York dataset.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [2]:
new_york_tweets = pd.read_json("new_york.json", lines=True)
# print(len(new_york_tweets))
# print(new_york_tweets.columns)
# print(new_york_tweets.loc[12]["text"])

In [3]:
london_tweets = pd.read_json("london.json", lines=True)
paris_tweets = pd.read_json("paris.json", lines=True)

# Classifying using language: Naive Bayes Classifier

Combine tweets from all three cities into a variable named `all_tweets` 

Made labels associated with those tweets. `0` represents a New York tweet, `1`  represents a London tweet, and `2` represents a Paris tweet into a variable `labels`.

In [4]:
new_york_text = new_york_tweets["text"].tolist()
# new_york_text
london_text = london_tweets['text'].tolist()
paris_text = paris_tweets['text'].tolist()
all_tweets = new_york_text + london_text + paris_text
labels = [0] * len(new_york_text) + [1] * len(london_text) + [2] * len(paris_text)

# Making a Training and Test Set

using scikit-learn's `train_test_split` function to do this split. 

In [5]:
train_data, test_data, train_labels, test_labels = train_test_split(all_tweets, labels, test_size=0.2, random_state=1)
# print(len(train_data))
# print(len(test_data))

# Making the Count Vectors

By calling `CountVectorizer` named `counter`.

and then `.fit()` method using `train_data` as a parameter to teaches the counter vocabulary.

then transform `train_data` and `test_data` into Count Vectors by calling `counter`'s `.transform()` method 



In [6]:
counter = CountVectorizer()
counter.fit(train_data)
train_counts = counter.transform(train_data)
test_counts = counter.transform(test_data)
# print(train_data[3])
# print(train_counts[3])

# Train and Test the Naive Bayes Classifier


In [7]:
classifier = MultinomialNB()
classifier.fit(train_counts, train_labels)
predictions = classifier.predict(test_counts)
# print(predictions)

# Evaluating Your Model

By calling scikit-learn's `accuracy_score` function and by looking at the confusion matrix.


In [8]:
from sklearn.metrics import accuracy_score
print(accuracy_score(test_labels, predictions))

0.6779324055666004


In [9]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(test_labels, predictions))

[[541 404  28]
 [203 824  34]
 [ 38 103 340]]


# Testing my Own Tweet


In [10]:
tweet = 'I love paris man'
tweet_counts = counter.transform([tweet])
print(classifier.predict(tweet_counts))

[2]
