# Classifying Tweets Project

In this project, I am using a Naive Bayes Classifier to find patterns in tweets. I'm using three files: `new_york.json`, `london.json`, and `paris.json`. These three files contain tweets that we gathered from those locations.

My goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

# Investigate the Data

To begin, let's take a look at the data. I've imported `new_york.json` and printed the following information:
* The number of tweets.
* The columns, or features, of a tweet.
* The text of the 12th tweet in the New York dataset.

In [1]:
import pandas as pd

new_york_tweets = pd.read_json("new_york.json", lines=True)
print(len(new_york_tweets))
print(new_york_tweets.columns)
print(new_york_tweets.loc[12]["text"])

4723
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'extended_tweet',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities',
       'withheld_in_countries'],
      dtype='object')
Be best #ThursdayThoughts


In the code block below, I'm loading the London and Paris tweets into DataFrames named `london_tweets` and `paris_tweets`.

In [2]:
london_tweets = pd.read_json("london.json", lines=True)
paris_tweets = pd.read_json("paris.json", lines=True)

In [3]:
print(len(london_tweets))
print(len(paris_tweets))

5341
2510


# Classifying using language: Naive Bayes Classifier

I'm going to create a Naive Bayes Classifier! I will grab the text of all of the tweets and make it one big list. I'm going to combine the text of all three lists into `all_tweets` and label encoding the tweets with `0` representing a New York tweet, `1`  representing a London tweet, and `2` representing a Paris tweet. Finish the definition of `labels`.

In [4]:
new_york_text = new_york_tweets["text"].tolist()
london_text = london_tweets["text"].tolist()
paris_text = paris_tweets["text"].tolist()

all_tweets = new_york_text + london_text + paris_text
labels = [0] * len(new_york_text) + [1] * len(london_text) + [2] * len(paris_text)

print(labels)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

# Making a Training and Test Set

I'll be splitting up the data into training and testing set

In [5]:
from sklearn.model_selection import train_test_split

train_data,test_data,train_labels,test_labels = train_test_split(all_tweets,labels,test_size = 0.2,random_state = 1)
print(len(train_data))
print(len(test_data))


10059
2515


# Making the Count Vectors

To use a Naive Bayes Classifier, I need to transform our lists of words into count vectors. Then I will fit the model using train data to teach the counter our vocabulary. Then I will transform train and test data into count vectors and call transform on them to store the result. 

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

counter = CountVectorizer()
counter.fit(train_data)
train_counts = counter.transform(train_data)
test_counts = counter.transform(test_data)

print(test_data[3])

Trade &amp; investment agreements often have a negative effect on the weakest parts of society, in particular women. Ho… https://t.co/Ve3JlBzNlZ


# Train and Test the Naive Bayes Classifier

I now have the inputs to the classifier. I'll use the CountVectors to train and test the Naive Bayes Classifier!

In [7]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(train_counts,train_labels)
predictions = classifier.predict(test_counts)


# Evaluating The Model

Now that the classifier has made its predictions, 

In [8]:
from sklearn.metrics import accuracy_score

accuracy_score(test_labels,predictions)

0.6779324055666004

I'll also look at the **confusion matrix** to evaluate the score. 

In [9]:
from sklearn.metrics import confusion_matrix

confusion_matrix(test_labels,predictions)

array([[541, 404,  28],
       [203, 824,  34],
       [ 38, 103, 340]], dtype=int64)