# Naive Bayes Project: Classifying Tweets

Twitter data provided in JSON files by Codecademy https://www.codecademy.com/paths/data-science/tracks/supervised-machine-learning-cumulative-project-skill-path/modules/supervised-learning-cumulative-project-skill-path/informationals/twitter-classification-cumulative-project-skill-path

The goal of this project is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

# Investigating the Data

In [1]:
import pandas as pd

new_york_tweets = pd.read_json("new_york.json", lines=True)
print(len(new_york_tweets))
print(new_york_tweets.columns)
print(new_york_tweets.loc[12]["text"])

4723
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'extended_tweet',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities',
       'withheld_in_countries'],
      dtype='object')
Be best #ThursdayThoughts


In [2]:
london_tweets = pd.read_json("london.json", lines=True)
print(len(london_tweets))

5341


In [3]:
paris_tweets = pd.read_json("paris.json", lines=True)
print(len(paris_tweets))

2510


# Classifying using language: Naive Bayes Classifier

A Naives Bayes Classifier will soon be built. First, all of the data will be stored in a list, with associated labels: 0 for New York, 1 for London, and 2 for Paris.

In [4]:
# Converting to lists
new_york_text = new_york_tweets["text"].tolist()
london_text = london_tweets["text"].tolist()
paris_text = paris_tweets["text"].tolist()

# Createing one combined list with corresponding labels
all_tweets = new_york_text + london_text + paris_text
labels = [0] * len(new_york_text) + [1] * len(london_text) + [2] * len(paris_text)

In [5]:
# train test split
from sklearn.model_selection import train_test_split

train_data, test_data, train_labels, test_labels = train_test_split(all_tweets, labels, test_size = 0.2, random_state = 1)
print(len(train_data))
print(len(test_data))

10059
2515


In [6]:
# Perparing data for NB classifier by transforming text into count vectors
from sklearn.feature_extraction.text import CountVectorizer

counter = CountVectorizer()
counter.fit(train_data)

train_counts = counter.transform(train_data)
test_counts = counter.transform(test_data)

print(train_data[3])
print(train_counts[3])

saying bye is hard. Especially when youre saying bye to comfort.
  (0, 5022)	2
  (0, 6371)	1
  (0, 9552)	1
  (0, 12314)	1
  (0, 13903)	1
  (0, 23994)	2
  (0, 27146)	1
  (0, 29397)	1
  (0, 30274)	1


In [7]:
# Training and testing the model
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(train_counts, train_labels)
predictions = classifier.predict(test_counts)

# Evaluation

In [8]:
# Evaluating by accuracy score 
from sklearn.metrics import accuracy_score

print(accuracy_score(test_labels, predictions))

0.6779324055666004


In [9]:
# Evaluating by confusion matrix
from sklearn.metrics import confusion_matrix

print(confusion_matrix(test_labels, predictions))

[[541 404  28]
 [203 824  34]
 [ 38 103 340]]


# Further Evaluation with Sample Tweets

In [10]:
tweet1 = "Let's go Giants!"
tweet1_counts = counter.transform([tweet1])
classifier.predict(tweet1_counts)

array([0])

In [11]:
tweet2 = "Come on you Spurs!"
tweet2_counts = counter.transform([tweet2])
classifier.predict(tweet2_counts)

array([1])

In [12]:
tweet3 = "Allez les bleus!"
tweet3_counts = counter.transform([tweet3])
classifier.predict(tweet3_counts)

array([2])

In [13]:
tweet4 = "They went on vacation."
tweet4_counts = counter.transform([tweet4])
classifier.predict(tweet4_counts)

array([0])

In [14]:
tweet5 = "They went on holiday."
tweet5_counts = counter.transform([tweet5])
classifier.predict(tweet5_counts)

array([1])