# Classifying Tweets

useing a Naive Bayes Classifier to find patterns in real tweets. We have three files: `new_york.json`, `london.json`, and `paris.json`. These three files contain tweets that gathered from those locations.

The goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

# Investigate the Data

To begin, let's take a look at the data. First, we import `new_york.json` and print the following information:
* The number of tweets.
* The columns, or features, of a tweet.
* The text of the 12th tweet in the New York dataset.

In [1]:
import pandas as pd
import string
import re

In [2]:
new_york_tweets = pd.read_json("new_york.json", lines=True)
print('There are {} records in New York.'.format(new_york_tweets.shape[0]))
print("Columns of the New York tweets:",new_york_tweets.columns)
print(new_york_tweets.loc[20]["text"])

There are 4723 records in New York.
Columns of the New York tweets: Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'extended_tweet',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities',
       'withheld_in_countries'],
      dtype='object')
@BesawKyle @barstoolsports Ed Cooley has an autoimmune disease that many people suffer from called Alopecia. You th… https://t.co/ldEMLrkguP


Let 's load the London and Paris tweets into DataFrames named `london_tweets` and `paris_tweets` as well.

In [3]:
london_tweets = pd.read_json("london.json", lines = True)
print('There are {} records in London.'.format(london_tweets.shape[0]))
print("Columns of the London tweets:", london_tweets.columns)


paris_tweets = pd.read_json("paris.json", lines = True)
print('There are {} records in Paris.'.format(paris_tweets.shape[0]))
print("Columns of the Paris tweets:", paris_tweets.columns)

There are 5341 records in London.
Columns of the London tweets: Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'extended_tweet', 'quote_count',
       'reply_count', 'retweet_count', 'favorite_count', 'entities',
       'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities'],
      dtype='object')
There are 2510 records in Paris.
Columns of the Paris tweets: Index(['created_at', 'id', 'id_str', 'text', 'source', 'truncated',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_sc

# Classifying using language: Naive Bayes Classifier

We're going to create a Naive Bayes Classifier! Let's begin by looking at the way language is used differently in these three locations. Let's grab the text of all of the tweets and make it one big list. In the code block below, we've created a list of all the `New York`, `london_tweets` and `paris_tweets`.

Then combine all three into a variable named `all_tweets` by using the `+` operator.

Let's also make the labels associated with those tweets. `0` represents a New York tweet, `1`  represents a London tweet, and `2` represents a Paris tweet. Finish the definition of `labels`.

In [4]:
new_york_text = new_york_tweets["text"].tolist()
london_text = london_tweets["text"].tolist()
paris_text = paris_tweets["text"].tolist()

all_tweets = new_york_text + london_text + paris_text
labels = [0] * len(new_york_text) + [1] * len(london_text) + [2] * len(paris_text)

In [5]:
all_tweets

['@DelgadoforNY19 Calendar marked.',
 'petition to ban more than one spritz of cologne',
 'People really be making up beef with you in they head lol',
 '30 years old.. wow what a journey... I moved to NYC at 22 young and dumb, without even $100 in my bank account and… https://t.co/awjzsvoGS7',
 'At first glance it looked like asparagus with chicken and gravy smothered over it or potatoes. She gotta be extra w… https://t.co/InBNnsKuWu',
 'texting me bullshit i just swipe and delete it',
 'Nailed it. https://t.co/dYYvyYVnxZ',
 '🗽Cammy Set for tomboyfeels \nCop @ https://t.co/eaNB5dNIdG (custom pieces 2)\nShot by lexi_vv_photography \nCreative D… https://t.co/25N9vMi97j',
 '@notepinuch Thank you ka 😂',
 "I'm at Crunch - Bushwick - @crunchgym in Brooklyn, NY https://t.co/WRGDRsEkPD",
 'Good day please make you tune in Thank you🙏🏿 https://t.co/5zVHN0LQ27',
 '10 Clear Quintuple 5 Disc CD Jewel Case $19.20 #FreeShip https://t.co/7JyD4NpD5s #CD #Jewel #Cases #Generic https://t.co/lHe81SqFTC',


# Making a Training and Test Set

We can now break our data into a training set and a test set. We'll use scikit-learn's `train_test_split` function to do this split. This function takes two required parameters: It takes the data, followed by the labels. Set the optional parameter `test_size` to be `0.2`. Finally, set the optional parameter `random_state` to `1` so your data is split in the same way as my data.

Remember, this function returns 4 items in this order:
1. The training data
2. The testing data
3. The training labels
4. The testing labels

Store the results in variables named `train_data`, `test_data`, `train_labels`, and `test_labels`.

In [6]:
from sklearn.model_selection import train_test_split

train_data, test_data, train_labels, test_labels = train_test_split(all_tweets, labels, 
                                                                    test_size = 0.2,
                                                                    random_state = 1)


# Making the Count Vectors

To use a Naive Bayes Classifier, we need to transform our lists of words into count vectors. Recall that this changes the sentence `"I love New York, New York"` into a list that contains:

* Two `1`s because the words `"I"` and `"love"` each appear once.
* Two `2`s because the words `"New"` and `"York"` each appear twice.
* Many `0`s because every other word in the training set didn't appear at all.

To start, create a `CountVectorizer` named `counter`.

Next, call the `.fit()` method using `train_data` as a parameter. This teaches the counter our vocabulary.

Finally, let's transform `train_data` and `test_data` into Count Vectors. Call `counter`'s `.transform()` method using `train_data` as a parameter and store the result in `train_counts`. Do the same for `test_data` and store the result in `test_counts`.

Print `train_data[3]` and `train_counts[3]` to see what a tweet looks like as a Count Vector.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

counter = CountVectorizer()
counter.fit(train_data)

train_counts = counter.transform(train_data)
test_counts = counter.transform(test_data)

print(train_data[3], train_counts[3])


saying bye is hard. Especially when youre saying bye to comfort.   (0, 5022)	2
  (0, 6371)	1
  (0, 9552)	1
  (0, 12314)	1
  (0, 13903)	1
  (0, 23994)	2
  (0, 27146)	1
  (0, 29397)	1
  (0, 30274)	1


# Train and Test the Naive Bayes Classifier

We now have the inputs to our classifier. Let's use the CountVectors to train and test the Naive Bayes Classifier!

First, make a `MultinomialNB` named `classifier`.

Next, call `classifier`'s `.fit()` method. This method takes two parameters &mdash; the training data and the training labels. `train_counts` contains the training data and `train_labels` containts the labels for that data.

Calling `.fit()` calculates all of the probabilities used in Bayes Theorem. The model is now ready to quickly predict the location of a new tweet. 

Finally, let's test our model. `classifier`'s `.predict()` method using `test_counts` as a parameter. Store the results in a variable named `predictions`.

In [8]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(train_counts, train_labels)
predictions = classifier.predict(test_counts)


# Evaluating Your Model

Now that the classifier has made its predictions, let's see how well it did. Let's look at two different ways to do this. First, call scikit-learn's `accuracy_score` function. This function should take two parameters &mdash;  the `test_labels` and your `predictions`. Print the results. This prints the percentage of tweets in the test set that the classifier correctly classified.



In [9]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(test_labels, predictions)
print(accuracy)


0.6779324055666004


The other way you can evaluate your model is by looking at the **confusion matrix**. A confusion matrix is a table that describes how your classifier made its predictions. For example, if there were two labels, A and B, a confusion matrix might look like this:

```
9 1
3 5
```

In this example, the first row shows how the classifier classified the true A's. It guessed that 9 of them were A's and 1 of them was a B. The second row shows how the classifier did on the true B's. It guessed that 3 of them were A's and 5 of them were B's.

For our project using tweets, there were three classes &mdash; `0` for New York, `1` for London, and `2` for Paris. You can see the confustion matrix by printing the result of the `confusion_matrix` function using `test_labels` and `predictions` as parameters.

In [10]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(test_labels, predictions))


[[541 404  28]
 [203 824  34]
 [ 38 103 340]]


# Test Your Own Tweet

The classifier predicts tweets that were actually from New York as either New York tweets or London tweets, but almost never Paris tweets. Similarly, the classifier rarely misclassifies the tweets that were actually from Paris. Tweets coming from two English speaking countries are harder to distinguish than tweets in different languages.

Now it's your chance to write a tweet and see how the classifier works! Create a string and store it in a variable named `tweet`. 

Call `counter`'s `.transform()` method using `[tweet]` as a parameter. Save the result as `tweet_counts`. Notice that your variable has to be in an array &mdash; `.transform()` can't take just a string, it must be a list. 

Finally, pass `tweet_counts` as parameter to `classifier`'s `.predict()` method. Print the result. This should give you the prediction for the tweet. Remember a `0` represents New York, a `1` represents London, and a `2` represents Paris. Can you write different tweets that the classifier predicts as being from New York, London, and Paris?

In [11]:
tweet = "London is a great city. London calling!"
tweet_counts = counter.transform([tweet])
print(classifier.predict(tweet_counts))

[1]


My tweet is classified as a tweet from London!Great!

# Use Cross Validation to increase the robustness of the Naive Bayes Classifier

In [12]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import SCORERS

all_counts = counter.transform(all_tweets) #tcreate a counter object without splitting the data
scores1 = cross_val_score(classifier, all_counts, labels, cv=5, scoring = "accuracy")

print("All accuracy scores from folds:", scores1)
print("Mean accuracy score from the cross validation:", scores1.mean())

All accuracy scores from folds: [0.64864865 0.66202783 0.66918489 0.67064439 0.67501989]
Mean accuracy score from the cross validation: 0.6651051304677045


The average accuracy score of  the whole dataset is 66%, which concludes that our train and test split worked well in terms of accuracy score.

We did not use any text normalization till now! But if we use some Normalization such as removing punctuations or stopwords we can get more accurate results! let 's try removing punctuation and calculating the resault and see what we got!

In [13]:
def remove_punct(text):
    text  = "".join([char for char in text if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)
    text = re.sub(r'\b\w{1,3}\b', '', text)
    text = text.replace('\n', ' ')
    
    return text

In [14]:
new_york_tweets['Tweet'] = new_york_tweets['text'].apply(lambda x: remove_punct(x))
new_york_tweets.head(10)

Unnamed: 0,created_at,id,id_str,text,display_text_range,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,...,timestamp_ms,extended_tweet,possibly_sensitive,quoted_status_id,quoted_status_id_str,quoted_status,quoted_status_permalink,extended_entities,withheld_in_countries,Tweet
0,2018-07-26 13:32:33+00:00,1022474755625164800,1022474755625164800,@DelgadoforNY19 Calendar marked.,"[16, 32]","<a href=""http://twitter.com/download/android"" ...",False,1.022208e+18,1.022208e+18,8.290618e+17,...,2018-07-26 13:32:33.060,,,,,,,,,DelgadoforNY Calendar marked
1,2018-07-26 13:32:34+00:00,1022474762491183104,1022474762491183104,petition to ban more than one spritz of cologne,,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,...,2018-07-26 13:32:34.697,,,,,,,,,petition more than spritz cologne
2,2018-07-26 13:32:35+00:00,1022474765750226945,1022474765750226944,People really be making up beef with you in th...,,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,...,2018-07-26 13:32:35.474,,,,,,,,,People really making beef with they head
3,2018-07-26 13:32:36+00:00,1022474768736546816,1022474768736546816,30 years old.. wow what a journey... I moved t...,,"<a href=""http://instagram.com"" rel=""nofollow"">...",True,,,,...,2018-07-26 13:32:36.186,{'full_text': '30 years old.. wow what a journ...,0.0,,,,,,,years what journey moved young dumb ...
4,2018-07-26 13:32:36+00:00,1022474769260838913,1022474769260838912,At first glance it looked like asparagus with ...,,"<a href=""http://twitter.com/download/iphone"" r...",True,,,,...,2018-07-26 13:32:36.311,{'full_text': 'At first glance it looked like ...,,,,,,,,first glance looked like asparagus with chic...
5,2018-07-26 13:32:36+00:00,1022474771093708800,1022474771093708800,texting me bullshit i just swipe and delete it,,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,...,2018-07-26 13:32:36.748,,,,,,,,,texting bullshit just swipe delete
6,2018-07-26 13:32:38+00:00,1022474776521175040,1022474776521175040,Nailed it. https://t.co/dYYvyYVnxZ,"[0, 10]","<a href=""http://twitter.com/download/iphone"" r...",False,,,,...,2018-07-26 13:32:38.042,,0.0,1.022439e+18,1.022439e+18,{'created_at': 'Thu Jul 26 11:09:46 +0000 2018...,"{'url': 'https://t.co/dYYvyYVnxZ', 'expanded':...",,,Nailed httpstcodYYvyYVnxZ
7,2018-07-26 13:32:39+00:00,1022474781373988864,1022474781373988864,🗽Cammy Set for tomboyfeels \nCop @ https://t.c...,,"<a href=""http://instagram.com"" rel=""nofollow"">...",True,,,,...,2018-07-26 13:32:39.199,{'full_text': '🗽Cammy Set for tomboyfeels Cop...,0.0,,,,,,,🗽Cammy tomboyfeels httpstcoeaNBdNIdG cust...
8,2018-07-26 13:32:39+00:00,1022474783064313856,1022474783064313856,@notepinuch Thank you ka 😂,"[12, 26]","<a href=""http://twitter.com/download/iphone"" r...",False,1.021947e+18,1.021947e+18,605984200.0,...,2018-07-26 13:32:39.602,,,,,,,,,notepinuch Thank 😂
9,2018-07-26 13:32:40+00:00,1022474788730793984,1022474788730793984,I'm at Crunch - Bushwick - @crunchgym in Brook...,,"<a href=""http://foursquare.com"" rel=""nofollow""...",False,,,,...,2018-07-26 13:32:40.953,,0.0,,,,,,,Crunch Bushwick crunchgym Brooklyn https...


In [15]:
london_tweets['Tweet'] = london_tweets['text'].apply(lambda x: remove_punct(x))
london_tweets.head(10)

Unnamed: 0,created_at,id,id_str,text,display_text_range,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,...,filter_level,lang,timestamp_ms,possibly_sensitive,quoted_status_id,quoted_status_id_str,quoted_status,quoted_status_permalink,extended_entities,Tweet
0,2018-07-26 13:39:30+00:00,1022476504855400449,1022476504855400448,@bbclaurak i agree Laura but the Party you see...,"[11, 140]","<a href=""http://twitter.com/download/iphone"" r...",True,1.022447e+18,1.022447e+18,61183570.0,...,low,en,2018-07-26 13:39:30.109,,,,,,,bbclaurak agree Laura Party seem support ...
1,2018-07-26 13:39:30+00:00,1022476506075942912,1022476506075942912,@masturbacaolove Why?,"[17, 21]","<a href=""http://twitter.com/download/iphone"" r...",False,1.021997e+18,1.021997e+18,9.003777e+17,...,low,und,2018-07-26 13:39:30.400,,,,,,,masturbacaolove
2,2018-07-26 13:39:31+00:00,1022476510089949190,1022476510089949184,@JackRobinson80 @pgroresearch Yeah not great b...,"[30, 65]","<a href=""http://twitter.com/download/iphone"" r...",False,1.022444e+18,1.022444e+18,735563300.0,...,low,en,2018-07-26 13:39:31.357,,,,,,,JackRobinson pgroresearch Yeah great quality...
3,2018-07-26 13:39:33+00:00,1022476519845883905,1022476519845883904,Penalty shit out Arsenal,,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,...,low,en,2018-07-26 13:39:33.683,,,,,,,Penalty shit Arsenal
4,2018-07-26 13:39:36+00:00,1022476532684648448,1022476532684648448,Obviously need some pen practice 🙈,,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,...,low,en,2018-07-26 13:39:36.744,,,,,,,Obviously need some practice 🙈
5,2018-07-26 13:39:37+00:00,1022476535058583552,1022476535058583552,What’s cooler than being cool? \n - Ice Cold m...,,"<a href=""http://instagram.com"" rel=""nofollow"">...",True,,,,...,low,en,2018-07-26 13:39:37.310,0.0,,,,,,What’ cooler than being cool Cold matcha l...
6,2018-07-26 13:39:38+00:00,1022476540540583936,1022476540540583936,@daosanchez26 tell Mina to come to #thfc,,"<a href=""http://twitter.com/download/iphone"" r...",False,,,7.030316e+17,...,low,en,2018-07-26 13:39:38.617,,,,,,,daosanchez tell Mina come thfc
7,2018-07-26 13:39:39+00:00,1022476543073894400,1022476543073894400,@OneSteveCoppell @_alexgstone @NickMLong 😂😂😂 I...,"[41, 66]","<a href=""http://twitter.com/download/android"" ...",False,1.022476e+18,1.022476e+18,1094977000.0,...,low,en,2018-07-26 13:39:39.221,,,,,,,OneSteveCoppell alexgstone NickMLong 😂😂😂 gett...
8,2018-07-26 13:39:39+00:00,1022476545187885063,1022476545187885056,"London is the hottest at the moment, travellin...","[0, 140]","<a href=""http://twitter.com/download/iphone"" r...",True,,,,...,low,en,2018-07-26 13:39:39.725,1.0,1.022474e+18,1.022474e+18,{'created_at': 'Thu Jul 26 13:27:48 +0000 2018...,"{'url': 'https://t.co/sZx7GAze1a', 'expanded':...",,London hottest moment travelling tubes ...
9,2018-07-26 13:39:39+00:00,1022476545972158464,1022476545972158464,@gemmyred @BadassWomensHr Thank you 😍,"[26, 37]","<a href=""http://twitter.com/download/iphone"" r...",False,1.022457e+18,1.022457e+18,248853600.0,...,low,en,2018-07-26 13:39:39.912,,,,,,,gemmyred BadassWomensHr Thank 😍


In [16]:
paris_tweets['Tweet'] = paris_tweets['text'].apply(lambda x: remove_punct(x))
paris_tweets.head(10)

Unnamed: 0,created_at,id,id_str,text,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,...,timestamp_ms,display_text_range,extended_entities,possibly_sensitive,quoted_status_id,quoted_status_id_str,quoted_status,quoted_status_permalink,extended_tweet,Tweet
0,2018-07-27 17:40:45+00:00,1022899608396156928,1022899608396156928,Bulletin météo parisien : des grêlons énormes ...,"<a href=""http://twitter.com/download/android"" ...",False,,,,,...,2018-07-27 17:40:45.854,,,,,,,,,Bulletin météo parisien grêlons énormes saba...
1,2018-07-27 17:40:47+00:00,1022899613550956544,1022899613550956544,Prêt pour le match #USORCL https://t.co/V5jw0S...,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,,...,2018-07-27 17:40:47.083,"[0, 26]","{'media': [{'id': 1022899599336525825, 'id_str...",0.0,,,,,,Prêt pour match USORCL httpstcoVjwSdNFN
2,2018-07-27 17:40:50+00:00,1022899626041651200,1022899626041651200,MAIS QOIDBDNND'SLS'SLSLLSLS''D DBDODNDNODJDBKD...,"<a href=""http://twitter.com/download/android"" ...",False,,,,,...,2018-07-27 17:40:50.061,"[0, 111]","{'media': [{'id': 1022899571884744706, 'id_str...",0.0,,,,,,MAIS QOIDBDNNDSLSSLSLLSLSD DBDODNDNODJDBKDLDLD...
3,2018-07-27 17:40:57+00:00,1022899655347249152,1022899655347249152,@ToursFC Où peut on le championnat de National...,"<a href=""http://twitter.com/download/android"" ...",False,1.022888e+18,1.022888e+18,978599200.0,978599200.0,...,2018-07-27 17:40:57.048,"[9, 50]",,,,,,,,ToursFC peut championnat National
4,2018-07-27 17:40:57+00:00,1022899656685223936,1022899656685223936,Les tismey ils sont bas qu’a tromper leur go e...,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,,...,2018-07-27 17:40:57.367,,,,,,,,,tismey sont ’ tromper leur faire putes
5,2018-07-27 17:41:02+00:00,1022899678541762560,1022899678541762560,"Finally, rain in Paris. #aurevoirlahaut @ Pari...","<a href=""http://instagram.com"" rel=""nofollow"">...",False,,,,,...,2018-07-27 17:41:02.578,,,0.0,,,,,,Finally rain Paris aurevoirlahaut Paris Fran...
6,2018-07-27 17:41:03+00:00,1022899683411349510,1022899683411349504,Ya des balles de golf qui tombent dans ma cham...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,,,,,...,2018-07-27 17:41:03.739,,,,,,,,,balles golf tombent dans chambre merde
7,2018-07-27 17:41:20+00:00,1022899755750486016,1022899755750486016,"En l'espace de jeudi dernier à ce soir, ça va ...","<a href=""http://twitter.com/download/android"" ...",False,,,,,...,2018-07-27 17:41:20.986,,,,,,,,,lespace jeudi dernier soir déjà faire s...
8,2018-07-27 17:41:20+00:00,1022899754748059648,1022899754748059648,"""تصدقين؟ انك في عمري لي عمر \nوانك حكايات المط...","<a href=""http://twitter.com/download/iphone"" r...",False,1.022894e+18,1.022894e+18,8.681985e+17,8.681985e+17,...,2018-07-27 17:41:20.747,"[0, 48]","{'media': [{'id': 1022899688964612097, 'id_str...",0.0,,,,,,تصدقين؟ عمري وانك حكايات المطر httpstcozRmBu
9,2018-07-27 17:41:22+00:00,1022899762855636993,1022899762855636992,@FBB_PORTEPAROLE @YanThoinet Content de voir N...,"<a href=""http://twitter.com/download/iphone"" r...",False,1.021781e+18,1.021781e+18,613617600.0,613617600.0,...,2018-07-27 17:41:22.680,"[29, 64]",,,,,,,,FBBPORTEPAROLE YanThoinet Content voir Nemo ...


In [19]:
new_york_text_punct = new_york_tweets["Tweet"].tolist()
london_text_punct = london_tweets["Tweet"].tolist()
paris_text_punct = paris_tweets["Tweet"].tolist()

all_tweets = new_york_text_punct + london_text_punct + paris_text_punct
labels = [0] * len(new_york_text_punct) + [1] * len(london_text_punct) + [2] * len(paris_text_punct)

In [20]:
all_tweets

['DelgadoforNY Calendar marked',
 'petition   more than  spritz  cologne',
 'People really  making  beef with   they head ',
 ' years   what  journey  moved     young  dumb without even    bank account … httpstcoawjzsvoGS',
 ' first glance  looked like asparagus with chicken  gravy smothered over   potatoes  gotta  extra … httpstcoInBNnsKuWu',
 'texting  bullshit  just swipe  delete ',
 'Nailed  httpstcodYYvyYVnxZ',
 '🗽Cammy   tomboyfeels    httpstcoeaNBdNIdG custom pieces  Shot  lexivvphotography  Creative … httpstcoNvMij',
 'notepinuch Thank   😂',
 '  Crunch  Bushwick  crunchgym  Brooklyn  httpstcoWRGDRsEkPD',
 'Good  please make  tune  Thank 🙏🏿 httpstcozVHNLQ',
 ' Clear Quintuple  Disc  Jewel Case  FreeShip httpstcoJyDNpDs  Jewel Cases Generic httpstcolHeSqFTC',
 ' best ThursdayThoughts',
 '  👋🏾 ️⃣️⃣ —   were  write  book entitled “ year that changed  life” What  would    … httpstcoopcWD',
 'davidfrum That’   this works Subjects have     they  covered That’  ’ known   ‘Free Press’',

In [21]:
train_data, test_data, train_labels, test_labels = train_test_split(all_tweets, labels, 
                                                                    test_size = 0.2,
                                                                    random_state = 1)

In [22]:
train_data

['lickedspoon nicmillerstale Unless they  slathered  Vaseline',
 ' paris     nuages ',
 '  course  rhetoric fools  usual useful idiots   codleft  httpstcoTYYUtCL',
 'saying   hard Especially when youre saying   comfort',
 'httpstcokmuaJXGDSB',
 'TonyCBaH simonjpaine PaulaWBaH wwardrobebl    Tony  need  help',
 'JessicaPage    ’ amazing either 😭😭',
 'SprtsTalkJo   banned once because MichaelRapaport tweeted that Kobe  better than Lebron   replied “… httpstcoldNgzUEH',
 'RPGorman ’  great group  kids   course they have  great teacher',
 'This   kind  weather that explains    like they  ‘smuggling cats’   tube httpstcoPdwzEHWmI',
 '   docked along Lake Huron  PureMichigan httpstcomZoRSXmI',
 'httpstcouEkQQS',
 '“From  experience  best  tech CMOs  often sitting  other roles  ’ know they  CMOs Finding … httpstcoMdOPuBB',
 'Cuánto extraño   míos ♥️♥️♥️♥️♥️♥️🙏🙏🙏🙏   York  York httpstcooEgePLsGN',
 'HamidMirPAK  MansoorAli TalatHussain',
 '  Ledger Building West India Quay This  nice there’  co

In [23]:
counter = CountVectorizer()
counter.fit(train_data)

train_counts = counter.transform(train_data)
test_counts = counter.transform(test_data)

print(train_data[3], train_counts[3])


saying   hard Especially when youre saying   comfort   (0, 4136)	1
  (0, 6796)	1
  (0, 9007)	1
  (0, 23714)	2
  (0, 28098)	1
  (0, 28569)	1


In [25]:
classifier = MultinomialNB()
classifier.fit(train_counts, train_labels)
predictions = classifier.predict(test_counts)

In [26]:
accuracy = accuracy_score(test_labels, predictions)
print(accuracy)


0.6819085487077535


In [27]:
print(confusion_matrix(test_labels, predictions))

[[538 425  10]
 [209 833  19]
 [ 33 104 344]]


Wow, we managed to improve our model by 2 percent just by removing punctuation!