# Clustering tweets about Machine Learning using self-organizing maps

*Arnaud Le Doeuff*
*Ignacio Dorado*
*11/20*

## Usefull Links

- https://github.com/RodolfoFerro/pandas_twitter/blob/master/01-extracting-data.md

## Module importation

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tweepy
import csv
import re

from tensorflow.keras.preprocessing import text

## 1. Capture the tweets

Talk about twitter api, credentials and stafff...

In [2]:
from credentials import *    # This will allow us to use the keys as variables

# API's setup:
def twitter_setup():
    """
    Utility function to setup the Twitter's API
    with our access keys provided.
    """
    # Authentication and access using keys:
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

    # Return API with authentication:
    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
    return api

# We create an extractor object:
extractor = twitter_setup()

Twitter will only allow us to download 3200 tweets every 15 minutes, which is not a lot considering most of them are retweeets. Only 7 days (or so) can be retrieved

In [3]:
# Words we want to search for
searchQuery = "machine learning"

# Maximum number of tweets we want to collect 
maxTweets = 10000

# Show our current limitations
extractor.rate_limit_status()['resources']['search']

{'/search/tweets': {'limit': 180, 'remaining': 180, 'reset': 1605742854}}

In [5]:
tweetCount = 0
global_count = 0
# We create a tweet list as follows:
tweets=[]
#Tell the Cursor method that we want to use the Search API (api.search)
#Also tell Cursor our query, and the maximum number of tweets to return
for tweet in tweepy.Cursor(extractor.search,q=searchQuery, tweet_mode='extended').items(maxTweets):
    
    #Verify the tweet has place info before writing (It should, if it got past our place filter)
    if not tweet.full_text.startswith("RT "):
            tweets.append(tweet)
            tweetCount += 1
    global_count += 1 
    
    if (global_count%1000 == 0):
        print("Downloaded {0} tweets".format(global_count))
        print("Kept {0} non RT tweets".format(tweetCount))
        print("---------------------------")
    

#Display how many tweets we have collected
print("Downloaded {0} tweets".format(global_count))
print("Kept {0} non RT tweets".format(tweetCount))

Downloaded 1000 tweets
Kept 238 non RT tweets
---------------------------
Downloaded 2000 tweets
Kept 450 non RT tweets
---------------------------


Rate limit reached. Sleeping for: 794


Downloaded 3000 tweets
Kept 638 non RT tweets
---------------------------
Downloaded 4000 tweets
Kept 882 non RT tweets
---------------------------
Downloaded 5000 tweets
Kept 1102 non RT tweets
---------------------------


Rate limit reached. Sleeping for: 804


Downloaded 6000 tweets
Kept 1307 non RT tweets
---------------------------
Downloaded 7000 tweets
Kept 1578 non RT tweets
---------------------------


Rate limit reached. Sleeping for: 789


Downloaded 8000 tweets
Kept 1827 non RT tweets
---------------------------
Downloaded 9000 tweets
Kept 2081 non RT tweets
---------------------------
Downloaded 10000 tweets
Kept 2302 non RT tweets
---------------------------
Downloaded 10000 tweets
Kept 2302 non RT tweets


In [6]:
print ("Type of tweets: " + str(type(tweets)))
print ("Type of each tweet: " + str(type(tweets[0])))

Type of tweets: <class 'list'>
Type of each tweet: <class 'tweepy.models.Status'>


In [7]:
print(dir(tweets[0]))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_api', '_json', 'author', 'contributors', 'coordinates', 'created_at', 'destroy', 'display_text_range', 'entities', 'extended_entities', 'favorite', 'favorite_count', 'favorited', 'full_text', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'metadata', 'parse', 'parse_list', 'place', 'possibly_sensitive', 'retweet', 'retweet_count', 'retweeted', 'retweets', 'source', 'source_url', 'truncated', 'user']


### Creating a pandas DataFrame

In [8]:
# We create a pandas dataframe as follows:
df = pd.DataFrame(data=[tweet.full_text for tweet in tweets], columns=['Tweets'])

#We add all the information we want to keep about the tweets
df['len']  = np.array([len(tweet.full_text) for tweet in tweets])
df['ID']   = np.array([tweet.id for tweet in tweets])
df['Date'] = np.array([tweet.created_at for tweet in tweets])
df['Source'] = np.array([tweet.source for tweet in tweets])
df['Likes']  = np.array([tweet.favorite_count for tweet in tweets])
df['RTs']    = np.array([tweet.retweet_count for tweet in tweets])

# We display the first 10 elements of the dataframe:
pd.set_option('display.max_colwidth', 60)
display(df.head(100))

Unnamed: 0,Tweets,len,ID,Date,Source,Likes,RTs
0,SICK’s Deep Learning brings simplicity to complex AI ins...,303,1329334610946260992,2020-11-19 08:04:03,dlvr.it,0,0
1,Google Cloud Debuts Professional Machine Learning Engine...,103,1329334598455734277,2020-11-19 08:04:00,Paper.li,0,0
2,Bringing your own custom container image to Amazon SageM...,142,1329334594085269506,2020-11-19 08:03:59,HubSpot,0,0
3,Here’s what machines need to understand in order to trul...,238,1329334592449507331,2020-11-19 08:03:59,HubSpot,0,0
4,Level up your data science vocabulary: Geometric Distrib...,121,1329334535532670977,2020-11-19 08:03:45,DeepAI,0,0
...,...,...,...,...,...,...,...
95,5 Most Useful Machine Learning Tools every lazy full-sta...,124,1329322590670893059,2020-11-19 07:16:17,Paper.li,2,1
96,The way we train AI is fundamentally flawed https://t.co...,67,1329322554541154306,2020-11-19 07:16:09,Twitter for Android,3,1
97,Machine Learning: MLflow 1.12 verbessert die PyTorch-Int...,100,1329322407811801088,2020-11-19 07:15:34,Paper.li,0,0
98,"Your fleet is in safe hands with #JupiCar, an innovative...",296,1329322279893950464,2020-11-19 07:15:03,Twitter Web App,0,2


Now we save the Data Frame it to a csv file so we can read it every time

In [9]:
df.to_csv('tweets.csv', index=False)

In [2]:
df = pd.read_csv("tweets(2302).csv")

## 2. Preprocessing

- First thing would be to get rid off every comma, point and any other strange symbol
    - Discuss if @s should be removed or kept
    - Discuss if  hastags should be removed
    - Remove symbols that stands on their own
    - Remove urls
- Second thing would be creating the dictionary
- Then reducing the dictionay
- Finally do some PCA (probabbly(?))

### 2.1 Cleaning the strings

In [3]:
def clean_str(string):
    """
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"@[^\s]+", " ", string)             #remove account tags   
    string = re.sub(r"http[^\s]+", " ", string)          #remove urls
    string = re.sub(r"#[^\s]+", " ", string)             #remove hastags
    string = re.sub(r"[^A-Za-z\']", " ", string)         #remove everything but letters and nummbers(?)
    string = re.sub(r"\'s", " is", string)               #split 's contractions
    string = re.sub(r"\'ve", " have", string)            #split 's contractions
    string = re.sub(r"n\'t", " not", string)             #split n't contractions
    string = re.sub(r"\'re", " are", string)             #split 're contractions
    string = re.sub(r"\'d", " would", string)            #split 'd contractions
    string = re.sub(r"\'ll", " will", string)            #split 'll contractions
    string = re.sub(r"\'", " ", string)                  #remove '
    string = re.sub(r"!", " ! ", string)                 #split !
    string = re.sub(r"\?", " \? ", string)               #split ?
    string = re.sub(r"\s{2,}", " ", string)              #remove more than 1 white space
    return string.strip().lower()            #remove start and end white spaces

In [4]:
df['Tweets'] = df['Tweets'].apply(clean_str)

pd.set_option('display.max_colwidth', None)
df.head(3)

Unnamed: 0,Tweets,len,ID,Date,Source,Likes,RTs
0,sick s deep learning brings simplicity to complex ai inspection sick has launched a suite of deep learning apps and services to simplify machine vision quality inspection for challenging food products and agricultural produce especially those that have,303,1329334610946260992,2020-11-19 08:04:03,dlvr.it,0,0
1,google cloud debuts professional machine learning engineer certification,103,1329334598455734277,2020-11-19 08:04:00,Paper.li,0,0
2,bringing your own custom container image to amazon sagemaker studio notebooks,142,1329334594085269506,2020-11-19 08:03:59,HubSpot,0,0


### 2.2 Creating the dictionary

Justify why we did it in count mode and not in binary mode

In [11]:
train_tweets = df['Tweets'].values.tolist()

# Create a tokenizer for the nb_words most common words
tokenizer = text.Tokenizer()

# Build the word index (dictionary)
tokenizer.fit_on_texts(train_tweets)

# Vectorize texts into one-hot enconding representations
train_vectors = tokenizer.texts_to_matrix(train_tweets, mode='count')

print('First tweet: ' + train_tweets[0])
print('\nVector of the first tweet: ' + str(train_vectors[0]))
print('\nShape of the training set (nb_examples, vector_size): ' + str(train_vectors.shape))

First tweet: sick s deep learning brings simplicity to complex ai inspection sick has launched a suite of deep learning apps and services to simplify machine vision quality inspection for challenging food products and agricultural produce especially those that have

Vector of the first tweet: [0. 2. 1. ... 0. 0. 0.]

Shape of the training set (nb_examples, vector_size): (2302, 7397)


In [12]:
word_index = tokenizer.word_index
word_count = tokenizer.word_counts
print("There are " + str(len(word_index)) + " unique tokens.\n")

print("Show the most frequent word index:")
for i, word in enumerate(sorted(word_count, key=word_count.get, reverse=True)):
    print('   {} ({}) --> {}'.format(word, word_count[word], word_index[word]))
    if i == 10: 
        break

print("\nShow the least frequent word index:")
for i, word in enumerate(sorted(word_count, key=word_count.get, reverse=False)):
    print('   {} ({}) --> {}'.format(word, word_count[word], word_index[word]))
    if i == 10: 
        break

There are 7396 unique tokens.

Show the most frequent word index:
   learning (1992) --> 1
   machine (1891) --> 2
   the (1273) --> 3
   to (1242) --> 4
   and (1181) --> 5
   a (857) --> 6
   of (756) --> 7
   is (737) --> 8
   in (728) --> 9
   for (595) --> 10
   with (402) --> 11

Show the least frequent word index:
   simplicity (1) --> 3240
   simplify (1) --> 3241
   container (1) --> 3242
   geometric (1) --> 3243
   flexes (1) --> 3244
   muscles (1) --> 3245
   wasn (1) --> 3246
   xataka (1) --> 3247
   xico (1) --> 3248
   academics (1) --> 3249
   ethics (1) --> 3250


### 2.3 Reducing the dictionary

We have 7396 unique words in our dictionary, now we have to decide where to prune the dictionary
- Probably we need to remove the **most frequent** articles
    - I'll take away the 30 most frequent words (provisional)
- We will also remove the **least frequent** words by keeping only the 1000 most frequent words
    - We could think about increasing this number to get a better perfromance

In [16]:
nb_upper_cut = 30
nb_words = 1000

# Store the word index for the 30 most frequent words
words_to_remove_indexes = []
words_to_keep_indexes = []
for i, word in enumerate(sorted(word_count, key=word_count.get, reverse=True)):
    if (i <= nb_upper_cut):
        words_to_remove_indexes.append(word_index[word])
    else:
        words_to_keep_indexes.append(word_index[word])
    if i == nb_words + nb_upper_cut:
        break
        
# Remove the 30 most common words from the vector array
train_vectors_reduced = np.delete(train_vectors, words_to_remove, 1)
print('Removed the ' + str(nb_upper_cut) + ' most frequent words, with indexes: ' + str(most_frequent_words_indexes))
print('New shape of the training set (nb_examples, vector_size): ' + str(train_vectors_reduced.shape))

# Keep only the first 1000 words
train_vectors_reduced = train_vectors_reduced[:, words_to_keep_indexes]
print('Kept the first ' + str(nb_words) + ' most frequent words, with indexes: ' + str(most_frequent_words_indexes))
print('New shape of the training set (nb_examples, vector_size): ' + str(train_vectors_reduced.shape))

NameError: name 'words_to_remove' is not defined

## 3. Visualize the clusters

## 4. Discuss the results