# Clustering tweets about Machine Learning using self-organizing maps

*Arnaud Le Doeuff*
*Ignacio Dorado*
*11/20*

## Usefull Links

- https://github.com/RodolfoFerro/pandas_twitter/blob/master/01-extracting-data.md

## Module importation

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import tensorflow
import tweepy
import csv
import re

## 1. Capture the tweets

Talk about twitter api, credentials and stafff...

In [20]:
from credentials import *    # This will allow us to use the keys as variables

# API's setup:
def twitter_setup():
    """
    Utility function to setup the Twitter's API
    with our access keys provided.
    """
    # Authentication and access using keys:
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

    # Return API with authentication:
    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
    return api

# We create an extractor object:
extractor = twitter_setup()

Twitter will only allow us to download 3200 tweets every 15 minutes, which is not a lot considering most of them are retweeets. Only 7 days (or so) can be retrieved

In [21]:
# Words we want to search for
searchQuery = "machine learning"

# Maximum number of tweets we want to collect 
maxTweets = 100

# Show our current limitations
extractor.rate_limit_status()['resources']['search']

{'/search/tweets': {'limit': 180, 'remaining': 180, 'reset': 1605733568}}

In [52]:
tweetCount = 0
global_count = 0
# We create a tweet list as follows:
tweets=[]
#Tell the Cursor method that we want to use the Search API (api.search)
#Also tell Cursor our query, and the maximum number of tweets to return
for tweet in tweepy.Cursor(extractor.search,q=searchQuery, tweet_mode='extended').items(maxTweets):
    
    #Verify the tweet has place info before writing (It should, if it got past our place filter)
    if not tweet.full_text.startswith("RT "):
            tweets.append(tweet)
            tweetCount += 1
    global_count += 1 
    
    if (global_count%1000 == 0):
        print("Downloaded {0} tweets".format(global_count))
        print("Kept {0} non RT tweets".format(tweetCount))
    

#Display how many tweets we have collected
print("Downloaded {0} tweets".format(global_count))
print("Kept {0} non RT tweets".format(tweetCount))

Downloaded 100 tweets
Kept 14 non RT tweets


In [39]:
print ("Type of tweets: " + str(type(tweets)))
print ("Type of each tweet: " + str(type(tweets[0])))

Type of tweets: <class 'list'>
Type of each tweet: <class 'tweepy.models.Status'>


In [40]:
print(dir(tweets[0]))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_api', '_json', 'author', 'contributors', 'coordinates', 'created_at', 'destroy', 'display_text_range', 'entities', 'extended_entities', 'favorite', 'favorite_count', 'favorited', 'full_text', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'metadata', 'parse', 'parse_list', 'place', 'possibly_sensitive', 'retweet', 'retweet_count', 'retweeted', 'retweets', 'source', 'source_url', 'truncated', 'user']


### Creating a pandas DataFrame

In [107]:
# We create a pandas dataframe as follows:
df = pd.DataFrame(data=[tweet.full_text for tweet in tweets], columns=['Tweets'])

#We add all the information we want to keep about the tweets
df['len']  = np.array([len(tweet.full_text) for tweet in tweets])
df['ID']   = np.array([tweet.id for tweet in tweets])
df['Date'] = np.array([tweet.created_at for tweet in tweets])
df['Source'] = np.array([tweet.source for tweet in tweets])
df['Likes']  = np.array([tweet.favorite_count for tweet in tweets])
df['RTs']    = np.array([tweet.retweet_count for tweet in tweets])

# We display the first 10 elements of the dataframe:
pd.set_option('display.max_colwidth', 60)
display(df.head(100))

Unnamed: 0,Tweets,len,ID,Date,Source,Likes,RTs
0,"@jeffcannata *eyes ""machine learning limerick generator""...",65,1329169216805740548,2020-11-18 21:06:50,Twitter for Android,0,0
1,The new economy in Arizona requires workers with artific...,302,1329168791075516419,2020-11-18 21:05:09,Sprout Social,0,1
2,"@Kirby0Louise Please enlighten me, SAM is not Supersampl...",90,1329168553208131585,2020-11-18 21:04:12,Twitter Web App,0,0
3,Machine Learning Pocket Reference: Working with Structur...,114,1329168493837684736,2020-11-18 21:03:58,Postmatico,0,1
4,5 Ways to Improve User Experience with Machine Learning ...,258,1329168260533727236,2020-11-18 21:03:02,Buffer,0,1
5,"The organizations, which have the impact of exponentiall...",250,1329168252484919303,2020-11-18 21:03:00,Buffer,0,1
6,Pretty interesting. They achieved 3x-7x faster training ...,279,1329168190388269059,2020-11-18 21:02:45,Twitter for iPhone,0,1
7,The way we train AI is fundamentally flawed https://t.co...,276,1329168147493031937,2020-11-18 21:02:35,Twitter for iPhone,0,2
8,Combination of Imaging and Machine Learning Can Predict ...,295,1329167908501483520,2020-11-18 21:01:38,Hootsuite Inc.,0,1
9,"https://t.co/mHkYKZoC0X, machine learning startup backed...",151,1329167811768414211,2020-11-18 21:01:15,Twitter Web App,0,1


Now we save the Data Frame it to a csv file so we can read it every time

In [152]:
df.to_csv('tweets.csv', index=False)

In [15]:
df = pd.read_csv("tweets(1171).csv")

## 2. Preprocessing

- First thing would be to get rid off every comma, point and any other strange symbol
    - Discuss if @s should be removed or kept
    - Discuss if  hastags should be removed
    - Remove symbols that stands on their own
    - Remove urls
- Second thing would be creating the dictionary
- Then reducing the dictionay
- Finally do some PCA (probabbly(?))

In [105]:
def clean_str(string):
    """
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"@[^\s]+", " ", string)             #remove account tags   
    string = re.sub(r"http[^\s]+", " ", string)          #remove urls
    string = re.sub(r"#[^\s]+", " ", string)             #remove hastags
    string = re.sub(r"[^A-Za-z\']", " ", string)         #remove everything but letters and nummbers(?)
    string = re.sub(r"\'s", " is", string)               #split 's contractions
    string = re.sub(r"\'ve", " have", string)            #split 's contractions
    string = re.sub(r"n\'t", " not", string)             #split n't contractions
    string = re.sub(r"\'re", " are", string)             #split 're contractions
    string = re.sub(r"\'d", " would", string)            #split 'd contractions
    string = re.sub(r"\'ll", " will", string)            #split 'll contractions
    string = re.sub(r"\'", " ", string)                  #remove '
    string = re.sub(r"!", " ! ", string)                 #split !
    string = re.sub(r"\?", " \? ", string)               #split ?
    string = re.sub(r"\s{2,}", " ", string)              #remove more than 1 white space
    return string.strip().lower()            #remove start and end white spaces

In [109]:
df['Tweets'] = df['Tweets'].apply(clean_str)

In [113]:
pd.set_option('display.max_colwidth', None)
df.head(3)

Unnamed: 0,Tweets,len,ID,Date,Source,Likes,RTs
0,eyes machine learning limerick generator project,65,1329169216805740548,2020-11-18 21:06:50,Twitter for Android,0,0
1,the new economy in arizona requires workers with artificial intelligence skills our workforce amp economic development officer darcy renfro emphasizes the need for diversity behind the people translating the human mind into machine learning,302,1329168791075516419,2020-11-18 21:05:09,Sprout Social,0,1
2,please enlighten me sam is not supersampling by machine learning like dlss,90,1329168553208131585,2020-11-18 21:04:12,Twitter Web App,0,0


## 3. Visualize the clusters

## 4. Discuss the results