# Clustering tweets about Machine Learning using self-organizing maps

*Arnaud Le Doeuff*
*Ignacio Dorado*
*11/20*

## Usefull Links

- https://github.com/RodolfoFerro/pandas_twitter/blob/master/01-extracting-data.md

## Module importation

In [21]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import tensorflow
import tweepy
import csv

from sklearn.model_selection import train_test_split

## 1. Capture the tweets

Talk about twitter api, credentials and stafff...

In [87]:
from credentials import *    # This will allow us to use the keys as variables

# API's setup:
def twitter_setup():
    """
    Utility function to setup the Twitter's API
    with our access keys provided.
    """
    # Authentication and access using keys:
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

    # Return API with authentication:
    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
    return api

# We create an extractor object:
extractor = twitter_setup()

Twitter will only allow us to download 3200 tweets every 15 minutes, which is not a lot considering most of them are retweeets. Only 7 days (or so) can be retrieved

In [147]:
# Words we want to search for
searchQuery = "machine learning"

# Maximum number of tweets we want to collect 
maxTweets = 10000

# Show our current limitations
extractor.rate_limit_status()['resources']['search']

In [148]:
tweetCount = 0
global_count = 0
# We create a tweet list as follows:
tweets=[]
#Tell the Cursor method that we want to use the Search API (api.search)
#Also tell Cursor our query, and the maximum number of tweets to return
for tweet in tweepy.Cursor(extractor.search,q=searchQuery).items(maxTweets):
    
    #Verify the tweet has place info before writing (It should, if it got past our place filter)
    if not tweet.text.startswith("RT "):
            tweets.append(tweet)
            tweetCount += 1
    global_count += 1 
    
    if (global_count%1000 == 0):
        print("Downloaded {0} tweets".format(global_count))
        print("Kept {0} non RT tweets".format(tweetCount))
    

#Display how many tweets we have collected
print("Downloaded {0} tweets".format(global_count))
print("Kept {0} non RT tweets".format(tweetCount))

Downloaded 0 tweets
Saw 0 tweets
Downloaded 188 tweets
Saw 1000 tweets
Downloaded 371 tweets
Saw 2000 tweets


Rate limit reached. Sleeping for: 786


Downloaded 484 tweets
Saw 3000 tweets
Downloaded 568 tweets
Saw 4000 tweets
Downloaded 682 tweets
Saw 5000 tweets


Rate limit reached. Sleeping for: 788


Downloaded 771 tweets
Saw 6000 tweets
Downloaded 865 tweets
Saw 7000 tweets
Downloaded 972 tweets
Saw 8000 tweets


Rate limit reached. Sleeping for: 782


Downloaded 1069 tweets
Saw 9000 tweets
Downloaded 1170 tweets
Saw 10000 tweets


In [149]:
print ("Type of tweets: " + str(type(tweets)))
print ("Type of each tweet: " + str(type(tweets[0])))

Type of tweets: <class 'list'>
Type of each tweet: <class 'tweepy.models.Status'>


In [150]:
print(dir(tweets[0]))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_api', '_json', 'author', 'contributors', 'coordinates', 'created_at', 'destroy', 'entities', 'favorite', 'favorite_count', 'favorited', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'metadata', 'parse', 'parse_list', 'place', 'possibly_sensitive', 'retweet', 'retweet_count', 'retweeted', 'retweets', 'source', 'source_url', 'text', 'truncated', 'user']


### Creating a pandas DataFrame

In [158]:
# We create a pandas dataframe as follows:
df = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweets'])

#We add all the information we want to keep about the tweets
df['len']  = np.array([len(tweet.text) for tweet in tweets])
df['ID']   = np.array([tweet.id for tweet in tweets])
df['Date'] = np.array([tweet.created_at for tweet in tweets])
df['Source'] = np.array([tweet.source for tweet in tweets])
df['Likes']  = np.array([tweet.favorite_count for tweet in tweets])
df['RTs']    = np.array([tweet.retweet_count for tweet in tweets])

# We display the first 10 elements of the dataframe:
display(df.head(100))

Unnamed: 0,Tweets,len,ID,Date,Source,Likes,RTs
0,🤝 NEW PARTNERSHIP 🤝 \n\nAt @Fetch_ai we are su...,137,1329002221615407104,2020-11-18 10:03:15,Twitter Web App,0,0
1,@elonmusk Do you believe machine learning is s...,109,1329002220118024193,2020-11-18 10:03:15,Twitter for iPhone,0,0
2,Eyes on $FET\n\nhttps://t.co/Vu1StthG4d partne...,140,1329002190942429184,2020-11-18 10:03:08,Twitter Web App,0,0
3,10 Ways Machine Learning Will Reshape Manufact...,94,1329002168918167553,2020-11-18 10:03:03,Upflow,0,0
4,#Strands is a dynamic company that adopts a di...,140,1329001924230844421,2020-11-18 10:02:05,HubSpot,0,0
...,...,...,...,...,...,...,...
95,https://t.co/3RAep6sWsQ \n If you have any q...,137,1328994433585401859,2020-11-18 09:32:19,boutlineprod,0,0
96,https://t.co/3RAep6sWsQ \n Our digital produ...,106,1328994433145081858,2020-11-18 09:32:19,boutlineprod,0,0
97,https://t.co/3RAep6sWsQ \n What's Included W...,140,1328994432624914432,2020-11-18 09:32:18,boutlineprod,0,1
98,Solid thread on sniff-testing machine learning...,80,1328994413813538817,2020-11-18 09:32:14,Twitter for iPhone,0,1


Now we save the Data Frame it to a csv file so we can read it every time

In [152]:
df.to_csv('tweets.csv', index=False)

In [157]:
df = pd.read_csv("tweets.csv")

## 2. Preprocessing

- First thing would be to get rid off every comma, point and any other strange symbol
- Second thing would be creating the dictionary
- Then reducing the dictionay
- Finally do some PCA (probabbly(?))

## 3. Visualize the clusters

## 4. Discuss the results