# Network Based Social Media Analytics
### Q Smart 2306288s
#### Github: https://github.com/axqs/web_sci_ae

Code in the next cell is taken from John H. Williamson's lecture notes for Data Fundamentals 2019

### Contents
1. [Introduction](#intro)
1. [Data Crawl](#data_crawl)
  1. [Streaming](#stream)
  1. [REST Probing](#rest)
1. [Group Tweets](#grouping)
1. [Capture and Organize User and Hashtag Information](#organize)
1. [Network Analysis](#analysis)

In [None]:
import IPython.display
IPython.display.HTML("""
    <script>
      function code_toggle() {
        if (code_shown){
          $('div.input').hide('500');
          $('#toggleButton').val('Show Code')
        } else {
          $('div.input').show('500');
          $('#toggleButton').val('Hide Code')
        }
        code_shown = !code_shown
      }

      $( document ).ready(function(){
        code_shown=false;
        $('div.input').hide()
      });
    </script>
    <form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>
""")

In [1]:
import pymongo
import tweepy
import os
import time

#mongodb access
password = os.environ["WEBSCI_MONGODB_PASS"]
client = pymongo.MongoClient("mongodb://qsmart:"+password+"@qs-web-science-ae-shard-00-00-4pdo2.mongodb.net:27017,qs-web-science-ae-shard-00-01-4pdo2.mongodb.net:27017,qs-web-science-ae-shard-00-02-4pdo2.mongodb.net:27017/test?ssl=true&replicaSet=qs-web-science-ae-shard-0&authSource=admin&retryWrites=true&w=majority")

#tweepy access
pk = os.environ["WEBSCI_PKEY"]
sk = os.environ["WEBSCI_SKEY"]
pt = os.environ["WEBSCI_PTOKEN"]
st = os.environ["WEBSCI_STOKEN"]

auth = tweepy.OAuthHandler(pk, sk)
auth.set_access_token(pt, st)

api = tweepy.API(auth)

print("Everything imported OK.")

Everything imported OK.


<a id="intro"></a>
## Introduction
1. Describe the software developed with appropriate details; if you have used code from elsewhere please specify it
2. Specify the time and duration of data collected

# Tweepy
### Streaming is push of data by Twitter, Search is pull of data initiated by the end user.
* one percent is from 1 hour stream, get additional data from rest probing
* put sample data in csv (100 rows)
* bad clusters expected, around 10, isolate noisy data -- smaller clusters can be ignored (2 ish docs)
* stats: duration (ie for 10 clusters run for 1 hour, 6 min duration for groups), how many docs, num likes, num retweets, # hashtags, # users
* tells tats of all clusters, only save valid ones
* report on how much data from stream, and how much from rest
* NO GRAPHS
* tweets: only care about tweet id, date created, username, text
* task 3: num users, top users and # connections
#### How to cluster
* scikit to cluster
* remove retweets to cluster - explain why
#### Part 3 Report
* can draw graph of cluster sizes
* summarize across all clusters, then elaborate with certain clusters
#### Part 4&5 report
* two groups, average all in each group, (stream and rest)

In [2]:
#neatly prints a dictionary object
def pretty_print(d, indent=0):
    for key,value in d.items():
        # print the indent, and then the key
        print(" "*indent,key,":", end=' ')

        # check if it's a dictionary: if so, recursive call
        if type(value)==type({}):
            print()
            pretty_print(value, indent+4)

        # if it's a list, print a comma separated sequence
        elif type(value)==type([]):
            print(", ".join(str(x) for x in value))
        else:
            # just a plain value
            print(value)

In [3]:
def filter_user_info(dictionary):
    #if user already exists in dictionary
    try:
        user_ids[dictionary["screen_name"]]["count"] += 1
    #if user does not exist
    except KeyError:
        tweet_user = {}
        # list to filter relevant user information
        user_info = ["name","screen_name","verified","location",
                     "followers_count","friends_count","favourites_count"]
        for j in user_info:
            tweet_user[j] = dictionary[j]
        tweet_user["count"] = 1
        #add new user to dictionary
        user_ids[dictionary["screen_name"]] = tweet_user
    # returns screen name of user
    return dictionary["screen_name"]

In [4]:
def filter_tweet_info(dictionary):
    filtered_info = {}
    # list to filter relevant user information
    tweet_info = ["created_at","user",
                  "extended_tweet","hashtags",
                  "in_reply_to_screen_name","retweeted_status",
                  "is_quote_status",
                  "retweet_count","favorite_count","quote_count","reply_count",]
    for j in tweet_info:
        if(j == "user"):
            filtered_info[j] = filter_user_info(dictionary["user"])
        else:
            try:
                # if "extended_tweet in dictionary, get full text. otherwise, get regular text"
                if j == "extended_tweet":
                    try:
                        filtered_info["text"] = dictionary[j]["full_text"]
                    except KeyError:
                        filtered_info["text"] = dictionary["text"]
                        
                # if tweet is a retweet, get the id of the retweeted tweet
                elif j == "retweeted_status":
                    try:
                        filtered_info["retweeted_status"] = dictionary["retweeted_status"]["screen_name"]
                        try:
                            filtered_info["text"] = dictionary["retweeted_status"]["extended_tweet"]["full_text"]
                        except:
                            filtered_info["text"] = dictionary["retweeted_status"]["text"]
                    except KeyError:
                        filtered_info["retweeted_status"] = False

                # get all hashtags in tweet 
                elif j == "hashtags":
                    filtered_info["hashtags"] = []
                    for h in dictionary["entities"]["hashtags"]:
                        filtered_info["hashtags"].append(h["text"])
                else:
                    filtered_info[j] = dictionary[j]
            except KeyError:
                filtered_info[j] = 0
    # returns dictionary of filtered tweet information
    return filtered_info

<a id="data_crawl"></a>
## Data Crawl

1. Use Twitter Streaming API for collecting 1% data
  * Specify the APIs used
    1. Please do not include entire code here; just main description of the function
    2. Along with a short description/justification
2. Enhance the crawling using the hybrid architecture of Twitter Streaming & REST
APIs
  * Specify the APIs used
    1. Please do not include entire code here; just main description of the function. Please describe how you developed a hybrid crawler.
    2. Along with a short description/justification

<a id="stream"></a>
### Twitter Streaming

In [36]:
from datetime import datetime
from IPython.display import clear_output

#run stream for an hour
def streamTimer():
    currentTime = time.time()
    #check to see if hour has passed, 3600 seconds = 1 hr
    if(currentTime - startTime >= 2400):
        clear_output()
        print("Tracking: "+str(tracking))
        print("Stream started at: ",startStream)
        print("Collected:",len(streamed_tweets))
        print("Stream ended at: ",datetime.now().strftime("%m/%d/%Y, %H:%M:%S"))
        print("Disconnecting stream . . .")
        stream.disconnect()
        print("Stream disconnected.")
        
# for each tweet in the stream, add it to the streamed_tweets dictionary
class TwitterStreamer(tweepy.StreamListener):
    def on_status(self, status):
        print(status)
        streamed_tweets[status._json["id"]] = filter_tweet_info(status._json)
        print("-"*100,"Collected:",len(streamed_tweets))
        streamTimer()
        
    def on_error(self,status_code):
        print(status_code)
        return False

In [37]:
streamed_tweets = {}
user_ids = {}
# create listener
listener = TwitterStreamer(api=tweepy.API(auth,
                           wait_on_rate_limit=True, 
                           wait_on_rate_limit_notify=True))
# create streamer
stream = tweepy.Stream(auth=auth, listener=listener)

# words to look for and track in tweets
tracking = [
    "baseball","mlb","Major","League","Baseball",
    "world","series","spring","training","opening","day",
    "yankees","astros","dodgers","la","angels","chicago","cubs","tb","rays","red","sox",
    "cheated","scandal","houston","asterisks","sign","stealing","cheating",
    "astroscheated","astrosscandal","houstonasterisks","astroscheatingscandal",
]

print("Tracking: "+str(tracking))
# start timer for stream
startTime = time.time()
startStream = datetime.now().strftime("%m/%d/%Y, %H:%M:%S")
print("Stream started at: ",startStream)
# filter stream
stream.filter(track=tracking, languages=['en'], is_async=True)

Tracking: ['baseball', 'mlb', 'Major', 'League', 'Baseball', 'world', 'series', 'spring', 'training', 'opening', 'day', 'yankees', 'astros', 'dodgers', 'la', 'angels', 'chicago', 'cubs', 'tb', 'rays', 'red', 'sox', 'cheated', 'scandal', 'houston', 'asterisks', 'sign', 'stealing', 'cheating', 'astroscheated', 'astrosscandal', 'houstonasterisks', 'astroscheatingscandal']
Stream started at:  03/07/2020, 14:46:35
Status(_api=<tweepy.api.API object at 0x1288afd50>, _json={'created_at': 'Sat Mar 07 14:46:31 +0000 2020', 'id': 1236302211287625735, 'id_str': '1236302211287625735', 'text': 'Imagine! 🖤', 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 626453096, 'id_str': '626453096', 'name': 'Michael Francis Williams', 'screen_name': 'parleyment', 'locati

Status(_api=<tweepy.api.API object at 0x1288afd50>, _json={'created_at': 'Sat Mar 07 14:46:31 +0000 2020', 'id': 1236302211610746888, 'id_str': '1236302211610746888', 'text': 'RT @lifeofrickey: HAVING A BAD DAY?? THEN THIS IS FOR YOU!!! https://t.co/jFPcRAbalr', 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 2272993522, 'id_str': '2272993522', 'name': 'Toni Marinaccio', 'screen_name': 'toni_marinaccio', 'location': None, 'url': None, 'description': None, 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 358, 'friends_count': 362, 'listed_count': 4, 'favourites_count': 3870, 'statuses_count': 1844, 'created_at': 'Fri Jan 10 14:05:12 +0000 2014', 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'lang':

Status(_api=<tweepy.api.API object at 0x1288afd50>, _json={'created_at': 'Sat Mar 07 14:46:31 +0000 2020', 'id': 1236302211686125568, 'id_str': '1236302211686125568', 'text': "RT @narendramodi: G'day @ScottMorrisonMP! \n\nIt doesn't get bigger than the India vs Australia Final in Women's @T20WorldCup tomorrow. \n\nBes…", 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 714682246788169728, 'id_str': '714682246788169728', 'name': 'Pradeep Chaudhary', 'screen_name': 'pradeep85300199', 'location': 'India', 'url': None, 'description': 'India First! A common citizen !  Jai Hind🇮🇳', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 86, 'friends_count': 151, 'listed_count': 0, 'favourites_count': 5346, 'statuses_count'

Status(_api=<tweepy.api.API object at 0x1288afd50>, _json={'created_at': 'Sat Mar 07 14:46:31 +0000 2020', 'id': 1236302211879251968, 'id_str': '1236302211879251968', 'text': 'RT @you_clowns: Epic clown of the day 🤡 https://t.co/qvl9SwOpkE', 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 579837226, 'id_str': '579837226', 'name': 'Ghag', 'screen_name': 'OfOpinion', 'location': 'Toronto, Ontario', 'url': None, 'description': 'Everyone has an opinion this is mine... brighter days are coming to all my friends struggling with anxiety', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 18, 'friends_count': 99, 'listed_count': 2, 'favourites_count': 440, 'statuses_count': 805, 'created_at': 'Mon May 14 12:30:42 +000

Status(_api=<tweepy.api.API object at 0x1288afd50>, _json={'created_at': 'Sat Mar 07 14:46:31 +0000 2020', 'id': 1236302211862462464, 'id_str': '1236302211862462464', 'text': 'RT @driven2drink: Listen.\n\nTheir cover of, "Summer Breeze," may be thee greatest interpretation of an already lovely song in the history of…', 'source': '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 155700400, 'id_str': '155700400', 'name': 'Eric M. Black', 'screen_name': 'eb4prez', 'location': 'BROOKLYN, NEW YORK', 'url': 'http://www.facebook.com/eb4prez', 'description': 'Soul music cognoscente. Chicken wing connoisseur. Sci-fi enthusiast. Soul Brother Number Two. Mr. #TurnThisUp!!! Orange Moon from the Otherside of the Game', 'translator_type': 'none', 'protected': False, 'ver

Status(_api=<tweepy.api.API object at 0x1288afd50>, _json={'created_at': 'Sat Mar 07 14:46:31 +0000 2020', 'id': 1236302212101484546, 'id_str': '1236302212101484546', 'text': 'RT @damocrat: People on the left: stop slagging off free movement.\n\nAs a non-wealthy Brit, it allowed me to:\n\n• Open a record shop in Ibiza…', 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 872158753008877569, 'id_str': '872158753008877569', 'name': 'JK', 'screen_name': 'JKLDNMAD', 'location': None, 'url': None, 'description': 'Does not like Brexit.\nSemi Lancastrian. Lancaster, Walthamstow & Madrid', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 4741, 'friends_count': 5215, 'listed_count': 10, 'favourites_count': 163643, 'statu

Status(_api=<tweepy.api.API object at 0x1288afd50>, _json={'created_at': 'Sat Mar 07 14:46:31 +0000 2020', 'id': 1236302212675932161, 'id_str': '1236302212675932161', 'text': 'RT @_alicejay: did you guys camp out in front of a starbucks until asian people walked past the sign for this picture', 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 310114685, 'id_str': '310114685', 'name': 'mimi kanassis', 'screen_name': 'daffycand', 'location': None, 'url': 'http://Instagram.com/daffycan', 'description': 'monopoly deal enthusiast', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 613, 'friends_count': 378, 'listed_count': 6, 'favourites_count': 86133, 'statuses_count': 27283, 'created_at': 'Fri Jun 03 06:41:49 +0000 

Status(_api=<tweepy.api.API object at 0x1288afd50>, _json={'created_at': 'Sat Mar 07 14:46:31 +0000 2020', 'id': 1236302212889858048, 'id_str': '1236302212889858048', 'text': "RT @andraydomise: There are less than 10 Black billionaires in America and they'll be fine.", 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 893142444312862721, 'id_str': '893142444312862721', 'name': 'Mirain Haf Griffiths 🌹🏳️\u200d🌈🏴\U000e0067\U000e0062\U000e0077\U000e006c\U000e0073\U000e007f', 'screen_name': 'Mirainhaf100', 'location': 'Wales, United Kingdom', 'url': None, 'description': 'She/Her, Bernie Bro, Russian Bot, In general a disaster of a human being ‘PrOnOuNs In BiO’', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 101, 'f

Status(_api=<tweepy.api.API object at 0x1288afd50>, _json={'created_at': 'Sat Mar 07 14:46:31 +0000 2020', 'id': 1236302213250719745, 'id_str': '1236302213250719745', 'text': 'RT @jaimelsnnister: these la jophie days truly have the best vibes ☀️ https://t.co/xkcrdrxnkS', 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 2191476004, 'id_str': '2191476004', 'name': 'Blake', 'screen_name': 'winterstvorm', 'location': 'Italia', 'url': None, 'description': '19 // cinephile // tv-series and books addicted // always dreaming // INFJ-T // 🇮🇹🇬🇧🇪🇸🇩🇪 // indecisa cronica', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 1325, 'friends_count': 1315, 'listed_count': 5, 'favourites_count': 17866, 'statuses_count': 27882, 'cr

Status(_api=<tweepy.api.API object at 0x1288afd50>, _json={'created_at': 'Sat Mar 07 14:46:31 +0000 2020', 'id': 1236302213477281794, 'id_str': '1236302213477281794', 'text': 'RT @YasMohammedxx: For #InternationalWomenDay2020 I want to honor Loujain Al Hathloul, a fierce feminist from Saudi Arabia who is imprisone…', 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 4084992761, 'id_str': '4084992761', 'name': 'Hadassah Goldberg', 'screen_name': 'hadassahg55', 'location': None, 'url': None, 'description': None, 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 18, 'friends_count': 118, 'listed_count': 0, 'favourites_count': 5508, 'statuses_count': 386, 'created_at': 'Sat Oct 31 20:33:18 +0000 2015', 'utc_offset': N

Status(_api=<tweepy.api.API object at 0x1288afd50>, _json={'created_at': 'Sat Mar 07 14:46:31 +0000 2020', 'id': 1236302213703704577, 'id_str': '1236302213703704577', 'text': 'RT @Coolwednesdays_: Dis nigga baby just addressed da rumors, dissed his babymomma &amp; made a hook in one freestyle dis nigga might just be d…', 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1105913649783341058, 'id_str': '1105913649783341058', 'name': 'ً', 'screen_name': 'RashaanTM', 'location': 'ask your girl', 'url': 'https://www.snapchat.com/add/rashaan.tm', 'description': '‘RASHAANTM’ for 10% off @Duraggy', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 3767, 'friends_count': 2270, 'listed_count': 10, 'favourites_count': 26257,

Status(_api=<tweepy.api.API object at 0x1288afd50>, _json={'created_at': 'Sat Mar 07 14:46:31 +0000 2020', 'id': 1236302213745696769, 'id_str': '1236302213745696769', 'text': 'RT @Anti: Damn this is actually a big pickup, shoutout to teams getting involved in Smash out in Europe! https://t.co/cRfJswt563', 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 2973597321, 'id_str': '2973597321', 'name': 'Garrote 🐸', 'screen_name': 'GarroteSSB', 'location': None, 'url': 'https://www.twitch.tv/garrotessb', 'description': '●Competitive Smash Ultimate Player● ●Greninja Main🐸● ●Streamer● ●Ranked #20 in Spain● ●#300 Spain Osu!● ●Jungler Diamond● Business: Sergio98_gl@hotmail.com C🌻', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 520

Status(_api=<tweepy.api.API object at 0x1288afd50>, _json={'created_at': 'Sat Mar 07 14:46:31 +0000 2020', 'id': 1236302214257352709, 'id_str': '1236302214257352709', 'text': 'RT @senior_debra: When will the World insist that China put in place preventative measures to reassure thes diseases dont keep reoccurring.…', 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1212942963950510081, 'id_str': '1212942963950510081', 'name': 'Debra Senior', 'screen_name': 'senior_debra', 'location': 'England, United Kingdom', 'url': None, 'description': 'Devoted dog mum. Passionate about animal welfare and the natural world. Other than that just an ordinary person. No fancy titles. 🐕 woof.', 'translator_type': 'none', 'protected': False, 'verified': False, 'fol

Status(_api=<tweepy.api.API object at 0x1288afd50>, _json={'created_at': 'Sat Mar 07 14:46:32 +0000 2020', 'id': 1236302214697816071, 'id_str': '1236302214697816071', 'text': "RT @univrsle: @HarmerDan I think the trouble is that the teacher's unions need the time to explain WHY this press conference was bad and th…", 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 800915954, 'id_str': '800915954', 'name': 'Mrs K', 'screen_name': 'ItsMe_MrsK', 'location': 'Canada', 'url': None, 'description': '❤️ Art, Music, Food, Travel and FUN!', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 220, 'friends_count': 242, 'listed_count': 1, 'favourites_count': 3206, 'statuses_count': 1622, 'created_at': 'Mon Sep 03 18:10:41 +00

In [39]:
stream.disconnect()

In [11]:
print("Tweets acquired by streaming:",len(streamed_tweets))

Tweets acquired by streaming: 119975


<a id="rest"></a>
### REST Probing

In [20]:
# get 1000 most recent tweets by user, given username/screen_name
def getTweetsByUsername(name):
    print(name,end=", ")
    timeline = tweepy.Cursor(api.user_timeline, screen_name=name, count=200, lang="en", include_rts=False, wait_on_rate_limit=True).items(1000)
    tweepy.Cursor(api.search, lang="en", count=100, until="").items(1000)
    for item in timeline:
        filter_tweet = filter_tweet_info(item._json)
        rest_tweets[item._json["id"]] = filter_tweet

In [21]:
# get all tweets in the past year that contain certain keywords
def getTweetsByKeywords(words):
    print("Getting tweets with:",end=" ")
    for word in words:
        print(word,end=", ")
        timeline = tweepy.Cursor(api.search, q=word, count=200, since="2019-03-06", until="2020-03-06", wait_on_rate_limit=True).items()
        for item in timeline:
            filter_tweet = filter_tweet_info(item._json)
            # add tweet to rest_tweets dictionary
            rest_tweets[item._json["id"]] = filter_tweet

In [22]:
keywords = ["#astroscheated","#astrosscandal","#houstonasterisks","#StripTheTitle","#TaintedTitle","AsteriskTour"]
mlb_official_accounts = ['BlueJaysPR', 'DetroitTigersPR', 'LAAngelsPR', 'MarinersPR', 'Rockies', 'RangerBlake', 
                         'whitesox', 'Phillies', 'Marlins', 'Indians', 'Cardinals', 'Brewers', 'astros', 'MLBPAA', 
                         'SFGiants', 'Mariners', 'BlueJays', 'Cubs', 'Rangers', 'Yankees', 'RedSox', 'MLB_PLAYERS', 
                         'RaysBaseball', 'Nationals', 'Twins', 'Angels', 'RockiesClubInfo', 'Orioles', 'Mets', 'MLB_PR', 
                         'Pirates', 'Padres', 'YankeesPR', 'Reds', 'Dbacks', 'tigers', 'MarlinsComms', 'Royals', 
                         'Dodgers', 'Braves', 'Athletics','MLB']
rest_tweets = {}
getTweetsByKeywords(keywords)
print("\nGetting tweets by:",end=" ")
for i in mlb_official_accounts:
    getTweetsByUsername(i)
print("\nDone.")

Getting tweets with: #astroscheated, #astrosscandal, #houstonasterisks, #StripTheTitle, #TaintedTitle, AsteriskTour, 
Getting tweets by: BlueJaysPR, DetroitTigersPR, LAAngelsPR, MarinersPR, Rockies, RangerBlake, whitesox, Phillies, Marlins, Indians, Cardinals, Brewers, astros, MLBPAA, SFGiants, Mariners, BlueJays, Cubs, Rangers, Yankees, RedSox, MLB_PLAYERS, RaysBaseball, Nationals, Twins, Angels, RockiesClubInfo, Orioles, Mets, MLB_PR, Pirates, Padres, YankeesPR, Reds, Dbacks, tigers, MarlinsComms, Royals, Dodgers, Braves, Athletics, MLB, 
Done.


In [23]:
print("Tweets found by REST probing:",len(rest_tweets))

Tweets found by REST probing: 117091


<a id="grouping"></a>
## Group Tweets
1. Describe your method for grouping & provide statistics on groups.
2. Describe the method for Username and Hashtag identification. 
  * Provide in a tabular fashion data statistics.
  * Total data; groups; average size of a group; min size; max size etc.
  * Provide data on a tabular fashion and contrast for entire data and just for grouped data.

In [30]:
def strip_documents(documents):
    filtered_documents = []
    # filter tweet text to remove any links and mentions before clustering
    for i in documents:
        split_words = i.strip().split()
        filtered_line = []
        for s in split_words:
            if "http" not in s and "@" not in s and "#" not in s:
                filtered_line.append(s)
        filtered_line = " ".join(filtered_line)
        filtered_documents.append(filtered_line)
    return filtered_documents

def filter_documents(docs):
    filtered_docs = []
    phrases = [
        "baseball","mlb","Major League Baseball",
        "world series","spring training","opening day",
        "yankees","astros","dodgers","los angeles angels","chicago cubs","tampa bay rays","red sox","la angels","tb rays",
        "astros cheated","astros scandal","houston asterisks","sign stealing","astros cheating scandal",
        "astroscheated","astrosscandal","houstonasterisks","astroscheatingscandal"
    ]
    for doc in docs:
        count = 0
        for p in phrases:
            if p.lower() in doc.lower():
                count += 1
        if count >= 1:
            filtered_docs.append(doc)
    return filtered_docs

In [31]:
import pandas as pd

users_df =  pd.DataFrame.from_dict(user_ids).T
streamed_df = pd.DataFrame.from_dict(streamed_tweets).T
rest_df = pd.DataFrame.from_dict(rest_tweets).T

documents = streamed_df["text"].tolist() + rest_df["text"].tolist()
documents = strip_documents(documents)
filtered = filter_documents(documents)
print(len(documents),"documents")
print(len(filtered),"filtered documents")

237066 documents
28926 filtered documents


In [34]:
def cluster(text):
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(text)
    print(len(text))
    true_k = int(str(len(text))[:2])
    model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
    model.fit(X)

    print("Top terms per cluster:")
    order_centroids = model.cluster_centers_.argsort()[:, ::-1]
    terms = vectorizer.get_feature_names()
    for i in range(true_k):
        print ("Cluster %d:" % i)
        for ind in order_centroids[i, :10]:
            print(terms[ind],end=", ")
        print("\n")

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

cluster(filtered)

28926
Top terms per cluster:
Cluster 0:
combined, swing, miss, single, curveballs, sliders, kershaw, 51, 2017, world, 

Cluster 1:
gonna, long, year, astros, rt, season, espera, esposito, espn, espinosa, 

Cluster 2:
february, 26th, times, pitches, games, players, hit, astros, rt, championpicks, 

Cluster 3:
right, getting, fans, training, spring, rt, 𝟳𝟮, essential, essay, esposito, 

Cluster 4:
location, bac, middle, steal, signs, fucking, need, won, pitch, year, 

Cluster 5:
row, starter, hits, batters, opening, day, astros, rt, essay, esposito, 

Cluster 6:
going, long, year, astros, rt, season, espera, esposito, espn, espinosa, 

Cluster 7:
training, spring, league, hilarious, hbps, graph, bar, end, shows, rt, 

Cluster 8:
draft, mlb, round, 2019, university, selected, pick, selection, tigers, 2018, 

Cluster 9:
baseball, best, today, rt, play, gods, time, gave, just, game, 

Cluster 10:
team, draws, ch, kid, class, wins, starts, amp, season, world, 

Cluster 11:
beanball, continue

<a id="organize"></a>
## Capture and Organize User and Hashtag Information
1. Develop a method to capture user mention information. Users occurring together in general data (that is data you collected in part 1) as well as on the groups (result of part 2). 
  * You contrast between these two parts in terms of user interaction graph. 
  * Differentiate between different kinds of networks like retweet network; quote tweets etc. 
  * Provide information on data structure used.
2. Develop and describe a mechanism to capture hashtag information occurring together in general data as well as on the groups. 
  * Provide information on the data structure used.
3. Provide tabular data and contrast the data between various cases.

<a id="analysis"></a>
## Network Analysis
1. Analyse links ( path of 2) and Triads (closed loops) in all data (Part 1) and contrast with groups formed in part 2 above.
2. Explain the method to compute these motifs.
3. Provide statistics on your analysis (how many ties and triads)