# Experimenting with Tweepy library

## Authentication

In [2]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())
CONSUMER_KEY = os.getenv("TWITTER_CONSUMER_KEY")
CONSUMER_SECRET = os.getenv("TWITTER_CONSUMER_SECRET")
ACCESS_TOKEN_KEY = os.getenv("TWITTER_ACCESS_TOKEN_KEY")
ACCESS_TOKEN_SECRET = os.getenv("TWITTER_ACCESS_TOKEN_SECRET")

In [3]:
import tweepy
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN_KEY, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)

## Batch API

In [6]:
from termcolor import colored
public_tweets = api.home_timeline()
for tweet in public_tweets:
    print(colored(tweet.user.screen_name, 'blue'), tweet.text)

[34msuperglaze[0m RT @emilyrauhala: Could also be the state-backed, police-led killing spree. https://t.co/YIW4C004eR
[34mt3n[0m Dreckschleudern raus! Warum auch LKW-Hersteller sich jetzt neu erfinden müssen. https://t.co/e3AIGFaQkM
[34mHilzFuld[0m RT @TechCrunch: The current state of crypto is eerily similar to the dot com era circa 1999 https://t.co/vF5OGd8r2R
[34msuperglaze[0m So yeah, that's where our smartphone batteries come from. Have a nice day.
[34msuperglaze[0m Here's the now-ex-CEO of Chemaf, the Swiss owner of a Congolese mines "Of course, people die...This is really shitt… https://t.co/yiHsAK9Y3x
[34mThePracticalDev[0m RT @ThePracticalDev: - Why Blog?
- How to Get Started?
- Beating Writers Block
- Gaining Readership
{ author: @ASpittel }
https://t.co/Z2pi…
[34msuperglaze[0m The attitudes of companies involved in the cobalt industry, as quoted in this WSJ piece, are disgusting.… https://t.co/jM3Q05gIeP
[34mt3n[0m Themen zur #dmexco18: Es gab doch noch mehr 

## Tweet attributes

For our feature generation process, we should extract the following attributes:
* Account creation date
* User UTC offset
* Tweet creation date
* User follower count
* User friend count
* User listing count
* User status count
* User favorite count
* Number of URLs
* Number of hashtags
* Number of mentions
* Tweet text
* Tweet length
* Quoted status
* Sensitivity status
* Quoted tweet retweet count
* Quoted tweet creation date
* Retweet count
* Replied status
* User verification status

In [7]:
retweet = public_tweets[0]
tweet = public_tweets[1]

In [9]:
type(tweet.user.created_at)

datetime.datetime

In [17]:
type(tweet.user.utc_offset)

NoneType

In [75]:
user = api.get_user(tweet.user.id)
type(user.utc_offset)

NoneType

UTC offset is not provided by the Tweepy library. We have to get these attributes elsewhere, e.g., through the *python-twitter* library. Knowing about the UTC offset is necessary to calculate local creation time of tweets.

In [16]:
type(tweet.created_at)

datetime.datetime

In [19]:
tweet.user.followers_count

148291

In [20]:
tweet.user.friends_count

181

In [21]:
tweet.user.listed_count

4141

In [22]:
tweet.user.statuses_count

63806

In [23]:
tweet.user.favourites_count

4200

In [27]:
tweet.entities['urls']

[{'display_url': 't3n.de/news/?post_typ…',
  'expanded_url': 'https://t3n.de/news/?post_type=post&p=1109830',
  'indices': [80, 103],
  'url': 'https://t.co/e3AIGFaQkM'}]

In [28]:
tweet.entities['hashtags']

[]

In [30]:
tweet.entities['user_mentions']

[]

In [32]:
type(tweet.text)

str

In [33]:
len(tweet.text)

103

In [34]:
tweet.is_quote_status

False

In [35]:
tweet.possibly_sensitive

False

In [38]:
tweet.retweet_count

0

In [42]:
retweet.is_quote_status

True

In [73]:
for t in public_tweets:
    print(t.is_quote_status, hasattr(t, 'quoted_status'), hasattr(t, 'retweeted_status'))

True False True
False False False
False False True
False False False
False False False
False False True
False False False
False False False
True True False
False False False
False False False
False False False
False False False
True False True
True True False
False False False
False False True


Obviously, the distinction between quotes and retweets is not consistent. We have to make multiple checks to determine the exact type of an incoming tweet.

In [60]:
public_tweets[14].quoted_status.retweet_count

19

In [61]:
public_tweets[14].quoted_status.user.followers_count

44074

In [43]:
retweet.retweet_count

2

In [44]:
retweet.retweeted_status.retweet_count

2

In [70]:
tweet.in_reply_to_status_id == None

True

In [71]:
tweet.user.verified

False

In [76]:
tweet.entities

{'hashtags': [],
 'media': [{'display_url': 'pic.twitter.com/TgTiYBpMMy',
   'expanded_url': 'https://twitter.com/drewconway/status/1010137695664951297/photo/1',
   'id': 1010137691562770432,
   'id_str': '1010137691562770432',
   'indices': [44, 67],
   'media_url': 'http://pbs.twimg.com/media/DgS64OfVAAAVgOe.jpg',
   'media_url_https': 'https://pbs.twimg.com/media/DgS64OfVAAAVgOe.jpg',
   'sizes': {'large': {'h': 699, 'resize': 'fit', 'w': 700},
    'medium': {'h': 699, 'resize': 'fit', 'w': 700},
    'small': {'h': 679, 'resize': 'fit', 'w': 680},
    'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
   'source_status_id': 1010137695664951297,
   'source_status_id_str': '1010137695664951297',
   'source_user_id': 18463930,
   'source_user_id_str': '18463930',
   'type': 'photo',
   'url': 'https://t.co/TgTiYBpMMy'}],
 'symbols': [],
 'urls': [],
 'user_mentions': [{'id': 18463930,
   'id_str': '18463930',
   'indices': [3, 14],
   'name': 'Drew Conway',
   'screen_name': 'drewconway

In [77]:
my_tweets = api.user_timeline()
for tweet in my_tweets:
    print(colored(tweet.user.screen_name, 'blue'), tweet.text)

[34m_fpeters[0m Interesting read about deploying multimodal neural networks
https://t.co/1gk7LIm1oT
[34m_fpeters[0m Testing if retweets appear in the stream https://t.co/ab5mN5xcyi
[34m_fpeters[0m Testing a new Twitter library called tweepy for my retweet prediction project https://t.co/dkGtS8CLkA
[34m_fpeters[0m RT @thecheckdown: Catch didn't count but that footwork 😳  @juliojones_11 https://t.co/zpKBZCFH0h
[34m_fpeters[0m RT @1FSVMainz05: UNSER TRAUM LEBT ... und wir kämpfen weiter! 💪 https://t.co/gMlz0B2pDi
[34m_fpeters[0m RT @1FSVMainz05: ... denn Toleranz und Weltoffenheit zählen zu unseren Werten – ohne Alternative! https://t.co/a295x9CN2B
[34m_fpeters[0m RT @StrictlyVC: Very Good Security makes data ‘unhackable’ with $8.5M from Andreessen https://t.co/emN7itjh2t
[34m_fpeters[0m Under the Hood of Uber’s Experimentation Platform https://t.co/xUVhWEkzQ2 via @ubereng
[34m_fpeters[0m https://t.co/GRVJ2SKgge
[34m_fpeters[0m RT @ThePracticalDev: Best Open Source Too

In [78]:
tweet_with_video = my_tweets[3]

In [80]:
len(tweet_with_video.retweeted_status.entities['media'])

1

In [84]:
len(my_tweets[2].entities['media'])

KeyError: 'media'

We need to check for the existence of the media entities, before determining the exact number.

In [86]:
print('media' in my_tweets[2].entities)

False


## Retweet attributes

Since we get full status information for all retweets, we can extract the same attributes for each retweet.

Main challenges:
* Sort out retweets from our user group
* Distinguish between tweets from user group and retweets or their tweets
* Identify special cases, e.g., intra-user group retweeting and quotes

We have seen that we cannot rely on Tweepy's attributes. Thus, we have to check for all attributes and remove tweets in the worst case.