# Writing a Twitter scraper


This notebook is part of the _d2d3 (data to d3) project_ to help data engineers and information visualizers to gather, clean, format, explore, munge, understand and feature engineer data before visualization.

_garbage in, garbage out (GIGO)_


GIGO (garbage in, garbage out) is a concept common to computer science and mathematics: the quality of output is determined by the quality of the input. The purpose of the _d2d3 (data to d3) project_ is to create tutorials and tools that  gather, clean, format, explore, munge, understand and feature engineer data before creating a beautiful d3 visualization.

Email _d2d3_ at _d2d3.net_ for more info.  



# What is python-twitter

*python-twitter* is one of the many libraries that make it easy to query Twitter data.  

Twitter has provided `REST API's ` [https://dev.twitter.com/rest/public](https://dev.twitter.com/rest/public) which can be used by
developers to access and read Twitter data. They have also provided a
`Streaming API ` [https://dev.twitter.com/streaming/overview](https://dev.twitter.com/streaming/overview) which can
be used to access Twitter Data in real-time.

To use the Streaming API one needs to get an access key by applying for a Twitter developer account [https://developer.twitter.com/en/apply-for-access](https://developer.twitter.com/en/apply-for-access)  

All new developers must apply for a developer account to access the Twitter developer platform. Once approved, you can begin to use the new Twitter API v2, or the v1.1 standard and premium APIs.   

Most of the software written to access Twitter data provide a library
which functions as a wrapper around Twitter's Search and Streaming API's
and are therefore constrained by the limitations of the API's.

With Twitter's Search API you can only send 180 Requests every 15
minutes. With a maximum number of 100 tweets per Request, you
can mine 72 tweets per hour (4 x 180 x 100 =72) . By using
TwitterScraper you are not limited by this number but by your internet
speed/bandwith and the number of instances of TwitterScraper you are
willing to start.

One of the bigger disadvantages of the Search API is that you can only
access Tweets written in the **past 7 days**. This is a major bottleneck
for anyone looking for older data. With TwitterScraper there is no such 
limitation.

Per Tweet it scrapes the following information:
 + Tweet-id
 + Tweet-url
 + Tweet text
 + Tweet html
 + Links inside Tweet
 + Hashtags inside Tweet
 + Image URLS inside Tweet
 + Video URL inside Tweet
 + Tweet timestamp
 + Tweet Epoch timestamp
 + Tweet No. of likes
 + Tweet No. of replies
 + Tweet No. of retweets
 + Username
 + User Full Name / Screen Name
 + User ID
 + Tweet is an reply to
 + Tweet is replied to
 + List of users Tweet is an reply to
 + Tweet ID of parent tweet

 
In addition it can scrape for the following user information:
 + Date user joined
 + User location (if filled in)
 + User blog (if filled in)
 + User No. of tweets
 + User No. of following
 + User No. of followers
 + User No. of likes
 + User No. of lists
 + User is verified


# Install python-twitter

To install **python-twitter**:

```python

    (sudo) pip install python-twitter  

```

#  CLI (Command Line Interace)

Robots are typcally run through the command line via scripts and automated to run 24/7 on servers.  A separate notebook will discuss how to automate the scripts and create a _bot_. This notebook focuses on using the _python-twitter_ library.




# Creating a Twitter app

1. Go to your Twitter developer account and sign-in [https://developer.twitter.com/](https://developer.twitter.com/)  

2. Create an app (Note you must be approved by Twitter before you can do this.)

You should see something like the image below.

![Twitter developer account](img/python-twitter_1.png)

Note this tells me that Twiiter caps me at 10,000,000 tweets per month.

3. Hit the create app button to create an app and give it a useful name.

![Twitter app](img/python-twitter_2.png)

![Twitter keys](img/python-twitter_3.png)

Note it is a good idea to regenerate your keys occasionally for security. (I immediately regenerated the keys after taking a screen shot for this tutorial.)


![Twitter app](img/python-twitter_4.png)

If you need to do things like write to your Twitter account you will need to change your permissions.

![Twitter keys](img/python-twitter_5.png)




# python-twitter API

Let's get started with the python-twitter API

------
Models
------

The library utilizes models to represent various data structures returned by Twitter. Those models are:   
    * twitter.Category    
    * twitter.DirectMessage    
    * twitter.Hashtag   
    * twitter.List   
    * twitter.Media  
    * twitter.Status  
    * twitter.Trend   
    * twitter.Url  
    * twitter.User  
    * twitter.UserStatus  

To read the documentation for any of these models, run:  

```bash

    $ pydoc twitter.[model]  
```

---
API
---

The API is exposed via the ``twitter.Api`` class.

The python-twitter requires the use of OAuth keys for nearly all operations. As of Twitter's API v1.1, authentication is required for most, if not all, endpoints. Therefore, you will need to register an app with Twitter in order to use this library. Please see the "Getting Started" guide on https://python-twitter.readthedocs.io for more information.

To generate an Access Token you have to pick what type of access your application requires and then do one of the following:

- `Generate a token to access your own account <https://dev.twitter.com/oauth/overview/application-owner-access-tokens>`_
- `Generate a pin-based token <https://dev.twitter.com/oauth/pin-based>`_
- use the helper script `get_access_token.py <https://github.com/bear/python-twitter/blob/master/get_access_token.py>`_

For full details see the `Twitter OAuth Overview <https://dev.twitter.com/oauth/overview>`_

To create an instance of the ``twitter.Api`` with login credentials (Twitter now requires an OAuth Access Token for all API calls)::

```python
    >>> import twitter
    >>> api = twitter.Api(consumer_key='consumer_key',
                          consumer_secret='consumer_secret',
                          access_token_key='access_token',```python
                          access_token_secret='access_token_secret')
```

To see if your credentials are successful::

```python
    >>> print(api.VerifyCredentials())
    {"id": 16133, "location": "Philadelphia", "name": "bear"}
    
```    

**NOTE**: much more than the small sample given here will print

To fetch a single user's public status messages, where ``user`` is a Twitter user's screen name::


```python

    >>> statuses = api.GetUserTimeline(screen_name=user)
    >>> print([s.text for s in statuses])
```         

To fetch a list of a user's friends:  

```python
    >>> users = api.GetFriends()
    >>> print([u.name for u in users])
```

To post a Twitter status message: 

```python
    >>> status = api.PostUpdate('I love python-twitter!')
    >>> print(status.text)
    I love python-twitter!
```    

There are many more API methods, to read the full API documentation either
check out the documentation on `readthedocs
<https://python-twitter.readthedocs.io>`_, build the documentation locally
with::

```bash
    $ make docs
```

or check out the inline documentation with::

```bash
    $ pydoc twitter.Api
```

In [3]:
# Check that python-twitter installed 
import twitter



# pin-based token

Note that this library also requires a pin-based token as as allows one to query for quite a bit of info.   

```python
api = twitter.Api(consumer_key='consumer_key',
                      consumer_secret='consumer_secret',
                      access_token_key='access_token',
                      access_token_secret='access_token_secret')
                      
```

I currently have an API key & secret but not a Access token & secret 

![Access token & secret](img/python-twitter_6.png)

Generate a pin-based token [https://dev.twitter.com/oauth/pin-based](https://dev.twitter.com/oauth/pin-based)  



In [4]:
twitter_keys =  {
'api_key' : 'M1GeQr964TdZ82PipexeoCSMh',
'api_key_secret' : 'uHCnBoIJ3Qg6m1yEOgkhmchbKwc7hxLsYoiilq1EELBzoAj1Lc',
'access_token' : '34749634-bEwCfMxjZyXLOFgPkS6MecRlKYj0C4YwqHdmzTa50',
'access_token_secret' : 'HG3RlELl2HAAFblcK7p0Ct2RVczdQ1I09rktmm9xzLCNN'
}
twitter_keys

{'api_key': 'M1GeQr964TdZ82PipexeoCSMh',
 'api_key_secret': 'uHCnBoIJ3Qg6m1yEOgkhmchbKwc7hxLsYoiilq1EELBzoAj1Lc',
 'access_token': '34749634-bEwCfMxjZyXLOFgPkS6MecRlKYj0C4YwqHdmzTa50',
 'access_token_secret': 'HG3RlELl2HAAFblcK7p0Ct2RVczdQ1I09rktmm9xzLCNN'}

In [5]:
api = twitter.Api(consumer_key=twitter_keys['api_key'],
                      consumer_secret=twitter_keys['api_key_secret'],
                      access_token_key=twitter_keys['access_token'],
                      access_token_secret=twitter_keys['access_token_secret'])

In [6]:
# To see if your credentials are successful:
print(api.VerifyCredentials())

{"created_at": "Thu Apr 23 22:15:37 +0000 2009", "description": "Northeastern University Engineering Professor in AI, ML, RL, DL, Visualization, and Games\naka Bear\naka Sin Cara Ojos Azules\nPursue it. Relentlessly. #Nerdlife", "favourites_count": 4218, "followers_count": 598, "friends_count": 722, "geo_enabled": true, "id": 34749634, "id_str": "34749634", "listed_count": 54, "location": "Boston", "name": "Nik Bear Brown", "profile_background_color": "C0DEED", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_tile": true, "profile_banner_url": "https://pbs.twimg.com/profile_banners/34749634/1553389496", "profile_image_url": "http://pbs.twimg.com/profile_images/1325560081404997632/-DOzKiga_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/1325560081404997632/-DOzKiga_normal.jpg", "profile_link_color": "4099FF", "pro

To fetch a single user's public status messages, where user is a Twitter user's screen name like @nikbearbrown
    

In [7]:
user='nikbearbrown'
statuses = api.GetUserTimeline(screen_name=user)
statuses[0]

Status(ID=1370587369175347203, ScreenName=NikBearBrown, Created=Sat Mar 13 04:07:47 +0000 2021, Text='AI Skunkworks Speaker Session on "Graph-Based Data Science" by Paco Nathan Graphs #Knowledge… https://t.co/Xsdo09e8Ok')

# The python-twitter API returns each tweet as a Status object

Note each Tweet is a status object. 

<class 'twitter.models.Status'>  

Status(ID=1370587369175347203, ScreenName=NikBearBrown, Created=Sat Mar 13 04:07:47 +0000 2021, Text='AI Skunkworks Speaker Session on "Graph-Based Data Science" by Paco Nathan Graphs #Knowledge… https://t.co/Xsdo09e8Ok')  


Python-twitter provides the following models of the objects returned by the Twitter API:  

twitter.models.Category  
twitter.models.DirectMessage  
twitter.models.Hashtag  
twitter.models.List  
twitter.models.Media  
twitter.models.Status  
twitter.models.Trend  
twitter.models.Url  
twitter.models.User  
twitter.models.UserStatus  

In [8]:
print (type(statuses[0]))

<class 'twitter.models.Status'>


In [9]:
# print the Tweet Text
for s in statuses:
    print (f"{s.text}  {s.full_text}")

AI Skunkworks Speaker Session on "Graph-Based Data Science" by Paco Nathan Graphs #Knowledge… https://t.co/Xsdo09e8Ok  None
AI Skunkworks Speaker Session with Jacopo Tagliabue, Lead AI Scientist at Coveo #Coveo #GreatTalk   https://t.co/GaTQATl3ym  None
An Etsy shop is selling my artwork. Four of my robots are part of a clipart pack. I don't whether to be pissed or f… https://t.co/wsWnjtLJea  None
RT @Coach_Yac: 🗣 Pay Jason Verrett @49ers https://t.co/1YngpcjmWk  None
Everything to know about the Unity PARTICLE SYSTEM https://t.co/D4GLO4kvy5  None
Texas Won't Reduce $16 Billion In Electricity Charges From Winter Storm https://t.co/UsN7MUwlKx  None
@elkandelk @DatBoyInc I'll cancel Netflix if I ever get that message.  None
@THR I'll cancel Netflix if I ever get that message.  None
Man coaxes nest of 6 cute baby bunnies out from his garden https://t.co/dPa9o9osSt  None
First Ever Space Hurricane Spotted in Earth's Upper Atmosphere https://t.co/wI0b391RCV  None
Gen Z’s high-speed rail mem

# What info can we get from a Tweet status object?

In python we can ask an object to tell us its methods and member variables. 

Those with underscores are functions.


'__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 
 and those without are member variables.
 
  'source',
 'text',
 'truncated',
 'tweet_mode',
 'urls',
 'user'

In [10]:
dir(statuses[0])

['AsDict',
 'AsJsonString',
 'NewFromJsonDict',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_json',
 'contributors',
 'coordinates',
 'created_at',
 'created_at_in_seconds',
 'current_user_retweet',
 'favorite_count',
 'favorited',
 'full_text',
 'geo',
 'hashtags',
 'id',
 'id_str',
 'in_reply_to_screen_name',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'lang',
 'location',
 'media',
 'param_defaults',
 'place',
 'possibly_sensitive',
 'quoted_status',
 'quoted_status_id',
 'quoted_status_id_str',
 'retweet_count',
 'retweeted',
 'retweeted_status',
 'scopes',
 'source',
 'text',
 'truncated',
 'tweet_mode',
 'urls',
 'user',
 'user_mentions',
 'withheld_c

# Reading the docs

To read the documentation for any of these models, run:  

```bash

    $ pydoc twitter.[model]  
```
 such as
 
 
 ```bash

    $ pydoc twitter.models.Status  
    
```
 
or read them online [https://python-twitter.readthedocs.io](https://python-twitter.readthedocs.io)  such as [https://python-twitter.readthedocs.io/en/latest/_modules/twitter/models.html](https://python-twitter.readthedocs.io/en/latest/_modules/twitter/models.html)   


_Note that different member variables can have different datatypes._ 



In [11]:
for s in statuses:
    if s.full_text is not None:
        print(f"{s.full_text} Language: {s.lang}")
    else:
        print(f"{s.text} Language: {s.lang}")

AI Skunkworks Speaker Session on "Graph-Based Data Science" by Paco Nathan Graphs #Knowledge… https://t.co/Xsdo09e8Ok Language: en
AI Skunkworks Speaker Session with Jacopo Tagliabue, Lead AI Scientist at Coveo #Coveo #GreatTalk   https://t.co/GaTQATl3ym Language: en
An Etsy shop is selling my artwork. Four of my robots are part of a clipart pack. I don't whether to be pissed or f… https://t.co/wsWnjtLJea Language: en
RT @Coach_Yac: 🗣 Pay Jason Verrett @49ers https://t.co/1YngpcjmWk Language: en
Everything to know about the Unity PARTICLE SYSTEM https://t.co/D4GLO4kvy5 Language: en
Texas Won't Reduce $16 Billion In Electricity Charges From Winter Storm https://t.co/UsN7MUwlKx Language: en
@elkandelk @DatBoyInc I'll cancel Netflix if I ever get that message. Language: en
@THR I'll cancel Netflix if I ever get that message. Language: en
Man coaxes nest of 6 cute baby bunnies out from his garden https://t.co/dPa9o9osSt Language: en
First Ever Space Hurricane Spotted in Earth's Upper Atmos

In [12]:
statuses[0].user

User(ID=34749634, ScreenName=NikBearBrown)

In [13]:
for s in statuses:
    print(s.user)

{"created_at": "Thu Apr 23 22:15:37 +0000 2009", "description": "Northeastern University Engineering Professor in AI, ML, RL, DL, Visualization, and Games\naka Bear\naka Sin Cara Ojos Azules\nPursue it. Relentlessly. #Nerdlife", "favourites_count": 4218, "followers_count": 598, "friends_count": 722, "geo_enabled": true, "id": 34749634, "id_str": "34749634", "listed_count": 54, "location": "Boston", "name": "Nik Bear Brown", "profile_background_color": "C0DEED", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_tile": true, "profile_banner_url": "https://pbs.twimg.com/profile_banners/34749634/1553389496", "profile_image_url": "http://pbs.twimg.com/profile_images/1325560081404997632/-DOzKiga_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/1325560081404997632/-DOzKiga_normal.jpg", "profile_link_color": "4099FF", "pro

# Excercise One 

What info do the Status attributes 'geo', 'hashtags', 'media', 'place', 'urls', and 'location' contain?


In [14]:
_=statuses[0].text
print(_)

AI Skunkworks Speaker Session on "Graph-Based Data Science" by Paco Nathan Graphs #Knowledge… https://t.co/Xsdo09e8Ok


In [15]:
_=statuses[0].geo
print(_)

None


In [16]:
_=statuses[0].hashtags
print(_)

[Hashtag(Text='Knowledge')]


In [17]:
_=statuses[0].media
print(_)

None


In [18]:
_=statuses[0].urls
print(_)

[URL(URL=https://t.co/Xsdo09e8Ok, ExpandedURL=https://twitter.com/i/web/status/1370587369175347203)]


In [19]:
_=statuses[0].AsJsonString
print(_)

<bound method TwitterModel.AsJsonString of Status(ID=1370587369175347203, ScreenName=NikBearBrown, Created=Sat Mar 13 04:07:47 +0000 2021, Text='AI Skunkworks Speaker Session on "Graph-Based Data Science" by Paco Nathan Graphs #Knowledge… https://t.co/Xsdo09e8Ok')>


# Note that you may  not be getting everything

The data may be truncated. If we look carefully there was a parameter called _"truncated": true_.

Maybe we want to set that to false?

```python
print(api.VerifyCredentials())  

{"created_at": "Thu Apr 23 22:15:37 +0000 2009", "description": "Northeastern University Engineering Professor in AI, ML, RL, DL, Visualization, and Games\naka Bear\naka Sin Cara Ojos Azules\nPursue it. Relentlessly. #Nerdlife", "favourites_count": 4219, "followers_count": 598, "friends_count": 722, "geo_enabled": true, "id": 34749634, "id_str": "34749634", "listed_count": 54, "location": "Boston", "name": "Nik Bear Brown", "profile_background_color": "C0DEED", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_tile": true, "profile_banner_url": "https://pbs.twimg.com/profile_banners/34749634/1553389496", "profile_image_url": "http://pbs.twimg.com/profile_images/1325560081404997632/-DOzKiga_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/1325560081404997632/-DOzKiga_normal.jpg", "profile_link_color": "4099FF", "profile_sidebar_border_color": "FFFFFF", "profile_sidebar_fill_color": "DDEEF6", "profile_text_color": "333333", "profile_use_background_image": true, "screen_name": "NikBearBrown", "status": {"created_at": "Sat Mar 13 04:07:47 +0000 2021", "favorite_count": 2, "id": 1370587369175347203, "id_str": "1370587369175347203", "lang": "en", "retweet_count": 6, "source": "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Web App</a>", "text": "AI Skunkworks Speaker Session on \"Graph-Based Data Science\" by Paco Nathan Graphs #Knowledge\u2026 https://t.co/Xsdo09e8Ok", "truncated": true}, "statuses_count": 37518, "url": "https://t.co/ZQ9sazeznI"}
```

We got the data with this line of code.

```python
statuses = api.GetUserTimeline(screen_name=user)
statuses[0]
```

Lets look at the docs to see if can do better?

Hopefully _python-twitter_ would allow one to simply set _"truncated": to false_. But the class doesn't seem to accept truncated as a parameter.


```python
GetUserTimeline(user_id=None, screen_name=None, since_id=None, max_id=None, count=None, include_rts=True, trim_user=False, exclude_replies=False)
```

See [https://python-twitter.readthedocs.io/en/latest/twitter.html](https://python-twitter.readthedocs.io/en/latest/twitter.html)  

It turns out this is controlled when we make the connection, you need to make sure you have tweet_mode='extended' when you call twitter.Api(). 

```python
twitter.Api(
            consumer_key= os.environ['CONSUMER_KEY'],
            consumer_secret= os.environ['CONSUMER_SECRET'],
            access_token_key= os.environ['ACCESS_TOKEN'],
            access_token_secret= os.environ['ACCESS_TOKEN_SECRET'],
            cache=None,
            tweet_mode= 'extended')
```            

In [20]:
api = twitter.Api(consumer_key=twitter_keys['api_key'],
                      consumer_secret=twitter_keys['api_key_secret'],
                      access_token_key=twitter_keys['access_token'],
                      access_token_secret=twitter_keys['access_token_secret'],
                      cache=None,
                      tweet_mode= 'extended')                

In [21]:
print(api.VerifyCredentials())

{"created_at": "Thu Apr 23 22:15:37 +0000 2009", "description": "Northeastern University Engineering Professor in AI, ML, RL, DL, Visualization, and Games\naka Bear\naka Sin Cara Ojos Azules\nPursue it. Relentlessly. #Nerdlife", "favourites_count": 4218, "followers_count": 598, "friends_count": 722, "geo_enabled": true, "id": 34749634, "id_str": "34749634", "listed_count": 54, "location": "Boston", "name": "Nik Bear Brown", "profile_background_color": "C0DEED", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_tile": true, "profile_banner_url": "https://pbs.twimg.com/profile_banners/34749634/1553389496", "profile_image_url": "http://pbs.twimg.com/profile_images/1325560081404997632/-DOzKiga_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/1325560081404997632/-DOzKiga_normal.jpg", "profile_link_color": "4099FF", "pro

In [22]:
statuses = api.GetUserTimeline(screen_name=user)
statuses[0]

Status(ID=1370587369175347203, ScreenName=NikBearBrown, Created=Sat Mar 13 04:07:47 +0000 2021, Text='AI Skunkworks Speaker Session on "Graph-Based Data Science" by Paco Nathan Graphs #Knowledge #artificialintelligence #machinelearning #Graphs  https://t.co/AZ1BsVsGzT')

In [23]:
_=statuses[0].text
print(_)

None


In [24]:
_=statuses[0].full_text
print(_)

AI Skunkworks Speaker Session on "Graph-Based Data Science" by Paco Nathan Graphs #Knowledge #artificialintelligence #machinelearning #Graphs  https://t.co/AZ1BsVsGzT


In [25]:
_=statuses[0].urls
print(_)

[URL(URL=https://t.co/AZ1BsVsGzT, ExpandedURL=https://youtu.be/1y7yAenJmgM)]


In [26]:
_=statuses[0].hashtags
print(_)

[Hashtag(Text='Knowledge'), Hashtag(Text='artificialintelligence'), Hashtag(Text='machinelearning'), Hashtag(Text='Graphs')]


# Lesson Always Sanity Check

Always sanity check your data.  Look at a few examples. Is it getting what is should?

*python-twitter* is a well written and sophisticated library but if we had used it in a niave way we would have been getting truncated and incomplete data.


In [27]:
_=statuses[0].hashtags
print(_)

[Hashtag(Text='Knowledge'), Hashtag(Text='artificialintelligence'), Hashtag(Text='machinelearning'), Hashtag(Text='Graphs')]


# Getting the actual urls

_Side Note_

Note that the Twitter urls are links to the tweet not what the tweet links to
[URL(URL=https://t.co/Xsdo09e8Ok, ExpandedURL=https://twitter.com/i/web/status/1370587369175347203)] 

How does one get the actual url form a "tiny" URL like https://t.co/Xsdo09e8Ok?


In [28]:
import urllib3
url='https://t.co/AZ1BsVsGzT'
print(urllib3.__version__)
http = urllib3.PoolManager()
resp = http.request('GET', url)
print(resp.status) # The 200 status code means that the request has succeeded.
print(resp.geturl())
print(resp.info())
resp.release_conn()

1.25.11
200
https://www.youtube.com/watch?v=1y7yAenJmgM&feature=youtu.be
HTTPHeaderDict({'Content-Type': 'text/html; charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'Cache-Control': 'no-cache, no-store, max-age=0, must-revalidate', 'Pragma': 'no-cache', 'Expires': 'Mon, 01 Jan 1990 00:00:00 GMT', 'Date': 'Sun, 14 Mar 2021 05:27:51 GMT', 'X-Frame-Options': 'SAMEORIGIN', 'Strict-Transport-Security': 'max-age=31536000', 'P3P': 'CP="This is not a P3P policy! See http://support.google.com/accounts/answer/151657?hl=en for more info."', 'Server': 'ESF', 'X-XSS-Protection': '0', 'Set-Cookie': 'GPS=1; Domain=.youtube.com; Expires=Sun, 14-Mar-2021 05:57:51 GMT; Path=/; Secure; HttpOnly, YSC=gcxXM8iB4Lc; Domain=.youtube.com; Path=/; Secure; HttpOnly; SameSite=none, VISITOR_INFO1_LIVE=DnGqUgOB0js; Domain=.youtube.com; Expires=Fri, 10-Sep-2021 05:27:51 GMT; Path=/; Secure; HttpOnly; SameSite=none', 'Alt-Svc': 'h3-29=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q0

# More Searching

_Raw Queries_  

To the Api.GetSearch() method, you can pass the parameter raw_query, which should be the query string you wish to use for the search omitting the leading “?”. This will override every other parameter. Twitter’s search parameters are quite complex, so if you have a need for a very particular search, you can find Twitter’s documentation at [https://dev.twitter.com/rest/public/search](https://dev.twitter.com/rest/public/search).  


Note that the strings need to be url formated.

so a query like "data visualiztion" cannot have spaces

You can check a query by pasting it in a browser.

https://twitter.com/search?q=data%20visualiztion%20&result_type=recent&since=2021-01-19&count=100

results = api.GetSearch(
    raw_query="q=twitter%20&result_type=recent&since=2021-01-19&count=100")

In [29]:
# Make sure raw query is url encoded
from urllib.parse import urlencode, quote_plus
base='https://twitter.com/search?'
params = {'q': 'data visualization', 'result_type': 'recent', 'since': '2021-01-19', 'count': 100}
query = urlencode(params)
print(query)
query=base+query
print(query)

q=data+visualization&result_type=recent&since=2021-01-19&count=100
https://twitter.com/search?q=data+visualization&result_type=recent&since=2021-01-19&count=100


In [30]:
results = api.GetSearch(
    raw_query="q=data+visualization&lang=en")

In [31]:
results

[Status(ID=1370969597688578050, ScreenName=SSXman2, Created=Sun Mar 14 05:26:38 +0000 2021, Text='RT @Ronald_vanLoon: 60 Types Of #Data Visualization And Their Usage\nby @SR_Visual_Info @dataviz_catalog\n\n#DataScience #DataAnalytics #Analy…'),
 Status(ID=1370969077376684033, ScreenName=HninThetHmuKhin, Created=Sun Mar 14 05:24:34 +0000 2021, Text="RT @CyrixLu: You can check data visualization about what's happening in Myanmar since February 1 coup. \nhttps://t.co/nMWMbcfEkR\n\n#WhatsHapp…"),
 Status(ID=1370967532438315013, ScreenName=elcaosmx, Created=Sun Mar 14 05:18:25 +0000 2021, Text="RT @freeCodeCamp: Businesses are increasingly data-driven - and they need a way to visualize that data so it's easier to analyze.\n\nIf you'r…"),
 Status(ID=1370965170697031688, ScreenName=Aatisharap, Created=Sun Mar 14 05:09:02 +0000 2021, Text='RT @abhymurarka: Data visualization.. https://t.co/50s1FQ8HUa'),
 Status(ID=1370964573788856321, ScreenName=wawaggmu, Created=Sun Mar 14 05:06:40 +0000 2021

# twitter-friends
  
lists all of a given user's friends (ie, followees)
  
twitter API docs: https://dev.twitter.com/rest/reference/get/friends/ids

In [32]:
print(user)
friends = api.GetFriendIDs(screen_name = user)

nikbearbrown


In [33]:
friends[0:5]

[1204955487650775040, 479437493, 1367576617879535624, 416503462, 1392379572]

In [34]:
friends = api.GetFriends(screen_name = user)

In [35]:
friends[0:5]

[User(ID=1204955487650775040, ScreenName=Aishwar73057379),
 User(ID=479437493, ScreenName=GrabFedor),
 User(ID=1367576617879535624, ScreenName=eric49907274),
 User(ID=416503462, ScreenName=cgrenier324),
 User(ID=1392379572, ScreenName=nowwhatstherush)]

# geo search

Let's get tweets within 1 mile of Northeastern Univeristy.

The GPS Coordinates of Northeastern Univeristy are:  

DD Coordinates 
42.33666532 -71.08749965  

DD Coordinates 
42°20'12.00" N -71°05'15.00" W 

Geohash Coordinates  
drt2y75svnqmu13

UTM Coordinates  
19T 328028.91214765 4689267.261328  


In [36]:
geo=api.GetSearch(geocode="42.33666532,-71.08749965,1mi")

In [37]:
geo[0:5]

[Status(ID=1370936876366979078, ScreenName=Guaroaandujar, Created=Sun Mar 14 03:16:36 +0000 2021, Text='# Reflexión\n\nGUAROA ANDÚJAR\n💻El Pionero de las Redes Sociales en Bani📲🤳🏻📸\nhttps://t.co/T8du7B2Ey1 en Roxbury, Boston https://t.co/6qR5ak1xSs')]

# twitter streams

The twitter streaming API gives a ultra-real-time stream of twitter's public timeline.

See [https://developer.twitter.com/en/docs/tutorials/consuming-streaming-data](https://developer.twitter.com/en/docs/tutorials/consuming-streaming-data)  

The Twitter stream sends new Tweets through the open connection as they happen, and your client app should read the data off the line as it is received by saving the data.


_python-twitter_

python-twitter has a GetStreamFilter class for consuming streaming data.  

```python
GetStreamFilter(follow=None, track=None, locations=None, languages=None, delimited=None, stall_warnings=None, filter_level=None)  
```

Parameters:	
    
follow – A list of user IDs to track. _Optional_     
track – A list of expressions to track. _Optional_     
locations – A list of Longitude,Latitude pairs (as strings) specifying bounding boxes for the tweets’ origin. _Optional_     
delimited – Specifies a message length. _Optional_     
stall_warnings – Set to True to have Twitter deliver stall warnings. _Optional_     
languages – A list of Languages. Will only return Tweets that have been detected as being written in the specified languages. _Optional_     
filter_level – Specifies level of filtering applied to stream. Set to None, ‘low’ or ‘medium’. _Optional_     

Returns:	

A twitter stream


*Note that the Twitter stream is continuous so this should be written as a script that continiously runs and saves the data to disk.*


This is best done on a cloud server.  Northeastern provides server support for students at the NU Discover cluster.

https://rc.northeastern.edu/


```python


import os
import json


from twitter import Api

# Either specify a set of keys here or use os.getenv('CONSUMER_KEY') style
# assignment:

CONSUMER_KEY = 'WWWWWWWW'
# CONSUMER_KEY = os.getenv("CONSUMER_KEY", None)
CONSUMER_SECRET = 'XXXXXXXX'
# CONSUMER_SECRET = os.getenv("CONSUMER_SECRET", None)
ACCESS_TOKEN = 'YYYYYYYY'
# ACCESS_TOKEN = os.getenv("ACCESS_TOKEN", None)
ACCESS_TOKEN_SECRET = 'ZZZZZZZZ'
# ACCESS_TOKEN_SECRET = os.getenv("ACCESS_TOKEN_SECRET", None)

# Users to watch for should be a list. This will be joined by Twitter and the
# data returned will be for any tweet mentioning:
# @twitter *OR* @twitterapi *OR* @support.
USERS = ['@twitter',
         '@twitterapi',
         '@support']

# Languages to filter tweets by is a list. This will be joined by Twitter
# to return data mentioning tweets only in the english language.
LANGUAGES = ['en']

# Since we're going to be using a streaming endpoint, there is no need to worry
# about rate limits.
api = Api(CONSUMER_KEY,
          CONSUMER_SECRET,
          ACCESS_TOKEN,
          ACCESS_TOKEN_SECRET)


def main():
    with open('output.txt', 'a') as f:
        # api.GetStreamFilter will return a generator that yields one status
        # message (i.e., Tweet) at a time as a JSON dictionary.
        for line in api.GetStreamFilter(track=USERS, languages=LANGUAGES):
            f.write(json.dumps(line))
            f.write('\n')


if __name__ == '__main__':
    main()



```

# Excercise Two

How would one modify the code below to get streaming tweets at specific geolocation(s)?

```python
api.GetStreamFilter(track=USERS, languages=LANGUAGES)
```


*A:*

```python
LOCATIONS = ['42.33666532,-71.08749965',
         '34.06999972,-118.439789907'] # Northeastern & UCLA


api.GetStreamFilter(track=USERS, languages=LANGUAGES, locations=LOCATIONS)

```

*Nota bene* 

A seperate tutorial on setting up a continious Twitter Streaming Listener and saving and parsing the data will be created.  


# Next Steps

There is a lot more to explore when dealing with Twitter data. Many more searches can be done and often the data needs to be cleaned and reformated.

Feature engineering is when one adds additional information to a tweet or twitter user such as identifying "fake news" tweets or flagging twitter users as bots or identifying which twitter users love d3. _d2d3_ will be adding additional notebooks, tutorials and videos that show how to do such things as:   

The d2d3 (data to d3) project creates tutorials and tools that  gather, clean, format, explore, munge, understand and feature engineer data before creating beautiful d3 visualizations.


* Creating bots 
* Create network data and identify important tweeters and tweets
* Create hierarchical data 
* Identifying which twitter users are renowned information visualizers.
* Creating lists of information design companies and their Twitter handles
* Geolocating tweets whose geo attribute is None.
* Extracting and tagging media in tweets
* Analyzing the sentiment of tweets
* Add missing hashtags to tweets
* Analyzing Twitter trends 
* Formatting and preparing Twitter data so that it can be used for every graph in the d3 gallery [https://observablehq.com/@d3/gallery](https://observablehq.com/@d3/gallery)
* Merging Twitter data with other data sources such as Google trends, Wikipedia, etc.  


Email _d2d3_ at _d2d3.net_ if you would like to see a particular tutorial.  
  


MIT License

Copyright (c) 2021 Nik Bear Brown & _d2d3.net_  

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.