# <center>Web Scraping by API </center>

## 1. Scrape data through APIs 
- Online content providers usually provide APIs for you to access data. Two types of APIs:
   * Python packages: e.g. tweepy package from Twitter
   * REST APIs: e.g. OMDB APIs (http://www.omdbapi.com), or TMDB (https://developers.themoviedb.org/3/getting-started)
- You need to read documentation of APIs to figure out how to access data

## 2. Scrape data by REST APIs (e.g. OMDB API)
- A REST API is a web service that uses `HTTP` requests to `GET`, `PUT`, `POST` and `DELETE` data
- Example:
    - https://groceries.asda.com/api/items/search<font color="blue"><b>?</b></font><font color='green'><b>keyword</b></font>=<font color='red'><b>yogurt<b></font><front color='purple'><b>&</b></font><font color='green'><b>r</b></font>=<font color='red'><b>json<b></font>, where
        - `?`: separate API endpoint  `https://groceries.asda.com/api/items/search` from parameters
        - `keyword=yogurt`: search `yogurt` on parameter `keyword`
        - `&`: combine multiple search criteria
        - `r=json`: result is in json format 
    - You can directly paste the above API to your browser
    - Or issue API calls using requests
- You need to read API documentation to understand how to specify parameters

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import requests
import json
import pandas as pd

In [2]:
import requests
import json

keyword = 'yogurt'


url="https://groceries.asda.com/api/items/search?keyword=" + keyword + "&r=json"

print(url)

# invoke the API 
r = requests.get(url)

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    # json. dumps() function converts a Python object into a json string
    result = r.json()
    print (json.dumps(result, indent=4))



https://groceries.asda.com/api/items/search?keyword=yogurt&r=json
{
    "statusMessage": "The API Item Search was executed successfully",
    "errors": [],
    "keyword": "yogurt",
    "storeId": "4565",
    "autoCorrectedTerm": "",
    "didYouMeanTerm": "",
    "queryRelaxed": false,
    "isHookLogicInsert": "false",
    "totalResult": "447",
    "currentPage": "1",
    "resultsStartIndex": "1",
    "resultsEndIndex": "60",
    "maxPages": "8",
    "qusApplied": false,
    "productBoostingDetails": "0^rule_5f8046bf0931946b86fb4387^^^Default",
    "monetizedItems": [],
    "items": [
        {
            "shelfId": "1215286383583",
            "shelfName": "Corners",
            "deptId": "1215341888021",
            "deptName": "Yogurts & Desserts",
            "isBundle": "false",
            "meatStickerDetails": "8::for::\u00a34::false",
            "extraLargeImageURL": "",
            "bundledItemCount": "0",
            "scene7Host": "https://ui.assets-asda.com:443/dm/",
      

In [3]:
# Exercise 2.2.  Another way to pass parameters

parameters = {'keyword': 'yogurt', 
              'r': 'json'}

r=requests.get('https://groceries.asda.com/api/items/search', params=parameters)

# in case authentication is needed, use
# r = requests.get('https://api.github.com/user', \
# auth=('user', 'pass'))

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    print (json.dumps(r.json(), indent=4))



{
    "statusMessage": "The API Item Search was executed successfully",
    "errors": [],
    "keyword": "yogurt",
    "storeId": "4565",
    "autoCorrectedTerm": "",
    "didYouMeanTerm": "",
    "queryRelaxed": false,
    "isHookLogicInsert": "false",
    "totalResult": "447",
    "currentPage": "1",
    "resultsStartIndex": "1",
    "resultsEndIndex": "60",
    "maxPages": "8",
    "qusApplied": false,
    "productBoostingDetails": "0^rule_5f8046bf0931946b86fb4387^^^Default",
    "monetizedItems": [],
    "items": [
        {
            "shelfId": "1215286383583",
            "shelfName": "Corners",
            "deptId": "1215341888021",
            "deptName": "Yogurts & Desserts",
            "isBundle": "false",
            "meatStickerDetails": "8::for::\u00a34::false",
            "extraLargeImageURL": "",
            "bundledItemCount": "0",
            "scene7Host": "https://ui.assets-asda.com:443/dm/",
            "cin": "6362225",
            "promoDetailFull": "8 for \u00

## 3. JSON (JavaScript Object Notation)

### What is JSON
- A lightweight data-interchange format
- "Self-describing" and easy to understand
- JSON format is text only 
- Language independent: can be read and used as a data format by any programming language

###  JSON Syntax Rules
JSON syntax is derived from JavaScript object notation syntax:
- Data is in **name/value** pairs separated by commas
- Curly braces hold objects
- Square brackets hold arrays

### A JSON object is:
- **a dictionary** or 
- a **list of dictionaries**

### Useful JSON functions
- dumps: save json object to string
- dump: save json object to file
- loads: load from a string in json format
- load: load from a file in json format

In [4]:
# Exercise 3.1 API returns a JSON object 

parameters = {'keyword': 'yogurt', 
              'r': 'json'}

r=requests.get('https://groceries.asda.com/api/items/search', params=parameters)

# if the API call returns a successful response
if r.status_code==200:
    result = r.json()
    
    df = pd.DataFrame(result["items"])
    df.head()
    

Unnamed: 0,shelfId,shelfName,deptId,deptName,isBundle,meatStickerDetails,extraLargeImageURL,bundledItemCount,scene7Host,cin,...,avgWeight,iconDetails,maxQty,pricePerWt,productURL,pricePerUOM,searchTuningScore,onSale,salePrice,positionChngByMargin
0,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,8::for::£4::false,,0,https://ui.assets-asda.com:443/dm/,6362225,...,,"{'promotionalIcons': ['59600049'], 'informatio...",10.0,Each,https://groceries.asda.com:443/api/items/view?...,,9486254.0,False,,0
1,1215286423096,Activia Yogurts,1215341888021,Yogurts & Desserts,False,4::for::£3::false,,0,https://ui.assets-asda.com:443/dm/,6202233,...,,{},10.0,Each,https://groceries.asda.com:443/api/items/view?...,,7188797.5,False,,0
2,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,8::for::£4::false,,0,https://ui.assets-asda.com:443/dm/,6362239,...,,"{'promotionalIcons': ['59600049'], 'informatio...",10.0,Each,https://groceries.asda.com:443/api/items/view?...,,6455880.0,False,,0
3,910000976085,Kids Yogurts,1215341888021,Yogurts & Desserts,False,,,0,https://ui.assets-asda.com:443/dm/,6203595,...,,"{'promotionalIcons': ['45100001', '59600050'],...",10.0,Each,https://groceries.asda.com:443/api/items/view?...,,5867831.0,False,,0
4,1215286383583,Corners,1215341888021,Yogurts & Desserts,False,8::for::£4::false,,0,https://ui.assets-asda.com:443/dm/,6362227,...,,"{'promotionalIcons': ['59600049'], 'informatio...",10.0,Each,https://groceries.asda.com:443/api/items/view?...,,5107219.0,False,,0


In [5]:
# Exercise 3.2. Parse JSON object (a dictionary)

# convert the first 2 items to string
s = json.dumps(result["items"][0:2], indent=4)
print(s)

# load back from a string
items = json.loads(s)
items

# save to file
json.dump(result["items"], open("items.json","w"))

# load back from file
items = json.load(open("items.json","r"))
print("test loaded data\n")
len(items)
items[0]

[
    {
        "shelfId": "1215286383583",
        "shelfName": "Corners",
        "deptId": "1215341888021",
        "deptName": "Yogurts & Desserts",
        "isBundle": "false",
        "meatStickerDetails": "8::for::\u00a34::false",
        "extraLargeImageURL": "",
        "bundledItemCount": "0",
        "scene7Host": "https://ui.assets-asda.com:443/dm/",
        "cin": "6362225",
        "promoDetailFull": "8 for \u00a34",
        "availability": "A",
        "totalReviewCount": "15",
        "asdaSuggest": "",
        "itemName": "Vanilla Yogurt with Chocolate\u00a0Balls",
        "price": "\u00a30.55",
        "imageURL": "",
        "aisleName": "Yogurts & Fromage Frais",
        "id": "1000120228362",
        "promoId": "ls91605",
        "isFavourite": "false",
        "hasAlternates": "false",
        "wasPrice": "",
        "brandName": "Muller Corner",
        "promoType": "No Promo",
        "weight": "130g",
        "promoOfferTypeCode": "15",
        "promoQty": "8",

[{'shelfId': '1215286383583',
  'shelfName': 'Corners',
  'deptId': '1215341888021',
  'deptName': 'Yogurts & Desserts',
  'isBundle': 'false',
  'meatStickerDetails': '8::for::£4::false',
  'extraLargeImageURL': '',
  'bundledItemCount': '0',
  'scene7Host': 'https://ui.assets-asda.com:443/dm/',
  'cin': '6362225',
  'promoDetailFull': '8 for £4',
  'availability': 'A',
  'totalReviewCount': '15',
  'asdaSuggest': '',
  'itemName': 'Vanilla Yogurt with Chocolate\xa0Balls',
  'price': '£0.55',
  'imageURL': '',
  'aisleName': 'Yogurts & Fromage Frais',
  'id': '1000120228362',
  'promoId': 'ls91605',
  'isFavourite': 'false',
  'hasAlternates': 'false',
  'wasPrice': '',
  'brandName': 'Muller Corner',
  'promoType': 'No Promo',
  'weight': '130g',
  'promoOfferTypeCode': '15',
  'promoQty': '8',
  'promoValue': '£4',
  'productAttribute': '',
  'scene7AssetId': '4025500245221',
  'promoDetail': '8 for £4',
  'bundleDiscount': '0.00',
  'avgStarRating': '4.8',
  'name': 'Muller Corner 

test loaded data



60

{'shelfId': '1215286383583',
 'shelfName': 'Corners',
 'deptId': '1215341888021',
 'deptName': 'Yogurts & Desserts',
 'isBundle': 'false',
 'meatStickerDetails': '8::for::£4::false',
 'extraLargeImageURL': '',
 'bundledItemCount': '0',
 'scene7Host': 'https://ui.assets-asda.com:443/dm/',
 'cin': '6362225',
 'promoDetailFull': '8 for £4',
 'availability': 'A',
 'totalReviewCount': '15',
 'asdaSuggest': '',
 'itemName': 'Vanilla Yogurt with Chocolate\xa0Balls',
 'price': '£0.55',
 'imageURL': '',
 'aisleName': 'Yogurts & Fromage Frais',
 'id': '1000120228362',
 'promoId': 'ls91605',
 'isFavourite': 'false',
 'hasAlternates': 'false',
 'wasPrice': '',
 'brandName': 'Muller Corner',
 'promoType': 'No Promo',
 'weight': '130g',
 'promoOfferTypeCode': '15',
 'promoQty': '8',
 'promoValue': '£4',
 'productAttribute': '',
 'scene7AssetId': '4025500245221',
 'promoDetail': '8 for £4',
 'bundleDiscount': '0.00',
 'avgStarRating': '4.8',
 'name': 'Muller Corner Vanilla Yogurt with Chocolate\xa0Ba

## 4. Get Tweets

Reference: 
- https://github.com/scalto/snscrape-by-location/blob/main/snscrape_by_location_tutorial.ipynb
- https://medium.com/swlh/how-to-scrape-tweets-by-location-in-python-using-snscrape-8c870fa6ec25

Note: User object is not exposed by TwitterSearchScraper any more.

In [6]:
import pandas as pd
import snscrape.modules.twitter as sntwitter
import itertools


In [9]:
#  search by keywords + time
# TwitterSearchScraper returns an interator, islice loops through the iterator

df = pd.DataFrame(itertools.islice(sntwitter.TwitterSearchScraper(
    '"blockchain + since:2021-12-1 until:2022-1-31"').get_items(), 500))

print(len(df))
df.head()

500


Unnamed: 0,url,date,content,renderedContent,id,user,replyCount,retweetCount,likeCount,quoteCount,...,media,retweetedTweet,quotedTweet,inReplyToTweetId,inReplyToUser,mentionedUsers,coordinates,place,hashtags,cashtags
0,https://twitter.com/CryptoAnomalous/status/148...,2022-01-30 23:59:59+00:00,@BradSherman do not get in the way of US techn...,@BradSherman do not get in the way of US techn...,1487938677535907845,"{'username': 'CryptoAnomalous', 'id': 53283913...",0,0,0,0,...,,,,,,"[{'username': 'BradSherman', 'id': 30216513, '...",,,,
1,https://twitter.com/gerard_dache/status/148793...,2022-01-30 23:59:55+00:00,"Congrats to Bill Rockwood Jr, Esq., MBA, J.D.,...","Congrats to Bill Rockwood Jr, Esq., MBA, J.D.,...",1487938658023919619,"{'username': 'gerard_dache', 'id': 256189693, ...",0,0,7,3,...,,,,,,,,,,
2,https://twitter.com/Rekttrading8/status/148793...,2022-01-30 23:59:54+00:00,@kararesurrect You are still announcing it on ...,@kararesurrect You are still announcing it on ...,1487938655159140353,"{'username': 'Rekttrading8', 'id': 14337294741...",0,0,0,0,...,,,,1.48782e+18,"{'username': 'TraderZetas', 'id': 147446459766...",,,,,
3,https://twitter.com/KeanuBelieves/status/14879...,2022-01-30 23:59:50+00:00,@WatcherGuru @AffinityBSC. Being listing on a ...,@WatcherGuru @AffinityBSC. Being listing on a ...,1487938639820693508,"{'username': 'KeanuBelieves', 'id': 1425210713...",0,0,1,0,...,,,,1.487907e+18,"{'username': 'WatcherGuru', 'id': 138749787175...","[{'username': 'WatcherGuru', 'id': 13874978717...",,,[ADAPT],
4,https://twitter.com/duynguy40664441/status/148...,2022-01-30 23:59:35+00:00,@mine_blockchain scam mnet,@mine_blockchain scam mnet,1487938575937126400,"{'username': 'duynguy40664441', 'id': 14595808...",1,0,0,0,...,,,,1.486773e+18,"{'username': 'mine_blockchain', 'id': 13995421...","[{'username': 'mine_blockchain', 'id': 1399542...",,,,


In [8]:
df.content[80:100]


80    How #bitcoin adoption could bring major prospe...
81    Cooper Rogers is here to save the #Metaverse w...
82    @sansiniestro Es que depende del grado de inte...
83    (1/19) Sirin Labs raised 205,000 $ETH to devel...
84    Why do games on the blockchain always have to ...
85    @Colten____ Yeah man! Get in for the art and s...
86    Abuse and harassment on the blockchain https:/...
87    @ZJhonni eu tbm n entendo mto bem nao mano, ma...
88    @HerissonJeune @sw_remi @neoman42 @Scipionista...
89    @spiderzero06 @veve_official @ecomi_ Feel you ...
90    This is awesome. I love Bitmart ideology towar...
91    @Conta_genrica sabotage the NFT economy by re-...
92    Like a newbie I incorrectly listed our lil' Sh...
93    Crypto Job: (Software Engineer (Backend) - Blo...
94    @Ciriyan1 @VinsmokeSanjiMD @TobbyKitty Hepsi t...
95    .@DeFiChain Mobile Wallet v1.1.0 is now availa...
96    Crypto Job: (Senior Software Engineer (Backend...
97    Chainlink(LINK) Price is Attempting a Huge

In [None]:
# search by user

df = pd.DataFrame(itertools.islice(sntwitter.TwitterUserScraper(
    '""').get_items(), 500))

print(len(df))
df.tail()


## 5. Tweepy
- Tweepy is a python library to access Twitter API. 
- `pip install tweepy`
- The Tweepy documentation has detailed explanations: https://docs.tweepy.org/en/stable/
- You need to apply for a developer account from here: https://developer.twitter.com/en/apply-for-access

In [12]:
import tweepy
import csv
import datetime

# https://docs.tweepy.org/en/stable/auth_tutorial.html

CONSUMER_KEY='c5vuI4xawiuos3BKW9UXZ8FFY'
CONSUMER_SECRET='7kNtpN1gSIYEOygD2fnZNNpkZRfoVRX1ib8UZrHFGyzAkulobw'
ACCESS_KEY='1453412544996073473-zPPYVT0f1z6e0xKAs3vCzx1wWpedNd'
ACCESS_SECRET='NmdQ2B5xtvJLOJgmW0b69ebquTLzCwAguGxsG6HRRp9Y2'

auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
auth.set_access_token(ACCESS_KEY,ACCESS_SECRET)
api=tweepy.API(auth)

In [13]:
# Take a look at the public tweets from your account's home timeline 

public_tweets = api.home_timeline()
print(len(public_tweets))
for tweet in public_tweets[:2]:
    print(tweet.text)

20
#Beijing2022 concluded on Sunday with the closing ceremony. Here’s the overall medal count. 🥇 Look back at this yea… https://t.co/hgNINT0RuA
Airport improvements, new hotels, solar energy and water projects: Many Caribbean islands are investing for the pos… https://t.co/cQv3KnjpjL


In [14]:
# is this useful information?
# Let's take a close look at ONE tweet json

public_tweets[1]
# there's no way to figure this out


Status(_api=<tweepy.api.API object at 0x7fbaea24d730>, _json={'created_at': 'Mon Feb 21 00:00:17 +0000 2022', 'id': 1495548895245570057, 'id_str': '1495548895245570057', 'text': 'Airport improvements, new hotels, solar energy and water projects: Many Caribbean islands are investing for the pos… https://t.co/cQv3KnjpjL', 'truncated': True, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/cQv3KnjpjL', 'expanded_url': 'https://twitter.com/i/web/status/1495548895245570057', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [117, 140]}]}, 'source': '<a href="http://www.socialflow.com" rel="nofollow">SocialFlow</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 807095, 'id_str': '807095', 'name': 'The New York Times', 'screen_name': 'nytimes', 'location': 'New York City', 'description': 'News tips? Share them 

In [15]:
# make it look better
# convert to string
json_str = json.dumps(public_tweets[0]._json)

# deserialise string into python object
parsed = json.loads(json_str)

print(json.dumps(parsed, indent=4, sort_keys=True))
# Now we can have a better idea of the clustered relations of the json object

{
    "contributors": null,
    "coordinates": null,
    "created_at": "Mon Feb 21 00:15:07 +0000 2022",
    "entities": {
        "hashtags": [
            {
                "indices": [
                    0,
                    12
                ],
                "text": "Beijing2022"
            }
        ],
        "symbols": [],
        "urls": [
            {
                "display_url": "twitter.com/i/web/status/1\u2026",
                "expanded_url": "https://twitter.com/i/web/status/1495552630197784579",
                "indices": [
                    117,
                    140
                ],
                "url": "https://t.co/hgNINT0RuA"
            }
        ],
        "user_mentions": []
    },
    "favorite_count": 62,
    "favorited": false,
    "geo": null,
    "id": 1495552630197784579,
    "id_str": "1495552630197784579",
    "in_reply_to_screen_name": null,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_

### 5.1. Get tweets from users' timeline
- Make a Timeline call to retrieve the most recent 3200 tweets by a user (a rule set by Twitter).
    - Note: the time range you get depends on how often the user posts tweets. 
- Parameters for the timeline call
    - `count`: the number of results to try and retrieve per page. Maximum is 200. 
    - Make multiple calls to retrieve the 3200 tweets. 
    - `tweet_mode`:swaps the text index for full_text, and prevents a primary tweet longer than 140 characters from being truncated.
- Variables of tweet objects
    - https://docs.tweepy.org/en/stable/api.html#tweepy-api-twitter-api-wrapper

In [16]:
# Get the first five tweets of a user.
timeline = api.user_timeline(screen_name="KelloggCompany",count=5,tweet_mode="extended")

for status in timeline:
    print (status.id)
    print (status.full_text)

1494058763275354112
From lighter, resealable packaging to refillable cereal stations, we’re making meaningful strides toward our #BetterDays commitment of  100% reusable, recyclable or compostable packaging by the end of 2025. Learn more: https://t.co/2pxNDcxmtT https://t.co/UsupSea6FL
1494043410335944708
Since 2015, we’ve reduced greenhouse gas emissions by the equivalent of removing nearly 3.2 billion cars off the road!  Learn more about our commitment to nurture the planet by reducing GHG emissions: https://t.co/H1DvVjslfp 
#BetterDays https://t.co/L8ArCTAS6X
1493979924725653507
Helping kids and families lead filled and fulfilled lives is why we’re committed to creating #BetterDays for our people and planet. We’re honored to support @Act4HlthyKids in their mission to support kids in underserved communities. https://t.co/pr6W3Z33R6
1493694554222075918
What if we could pay farmers for the greenhouse gas emissions they reduce by implementing climate-positive practices? 
We can, thanks 

In [17]:
# Step 1: get a list of tweets 
# Step 2: extract the varaibles you want

def get_all_tweets(user_name):
    auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
    auth.set_access_token(ACCESS_KEY,ACCESS_SECRET)
    api=tweepy.API(auth)
    
    # initialize the first call
    alltweets=[]
    new_tweets=api.user_timeline(screen_name=user_name, count=200)
    alltweets.extend(new_tweets)
    oldest=alltweets[-1].id-1  #next time start from the oldest one minus one 
    
    # continue to get tweets
    while len(new_tweets)>0:  
        print ("getting tweets before", oldest)
        new_tweets = api.user_timeline(screen_name=user_name,count=200, max_id=oldest)
        alltweets.extend(new_tweets)
        oldest=alltweets[-1].id-1
        print("...{} tweets downloaded so far".format(len(alltweets)))
    
    # extract the variables you want
    outtweets = [[tweet.id_str, tweet.user.name, tweet.created_at, tweet.user.followers_count,
                  tweet.text.encode("utf-8")] for tweet in alltweets]
            
    # write out your variables
    with open('%s_tweet.csv' % user_name,'w') as outputfile: 
        writer=csv.writer(outputfile)
        writer.writerow(["id","user_name","created_at","followers","text"])
        writer.writerows(outtweets)
    pass

# use your function
if __name__ == '__main__':
    #pass in the username of the account you want to download
    get_all_tweets("KelloggCompany")
    

getting tweets before 1399719994724864010
...400 tweets downloaded so far
getting tweets before 1337041811412488197
...600 tweets downloaded so far
getting tweets before 1257656292014993407
...800 tweets downloaded so far
getting tweets before 1195344248163491839
...1000 tweets downloaded so far
getting tweets before 1133722853013114879
...1200 tweets downloaded so far
getting tweets before 1040045464765325311
...1400 tweets downloaded so far
getting tweets before 961706244900818943
...1600 tweets downloaded so far
getting tweets before 902881450592231423
...1800 tweets downloaded so far
getting tweets before 842721640182140927
...2000 tweets downloaded so far
getting tweets before 776789658802028543
...2197 tweets downloaded so far
getting tweets before 725360392512364548
...2395 tweets downloaded so far
getting tweets before 672728657132175360
...2595 tweets downloaded so far
getting tweets before 634431301207068671
...2794 tweets downloaded so far
getting tweets before 5652286916898

In [31]:
# Take a look at the table you got
df= pd.read_csv('KelloggCompany_tweet.csv', header=0)
df.head()

# how many tweets we get?
len(df)

# The following tweet is a retweet. Take a look at the text.
# The index of this retweet is based on the table generated previously
# if you run at a different time, you will see a differet tweet
# try to find a retweet and compare the text with the actual tweet
# You can find each tweet on Twitter by its ID.
df.text[34]


Unnamed: 0,id,user_name,created_at,followers,text
0,1491416753037279234,Kellogg Company,2022-02-09 14:20:37,76708,b'Is it worth getting your hand stuck? Of cou...
1,1491151362389905409,Kellogg Company,2022-02-08 20:46:03,76708,"b""Check out this in-depth and wide-ranging di..."
2,1489299648720388102,Kellogg Company,2022-02-03 18:08:00,76708,b'We\xe2\x80\x99re honored to be on @FortuneMa...
3,1489248636819021832,Kellogg Company,2022-02-03 14:45:18,76708,"b'""We\xe2\x80\x99ve got this wonderful resourc..."
4,1488887938863800321,Kellogg Company,2022-02-02 14:52:01,76708,"b""For many kids, a snow day is all about fun -..."


3220

"b'RT @foodbanking: Food loss &amp; waste is costly for producers, takes up space in landfills, and emits harmful greenhouse gases.\\n\\nIn our blog o\\xe2\\x80\\xa6'"

### 5.2. Deal with truncated text
- For text mining on Twitter, it is important to get the full text. 
    - Full text would be essential for topic modeling and sentiment analysis.
    - Full text is also important for extracting mention networks (note the previous example). 
- Use the `tweet_mode="extended"` when calling a user's timeline.
    - When using extended mode, the `text` attribute of Status objects returned is replaced by a `full_text` attribute, which contains the entire untruncated text of the Tweet. 
- Full text for tweets that are retweets.
    - If the tweet is a retweet, the full_text is still truncated. 
    - We need to access the full text through `retweeted_status` attribute, which is a status object itself. 
- For reference: https://docs.tweepy.org/en/stable/extended_tweets.html

In [19]:
# Let's deal with retweets

def get_all_tweets(user_name):
    auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
    auth.set_access_token(ACCESS_KEY,ACCESS_SECRET)
    api=tweepy.API(auth,wait_on_rate_limit=True)

    alltweets=[]
    new_tweets=api.user_timeline(screen_name = user_name, count=200,tweet_mode="extended")
    alltweets.extend(new_tweets)
    oldest=alltweets[-1].id-1  
    
    # set date condition
    startDate = datetime.datetime(2021, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)
    while new_tweets[-1].created_at > startDate:
        print ("getting tweets before", oldest)
        new_tweets = api.user_timeline(screen_name = user_name,count=200, max_id=oldest)
        alltweets.extend(new_tweets)
        oldest=alltweets[-1].id-1
        print("...{} tweets downloaded so far".format(len(alltweets)))
        
    # check if it's a retweet
    # When using extended mode with a Retweet, the full_text attribute of the Status object may be truncated    
    # However, since the retweeted_status attribute (of a Status object that is a Retweet) is itself a Status object
    # the full_text attribute of the Retweeted Status object can be used instead.
    
    outtweets_all=[]
    for tweet in alltweets:
        status = api.get_status(tweet.id, tweet_mode="extended")
        
        if hasattr(status, "retweeted_status"):  # is a retweet
            full_text=status.retweeted_status.full_text.encode("utf-8")
            
            outtweets=[
            # tweet content
            tweet.id_str, tweet.created_at,full_text,
            # user features
            tweet.user.name, tweet.user.screen_name, tweet.user.followers_count, 
            # retweet features
            tweet.retweeted_status.user.name,tweet.retweeted_status.user.screen_name,tweet.retweeted_status.user.description]
            outtweets_all.append(outtweets)
  
        else: # not a retweet
            full_text=status.full_text.encode("utf-8")
                    
            outtweets=[
            # tweet content
            tweet.id_str, tweet.created_at,full_text,
            # user features
            tweet.user.name, tweet.user.screen_name, tweet.user.followers_count, 
            # retweet features
            "no value","no value","no value"]
            outtweets_all.append(outtweets)

    with open('%s_full_tweet.csv' % user_name,'w') as outputfile: 
        writer=csv.writer(outputfile)
        writer.writerow(["id","created_at","full_text",
                        "user.name","user.screen_name","user.followers_count",
                        "retweeted_status.user.name","retweeted_status.user.screen_name","retweeted_status.user.description"])
        writer.writerows(outtweets_all)

        
if __name__ == '__main__':
    #pass in the username of the account you want to download
    get_all_tweets("KelloggCompany")


getting tweets before 1399719994724864010
...400 tweets downloaded so far


In [27]:
df= pd.read_csv('KelloggCompany_full_tweet.csv', header=0)
df.head()
len(df)
# plese compare this full text with the above truncated text, what differences can you find?
df.full_text[34]

Unnamed: 0,id,created_at,full_text,user.name,user.screen_name,user.followers_count,retweeted_status.user.name,retweeted_status.user.screen_name,retweeted_status.user.description
0,1491416753037279234,2022-02-09 14:20:37,b'Is it worth getting your hand stuck? Of cou...,Kellogg Company,KelloggCompany,76708,no value,no value,no value
1,1491151362389905409,2022-02-08 20:46:03,b'Check out this in-depth and wide-ranging di...,Kellogg Company,KelloggCompany,76708,no value,no value,no value
2,1489299648720388102,2022-02-03 18:08:00,b'We\xe2\x80\x99re honored to be on @FortuneMa...,Kellogg Company,KelloggCompany,76708,no value,no value,no value
3,1489248636819021832,2022-02-03 14:45:18,"b'""We\xe2\x80\x99ve got this wonderful resourc...",Kellogg Company,KelloggCompany,76708,no value,no value,no value
4,1488887938863800321,2022-02-02 14:52:01,"b""For many kids, a snow day is all about fun -...",Kellogg Company,KelloggCompany,76708,no value,no value,no value


400

"b'Food loss &amp; waste is costly for producers, takes up space in landfills, and emits harmful greenhouse gases.\\n\\nIn our blog on @CGF_The_Forum, see how partnerships between #FoodProducers and #FoodBanks can alleviate hunger and mitigate climate change. #IDAFLW\\nhttps://t.co/xIOT0Hjlac'"

### 5.3. Build Twitter networks
- **Follower-followee network**
    - If you have a list of user accounts, you may retrive the pairwise boolean values of following relations. 
    - Parameters
        * `source_id` – The user_id of the subject user.
        * `source_screen_name` – The screen_name of the subject user.
        * `target_id` – The user_id of the target user.
        * `target_screen_name` – The screen_name of the target user.
- **Retweet network**
    - Retweeted accounts can be extracted while scraping the API. 
    - Or retweeted accounts can be extracted from the text. 
- **Mention network**
    - Can be extracted from the full text. 

In [32]:
# How to scrape the follower-followee network?
# we can directly retrieve a bollean value 

dog="Microsoft"
cat="Oracle"

is_following = api.get_friendship(source_screen_name=cat,target_screen_name=dog)
print(is_following[1].following)

# Question: how to get the adjacency matrix of a follower-followee network?

False


### 5.4 Keywords search


In [21]:
# Define the search term and the date_since date as variables
search_words = "blockchain"
date_since = "2021-11-01"

# Collect tweets
tweets = tweepy.Cursor(api.search_tweets,
              q=search_words,
              lang="en",
              since=date_since).items(5)


# Iterate and print tweets
for tweet in tweets:
    print(tweet.text)

Unexpected parameter: since


@DeFiDiscussion @ludo Many games that involve blockchain are often not entertaining enough and end up being played… https://t.co/6eM0YbOUyo
RT @CardanoFeed: Cardano Enters Top 3 of Most Developed Coins on Crypto Market 

#Cardano #cardanofeed #ADA #crypto #cardanocommunity #bitc…
RT @POODLETOKEN: $POODL is available for everyone, any network. From virtually any #crypto 

$ETH $BTC $SOL $ONE $ZIL $BNB

It doesn't matt…
RT @AgeOfZalmoxis: Our Whitepaper is out!
https://t.co/v0PnvjQua8

Be part of the community:
https://t.co/2WNZTfQy4y

@RomanianAcademy @ben…
RT @HiRezTheRapper: I AM GOING TO BE LAUNCHING LIMITED LIFETIME MEMBERSHIPS TO ALL MY IRL/META EVENTS FOR @HogHomies. HOLDERS WILL GET FREE…


##### Twitter data resources
https://github.com/echen102/us-pres-elections-2020 <br>
https://github.com/echen102/COVID-19-TweetIDs