
# Mining the social web 
## Workout 2. Understanding JSON

- CityU COM5507 201819A - Unit 2: Web data collection
- 24 Oct 2018, Week 8: Mining the social web - data formats 


- Course Instructor: [Dr. Xinzhi Zhang](www.drxinzhizhang.com)  (JOUR, Hong Kong Baptist University) 
  - xzzhang2@gmail.com


- The codes in this notebook are modified from various sources. All codes are for educational purposes only and released under the CC1.0. 

## JSON: a data type, which looks like...

In [None]:
# JSON represents data as nested “lists” and “dictionaries”

In [1]:
import json

In [2]:
# json stored as a dict

data = '''
{
  "name" : "Chuck",
  "phone" : {
    "type" : "intl",
    "number" : "+1 734 303 4456"
   },
   "email" : {
     "hide" : "yes"
   }
}
'''


In [14]:
info = json.loads(data)
print('Name:', info["name"])
print('Hide:', info["email"]["hide"])

TypeError: list indices must be integers or slices, not str

In [5]:
# json stored as a list (dict nested within it)

data = '''
[
  { "id" : "001",
    "x" : "2",
    "name" : "Chuck"
  } ,
  { "id" : "009",
    "x" : "7",
    "name" : "Chuck"
  }
]
'''

In [6]:
info = json.loads(data)
print('User count:', len(info))

for item in info:
    print('Name', item['name'])
    print('Id', item['id'])
    print('Attribute', item['x'])

User count: 2
Name Chuck
Id 001
Attribute 2
Name Chuck
Id 009
Attribute 7


## JSON: a single file, from a real example

In [None]:
# for single file: 20180101a

In [7]:
with open('20180101a.json') as json_data:
    tweet = json.load(json_data)
    print(tweet)

{'created_at': 'Mon Jan 01 00:28:53 +0000 2018', 'id': 947625642291376128, 'id_str': '947625642291376128', 'text': 'Why does data journalist @maloym support @latguild? “I want the Los Angeles Times to be a newspaper where journalis… https://t.co/yDUAJtAqxr', 'display_text_range': [0, 140], 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'truncated': True, 'in_reply_to_status_id': 947554925927587841, 'in_reply_to_status_id_str': '947554925927587841', 'in_reply_to_user_id': 859504635773464576, 'in_reply_to_user_id_str': '859504635773464576', 'in_reply_to_screen_name': 'latguild', 'user': {'id': 859504635773464576, 'id_str': '859504635773464576', 'name': 'L.A. Times Guild 🦅', 'screen_name': 'latguild', 'location': 'Los Angeles, CA', 'url': 'https://latguild.com', 'description': 'Our mission is to safeguard the future of the Los Angeles Times and its journalists.', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_c

In [8]:
print(len(tweet))

30


In [9]:
print(tweet["created_at"])

Mon Jan 01 00:28:53 +0000 2018


In [10]:
print(tweet["text"]) 

Why does data journalist @maloym support @latguild? “I want the Los Angeles Times to be a newspaper where journalis… https://t.co/yDUAJtAqxr


In [11]:
print(tweet["url"]) 
# what happened?  

KeyError: 'url'

![Twitter JSON cheat sheet](JSON_cheat_sheet_raffi-krikorian-map-of-a-tweet.png)

In [15]:
print(tweet["user"]["url"]) 

https://latguild.com


In [16]:
print(tweet["user"]["screen_name"])

latguild


In [17]:
print(tweet["user"]["description"])

Our mission is to safeguard the future of the Los Angeles Times and its journalists.


## Digging out more fields of a Tweet

In [18]:
# a more "debugging" method 

if 'created_at' in tweet:
    if tweet['created_at'] is not None:
        timestamp = tweet['created_at']
    else:
        timestamp = "Nonetimestamp"
else:
    timestamp = "Notimestamp"

print(timestamp)

Mon Jan 01 00:28:53 +0000 2018


In [21]:
# the number of followers
if 'followers_count' in tweet['user']:
    if tweet['user']['followers_count'] is not None:
        nfollowers = tweet['user']['followers_count']
    else:
        nfollowers = "Nonefollowers"
else:
    nfollowers = "Nofollowers"

print(nfollowers)

2865


## Challenges:
- try to dig out all the possible fields of this tweet, and store the data in a structured format using pandas. 
- try to combine all the tweets stored in the folder [com5507_json_ddj20180131], and store the data in a structured format using pandas. (hint: first, you may need a loop to "append" the files into one. 