** Working with JSON data **

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. 

JSON is built on two structures:
    
1. A collection of name/value pairs - you are already familiar with this concept through working with Python dictionaries
2. An ordered list of values - you are already familiar with this concept through working with Python lists
    
More information on JSON is available at https://www.json.org/

The data for this tutorial came from Tweets of Congress daily archives: https://freegovinfo.info/node/tag/twitter-data and https://alexlitel.github.io/congresstweets/

In [2]:
import csv

# The os module provides a portable way of using operating system dependent functionality. 
# For example, Windows and Mac operating systems use different path notations for files. 
# On a Mac, you may see a path to a file or a folder listed as /var/www/html/folder
# On a Windows computer, the same path would be expressed as C:\var\www\html\folder
# The os module hides these differences from the programmer and makes programs written in Python
# more operating system-independent
import os


# The json library can parse JSON from strings or files. 
# The library parses JSON into a Python dictionary or list. 
# It can also convert Python dictionaries or lists into JSON strings.
# http://docs.python-guide.org/en/latest/scenarios/json/
import json

In [4]:
# As a first step, we need to get the current working directory.  
# What that means is that in order for us to access a list of files that reside somewhere on 
# a hard drive, we need a starting point. Generally, that starting point is the director where
# the current script is located (this file).  From that directory, we can figure out a relative path 
# to a subfolder that we need and the files located in that subfolder.
# For more information on relative vs. absolute paths review the following resources:
# https://en.wikipedia.org/wiki/Path_(computing)
# http://resources.esri.com/help/9.3/ArcGISengine/java/Gp_ToolRef/sharing_tools_and_toolboxes/pathnames_explained_colon_absolute_relative_unc_and_url.htm


# Get current working directory
cwd = os.getcwd()
print(cwd)

# Determine the path (location on the harddrive) of the subfolder that contains our data files
data_subfolder = "congress_tweets"

# Combine the working directory name with the subfolder name.  Note that if you are on a Windows machine,
# folder hierarchy is separated with a backslash (\).  However, because in Python the backslash is
# a special character, you actually need to use two backslashes (\\)
# If you are on a Mac or a Linux computer, you need to use a forward slash (/)
folder_path = cwd + "/" + data_subfolder 
print(folder_path)


/Users/dmitriyb/Box Sync/TEACHING/Code Examples (DMB72@pitt.edu)/computationalthinking/Reading and processing data from the web
/Users/dmitriyb/Box Sync/TEACHING/Code Examples (DMB72@pitt.edu)/computationalthinking/Reading and processing data from the web/congress_tweets


In [7]:
for root_folder, subfolders, files in os.walk(folder_path):
    #print("Root folder: " + str(root_folder))
    #print("Subfolders: " + str(subfolders))
    print("Files: " + str(files))
    

Files: ['2018-01-24.json', '2018-01-28.json', '2018-02-02.json', '2018-01-29.json', '2018-01-25.json', '2018-02-04.json', '2018-02-05.json', '2018-02-06.json', '2018-02-07.json', '2018-01-26.json', '2018-01-30.json', '2018-02-01.json', '2018-01-31.json', '2018-01-27.json']


In [10]:
# Now we need to iterate through the list of files in our data subfolder
for root_folder, subfolders, files in os.walk(folder_path):
    for file_name in files:
        file_path = root_folder + '/' + file_name
        file = open(file_path, 'r', encoding="utf-8")
        tweet_data = json.load(file)
        print(file_path)
        for tweet in tweet_data[0:10]:
            print("______________________________")
            print("ID: " + tweet["id"])
            print("Screen Name: " + tweet["screen_name"])
            print("User ID: " + tweet["user_id"])
            print("Time: " + tweet["time"])
            print("Text: " + tweet["text"])
            print("Source: " + tweet["source"])

/Users/dmitriyb/Box Sync/TEACHING/Code Examples (DMB72@pitt.edu)/computationalthinking/Reading and processing data from the web/congress_tweets/2018-01-24.json
______________________________
ID: 956039366593470464
Screen Name: ericswalwell
User ID: 377609596
Time: 2018-01-24T00:42:02-05:00
Text: Agreed, in full. Now tell us what you’re going to do about it. https://twitter.com/senatemajldr/status/955824536640999425 QT @SenateMajLdr Closely tracking reports of the tragedy in Benton, #Kentucky at Marshall County High School and my thoughts are with the students, teachers, faculty, and the entire community. Thank you to the first responders who continue to put themselves in harm's way to protect others.
Source: Twitter for iPhone
______________________________
ID: 956037956392890368
Screen Name: DeanHeller
User ID: 41363507
Time: 2018-01-24T00:36:25-05:00
Text: RT @ChloeNews3LV GO KNIGHTS GO!!! 🏒✨ @GoldenKnights http://pbs.twimg.com/media/DUSCE1rW0AAzhlX.jpg
Source: Twitter for iPhone
___

/Users/dmitriyb/Box Sync/TEACHING/Code Examples (DMB72@pitt.edu)/computationalthinking/Reading and processing data from the web/congress_tweets/2018-02-06.json
______________________________
ID: 960744868699164672
Screen Name: MurrayCampaign
User ID: 158470209
Time: 2018-02-06T00:20:01-05:00
Text: As income inequality in our country continues to grow, it’s clearer than ever that we need to keep fighting to build an economy that works for everyone -- not just the wealthiest and those at the very top.
Source: Sprout Social
______________________________
ID: 960742094607405056
Screen Name: RepJayapal
User ID: 815733290955112448
Time: 2018-02-06T00:08:59-05:00
Text: RT @BlackGirlMagix #BlackHistoryMonth
Today, teach your children about Ruby Bridges. She was the first black child to attend at all-white public elementary school in the South.

She's only 63. Only, 63. http://pbs.twimg.com/media/DVI1spnWAAEK5iA.jpg
Source: Twitter for iPad
______________________________
ID: 960741911492472832


/Users/dmitriyb/Box Sync/TEACHING/Code Examples (DMB72@pitt.edu)/computationalthinking/Reading and processing data from the web/congress_tweets/2018-01-30.json
______________________________
ID: 958215981666619393
Screen Name: LacyClayMO1
User ID: 584912320
Time: 2018-01-30T00:51:07-05:00
Text: RT @nowthisnews Nancy Pelosi pulled no punches on President Trump’s immigration plan to ‘make America white again’ http://pbs.twimg.com/media/DUvrU6SWsAAYY90.jpg https://video.twimg.com/amplify_video/958116062528229377/vid/240x240/j-f2Vg10GkK0XEMu.mp4
Source: Twitter for iPad
______________________________
ID: 958212683815374848
Screen Name: MurrayCampaign
User ID: 158470209
Time: 2018-01-30T00:38:01-05:00
Text: Leah turned her own painful experience into a personal mission to ensure others who experience sexual assault have the resources they need when they seek help. I’m amazed by Leah’s dedication &amp; I couldn’t be prouder to help lift up her voice in Congress.  http://komonews.com/news/loc

In [12]:
# Create a master list to hold all tweets
all_tweets = []

for root_folder, subfolders, files in os.walk(folder_path):
    for file_name in files:
        file_path = root_folder + '/' + file_name
        file = open(file_path, 'r', encoding="utf-8")
        tweet_data = json.load(file)

        for tweet in tweet_data:
            all_tweets.append(tweet)


In [13]:
print(len(all_tweets))

35118


In [14]:
def search_tweets(search_term):
    results = []
    for tweet in all_tweets:
        text = tweet["text"].lower()
        if text.find(search_term) != -1: 
            results.append(tweet)

    return results

In [17]:
results = search_tweets('healthcare')
for tweet in results:
    print("______________________________")
    print("ID: " + tweet["id"])
    print("Screen Name: " + tweet["screen_name"])
    print("User ID: " + tweet["user_id"])
    print("Time: " + tweet["time"])
    print("Text: " + tweet["text"])
    print("Source: " + tweet["source"])

______________________________
ID: 956060207297294336
Screen Name: auctnr1
User ID: 21572351
Time: 2018-01-24T02:04:50-05:00
Text: Ya think? @realDonaldTrump 90% of Americans bigger paychecks. 
2.6 million+ Americans got bonuses /pay raises. Utilities cutting rates for millions. DOW over 26K Contrast to #SchumerShutdow  accommodate illegal immigrants / don’t pay troops / don’t provide children’s #healthcare https://twitter.com/thehill/status/956056745633296384 QT @thehill Poll: More Americans blame Dems than Trump for shutdown http://thehill.com/blogs/blog-briefing-room/news/370294-congressional-democrats-edge-out-trump-for-shutdown-blame-poll http://pbs.twimg.com/media/DUSYirsW0AA294I.jpg
Source: Twitter for iPhone
______________________________
ID: 956168045386493952
Screen Name: auctnr1
User ID: 21572351
Time: 2018-01-24T09:13:21-05:00
Text: .@facebook executive to retire to help make #Democrats great! Might want to start by not repeating the #SchumerShutdown to protect illegal alie