In [7]:
! pip install jsonlines
! pip install https://github.com/elyase/geotext/archive/master.zip
! pip install tqdm

Collecting jsonlines
  Downloading https://files.pythonhosted.org/packages/4f/9a/ab96291470e305504aa4b7a2e0ec132e930da89eb3ca7a82fbe03167c131/jsonlines-1.2.0-py2.py3-none-any.whl
Installing collected packages: jsonlines
Successfully installed jsonlines-1.2.0
Collecting https://github.com/elyase/geotext/archive/master.zip
  Downloading https://github.com/elyase/geotext/archive/master.zip
[K     / 4.3MB 96.8MB/s    / 81kB 299kB/s
Building wheels for collected packages: geotext
  Running setup.py bdist_wheel for geotext ... [?25ldone
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-xrga8q69/wheels/f5/e3/84/31638877059a434d8601a764fc7565f2a9f7b6fb327085191e
Successfully built geotext
Installing collected packages: geotext
Successfully installed geotext-0.3.0


In [6]:
import collections
import json
import jsonlines
import ast
from tqdm import tqdm
from geotext import GeoText

ModuleNotFoundError: No module named 'jsonlines'

The data has a lot of issues. It cannot be directly read by Spark (you get a 'corrupted record') as there is a lot of nesting. Similarly, you get an error when you try to read the data with pandas. Python's native json library and the jsonlines module cannot read it because it contains single quotes rather than double quotes (json requires double quotes). The solution is to read the data line by line, use the ast library to convert single quotes to double quotes, remove trailing commas, and dump it to valid json. 

The original file was called places.clean.json, I changed the name to places.original.json because it really isn't 'clean' yet as far as my requirements are concerned.

Number of records in places.original.json (places.clean.json): 3114353

Sample records:

{"name": "Diamond Valley Lake Marina", "price": null, "address": ["2615 Angler Ave", "Hemet, CA 92545"], "hours": [["Monday", [["6:30 am--4:15 pm"]]], ["Tuesday", [["6:30 am--4:15 pm"]]], ["Wednesday", [["6:30 am--4:15 pm"]], 1], ["Thursday", [["6:30 am--4:15 pm"]]], ["Friday", [["6:30 am--4:15 pm"]]], ["Saturday", [["6:30 am--4:15 pm"]]], ["Sunday", [["6:30 am--4:15 pm"]]]], "phone": "(951) 926-7201", "closed": false, "gPlusPlaceId": "104699454385822125632", "gps": [33.703804, -117.003209]}

{"name": "Blue Ribbon Cleaners", "price": null, "address": ["Parole", "Annapolis, MD"], "hours": null, "phone": "(410) 266-6123", "closed": false, "gPlusPlaceId": "103054478949000078829", "gps": [38.979759, -76.547538]}

{"name": "Portofino", "price": null, "address": ["\u0443\u043b. \u0422\u0443\u0442\u0430\u0435\u0432\u0430, 1", "Nazran, Ingushetia, Russia", "366720"], "hours": [["Monday", [["9:30 am--9:00 pm"]]], ["Tuesday", [["9:30 am--9:00 pm"]]], ["Wednesday", [["9:30 am--9:00 pm"]], 1], ["Thursday", [["9:30 am--9:00 pm"]]], ["Friday", [["9:30 am--9:00 pm"]]], ["Saturday", [["9:30 am--9:00 pm"]]], ["Sunday", [["9:30 am--9:00 pm"]]]], "phone": "8 (963) 173-38-38", "closed": false, "gPlusPlaceId": "109810290098030327104", "gps": [43.22776, 44.762726]}

#### Steps
1. Read the original json file which contains corrupted json. 
2. We don't need all the data. So a check is made to see if the country in the record is one of 6 countries (which are the places with hotels in them in the other file, 515k_reviews). Of course, we don't have a field for the country, so this is just a substring search after converting the strings to lower case. The next steps take place only if the business is located in one of the 6 countries.
3. Use the literal_eval function in the builtin module ast (which is generally used to process tree data) to convert single quotes to double quotes. This will also remove trailing commas. Note that ast.literal_eval returns a dictionary (it interprets the JSON line as a dict), which by default uses double quotes. It also removes the unicode indicator u''.
4. The address field is a json array. This needs to be converted into one field.
5. As an additional step, I get the city and country from the address using a simple NER library. A more complex method will work better, but this is what I'm using for now.
6. The gps, another json array, is split into two new fields: latitude and longitude.
7. Finally, the most complex field, hours, needs to be handled. This contains either None, or a list of lists, each of which has a day and another list of lists. This needs to be flattened -- I do this by using a generator. In the final resulting output, I add 7 new fields for the opening hours of each day: MondayHours, TuesdayHours and so on.
8. Write the results back into a new jsonlines file which contains valid JSON.

Note: this file does not contains any of the intermediate steps. The explore_13lines_google_places.ipynb file shows the intermediate steps I used while constructing the program. The program itself is also present in clean_google_places.py.

In [None]:
def flatten(day_hours_list):
    """ Generator which flattens a list of lists recursively by yielding strings and ints/floats
    directly and recursively calling the generator func if it's an iterable """
    for el in day_hours_list:
        if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
            yield from flatten(el)
        else:
            yield el

def process_hours(opening_hours):
    """ Takes a Json array of opening_hours in the following form (None if not present), and
    returns a dictionary containing days as keys and opening hours as string values
    [['Monday', [['6:30 am--4:15 pm']]],
    ['Tuesday', [['6:30 am--4:15 pm']]],
    ['Wednesday', [['6:30 am--4:15 pm']], 1],
    ['Thursday', [['6:30 am--4:15 pm']]],
    ['Friday', [['6:30 am--4:15 pm']]],
    ['Saturday', [['6:30 am--4:15 pm']]],
    ['Sunday', [['6:30 am--4:15 pm']]]] """
    day_keys = ['MondayHours', 'TuesdayHours', 'WednesdayHours', 'ThursdayHours',
            'FridayHours', 'SaturdayHours', 'SundayHours']
    if opening_hours is None:
        # Return None for each day
        return {k:None for k in day_keys}
    return_dict = dict()
    for day_hours in opening_hours:
        # day_hours[0] is the day, [1] is the list of lists
        return_dict[day_hours[0]+'Hours'] = next(flatten(day_hours[1]))
        
    return return_dict   

countries = ['france', 'italy', 'spain', 'united kingdom', 'austria', 'netherlands']
with jsonlines.open('../Data/Cleaned/google_places_cleaned.jsonl', mode='w') as writer:
    with open('../Data/Original/places.original.json', 'r') as testfile:
        # fields: 'name', 'price', 'address', 'hours', 'phone', 'closed', 'gPlusPlaceId', 'gps'
        for line in tqdm(testfile):
            normalised_dict = ast.literal_eval(line)
            print(normalised_dict.keys())
            joined_address = ', '.join(normalised_dict.get('address'))
            if any(country in joined_address.lower() for country in countries):
                # Country found in the list, find which country it is.
                # GeoText module only detects the city/country if it is capitalised
                geo_address = GeoText(joined_address.title())
                try:
                    matching_country = geo_address.countries[0]
                except IndexError:
                    matching_country = [country for country in countries if country in joined_address.lower()][0].title()
                try:
                    #If it matches 2 cities for some reason, take the one closer to the end of the string.
                    matching_city = geo_address.cities[-1]
                except IndexError:
                    matching_city = None
                normalised_dict['country'] = matching_country
                normalised_dict['city'] = matching_city
                normalised_dict['latitude'] = normalised_dict.get('gps')[0]
                normalised_dict['longitude'] = normalised_dict.get('gps')[1]
                opening_hours_dict = process_hours(normalised_dict.get('hours'))
                # Add the new keys (Monday, Tuesday, ...)
                normalised_dict.update(opening_hours_dict)
                print(normalised_dict.keys())
                del normalised_dict['closed']
                del normalised_dict['gps']
                del normalised_dict['hours']
                
                # Write to output jsonl file
                writer.write(normalised_dict)


The resulting file (google_places_cleaned.jsonl) has 464,906 records in jsonlines format.