Planning a city trip according to popularity of a sight(place), days of trip and distance between attractions.

HTML source code are copied from https://www.google.com/travel/things-to-do/see-all with input of Portland and and display all top sights. 

Let's import associated libraries and open our destination Portland file. We will use regular expression to extract places, rating and the number of reviews from the HTML file.

In [1]:
# Import libraries
import pandas as pd
import re  # regular expression

from geopy.geocoders import Nominatim  # module to convert an address into latitude and longitude values

pd.set_option('display.max_rows', 100)

In [2]:
# Open and read Portland to see HTML file
with open('portland_attractions.rtf', 'r', encoding='utf-8') as f:
    tosee_html = f.read()

tosee_html

'{\\rtf1\\ansi\\ansicpg1252\\cocoartf1671\\cocoasubrtf600\n{\\fonttbl\\f0\\fswiss\\fcharset0 Helvetica;\\f1\\fnil\\fcharset134 HiraginoSansGB-W3;}\n{\\colortbl;\\red255\\green255\\blue255;\\red0\\green0\\blue0;\\red50\\green223\\blue19;\\red28\\green33\\blue255;\n\\red203\\green218\\blue32;\\red20\\green0\\blue255;\\red209\\green200\\blue24;\\red34\\green5\\blue208;\\red207\\green161\\blue39;\n\\red43\\green117\\blue255;\\red187\\green219\\blue27;}\n{\\*\\expandedcolortbl;;\\cssrgb\\c0\\c0\\c0;\\cssrgb\\c20000\\c87843\\c8627;\\cssrgb\\c15294\\c26275\\c100000;\n\\cssrgb\\c83137\\c87059\\c15686;\\cssrgb\\c11373\\c3529\\c100000;\\cssrgb\\c85490\\c81176\\c11373;\\cssrgb\\c18431\\c16863\\c85098;\\cssrgb\\c85098\\c68627\\c19608;\n\\cssrgb\\c21176\\c55294\\c100000;\\cssrgb\\c77647\\c87059\\c13333;}\n\\margl1440\\margr1440\\vieww26820\\viewh15060\\viewkind0\n\\deftab720\n\\pard\\pardeftab720\\partightenfactor0\n\n\\f0\\fs24 \\cf0 \\expnd0\\expndtw0\\kerning0\n<div class="kQb6Eb" role="list"><d

In [3]:
# Extract strings containing name of places and rating

patn_raw = re.compile(r'YmWhbc.+reviews')

raw_matches = patn_raw.finditer(tosee_html)
raw_list_1 = []
for m in raw_matches:
    raw_list_1.append(m.group(0))

print('There are total {} places.'.format(len(raw_list_1)))
raw_list_1 = [raw.strip('YmWhbc') for raw in raw_list_1]

# There are a few places without rating that makes result messing. We'll clean up those items.
raw_list = []
for raw in raw_list_1:
    m = re.search(r'YmWhbc.+reviews', raw)
    try:
        raw_list.append(m.group())
    except:
        raw_list.append(raw)
        
print('There are total {} places with rating.'.format(len(raw_list)))
raw_list[:5]

There are total 93 places.
There are total 93 places with rating.


['"\\cf0 >\\cf6 Pittock Mansion\\cf0 </div></div><div class="tP34jb "><\\cf7 span class="ta47le"><span aria-label="4.6 stars from 4965 reviews',
 '\\cf0 ">\\cf10 Portland Japanese Garden\\cf0 </div></div><div class="tP34jb "><\\cf11 span class="ta47le"><span aria-label="4.5 stars from 4967 reviews',
 '">Lan Su Chinese Garden</div></div><div class="tP34jb "><span class="ta47le"><span aria-label="4.6 stars from 2488 reviews',
 '">OMSI</div></div><div class="tP34jb "><span class="ta47le"><span aria-label="4.5 stars from 6932 reviews',
 '">Oregon Zoo</div></div><div class="tP34jb "><span class="ta47le"><span aria-label="4.5 stars from 14743 reviews']

In [23]:
# Extract name of places and put into a list
places = []
for raw in raw_list:
    m = re.search(r'[A-Z].+[a-zI](\\|<)', raw)
    try:
        places.append(m.group())
    except:
        places.append(str(raw)[0:100])
        
places[:9]

['Pittock Mansion\\',
 'Portland Japanese Garden\\',
 'Lan Su Chinese Garden<',
 'OMSI<',
 'Oregon Zoo<',
 'International Rose Test Garden<',
 'Portland Art Museum<',
 'Washington Park<',
 'Pioneer Courthouse Square<']

In [24]:
# Reviewing the place list, one item naming 'Waterfront Renaissance..' is abnormal. Let's fixed it.
tmp = [places.index(i) for i in places if 'Waterfront Renaissance' in i]
places[tmp[0]] = 'Waterfront Renaissance Trail Vancouver WA'

places = [i.strip('\\<') for i in places]
places = [i.strip('YmWhbc">') for i in places]
places[0:9]

['Pittock Mansion',
 'Portland Japanese Garden',
 'Lan Su Chinese Garden',
 'OMSI',
 'Oregon Zoo',
 'International Rose Test Garden',
 'Portland Art Museu',
 'ashington Park',
 'Pioneer Courthouse Square']

After examining the name of the places, I found many 'Museum' are missing the last letter 'm'. One of examples is 'Portland Art Museu'. Meanwhile, 'W' is missing from the beginning. One of examples is 'ashington Park'. Let's fix the problems.

In [25]:
# Fill in the missing 'm' from 'Museum'
tmp = [places.index(i) for i in places if re.search(r'Museu$', i)]
for i in tmp:
    places[i] = places[i] + 'm'
    

In [26]:
# Fill in the missing 'W' in the beginning
tmp = [places.index(i) for i in places if re.search(r'^[a-z]', i)]
for i in tmp:
    places[i] = 'W' + places[i]
places[:9]

['Pittock Mansion',
 'Portland Japanese Garden',
 'Lan Su Chinese Garden',
 'OMSI',
 'Oregon Zoo',
 'International Rose Test Garden',
 'Portland Art Museum',
 'Washington Park',
 'Pioneer Courthouse Square']

In [28]:
# A few erratic name errors are fixed here. 
places[9] = 'Governor Tom McCall Waterfront'
places[10] = 'Hoyt Arboretum'
places[42] = 'The Freakybuttrue Peculiarium'
places[54] = 'Rice Northwest Museum of Rocks and Minerals'
places[74] = 'The Old Church'

for i, p in enumerate(places):
    print(i, p)

0 Pittock Mansion
1 Portland Japanese Garden
2 Lan Su Chinese Garden
3 OMSI
4 Oregon Zoo
5 International Rose Test Garden
6 Portland Art Museum
7 Washington Park
8 Pioneer Courthouse Square
9 Governor Tom McCall Waterfront
10 Hoyt Arboretum
11 The Grotto
12 Oaks Amusement Park
13 Portland Saturday Market
14 Forest Park
15 Shanghai Tunnels/Portland Underground Tour
16 St. Johns Bridge
17 Portland Children's Museum
18 Crystal Springs Rhododendron Garden
19 World Forestry Center
20 Fort Vancouver National Historic Site | Visitor Center
21 South Waterfront Lower Tram Terminal
22 Mill Ends Park
23 Oregon Historical Society
24 Fort Vancouver National Historic Site
25 Tilikum Crossing Bridge
26 Tryon Creek State Natural Area
27 Powell's City of Books
28 Portland Audubon
29 Mt Tabor Park
30 Powell Butte
31 Peninsula Park
32 Laurelhurst Park
33 Witch's Castle
34 Wildwood Trail
35 Tryon Creek
36 Eastbank Esplanade
37 Keller Fountain Park
38 Leach Botanical Garden
39 Council Crest Park
40 Oaks Bo

In [9]:
# Extract rating from raw_list
rating = []
for rate in raw_list:
    m = re.search(r'\d[.|\s]\d?\s?stars', rate)
    try:
        rating.append(m.group())
    except:
        rating.append(rate)
        
rating = [float(i.strip('stars')) for i in rating]
rating[:5]

[4.6, 4.5, 4.6, 4.5, 4.5]

In [10]:
# Extract the number of reviews from raw_list
reviews = []

for raw in raw_list:
    m = re.search(r'\d+\s(review)', raw)
    try:
        reviews.append(m.group())
    except:
        reviews.append(raw)

reviews = [int(i.strip('review')) for i in reviews]
reviews[:10]

[4965, 4967, 2488, 6932, 14743, 4753, 3964, 9618, 6062, 6210]

In [11]:
# Create dataframe for Portland attraction with rating and the number of reviews
tosee_df = pd.DataFrame(list(zip(places, rating, reviews)), columns = ['place', 'rating', 'reviews'])
tosee_df.describe()

Unnamed: 0,rating,reviews
count,93.0,93.0
mean,4.532258,1721.946237
std,0.292352,3336.139947
min,2.8,1.0
25%,4.4,173.0
50%,4.6,598.0
75%,4.7,1650.0
max,5.0,24920.0


In [12]:
tosee_df

Unnamed: 0,place,rating,reviews
0,Pittock Mansion,4.6,4965
1,Portland Japanese Garden,4.5,4967
2,Lan Su Chinese Garden,4.6,2488
3,OMSI,4.5,6932
4,Oregon Zoo,4.5,14743
5,International Rose Test Garden,4.7,4753
6,Portland Art Museum,4.7,3964
7,Washington Park,4.7,9618
8,Pioneer Courthouse Square,4.4,6062
9,Governor Tom McCall Waterfront,4.5,6210


Let's examine the descriptive statistics number. The range of the number of reviews is quite big, from 1 to 24920. The mean is 1721 and the standard deviation is 3336. The first quartile is 173, the median is 598 and the third quartile is 1650. Comparing the first quartil, 173, with mean, 1721, or median, 598, the places in first quartile do not get significant popularity. As a result, we will remove the places that are in the first quartile from the to-see list. 

In [13]:
# Remove the places having less than 173 reviews.
tosee_df = tosee_df[tosee_df['reviews'] > 173]
tosee_df.reset_index(drop=True, inplace=True)
tosee_df

Unnamed: 0,place,rating,reviews
0,Pittock Mansion,4.6,4965
1,Portland Japanese Garden,4.5,4967
2,Lan Su Chinese Garden,4.6,2488
3,OMSI,4.5,6932
4,Oregon Zoo,4.5,14743
5,International Rose Test Garden,4.7,4753
6,Portland Art Museum,4.7,3964
7,Washington Park,4.7,9618
8,Pioneer Courthouse Square,4.4,6062
9,Governor Tom McCall Waterfront,4.5,6210


In [14]:
tosee_df.shape

(69, 3)

In [15]:
tosee_df

Unnamed: 0,place,rating,reviews
0,Pittock Mansion,4.6,4965
1,Portland Japanese Garden,4.5,4967
2,Lan Su Chinese Garden,4.6,2488
3,OMSI,4.5,6932
4,Oregon Zoo,4.5,14743
5,International Rose Test Garden,4.7,4753
6,Portland Art Museum,4.7,3964
7,Washington Park,4.7,9618
8,Pioneer Courthouse Square,4.4,6062
9,Governor Tom McCall Waterfront,4.5,6210


In [16]:
# use geopy to find latitude & longitude
geolocator = Nominatim(user_agent='portland_agent')

for i, place in enumerate(tosee_df['place']):
    if 'Vancouver' in place:
        to_find = place + ', Vancouver, WA'
    else:
        to_find = place + ', Portland, Oregon'
    
    location = geolocator.geocode(to_find)
    try:
        tosee_df.loc[i, 'latitude'] = location.latitude
        tosee_df.loc[i, 'longitude'] = location.longitude
    except:
        tosee_df.loc[i, 'latitude'] = to_find
        tosee_df.loc[i, 'longitude'] = to_find
        
# A few places that are close to Portland, but are not located in Porland. 
# We will try only add'Oregon' to find geographical coordiante.
for i, place in enumerate(tosee_df['place']):
    if tosee_df.loc[i, 'latitude'] == tosee_df.loc[i, 'longitude']:
        to_find = place + ', Oregon'
        location = geolocator.geocode(to_find)
        if location:
            tosee_df.loc[i, 'latitude'] = location.latitude
            tosee_df.loc[i, 'longitude'] = location.longitude

tosee_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


Unnamed: 0,place,rating,reviews,latitude,longitude
0,Pittock Mansion,4.6,4965,45.5252,-122.716
1,Portland Japanese Garden,4.5,4967,45.5187,-122.708
2,Lan Su Chinese Garden,4.6,2488,45.5257,-122.673
3,OMSI,4.5,6932,45.5083,-122.666
4,Oregon Zoo,4.5,14743,45.5098,-122.713
5,International Rose Test Garden,4.7,4753,45.5191,-122.705
6,Portland Art Museum,4.7,3964,45.5162,-122.684
7,Washington Park,4.7,9618,45.5155,-122.706
8,Pioneer Courthouse Square,4.4,6062,45.5189,-122.679
9,Governor Tom McCall Waterfront,4.5,6210,45.5204,-122.67


In [17]:
tosee_df.to_csv('portland_tosee_geocor', index=False)

In [18]:
# Review the places that failed to find latitude and longitude
tosee_df[tosee_df['latitude'] == tosee_df['longitude']]

Unnamed: 0,place,rating,reviews,latitude,longitude
15,Shanghai Tunnels/Portland Underground Tour,3.7,324,"Shanghai Tunnels/Portland Underground Tour, Po...","Shanghai Tunnels/Portland Underground Tour, Po..."
34,The Freakybuttrue Peculiarium,4.6,452,"The Freakybuttrue Peculiarium, Portland, Oregon","The Freakybuttrue Peculiarium, Portland, Oregon"
37,Pearson Air Museum,4.6,248,"Pearson Air Museum, Portland, Oregon","Pearson Air Museum, Portland, Oregon"
38,The Freakybuttrue Peculiarium and Museum,4.2,1008,"The Freakybuttrue Peculiarium and Museum, Port...","The Freakybuttrue Peculiarium and Museum, Port..."
45,Rice Northwest Museum of Rocks &amp; Minerals,4.8,832,"Rice Northwest Museum of Rocks &amp; Minerals,...","Rice Northwest Museum of Rocks &amp; Minerals,..."
46,Roloff Farms,4.6,682,"Roloff Farms, Portland, Oregon","Roloff Farms, Portland, Oregon"
56,Bella Organic Far,4.0,261,"Bella Organic Far, Portland, Oregon","Bella Organic Far, Portland, Oregon"
58,The Old Church Concert Hall,4.8,456,"The Old Church Concert Hall, Portland, Oregon","The Old Church Concert Hall, Portland, Oregon"
60,Elk Rock Garden,4.8,223,"Elk Rock Garden, Portland, Oregon","Elk Rock Garden, Portland, Oregon"
68,Frenchman's Bar Regional Park,4.6,312,"Frenchman's Bar Regional Park, Portland, Oregon","Frenchman's Bar Regional Park, Portland, Oregon"


Given more than half of the places have more than 500 reviews. I'll drop the places that are with less than 500 reviews and fail to find latitude and longitude by program from the list. 

In [19]:
# Remove the latitude and longitude not found and the number of reviews are less than 500.



In [20]:
geolocator.geocode('Governor Tom McCall Waterfront, portland')

Location(Governor Tom McCall Waterfront Park, Chinatown, Old Town, Portland, Metro, Multnomah County, Oregon, United States of America, (45.52040275, -122.67038422190106, 0.0))

In [22]:
places

['Pittock Mansion',
 'Portland Japanese Garden',
 'Lan Su Chinese Garden',
 'OMSI',
 'Oregon Zoo',
 'International Rose Test Garden',
 'Portland Art Museum',
 'Washington Park',
 'Pioneer Courthouse Square',
 'Governor Tom McCall Waterfront',
 'Hoyt Arboretum',
 'The Grotto',
 'Oaks Amusement Park',
 'Portland Saturday Market',
 'Forest Park',
 'Shanghai Tunnels/Portland Underground Tour',
 'St. Johns Bridge',
 "Portland Children's Museum",
 'Crystal Springs Rhododendron Garden',
 'World Forestry Center',
 'Fort Vancouver National Historic Site | Visitor Center',
 'South Waterfront Lower Tram Terminal',
 'Mill Ends Park',
 'Oregon Historical Society',
 'Fort Vancouver National Historic Site',
 'Tilikum Crossing Bridge',
 'Tryon Creek State Natural Area',
 "Powell's City of Books",
 'Portland Audubon',
 'Mt Tabor Park',
 'Powell Butte',
 'Peninsula Park',
 'Laurelhurst Park',
 "Witch's Castle",
 'Wildwood Trail',
 'Tryon Creek',
 'Eastbank Esplanade',
 'Keller Fountain Park',
 'The Frea