# Data Collection Using Web Scraping 

## To solve this problem we will need the following data :

● List of neighborhoods in Pune.

● Latitude and Longitudinal coordinates of those neighborhoods.

● Venue data for each neighborhood.

## Sources
● For the list of neighborhoods, I used
(https://en.wikipedia.org/wiki/Category:Neighbourhoods_in_Pune)

● For Latitude and Longitudinal coordinates: Python Geocoder Package
(https://geocoder.readthedocs.io/)

● For Venue data: Foursquare API (https://foursquare.com/)


## Methods to extract data from Sources

To extract the data we will use python packages like requests, beautifulsoup and geocoder.

We will use Requests and beautifulsoup packages for web
scraping(https://en.wikipedia.org/wiki/Category:Neighbourhoods_in_Pune ) to get the list of
neighborhoods in Pune and geocoder package to get the latitude and longitude coordinates of
each neighborhood.

Then we will use Folium to plot these neighborhoods on the map.  

After that, we will use the foursquare API to get the venue data of those neighborhoods. Foursquare API will provide many categories of the venue data but  we are particularly interested in the supermarket category in order to help us to solve the business problem.

## Imports 

In [None]:

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
!pip install geocoder
import geocoder # to get coordinates
!pip install requests 
import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

print("Libraries imported.")

## Collecting the nebourhood data using Requests, BeautifulSoup, and Geocoder labries

In [2]:

data = requests.get("https://en.wikipedia.org/wiki/Category:Neighbourhoods_in_Pune").text
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')
# create a list to store neighborhood data
neighborhood_List = []
# append the data into the list
for row in soup.find_all("div", class_="mw-category")[0].findAll("li"):
    neighborhood_List.append(row.text)

# create a new DataFrame from the list
Pune_df = pd.DataFrame({"Neighborhood": neighborhood_List})

Pune_df.tail()

Unnamed: 0,Neighborhood
52,Vimannagar
53,Vishrantwadi
54,Wakad
55,Warje
56,Yerawada


In [4]:

# define a function to get coordinates
def get_cord(neighborhood):
 
    coords = None
    # loop until you get the coordinates
    while(coords is None):
        g = geocoder.arcgis('{}, Pune, Maharashtra'.format(neighborhood))
        coords = g.latlng
    return coords

In [5]:

# create a  list and store the coordinates 
coords = [ get_cord(neighborhood) for neighborhood in Pune_df["Neighborhood"].tolist() ]

In [6]:
coords[:10]

[[18.516483671884753, 73.85387026191101],
 [18.563450000000046, 73.81227000000007],
 [18.576020000000028, 73.77983000000006],
 [18.548200000000065, 73.77316000000008],
 [18.50747000000007, 73.78236000000004],
 [18.509030000000052, 73.87317000000007],
 [18.579220000000078, 73.74352000000005],
 [18.516890000000046, 73.85617000000008],
 [18.51244931570263, 73.85657158825195],
 [18.515850000000057, 73.84061000000008]]

In [7]:
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [8]:

# merge the coordinates into the original dataframe
Pune_df['Latitude'] = df_coords['Latitude']
Pune_df['Longitude'] = df_coords['Longitude']

# check the neighborhoods and the coordinates
print(Pune_df.shape)
Pune_df.head(10)

(55, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Appa Balwant Chowk,18.516484,73.85387
1,"Aundh, Pune",18.56345,73.81227
2,Balewadi,18.57602,73.77983
3,Baner,18.5482,73.77316
4,Bavdhan,18.50747,73.78236
5,"Bhavani Peth, Pune",18.50903,73.87317
6,Blue Ridge Town Pune,18.57922,73.74352
7,"Budhwar Peth, Pune",18.51689,73.85617
8,"Chakan, Pune",18.512449,73.856572
9,Deccan Gymkhana,18.51585,73.84061


In [9]:
# save the DataFrame as CSV file
Pune_df.to_csv("Pune_df.csv", index=False)

## Collecting the nebourhood venue data using Foursquare API 

In [10]:

# define Foursquare Credentials and Version
CLIENT_ID = '5HUDVH14DMECWUAFI2MICONBTTDPW1CCL1C4TFGE3FEHEUHJ' # your Foursquare ID
CLIENT_SECRET = 'R0WIH5UIW2SADKBUW4B4WMY2QWBBT0Q02IURAXQXVJZMTDIV' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 5HUDVH14DMECWUAFI2MICONBTTDPW1CCL1C4TFGE3FEHEUHJ
CLIENT_SECRET:R0WIH5UIW2SADKBUW4B4WMY2QWBBT0Q02IURAXQXVJZMTDIV


In [11]:

radius = 3000
LIMIT = 150

venues = []

for lat, long, neighborhood in zip(Pune_df['Latitude'], Pune_df['Longitude'], Pune_df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))


In [12]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)

(4313, 7)


In [13]:
venues_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Appa Balwant Chowk,18.516484,73.85387,Sujata Mastani,18.511793,73.852145,Ice Cream Shop
1,Appa Balwant Chowk,18.516484,73.85387,Lal Mahal,18.51872,73.856556,Historic Site
2,Appa Balwant Chowk,18.516484,73.85387,Bhagat Tarachand,18.514332,73.851317,Indian Restaurant
3,Appa Balwant Chowk,18.516484,73.85387,Raja Dinkar Kelkar museum,18.510744,73.854389,History Museum
4,Appa Balwant Chowk,18.516484,73.85387,Hotel Madhuban,18.519248,73.848688,Tea Room


In [14]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))


There are 142 uniques categories.


In [15]:
# print out the list of categories
venues_df['VenueCategory'].unique()

array(['Ice Cream Shop', 'Historic Site', 'Indian Restaurant',
       'History Museum', 'Tea Room', 'Donut Shop', 'Café', 'Juice Bar',
       'BBQ Joint', 'South Indian Restaurant', 'Snack Place', 'Bistro',
       'Vegetarian / Vegan Restaurant', 'Fast Food Restaurant',
       'Sandwich Place', 'Theater', 'Maharashtrian Restaurant', 'Stadium',
       'Trail', 'Dessert Shop', 'Coffee Shop', 'Seafood Restaurant',
       'Italian Restaurant', 'Food Truck', 'Bakery', 'Frozen Yogurt Shop',
       'Gym', 'Bar', 'Supermarket', 'Gym / Fitness Center', 'Hotel',
       'Steakhouse', 'Burger Joint', 'Plaza', 'Deli / Bodega',
       'Restaurant', 'Sports Bar', 'Lounge', 'General Entertainment',
       'Sporting Goods Shop', 'Hotel Bar', 'Theme Park',
       'Chinese Restaurant', 'Hookah Bar', 'Gastropub', 'Smoke Shop',
       'Pizza Place', 'Shopping Mall', 'Bookstore', 'English Restaurant',
       'Mexican Restaurant', 'Chocolate Shop', 'Multiplex',
       'Korean Restaurant', 'Jewelry Store', 'C

In [16]:
venues_df.to_csv("venues_df.csv")