# City Analysis:  Madrid

## Salon's vs Typical Neighborhood Venues

### Table of Contents
1. [Background](#Background)
2. [Description](#Description)
3. [Data Capture](#Data)

***

## Background
> A description of the problem and a discussion of the background. (15 marks):

>  A stylist who is also an entrepenuer is planning to open a Salon in the busy city of Madrid.  She has interested investors and needs to come up with a viable business plan.  She is in the early stages of planning and is trying to figure out the best neighborhood.  She needs information about each neighborhood in New York in order to decide where would be the best location to open the Salon.  She thinks that a new salon establishment would be best opened in neighborhoods with certain types of venues and characteristics.  However, she is not really sure what those types of venues should be.  She wants to get a comprehensive list of all of the neighborhoods in New York beginning with a list of where currently established salons exist.  She then wants to see what the characteristics of those neighborhoods are.

>  In order to gain the necessary knowledge to support her dicision making process, the stylist enlists the aid of a data scientist.  She wants the data scientist to provide here with information that will help her make the best, most informed decision. This includes maps, charts and other information systematically generated that she can visually compare and contrast.  The data scientist knows that this will require exploratory data analysis of venues within each neighborhood in Madrid. 


## Description
> A description of the data and how it will be used to solve the problem. (15 marks)

> There are several data types needed to solve the problem of creating the exploratory data analysis that the stylist needs.  This includes a list of all of the neighborhoods in Madrid.  This also includes geographical data to cross reference the list of neighborhoods.  This will give us a basis for extracting information about venues and places.  We will then generate an exploratory data product that can be used to make an informed business decision.

>  For the list of neighborhoods in Madrid, I will use publicly available information gathered from the internet.  I plan to primarily use information from Wikipedia to get a general list of neighborhoods in Madrid.  I will also use any other listings that I can find using search engines such as Google, Bing, and DuckDuckGo.  I will scrape the neighborhood information using Python's BeautifulSoup module and store the data in a Panda's dataframe.

>  For the geographical data, I will use Google's geocoding API to gather geogrphical coordinates for each neighborhood.  I will do this by extracting the neighborhood names from the previously created Panda's dataframe and passing those names to Google's API.  I will capture the geolocation data from Google and add that data to the pre-existing Panda's dataframe.

> I will use the geographical and neighborhood dataframe to poll publicly available places and venues using Python. I will do this primarily by querying Foursquare's API for their freely available social location data.  The Foursquare data inlcudes venue categories which will be used to analyse each neighborhood.  I will also add information from Google's places API if possible as well as any publicly available venue information I can find using free search engines.  The data will be stored in a Panda's dataframe for analysis.

> Once all of the venue information is gathered, I will generate the exploratory data analysis. I will use clustering techniques to group the venue data by neighborhood.  This grouping will be used to rank neighborhoods by the different venue categories found in each neighborhood.  This information will provide a general character profile for each neighborhood.  For instance, if the most common venue is a type of restaurant, then the neighborhood is be characterized as an restaurant neighborhood.  Once each neighborhood is categorized by a prevalent venue type, I will rank neighborhoods based on the number of salons using the same clustering techniques.  I will also create a choropleth map based on the total number of salons found in each neighborhood (the darker the area, the more salons exist). 

***

## Data

>     

In [None]:
##import necessary modules
import pandas as pd
from bs4 import BeautifulSoup
import requests
import json
import numpy as np
import os #import for file handling
import folium

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

In [None]:
#get keys for API's
with open('gk','r') as gk:
    gkey=gk.readline()
    
with open('fk','r') as fk:
    lines=fk.readlines()
    CLIENT_ID=lines[0].strip()
    CLIENT_SECRET=lines[1].strip()
    VERSION = '20200517' # Foursquare API version

In [None]:
#define handy functions
def getgeodata(postal_code,key):
    #generate url
    apiurl='https://maps.googleapis.com/maps/api/geocode/json?address={},+Toronto,+Canada&key={}'.format(postal_code,key)   
 
    #create a file to write
    fname='./geodata/{}.json'.format(postal_code)
    if os.path.exists(fname):
        f=open(fname,'r')
        jsondata=f.read()
        f.close
        return jsondata
    else:
        f=open(fname,'x')
        url=requests.get(apiurl)
        jsondata=url.text
        f.write(jsondata)
        f.close
        return jsondata

#borrowed from the DP0701EN-3-3-2 lab.  function to returm most common venues
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]