# Capstone Project

## Introduction

I lived in the city of Austin, Texas for 4 years. I personally saw how the city was increasing in different types of businesses. However, Austin is a small city which can make it difficult to place new businesses in already crowded locations like downtown. The crowdedness in a single area is not the only concern, but that single area can also have similar types of businesses. 

In this report I will cluster similar venues around the different zip codes in Austin. New business owners should then be able to use this report to stratigically place their new business in Austin. With the help of <b>Foursquare</b>, we should be able to: see the amount of businesses in an area, the types of businesses, and the most common venues of an area.

<i>This report should not be used as the main decision maker on placing a new business in austin.</i>

## Data

The data will be similar to the data used in the IBM course, which consists of: Zip Code, Longitude, Latitude, Venue, and Venue Category. 

I will collect data from https://www.zip-codes.com/m/city/tx-austin.asp which has the zip codes for Austin. I will also collect the latitude and longitude from this website with the use of a function.

<b>Foursquare</b> will then provide me with the venue data around the zip codes. 

The Latitude and Longitude data will be used by the library <b>Folium</b> so a map and the markers can be created.

The idea would be to have a dataframe like below:

|Index|ZipCode|Latitude|Longitude|1st Common Venue| 2nd Common Venue| ... | 10th Common Venue|
|-----|-------|--------|---------|----------------|-----------------|-----|------------------|
|0    |NA     |NA      |NA       |NA              |NA               | ... |NA                |




[Done] Introduction where you discuss the business problem and who would be interested in this project.

[Done] Data where you describe the data that will be used to solve the problem and the source of the data.

Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, and what machine learnings were used and why.

Results section where you discuss the results.

Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.

Conclusion section where you conclude the report.

## Code



In [22]:
# Importing all necessary libraries from the "Segmenting and Clustering Neighborhoods in New York City" lab
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Importing the Web Scraping libraries
# Install these into your system/environment if not downloaded
import requests
import lxml.html as lh

def get_content(url):
    #Create a handle, page, to handle the contents of the website
    page = requests.get(url)
    #Store the contents of the website under doc
    doc = lh.fromstring(page.content)
    return doc

def get_coord(zip_code):
    url = 'https://www.zip-codes.com/m/zip-code/' + zip_code + '/zip-code-' + zip_code + '.asp'
    zipDoc = get_content(url)
    tr_elements2 = zipDoc.xpath('//tr')
    lat = tr_elements2[11].text_content()[9:]
    long = tr_elements2[12].text_content()[10:]

# The url for the wiki page
url = 'https://www.zip-codes.com/m/city/tx-austin.asp'
doc = get_content(url)

#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

#Create empty list
col=[]
i=0

# Geting Column Names
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    col.append((name,[]))

# Getting the values for each columns
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 4, the //tr data is not from our table 
    if len(T)!=4:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)