# Coursera Capstone Project

### Anand Konji

Now that you have been equipped with the skills and the tools to use location data to explore a geographical location, over the course of two weeks, you will have the opportunity to be as creative as you want and come up with an idea to leverage the Foursquare location data to explore or compare neighborhoods or cities of your choice or to come up with a problem that you can use the Foursquare location data to solve. If you cannot think of an idea or a problem, here are some ideas to get you started:

In Module 3, we explored New York City and the city of Toronto and segmented and clustered their neighborhoods. Both cities are very diverse and are the financial capitals of their respective countries. One interesting idea would be to compare the neighborhoods of the two cities and determine how similar or dissimilar they are. Is New York City more like Toronto or Paris or some other multicultural city? I will leave it to you to refine this idea.
In a city of your choice, if someone is looking to open a restaurant, where would you recommend that they open it? Similarly, if a contractor is trying to start their own business, where would you recommend that they setup their office?
These are just a couple of many ideas and problems that can be solved using location data in addition to other datasets. No matter what you decide to do, make sure to provide sufficient justification of why you think what you want to do or solve is important and why would a client or a group of people be interested in your project.

Review criterialess 
This capstone project will be graded by your peers. This capstone project is worth 70% of your total grade. The project will be completed over the course of 2 weeks. Week 1 submissions will be worth 30% whereas week 2 submissions will be worth 40% of your total grade.

For this week, you will required to submit the following:

A description of the problem and a discussion of the background. (15 marks)
A description of the data and how it will be used to solve the problem. (15 marks)
For the second week, the final deliverables of the project will be:

A link to your Notebook on your Github repository, showing your code. (15 marks)
A full report consisting of all of the following components (15 marks):
Introduction where you discuss the business problem and who would be interested in this project.
Data where you describe the data that will be used to solve the problem and the source of the data.
Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
Results section where you discuss the results.
Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
Conclusion section where you conclude the report.
3. Your choice of a presentation or blogpost. (10 marks)  
------------------------

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

--------  
## The Battle of Neighborhoods (Week 1)


## 1.	Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a restaurant. Specifically, this report will be targeted to stakeholders interested in opening an **Italian restaurant** in **Bangalore**, India.

Since there are lots of restaurants in Bangalore we will try to detect **locations that are not already crowded with restaurants**. We are also particularly interested in **areas with no Italian restaurants in vicinity**. We would also prefer locations **as close to city center as possible**, assuming that first two conditions are met.

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

## 2.	Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* number of existing restaurants in the neighborhood (any type of restaurant)
* number of and distance to Italian restaurants in the neighborhood, if any
* distance of neighborhood from city center

We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **reverse geocoding**
* number of restaurants and their type and location in every neighborhood will be obtained using **Foursquare API**
* coordinate of Bangalore center will be obtained using **geocoding** of well known Bangalore location (Koramangala)

### Neighborhood Candidates

Let's create latitude & longitude coordinates for centroids of our candidate neighborhoods. We will create a grid of cells covering our area of interest which is aprox. 10x10 killometers centered around Bangalore city center.

Let's first find the latitude & longitude of Bangalore city center.

#### install the packages required, if not done already

In [1]:
# install the packages required, if not done already
!pip install beautifulsoup4
!pip install lxml
!pip install geocoder
!pip install geopy
!pip install folium
print("Installed required packages !!!")

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/e8/b5/7bb03a696f2c9b7af792a8f51b82974e51c268f15e925fc834876a4efa0b/beautifulsoup4-4.9.0-py3-none-any.whl (109kB)
[K     |████████████████████████████████| 112kB 7.8MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/05/cf/ea245e52f55823f19992447b008bcbb7f78efc5960d77f6c34b5b45b36dd/soupsieve-2.0-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.0 soupsieve-2.0
Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/dd/ba/a0e6866057fc0bbd17192925c1d63a3b85cf522965de9bc02364d08e5b84/lxml-4.5.0-cp36-cp36m-manylinux1_x86_64.whl (5.8MB)
[K     |████████████████████████████████| 5.8MB 3.2MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.0
Collecting geocoder
[?25l  Downloading https://files.pythonhosted.o

#### install the packages required, if not done already

In [19]:
# import the libraries
import pandas as pd
import requests as rqt
from bs4 import BeautifulSoup as bs
import numpy as np
import geocoder as gc
print("Imported!")
print("Imported required stuffs!")

Imported!
Imported required stuffs!


#### Scraping Wikipedia page and creating a Dataframe and Transforming the data on Wiki page into pandas dataframe.

Using BeautifulSoup Scraping List of Postal Codes of Given Wikipedia/other source Page 

In [3]:
url = "http://pincode.india-server.com/cities/bengaluru/"
extractData = rqt.get(url).text
wikiData = bs(extractData, 'lxml')

In [5]:
#wikiData

Converting content of PostalCode HTML table as dataframe

In [6]:
columnNames = ['Postalcode','Town','Neighborhood']
bangalore = pd.DataFrame(columns = columnNames)

table = wikiData.find('table', class_='pincode-tbl')

In [32]:
#table

In [7]:
# parsing data and pushing to table rows
SrNo = 0
Neighborhood = 0
town='Bangalore'
OffType = 0
PinCode = 0

for tr in table.find_all('tr'):
    i = 0
    for td in tr.find_all('td'):
        if i == 0:
            SrNo = td.text.strip('\n')
            i = i + 1
        elif i == 1:
            Neighborhood = td.text.strip('\n')
            i = i + 1
        elif i == 2:
            OffType = td.text.strip('\n')
            i = i + 1
        elif i == 3: 
            PinCode = td.text.strip('\n').replace(']','')
    #bangalore = bangalore.append({'S.No.': SrNo,'Post office': PoOff,'Office type': OffType, 'Pincode': PinCode},ignore_index=True)
    bangalore = bangalore.append({'Postalcode': PinCode,'Town': town,'Neighborhood': Neighborhood},ignore_index=True)

In [8]:
bangalore

Unnamed: 0,Postalcode,Town,Neighborhood
0,0,Bangalore,0
1,Pincode,Bangalore,Post office
2,560063,Bangalore,A F Station Yelahanka
3,560107,Bangalore,Achitnagar
4,561101,Bangalore,Adarangi
...,...,...,...
449,562110,Bangalore,Yeliyur
450,562127,Bangalore,Yennegere
451,562123,Bangalore,Yentiganahalli
452,560022,Bangalore,Yeshwanthpur Bazar


Data cleansing

In [9]:
# clean dataframe 
bangalore = bangalore[bangalore.Neighborhood!='Not assigned']
bangalore = bangalore[bangalore.Neighborhood!= 0]
bangalore = bangalore[bangalore.Neighborhood!='Post office']
bangalore.reset_index(drop = True, inplace = True)
i = 0
for i in range(0,bangalore.shape[0]):
    if bangalore.iloc[i][2] == 'Not assigned':
        bangalore.iloc[i][2] = bangalore.iloc[i][1]
        i = i+1

In [10]:
df = bangalore.groupby(['Postalcode','Town'])['Neighborhood'].apply(', '.join).reset_index()

In [11]:
df

Unnamed: 0,Postalcode,Town,Neighborhood
0,560001,Bangalore,"Bangalore Bazaar , Bangalore G.P.O., CMM Court..."
1,560002,Bangalore,"Bangalore City , Bangalore Corporation Building"
2,560003,Bangalore,"Malleswaram , Palace Guttahalli , Swimming Poo..."
3,560004,Bangalore,"Basavanagudi , Mavalli , Pampamahakavi Road , ..."
4,560005,Bangalore,Fraser Town
...,...,...,...
126,562149,Bangalore,"Bagalur , Bandikodigehalli , Kannur"
127,562157,Bangalore,"Bettahalsur , Chikkajala , Doddajala , Hunasam..."
128,562162,Bangalore,"Aluru , Dasanapura , Hullegowdanahalli , Husku..."
129,562163,Bangalore,"Arakere , Basettihalli , Doddatumkur , Konnaga..."


In [12]:
df.describe()

Unnamed: 0,Postalcode,Town,Neighborhood
count,131,131,131
unique,131,1,131
top,560049,Bangalore,"Gayathrinagar , Srirampuram"
freq,1,131,1


Data Cleaning | Drop None rows of df and row which contains 'Not assigned' value | All "Not assigned" will be replace to 'NaN'

In [13]:
df = df.dropna()
empty = 'Not assigned'
df = df[(df.Postalcode != empty ) & (df.Town != empty) & (df.Neighborhood != empty)]

In [14]:
df.head()

Unnamed: 0,Postalcode,Town,Neighborhood
0,560001,Bangalore,"Bangalore Bazaar , Bangalore G.P.O., CMM Court..."
1,560002,Bangalore,"Bangalore City , Bangalore Corporation Building"
2,560003,Bangalore,"Malleswaram , Palace Guttahalli , Swimming Poo..."
3,560004,Bangalore,"Basavanagudi , Mavalli , Pampamahakavi Road , ..."
4,560005,Bangalore,Fraser Town


In [15]:
def neighborhood_list(grouped):    
    return ', '.join(sorted(grouped['Neighborhood'].tolist()))
                    
grp = df.groupby(['Postalcode', 'Town'])
df_2 = grp.apply(neighborhood_list).reset_index(name='Neighborhood')

In [16]:
df_2.describe()

Unnamed: 0,Postalcode,Town,Neighborhood
count,131,131,131
unique,131,1,131
top,560049,Bangalore,"Gayathrinagar , Srirampuram"
freq,1,131,1


In [17]:
print(df_2.shape)
df_2.head()

(131, 3)


Unnamed: 0,Postalcode,Town,Neighborhood
0,560001,Bangalore,"Bangalore Bazaar , Bangalore G.P.O., CMM Court..."
1,560002,Bangalore,"Bangalore City , Bangalore Corporation Building"
2,560003,Bangalore,"Malleswaram , Palace Guttahalli , Swimming Poo..."
3,560004,Bangalore,"Basavanagudi , Mavalli , Pampamahakavi Road , ..."
4,560005,Bangalore,Fraser Town


In [18]:
df_2.to_csv('bangalore.csv', index=False) 

#### Adding Latitude and Longitude Co-ordinates to the DataFrame

Reading the bangalore.csv which created on Part 1 Notebook

In [20]:
df = pd.read_csv('bangalore.csv')
df.head()

Unnamed: 0,Postalcode,Town,Neighborhood
0,560001,Bangalore,"Bangalore Bazaar , Bangalore G.P.O., CMM Court..."
1,560002,Bangalore,"Bangalore City , Bangalore Corporation Building"
2,560003,Bangalore,"Malleswaram , Palace Guttahalli , Swimming Poo..."
3,560004,Bangalore,"Basavanagudi , Mavalli , Pampamahakavi Road , ..."
4,560005,Bangalore,Fraser Town


In [21]:
print(df.shape)
df.describe()

(131, 3)


Unnamed: 0,Postalcode
count,131.0
mean,560433.099237
std,775.877517
min,560001.0
25%,560035.5
50%,560071.0
75%,560104.5
max,562164.0


In [22]:
def get_latilong(postal_code):
    lati_long_coords = None
    while(lati_long_coords is None):
        g = gc.arcgis('{}, Bangalore, Karnataka'.format(postal_code))
        lati_long_coords = g.latlng
    return lati_long_coords
    
get_latilong(560034)

[12.931355000000053, 77.63397875900006]

In [23]:
# Retrieving Postal Code Co-ordinates
postal_codes = df['Postalcode']    
coords = [ get_latilong(postal_code) for postal_code in postal_codes.tolist() ]

In [24]:
# Adding Columns Latitude & Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
df['Latitude'] = df_coords['Latitude']
df['Longitude'] = df_coords['Longitude']

In [25]:
df[df.Postalcode == 560034]

Unnamed: 0,Postalcode,Town,Neighborhood,Latitude,Longitude
31,560034,Bangalore,"Agara , Koramangala I Block , Koramangala , St...",12.931355,77.633979


In [26]:
df.head(15)

Unnamed: 0,Postalcode,Town,Neighborhood,Latitude,Longitude
0,560001,Bangalore,"Bangalore Bazaar , Bangalore G.P.O., CMM Court...",12.979185,77.606623
1,560002,Bangalore,"Bangalore City , Bangalore Corporation Building",12.96407,77.577647
2,560003,Bangalore,"Malleswaram , Palace Guttahalli , Swimming Poo...",13.003656,77.569745
3,560004,Bangalore,"Basavanagudi , Mavalli , Pampamahakavi Road , ...",12.945664,77.575075
4,560005,Bangalore,Fraser Town,12.998115,77.620842
5,560006,Bangalore,"J.C.Nagar , Training Command IAF",13.010375,77.591292
6,560007,Bangalore,"Agram , Air Force Hospital",12.956583,77.62849
7,560008,Bangalore,"H.A.L II Stage , Hulsur Bazaar",12.97303,77.627446
8,560009,Bangalore,"Bangalore Dist Offices Bldg , K. G. Road",12.978412,77.578175
9,560010,Bangalore,"Industrial Estate , Rajajinagar , Rajajinagar...",12.998219,77.55218


In [27]:
df.to_csv('bangalore_lalo.csv',index=False)

In [28]:
df

Unnamed: 0,Postalcode,Town,Neighborhood,Latitude,Longitude
0,560001,Bangalore,"Bangalore Bazaar , Bangalore G.P.O., CMM Court...",12.979185,77.606623
1,560002,Bangalore,"Bangalore City , Bangalore Corporation Building",12.964070,77.577647
2,560003,Bangalore,"Malleswaram , Palace Guttahalli , Swimming Poo...",13.003656,77.569745
3,560004,Bangalore,"Basavanagudi , Mavalli , Pampamahakavi Road , ...",12.945664,77.575075
4,560005,Bangalore,Fraser Town,12.998115,77.620842
...,...,...,...,...,...
126,562149,Bangalore,"Bagalur , Bandikodigehalli , Kannur",13.109270,77.678845
127,562157,Bangalore,"Bettahalsur , Chikkajala , Doddajala , Hunasam...",13.168690,77.635941
128,562162,Bangalore,"Aluru , Dasanapura , Hullegowdanahalli , Husku...",13.063310,77.439025
129,562163,Bangalore,"Arakere , Basettihalli , Doddatumkur , Konnaga...",13.260500,77.530259


## 3.	Methodology section 


## 4.	Results section 


## 5.	Discussion section 


## 6.	Conclusion section 
