# Introduction business problem

In this project I want to analyze a dataset of London City to find where a person could open a place for his/her business.

In particular with the Foursquare location data I will try to figure out the main problem: which is the best place to open a business?

# Dataset

The information that I'm going to use for the Capstone project are located in this [site](https://www.doogal.co.uk/london_postcodes.php).

Basically the dataset is quite huge (about 135 MB and 320426 rows with 44 attribute), so for this reason I will use basic info such as:
1. District 
2. District Code
3. Ward (Neighborhood)

Before to start the analyze I decided to filter some of the rows with the "In Use?" attribute that defines if a postcode is it used or not.

After this filtering I'm going to use an Unsupervised Algorithm: K-Means.

Unfortunately the latitude and the longitude of the dataframe are for every postcode so I'm going to use some external services to find the lat and lon of each district code. At the end for each district I will try to find with the Foursquare data the best place to open a business for each district.

In the below lines there are some info about the dataset: number of rows, the first lines of the df etc.

### Import libraries and creation of dataframe 

In [1]:
import random 
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
%matplotlib inline 

from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs

print('Libraries imported!')

Libraries imported!


In [2]:
# Creation the dataframe through the link
link = "https://www.doogal.co.uk/UKPostcodesCSV.ashx?area=London"
df_london = pd.read_csv(link)

In [3]:
df_london.head()

Unnamed: 0,Postcode,In Use?,Latitude,Longitude,Easting,Northing,Grid Ref,County,District,Ward,...,Constituency Code,Index of Multiple Deprivation,Quality,User Type,Last updated,Nearest station,Distance to station,Postcode area,Postcode district,Police force
0,BR1 1AA,Yes,51.401546,0.015415,540291,168873,TQ402688,Greater London,Bromley,Bromley Town,...,E14000604,20532,1,0,2019-05-29,Bromley South,0.218257,BR,BR1,Metropolitan Police
1,BR1 1AB,Yes,51.406333,0.015208,540262,169405,TQ402694,Greater London,Bromley,Bromley Town,...,E14000604,10169,1,0,2019-05-29,Bromley North,0.253666,BR,BR1,Metropolitan Police
2,BR1 1AD,No,51.400057,0.016715,540386,168710,TQ403687,Greater London,Bromley,Bromley Town,...,E14000604,20532,1,1,2019-05-29,Bromley South,0.044559,BR,BR1,Metropolitan Police
3,BR1 1AE,Yes,51.404543,0.014195,540197,169204,TQ401692,Greater London,Bromley,Bromley Town,...,E14000604,19350,1,0,2019-05-29,Bromley North,0.462939,BR,BR1,Metropolitan Police
4,BR1 1AF,Yes,51.401392,0.014948,540259,168855,TQ402688,Greater London,Bromley,Bromley Town,...,E14000604,20532,1,0,2019-05-29,Bromley South,0.227664,BR,BR1,Metropolitan Police


In [4]:
df_london.columns

Index(['Postcode', 'In Use?', 'Latitude', 'Longitude', 'Easting', 'Northing',
       'Grid Ref', 'County', 'District', 'Ward', 'District Code', 'Ward Code',
       'Country', 'County Code', 'Constituency', 'Introduced', 'Terminated',
       'Parish', 'National Park', 'Population', 'Households', 'Built up area',
       'Built up sub-division', 'Lower layer super output area', 'Rural/urban',
       'Region', 'Altitude', 'London zone', 'LSOA Code', 'Local authority',
       'MSOA Code', 'Middle layer super output area', 'Parish Code',
       'Census output area', 'Constituency Code',
       'Index of Multiple Deprivation', 'Quality', 'User Type', 'Last updated',
       'Nearest station', 'Distance to station', 'Postcode area',
       'Postcode district', 'Police force'],
      dtype='object')

Filtering the postcodes that are not used

In [5]:
print("The number of rows of the df is: ", df_london.shape[0])

The number of rows of the df is:  320426


In [6]:
print("After the filtering the number of rows to analyze are: ", df_london[df_london['In Use?'] == 'Yes'].shape[0])

After the filtering the number of rows to analyze are:  177967


List of districts

In [7]:
print(df_london['District'].unique())

['Bromley' 'Lewisham' 'Lambeth' 'Croydon' 'Greenwich' 'Havering' 'Camden'
 'Sutton' 'Merton' 'Bexley' 'Tower Hamlets' 'City of London' 'Hackney'
 'Waltham Forest' 'Redbridge' 'Newham' 'Enfield' 'Islington' 'Westminster'
 'Barnet' 'Brent' 'Ealing' 'Harrow' 'Hillingdon' 'Barking and Dagenham'
 'Kingston upon Thames' 'Richmond upon Thames' 'Haringey'
 'Hammersmith and Fulham' 'Southwark' 'Kensington and Chelsea'
 'Wandsworth' 'Hounslow']
