<h1>Data Science Capstone Project - The Battle of Neighborhoods (Week 1-2)

<h2>Business Problem

<b>Background</b>

When you live in a big city like London or New York, its relatively easy to figure out where to go for some amzing restaurants, or banging street food in the cuisine of your choice, or what to do when you have a few hours to kill?  But when you live in a much smaller city (as most of us do!), the choice is less and often the decisions are harder.  There aren't many guides or websites for smaller cities.  When you are visiting a place for a first time, wouldn't it be great to figure out the happening neighborhoods or boroughs to hang out, or where to go for the best restaurants?

Now actually this is an easier problem when you have lots of data, and bigger cities also have much more wrangled data available....but with smaller cities, the data acquisition and wrangling is much harder, and there is less data to be statisitically relevant.  


<b>Problem : *Where to go in smaller cities when there are no-online guides?*</b>


In the 1960s, the UK Government decided that a further generation of new towns in the South East of England was needed to relieve housing congestion in London. This new town (in planning documents, "new city"), <b>Milton Keynes</b>, was to be the biggest yet, with a target population of 250,000 and a "designated area" of about 22,000 acres (9,000 ha). At designation, its area incorporated the existing towns of Bletchley, Wolverton, and Stony Stratford, along with another fifteen villages and farmland in between.

I live in this amazing city and wondered **whether data science could teach me a thing a or two about the place I have lived in** for the last 25 years? 

But more specifically,  if you were a visitor to to my home town; 
* What would be the areas within the city you would goto, for a range of different activities?  
* Where would you go if you wanted a range of fast food outlets?  
* Where do you go for the best Supermarkets and food shops?  
* Which parts of the city have the best museums?  

**So could data science shed light on where to go in a city of 250,000 people?**

...and for those of you who have never heard of Milton Keynes, its here:

<img src="http://www.worldeasyguides.com/wp-content/uploads/2014/08/Where-is-Milton-Keynes-on-map-England.jpg" width="400" height="400" align="left"/>


<h2>Data Section

**Data Understanding**

In a city of 250,000 people covering 100 km squared, we're going to need to chunk up the city into small enough areas to be explored yet big enough to be of interest.  In the UK we use postcode (similar to ZIP codes) to segment each area.  Typically postcodes are 6 or 7 characters (eg. MK7 8AA) which go down to a level of 10-15 houses.  That will be too granular for our analysis, so we will use a higher level of the postcodes going to just 3 or 4 characters (eg. MK7 or MK78) to give us a postal region cover a few square kilometers each.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/65/MK_postcode_area_map.svg/1100px-MK_postcode_area_map.svg.png" width="400" height="400" align="left"/>

**Data Collection**

There are THREE main sources of data that we will need for our analysis:

* Neighbourhood or Borough data, of all of the regions within Milton Keynes
* Location coordinate data within the Boroughs so that we can search for venues around a locale
* Venue information, its location and classification

For the **Neighbourhood data**, the internet is our friend, and we should be able to get this from wikipedia.  Actually our good friends at wikipedia have posted a list of all of the major postal code towns, their major post codes, and neighborhoods within.  Since its in wikipedia, we can screen scrape it, and transform it into a dataframe!  https://en.wikipedia.org/wiki/MK_postcode_area

For the **Location Coordinates**, the UK Government's Office for National Statistics (https://www.ons.gov.uk/) publishes much data for each postal code, such as the income levels, the OS grid references, the parliamentary constituency, and for our purposes, the longitude and lattitude of each postcode.  Now a 6/7-digit postcode is going to be too granular for our purposes, so some data wrangling and aggregation will be in order.  ONS publish these files regularly in CSV format.  We will get our data from here:  https://geoportal.statistics.gov.uk/datasets/national-statistics-postcode-lookup-may-2020  (example below)

<img src="https://raw.githubusercontent.com/brian-naylor/Coursera_Capstone/master/ONS%20Postcode%20Data%20-%20May%202020.jpg" width="600" height="600" align="left"/>

Finally, for the venue and point of interest information, our good friend **Foursquare** can come to the rescue, where we will leverage their **API to perform venue exploration** around a given longitude and lattitude to see what we find.  (e.g. https://foursquare.com/explore?mode=url&ne=52.093746%2C-0.52434&q=Food&sw=51.965952%2C-0.846891)

<img src="https://raw.githubusercontent.com/brian-naylor/Coursera_Capstone/master/FoursquareinMK.jpg" width="600" height="600" align="left"/>

**Data Wrangling**

There is going to need to be considerable cleaning, filtering and agregation of data before we can model the outcomes.  We will be combining approximately 20 different regions in Milton Keynes, with data from about 20,000 postcodes, and somewhere in the region of about 400-800 points of interest and venues in Milton Keynes.

***

<h2>Methodology

We're going to follow a standard CRISP-DM approach, as shown in the diagram below.

<img src="https://miro.medium.com/max/1400/1*2NajmK58hJf8lJQm25iXWw.png" width="400" height="400" align="left"/>

With the **Business Problem** and **Data Understanding / Collection** described above, lets focus a little on the **Modelling** part.  Since we are trying to understand what parts of the city are best to visit for specific categories, we are going to need to apply some clustering techniques to the venues, to determine their common characteristics, and what would make sense to group them together into a set of clusters.  Also, the number of clusters is going to be far less than the number of boroughs/regions, otherwise we might as well just say "go to postcode region XX for the best chinese food restaurants."  

So we will need to identify all the relevant venues within the regions of interest, determine their relative importance or density within each region, and then apply clustering techniques to group them together.  We will obviously need to plot them to visualise the results, and then evaluate to see whether they make sense.

**Tooling**

Its also important to consider the tooling, and we will be using the following;
* Watson Studio, for cloud-based Notebooks
* IBM Cloud Object Storage for any files we load (and save)
* Github for sharing the published results (in my personal repo https://github.com/brian-naylor/Coursera_Capstone)
* Within Notebooks, we're likely to make extensive use of Pandas for dataframe manipulation, Folium for visualisation, and Geocode for location plotting

***

<h1>Week 2 - The Data Science bit!

<h3>Data Collection

<h3>Data Wrangling

<h3>Modelling

<h3>Visualisation

<h3>Results

<h3>Conclusion