# Applied Data Science : Capstone Final Project

## Battle Of The Neighborhoods

### Author : Bill Gourley
### Date     : 23rd December 2018


#### Introduction/Problem Description

As the image below shows, Tourism plays an important part in the overall Irish Economy. It is vital, therefore, for any business or individual involved in the industry or wishing to create a new business to service the industry, to have access to demographic data in a timely and interactive manner.

The stakeholders of this project, the Irish Tourism Industry Confederation, have commissioned a report and software application that utilizes geospatial and demographic social data to illustrate the major categories of existing facilities and venues for counties and major cities in Ireland. This report will then be used by the Confederation as a basis for informing government and other agencies of the facilties and services which could be added or expanded in varous counties and cities to attract tourists and therefore enhance industry profitability.

![title](tourism_impact.jpg)

This report uses a Python Notebook application to :

    * obtain geospatial data on the counties and major Irish cities,
    * obtain demographic data from the Foursquare API application from geospatial data,
    * produce a summary report of the major venue categories, by county and major city.

#### Data

    * Geospatial Data : https://www.citypopulation.de/php/ireland.php

The geospatial data contains the county/city name together with a status column and multiple census data columns. For this exercise only the name column will be used. The names are subsequently used to obtain the latitude and longitude values of the location. An example of the data is shown below.
    
   ![geo_data](geo_data.jpg)
    
    * Demographic Venue Data : Foursquare API
    
An example of the Demographic Venue Data retrieved from Foursquare is shown below. This data is subsequently analyzed to summarize the venue categories into 7 major categories using a combination of value mapping and clustering.

   ![venue_data](foursquare_data.jpg)

#### Methodology

    * Exploratory Data Analysis : 
    
For this project, the data obtained from the 2 sources detailed above required very little in the way of Exploratory Data Analysis. There were no missing values to deal with and the data was obtained cleanly in table format. The data table obtained for the county information included population census data columns which were removed as they were not required for the current project.

    * Inferential Statistics :
    
The end goal of this project was not to provide predictive analytics capabilities, but rather to use the data obtained to gain insight and produce a report with recommendations to the stakeholders. As such, there is no target variable. Therefore there was no need to perform inferential statistical analysis.

    * Machine Learning Algorithms
    
To reduce the number of individual Foursquare venue categories (138) to a manageable number of Major Venue Categories (9), a combination of manual value mapping and machine learning clustering was used. The clustering algorithm used was hdbscan which was preferred over kmeans for two main reasons : (1) a fixed number of clusters does not have to be provided, and (2) hdbscan clusters by density rather than partitioning the data thus producing a more consistent and stable set of clusters.

The final set of Major Venue Categories produced was :

    0 : Services
    1 : Entertainment
    2 : Transport
    3 : Accomodation
    4 : Shopping
    5 : Bar
    6 : Fast Food
    7 : Cultural
    8 : Restaurant

    * Results Methods :
    
(1) The Major venue categories listed above were grouped by category and a total sum and global mean was calculated for each category.

(2) The categories were summed for each neighborhood and a neighborhood mean was calculated for each category.

(3) The neighborhood means were then subracted from the global means to produce a list of differences between each neighborhood category and the global mean for that category.

(4) Negative differences indicate areas which could be added to the neighborhood to improve the attractiveness of the neighborhood to tourists and locals alike, therefore increasing footfall and profitability.

(5) Positive differences indicate areas which could be used as focus points for advertising strengths of particular neighborhoods.


#### Results

After mapping and clustering, the quantity and global means of the 9 Major Venue Categories are as shown below: 

![title](majorCategory_globalMeans.jpg)

The venues dataframe after clustering is as shown in the example below. Note that 1 has been added to the cluster numbering to facilitate easier interpretation.

![title](dataframe_clusters.jpg)

The neighborhood means for each Major Venue Category is as shown below:

![title](neighborhood_means.jpg)

The differences in the neighborhood means and the global means for each Major Venue  Category when grouped by neighborhood is as shown below:

![title](mean_differences.jpg)

Note : Although data has been processed for all 31 neighborhoods in the counties dataframe, to keep the report short, final results for 4 neighborhoods only are shown and discussed below. The methods used, however, apply equally to all neighborhoods.

The neighborhoods selected for illustration are Carlow, Waterford, Kilkenny and Limerick. 

Bar graphs illustrating the perentage differences for eacxh major category are shown below. Blue bars indicate that the neighborhood has more of this category than the global mean, red bars indicate the the neighborhood has less of this category than the global mean.

![title](neighborhood_graphs.jpg)

#### Results Analysis

The results and map displays obtained for the 4 neighborhoods illustrated can be summarized as follows:

##### Carlow
    * With approximately 12% more venues than the global mean, Shopping would be a good potential selling point for Carlow.
    * The other categories are more or less on a par with the global mean, although with approximately 1% to 2% less venues than the global mean, the quantity of Restaurants, Fast Food Outlets and Bars could be increased by a small amount.

![title](map_carlow.jpg)

##### Waterford
    * With approximately 8% and 5% more venues respectively than the global mean, Cultural Venues and Accomodation would be good selling points for Waterford.
    * Shopping venues and Bars with 2% and 10% less venues respectively than the global mean could be areas which could be improved upon.

![title](map_waterford.jpg)

##### Kilkenny
    * Overall Kilkenny fares well when compared to the global mean in most categories.
    * Areas which could be used as selling points are Cultural and Bars with approximately 2% to 3% more venues than the global mean.
    * Areas which could be improved slightly with 2% and 5% less venues respectively than the global mean are Restaurants and Fast Food venues.

![title](map_kilkenny.jpg)

##### Limerick
    * Overall Limerick fares well when compared to the global mean in most categories.
    * Areas which could be used as selling points are Restaurants and Bars with approximately 2% to 3% more vemnues than the global mean.
    * Areas which could be improved upon are Shopping, Cultural and Entertainment venues.
    
![title](map_limerick.jpg)

#### Discussion

##### Limitations of Available Data

Applications such as this one depend almost exclusively on data, the more relevant the data available the better the results. For this project we have only used one source of social demographic data, ie, from Foursquare, therefore the results obtained depend entirely on the data present within the Foursquare databases. This is particularly obvious in the lack of results obtained for more rural areas, it seems that the majority of Foursquare users concentrate on venues within urban areas. To get a more conclusive and accurate result, I would suggest that as many similar data sources be used as possible from applications such as TripSavvy, Booking.com, TripAdvisor etc. This would ensure that we have data from as widespread a geographical and demographic area as possible.

##### Further Steps

The application could be extended in a number of areas, a few examples are:

    (1) Retain and report on the sub-categories contained within the Major Categories to provide a more refined result.
    
    (2) Add an input interface to allow users to specify locations interactively.
    
    (3) Allow users to buiold custom queries, eg. to explore by category and location.
    
    (4) Proitability functionality could be added by adding footfall and profit data.
    
    (5) Predicting profitablilty by utilizing Machine Learning to predict footfall by location and category, then extrapolating profitability.

#### Conclusion

Although the application only uses one source of data, the results show that the methodology used is sound and valid and thus the application could form a major part of any decision making process by the Irish Tourism Industry Confederation involving geographic and demographic data.

#### References 

Impact of Tourism on the Irish Economy : https://www.itic.ie/the-impact-of-tourism-on-the-irish-economy/

Adding Legend to Folium Maps : https://tilemill-project.github.io/tilemill/docs/guides/advanced-legends/