## Data science project for ANLY-501|Relationship between existence of Starbucks stores and economic and urban development indices of countries.

>This notebook provides a project report for the data science project done as part of ANLY-501. The data science project intends to show all stages of the data science pipeline. This notebook facilitates that by organizing different phases into different sections and having the code and visualizations all included in one place. Install Jupyter notebook software available from http://jupyter.org/ to run this notebook.

### Data Science Problem

Starbucks Corporation is an American coffee company and coffeehouse chain with more than 24,000 stores across the world. This projects intends to explore the following data science problems:
1. Exploratory Data Analysis (EDA) about Starbucks store locations for example geographical distribution of stores by country, region, ownership model, brand name etc.
2. Find a relationship between Starbucks data with various economic and human development indices such as GDP, ease of doing business, rural to urban population ratio, literacy rate, revenue from tourist inflow and so on.
3. Predict which countries where Starbucks does not have a store today are most suitable for having Starbucks stores (in other words in which country where Starbucks does not have a presence should Starbucks open its next store and how many).

### Potential Analysis that Can Be Conducted Using Collected Data 

The data to be used as part of this project is obtained from two sources.
1. Starbucks store location data is available from https://opendata.socrata.com via an API. The Socrate Open Data API (SODA) end point for this data is https://opendata.socrata.com/resource/xy4y-c4mk.json

2. The economic and urban development data for various countries is available from the World Bank(WB) website. WB APIs are described here https://datahelpdesk.worldbank.org/knowledgebase/topics/125589-developer-information. The API end point for all the World Development Indicators (WDI) is http://api.worldbank.org/indicators?format=json. Some examples of indicators being collected include GDP, urban to rural population ratio, % of employed youth (15-24 years), international tourism receipts (as % of total exports), ease of doing business and so on.

The possible directions / hypotheses based on collected data (this is not the complete list, would be expanded as the project goes on):
1. EDA about Starbucks store locations (for example):
 - What percentage of stores exists in high income, high literacy, high urban to rural population ratio European countries Vs say high population, rising GDP, low urban to rural population Asian countries.
 - Distribution of stores across geographies based on type of ownership (franchisee, joint venture etc.), brand name etc.
 - Which country, which city has the most Starbucks stores per 1000 people.
 - Is there a Starbucks always open at any UTC time during a 24hour period i.e. you can always find some Starbucks store open time at any given time somewhere in some timezone around the world.

2. Data visualization of the Starbucks store data:
 - World map showing starbucks locations around the world.
 - Heat map of the world based on the number of Starbucks store in a country.
 - Frequency distribution of Starbucks store by city in a given country.
 - Parallel coordinates based visualization for number of stores combined with economic and urban development indicators.
 
3. Machine learning model for predicting number of Starbucks store based on various economic and urban development indicators. This could then be used to predict which countries where Starbucks does not have a presence today would be best suited as new markets.
 

### Data Issues

The data used for this project is being obtained via APIs from the socrata web site and the World Bank website and is therefore expected to be relatively error free (for example as compared to the same data being obtained by scraping these websites). Even so, the data is checked for quality and appropriate error handling or even alternate mechanisms are put in place to handle errors/issues with the data.

| Issue         | Handling Strategy| 
| ------------- |-------------| 
| Some of the city names (for examples cities in China) include UTF-16 characters and would therefore not display correctly in documents and charts.      |  Replace city name with country name _1, _2 and so on, for example CN_1, CN_2 etc.|
| Missing data in any of the fields in the Starbucks dataset. | Ignore the data for the location with any missing value. Keep a count of the locations ignored due to missing data to get a sense of the overall quality of data. |
| Incorrect format of the value in various fields in the Starbucks dataset. For example Latitude/Longitude values eing out of range, country codes being invalid etc.|  Ignore the data for the location with any missing value. Keep a count of the locations ignored due to missing data to get a sense of the overall quality of data. |
| Missing data for any of the indicators in the WB dataset. | The most recent year for which the data is available is 2015, if for a particular indicator the 2015 data is not available then use data for the previous year i.e. 2014. If no data is available for that indicator even for the previous 5 years then flag it as such and have the user define some custom value for it.|
| Incorrect format of the value in various fields in the WB dataset. For example alphanumeric or non-numeric data  for fields such as GDP for which numeric values are expected.|  Provide sufficient information to the user (in this case the programmer) about the incorrect data and have the user define correct values.|

 