## Data science project for ANLY-501|Relationship between existence of Starbucks stores and economic and urban development indices of countries.

>This notebook provides a project report for the data science project done as part of ANLY-501. The data science project intends to show all stages of the data science pipeline. This notebook facilitates that by organizing different phases into different sections and having the code and visualizations all included in one place. Install Jupyter notebook software available from http://jupyter.org/ to run this notebook.
>All code for this project is available on Github as https://github.com/aarora79/sb_study and there is also a simple website associated with it https://aarora79.github.io/sb_study/. The entire code (including the generated output) can be downloaded as a zip file from Github via this URL https://github.com/aarora79/sb_study/archive/master.zip.

### Data Science Problem

Starbucks Corporation is an American coffee company and coffeehouse chain with more than 24,000 stores across the world. This projects intends to explore the following data science problems:
1. Exploratory Data Analysis (EDA) about Starbucks store locations for example geographical distribution of stores by country, region, ownership model, brand name etc.
2. Find a relationship between Starbucks data with various economic and human development indices such as GDP, ease of doing business, rural to urban population ratio, literacy rate, revenue from tourist inflow and so on.
3. Predict which countries where Starbucks does not have a store today are most suitable for having Starbucks stores (in other words in which country where Starbucks does not have a presence should Starbucks open its next store and how many).

This problem is important because it attempts to provide a model (using Starbucks as an example) which can be applied to any similar business (say in the food and hospitality industry) to predict where it could expand globally. It provides insight about where the business is located currently by bringing out information like say the total number of stores (48) found in the entire continent of Africa is way less than number of stores (184) found just on U.S. Airports.

### Potential Analysis that Can Be Conducted Using Collected Data 

The data used as part of this project is obtained from two sources.
1. Starbucks store location data is available from https://opendata.socrata.com via an API. The Socrate Open Data API (SODA) end point for this data is https://opendata.socrata.com/resource/xy4y-c4mk.json

2. The economic and urban development data for various countries is available from the World Bank(WB) website. WB APIs are described here https://datahelpdesk.worldbank.org/knowledgebase/topics/125589-developer-information. The API end point for all the World Development Indicators (WDI) is http://api.worldbank.org/indicators?format=json. Some examples of indicators being collected include GDP, urban to rural population ratio, % of employed youth (15-24 years), international tourism receipts (as % of total exports), ease of doing business and so on.

The possible directions/hypothesis based on collected data (this is not the complete list, would be expanded as the project goes on):
1. EDA about Starbucks store locations (for example):
 - What percentage of stores exists in high income, high literacy, high urban to rural population ratio European countries Vs say high population, rising GDP, low urban to rural population Asian countries.
 - Distribution of stores across geographies based on type of ownership (franchisee, joint venture etc.), brand name etc.
 - Which country, which city has the most Starbucks stores per 1000 people.
 - Is there a Starbucks always open at any UTC time during a 24hour period i.e. you can always find some Starbucks store open time at any given time somewhere in some timezone around the world.

2. Data visualization of the Starbucks store data:
 - World map showing starbucks locations around the world.
 - Heat map of the world based on the number of Starbucks store in a country.
 - Frequency distribution of Starbucks store by city in a given country. Does this distribution resemble any wellll known statistical distribution.
 - Parallel coordinates based visualization for number of stores combined with economic and urban development indicators.
 
3. Machine learning model for predicting number of Starbucks store based on various economic and urban development indicators. This could then be used to predict which countries where Starbucks does not have a presence today would be best suited as new markets. For example, model the number of Starbucks location in a country based on a) ease of doing business, b) GDP, c) urban to rural population, d) employment rate between people in the 15-24year age group, e) type of government, f) access to 24hour electricity and so on and so forth.
 

### Data Issues

The data used for this project is being obtained via APIs from the Socrata web site and the World Bank website and is therefore expected to be relatively error free (for example as compared to the same data being obtained by scraping these websites). Even so, the data is checked for quality and appropriate error handling or even alternate mechanisms are put in place to handle errors/issues with the data.

| Issue         | Handling Strategy| 
| ------------- |-------------| 
| Some of the city names (for examples cities in China) include UTF-16 characters and would therefore not display correctly in documents and charts.      |  Replace city name with country name _1, _2 and so on, for example CN_1, CN_2 etc.|
| Missing data in any of the fields in the Starbucks dataset. | Ignore the data for the location with any mandatory field missing (say country name is missing). Keep a count of the locations ignored due to missing data to get a sense of the overall quality of data. |
| Incorrect format of the value in various fields in the Starbucks dataset. For example Latitude/Longitude values being out of range, country codes being invalid etc.|  Ignore the data for the location with any missing value. Keep a count of the locations ignored due to missing data to get a sense of the overall quality of data. |
| Misc. data type related errors such as date time field (first_seen field in Starbucks dataset) not being correct, fields considered as primary key not being unique (for example store id for Starbuck dataset, Country code for WB dataset) | Flag all invalid fields as errors. |
| Missing data for any of the indicators in the WB dataset. | The most recent year for which the data is available is 2015, if for a particular indicator the 2015 data is not available then use data for the previous year i.e. 2014. If no data is available for that indicator even for the previous 5 years then flag it as such and have the user define some custom value for it.|
| Incorrect format of the value in various fields in the WB dataset. For example alphanumeric or non-numeric data  for fields such as GDP for which numeric values are expected.|  Provide sufficient information to the user (in this case the programmer) about the incorrect data and have the user define correct values.|

The subsequent sections of this notebook provide a view of the data for the two datasets used in this project and also provide the data quality scores (DQS) for both the datasets.

# Datasets (Starbucks and WorldBank WDI)

The Python code for the SB Study (SBS) is run offline and the results (along with the code) are periodically pushed to Github. The code here simply downloads the files from Github to show the results and how the data looks like.

In [6]:
import pandas as pd
#get the Starbucks dataset from Github
#note that the dataset 
SB_DATASET_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/SB_data_as_downloaded.csv'
df = pd.read_csv(SB_DATASET_URL)

print('Starbucks dataset has %d rows and %d columns ' %(df.shape[0], df.shape[1]))
df.head()

Starbucks dataset has 24823 rows and 21 columns 


Unnamed: 0,brand,city,coordinates,country,country_subdivision,current_timezone_offset,first_seen,latitude,longitude,name,...,ownership_type,phone_number,postal_code,store_id,store_number,street_1,street_2,street_3,street_combined,timezone
0,Starbucks,Hong Kong,"{u'latitude': u'22.3407001495361', u'needs_rec...",CN,91,480,2013-12-08T22:41:59,22.3407,114.201691,Plaza Hollywood,...,LS,{u'phone_number': u'29554570'},,1,34638-85784,"Level 2, Plaza Hollywood, Diamond Hill,",Kowloon,,"Level 2, Plaza Hollywood, Diamond Hill,, Kowloon",China Standard Time
1,Starbucks,Hong Kong,"{u'latitude': u'22.2839393615723', u'needs_rec...",CN,91,480,2013-12-08T22:41:59,22.283939,114.158188,Exchange Square,...,LS,{u'phone_number': u'21473739'},,6,34601-20281,"Shops 308-310, 3/F.,","Exchange Square Podium, Central, HK.",,"Shops 308-310, 3/F.,, Exchange Square Podium, ...",China Standard Time
2,Starbucks,Kowloon,"{u'latitude': u'22.3228702545166', u'needs_rec...",CN,91,480,2013-12-08T22:41:59,22.32287,114.21344,Telford Plaza,...,LS,{u'phone_number': u'27541323'},,8,34610-28207,"Shop Unit G1A, Atrium A, Telford Plaza I",", Kowloon Bay, Kowloon",,"Shop Unit G1A, Atrium A, Telford Plaza I, , Ko...",China Standard Time
3,Starbucks,Hong Kong,"{u'latitude': u'22.2844505310059', u'needs_rec...",CN,91,480,2013-12-08T22:41:59,22.284451,114.158463,Hong Kong Station,...,LS,{u'phone_number': u'25375216'},,13,34622-64463,Concession HOK 3a & b,LAR Hong Kong Station,,"Concession HOK 3a & b, LAR Hong Kong Station",China Standard Time
4,Starbucks,Hong Kong,"{u'latitude': u'22.2777309417725', u'needs_rec...",CN,91,480,2013-12-08T22:41:59,22.277731,114.164917,"Pacific Place, Central",...,LS,{u'phone_number': u'29184762'},,17,34609-22927,"Shop 131, Level 1, Pacific Place","88 Queensway, HK",,"Shop 131, Level 1, Pacific Place, 88 Queensway...",China Standard Time


In [7]:
import pandas as pd
#get the Starbucks dataset from Github
#note that the dataset 
WB_DATASET_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/WDI_data_as_downloaded.csv'
df = pd.read_csv(WB_DATASET_URL)

print('Worldbank dataset has %d rows and %d columns ' %(df.shape[0], df.shape[1]))
df.head()

Worldbank dataset has 264 rows and 84 columns 


Unnamed: 0,country_code,IC.TAX.LABR.CP.ZS,WP_time_01.1,SP.POP.1564.TO.ZS,IC.BUS.NDNS.ZS,IC.LGL.CRED.XQ,IC.GOV.DURS.ZS,DT.DOD.PVLX.GN.ZS,SE.ADT.LITR.ZS,IC.EXP.COST.CD,...,SL.SRV.EMPL.ZS,FI.RES.TOTL.DT.ZS,IC.FRM.BRIB.ZS,IC.TAX.OTHR.CP.ZS,IC.REG.COST.PC.ZS,IC.ELC.OUTG,SL.EMP.WORK.ZS,TX.VAL.OTHR.ZS.WT,EN.URB.LCTY.UR.ZS,SI.POV.2DAY
0,BD,,,65.579469,,6.0,,,,,...,,,,,13.9,,,62.45118,31.889816,
1,BE,49.4,,64.830742,,4.0,,,,,...,,,,0.6,4.8,,,59.897278,18.51681,
2,BF,21.4,,52.034628,,6.0,,,,,...,,,,3.7,43.5,,,,50.703959,
3,BG,20.2,,65.828506,,9.0,,,,,...,,,,1.8,0.7,,,,23.100215,
4,VE,18.0,,65.627593,,1.0,,,,,...,,,,37.1,88.7,,,14.755198,10.53417,


# Data Quality Scores (DQS)

To check for dataquality two types of checks were done on both the datasets.
1. **Missing data**: any cell in the dataset which was empty was counted as missing data. This is checked for all features 
   in both the datasets.
2. **Invalid data**: validation checks were done on many (9 fields in the Starbucks dataset, 
   and all 83 fields in the Worldbank dataset). These checks were both generic (validation for numeric data, 
   absence of special characters, validation of timestamp data, latitude and longitude validation, timezone offset and 
   so on and so forth) as well context specific (store-id has to be unique).

Two metrices are derived to quantify missing data and invalid data. These are raw score and adjusted score.

1. **Raw Score for Missing Data**: this is the percentage of data that is not missing, simply speaking this is the count of non-empty cells Vs total number of cells expressed as a percentage. The higher this score the cleaner the dataset is from the perspective of missing data. This score is available for both Starbucks and WorldBank datasets.

2. **Adjusted Raw Score for Missing Data**: this is the percentage of data that is not missing for **Mandatory Features** (i.e. features without which the record would have to be discarded), simply speaking this is the count of non-empty cells in mandatory columns Vs total number of cells expressed as a percentage. In case all features are mandatory then the adjusted raw score and the raw score are the same. The higher this score the cleaner the dataset is from the perspective of missing data. This score is available for both Starbucks and WorldBank datasets. 

3. **Raw Score for Invalid Data**: this is the percentage of data that is not invalid, simply speaking this is the count of cells containing valid data Vs total number of cells expressed as a percentage. The higher this score the cleaner the dataset is from the perspective of invalid data. Validity checks are both generic and context specific as defined above. This score is available for both Starbucks and WorldBank datasets.

3. **Adjusted Score for Invalid Data**: this is the percentage of data that is not invalid for **Mandatory Features** (i.e. features without which the record would have to be discarded), simply speaking this is the count of cells containing valid data in mandatory columns Vs total number of cells expressed as a percentage. The higher this score the cleaner the dataset is from the perspective of invalid data. Validity checks are both generic and context specific as defined above. This score is available for both Starbucks and WorldBank datasets.

### Validity checks implemented for Starbucks dataset

| Field         | Validity Check| 
| ------------- |-------------| 
| Store Id      |  Has to be unique for each row|
| Latitude/Longitude|Longitude measurements range from 0° to (+/–)180°, Latitude from -90° to +90°|
| Timezone offset| Has to be divisible by 15|
| Country code | Has to be present in a country code list downloaded offline|
| Brand | Has to be within one of 6 brands listed on Starbucks website http://www.starbucks.com/careers/brands|
| Store Number | Has to follow a format XXXX-YYYY|
| Timzone name| Has to have either "Standard Time" or  "UTC" as part of name|
| Ownership Type | Should not contain special characters | 
| First seen | Has to follow a date format |

### Validity checks implemented for Worldbank dataset
| Field         | Validity Check| 
| ------------- |-------------| 
| Country Code      |  Should not contain special characters|
| All 83 features (WDIs)| Should be numeric as all of them represent a numerical measure of the feature|



### DQS for Starbucks and WorldBank datasets
The following code fragment downloads a CSV file containing these scores from the Github repo of this project. These scrores have been calculated offline by running the program and uploading the results as part of the Github repo.

In [5]:
import pandas as pd
DQS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/dqs.csv'

df = pd.read_csv(DQS_URL)
df.head()

Unnamed: 0,Datasource,Invalid_Data_Raw_Score,Invalid_Data_Adjusted_Score,Missing_Data_Raw_Score,Missing_Data_Adjusted_Score
0,WorldBank,100.0,100.0,33.64899,33.64899
1,Starbucks,99.999233,99.999233,91.523031,100.0
