# Predicting the Success of Starbucks Locations (part 2)

**Author**: <a href = "https://www.linkedin.com/in/alexis-raymond-telfer/">Alexis Raymond</a>  
**Date Modified**: 2019-04-18

## Table of Contents

1. [Exploratory Data Analysis](#eda)

## 1. Exploratory Data Analysis <a name="eda"></a>

Now that we have a clean dataframe with all the selected features for a predictive model, we can explore the data in order to gather some insights.

### Import analysis and visualization libraries

In [1]:
import pandas as pd # library for data analysis
import numpy as np # library to handle data in a vectorized manner

import matplotlib.pyplot as plt # library for simple data visualizations
import seaborn as sns # library for more advanced data visualizations

ModuleNotFoundError: No module named 'pandas'

### Import locations dataframe

In [3]:
starbucks_locations = pd.read_csv("starbucks_ratings.csv") # Read locations dataset in dataframe
starbucks_locations.drop(['Unnamed: 0', 'Store Number', 'Foursquare ID'], 1, inplace=True) # Drop useless columns
starbucks_locations.head() # Show first 5 rows of dataframe

NameError: name 'pd' is not defined

## Analyze target feature

The target variable for this research project is the average rating found with the Foursquare Places API for each Starbucks locations. First, let's calculate basic statistics about this feature in order to better understand it.

In [4]:
starbucks_locations['Rating'].describe() # Calculate basic statistics on the ratings

NameError: name 'starbucks_locations' is not defined

In order to better understand the value of this rating, we must have a better idea of how it is distributed. The first step in doing so is to visualize it using an histogram.

In [5]:
fig, ax = plt.subplots(1,1, dpi = 100) # Create plot

sns.distplot(starbucks_locations['Rating'], bins=15) # Create histogram with KDE for the ratings distribution

ax.get_yaxis().set_ticklabels([]) # Remove y tick labels
sns.despine(left = True) # Remove all spines except for the bottom one

plt.savefig('ratings_dist.png') # Save histogram to PNG

NameError: name 'plt' is not defined

It seems like the distribution is normally centered around the value of 7.75. This makes sense knowing that its mean is 7.64 and that 50% of the ratings are between the values of 7.30 and 8.10. Now, we can represent it with a box plot in order to see if there are multiple outliers.

In [6]:
fig, ax = plt.subplots(1,1, figsize = (3, 6), dpi = 100) # Create plot

sns.boxplot(y = 'Rating', data = starbucks_locations) # Create boxplot for the ratings distribution

sns.despine() # Remove the top and right spines

plt.savefig('ratings_box.png') # Save boxplot to PNG

NameError: name 'plt' is not defined

This boxplot tells us that there are only 5 outliers in the 384 observations and that all of them are between the value of 5.5 and 6.

## Relationship between distance to headquarter and rating

Before starting the exploratory data analysis, one hypothesis was that the further away a Starbucks location was to the Seattle headquarters, the less successful it was. It is now time to evaluate this theory by visualizing the relationship between the two variables. 

In [7]:
sns.lmplot(x = 'Distance to HQ', y = 'Rating', data = starbucks_locations) # Create regression plot showing the relationship between the distance to HQ and rating variables

plt.savefig('ratings_distance_regression.png') # Save regression plot to PNG

NameError: name 'sns' is not defined

In [8]:
starbucks_locations[['Distance to HQ', 'Rating']].corr() # Calculate correlation 

NameError: name 'starbucks_locations' is not defined

Interestingly, the linear regression plot above proves that our hypothesis was wrong. In fact, the correlation coefficient between the two variables is 0.21 which is low. On top of that, the regression slope is in the opposite direction from what was expected. Being positive, it supposes that the further away a store is from the HQ, the most successful it is. However, since the correlation is so low, we must conclude that there is no relationship between the distance to HQ and the success of the location.

## City demographics

In order for a predictive model to be efficient, it cannot have two correlated features. Therefore, we need to verify that the area and population density variables are not tied to eachother.

In [9]:
starbucks_locations[['Area', 'Density']].corr() # Calculate correlation

NameError: name 'starbucks_locations' is not defined

As we can tell by the extremely low correlation coefficient of -0.06, there is no correlation between the area and population density variables. We can keep both variables in the model. Now, let's see if they are correlated with the location's rating.

In [10]:
sns.lmplot(x = 'Density', y = 'Rating', data = starbucks_locations) # Create regression plot showing the relationship between the population density and rating variables

plt.savefig('ratings_density_regression.png') # Save regression plot to PNG

NameError: name 'sns' is not defined

In [11]:
sns.lmplot(x = 'Area', y = 'Rating', data = starbucks_locations) # Create regression plot showing the relationship between the area and rating variables

plt.savefig('ratings_area_regression.png') # Save regression plot to PNG

NameError: name 'sns' is not defined

In [12]:
starbucks_locations[['Density', 'Area', 'Rating']].corr() # Calculate correlation

NameError: name 'starbucks_locations' is not defined

Once again, contrary to what was expected, there doesn't seem to be a correlation between the demographics of a city and the success of the Starbucks in it. In fact, the correlation coefficient between the areas and the ratings is -0.03 which is extremely low and the one between the densities and the ratings is -0.17 which is low as well.

## Most popular venues

Now, lets find the 10 most common venues found around Starbucks.

In [13]:
# Print 10 most common venues and their total frequency
starbucks_locations.drop(['City', 'Country', 'Longitude', 'Latitude', 'Area', 'Density', 'Distance to HQ', 'Rating'], axis = 1).sum().sort_values(ascending = False).head(10)

NameError: name 'starbucks_locations' is not defined