During the Summer of 2015, my wife and I were delivering pizzas at the Domino's on State street. I made her collect data with me and then I manually input 1301 (deliveries) rows, 11 columns per row, of data into excel. The columns include; order time, distance from the store, tip amount, order amount, credit/cash order, credit/cash tip, person who delivered, gender of tipper, date delivered, and neighborhood. I did not keep any personally identifiable information.

I've been wanting to write up a magnum opus on this data and all of the insights but I've decided it's too much to try to include in one post. I've decided to answer one question at a time.

Today's question: Is there any correlation between delivery distance from the store and tip amount/tip percentage of order?

Hypothesis: People don't take into account or/nor care about the distance a delivery driver traveled in their tipping decisions.

In [110]:
#My first step is to run an Ordinary Least Squares Regression 
#model with tips ~ distances
import pandas as pd
import numpy as np
import matplotlib.pyplot as pl
import statsmodels.formula.api as sm
from statsmodels.api import add_constant

path = r'C:\Users\angelddaz\OneDrive\Documents\Data Training\data\RawDelData.csv'
data = pd.read_csv(path)

#Querying our data!
stats = data.loc[data['OrderAmount']>-100.00][['Tip','Distance']]

#This is the model magic
lm = sm.ols(formula='Tip ~ Distance', data=stats).fit()
#Intercept:3.0295	#Distance: 0.1588
lm.params

lm.params
#Therefore, our model is: Tip = $3.03 + $0.16(Miles)

Intercept    3.029544
Distance     0.158800
dtype: float64

In [111]:
X_new = pd.DataFrame({'Distance': [stats.Distance.min(), stats.Distance.max()]})
X_new.head()
#making the line with our observed distances
preds = lm.predict(add_constant(X_new), transform=False)
#plotting our scatter plot
ax1 = stats.plot(kind='scatter', x='Distance', y='Tip')

#Making it pretty
ax1.text(7, 20, r'Rsquared: 0.011', fontsize=15)
pl.xlabel('Tip Amount in Dollars')
pl.xlabel('Distance From Store in Miles')
pl.xlim(-0.25, 8.7)
pl.ylim(-0.30, 25.2)
pl.suptitle('Distances At Every Tip Amount', fontsize=18)
pl.title('Linear Regression Analysis',fontsize=15)

#plotting our line
ax2 = pl.plot(X_new,preds, c='red', linewidth=2)
pl.legend(ax2, ('Y = 3.0295 + 0.1588X',), loc='best')
lm.summary()
pl.show()

![Image of First Model](https://media.licdn.com/mpr/mpr/AAEAAQAAAAAAAAe9AAAAJDVlYjJmMjU2LTM2YjctNDViMC05MWU1LWM5YmFjYmZiZTM3OA.jpg)

This gets me a small slope, with passing t-score (3.737), P-Value (0.000), and passing confidence interval at a 95% confidence level. However, I have a terrible R-squared value with almost none of the variance in the data being accounted for by our model.

I knew that one of the neighborhoods we delivered to was Hidden Springs. A well to do neighborhood tucked away in the foothills, making it one of the farthest places we delivered to. I wondered if the high tip values were skewing the model because Hidden Springs was a wealthier neighborhood than average, not because they were farther from the store. 

In [112]:
#First we need to query the data that's past 5, 6, and 7 miles to see what's going on.

far = data.loc[data['Distance']>=5.00][['Tip','Area(text)', 'Distance']]
far
#This gives us 38 deliveries. 33 of which are Hidden Springs deliveries.
#Wealthier neighborhoods tip more, not because of distance. To support this claim
#let's look at Garden City data, which is closer to the store, poorer, and tips worse.

Unnamed: 0,Tip,Area(text),Distance
10,4.0,Hidden Springs,7.8
29,6.0,Hidden Springs,8.1
36,7.0,Hidden Springs,8.4
57,6.56,Hidden Springs,7.7
58,6.0,Hidden Springs,8.2
84,4.0,Hidden Springs,8.1
88,8.0,Hidden Springs,7.4
92,4.6,Bogus Basin - Brumback Hills,5.0
103,4.17,Hidden Springs,6.4
107,4.62,Hidden Springs,8.1


In [113]:
#To build my case that Hidden Springs tips better because they're of a higher socioeconomic status, not because they're farther,
#I'm going to use Garden City, an objectively poorer part of town as a contrast.

gcDels = data.loc[data['Area(text)']=='Garden City'][['Tip', 'Housing']]
hsDels = data.loc[data['Area(text)']=='Hidden Springs'][['Tip', 'Housing']]

mu_gc = gcDels.mean() #GC Delivery mean = 2.96
mu_hs = hsDels.mean() #HS Delivery mean = 5.24

sigma_gc = gcDels.std() #GC stdev = 2.0829
sigma_hs = hsDels.std() #HS stdev = 1.9950

variance_gc = gcDels.std() * gcDels.std() #GC variance = 4.3386
variance_hs = hsDels.std() * hsDels.std() #HS variance = 3.9799
variance_gc

n_gc = gcDels.count() #GC sample size = 221
n_hs = hsDels.count() #HS sample size = 34

degfreedom = n_gc + n_hs - 2 -1 #n-k-1: degrees of freedom = 252

On a surface level analysis, Hidden Springs does tip higher. However, this can be skewed by the low sample count in Hidden springs. There's a way to solve this problem, but before we do that... I want to gather more evidence.

In [114]:
#Let's do a breakdown of socioeconomic differences between Hidden Springs and Garden City
#Apartments, due to a lack of foresight, includes trailers.

#gcDels = data.loc[data['Area(text)']=='Garden City'][['Tip', 'Housing']]
apt = 0
house = 0
business = 0
hotel = 0
for h in gcDels['Housing']:
    if h == 'Apartment':
        apt = apt + 1
    elif h == 'House':
        house = house + 1
    elif h == 'Hotel':
        hotel = house + 1
    elif h == 'Business':
        business = business + 1

print "GC Houses: ",house
print "GC Apartments: ",apt
print "GC Businesses: ",business
print "GC Hotels: ",hotel

GC Houses:  75
GC Apartments:  97
GC Businesses:  32
GC Hotels:  71


In [115]:
#Now for a breakdown of housing types of the Hidden Springs neighborhood.
hsDels = data.loc[data['Area(text)']=='Hidden Springs'][['Tip', 'Housing']]

apt = 0
house = 0
business = 0
hotel = 0
for h in hsDels['Housing']:
    if h == 'Apartment':
        apt = apt + 1
    if h == 'House':
        house = house + 1
    if h == 'Hotel':
        hotel = house + 1
    if h == 'Business':
        business = business + 1

print "HS Houses: ",house
print "HS Apartments: ",apt
print "HS Businesses: ",business
print "HS Hotels: ",hotel

HS Houses:  33
HS Apartments:  1
HS Businesses:  0
HS Hotels:  0


There are more apartments/trailers in Garden City, making it a richer neighborhood in this case,making it more likely to get higher tips. A more precise and project for later on is to scrape zillow for sale prices to show this disparity even further.

For now, let's do a Difference of Two Sample means t-Test to account for our difference in sample sizes.

import nbconvert

In [None]:
jupyter nbconvert --to pdf Untitled.pdf