# Analyzing Thanksgiving Dinner Data

FiveThirtyEight published a post a while back looking at the most common thanksgiving main and side dishes, etc, in different regions of the United States.

In doing a guided project on dataquest.io, I have retrieved those data from the FiveThirtyEight Github (https://github.com/fivethirtyeight/data/tree/master/thanksgiving-2015) and am going to do a series of analyses with them.

In [1]:
import pandas as pd

dinner = pd.read_csv("thanksgiving-2015-poll-data.csv", encoding="Latin-1")
print(dinner.head(5)) # Just making sure the data is read in properly



   RespondentID Do you celebrate Thanksgiving?  \
0    4337954960                            Yes   
1    4337951949                            Yes   
2    4337935621                            Yes   
3    4337933040                            Yes   
4    4337931983                            Yes   

  What is typically the main dish at your Thanksgiving dinner?  \
0                                             Turkey             
1                                             Turkey             
2                                             Turkey             
3                                             Turkey             
4                                           Tofurkey             

  What is typically the main dish at your Thanksgiving dinner? - Other (please specify)  \
0                                                NaN                                      
1                                                NaN                                      
2                            

In [2]:
# Printing out the names of all the columns
dinner.columns

Index([u'RespondentID', u'Do you celebrate Thanksgiving?',
       u'What is typically the main dish at your Thanksgiving dinner?',
       u'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       u'How is the main dish typically cooked?',
       u'How is the main dish typically cooked? - Other (please specify)',
       u'What kind of stuffing/dressing do you typically have?',
       u'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       u'What type of cranberry saucedo you typically have?',
       u'What type of cranberry saucedo you typically have? - Other (please specify)',
       u'Do you typically have gravy?',
       u'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       u'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       u'Which of these side dishes arety

Not everyone celebrates Thanksgiving. So, just to clean up the data a little bit, I am going to remove everyone who doesn't celebrate the holiday.

In [3]:
#This will just tell us how many people don't celebrate Thanksgiving
dinner['Do you celebrate Thanksgiving?'].value_counts()

Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64

In [4]:
# Lets remove everyone who said no from the data
# This selects everyone who said yes, and restores it in 'dinner'
dinner = dinner[dinner['Do you celebrate Thanksgiving?'] == 'Yes']

## What do people eat for Thanksgiving dinner?

Now that we've cleaned up the data a little bit and removed the people who do not celebrate Thanksgiving, we'll look at what people eat for their main courses.

In [5]:
dinner['What is typically the main dish at your ' \
       'Thanksgiving dinner?'].value_counts()

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64

In [6]:
# Looking just at the tofurkey people - do they use gravy?
tofurkey_df = dinner[dinner['What is typically the main dish at your ' \
                            'Thanksgiving dinner?'] == 'Tofurkey']
tofurkey_df['Do you typically have gravy?'].value_counts()

Yes    12
No      8
Name: Do you typically have gravy?, dtype: int64

So, as we can see from the two tables above, the vast majority of people have turkey as their main course. Then, looking just at those 20 weird families that do tofurkey as the main dish, I explored whether they use gravy. Turns out to be an almost 50/50 split, with most of the tofurkey families still using gravy.

## What do people eat for dessert?

Specifically, I am going to explore what proportion of people eat apple, pecan, or pumpkin pie for dessert. These data are not one-or-the-other. People can report having more than one type of pie for dessert.



In [7]:
# Creating a Boolean Series for Apple pie
# Returns True if they ate apple pie
apple_isnull = dinner['Which type of pie is typically served at your ' \
                      'Thanksgiving dinner? Please select all that apply. ' \
                      '- Apple'].isnull()
# Do the same for pumpkin and pecan
pumpkin_isnull = dinner['Which type of pie is typically served at your ' \
                        'Thanksgiving dinner? Please select all that apply. ' \
                        '- Pumpkin'].isnull()
pecan_isnull = dinner['Which type of pie is typically served at your ' \
                      'Thanksgiving dinner? Please select all that apply. ' \
                      '- Pecan'].isnull()

# Now lets join the three series so we can learn how many families have
# at least one pie
ate_pies = (apple_isnull) & (pumpkin_isnull) & (pecan_isnull)

# Display how many families ate at least one pie
ate_pies.value_counts()

False    876
True     104
dtype: int64

So, from the above analysis we see that 876 households had at least one pie for dessert on Thanksgiving. Here, a True means that the values for all three pies are null, so False means that at least one is false (i.e., they typically eat at least one pie).

## Age

Going to do a bit of age analysis. First, need to do a bit more data cleaning. Let's quickly take a look and see how the ages are distributed across categories before we clean it up.

In [8]:
age_range = dinner['Age'].value_counts()
age_range.sort_index()

18 - 29    185
30 - 44    235
45 - 59    269
60+        258
Name: Age, dtype: int64

While this is somewhat informative to look at, these are strings. we need to turn them in to values we can actually use. Unfortunately, given the ages are ranges we are going to lose some information in the process.

So, I will assign the lowest age from each category to be that persons value. For example, the 269 people that wrote 45 - 69 will be assigned the age 45.

First, I will write a small function to convert the strings to an interval in the way I described above, and then display some descriptives.

In [9]:
def extract_age(age_str):
    if pd.isnull(age_str):
        return None
    age_str = age_str.split(' ')[0]
    age_str = age_str.replace('+', '')
    return int(age_str)

# apply the function to our age column
dinner['int_age'] = dinner['Age'].apply(extract_age)
# display some descriptives
dinner['int_age'].describe()

count    947.000000
mean      40.089757
std       15.352014
min       18.000000
25%       30.000000
50%       45.000000
75%       60.000000
max       60.000000
Name: int_age, dtype: float64

### Age Summary

So, with this analysis the average age of the respondents was 40, ranging between 18 and 60. This is, however, not an accurate representation of the data because we collapsed all age categories into their lower bound (i.e., 45-69 became 45). 

As such, all of the statistics on age will be skewed, and with restricted range.

## Household Income

I am going to now to the same thing I just did with age to income. we'll look at how the incomes are distributed then turn the strings into intervals, assigning the lower bound of the range to be the income value.

In [10]:
dinner['How much total combined money did all members of your ' \
       'HOUSEHOLD earn last year?'].value_counts()

$25,000 to $49,999      166
$50,000 to $74,999      127
$75,000 to $99,999      127
Prefer not to answer    118
$100,000 to $124,999    109
$200,000 and up          76
$10,000 to $24,999       60
$0 to $9,999             52
$125,000 to $149,999     48
$150,000 to $174,999     38
$175,000 to $199,999     26
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: int64

In [11]:
# Convert the strings to intervals
def extract_inc(inc_str):
    if pd.isnull(inc_str):
        return None
    inc_str = inc_str.split(" ")[0]
    if inc_str == "Prefer":
        return None
    inc_str = inc_str.replace("$", "")
    inc_str = inc_str.replace(",", "")
    return int(inc_str)

# Apply this function to the income column
dinner['int_inc'] = dinner['How much total combined money did all members of ' \
                           'your HOUSEHOLD earn last year?'].apply(extract_inc)
# Get some descriptives
dinner['int_inc'].describe()

count       829.000000
mean      75965.018094
std       59068.636748
min           0.000000
25%       25000.000000
50%       75000.000000
75%      100000.000000
max      200000.000000
Name: int_inc, dtype: float64

### Income Summary

just like we did with age, we basically rounded all the income categories down. With this, the average household income was $75,965. So, again, not the most accurate depiction of the average income because we lose a lot of the range/variability, but will do for now.

## Do lower income people travel farther for Thanksgiving?

The hypothesis here is that lower income people are generally younger, and thus travel more often to their parents for thanksgiving. Higher income people are more likely to be older and thus more stable, holding Thanksgiving at their own houses.

To explore this, we filter the data by our newly created interval income categories, and then look at the values of the question 'How far will you travel for thanksgiving?'

In [12]:
# Selecting people who earn under $150,000 to see how far they travel
low_inc = dinner[dinner["int_inc"] < 150000]

travel_LI = low_inc["How far will you travel for Thanksgiving?"]

# Lets get the value counts for this distance travelled column
travel_LI.value_counts()

Thanksgiving is happening at my home--I won't travel at all                         281
Thanksgiving is local--it will take place in the town I live in                     203
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    150
Thanksgiving is out of town and far away--I have to drive several hours or fly       55
Name: How far will you travel for Thanksgiving?, dtype: int64

Alright, so it seems like most of the "lower wage" people are staying home for Thanksgiving. Only 205 are traveling out of town, relative to the 484 (70%) who are staying in town. What about the people who make over $150,000

In [13]:
high_inc = dinner[dinner["int_inc"] > 150000]
travel_HI = high_inc["How far will you travel for Thanksgiving?"]
travel_HI.value_counts() 

Thanksgiving is happening at my home--I won't travel at all                         49
Thanksgiving is local--it will take place in the town I live in                     25
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    16
Thanksgiving is out of town and far away--I have to drive several hours or fly      12
Name: How far will you travel for Thanksgiving?, dtype: int64

So, there are obviously less people in this category. Regardless, we've got 72.5% of the people who earn more than $150,000 staying at home. So, it seems this hypothesis is unsupported. "Lower" income folks don't travel any more than higher income folks for Thanksgiving.

## Friendsgiving! Do the Young'uns do it more?

Let's take a look at the young people versus old in whether or not they are attending a friendsgiving.

In [14]:
# Lets generate a pivot table

friendsgiving = pd.pivot_table(dinner, values = "int_age", 
                               index = "Have you ever tried to meet up "
                                       "with hometown friends on "
                                       "Thanksgiving night?",
                               columns = 'Have you ever attended a '
                                         '"Friendsgiving?"')
print(friendsgiving)

Have you ever attended a "Friendsgiving?"                  No        Yes
Have you ever tried to meet up with hometown fr...                      
No                                                  42.283702  37.010526
Yes                                                 41.475410  33.976744


From this we get a nice little 2x2 table showing yes and no for the two above questions, and the average age for each column. So, the people who said no to both (i.e., have never done anything related to Friendsgiving) are on average 42.3 years old, whereas the people who answered yes to both were, on average, 34. So theres about a ten year age gap - seems like younger folks do friendsgiving more often.

However, if you simply asked whether they tried to meet up with hometown friends, that age difference disappears. So there are two possibilities: (1) its a phrasing difference such that older people do not use the term "friendsgiving," or (b) the terms imply different things (e.g., Friendsgiving implies meeting for dinner, where as simply meeting with friends may be after dinner).

Is it also a poor mans game?

In [15]:
friendsgiving_inc = pd.pivot_table(dinner, values = "int_inc", 
                                   index = "Have you ever tried to meet up "
                                           "with hometown friends on "
                                           "Thanksgiving night?",
                               columns = 'Have you ever attended a '
                                         '"Friendsgiving?"')
print(friendsgiving_inc)

Have you ever attended a "Friendsgiving?"                     No           Yes
Have you ever tried to meet up with hometown fr...                            
No                                                  78914.549654  72894.736842
Yes                                                 78750.000000  66019.736842


Seems like the answer is again yes. The average income of people who said yes to both was about $8,000 lower than those who said no to both. But there is the same interpretation as the age analysis above. The income difference disappears when the phrasing is changed, possibly for either of the reasons listed above.

## What is the most common dessert?

This should be relatively easy, lets just look at the value counts for each of the dessert related columns and see what is the most common! There are a lot of different dessert columns, so I'm just going to try to compare them all and print the most frequently occuring. Additionally, let's ignore everyone who said "None" or "Other" for the various dessert options.

To do this I'm just going to make a dictionary for each of the desserts. Then I'll store the value count for each each dessert as the value to that desserts key, then I'll sort them by descending - then we can see the most popular desserts!

In [16]:
desserts = {'apple pie': 0,
            'buttermilk pie': 0,
            'cherry pie': 0,
            'chocolate pie': 0,
            'coconut cream pie': 0,
            'key lime pie': 0,
            'peach pie': 0,
            'pecan pie': 0,
            'pumpkin pie': 0,
            'sweet potato pie': 0,
            'apple cobbler': 0,
            'blondies': 0,
            'brownies': 0,
            'carrot cake': 0,
            'cheesecake': 0,
            'cookies': 0,
            'fudge': 0,
            'ice cream': 0,
            'peach cobbler': 0
        }

desserts['apple pie'] = int(dinner['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple'].value_counts())
desserts['buttermilk pie'] = int(dinner['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Buttermilk'].value_counts())
desserts['cherry pie'] = int(dinner['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Cherry'].value_counts())
desserts['chocolate pie'] = int(dinner['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Chocolate'].value_counts())
desserts['coconut cream pie'] = int(dinner['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Coconut cream'].value_counts())
desserts['key lime pie'] = int(dinner['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Key lime'].value_counts())
desserts['peach pie'] = int(dinner['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Peach'].value_counts())
desserts['pecan pie'] = int(dinner['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan'].value_counts())
desserts['pumpkin pie'] = int(dinner['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin'].value_counts())
desserts['sweet potato pie'] = int(dinner['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Sweet Potato'].value_counts())

desserts['apple cobbler'] = int(dinner['Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Apple cobbler'].value_counts())
desserts['blondies'] = int(dinner['Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Blondies'].value_counts())
desserts['brownies'] = int(dinner['Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Brownies'].value_counts())
desserts['carrot cake'] = int(dinner['Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Carrot cake'].value_counts())
desserts['cheesecake'] = int(dinner['Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Cheesecake'].value_counts())
desserts['cookies'] = int(dinner['Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Cookies'].value_counts())
desserts['fudge'] = int(dinner['Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Fudge'].value_counts())
desserts['ice cream'] = int(dinner['Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Ice cream'].value_counts())
desserts['peach cobbler'] = int(dinner['Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Peach cobbler'].value_counts())

# Sort the dictionary then print the keys and values
desserts_sorted = sorted(desserts, key=desserts.get, reverse=True)

for dessert in desserts_sorted
    print(dessert, desserts[dessert])

('pumpkin pie', 729)
('apple pie', 514)
('pecan pie', 342)
('ice cream', 266)
('cookies', 204)
('cheesecake', 191)
('sweet potato pie', 152)
('chocolate pie', 133)
('brownies', 128)
('cherry pie', 113)
('apple cobbler', 110)
('peach cobbler', 103)
('carrot cake', 72)
('fudge', 43)
('key lime pie', 39)
('coconut cream pie', 36)
('buttermilk pie', 35)
('peach pie', 34)
('blondies', 16)


This is not at all surprising. The two most common Thanksgiving desserts are pumpkin and apple pie. Apparently people don't like blondies, peach pie, or buttermilk pie. 

This was a fun little analysis to figure out. I think I can do it better with a for loop but I'm not entirely sure how. Not the most elegant code, but thats okay.