<h1>Thanksgiving Dinner Survey</h1>
Part of the dataquest course on Python and the Pandas library, in this project, I'll be working with Jupyter notebook, and analyzing data on Thanksgiving dinner in the US. 

The dataset is stored in the thanksgiving.csv file. It contains 1058 responses to an online survey about what Americans eat for Thanksgiving dinner. Each survey respondent was asked questions about what they typically eat for Thanksgiving, along with some demographic questions, like their gender, income, and location. This dataset will allow us to discover regional and income-based patterns in what Americans eat for Thanksgiving dinner.

The dataset has 65 columns, and 1058 rows. Most of the column names are questions, and most of the column values are string responses to the questions. Most of the columns are categorical, as a survey respondent had to select one of a few options.
There are also quite a few NaN values in the columns, which occurred when a survey respondent didn't fill out a question because they didn't want to, or it didn't apply to them.

In [2]:
#Import dependencies
import pandas as pd
import numpy as np

In [3]:
#Import our data
data = pd.read_csv('thanksgiving.csv', encoding='Latin-1')

In [4]:
data.head(5)

Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,4337954960,Yes,Turkey,,Baked,,Bread-based,,,,...,Yes,No,No,No,,Suburban,18 - 29,Male,"$75,000 to $99,999",Middle Atlantic
1,4337951949,Yes,Turkey,,Baked,,Bread-based,,Other (please specify),Homemade cranberry gelatin ring,...,No,No,Yes,No,,Rural,18 - 29,Female,"$50,000 to $74,999",East South Central
2,4337935621,Yes,Turkey,,Roasted,,Rice-based,,Homemade,,...,Yes,Yes,Yes,No,,Suburban,18 - 29,Male,"$0 to $9,999",Mountain
3,4337933040,Yes,Turkey,,Baked,,Bread-based,,Homemade,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$200,000 and up",Pacific
4,4337931983,Yes,Tofurkey,,Baked,,Bread-based,,Canned,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$100,000 to $124,999",Pacific


In [5]:
print(data.columns)

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served

Because we want to understand what people ate for Thanksgiving, we'll remove any responses from people who don't celebrate it. The column Do you celebrate Thanksgiving? contains this information. We only want to keep data for people who answered Yes to this questions.

In [6]:
celebrated = data['Do you celebrate Thanksgiving?'].value_counts()
print(celebrated)

Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64


In [7]:
data = data[data['Do you celebrate Thanksgiving?'] == 'Yes']

In [8]:
celebrated_corrected = data['Do you celebrate Thanksgiving?'].value_counts()
print(celebrated_corrected)

Yes    980
Name: Do you celebrate Thanksgiving?, dtype: int64


We have now removed all surveys that correspond to people who do not celebrate thanksgiving.

Let's explore what main dishes people tend to eat during Thanksgiving dinner. We can again use the value_counts method to help us with this.

In [9]:
main_dish = data['What is typically the main dish at your Thanksgiving dinner?'].value_counts()
print(main_dish)

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64


Lets find out if people who chose to have Tofurkey had gravy with their dish

In [11]:
tofurkey = data[data['What is typically the main dish at your Thanksgiving dinner?'] == 'Tofurkey']
gravy_tofurkey = tofurkey['Do you typically have gravy?'].value_counts()
print(gravy_tofurkey)

Yes    12
No      8
Name: Do you typically have gravy?, dtype: int64


Now that we've looked into the main dishes, let's explore the dessert dishes. Specifically, we'll look at how many people eat Apple, Pecan, or Pumpkin pie during Thanksgiving dinner. This data is encoded in the following three columns:

    Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple
    Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin
    Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan

In all three columns, the value is either the name of the pie if the person eats it for Thanksgiving dinner, or null otherwise.

We can find out how many people eat one of these three pies for Thanksgiving dinner by figuring out for how many people all three columns are null.

In [14]:
apple_isnull = data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple'].isnull()
pumpkin_isnull = data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin'].isnull()
pecan_isnull = data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan'].isnull()
ate_pies = (apple_isnull & pumpkin_isnull & pecan_isnull)
ate_pies.value_counts()

False    876
True     104
dtype: int64

Let's analyze the Age column in more depth. In order to analyze the Age column, we'll first need to convert it to numeric values. This will make it simple to figure out things like the average age of survey respondents. The Age column contains values that fall into one of a few categories:

    18 - 29
    30 - 44
    45 - 59
    60+
    null

Because we're missing the exact age value, we won't be able to extract an exact integer value, and we'll instead have to extract the first age value in the strings given.

We can do this by splitting each value on the space character (), then taking the first item in the resulting list. We'll also have to replace the + character to account for 60+, which follows a different format than the rest.

In [19]:
def age_to_int(age_col):
    if pd.isnull(age_col) == True:
        return None
    else:
        age = age_col.split(' ')[0]
        age = age.replace('+', '')
        return int(age)
data['int_age'] = data['Age'].apply(age_to_int)
print(pd.Series.describe(data['int_age']))

count    947.000000
mean      40.089757
std       15.352014
min       18.000000
25%       30.000000
50%       45.000000
75%       60.000000
max       60.000000
Name: int_age, dtype: float64


From a total of 947 records where age was provided, the mean age was 40 years old, with a variation in age of 15 years either side of this mean. The youngest person to answer this survey was 18, and the oldest was 60.

The How much total combined money did all members of your HOUSEHOLD earn last year? column is very similar to the Age column. It contains categories, but can be converted to numerical values. Here are the unique values in the column:

    Prefer not to answer
    $0 to $9,999
    $10,000 to $24,999
    $25,000 to $49,999
    $50,000 to $74,999
    $75,000 to $99,999
    $100,000 to $124,999
    $125,000 to $149,999
    $150,000 to $174,999
    $175,000 to $199,999
    $200,000 and up
    null

We can convert these values to numeric by again splitting on the space character (). We'll then have to account for the string Prefer. Finally, we'll be able to replace the dollar sign character $ and the comma ,, and return the result.

In [20]:
def income(income_col):
    if pd.isnull(income_col) == True:
        return None
    else:
        income = income_col.split(' ')[0]
        if income == 'Prefer':
            return None
        else:
            income = income.replace('$', '')
            income = income.replace(',', '')
            return int(income)
        
data['int_income'] = data['How much total combined money did all members of your HOUSEHOLD earn last year?'].apply(income)
print(pd.Series.describe(data['int_income']))

count       829.000000
mean      75965.018094
std       59068.636748
min           0.000000
25%       25000.000000
50%       75000.000000
75%      100000.000000
max      200000.000000
Name: int_income, dtype: float64


Of the 829 records where a total household income was supplied, the mean income was $75,965. The variation in income was wide, with a standard deviation of 59068. The minimum income was 0 and the maximum was 200,000. 

We can now see how the distance someone travels for Thanksgiving dinner relates to their income level. It's safe to hypothesize that people earning less money could be younger, and would travel to their parent's houses for Thanksgiving. People earning more are more likely to have Thanksgiving at their house as a result.

We can test this by filtering data based on int_income, and seeing what the values in the How far will you travel for Thanksgiving? column are.

In [21]:
#First lets select data where income is less than 150000
data_less150k = data[data['int_income'] < 150000]
print(data_less150k['How far will you travel for Thanksgiving?'].value_counts())

Thanksgiving is happening at my home--I won't travel at all                         281
Thanksgiving is local--it will take place in the town I live in                     203
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    150
Thanksgiving is out of town and far away--I have to drive several hours or fly       55
Name: How far will you travel for Thanksgiving?, dtype: int64


In [22]:
#First lets select data where income is more than 150000
#First lets select data where income is less than 150000
data_more150k = data[data['int_income'] > 150000]
print(data_more150k['How far will you travel for Thanksgiving?'].value_counts())

Thanksgiving is happening at my home--I won't travel at all                         49
Thanksgiving is local--it will take place in the town I live in                     25
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    16
Thanksgiving is out of town and far away--I have to drive several hours or fly      12
Name: How far will you travel for Thanksgiving?, dtype: int64


So we can see from the values above, that proportionally there is not much difference between those that earn below and those than earn above a household income of $150k a year. For both demographics, the majority remain at home or in the local area.

There are two columns which directly pertain to friendship, Have you ever tried to meet up with hometown friends on Thanksgiving night?, and Have you ever attended a "Friendsgiving?. In the US, a "Friendsgiving" is when instead of traveling home for the holiday, you celebrate it with friends who live in your area. Both questions seem skewed towards younger people. Let's see if this hypothesis holds up.

In order to see the average ages of people who have done both, we can use a pivot table.

In [25]:
hometown_friends = 'Have you ever tried to meet up with hometown friends on Thanksgiving night?'
friendsgiving = 'Have you ever attended a "Friendsgiving?"'
age_bias = data.pivot_table(index=hometown_friends, columns=friendsgiving, values='int_age')
age_bias

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,42.283702,37.010526
Yes,41.47541,33.976744


In [26]:
income_bias = data.pivot_table(index=hometown_friends, columns=friendsgiving, values='int_income')
income_bias

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,78914.549654,72894.736842
Yes,78750.0,66019.736842


So suprisingly the questions we investigated were not particularly weighted towards a younger audience, with a mean age of 41 for those that had celebrated with hometown friends, and a mean age of 37 for those that had attended a Friendsgiving. For those that had attended both the age was slightly lower at 33.
There was a little variation with income as well, apart from those that attended a Friendsgiving being slightly lower. But I would conclude that these questions do not hold a bias towards younger people.