In [17]:
import pandas as pd
import numpy as np

data = pd.read_csv("./thanksgiving-2015/thanksgiving-2015-poll-data.csv", encoding="Latin-1")
print(data.head(3))
print(data.shape)

   RespondentID Do you celebrate Thanksgiving?  \
0    4337954960                            Yes   
1    4337951949                            Yes   
2    4337935621                            Yes   

  What is typically the main dish at your Thanksgiving dinner?  \
0                                             Turkey             
1                                             Turkey             
2                                             Turkey             

  What is typically the main dish at your Thanksgiving dinner? - Other (please specify)  \
0                                                NaN                                      
1                                                NaN                                      
2                                                NaN                                      

  How is the main dish typically cooked?  \
0                                  Baked   
1                                  Baked   
2                                Roas

In [18]:
data_yes = data[data["Do you celebrate Thanksgiving?"]=="Yes"]
print(data_yes["Do you celebrate Thanksgiving?"].value_counts())


Yes    980
Name: Do you celebrate Thanksgiving?, dtype: int64


In [19]:
print(data_yes["What is typically the main dish at your Thanksgiving dinner?"].value_counts())
data_Tofurkey = data_yes[data_yes["What is typically the main dish at your Thanksgiving dinner?"] == "Tofurkey"]
print(data_Tofurkey["Do you typically have gravy?"].value_counts())

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64
Yes    12
No      8
Name: Do you typically have gravy?, dtype: int64


In [20]:
apple_isnull = pd.isnull(data_yes["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple"])
pumpkin_isnull = pd.isnull(data_yes["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin"])
pecan_isnull = pd.isnull(data_yes["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan"])

ate_pies = apple_isnull & pumpkin_isnull & pecan_isnull
print(ate_pies.value_counts())

False    876
True     104
dtype: int64


In [21]:
data_ages = data_yes["Age"]
def string_to_int (age):
        if pd.isnull(age) == True:
            return None
        else:
            parsed_string = age.split(" ")[0]
            updated_string = parsed_string.replace("+","")
            int_conv_age = int(updated_string)
            return int_conv_age

data_yes["int_age"] = data_ages.apply(string_to_int)
print(data_yes["int_age"].describe())

count    947.000000
mean      40.089757
std       15.352014
min       18.000000
25%       30.000000
50%       45.000000
75%       60.000000
max       60.000000
Name: int_age, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


# Age_Analysis: Describing Turkey Go-ers
### Looking at age using the pandas method describe()
This was an interesting issue because we had to convert an age range in string form to an int, in order to perform some basic functions like mean, std, etc. First, we created a method to conver an age (single string) to a int, with correct formatting. Once we had this created, we called this function against the age collumn (using the apply() method), and put the result in a new collum "int_age". 

Caveats: One this that is key to note is that because we took the lower of the two bounds for each age range, the data skews younger than most users are, meaning, the majority of users aren't the lower bound of a particular set. What would have been more accurate is if we took the first and last pieces of the string, converted them to its, found the mean, and used that as the age range for the group.

In [22]:
data_incomes = data_yes["How much total combined money did all members of your HOUSEHOLD earn last year?"]
def string_to_int_income (income):
    if pd.isnull(income):
        return None
    split_income= income.split(" ")
    split_low = split_income[0]
    split_high = split_income[len(split_income)-1]
    if split_low == "Prefer":
        return None
    split_low = split_low.replace(",", "").replace("$", "")
    split_high = split_high.replace(",", "").replace("$", "")
    if split_high =="up":
        split_int_low = int(split_low)
        return split_int_low
    return (int(split_low) + int(split_high))/2

data_yes["int_income"] = data_incomes.apply(string_to_int_income)
print(data_yes["int_income"].describe())

count       829.000000
mean      86486.276840
std       57789.467567
min        4999.500000
25%       37499.500000
50%       87499.500000
75%      112499.500000
max      200000.000000
Name: int_income, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


# Income_Analysis: Describing Turkey Go-ers
### Looking at average household income
Compared to the age issue above, I tried something slightly different. Knowing that we had a similiar problem we saw with age, this time, I accounted for both ends of income brackets to give an actual mean to sum counts. This should be more accurate, although there is an inherint issue with 2 pieces:

1. The 200k+ collumn has no upper bound, so it's set at 200k flat, however, this is a poor assumption to make, given how large other incomes could be and how few people are generally above 200k income overall
2. the first bracket starts at 0, which means it represents all homes were people aren't working. This is much less likely an issue than above, as most people who aren't working will have some source of income within their house, even if they're a student (assuming you said you still live with your parents). However, still worth noting that if there was some heavy influx of people whith income at 0, this would tip the scales.

What I'm saying above, concisely, is: *We aren't accounting for the amount of people who fall onto a particular value in each bracket, so there's room for error (though that's the entire reason we're using brackets in the first place)*

In [23]:
data_sub150k_travel = data_yes[data_yes["int_income"]<150000]
print(data_sub150k_travel["How far will you travel for Thanksgiving?"].value_counts())

Thanksgiving is happening at my home--I won't travel at all                         281
Thanksgiving is local--it will take place in the town I live in                     203
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    150
Thanksgiving is out of town and far away--I have to drive several hours or fly       55
Name: How far will you travel for Thanksgiving?, dtype: int64


In [24]:
data_over150k_travel = data_yes[data_yes["int_income"]>=150000]
print(data_over150k_travel["How far will you travel for Thanksgiving?"].value_counts())

Thanksgiving is happening at my home--I won't travel at all                         66
Thanksgiving is local--it will take place in the town I live in                     34
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    25
Thanksgiving is out of town and far away--I have to drive several hours or fly      15
Name: How far will you travel for Thanksgiving?, dtype: int64


In [25]:
data_yes.pivot_table(
    index="Have you ever tried to meet up with hometown friends on Thanksgiving night?", 
    columns = 'Have you ever attended a "Friendsgiving?"', 
    values = "int_age")

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,42.283702,37.010526
Yes,41.47541,33.976744


In [26]:
data_yes.pivot_table(
    index="Have you ever tried to meet up with hometown friends on Thanksgiving night?", 
    columns = 'Have you ever attended a "Friendsgiving?"', 
    values = "int_income")

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,89456.82448,83124.552632
Yes,89553.119048,76315.319079


# Friendsgiving or Thanksgiving Meetup
In short, it seems that on average,
1. People who have Friends giving are younger (and people who hang out with friends at all, are younger as well)
2. People who have friends giving AND Thanksgiving Night with Friends have the smallest income, followed by Friendsgiving itself (which makes sense since that would suggest you couldn't come home with your family, possibly becuase of financial reasons)
    a. As a sub-point here, people who have their friends meetups on thanksgiving night tend to have a slightly higher income than people who don't do anything with friends, however, the difference is so small (a few hundred $$) that its likely negligable.
    
In short, if you spend time with friends, you're likely to be younger, and have lower incomes.