# Working With Survey Data
## Featuring: Your Survey Responses!

## Surveys

Social science data is often obtained through **surveys**: asking people questions about themselves!

- **Economics:** What is your current income?

- **Political Science:** Which political party do you currently plan to vote for?

- **Linguistics:** What languages do you speak?

- **Psychology:** What are your preferences?

- **Sociology:** What is your religious affiliation?

- **Geography:** Where do you live, where do you work, and how do you travel between the two?

## Surveys

Surveys are usually drawn from a **sample** of the population. 

- Why? Because it is very difficult to survey every single person!

- An exception is the Census, but they have to work **really** hard to get everybody to respond.

- We will talk later about how to handle the challenge of using a sample instead of an entire population. For now, let's focus on the data from our in-class survey!

### Class survey!

We sent out a survey last week, asking you to provide answers to some basic questions. Let's see what you all said!

These survey responses are stored in a file called `ggr274_survey.csv`. This file contains a table of survey responses, consisting of:
- A row for each person who responded
- A column for each question that we asked

### Reading our data: `open` and `readlines`

Now that we've seen the file, let's read the file's contents into Python.

Once again, we can do this in two steps:

1. **Open** the file.
2. **Read** the file data into Python, line by line.

In [4]:
survey_file = open("ggr274_survey.csv", encoding="utf-8-sig")

In [5]:
survey_data = survey_file.readlines()

### Exploring Our Data

Let's look at just the first line from the file:

In [None]:
survey_data[0]

We need to split the values of this line into individual pieces. We can do this using `.split(",")`

In [None]:
print(survey_data[0].split(","))

This first row appears to be the **header**: this contains the names for each of our columns! For now, let's just check out the first *actual* row of our data:

In [None]:
print(survey_data[1].split(","))

### Summarizing numeric data

Let's look at commute times!

In [None]:
commutes = []

for line in survey_data[1:]:
    entries = line.split(",")

    commute_entry = entries[9].strip()
    
    commute_int = int(commute_entry)
    
    commutes.append(commute_int)
    
print(commutes)

Questions: 
1. Why do we need to put a `[1:]` after `survey_data`?
2. Why do we need to put a `[9]` after `entries`?

Total number of responses using `len()`:

In [None]:
num_responses = len(commutes)

print(f"Number of responses: {num_responses}.")

Calculating response rate out of 120 students:

In [None]:
print(f"Response rate: {num_responses/120}")

We can also compute summary statistics for this list of commute times!

In [None]:
sum_commute = sum(commutes)
min_commute = min(commutes)
max_commute = max(commutes)
avg_commute = sum_commute / num_responses

In [None]:
print(f"Sum of all commute times: {sum_commute}.")
print(f"Minimum commute: {min_commute}.")
print(f"Maximum commute: {max_commute}.")
print(f"Average commute: {avg_commute}.")

### Summarizing categorical data

Now let's take a look at a **categorical** variable. How many people said that Winter was their favourite season?

In [None]:
seasons = []

for line in survey_data[1:]:
    entries = line.split(",")
    
    season_entry = entries[5].strip()
        
    seasons.append(season_entry)
    
print(seasons)

Let's start by setting a count of the number of people that prefer winter (`winter_count`) at 0. 

Then, we'll add +1 to `winter_count` for each time that "Winter" appears in `seasons`:

In [None]:
winter_count = 0

for season in seasons:
    if season == "Winter":
        winter_count = winter_count + 1

How many people prefer winter? What share of responses?

In [None]:
print(f"Number of responses for Winter: {winter_count}.")
print(f"Share of responses for Winter: {winter_count/len(seasons)}.")

### Using dictionaries to summarize groups

What if we want to compare the responses of different groups? For example, we asked everyone what their favourite hot drink was (coffee vs. tea), and how many cups they had per week. 

**Did coffee drinkers have more cups per week, or tea drinkers?**

To answer this question, we're going to need a **dictionary**. This dictionary will contain separate lists for each type of drink (coffee, tea, etc.).

In [7]:
# Create an empty dictionary
drink_dict = {}

We can use a for loop to add different categories *inside* of `drink_dict`:

In [9]:
# Create an empty dictionary
drink_dict = {}

for line in survey_data[1:]:

    entries = line.split(",")
    
    # Column of drink types
    drink_type = entries[3].strip()

    # Check if this drink_type is already in our dictionary
    # If not, add a new key in the dictionary for this drink_type!
    if drink_type not in drink_dict:
        drink_dict[drink_type] = []

print(drink_dict)

{'Tea': [], 'Coffee': [], 'Juice': [], 'hot chocolate': [], 'Hot chocolate': [], 'Hot Chocolate': [], 'Matcha': []}


Let's expand this for loop, so that it creates each drink category and **appends** information from the cups_per_week column!

In [None]:
# Create an empty dictionary
drink_dict = {}

# Read each line of survey_data (skipping line 0)
for line in survey_data[1:]:

    entries = line.split(",")
    
    # Column of drink types
    drink_type = entries[3].strip()

    # Check if this drink_type is already in our dictionary
    # If not, add a new key in the dictionary for this drink_type!
    if drink_type not in drink_dict:
        drink_dict[drink_type] = []

    # Column of cups per week (convert to integer)
    cups_per_week_entry = entries[4].strip()
    cups_per_week_int = int(cups_per_week_entry)

    # Add number of cups to the correct drink type for this row
    drink_dict[drink_type].append(cups_per_week_int) 
        
print(drink_dict)

{'Tea': [], 'Coffee': [], 'Juice': [], 'hot chocolate': [], 'Hot chocolate': [], 'Hot Chocolate': [], 'Matcha': []}


We can now pull out an individual group containing only the responses from tea drinkers or coffee drinkers!

In [None]:
print(drink_dict["Tea"])

In [None]:
print(drink_dict["Coffee"])

Now, let's see whether tea or coffee drinkers have more cups per week!

In [None]:
total_tea_cups = sum(drink_dict["Tea"])
total_tea_drinkers = len(drink_dict["Tea"])
average_tea_cups = total_tea_cups/total_tea_drinkers


print(f"Total tea cups per week: {total_tea_cups}")
print(f"Total tea drinkers: {total_tea_drinkers}")
print(f"Average cups for tea drinkers: {average_tea_cups}")

In [None]:
total_coffee_cups = sum(drink_dict["Coffee"])
total_coffee_drinkers = len(drink_dict["Coffee"])
average_coffee_cups = total_coffee_cups/total_coffee_drinkers

print(f"Total coffee cups per week: {total_coffee_cups}")
print(f"Total coffee drinkers: {total_coffee_drinkers}")
print(f"Average cups for coffee drinkers: {average_coffee_cups}")