# Code-along | 2024-01-23 | Analyzing Survey Data with SQL & Python | Richie Cotton

This code-along analyses data from a survey about the growth of Finnish companies. The data reports the perceptions of top managers on growth, innovativeness, and the ability for renewal.

### Where is the data from?

- [Suominen & Pihlajamaa, 2022](https://www.sciencedirect.com/science/article/pii/S2352340922005261)
- [The dataset](https://zenodo.org/records/5820394#.Y5OKl-zMK3I)

### What will I learn today?

- How to summarize and visualize questions with a numeric response using a histogram.
- How to determine whether there is a difference between two groups of numeric responses using a Mann-Whitney U test.
- How to summarize and visualize questions with a categorical response using a bar plot.

## Task 0: Setup

For this analysis we need the `plotly.express` package for drawing histograms and bar plots.

We'll also need the `mannwhitneyu` function from the `scipy.stats` package to perform the Mann-Whitney U test.

### Instructions

Import the following packages.

- Import `plotly.express` using the alias `px`.
- From `scipy.stats` import the `mannwhitneyu` function.

In [1]:
# Import plotly.express using the alias px
import plotly.express as px

# From scipy.stats import the mannwhitneyu function
from scipy.stats import mannwhitneyu

## Task 1: Import the Survey Dataset

The survey data is contained in a CSV file named `"What_does_it_take_to_generate_new_growth_Survey_data.csv"`.

### Data dictionary

The dataset contains the following columns.

- `Growth_Firm`: Is the company (firm) _currently_ classified as a growth company under OECD definitions?
- `question_2_row_1_transformed`: The responses to question 2, part 1 (with some pre-applied transformation).
- `question_2_row_2_transformed`: The responses to question 2, part 2 (with some pre-applied transformation).
- `question_3_row_1`: The responses to question 3, part 1.
- ...
- `question_7_row_1`: The responses to question 7, part 1.

The details of each question are fully described in `survey_questions.csv`, and we'll cover the details of the specific questions that we look at as we come to them in the tasks here.

### Instructions

Use SQL to import the survey data.

- Select everything from `survey_data.csv`. 
    - This uses European style CSV settings, so you can't use the default CSV reading settings.
    - Set the column delimiter to a semi-colon.
    - Set the decimal separator to a comma.
    - Set the null string to a space.
- Assign to a DataFrame named `survey`.

<details>
    <summary>Code hints</summary>
    <p>
        
- Workspace lets you import from a CSV file into a SQL query by calling DuckDB's [`read_csv_auto()`](https://duckdb.org/docs/data/csv/overview.html) in the `FROM` clause.
- `delim` sets the column delimiter.
- `decimal_separator` sets the decimal separator.
- `nullstr` sets the value used for NULLs (missing values).

    </p>
</details>    

In [2]:
-- Select everything from survey_data.csv
SELECT * 
	FROM read_csv_auto("survey_data.csv", delim=";", decimal_separator=",", nullstr=" ")

Unnamed: 0,Growth_Firm,question_2_row_1_transformed,question_2_row_2_transformed,question_3_row_1,question_3_row_2,question_3_row_3,question_3_row_4,question_3_row_5,question_3_row_6,question_3_row_7,question_3_row_8,question_3_row_9,question_3_row_10,question_3_row_11,question_3_row_12,question_3_row_13,question_3_row_14,question_3_row_15,question_3_row_16,question_4_row_1,question_4_row_2,question_4_row_3,question_4_row_4,question_5_row_1,question_5_row_2,question_5_row_3,question_5_row_4,question_5_row_5,question_5_row_6,question_5_row_7,question_5_row_8,question_5_row_9,question_5_row_10,question_6_row_1,question_6_row_2,question_7_row_1
0,0,35.135135,50.750939,4,5,5,4,3,3,4,4,4,2,2,2,2,4,4,3,4,4,4,4,1,1,2,4,2,4,2,3,2.0,5.0,4,5,1
1,0,23.018043,51.182200,5,4,4,4,4,4,4,5,5,4,2,4,2,4,4,3,4,3,3,4,4,4,2,3,4,3,3,3,4.0,3.0,5,4,1
2,0,86.640472,62.932639,3,4,4,4,4,3,4,5,3,3,3,5,3,4,4,4,4,4,4,4,4,4,4,5,4,4,4,4,,,5,3,1
3,0,17.647059,39.130435,3,4,5,4,4,4,5,5,3,3,4,5,4,4,5,3,4,3,3,3,3,2,3,3,3,4,4,4,3.0,3.0,3,3,1
4,0,60.000000,32.802125,4,4,4,4,3,4,4,4,5,5,2,3,1,2,4,2,4,2,2,2,2,2,2,4,2,4,2,3,3.0,4.0,5,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,1,227.868852,1417.450683,3,4,4,3,2,4,3,3,3,3,4,4,4,3,3,3,3,2,2,2,2,2,2,2,3,3,3,4,3.0,3.0,3,4,2
116,1,316.666667,446.149645,5,5,5,4,4,5,5,4,5,5,5,4,3,4,1,4,3,2,2,1,2,3,2,4,4,2,2,2,3.0,3.0,2,2,2
117,1,566.666667,4996.839959,4,5,5,4,4,4,5,5,3,3,5,4,5,5,5,4,4,5,4,5,5,5,5,5,4,4,4,5,4.0,5.0,4,4,2
118,1,471.428571,465.770863,4,5,4,5,5,4,4,4,4,4,5,5,2,5,5,5,2,2,2,2,5,5,5,5,5,4,2,5,5.0,5.0,1,1,2


The dataset doesn't contain the actual questions that were asked. To find out what the questions are, we can look up the column titles in the data dictionary contained in `survey_questions.csv`.

### Instructions

Use SQL to import the data dictionary for the survey questions.

- Select everything from `survey_questions.csv`. 
    - This uses the default read CSV settings.

<details>
    <summary>Code hints</summary>
    <p>
        
- If you are importing a file from CSV with the default `read_csv_auto()` settings, then Workspace lets you simply type the file name in the `FROM` clause.

    </p>
</details> 

In [3]:
-- Select everything from survey_data.csv
SELECT * 
	FROM 'survey_questions.csv'

Unnamed: 0,column,question,row,section,title,response_type
0,question_2_row_1_transformed,2,1,estimated growth,Expected employee count in five years (as a pe...,numeric
1,question_2_row_2_transformed,2,2,estimated growth,Expected revenue in five years (as a percent f...,numeric
2,question_3_row_1,3,1,company culture,Employees are encouraged to be creative,agree_disagree
3,question_3_row_2,3,2,company culture,Managers are expected to be creative problem s...,agree_disagree
4,question_3_row_3,3,3,company culture,Employees' ability to function creatively is r...,agree_disagree
5,question_3_row_4,3,4,company culture,We are constantly looking for ways to develop ...,agree_disagree
6,question_3_row_5,3,5,company culture,Assistance in developing new ideas is readily ...,agree_disagree
7,question_3_row_6,3,6,company culture,Our organization is open and responsive to cha...,agree_disagree
8,question_3_row_7,3,7,company culture,"Managers here are always searching for fresh, ...",agree_disagree
9,question_3_row_8,3,8,company culture,Our organization has a clear and inspiring set...,agree_disagree


## Task 2: Visualizing Numeric Responses

Question 2 asks 

> If the firm develops the way you would like it to, how much revenue would the firm receive, and how many employees would it have five years ahead? Disregard possible inflation.

In this task we'll consider the first part, about employee count.

The responses are numeric, and so it's natural to visualize the distribution as a histogram.

### Instructions

Draw a histogram of expected employee count in five years.

- Draw a histogram of the `survey` data.
- On the x-axis, plot `question_2_row_1_transformed`.
- Set the x-axis label to `"Expected employee count in five years (as a percent from last available year)"`.

<details>
    <summary>Code hints</summary>
    <p>
        
- Axis labels use the `labels` arugment with a dictionary. Use this code pattern `labels={"variable_name": "Text for label"}`.

    </p>
</details> 

In [4]:
# Draw a histogram of the survey data
# On the x-axis, plot question_2_row_1_transformed
# Facet the plot in rows by growth firm status.
px.histogram(
    survey, 
    x="question_2_row_1_transformed",
    labels={
        "question_2_row_1_transformed": "Expected employee count in five years (as a percent from last available year)"
    }
)


An interesting question is whether companies that are currently classified as _growth_ have different expectations of how many more employees they will add over the next five years compared to _non-growth_ companies. We can draw a histogram for each.

### Instructions

Update the histogram of expected employee count in five years.

- Copy and paste your previous histogram code.
- Facet the plot in rows by growth status.

In [5]:
# Copy and paste your previous histogram code.
# On the x-axis, plot question_2_row_1_transformed
# Facet the plot in rows by growth status.
px.histogram(
    survey, 
    x="question_2_row_1_transformed",
    facet_row="Growth_Firm",
    labels={
        "question_2_row_1_transformed": "Expected employee count in five years (as a percent from last available year)"
    }
)

## YOUR TURN: Visualize Another Question With Numeric Reponses

### Instructions

Draw the last histogram again, this time with the results of question 2, part 2.

- Copy and paste your previous code.
- Change the column to `question_2_row_1_transformed`.
- Change the x-axis title to `"Expected revenue in five years (as a percent from last available year)"`.

In [6]:
# Visualize question 2, part 2
px.histogram(
    survey, 
    x="question_2_row_2_transformed",
    facet_row="Growth_Firm",
    labels={
        "question_2_row_2_transformed": "Expected revenue in five years (as a percent from last available year)"
    }
)

## Task 3: Calculating Statistical Significance Between Groups of Numeric Responses

The two histograms look pretty similar. However, there may be a statistically significant difference between the two groups.

We data don't have a bell-shaped normal distribution curve, so we use a Mann-Whitney U test (a.k.a. Wilcoxon Rank Sum test) to compare them.

### Instructions

Get the non-growth rows for question 2, part 1.

- Select the `question_2_row_1_transformed` column from the survey CSV.
- Get rows where growth firm status is `0`.
- Assign to a dataframe named `q2_1_non_growth`.

In [7]:
-- Select the question_2_row_1_transformed column from the survey CSV
-- Get rows where growth firm status is 0
SELECT question_2_row_1_transformed
	FROM read_csv_auto("survey_data.csv", delim=";", decimal_separator=",", nullstr=" ")
	WHERE Growth_Firm = 0

Unnamed: 0,question_2_row_1_transformed
0,35.135135
1,23.018043
2,86.640472
3,17.647059
4,60.0
5,-1.295497
6,12.275449
7,66.666667
8,9.375
9,506.060606


### Instructions

Get the growth rows for question 2, part 1.

- Do the same again, this time getting rows where growth firm status is `1`.
- Assign to `q2_1_growth`.

In [8]:
-- Select the question_2_row_1_transformed column from the survey CSV
-- Get rows where growth firm status is 1
SELECT question_2_row_1_transformed
	FROM read_csv_auto("survey_data.csv", delim=";", decimal_separator=",", nullstr=" ")
	WHERE Growth_Firm = 1

Unnamed: 0,question_2_row_1_transformed
0,580.272109
1,166.666667
2,400.000000
3,7.296137
4,25.000000
...,...
57,227.868852
58,316.666667
59,566.666667
60,471.428571


### Instructions

- Perform a Mann-Whitney U test on `q2_1_non_growth` and `q2_1_growth`.
- Look at the p-value. Is it more or less than `0.05`?

In [9]:
# Perform a Mann-Whitney U test on q2_1_non_growth and q2_1_growth
mannwhitneyu(q2_1_non_growth, q2_1_growth)

MannwhitneyuResult(statistic=array([1299.]), pvalue=array([0.00884359]))

## Task 4: Visualizing Categorical Responses

Many of the questions in the survey dataset have categorical responses with 5 options from "Strongly disagree" to "Strongly agree".

The values are encoded as `1` for `Strongly disagree` through to `5` for `Strongly agree`. For visualizing the responses, it is better to have explicit labels rather than numbers.

We'll gradually build up the SQL query to get the counts for each response type then draw a bar plot.

### Useful jargon

These sorts of survey responses where answer is a level of agreement to a statement are called **Likert scales** (or rating scales). 

### Instructions

- Import everything from `agree_disagree.csv` as `lookup`.

<details>
    <summary>Code hints</summary>
    <p>
        
- If you can get away with using default arguments to `read_csv_auto()`, then Workspace lets you simply pass the CSV file name in the `FROM` clause.

    </p>
</details> 

In [10]:
-- Import everything from agree_disagree.csv as lookup
SELECT *
	FROM 'agree_disagree.csv' AS lookup

Unnamed: 0,code,response
0,1,Strongly disagree
1,2,Disagree
2,3,Neither agree or disagree
3,4,Agree
4,5,Strongly agree


We're working towards getting the counts for each of the five responses, even if they aren't all present in the dataset. That means that we want zero counts to be allowed. To achieve this, we need a left join.

### Instructions

Extend the previous code to join the lookup to the survey data.

- Copy and paste the previous code.
- Left join lookup to the survey data on `lookup` `code` equal to `survey` `question_3_row_1`.
- Select the `lookup` `response` and the `survey` `question_3_row_1` columns.

<details>
    <summary>Code hints</summary>
    <p>
        
- You can call `read_csv_auto()` in the `JOIN` clause.

    </p>
</details> 

In [11]:
-- Copy and paste the previous code
-- Left join lookup to the survey data on lookup code equal to survey question_3_row_1
-- Select the lookup response and the survey question_3_row_1 columns
SELECT 
	lookup.response, 
	survey.question_3_row_1
	FROM 'agree_disagree.csv' AS lookup 
	LEFT JOIN read_csv_auto("survey_data.csv", delim=";", decimal_separator=",", nullstr=" ") AS survey
		ON lookup.code = survey.question_3_row_1

Unnamed: 0,response,question_3_row_1
0,Agree,4.0
1,Strongly agree,5.0
2,Neither agree or disagree,3.0
3,Neither agree or disagree,3.0
4,Agree,4.0
...,...,...
116,Strongly agree,5.0
117,Agree,4.0
118,Agree,4.0
119,Neither agree or disagree,3.0


### Instructions

Extend the previous code to get counts.

- Copy and paste the previous code.
- Change the selection from `survey.question_3_row_1` to the count of that column, naming the result as `n`.
- Group by the `lookup` `response`.

In [12]:
-- Copy and paste the previous code
-- Change the selection from survey.question_3_row_1 to the count of that column, naming the result as n
-- Group by the lookup response
SELECT 
	lookup.response, 
	COUNT(survey.question_3_row_1) AS n
	FROM 'agree_disagree.csv' AS lookup 
	LEFT JOIN read_csv_auto("survey_data.csv", delim=";", decimal_separator=",", nullstr=" ") AS survey
		ON lookup.code = survey.question_3_row_1
	GROUP BY lookup.response

Unnamed: 0,response,n
0,Strongly disagree,0
1,Neither agree or disagree,18
2,Agree,67
3,Strongly agree,29
4,Disagree,6


In order to draw an easy to interpret plot, we want to include a color scheme based on the level of agreement with the statement.

Using `lookup.code - 3` gives us a range from `-2` (Strongly disagree) to `2` (Strongly agree).

### Instructions

Extend the previous code to include the level of agreement, and order the results.

- Copy and paste the previous code.
- Calculate the `lookup` `code` minus 3, naming the result as `agreement`.
- Order the result by `lookup` `code`.
- Assign the result to a DataFrame named `q3_1_counts`.

In [13]:
-- Copy and paste the previous code
-- Calculate the lookup code minus 3, naming the result as agreement
-- Order the result by lookup code
SELECT 
	lookup.response, 
	COUNT(survey.question_3_row_1) AS n,
	lookup.code - 3 AS agreement
	FROM 'agree_disagree.csv' AS lookup 
	LEFT JOIN read_csv_auto("survey_data.csv", delim=";", decimal_separator=",", nullstr=" ") AS survey
		ON lookup.code = survey.question_3_row_1
	GROUP BY lookup.code, lookup.response
	ORDER BY lookup.code

Unnamed: 0,response,n,agreement
0,Strongly disagree,0,-2
1,Disagree,6,-1
2,Neither agree or disagree,18,0
3,Agree,67,1
4,Strongly agree,29,2


Now we are (finally) ready to plot the questions 3 part 1 responses.

These types of categorical variables where you have a neutral response and two sets of responses going in opposite directions (agreeing and disagreeing) are best visualized using a diverging color scale.

### Instructions

Draw a bar plot of the response counts.

- Draw a bar plot of `q3_1_counts`.
- On the x axis, plot `response`.
- On the y axis, plot `n`.
- Color the bars by `agreement`.
- Use the diverging continuous color scale `px.colors.diverging.Armyrose_r`.

<details>
    <summary>Code hints</summary>
    <p>
        
- Set a continuous color scale with the `color_continuous_scale` argument to `px.bar()`.
- The diverging scales can be found in `px.colors.diverging`.

    </p>
</details> 

In [14]:
# Draw a bar plot of q3_1_counts
# On the x axis, plot response
# On the y axis, plot n
# Color the bars by agreement
# Use the diverging continuous color scale px.colors.diverging.Armyrose_r
px.bar(
    q3_1_counts, 
    x="response", 
    y="n", 
    color="agreement", 
    color_continuous_scale=px.colors.diverging.Armyrose_r
)

## YOUR TURN: Visualize Another Question with Categorical Responses

### Instructions

Choose another agree-disagree question (any part of q3 to q6), then get the counts of the responses.

- Copy and paste your previous SQL query.
- Change the column to one one for your new question. (The code needs changing in 2 places.)
- Assign the results to a DataFrame with a meaningful name.

In [15]:
-- Get the counts for your new categorical question
SELECT 
	lookup.response, 
	COUNT(survey.question_3_row_13) AS n,
	lookup.code - 3 AS agreement
	FROM 'agree_disagree.csv' AS lookup 
	LEFT JOIN read_csv_auto("survey_data.csv", delim=";", decimal_separator=",", nullstr=" ") AS survey
		ON lookup.code = survey.question_3_row_13
	GROUP BY lookup.code, lookup.response
	ORDER BY lookup.code

Unnamed: 0,response,n,agreement
0,Strongly disagree,4,-2
1,Disagree,25,-1
2,Neither agree or disagree,41,0
3,Agree,36,1
4,Strongly agree,14,2


### Instructions

Draw a bar plot of the response counts for your new question.

- Copy and paste your previous plotting code.
- Change the dataset to your new DataFrame of counts. 

In [16]:
# Visualize the responses from your new categorical question
px.bar(
    q3_13_counts, 
    x="response", 
    y="n", 
    color="agreement", 
    color_continuous_scale=px.colors.diverging.Armyrose_r
)

## Homework Tasks

- Visualize the relationship between responses for both the numeric questions using a scatter plot, with points colored by growth status.
- Visualize the relationship between responses for two categorical questions using a heatmap, with cells colored by count. How might you extend this to display the growth statuses?
- Find out which questions had the strongest agreement with the statement. That is, calculate which questions had the highest average numeric score for the responses.
- Find out which questions had the strongest level of feeling in the responses. That is, calculate which questions had more "Strongly agree" and "Strongly disagree" responses. Think of a way to weight the different responses and calculate an average for each question.

## Keep Learning!

Learn more about 

- analyzing survey data in [Analyzing Survey Data in Python](https://www.datacamp.com/courses/analyzing-survey-data-in-python)
- histograms and bar plots in [Understanding Data Visualization](https://www.datacamp.com/courses/understanding-data-visualization)
- data visualization with Plotly in [Introduction to Data Visualization with Plotly in Python](https://www.datacamp.com/courses/introduction-to-data-visualization-with-plotly-in-python)
- the Mann-Whitney U test in [Hypothesis Testing in Python](https://www.datacamp.com/courses/hypothesis-testing-in-python)
- left joins in [Joining Data in SQL](https://www.datacamp.com/courses/joining-data-in-sql)