# Survey data App

This notebook can be used as a starting point for creating a Datapane app that presents an analysis of survey data.

This sample Datapane app demonstrates,
- Interactively building a complex app from a Notebook
- Serving a data app from a Notebook
- Basic Datapane Forms

<img width="400" alt="preview" src="https://user-images.githubusercontent.com/15690380/188167767-fb6aa495-050e-4bc8-a046-c0d8eb6ab53a.png">

Looking through the [2021 Kaggle Machine Learning & Data Science Survey](https://www.kaggle.com/c/kaggle-survey-2021https://www.kaggle.com/c/kaggle-survey-2021), let's build a app that's focussed on individuals that use Python.

For those who use Python (`Q7_Part_1`) in segmenting by:

- Current role (`Q5`)
- Industry (`Q20`)
- Size of data science team at work (`Q22`)
- Primary tool (`41`)

For the following questions:

- What Python IDEs (`Q9`),
- What hosted Python notebook products (`Q10`)
- What visualization libraries (`Q14`)
- What BI tools do they use (`Q34-A`)
- What BI tools do they want to look at (`Q34-B`)
- Where do you share data analyses (`Q39`)
- What part of the pipeline? (`Q24`)

In [None]:
import datapane as dp
import altair as alt
import pandas as pd
import random

## Load the data

Our data is in a CSV file, and our wrangling operations are offered by `pandas`, so let's [get our data into](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#sql-queries) a `DataFrame`.

The first row contains column descriptions, in addition to the column headers, let's filter them out and show the result.

In [None]:
src_data = pd.read_csv("./kaggle_survey_2021_responses.csv.gz")
data = src_data.iloc[1:]
dp.DataTable(data.head())

Now let's move the descriptions to their own DataFrame and show them.

In [None]:
descriptions = src_data.iloc[0].to_frame()
dp.DataTable(descriptions)

In [None]:
segment_mapping = {
    "Roles": "Q5",
    "Industry": "Q20",
    "DS Team Size": "Q22",
    "Primary Tool": "Q41",
}

## Word cloud function

In [None]:
def word_cloud(df: pd.DataFrame) -> dp.Plot:
    words_and_counts = df.melt()["value"].dropna().value_counts()
    words_and_counts = pd.DataFrame(words_and_counts).rename_axis().reset_index()
    words_and_counts.columns = ["word", "count"]

    def shuffled_range(n):
        return random.sample(range(n), k=n)

    n = len(words_and_counts)
    x = shuffled_range(n)
    y = shuffled_range(n)

    word_cloud_data = words_and_counts.assign(x=x, y=y)

    base = (
        alt.Chart(word_cloud_data)
        .encode(x=alt.X("x:O", axis=None), y=alt.Y("y:O", axis=None))
        .configure(background="#eef2ff")
    )

    word_cloud = (
        base.mark_text(baseline="middle")
        .encode(
            text="word:N",
            color=alt.Color("count:Q", scale=alt.Scale(scheme="purpleblue")),
            size=alt.Size("count:Q", legend=None, scale=alt.Scale(range=[20, 50])),
        )
        .configure_view(strokeWidth=0)
    )

    return dp.Plot(word_cloud)

Test our word cloud on all the question 9 responses:

In [None]:
word_cloud(data.filter(like="Q9_Part"))

## Plotting segments

In [None]:
def plot_segment_distribution(df: pd.DataFrame, segment_name: str) -> dp.Plot:
    segments = df[segment_mapping[segment_name]]
    counts = pd.DataFrame(segments.value_counts()).rename_axis().reset_index()
    counts.columns = [segment_name, "counts"]

    fig = (
        alt.Chart(counts)
        .mark_bar()
        .encode(
            x=alt.X(counts.columns[0], sort="-y", axis=alt.Axis(labelAngle=-45)),
            y="counts",
            color=alt.Color(segment_name, scale=alt.Scale(scheme="rainbow"), legend=None),
        )
    )

    return dp.Plot(fig)

Test plotting the distribution of the Roles segment

In [None]:
plot_segment_distribution(data, "Roles")

## Filtering with a form

Let's allow our user to filter the survey data and generate a word cloud themselves.

First, we'll look towards filtering by programming language. These are in columns with a `Q7_Part_` prefix.

In [None]:
programming_languages = data.filter(like="Q7_Part_").melt().dropna().drop_duplicates().reset_index(drop=True)

programming_languages = pd.concat(
    [
        programming_languages,
        pd.DataFrame([["*", "All"]], columns=programming_languages.columns),
    ]
)

dp.Table(programming_languages)

## Overall summary

In [None]:
stats_group = dp.Group(
    dp.BigNumber(heading="Participants", value=len(data)),
    dp.BigNumber(heading="Segments", value=len(segment_mapping)),
)

stats_group

In [None]:
def process(programming_language: str, choice: str) -> dp.Group:
    filtered_data = data

    if programming_language != "All":
        programming_language_column = programming_languages[
            programming_languages.value == programming_language
        ].variable.item()
        filtered_data = filtered_data[filtered_data[programming_language_column] == programming_language]

    segment_plots = dp.Group(
        blocks=[plot_segment_distribution(filtered_data, segment) for segment in segment_mapping.keys()]
    )

    choice_id_looking = {
        "What developer environment do you use?": "Q9",
        "What programming language do you use?": "Q7",
    }

    word_cloud_plot = word_cloud(filtered_data.filter(like=choice_id_looking[choice]))

    return dp.Group(
        f"## {choice}",
        word_cloud_plot,
        "## Breakdown by segment",
        segment_plots,
    )


form = dp.Form(
    on_submit=process,
    target="output",
    label="Filters",
    controls=[
        dp.Choice(
            "programming_language",
            label="Programming language",
            initial="All",
            options=list(programming_languages.value),
        ),
        dp.Choice(
            "choice",
            label="Word cloud: survey question",
            initial="What programming language do you use?",
            options=[
                "What developer environment do you use?",
                "What programming language do you use?",
            ],
        ),
    ],
)


v = dp.View(
    "# Kaggle Survey 2021",
    dp.Group(form, stats_group, columns=2),
    dp.Empty(name="output"),
)

In [None]:
dp.serve_app(v)