# The influence of AI on StackOverflow users

In the following analysis, I want to take data from the last years of StackOverflow user surveys in order to study how the *growing influence of AI affected the StackOverflow users*. To be more specific, the influence of AI will be assessed by studying
1. the development of the **AI-related survey questions**,
2. how the users **perceived AI** and if this perception has changed over the years,
3. for **which purpose** AI is used for.

## 1. Loading libraries and data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.axes import Axes
from matplotlib.patches import Patch
import seaborn as sns
import os
import glob
import re
import math
from collections import defaultdict, Counter
from typing import Optional, Dict, List, Tuple, Literal
%matplotlib inline

pd.set_option('display.max_colwidth', None)

The survey data is loaded into a dictionary with each dictionary key-value pair indicating a specific survey and the corresponding survey data.

In [None]:
# -------------------------------------------------------------------
# Load the data of several StackOverflow user surveys into data frames
# -------------------------------------------------------------------
def load_stackoverflow_data(folder_path, load_schema=False):
    """
    Load Stack Overflow survey or schema CSV files from a folder into a dictionary of DataFrames.
    
    Parameters:
        folder_path (str): Path to the folder containing CSV files.
        load_schema (bool): If True, load schema files (filenames containing 'schema').
                            If False, load survey response files.
    
    Returns:
        dict: { 'survey_<year>': DataFrame }
    """
    # Determine filter based on schema flag
    if load_schema:
        csv_files = [f for f in glob.glob(os.path.join(folder_path, "*.csv")) if "schema" in os.path.basename(f).lower()]
    else:
        csv_files = [f for f in glob.glob(os.path.join(folder_path, "*.csv")) if "schema" not in os.path.basename(f).lower()]

    data_dict = {}

    for file in csv_files:
        filename = os.path.basename(file)
        
        # Extract year from filename
        match = re.search(r"(20)\d{2}", filename)
        year = int(match.group()) if match else None
        
        # Read CSV with encoding fallback
        try:
            df = pd.read_csv(file, encoding="utf-8")
        except UnicodeDecodeError:
            df = pd.read_csv(file, encoding="latin1")
        
        # Add Year column
        df["Year"] = year
        
        # Create dictionary key
        key_name = f"survey_{year}" if year else "survey_unknown"
        data_dict[key_name] = df

    return data_dict


In [None]:
schema_dfs = load_stackoverflow_data(folder_path="data", load_schema=True)
survey_dfs = load_stackoverflow_data(folder_path="data", load_schema=False)

Make sure that we get the same surveys with the survey data and the additional, explanatory schema data.

In [None]:
survey_dfs.keys()

In [None]:
schema_dfs.keys()

## 2. Data assessment and wrangling

Let's take a closer look at the survey data and find the questions that are helpful to investigate the influence of AI on the StackOverflow users.

### 2.1 Definition of helper functions for data assessment

In [None]:
# -------------------------------------------------------------------
# Find common columns across a set of dataframes
# -------------------------------------------------------------------
def find_common_columns(df_set):
    """
    Find columns that appear in at least two DataFrames and show which surveys they belong to.
    Sorted by frequency (descending).

    Args:
        df_set (dict): Dictionary of DataFrames {survey_name: DataFrame}

    Returns:
        list of tuples: [(column_name, count, [surveys]), ...] sorted by count desc
    """
    # Map each column to the surveys it appears in
    column_map = defaultdict(list)
    for survey_name, df in df_set.items():
        for col in df.columns:
            column_map[col].append(survey_name)

    # Keep only columns that appear in at least two surveys
    common_columns = {col: surveys for col, surveys in column_map.items() if len(surveys) >= 2}

    # Sort by frequency
    common_sorted = sorted(
        [(col, len(surveys), surveys) for col, surveys in common_columns.items()],
        key=lambda x: x[1],
        reverse=True
    )

    # Print results
    print("Columns common to at least two surveys (sorted by frequency):")
    for col, count, surveys in common_sorted:
        print(f"{col} ({count} surveys): {surveys}")

    return common_sorted

In [None]:
# -------------------------------------------------------------------
# Find columns that are unique to a single survey
# -------------------------------------------------------------------
def find_unique_columns(df_set):
    """
    Find columns that are unique to a single survey, list them by survey, and provide a summary count.

    Args:
        df_set (dict): Dictionary of DataFrames {survey_name: DataFrame}

    Returns:
        dict: {survey: [unique_columns]}
    """
    
    # Map each column to the surveys it appears in
    column_map = defaultdict(list)
    for survey_name, df in df_set.items():
        for col in df.columns:
            column_map[col].append(survey_name)

    # Filter unique columns
    unique_columns = {col: surveys for col, surveys in column_map.items() if len(surveys) == 1}

    # Organize by survey
    survey_unique_map = defaultdict(list)
    for col, surveys in unique_columns.items():
        survey_unique_map[surveys[0]].append(col)

    # Sort surveys alphabetically and columns inside each survey
    survey_unique_map = {survey: sorted(cols) for survey, cols in sorted(survey_unique_map.items())}

    # Print results
    print("Unique columns per survey:")
    for survey, cols in survey_unique_map.items():
        print(f"{survey} ({len(cols)} unique columns): {cols}")

    return survey_unique_map

In [None]:
# -------------------------------------------------------------------
# Find questions across all surveys by keyword
# -------------------------------------------------------------------
def find_questions_by_keyword(schema_dfs, keyword,
                               col_name_cols=("Column", "qname"),
                               question_cols=("question", "QuestionText", "Question"),
                               case=True, verbose=True):
    """
    Find all questions containing a given keyword AND a question mark across multiple schema DataFrames.

    Parameters:
        schema_dfs (dict): {survey_name: DataFrame}
        keyword (str): keyword to search for
        col_name_cols (tuple): possible column name fields
        question_cols (tuple): possible question text fields
        case (bool): case-sensitive search (default False)
        verbose (bool): if True, prints grouped results

    Returns:
        DataFrame: ['survey', 'column_name', 'question']
    """
    
    rows = []
    pattern = f"(?=.*{re.escape(keyword)})(?=.*[?？])"  # keyword + question mark

    for survey_name, schema in schema_dfs.items():
        col_name_col = next((c for c in col_name_cols if c in schema.columns), None)
        q_col = next((c for c in question_cols if c in schema.columns), None)
        if not col_name_col or not q_col:
            continue

        mask = schema[q_col].astype(str).str.contains(pattern, case=case, na=False, regex=True)
        matches = schema.loc[mask, [col_name_col, q_col]].copy()
        if not matches.empty:
            matches["survey"] = survey_name
            matches.rename(columns={col_name_col: "column_name", q_col: "question"}, inplace=True)
            rows.append(matches)

    if not rows:
        if verbose:
            print(f"No matches found for keyword '{keyword}' with a question mark.")
        return pd.DataFrame(columns=["survey", "column_name", "question"])

    df = pd.concat(rows, ignore_index=True)

    if verbose:
        print(f"Found {len(df)} matches for keyword '{keyword}':")
        for survey in df["survey"].unique():
            print(f"\n--- {survey} ---")
            subset = df[df["survey"] == survey]
            for _, row in subset.iterrows():
                print(f"{row['column_name']}: {row['question']}")

    return df


In [None]:
# -------------------------------------------------------------------
# Find either duplicated questions or column names across all surveys
# -------------------------------------------------------------------
def find_duplicates_across_surveys(df, check_on="question", duplicate_min=2, only_duplicates=True, verbose=True):
    """
    Find duplicates across surveys based on question text OR column_name.

    Parameters:
        df (DataFrame): Output from find_questions_by_keyword or similar
        check_on (str): 'question' or 'column_name' (field to check duplicates on)
        duplicate_min (int): Minimum number of surveys for a duplicate (default 2)
        only_duplicates (bool): If True, return only rows that meet the duplicate_min threshold
        verbose (bool): if True, prints grouped results

    Returns:
        DataFrame: ['survey', 'column_name', 'question', 'survey_count', 'duplicate_flag']
    """
    import pandas as pd
    import re
    import unicodedata

    if df.empty:
        if verbose:
            print("No data to check for duplicates.")
        return pd.DataFrame(columns=["survey", "column_name", "question", "survey_count", "duplicate_flag"])

    # Normalization for grouping
    def normalize(text):
        text = unicodedata.normalize("NFKC", str(text))
        text = text.replace("\u00A0", " ").replace("\u202F", " ")
        text = re.sub(r"[\u200B-\u200D\u2060\uFEFF]", "", text)
        text = text.replace("*", "").replace("＊", "")
        text = re.sub(r"\s+", " ", text.lower()).strip()
        return text

    if check_on not in ["question", "column_name"]:
        raise ValueError("check_on must be 'question' or 'column_name'")

    df = df.copy()
    df["normalized"] = df[check_on].apply(normalize)
    df = df[df["normalized"].ne("")]

    # Count surveys per normalized value
    survey_counts = df.groupby("normalized")["survey"].nunique()
    df["survey_count"] = df["normalized"].map(survey_counts)
    df["duplicate_flag"] = df["survey_count"] >= duplicate_min

    # Filter if only duplicates requested
    if only_duplicates:
        df = df[df["duplicate_flag"]]

    df = df.drop(columns="normalized").sort_values(
        by=["duplicate_flag", "survey_count", "survey"], ascending=[False, False, True]
    ).reset_index(drop=True)

    if verbose and not df.empty:
        print(f"Duplicates based on {check_on} (min surveys = {duplicate_min}):")
        for survey in df["survey"].unique():
            print(f"\n--- {survey} ---")
            subset = df[df["survey"] == survey]
            for _, row in subset.iterrows():
                marker = f" (DUPLICATE, {row['survey_count']})" if row['survey_count'] >= duplicate_min else ""
                print(f"{row['column_name']}: {row['question']}{marker}")
    elif verbose:
        print("No duplicates found.")

    return df


In [None]:
# -------------------------------------------------------------------
# Retrieve the question for a given column name across all surveys
# -------------------------------------------------------------------
def get_question_description(column_name, schema_dfs,
                    col_name_cols=("Column", "qname"),
                    question_cols=("question", "QuestionText", "Question"),
                    return_all=True):
    """
    Retrieve the description of a column from multiple schema DataFrames.

    Parameters:
        column_name (str): The name of the column you want to look up.
        schema_dfs (dict): {survey_name: DataFrame} of schema tables.
        col_name_cols (tuple): Possible column names that identify the column name in the schema.
        question_cols (tuple): Possible column names that contain the question text.
        return_all (bool): If True, return all matches as a dict {survey: description}. 
                           If False, return the first match.

    Returns:
        str or dict: The description of the column (first match) or a dict of all matches.
    """
    
    results = {}

    for survey_name, schema in schema_dfs.items():
        # Find which column in schema holds the column names
        col_name_col = next((c for c in col_name_cols if c in schema.columns), None)
        if not col_name_col:
            continue

        # Find which column in schema holds the question text
        q_col = next((c for c in question_cols if c in schema.columns), None)
        if not q_col:
            continue

        # Filter for the requested column_name
        match = schema.loc[schema[col_name_col] == column_name, q_col]

        if not match.empty:
            results[survey_name] = match.iloc[0]
            if not return_all:
                return match.iloc[0]

    return results if results else None


In [None]:
# -------------------------------------------------------------------
# Retrieve the column name for a given question across all surveys
# -------------------------------------------------------------------
def get_column_name(question_text, schema_dfs,
                    col_name_cols=("Column", "qname"),
                    question_cols=("question", "QuestionText", "Question"),
                    case=False, return_all=True):
    """
    Find column name(s) for a given question text from multiple schema DataFrames.
    Handles case-insensitive and partial matches, and escapes regex characters.

    Parameters:
        question_text (str): The question text (or part of it) to search for.
        schema_dfs (dict): {survey_name: DataFrame} of schema tables.
        col_name_cols (tuple): Possible column name fields in schema.
        question_cols (tuple): Possible question text fields in schema.
        case (bool): Case-sensitive search (default False).
        return_all (bool): If True, return all matches as {survey: [columns]}.
                           If False, return the first match found.

    Returns:
        dict or str or None
    """
    results = {}

    # Normalize query: collapse whitespace and lowercase if case=False
    query = " ".join(question_text.split())
    if not case:
        query = query.lower()

    for survey_name, schema in schema_dfs.items():
        # Detect relevant columns
        col_name_col = next((c for c in col_name_cols if c in schema.columns), None)
        q_col = next((c for c in question_cols if c in schema.columns), None)
        if not col_name_col or not q_col:
            continue

        # Normalize schema question text
        q_series = schema[q_col].astype(str).str.replace(r"\s+", " ", regex=True).str.strip()
        if not case:
            q_series = q_series.str.lower()

        # Escape regex special chars in query for literal match
        pattern = re.escape(query)

        mask = q_series.str.contains(pattern, na=False)
        matches = schema.loc[mask, col_name_col]

        if not matches.empty:
            results[survey_name] = matches.tolist()
            if not return_all:
                return matches.iloc[0]

    return results if results else None


In [None]:
# -------------------------------------------------------------------
# Calculate the value count of a given column across all surveys
# -------------------------------------------------------------------
def value_counts_all_surveys(survey_dfs, column_name, normalize=False, dropna=True, verbose=False):
    """
    Compute value counts for a given column across all surveys, with optional verbose output.

    Parameters:
        survey_dfs (dict): {survey_name: DataFrame}
        column_name (str): Column to compute value counts for
        normalize (bool): If True, return relative frequencies (like pandas normalize)
        dropna (bool): If True, exclude NaN values
        verbose (bool): If True, print detailed info for each survey

    Returns:
        dict: {survey_name: Series of value counts}
    """
    results = {}

    for survey_name, df in survey_dfs.items():
        if column_name in df.columns:
            counts = df[column_name].value_counts(normalize=normalize, dropna=dropna)
            results[survey_name] = counts

            if verbose:
                print(f"\n--- Survey: {survey_name} ---")
                print(f"Column analyzed: {column_name}")
                print(f"Total rows: {len(df)}")
                print(f"Unique values: {df[column_name].nunique(dropna=dropna)}")
                print(f"NaN values: {df[column_name].isna().sum()}")
                print("Value counts:")
                print(counts.to_string())  # Full output without truncation
        else:
            results[survey_name] = None
            if verbose:
                print(f"\n--- Survey: {survey_name} ---")
                print(f"Column '{column_name}' not found in this survey.")

    return results

In [None]:
# -------------------------------------------------------------------
# Replace missing values in the survey data with "no answer"
# -------------------------------------------------------------------
def replace_missing_all_columns(
    survey_dfs: Dict[str, pd.DataFrame],
    na_label: str = "No answer",
    treat_blank_as_na: bool = True,
    as_category: bool = True,
    inplace: bool = False
) -> Dict[str, pd.DataFrame]:
    """
    Replace NaN (and optionally blank strings) in ALL columns of each survey DataFrame.

    Parameters
    ----------
    survey_dfs : dict[str, pd.DataFrame]
        Mapping of survey name -> DataFrame.
    na_label : str, default "No answer"
        Label to use for missing values.
    treat_blank_as_na : bool, default True
        If True, empty/whitespace-only strings are treated as NaN before replacement.
    as_category : bool, default True
        If True, convert all object/string columns to 'category' after replacement.
    inplace : bool, default False
        If True, modify DataFrames in `survey_dfs` directly and return the same dict.
        If False, return a new dict with copies.

    Returns
    -------
    dict[str, pd.DataFrame]
        Dict of processed DataFrames (same object if inplace=True).
    """
    out = survey_dfs if inplace else {}

    for name, df in survey_dfs.items():
        target = df if inplace else df.copy()

        # Optionally treat blanks as NaN
        if treat_blank_as_na:
            target = target.replace(r"^\s*$", pd.NA, regex=True)

        # Replace all NaNs with na_label
        target = target.fillna(na_label)

        # Convert object columns to category if requested
        if as_category:
            for col in target.columns:
                if target[col].dtype == object or pd.api.types.is_string_dtype(target[col]):
                    target[col] = target[col].astype("category")

        if not inplace:
            out[name] = target

    return out


### 2.2 General data assessment

First, let's make a copy of the original dataframe in case we need to make changes for the analysis.

In [None]:
schema_dfs_original = schema_dfs.copy()

In [None]:
survey_common_cols = find_common_columns(survey_dfs)

There are quite many columns that appear in several surveys. However, the most common ones do not seem to be obviously linked to AI so we have to investigate further and maybe restrict ourselves to the surveys of the last couple of years.

In [None]:
survey_unique_cols = find_unique_columns(survey_dfs)

Here we can already see more AI-related questions. As they are all unique, they are difficult to compare over a timescale. However, this shows that questions related to AI have changed a lot which is in line with the dynamic development of AI itself.

In [None]:
schema_common_cols = find_common_columns(schema_dfs)

Even the schema data has changed over the years, which will make it increasingly difficult for an analysis. However, if we restrict ourselves to the last few years the schema data should be roughly comparable.

In [None]:
schema_unique_cols = find_unique_columns(schema_dfs)

### 2.3. Question 1: Assessment of AI-related questions.

In [None]:
df_matches = find_questions_by_keyword(schema_dfs, "AI")

It seems like on the one hand the number of AI-related questions has increased a lot in the last years. On the other hand, we still need to find questions that we can use to compare the development over several years.

In [None]:
df_matches.head()

In [None]:
df_matches["survey"].value_counts()

In [None]:
survey_counts_AI = df_matches["survey"].value_counts()

In [None]:
# Extract years from the survey column (assuming format like 'survey_2018')
df_matches['year'] = df_matches['survey'].str.extract(r'(\d{4})')

plt.figure(figsize=(10, 6))
ax = sns.countplot(
    data=df_matches,
    x='year',
    order=df_matches['survey'].str.extract(r'(\d{4})')[0].drop_duplicates(),
    color='royalblue'
)

# Annotate each bar with integer count
for p in ax.patches:
    height = int(p.get_height())
    ax.text(
        x=p.get_x() + p.get_width() / 2,
        y=height + 0.5,
        s=f"N={height}",
        ha="center"
    )

plt.title("Number of survey questions related to AI")
plt.ylabel("Count")
plt.xlabel("Survey")
plt.tight_layout()
plt.show()

Indeed, the number of AI-related questions has increased significantly with 31 questions now in 2025. How is this comparable to the total number of questions of the surveys though?

In [None]:
# Count total number of columns in each survey
survey_column_counts = {key: df.shape[1] for key, df in survey_dfs.items()}

# Count AI-related questions in df_matches for each survey
ai_question_counts = df_matches["survey"].value_counts().to_dict()

# Prepare data
years = sorted([key.split('_')[1] for key in survey_column_counts.keys()])
total_counts = [survey_column_counts[f"survey_{year}"] for year in years]
ai_counts = [ai_question_counts.get(f"survey_{year}", 0) for year in years]
non_ai_counts = [total - ai for total, ai in zip(total_counts, ai_counts)]

# Plot
plt.figure(figsize=(10, 6))
plt.bar(years, non_ai_counts, label="Non-AI Questions", color="lightgray")
plt.bar(years, ai_counts, bottom=non_ai_counts, label="AI Questions", color="royalblue")

# Annotate both total and AI counts
for i, (total, ai) in enumerate(zip(total_counts, ai_counts)):
    plt.text(i, total + 1, f"AI={ai}", ha='center', fontsize=9)
    
plt.xlabel("Year")
plt.ylabel("Number of Questions")
plt.title("Total and AI-Related Questions per Survey Year")
plt.legend()
plt.tight_layout()
plt.show()


With this plot, we can see that there was a drastic decrease in the overall number of questions around 2020/2021 which might be related to the pandemic. Since 2023, the number of questions rises again in general as well as the proportion of AI-related questions, which now in 2025 makes up almost 20% of the whole survey. This is already a very good indication of the **growing influence of AI among the stackoverflow users**, which closes our first field of interest to look into the development of the AI-related questions over the years.

### 2.4 Data assessment for question 2: How do StackOverflow users perceive data?

Next, we want to see which questions we can use to increase our understanding of **how AI is perceived** by the stackoverlow users and how this perception might have changed over the years.

In [None]:
df_dups_question = find_duplicates_across_surveys(df_matches, check_on="question")

In [None]:
df_dups_question = find_duplicates_across_surveys(df_matches, check_on="question", duplicate_min=3)

There seem to be only three questions that are identical across at least three surveys, which can be considered as the minimum for a meaningful time analysis. But let's look also at the question identifiers.

In [None]:
df_dups_question = find_duplicates_across_surveys(df_matches, check_on="column_name", duplicate_min=3)

So when also taking into account the column identifier names, there is one more question "AITool" we can use for our analysis. This was not found before, because the actual wording has changed from 2024 to 2025. To be more precise, this question entails more subquestions in 2025 than before. This will be a very useful question in order to tackle question 3 later.

Let's start with the three questions "AISelect", "AISent", and "AIAcc" in order to study the perception of AI among the StackOverflow users. First, let's assess the data of these questions so we can see if we need to perform any data wrangling for the purpose of this analysis.

In [None]:
get_question_description("AIBen", schema_dfs)

In [None]:
get_question_description("AIAcc", schema_dfs)

It seems like there is an error in the survey of 2023 mixing up the questions "AIBen" related to benefits and "AIAcc" for accuracy. In order to properly compare the answers over the surveys of 2023, 2024, and 2025 we need to adjust the naming of the columns.

In [None]:
schema_dfs["survey_2023"][schema_dfs["survey_2023"]["qname"]=="AIBen"]

In [None]:
survey_dfs["survey_2023"]["AIBen"].head()

In [None]:
get_column_name("For the AI tools you use as part of your development workflow, what are the MOST important benefits you are hoping to achieve? Please check all that apply.", schema_dfs)

In [None]:
get_column_name("How much do you trust the accuracy of the output from AI tools as part of your development workflow?", schema_dfs)

In order to assign the correct column name to the questions "AIBen" and "AIAcc" in the survey 2023, we need to replace it both in the survey and in the schema data.

In [None]:
schema_dfs["survey_2023"]["qname"] = schema_dfs["survey_2023"]["qname"].replace({"AIBen": "AIAcc", "AIAcc": "AIBen"})
survey_dfs["survey_2023"].rename(columns={"AIBen": "AIAcc", "AIAcc": "AIBen"}, inplace=True)

In [None]:
get_question_description("AIBen", schema_dfs)

In [None]:
get_question_description("AIAcc", schema_dfs)

In [None]:
df_matches = find_questions_by_keyword(schema_dfs, "AI", verbose=False)

In [None]:
df_dups_question = find_duplicates_across_surveys(df_matches, check_on="question", duplicate_min=3)

In [None]:
df_dups_question = find_duplicates_across_surveys(df_matches, check_on="column_name", duplicate_min=3)

Looks like the replacement worked. Let's check the variety of answeres for the questions "AISelect", "AISent", and "AIAcc" next.

In [None]:
counts_AISelect = value_counts_all_surveys(survey_dfs, 'AISelect', normalize=False, verbose=True)

In [None]:
counts_AIAcc = value_counts_all_surveys(survey_dfs, 'AIAcc', normalize=False, verbose=True)

In [None]:
counts_AISent = value_counts_all_surveys(survey_dfs, 'AISent', normalize=False, verbose=True)

Let's address the NaNs in the surveys by replacing them with "no answer" in order to also see how many survey participants did not answer a specific question.

In [None]:
survey_dfs = replace_missing_all_columns(
    survey_dfs,
    na_label="No answer",
    treat_blank_as_na=True,
    as_category=True,
    inplace=False
)

In [None]:
counts_AISent = value_counts_all_surveys(survey_dfs, 'AISent', normalize=False, verbose=True)

Looks good, now this data is ready to be analyzed in order to answer question 2 about how AI is perceived among the StackOverflow users.

### 2.5 Data assessment for question 3: for which purpose is AI used for?

Next, we want to investigate the column "AITool" in order to be able to study for what kind of tools or work the survey participants are using AI.

In [None]:
counts_AITool = value_counts_all_surveys(survey_dfs, 'AITool', normalize=False, verbose=True)

Although "AITool" is a valid column name and found in the schema of 2023, 2024, and 2025 it cannot be found in the actual survey data.

In [None]:
get_question_description("AITool", schema_dfs)

In [None]:
[item for item in schema_dfs["survey_2023"]["qname"] if "AITool" in item]

In [None]:
schema_dfs["survey_2023"][schema_dfs["survey_2023"]["qname"]=="AITool"]

In [None]:
[item for item in schema_dfs["survey_2024"]["qname"] if "AITool" in item]

In [None]:
schema_dfs["survey_2024"][schema_dfs["survey_2024"]["qname"]=="AITool"]

In [None]:
[item for item in schema_dfs["survey_2025"]["qname"] if "AITool" in item]

In [None]:
schema_dfs["survey_2025"][schema_dfs["survey_2025"]["qname"]=="AITool"]

Even though the "qname" parameter is identical over the surveys and the questions only differ slightly, a column called "AITool" cannot be found in the survey data. So let's see if we can find a close if not identifal match among the survey column names.

In [None]:
tool_col_names = [col for col in survey_dfs["survey_2023"].columns if "AITool" in col]
survey_dfs["survey_2023"][tool_col_names].head(10)

It seems the subquestions for "AITool" are already in separate columns in the survey data of 2023. We need to check if this is also the case for the surveys of 2023 and 2024.

In [None]:
tool_col_names = [col for col in survey_dfs["survey_2024"].columns if "AITool" in col]
survey_dfs["survey_2024"][tool_col_names].head(10)

The survey of 2024 uses the exact same column names for the subquestions.

In [None]:
tool_col_names = [col for col in survey_dfs["survey_2025"].columns if "AITool" in col]
survey_dfs["survey_2025"][tool_col_names].head(10)

As already stated before, the content of the question changed for the survey of 2025 with several more subquestions and aspects. This can now be seen as we have even more columns for "AITool" than for the surveys of 2023 and 2024. In order to be able to compare the answers to these questions at least on the level that is provided by the questions of 2023 and 2024, we need to rewrangling the columns in the 2025 survey so the columns for "AITool" match.

In [None]:
survey_dfs["survey_2025"]["AIToolCurrently Using"] = survey_dfs["survey_2025"][['AIToolCurrently partially AI', 'AIToolCurrently mostly AI']].apply(lambda x: ';'.join([str(val) for val in x if pd.notna(val) and val != '']), axis=1)
survey_dfs["survey_2025"]["AIToolInterested in Using"] = survey_dfs["survey_2025"][['AIToolPlan to partially use AI', 'AIToolPlan to mostly use AI']].apply(lambda x: ';'.join([str(val) for val in x if pd.notna(val) and val != '']), axis=1)
survey_dfs["survey_2025"].rename(columns={"AIToolDon't plan to use AI for this task": "AIToolNot interested in Using"}, inplace=True)


In [None]:

survey_dfs["survey_2025"].drop(columns=[
    'AIToolCurrently partially AI',
    'AIToolCurrently mostly AI',
    'AIToolPlan to partially use AI',
    'AIToolPlan to mostly use AI'
], inplace=True)


In [None]:
survey_dfs["survey_2025"]["AIToolInterested in Using"].head()

In [None]:
tool_col_names_2 = [col for col in survey_dfs["survey_2025"].columns if "AITool" in col]
survey_dfs["survey_2025"][tool_col_names_2].head(10)

Now also the data for "AITool" looks ready to be analyzed in order to answer the question 3 related to the purpose of the AI usage.

## 3. Analysis of the StackOverflow user answers

### 3.1 Definition of functions to analyze and plot the relevant data

In order to analyze and plot the answers of the StackOverflow users for our four chosen questions "AISelect", "AISent", "AIBen", and "AITool" it will be helpful to define some functions that we can then call for each question.

In [None]:
# -------------------------------------------------------------------
# Combine surveys into one DataFrame with a Year column
# -------------------------------------------------------------------
def prepare_long_format(survey_dict: Dict[str, pd.DataFrame], columns: List[str]) -> pd.DataFrame:
    """
    Combine multiple yearly survey DataFrames into a single long-format DataFrame
    that includes a 'Year' column. Only the specified columns + 'Year' are kept.

    Parameters
    ----------
    survey_dict : dict[str, pd.DataFrame]
        Mapping of year label to DataFrame, e.g. {"2023": df2023, "2024": df2024, ...}
    columns : list[str]
        Columns to carry forward. These are expected to contain semicolon-separated strings.

    Returns
    -------
    pd.DataFrame
        Concatenated DataFrame with given columns and a numeric 'Year' column.
    """
    frames = []
    for year, df in survey_dict.items():
        temp = df.copy()
        temp["Year"] = int(year)
        keep_cols = [c for c in columns if c in temp.columns]
        frames.append(temp[keep_cols + ["Year"]])
    return pd.concat(frames, ignore_index=True)

In [None]:
# -------------------------------------------------------------------
# Build Answer × Year × Category counts (generic categories)
# -------------------------------------------------------------------
def compute_trend_generic(
    df: pd.DataFrame,
    columns: List[str],
    category_map: Optional[Dict[str, str]] = None,
    answers: Optional[List[str]] = None,
    deduplicate_within_row: bool = True
) -> Tuple[pd.DataFrame, pd.DataFrame, List[str]]:
    """
    Parse semicolon-separated multi-select columns to produce a long table of
    Answer × Year × Category counts. Categories are derived from column names
    or from an optional mapping.

    Parameters
    ----------
    df : pd.DataFrame
        Input frame from `prepare_long_format`, must include a 'Year' column and the target columns.
    columns : list[str]
        Columns to analyze (each is a semicolon-separated multi-select).
    category_map : dict[str, str], optional
        Mapping from column name -> display category. If None, uses column names as-is.
    answers : list[str], optional
        If provided, restrict counts to this set of answers.
    deduplicate_within_row : bool
        If True, within a single respondent row, count each (answer, category) at most once.

    Returns
    -------
    trend_df : pd.DataFrame
        Long table with columns: ['Answer', 'Year', 'Category', 'Count'].
    base_counts : pd.DataFrame
        Table with columns: ['Year', 'N_respondents'] (total rows per year), used for relative='year'.
    categories : list[str]
        List of category names used (order preserved from mapping or columns).
    """
    if category_map is None:
        category_map = {col: col for col in columns}

    records = []
    for _, row in df.iterrows():
        year = row["Year"]
        seen = set()
        for col in columns:
            if col not in df.columns:
                continue
            if pd.isna(row[col]):
                continue
            category = category_map.get(col, col)
            # Split semicolon-separated values
            for ans in (a.strip() for a in str(row[col]).split(';')):
                if not ans:
                    continue
                if answers is not None and ans not in answers:
                    continue
                if deduplicate_within_row:
                    key = (ans, category)
                    if key in seen:
                        continue
                    seen.add(key)
                records.append((ans, year, category))

    trend_df = (
        pd.DataFrame(records, columns=["Answer", "Year", "Category"])
          .groupby(["Answer", "Year", "Category"], as_index=False)
          .size()
          .rename(columns={"size": "Count"})
    )

    base_counts = (
        df.groupby("Year", as_index=False)
          .size()
          .rename(columns={"size": "N_respondents"})
    )

    categories = list(dict.fromkeys(category_map.values()))  # preserve insertion order
    return trend_df, base_counts, categories

In [None]:
# -------------------------------------------------------------------
# Add relative modes for plotting
# -------------------------------------------------------------------

Mode = Literal["year", "answer"]

def add_relative(
    trend_df: pd.DataFrame,
    base_counts: pd.DataFrame,
    mode: Optional[Mode] = None,
) -> Tuple[pd.DataFrame, str]:
    """
    Add a 'Value' column for plotting based on the chosen relative mode.

    Parameters
    ----------
    trend_df : pd.DataFrame
        Must have columns ['Answer', 'Year', 'Category', 'Count'].
    base_counts : pd.DataFrame
        Must have columns ['Year', 'N_respondents'] (required only if mode == "year").
    mode : {"year", "answer"} or None, optional
        - None: use raw counts (Value = Count).
        - "year": Value = Count / N_respondents(year) * 100 (percent of all respondents that year).
        - "answer": Value = Count / sum(Count by Answer & Year) * 100 (category share within answer-year).

    Returns
    -------
    df_with_value : pd.DataFrame
        Copy of trend_df with an added 'Value' column.
    y_label : str
        Recommended y-axis label for the plot.
    """
    # --- Validate inputs early (clear error messages help downstream users) ---
    required_trend = {"Answer", "Year", "Category", "Count"}
    missing_trend = required_trend - set(trend_df.columns)
    if missing_trend:
        raise ValueError(f"trend_df is missing required columns: {sorted(missing_trend)}")

    if mode == "year":
        required_base = {"Year", "N_respondents"}
        missing_base = required_base - set(base_counts.columns)
        if missing_base:
            raise ValueError(f"base_counts is missing required columns: {sorted(missing_base)}")

    df = trend_df.copy()

    # --- Raw counts ---
    if mode is None:
        df["Value"] = df["Count"]
        return df, "Count"

    # --- Percent of all respondents in that year ---
    if mode == "year":
        # m:1 validates that each Year maps to at most one N_respondents row
        df = df.merge(
            base_counts[["Year", "N_respondents"]],
            on="Year",
            how="left",
            validate="m:1",
        )
        if df["N_respondents"].isna().any():
            raise ValueError("Missing N_respondents for some Year values after merge.")

        denom = df["N_respondents"].replace(0, np.nan)
        df["Value"] = df["Count"].div(denom).mul(100).fillna(0.0)
        return df, "Proportion of all respondents (%)"

    # --- Share within (Answer, Year) ---
    if mode == "answer":
        # Named aggregation avoids a post-rename step (and the typing complaint)
        totals = (
            df.groupby(["Answer", "Year"], as_index=False)
              .agg(AnswerYearTotal=("Count", "sum"))
        )
        df = df.merge(totals, on=["Answer", "Year"], how="left", validate="m:1")
        denom = df["AnswerYearTotal"].replace(0, np.nan)
        df["Value"] = df["Count"].div(denom).mul(100).fillna(0.0)
        return df, "Category share within answer-year (%)"

    raise ValueError(f"Unknown mode: {mode!r}. Use None, 'year', or 'answer'.")

In [None]:
# -------------------------------------------------------------------
# Answer selection (global top-N or union of per-year top-N)
# -------------------------------------------------------------------
def select_answers(
    trend_df: pd.DataFrame,
    top_n: Optional[int] = None,
    ensure_top_n_per_year: bool = False,
) -> List[str]:
    """
    Select which answers to include in plots.

    Parameters
    ----------
    trend_df : pd.DataFrame
        Must have columns ['Answer', 'Year', 'Category', 'Count'].
    top_n : int, optional
        If None, returns all answers. If provided, returns a subset.
    ensure_top_n_per_year : bool
        - If True: returns the union of per-year top_n answers (by total across categories).
        - If False: returns the global top_n answers by total across all years & categories.

    Returns
    -------
    list[str]
        Selected answers to plot (deduplicated).
    """
    # --- Validate required columns up front ---
    required = {"Answer", "Year", "Category", "Count"}
    missing = required - set(trend_df.columns)
    if missing:
        raise ValueError(f"trend_df is missing required columns: {sorted(missing)}")

    # --- All answers (original order preserved) ---
    if top_n is None:
        # drop_duplicates keeps first occurrence order; cast to str for type safety
        return trend_df["Answer"].astype(str).drop_duplicates().tolist()

    # --- Degenerate cases ---
    if not isinstance(top_n, int):
        raise TypeError(f"top_n must be an int or None, got {type(top_n).__name__}")
    if top_n <= 0:
        return []

    # --- Per-year top-N union ---
    if ensure_top_n_per_year:
        per_year = (
            trend_df.groupby(["Year", "Answer"], as_index=False)
                    .agg(total=("Count", "sum"))
        )

        # Sort with deterministic tie-break on Answer
        per_year_sorted = per_year.sort_values(
            by=["Year", "total", "Answer"],
            ascending=[True, False, True],
            kind="mergesort",  # stable
        )

        # Take top_n within each year, then union across years
        top_union = (
            per_year_sorted.groupby("Year", group_keys=False)
                           .head(top_n)["Answer"]
                           .astype(str)
                           .unique()
                           .tolist()
        )
        return top_union

    # --- Global top-N across all years/categories ---
    global_totals = (
        trend_df.groupby("Answer", as_index=False)
                .agg(total=("Count", "sum"))
                .sort_values(
                    by=["total", "Answer"],
                    ascending=[False, True],
                    kind="mergesort",
                )
    )

    return global_totals.head(top_n)["Answer"].astype(str).tolist()


In [None]:
# -------------------------------------------------------------------
# Plot everything as a heatmap and several bar plots
# -------------------------------------------------------------------
def plot_combined_dashboard(
    trend_df: pd.DataFrame,
    base_counts: pd.DataFrame,
    categories: List[str],
    top_n: int = 8,
    ensure_top_n_per_year: bool = False,
    strict_top_n: bool = True,
    relative_heatmap: Optional[str] = "year",   # None -> absolute counts
    relative_bars: Optional[str] = "answer",    # None -> absolute counts
    cmap: str = "YlGnBu",
    annotate: bool = True,

    # Comparability & clarity
    share_y: bool = True,
    legend_per_subplot: bool = False,
    annotate_bar_totals: bool = True,

    # Layout
    bar_cols: int = 6,
    heatmap_width_per_cat: float = 4.5,
    bar_width_per_col: float = 4.2,
    top_row_height: float = 5.0,
    bar_row_height: float = 2.6,

    # Readability
    font_scale: float = 0.95,
    heatmap_tick_fontsize: int = 10,
    bar_tick_fontsize: int = 9,
    bar_title_fontsize: int = 10,
    rotate_bar_xticks: int = 30,

    # Spacing
    gridspec_hspace: float = 0.45,
    heatmap_wspace: float = 0.30,
    bar_wspace: float = 0.50,
    bar_hspace: float = 0.75,
    legend_right_pad: float = 0.18,
    cbar_fraction: float = 0.025,
    cbar_pad: float = 0.02,

    # Labels/annotation
    xlabel_pad: float = 4,
    ylabel_pad: float = 4,
    annot_fontsize: int = 9,
    show_all_heatmap_y: bool = False,

    # Legend wrapping (Zeilenumbruch)
    legend_title: str = "Category",
    legend_wrap_chars: Optional[int] = 25,
    legend_title_wrap_chars: Optional[int] = None,

    # Controls y-axis label and y-ticks visibility on bar subplots
    show_bar_ylabel: bool = True,  # True: all subplots show y-label & y-ticks; False: only first column
) -> None:
    """
    Plot a combined dashboard with:
      • Top row: heatmaps (one per category, shared colorbar).
      • Bottom: stacked bar charts (one subplot per answer).
      • Optional Totals bar chart: when `relative_bars=None`, a totals subplot is added as the **first** bar.

    Key features
    ------------
    - **Totals bar chart (first position)**: Appears only when `relative_bars=None`, summing across the
      selected answers and the displayed categories per year.
    - **Legend wrapping (Zeilenumbruch)**: Long legend labels and the legend title break into multiple lines.
      Controlled by `legend_wrap_chars` and `legend_title_wrap_chars`.
    - **Dynamic layout**: Grid adjusts automatically to the number of answers and `bar_cols`.
    - **Comparable scales**: `share_y=True` shares the y-scale across answer subplots (Totals uses its own).
    - **Y-axis visibility control**: `show_bar_ylabel` controls both the y-axis label and y-ticks:
        * If True: every bar subplot (including Totals) shows y-label and y-ticks.
        * If False: only the **first column** of bar subplots (including Totals if placed in col 0) shows y-label and y-ticks.

    Parameters
    ----------
    trend_df : pd.DataFrame
        Long-format data with columns ['Answer', 'Year', 'Category', 'Count'].
    base_counts : pd.DataFrame
        Data with ['Year', 'N_respondents'] for normalization when any `relative_* == "year"`.
    categories : List[str]
        Ordered list of category labels for heatmaps and stacked bars.
    top_n : int
        Number of answers to include.
    ensure_top_n_per_year : bool
        Include per-year top answers in the union selection.
    strict_top_n : bool
        Reduce the union back to exactly `top_n` using global totals.
    relative_heatmap : {"year", None}
        Heatmap normalization: None = absolute counts; "year" = % of respondents (requires `base_counts`).
    relative_bars : {"answer", "year", None}
        Bars normalization: None = absolute counts; "answer" = 100% within answer-year; "year" = % of respondents.
    cmap : str
        Colormap for heatmaps.
    annotate : bool
        Annotate heatmap cells with values.
    share_y : bool
        Share y-axis across **answer** bar subplots (Totals not shared).
    legend_per_subplot : bool
        If True, add legend in each bar subplot; otherwise a global legend.
    annotate_bar_totals : bool
        Annotate raw N above each stacked bar.
    show_bar_ylabel : bool
        Control y-axis label and y-ticks on bar subplots:
         - True: show on **all** bar subplots (including Totals).
         - False: show only on the **first column** of the bar grid; hide on other columns.
    legend_wrap_chars : int or None
        Wrap legend labels at this character width; None disables wrapping.
    legend_title_wrap_chars : int or None
        Wrap legend title; if None, uses `legend_wrap_chars`.
    """

    import math, textwrap
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    from matplotlib.patches import Patch
    from typing import Optional, List

    def _wrap_text(s: str, width: Optional[int]) -> str:
        return textwrap.fill(str(s), width=width) if width and width > 0 else str(s)

    # --- Validation ---
    required = {"Answer", "Year", "Category", "Count"}
    missing = required - set(trend_df.columns)
    if missing:
        raise ValueError(f"trend_df missing required columns: {sorted(missing)}")
    if not categories:
        raise ValueError("`categories` must be a non-empty list.")

    # Styling
    sns.set_style("whitegrid")
    sns.set_context("notebook", font_scale=font_scale)

    # --- 1) Answer selection ---
    answers = select_answers(trend_df, top_n=top_n, ensure_top_n_per_year=ensure_top_n_per_year)
    if strict_top_n and len(answers) > top_n:
        global_totals = (
            trend_df.groupby("Answer", as_index=False)
                    .agg(total=("Count", "sum"))
                    .sort_values(by=["total", "Answer"], ascending=[False, True])
        )
        selected: List[str] = []
        allowed = set(map(str, answers))
        for a in global_totals["Answer"].astype(str):
            if a in allowed:
                selected.append(a)
                if len(selected) >= top_n:
                    break
        answers = selected

    if not answers:
        raise ValueError("No answers selected—check data or selection parameters.")

    df_all = trend_df[trend_df["Answer"].astype(str).isin(answers)].copy()
    if df_all.empty:
        raise ValueError("Filtered data is empty after selecting answers; nothing to plot.")

    # --- 2) Heatmaps preparation ---
    df_heat, ylab_heat = add_relative(df_all, base_counts, mode=relative_heatmap)
    if df_heat.empty:
        raise ValueError("No heatmap data after applying heatmap relative mode.")

    years = sorted(df_heat["Year"].dropna().unique().tolist())
    if not years:
        raise ValueError("No 'Year' values available for heatmaps.")

    ans_order = (
        df_heat.groupby("Answer", as_index=False)
               .agg(total=("Value", "sum"))
               .sort_values(by=["total", "Answer"], ascending=[False, True])["Answer"]
               .astype(str).tolist()
    )
    fmt_heat = ".1f" if relative_heatmap else ".0f"

    # --- 3) Bars preparation ---
    df_bars, ylab_bars = add_relative(df_all, base_counts, mode=relative_bars)
    if df_bars.empty:
        raise ValueError("No bar data after applying bar relative mode.")

    pivoted = (
        df_bars.pivot_table(
            index=["Answer", "Year"],
            columns="Category",
            values="Value",
            fill_value=0.0,
            aggfunc="sum",
        ).reset_index()
    )
    for cat in categories:
        if cat not in pivoted.columns:
            pivoted[cat] = 0.0

    totals_raw = (
        df_all[df_all["Category"].isin(categories)]
        .groupby(["Answer", "Year"], as_index=False)
        .agg(total_count=("Count", "sum"))
    )

    include_total_bar = (relative_bars is None)

    # Common y-limit across answer subplots
    if relative_bars == "answer":
        ymax_bars = 100.0
    else:
        ymax_bars = float(pivoted[categories].sum(axis=1).max())
        if ymax_bars <= 0:
            ymax_bars = 1.0

    # --- 4) Dynamic sizing ---
    n_cats = max(1, len(categories))
    n_answer_bars = max(1, len(answers))
    n_bars_total = n_answer_bars + (1 if include_total_bar else 0)

    n_bar_cols = max(1, bar_cols)
    n_bar_rows = math.ceil(n_bars_total / n_bar_cols)

    heatmap_width = max(12.0, n_cats * heatmap_width_per_cat)
    bars_width = n_bar_cols * bar_width_per_col
    fig_width = max(heatmap_width, bars_width)
    fig_height = top_row_height + n_bar_rows * bar_row_height

    # --- 5) GridSpec scaffolding ---
    fig = plt.figure(figsize=(fig_width, fig_height))
    outer_gs = fig.add_gridspec(
        nrows=2, ncols=1,
        height_ratios=[top_row_height, n_bar_rows * bar_row_height],
        hspace=gridspec_hspace,
    )
    gs_heatmaps = outer_gs[0].subgridspec(1, n_cats, wspace=heatmap_wspace)
    gs_bars = outer_gs[1].subgridspec(n_bar_rows, n_bar_cols, wspace=bar_wspace, hspace=bar_hspace)

    # --- 5a) Heatmaps ---
    heat_axes = []
    vmin, vmax = float(df_heat["Value"].min()), float(df_heat["Value"].max())

    for i, cat in enumerate(categories):
        ax = fig.add_subplot(gs_heatmaps[0, i])
        heat_axes.append(ax)

        sub = df_heat[df_heat["Category"] == cat]
        mat = (
            sub.pivot_table(index="Answer", columns="Year", values="Value", fill_value=0.0)
               .reindex(index=ans_order, columns=years)
               .fillna(0.0)
        )

        sns.heatmap(
            mat,
            fmt=fmt_heat,
            cmap=cmap,
            ax=ax,
            cbar=False,
            vmin=vmin,
            vmax=vmax,
            annot=annotate,
            annot_kws={"fontsize": int(annot_fontsize)} if annotate else None,
        )
        ax.set_title(str(cat), fontsize=12, pad=8)
        ax.set_xlabel("Year", fontsize=10, labelpad=xlabel_pad)
        if show_all_heatmap_y or i == 0:
            ax.tick_params(axis="y", labelsize=heatmap_tick_fontsize)
        else:
            ax.tick_params(axis="y", labelleft=False)
        ax.tick_params(axis="x", labelsize=10)

    if heat_axes:
        fig.colorbar(
            heat_axes[-1].collections[0],
            ax=heat_axes,
            orientation="vertical",
            fraction=cbar_fraction,
            pad=cbar_pad,
        )

    # --- 5b) Bars ---
    palette = dict(zip(categories, sns.color_palette("tab20", n_colors=len(categories))))
    global_handles = [Patch(facecolor=palette[c], edgecolor="none") for c in categories]

    first_bar_ax = None  # for sharey across answer subplots (Totals excluded)

    # Helper to set y-label & y-ticks visibility consistently
    def _apply_yaxis_visibility(ax, is_first_col: bool, label_text: str):
        """
        Controls both y-axis label and y-ticks (marks + labels) based on show_bar_ylabel.
        - If show_bar_ylabel=True: show everywhere.
        - If show_bar_ylabel=False: show only in first column; hide elsewhere.
        """
        if show_bar_ylabel or is_first_col:
            ax.set_ylabel(label_text, fontsize=9, labelpad=ylabel_pad)
            ax.tick_params(axis="y", labelleft=True, left=True)
        else:
            ax.set_ylabel("")
            ax.tick_params(axis="y", labelleft=False, left=False)

    # Totals bar as first subplot (only when absolute bars)
    if include_total_bar:
        total_slot = 0
        row_idx = total_slot // n_bar_cols
        col_idx = total_slot % n_bar_cols
        ax_total = fig.add_subplot(gs_bars[row_idx, col_idx])  # not sharey with answers

        totals_per_year = (
            df_all[df_all["Category"].isin(categories)]
            .groupby("Year", as_index=False)
            .agg(total=("Count", "sum"))
            .sort_values("Year")
        )
        xs = totals_per_year["Year"].tolist()
        ys = totals_per_year["total"].astype(float).tolist()

        ax_total.bar(xs, ys, color="#6E6E6E", alpha=0.95, linewidth=0)
        totals_ymax = max(1.0, float(max(ys) if ys else 1.0))
        ax_total.set_ylim(0, totals_ymax * 1.08)

        if annotate_bar_totals and len(xs) > 0:
            y_offset = 0.02 * totals_ymax
            for x, y in zip(xs, ys):
                ax_total.text(
                    x, y + y_offset,
                    f"N={y:,.0f}",
                    ha="center", va="bottom",
                    fontsize=max(7, bar_tick_fontsize - 1),
                    color="#333333",
                )

        ax_total.set_title("Total (all selected answers)", fontsize=bar_title_fontsize, pad=6)
        ax_total.set_xticks(xs)
        if rotate_bar_xticks:
            ax_total.set_xticklabels(xs, rotation=rotate_bar_xticks, ha="right")
        ax_total.tick_params(axis="x", labelsize=bar_tick_fontsize, pad=2)
        ax_total.yaxis.grid(True, linestyle="--", linewidth=0.5, alpha=0.6)
        ax_total.set_axisbelow(True)
        ax_total.margins(x=0.05)

        # Apply y-axis visibility logic (Totals is in col 0)
        _apply_yaxis_visibility(ax_total, is_first_col=True, label_text="Total Count")

    # Answer subplots (start after Totals if present)
    for j, ans in enumerate(answers):
        slot = j + (1 if include_total_bar else 0)
        row_idx = slot // n_bar_cols
        col_idx = slot % n_bar_cols

        ax = fig.add_subplot(
            gs_bars[row_idx, col_idx],
            sharey=first_bar_ax if (share_y and first_bar_ax is not None) else None
        )
        if first_bar_ax is None:
            first_bar_ax = ax  # first **answer** axis sets the shared scale

        sub = pivoted[pivoted["Answer"].astype(str) == str(ans)].sort_values("Year")
        xs = sub["Year"].tolist()

        bottoms = None
        for cat in categories:
            vals = sub[cat].to_numpy(dtype=float, copy=False)
            ax.bar(xs, vals, bottom=bottoms, color=palette[cat], linewidth=0, alpha=0.95)
            bottoms = vals if bottoms is None else (bottoms + vals)

        ax.set_ylim(0, ymax_bars)

        # Annotations for raw totals above the stacks
        if annotate_bar_totals and len(xs) > 0:
            stack_tops = sub[categories].sum(axis=1).to_numpy(dtype=float, copy=False)
            sub_totals = (
                totals_raw[totals_raw["Answer"].astype(str) == str(ans)]
                .set_index("Year")
                .reindex(xs)["total_count"]
                .fillna(0.0)
                .to_numpy(dtype=float)
            )
            y_offset = 0.02 * ymax_bars
            for x, top, raw_n in zip(xs, stack_tops, sub_totals):
                ax.text(
                    x, top + y_offset,
                    f"N={raw_n:,.0f}",
                    ha="center", va="bottom",
                    fontsize=max(7, bar_tick_fontsize - 1),
                    color="#333333",
                )

        # Per-subplot legend (optional, with wrapping)
        if legend_per_subplot:
            handles = [Patch(facecolor=palette[c], edgecolor="none") for c in categories]
            labels_wrapped = [_wrap_text(c, legend_wrap_chars) for c in categories]
            ax.legend(
                handles=handles,
                labels=labels_wrapped,
                loc="upper right",
                frameon=False,
                fontsize=max(8, bar_tick_fontsize - 1),
                handlelength=1.2,
                handletextpad=0.6,
                borderaxespad=0.4,
            )

        # Readability
        ax.set_title(str(ans), fontsize=bar_title_fontsize, pad=6)
        ax.set_xticks(xs)
        if rotate_bar_xticks:
            ax.set_xticklabels(xs, rotation=rotate_bar_xticks, ha="right")
        ax.tick_params(axis="x", labelsize=bar_tick_fontsize, pad=2)
        ax.yaxis.grid(True, linestyle="--", linewidth=0.5, alpha=0.6)
        ax.set_axisbelow(True)
        ax.margins(x=0.05)

        # Apply y-axis visibility logic for each answer subplot
        _apply_yaxis_visibility(ax, is_first_col=(col_idx == 0), label_text=ylab_bars)

    # --- Global legend (if not per-subplot) with wrapped labels ---
    if not legend_per_subplot and categories:
        wrapped_labels = [_wrap_text(c, legend_wrap_chars) for c in categories]
        title_width = legend_title_wrap_chars if legend_title_wrap_chars is not None else legend_wrap_chars
        wrapped_title = _wrap_text(legend_title, title_width)
        fig.legend(
            [Patch(facecolor=palette[c], edgecolor="none") for c in categories],
            wrapped_labels,
            title=wrapped_title,
            bbox_to_anchor=(0.99, 0.5),
            loc="center left",
            frameon=False,
            handlelength=1.2,
            handletextpad=0.6,
            borderaxespad=0.6,
        )

    # --- Title & layout ---
    fig.suptitle(
        f"Dashboard for Top {top_n} Answers\n"
        f"Heatmaps ({ylab_heat}) + Stacked Bars ({ylab_bars})"
        f"{' + Totals' if include_total_bar else ''}\n",
        fontsize=15,
        y=0.995,
    )

    right_limit = max(0.0, 1.0 - (legend_right_pad if not legend_per_subplot else 0.02))
    fig.subplots_adjust(left=0.05, right=right_limit, top=0.9, bottom=0.05)

    plt.show()


### 3.2 Analyzing and plotting the data for question 2: How do StackOverflow users perceive AI?

In [None]:
# ------------------------------------------------------------
# AISelect Questions - preparations
# ------------------------------------------------------------

# 1) Prepare inputs
survey_dict = {
    "2023": survey_dfs["survey_2023"],
    "2024": survey_dfs["survey_2024"],
    "2025": survey_dfs["survey_2025"],
}

columns_to_analyze_AISelect = [
    "AISelect"
]

# Optional: map raw column names to category labels.
# If omitted, the raw column names will appear in the plots.
category_map_AISelect = {
    "AISelect": "Do you currently use AI tools in your development process?"
}

# 2) Combine surveys
long_df = prepare_long_format(survey_dict, columns_to_analyze_AISelect)

# 3) Build Answer × Year × Category counts (generic)
trend_df_AISelect, base_counts_AISelect, categories_AISelect = compute_trend_generic(
    long_df,
    columns_to_analyze_AISelect,
    category_map=category_map_AISelect,  # or None to use column names as categories
    answers=None,               # keep all answers; you could pass a list to restrict
    deduplicate_within_row=True
)

In [None]:
plot_combined_dashboard(
    trend_df=trend_df_AISelect,                 # long DF: ['Answer','Year','Category','Count']
    base_counts=base_counts_AISelect,           # per-year totals for normalization (relative='year')
    categories=categories_AISelect,             # list of category labels (3 in your case)

    top_n=7,                           # how many answers to display
    ensure_top_n_per_year=True,        # union of per-year top-N answers
    strict_top_n=True,                 # cap final selection back to exactly top_n

    relative_heatmap="year",           # heatmap values as % of respondents per year
    relative_bars=None,                # 100% stacks (category share)
    share_y=True,                      # all at 0–100
    annotate_bar_totals=False,         # write raw N above each bar
    show_bar_ylabel = True,            # show ylabel for each bar plot

    bar_cols=4,                        # 4 bar subplots per row (8 → 2 rows)

    top_row_height=5.0,                # height (in) of the heatmaps row
    bar_row_height=5.0,                # height (in) per row of bar charts
    heatmap_width_per_cat=5.5,         # width (in) per heatmap column
    bar_width_per_col=4.0,             # width (in) per bar subplot

    font_scale=1.0,                    # global font scaling
    rotate_bar_xticks=30,              # rotate bar x-axis tick labels (deg)

    # Spacing tuned for your layout
    heatmap_wspace=0.2,                # horizontal gap between heatmaps
    bar_wspace=0.5,                    # horizontal gap between bar subplots
    bar_hspace=0.5,                    # vertical gap between bar rows
    gridspec_hspace=0.3,               # vertical gap between heatmaps block and bars block
    legend_right_pad=0.15,             # fraction of figure width reserved for right-side legend

    # Annotation and ticks
    annot_fontsize=9,                  # font size for heatmap cell annotations
    show_all_heatmap_y=False           # show y-ticks only on leftmost heatmap (saves space)
)

In order to compare better over the years, we will introduce a general "yes" answer for the survey of 2025.

In [None]:
YES_MERGE_MAP = {
    "Yes, I use AI tools daily": "Yes",
    "Yes, I use AI tools weekly": "Yes",
    "Yes, I use AI tools monthly or infrequently": "Yes"
    # add more mappings as needed…
}

trend_df_merged = (
    trend_df_AISelect
    .assign(Answer=lambda d: d["Answer"].astype(str).replace(YES_MERGE_MAP))
    .groupby(["Answer", "Year", "Category"], as_index=False, sort=False)["Count"]
    .sum()
)

In [None]:
plot_combined_dashboard(
    trend_df=trend_df_merged,                   # long DF: ['Answer','Year','Category','Count']
    base_counts=base_counts_AISelect,           # per-year totals for normalization (relative='year')
    categories=categories_AISelect,             # list of category labels (3 in your case)

    top_n=4,                           # how many answers to display
    ensure_top_n_per_year=True,        # union of per-year top-N answers
    strict_top_n=True,                 # cap final selection back to exactly top_n

    relative_heatmap="year",           # heatmap values as % of respondents per year
    relative_bars=None,                # 100% stacks (category share)
    share_y=True,                      # all at 0–100
    annotate_bar_totals=False,         # write raw N above each bar
    show_bar_ylabel = True,            # show ylabel for each bar plot

    bar_cols=3,                        # 4 bar subplots per row (8 → 2 rows)

    top_row_height=5.0,                # height (in) of the heatmaps row
    bar_row_height=5.0,                # height (in) per row of bar charts
    heatmap_width_per_cat=4.5,         # width (in) per heatmap column
    bar_width_per_col=4.0,             # width (in) per bar subplot

    font_scale=1.0,                    # global font scaling
    rotate_bar_xticks=30,              # rotate bar x-axis tick labels (deg)

    # Spacing tuned for your layout
    heatmap_wspace=0.2,                # horizontal gap between heatmaps
    bar_wspace=0.5,                    # horizontal gap between bar subplots
    bar_hspace=0.5,                    # vertical gap between bar rows
    gridspec_hspace=0.3,               # vertical gap between heatmaps block and bars block
    legend_right_pad=0.15,             # fraction of figure width reserved for right-side legend

    # Annotation and ticks
    annot_fontsize=9,                  # font size for heatmap cell annotations
    show_all_heatmap_y=False           # show y-ticks only on leftmost heatmap (saves space)
)

With the help of this question and the above plots it becomes evident that the general usage of AI has increased from 2023 over 2024 to 2025. In 2023, less than half the survey participants were using AI according to the above heatmap. This shifted in 2024 where now 57.6% of the participants are using AI. Even though a third of the participants did not answer this questions, among the other two thirds the AI-users make up a clear majority.

When taking into account the more detailed "yes"-answers in the survey of 2025 we can also see that if the participants use AI, they mostly do so on a *daily basis*.

We will investigate the distribution of "no answer" later in order to see if this is a general effect of the study of 2025 or if this is specific to a certain question or the AI-topic.

In [None]:
# ------------------------------------------------------------
# AISent Questions - preparations
# ------------------------------------------------------------

# 1) Prepare inputs
columns_to_analyze_AISent = [
    "AISent"
]

# Optional: map raw column names to category labels.
# If omitted, the raw column names will appear in the plots.
category_map_AISent = {
    "AISent": "How favorable is your stance on using AI tools as part of your development workflow?"
}

# 2) Combine surveys
long_df = prepare_long_format(survey_dict, columns_to_analyze_AISent)

# 3) Build Answer × Year × Category counts (generic)
trend_df_AISent, base_counts_AISent, categories_AISent= compute_trend_generic(
    long_df,
    columns_to_analyze_AISent,
    category_map=category_map_AISent,  # or None to use column names as categories
    answers=None,               # keep all answers; you could pass a list to restrict
    deduplicate_within_row=True
)

In [None]:
plot_combined_dashboard(
    trend_df=trend_df_AISent,                 # long DF: ['Answer','Year','Category','Count']
    base_counts=base_counts_AISent,           # per-year totals for normalization (relative='year')
    categories=categories_AISent,             # list of category labels (3 in your case)

    top_n=7,                           # how many answers to display
    ensure_top_n_per_year=True,        # union of per-year top-N answers
    strict_top_n=True,                 # cap final selection back to exactly top_n

    relative_heatmap="year",           # heatmap values as % of respondents per year
    relative_bars=None,                # 100% stacks (category share)
    share_y=True,                      # all at 0–100
    annotate_bar_totals=False,          # write raw N above each bar
    show_bar_ylabel = True,            # show ylabel for each bar plot

    bar_cols=4,                        # 4 bar subplots per row (8 → 2 rows)

    top_row_height=5.0,                # height (in) of the heatmaps row
    bar_row_height=5.0,                # height (in) per row of bar charts
    heatmap_width_per_cat=4.5,         # width (in) per heatmap column
    bar_width_per_col=4.0,             # width (in) per bar subplot

    font_scale=1.0,                    # global font scaling
    rotate_bar_xticks=30,              # rotate bar x-axis tick labels (deg)

    # Spacing tuned for your layout
    heatmap_wspace=0.2,                # horizontal gap between heatmaps
    bar_wspace=0.5,                    # horizontal gap between bar subplots
    bar_hspace=0.5,                    # vertical gap between bar rows
    gridspec_hspace=0.3,               # vertical gap between heatmaps block and bars block
    legend_right_pad=0.15,             # fraction of figure width reserved for right-side legend

    # Annotation and ticks
    annot_fontsize=9,                  # font size for heatmap cell annotations
    show_all_heatmap_y=False           # show y-ticks only on leftmost heatmap (saves space)
)

In contrast to the question "AISelect" investigated before, the proportion of the survey participants not answering this question regarding the stance to use AI in their workflow is roughly 30% over all the surveys of 2023, 2024, and 2024. However, among the participants that answered this question, AI is perceived largely positive ("favorable" or "very favorable") with a slight downward trend towards 2025. In fact, in this year the negative answers ("unfavorable" or "very unfavorable") experienced an increase which was quite noticable for the "very unfavorable" option. 

Summing up we can say that AI is perceived positively in the context of using it in the development workflow, however, with a slight downward trend.

In [None]:
# ------------------------------------------------------------
# AIAcc Questions - preparations
# ------------------------------------------------------------

# 1) Prepare inputs
columns_to_analyze_AIAcc= [
    "AIAcc"
]

# Optional: map raw column names to category labels.
# If omitted, the raw column names will appear in the plots.
category_map_AIAcc = {
    "AIAcc": "How much do you trust the accuracy of the output from AI tools as part of your development workflow?"
}

# 2) Combine surveys
long_df = prepare_long_format(survey_dict, columns_to_analyze_AIAcc)

# 3) Build Answer × Year × Category counts (generic)
trend_df_AIAcc, base_counts_AIAcc, categories_AIAcc= compute_trend_generic(
    long_df,
    columns_to_analyze_AIAcc,
    category_map=category_map_AIAcc,  # or None to use column names as categories
    answers=None,               # keep all answers; you could pass a list to restrict
    deduplicate_within_row=True
)

In [None]:
plot_combined_dashboard(
    trend_df=trend_df_AIAcc,                   # long DF: ['Answer','Year','Category','Count']
    base_counts=base_counts_AIAcc,           # per-year totals for normalization (relative='year')
    categories=categories_AIAcc,             # list of category labels (3 in your case)

    top_n=6,                           # how many answers to display
    ensure_top_n_per_year=True,        # union of per-year top-N answers
    strict_top_n=True,                 # cap final selection back to exactly top_n

    relative_heatmap="year",           # heatmap values as % of respondents per year
    relative_bars=None,                # 100% stacks (category share)
    share_y=True,                      # all at 0–100
    annotate_bar_totals=False,          # write raw N above each bar
    show_bar_ylabel = True,            # show ylabel for each bar plot

    bar_cols=4,                        # 4 bar subplots per row (8 → 2 rows)

    top_row_height=5.0,                # height (in) of the heatmaps row
    bar_row_height=5.0,                # height (in) per row of bar charts
    heatmap_width_per_cat=4.5,         # width (in) per heatmap column
    bar_width_per_col=4.0,             # width (in) per bar subplot

    font_scale=1.0,                    # global font scaling
    rotate_bar_xticks=30,              # rotate bar x-axis tick labels (deg)

    # Spacing tuned for your layout
    heatmap_wspace=0.2,                # horizontal gap between heatmaps
    bar_wspace=0.5,                   # horizontal gap between bar subplots
    bar_hspace=0.5,                    # vertical gap between bar rows
    gridspec_hspace=0.3,               # vertical gap between heatmaps block and bars block
    legend_right_pad=0.15,             # fraction of figure width reserved for right-side legend

    # Annotation and ticks
    annot_fontsize=9,                  # font size for heatmap cell annotations
    show_all_heatmap_y=False           # show y-ticks only on leftmost heatmap (saves space)
)

For the question about the trustlevel that the StackOverflow users have towards the output of AI, we can note that the proportion of "no answer" is even higher than before. Almost half of the participants of the 2024 survey did not answer this question. Among the rest, the trust in the output of AI tools is slightly positive with a clear trend towards distrust, which becomes evident in the survey of 2025 where 13.3% of the participants stated they "highly distrust" the results. 

#### Summary for question 2: How do StackOverflow users perceive AI?
This downward trend in the perception of AI as seen by comparing the, its usage and results can have several reasons. On the one hand, the number of available AI tools is increasing extremely fast and it is becoming more and more easy to use these tools also in a professional work environment. Therefore, the tools are no longer seen as magically solving toy problems with AI as seen in media - but they are actually used for real problems, where some AI tools still have their limits. In addition, with the wider spread of AI tools the users and the general population itself are sensitized or educated about a responsible and careful usage of the tools. Therefore, it makes sense that the optimism regarding AI tools seems to slowly decline to a more realistic perception of AI.

### 3.3 Analyzing and plotting the data for question 3: for which purpose is AI used by the StackOverflow users?

In [None]:
# ------------------------------------------------------------
# AITool Questions - preparations
# ------------------------------------------------------------

columns_to_analyze_AITool = [
    "AIToolCurrently Using",
    "AIToolInterested in Using",
    "AIToolNot interested in Using",
]

# Optional: map raw column names to category labels.
# If omitted, the raw column names will appear in the plots.
category_map_AITool = {
    "AIToolCurrently Using": "Currently Using",
    "AIToolInterested in Using": "Interested in Using",
    "AIToolNot interested in Using": "Not Interested",
}

# 2) Combine surveys
long_df = prepare_long_format(survey_dict, columns_to_analyze_AITool)

# 3) Build Answer × Year × Category counts (generic)
trend_df_AITool, base_counts_AITool, categories_AITool = compute_trend_generic(
    long_df,
    columns_to_analyze_AITool,
    category_map=category_map_AITool,  # or None to use column names as categories
    answers=None,               # keep all answers; you could pass a list to restrict
    deduplicate_within_row=True
)

In [None]:
# ------------------------------------------------------------
# AITool Questions - plotting the results
# ------------------------------------------------------------

plot_combined_dashboard(
    trend_df=trend_df_AITool,                 # long DF: ['Answer','Year','Category','Count']
    base_counts=base_counts_AITool,           # per-year totals for normalization (relative='year')
    categories=categories_AITool,             # list of category labels (3 in your case)

    top_n=15,                           # how many answers to display
    ensure_top_n_per_year=True,        # union of per-year top-N answers
    strict_top_n=True,                 # cap final selection back to exactly top_n

    relative_heatmap="year",           # heatmap values as % of respondents per year
    relative_bars="answer",            # 100% stacks (category share)
    share_y=True,                      # all at 0–100
    annotate_bar_totals=False,         # write raw N above each bar
    show_bar_ylabel = False,            # show ylabel for each bar plot

    bar_cols=5,                        # 4 bar subplots per row (8 → 2 rows)

    top_row_height=5.0,                # height (in) of the heatmaps row
    bar_row_height=5.0,                # height (in) per row of bar charts
    heatmap_width_per_cat=4.5,         # width (in) per heatmap column
    bar_width_per_col=4.0,             # width (in) per bar subplot

    font_scale=1.0,                    # global font scaling
    rotate_bar_xticks=30,              # rotate bar x-axis tick labels (deg)

    # Spacing tuned for your layout
    heatmap_wspace=0.2,                # horizontal gap between heatmaps
    bar_wspace=0.25,                   # horizontal gap between bar subplots
    bar_hspace=0.5,                    # vertical gap between bar rows
    gridspec_hspace=0.3,               # vertical gap between heatmaps block and bars block
    legend_right_pad=0.15,             # fraction of figure width reserved for right-side legend

    # Annotation and ticks
    annot_fontsize=9,                  # font size for heatmap cell annotations
    show_all_heatmap_y=False           # show y-ticks only on leftmost heatmap (saves space)
)

For this question, the proportion of survey participants not answering was extremely high. Depending on the survey year and the type of question, up to 80% of all participants did not answer some of the questions. The reason for this can only be guessed like there were too many choices but there is no clear pattern as to why this question was avoided that much. 

However, from those who answered we can deduce a few takeaways regarding the purpose of AI usage:
- Currently using
    - "Writing code" is still the number one application of AI. Still, in 2025 the numbers are significantly lower than in 2024 and, of the people not currently using AI for that purpose, an increasing proportion is not even interested in using it in the future.
    - Nevertheless, the field of coding is still a huge field of application for AI with several tools currently in use ranging from code documentation, over debugging to testing code. Ironically, we *are* analyzing a survey of StackOverflow users to the user base might be a little bit biased here :)
    - Another hot topic currently being used is "search for answers", which is also a common use case among many other user groups as chatbots are taking over the popularity as very strong search engines providing precise answers and references to the search queries (depending of course on the precision of the query).
- Not interested
    - "Comitting and reviewing code" or "Deployment and monitoring" still seems to be preferred not to be done by AI which makes sense because we still need to question the output of the AI and, additionally, the trustlevel in AI ouput is slightly decreasing.
    - "Project planning" seems to be quite unfavored among the StackOverflow users. This, however, might be biased by the user group as the majority might be more involved in coding instead of project planning. This could be evaluated more by making correlations to other survey questions regarding the general field of work of the survey participants.
    - "Predictive analytics" is also increasingly uninteresting, probably also related to the decreasing trust in AI output.
- Interested in using
    - The fields of interest, on the other hand, are almost equally distributed.


### 3.4 Proportion of "no answer"

As the proportion of not answered questions is, in some cases, unexpectedly high, I want to analyze the proportion of the "no answer" values with respect to the other answers of the surveys in order to see if there is a trend for/against the AI-related questions or not.

In [None]:
# Inputs
target_questions = (columns_to_analyze_AIAcc + columns_to_analyze_AISelect + columns_to_analyze_AISent + columns_to_analyze_AITool)

# AI-related questions from df_matches
ai_questions = df_matches.loc[df_matches["survey"].isin([f"survey_{y}" for y in ["2023","2024","2025"]]), "column_name"].tolist()

# Helper function to compute proportion of "no answer"
def compute_no_answer_prop(df):
    total = len(df)
    no_ans = (df["answer"].astype(str).str.lower() == "no answer").sum()
    return no_ans / total if total > 0 else None

rows = []

for year, survey_df in survey_dict.items():
    # Melt wide format to long format
    df_long = survey_df.melt(var_name="question", value_name="answer")
    
    # Group 0: Target questions
    target_df = df_long[df_long["question"].isin(target_questions)]
    target_prop = compute_no_answer_prop(target_df)

    # Group 1: All other questions
    other_df = df_long[~df_long["question"].isin(target_questions)]
    other_prop = compute_no_answer_prop(other_df)

    # Group 2: Other questions containing "AI"
    other_ai_df = other_df[other_df["question"].isin(ai_questions)]
    other_ai_prop = compute_no_answer_prop(other_ai_df)

    # Group 3: Other questions NOT containing "AI"
    other_non_ai_df = other_df[~other_df["question"].isin(ai_questions)]
    other_non_ai_prop = compute_no_answer_prop(other_non_ai_df)

    # Group 4: All questions
    all_prop = compute_no_answer_prop(df_long)

    rows.append({
        "Year": year,
        "Target Questions": target_prop,
        "All Other": other_prop,
        "Other AI": other_ai_prop,
        "Other Non-AI": other_non_ai_prop,
        "All": all_prop
    })

comparison_df = pd.DataFrame(rows)
print(comparison_df)

In [None]:
# Set Year as index for plotting
plot_df = comparison_df.set_index("Year")
plot_df_reset = plot_df.reset_index().melt(id_vars="Year", var_name="Group", value_name="Proportion")
plt.figure(figsize=(10, 6))
sns.barplot(data=plot_df_reset, x="Year", y="Proportion", hue="Group")
plt.title("Comparison of 'No Answer' Proportions by Year")
plt.ylabel("Proportion of 'No Answer'")
plt.tight_layout()
plt.show()

Comparing the proportions of the "no answer" values to the other questions, we can observe the following:
- In average, the questions considered in this analysis were answered more properly over time with the fewest "no answer" values in the latest survey of 2025.
- Compared to "all other" or even "all" questions, in 2023 the survey participants answered the target questions less, in 2024 equally as much and in 2025 clearly more frequent.
- "Other AI"-related questions seem to be answered relatively sparsely when compared to "other non-AI".
- The general proportion of "no answer" for all questions is increasing over time with a percentage of over 50% in the latest survey of 2025.

## 4. Summary

### Question 1: How did the AI‑related survey questions develop over time?

- Sharp rise in AI focus: The number of AI‑related items grows from 3 (2018) and 1 (2019) to 9 (2023), 11 (2024), and 31 (2025). By 2025, AI questions account for ~20% of the full survey.
- Overall survey size dip & rebound: Total question count drops around 2020/2021, then rises again in 2023–2025 alongside the AI share.

**Implication**: The survey increasingly probes AI—both breadth (more topics like tools, trust, agents) and depth (richer sub‑items in 2025)

### Question 2: How do users perceive AI, and has that changed?

The analysis uses three recurring questions—"AISelect" (use), "AISent" (stance), and "AIAcc" (trust). A 2023 schema mix‑up between "AIBen" and "AIAcc" is detected and corrected to ensure comparability across 2023–2025.

- "AISelect": Usage rises over time. In 2024, the share using AI is ~57.6% (noting a sizeable non‑response). In 2025, the split reveals that most “Yes” users report daily usage.
- "AISent": Majority remains favorable/very favorable, but there’s a slight downward trend into 2025 with a noticeable uptick in “very unfavorable.”
- "AIAcc": Still net positive (“somewhat trust” leads) yet trending toward more distrust; in 2025, ~13.3% report “highly distrust.” Non‑response on trust is high (e.g., ~half in 2024).

**Implication**: Usage keeps climbing, but attitudes are normalizing—optimism cooling slightly as real‑world limitations and risks surface.

### Question 3: What do people use AI for?

- Top current use: “Writing code” remains the #1 application; usage in 2025 is lower than 2024, and among non‑users for coding a growing share is not interested in adopting it. “Search for answers” is another strong, current use.
- Less favored: “Committing/reviewing code” and “Deployment/monitoring” tend to show higher not‑interested rates. “Project planning” is also relatively unfavored (possibly role bias in the SO audience). “Predictive analytics” interest appears to wane, aligning with the trust findings.
- Data note: The 2025 matrix question changed and was re‑mapped to align with 2023/2024 categories before comparison. Also, non‑response for this block is very high (up to ~80% on some items), so interpret with caution.

### Additional data quality observations
- Schema evolution (2017–2025) complicates strict year‑over‑year comparisons; recent years are more comparable.
- Non‑response (“No answer”): For the AI questions analyzed, non‑response declines by 2025 relative to other questions, but overall survey non‑response rises over time (exceeding 50% in 2025 across all questions). This can bias levels and trends.

### **Bottom line**
- AI’s prominence in the survey has surged since 2023.
- Usage keeps growing (with many daily users by 2025), yet sentiment and trust show a softening—especially in 2025.
- Coding and answer‑search dominate today’s use; review/monitoring/planning remain less AI‑driven among respondents.
- Interpretation caveat: High and shifting non‑response rates and schema changes require caution.