# Descriptive Analysis of Questions Q1 to Q8

In this section, a descriptive statistical analysis is performed for questions Q1 to Q8 (including their numerical sub-questions) from the dataset `clean_data.csv`. For each question, the mean, standard deviation, minimum, maximum, mode, and frequency of the mode are calculated. In the `clean_data.csv` file, values representing "no answer" (originally -99) have already been treated as missing values (NaN).

In [None]:
import pandas as pd
import numpy as np

# Load data
file_path = '../../src/llm/clean_data.csv'
df = pd.read_csv(file_path)

# Define columns for analysis (Q1-Q8, numerical sub-questions, skip Q3)
question_columns = ['Q1']
question_columns.extend([f'Q2{chr(ord("A")+i)}' for i in range(7)]) # Q2A to Q2G
# Q3 is a text question and is skipped here
question_columns.extend([f'Q4{chr(ord("A")+i)}' for i in range(8)]) # Q4A to Q4H
question_columns.append('Q5')
question_columns.extend([f'Q6{chr(ord("A")+i)}' for i in range(6)]) # Q6A to Q6F
question_columns.extend([f'Q7{chr(ord("A")+i)}' for i in range(6)]) # Q7A to Q7F
question_columns.extend([f'Q8{chr(ord("A")+i)}' for i in range(12)]) # Q8A to Q8L

results = []

for col in question_columns:
    if col in df.columns:
        # Ensure the column is numeric. Values that cannot be converted become NaN.
        # The -99 values were already converted to NaN in clean_data.csv.
        series_cleaned = pd.to_numeric(df[col], errors='coerce')
        
        # Calculate statistics
        mean_val = series_cleaned.mean()
        std_val = series_cleaned.std()
        min_val = series_cleaned.min()
        max_val = series_cleaned.max()
        
        mode_val = np.nan
        frequency_val = np.nan
        
        if series_cleaned.notna().sum() > 0:
            mode_series = series_cleaned.mode()
            if not mode_series.empty:
                mode_val = mode_series.iloc[0]
                # Ensure frequency_val is a number, even if mode_val is NaN (should not happen here)
                # or if the mode does not appear in value_counts (very unlikely)
                frequency_val = series_cleaned.value_counts().get(mode_val, 0) if pd.notna(mode_val) else 0
        
        results.append({
            'Column': col,
            'Mean': mean_val,
            'Std': std_val,
            'Min': min_val,
            'Max': max_val,
            'Mode': mode_val,
            'Frequency': int(frequency_val) if pd.notna(frequency_val) else np.nan # Frequency as Int, if not NaN
        })

# Display results as DataFrame
df_results = pd.DataFrame(results)

# Print DataFrame
print(df_results.to_string())

   Column      Mean       Std  Min  Max  Mode  Frequency
0      Q1  3.098361  1.032609  1.0  5.0   3.0        216
1     Q2A  2.442623  0.570301  1.0  3.0   3.0        235
2     Q2B  2.662551  0.515049  1.0  3.0   3.0        332
3     Q2C  1.378099  0.564307  1.0  3.0   1.0        321
4     Q2D  1.344398  0.571008  1.0  3.0   1.0        340
5     Q2E  2.254167  0.691420  1.0  3.0   2.0        220
6     Q2F  2.491561  0.603839  1.0  3.0   3.0        260
7     Q2G  2.195329  0.723511  1.0  3.0   2.0        207
8     Q4A  2.663158  0.539779  1.0  3.0   3.0        331
9     Q4B  2.090336  0.752701  1.0  3.0   2.0        203
10    Q4C  1.376874  0.581677  1.0  3.0   1.0        315
11    Q4D  1.939583  0.676550  1.0  3.0   2.0        259
12    Q4E  1.646934  0.679665  1.0  3.0   1.0        222
13    Q4F  1.247881  0.513094  1.0  3.0   1.0        373
14    Q4G  1.068966  0.293166  1.0  3.0   1.0        437
15    Q4H  1.074786  0.328459  1.0  3.0   1.0        442
16     Q5  2.076763  1.159037  

In [None]:
# Save results as CSV file
output_csv_path = '../../data/stat_summaryQ1toQ8.csv'
df_results.to_csv(output_csv_path, index=False, encoding='utf-8')

print(f"The results were successfully saved to '{output_csv_path}'.")

Die Ergebnisse wurden erfolgreich in '../../data/stat_summaryQ1toQ8.csv' gespeichert.


## Explanation of the Calculated Columns

In the table above:

*   **Column**: The name of the column (question or sub-question) from the dataset.
*   **Mean**: The average value of the answers for this question.
*   **Std**: The standard deviation, a measure of the dispersion of answers around the mean.
*   **Min**: The smallest value given for this question.
*   **Max**: The largest value given for this question.
*   **Mode**: The value that was most frequently named for this question. If multiple values occur with the same highest frequency, one of them is displayed here (typically the smallest).
*   **Frequency**: The number of times the mode (the most frequent value) occurs in the answers for this question.

## Comparison of Generated Statistics with a Reference File

In this section, the statistics generated in this notebook session (`stat_summaryQ1toQ8.csv`) are compared with a reference file (`../../src/llm/stat_summary.csv`). Only the rows (questions Q1-Q8 and their sub-questions) and columns (`Mean`, `Std`, `Min`, `Max`, `Mode`, `Frequency`) that are present in both files are compared.

In [None]:
import pandas as pd
import numpy as np

print("Comparison of the generated file stat_summaryQ1toQ8.csv with the reference file ../../src/llm/stat_summary.csv\n")

# Path to the generated file
path_generated = '../../data/stat_summaryQ1toQ8.csv'
# Path to the reference file
path_reference = '../../src/llm/stat_summary.csv'

try:
    df_generated = pd.read_csv(path_generated)
    df_reference = pd.read_csv(path_reference)

    # Set 'Column' as index
    df_generated_indexed = df_generated.set_index('Column')
    df_reference_indexed = df_reference.set_index('Column')

    # Select only the questions (rows) present in the generated file
    common_question_rows = df_generated_indexed.index
    df_reference_filtered = df_reference_indexed.loc[df_reference_indexed.index.isin(common_question_rows)]

    # Define columns for comparison
    stat_cols_to_compare = ['Mean', 'Std', 'Min', 'Max', 'Mode', 'Frequency']
    
    # Ensure both DataFrames have the same rows (in the same order) and columns
    df_generated_aligned = df_generated_indexed.loc[common_question_rows, stat_cols_to_compare].copy()
    df_reference_aligned = df_reference_filtered.reindex(common_question_rows)[stat_cols_to_compare].copy()

    all_values_match = True

    # Comparison for 'Mean' and 'Std' (floating-point numbers with tolerance)
    for col_name in ['Mean', 'Std']:
        if col_name in df_generated_aligned.columns and col_name in df_reference_aligned.columns:
            series_gen = df_generated_aligned[col_name]
            series_ref = df_reference_aligned[col_name]
            
            # Check for NaN consistency before using np.isclose to avoid misleading length differences
            if series_gen.isna().sum() != series_ref.isna().sum() or not np.all(np.isclose(series_gen.dropna(), series_ref.dropna(), rtol=1e-7, atol=1e-9)):
                 # Fallback in case dropna() leads to different lengths or NaNs are different
                if not np.all(np.isclose(series_gen, series_ref, rtol=1e-7, atol=1e-9, equal_nan=True)):
                    all_values_match = False
                    print(f"Differences found in column: {col_name}")
                    comparison_df = pd.DataFrame({'Generated': series_gen, 'Reference': series_ref})
                    mask_diff = ~np.isclose(series_gen, series_ref, rtol=1e-7, atol=1e-9, equal_nan=True)
                    print(comparison_df[mask_diff].to_string())
                    print("-" * 50)
        else:
            print(f"Column {col_name} not present in both DataFrames for comparison.")
            all_values_match = False


    # Comparison for 'Min', 'Max', 'Mode', 'Frequency'
    # These are compared as float to handle type differences (e.g., int vs. float)
    for col_name in ['Min', 'Max', 'Mode', 'Frequency']:
        if col_name in df_generated_aligned.columns and col_name in df_reference_aligned.columns:
            series_gen = df_generated_aligned[col_name].astype(float)
            series_ref = df_reference_aligned[col_name].astype(float)

            if not series_gen.equals(series_ref): # .equals() handles NaNs correctly
                all_values_match = False
                print(f"Differences found in column: {col_name}")
                # Show original values for better readability
                comparison_df = pd.DataFrame({'Generated': df_generated_aligned[col_name], 
                                              'Reference': df_reference_aligned[col_name]})
                # Mask for different values (after conversion to float)
                mask_diff = (series_gen != series_ref) | (series_gen.isna() != series_ref.isna())
                print(comparison_df[mask_diff].to_string())
                print("-" * 50)
        else:
            print(f"Column {col_name} not present in both DataFrames for comparison.")
            all_values_match = False

    if all_values_match:
        print("All compared values for Q1-Q8 in the generated file match the reference file.")
    else:
        print("Differences were found. Please check the output above.")

except FileNotFoundError:
    print(f"One of the files was not found. Please check the paths:\n- Generated: {path_generated}\n- Reference: {path_reference}")
except Exception as e:
    print(f"An error occurred during the comparison: {e}")


Vergleich der generierten Datei stat_summaryQ1toQ8.csv mit der Referenzdatei ../../src/llm/stat_summary.csv

Alle verglichenen Werte für Q1-Q8 in der generierten Datei stimmen mit der Referenzdatei überein.
