# 01 - Veri Yukleme ve Istatistiksel Analiz

Bu dosya, 2025 Yazilim Sektoru Maas Anketi icin temel veri yukleme, on isleme ve istatistiksel analiz islemlerini ele alir

## Hedefleri:
- Temizlenmis veri setini yuklemek ve dogrulamak
- Etki buyuklukleri ile istatistiksel testler gerceklestirmek
- Temel veri ozetlerini oluşturmak
- Sonraki analizler icin temel metrikleri hesaplamak

## Kütüphaneleri import etme ve Setup

In [14]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind, f_oneway, kruskal
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Set up plotting style
sns.set_palette("husl")
plt.rcParams['font.family'] = 'DejaVu Sans'

# Constants
FIG_DIR = '../figures'
LOCATION_NOTE = 'Note: Estimated location is inferred from company location and work mode (Office/Hybrid → company location). Not definitive. "Yurtdışı TR hub" responses are excluded from location-based inference.'

# Ensure output directory exists
os.makedirs(FIG_DIR, exist_ok=True)

## Veri Yükleme ve Doğrulama

In [16]:
def load_data() -> pd.DataFrame:
    """Load the cleaned dataset and perform basic validation"""
    df = pd.read_csv('../data/2025_cleaned_data.csv')
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    return df

# Load data
df = load_data()
print(f'Dataset yuklendi: {df.shape[0]} satir, {df.shape[1]} sutun')
print(f'Tarih araligi: {df["timestamp"].min()} to {df["timestamp"].max()}')

Dataset yuklendi: 2969 satir, 94 sutun
Tarih araligi: 2025-08-20 12:31:15 to 2025-08-21 11:03:36


## Veri Genel Bakış ve Özet İstatistikler

In [17]:
# Display basic dataset information
print("Dataset Bilgileri:")
print(df.info())

print("\nIlk birkaç satir:")
display(df.head())

print("\nTemel sayisal sutunlarin ozet istatistikleri:")
key_cols = ['salary_numeric', 'experience_years', 'seniority_level_ic']
display(df[key_cols].describe())

Dataset Bilgileri:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2969 entries, 0 to 2968
Data columns (total 94 columns):
 #   Column                             Non-Null Count  Dtype         
---  ------                             --------------  -----         
 0   timestamp                          2969 non-null   datetime64[ns]
 1   gender                             2969 non-null   int64         
 2   experience_years                   2969 non-null   int64         
 3   salary_numeric                     2969 non-null   float64       
 4   is_likely_in_company_location      2969 non-null   int64         
 5   company_location_Amerika           2969 non-null   int64         
 6   company_location_Avrupa            2969 non-null   int64         
 7   company_location_Turkiye           2969 non-null   int64         
 8   company_location_Yurtdisi_TR_hub   2969 non-null   int64         
 9   employment_type_Freelance          2969 non-null   int64         
 10  employment_type_K

Unnamed: 0,timestamp,gender,experience_years,salary_numeric,is_likely_in_company_location,company_location_Amerika,company_location_Avrupa,company_location_Turkiye,company_location_Yurtdisi_TR_hub,employment_type_Freelance,...,frontend_Vue,tools_FastApi,tools_Firebase,tools_Jotai,tools_Kullanmiyorum,tools_Redux,tools_Strapi,tools_Supabase,tools_Wordpress,tools_Zustand
0,2025-08-20 12:31:15,0,5,65.5,1,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
1,2025-08-20 12:31:27,0,6,125.5,0,0,0,1,0,0,...,0,0,1,0,0,1,0,0,0,0
2,2025-08-20 12:32:54,0,7,155.5,1,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,2025-08-20 12:33:08,0,5,85.5,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
4,2025-08-20 12:34:03,0,10,125.5,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1



Temel sayisal sutunlarin ozet istatistikleri:


Unnamed: 0,salary_numeric,experience_years,seniority_level_ic
count,2969.0,2969.0,2969.0
mean,98.206635,5.01482,2.214887
std,54.969052,4.318024,1.185449
min,5.0,0.0,0.0
25%,55.5,2.0,1.0
50%,85.5,4.0,2.0
75%,125.5,6.0,3.0
max,230.5,30.0,6.0


## Etki Büyüklüğü Hesaplama Fonksiyonu

In [18]:
def calculate_effect_size(group1, group2):
    """Calculate Cohen's d effect size for comparing two groups"""
    n1, n2 = len(group1), len(group2)
    pooled_std = np.sqrt(((n1 - 1) * group1.var() + (n2 - 1) * group2.var()) / (n1 + n2 - 2))
    return (group1.mean() - group2.mean()) / pooled_std

## İstatistiksel Testler: Temel Karşılaştırmalar

In [19]:
def perform_statistical_tests(df: pd.DataFrame):
    """Perform hypothesis tests with effect sizes for key comparisons"""
    results = {}
    
    # React vs Non-React
    if 'frontend_React' in df.columns:
        react_salaries = df[df['frontend_React'] == 1]['salary_numeric']
        non_react_salaries = df[df['frontend_React'] == 0]['salary_numeric']
        
        if len(react_salaries) > 10 and len(non_react_salaries) > 10:
            t_stat, p_value = ttest_ind(react_salaries, non_react_salaries, equal_var=False)
            effect_size = calculate_effect_size(react_salaries, non_react_salaries)
            
            results['react_vs_non_react'] = {
                'react_mean': react_salaries.mean(),
                'non_react_mean': non_react_salaries.mean(),
                'mean_diff': react_salaries.mean() - non_react_salaries.mean(),
                'p_value': p_value,
                'effect_size': effect_size,
                'significant': p_value < 0.05,
                'react_count': len(react_salaries),
                'non_react_count': len(non_react_salaries)
            }
    
    # Remote vs Office
    if 'work_mode_Remote' in df.columns and 'work_mode_Office' in df.columns:
        remote_salaries = df[df['work_mode_Remote'] == 1]['salary_numeric']
        office_salaries = df[df['work_mode_Office'] == 1]['salary_numeric']
        
        if len(remote_salaries) > 10 and len(office_salaries) > 10:
            t_stat, p_value = ttest_ind(remote_salaries, office_salaries, equal_var=False)
            effect_size = calculate_effect_size(remote_salaries, office_salaries)
            
            results['remote_vs_office'] = {
                'remote_mean': remote_salaries.mean(),
                'office_mean': office_salaries.mean(),
                'mean_diff': remote_salaries.mean() - office_salaries.mean(),
                'p_value': p_value,
                'effect_size': effect_size,
                'significant': p_value < 0.05,
                'remote_count': len(remote_salaries),
                'office_count': len(office_salaries)
            }
    
    # Europe vs Turkey
    if 'company_location_Avrupa' in df.columns and 'company_location_Turkiye' in df.columns:
        europe_salaries = df[df['company_location_Avrupa'] == 1]['salary_numeric']
        turkey_salaries = df[df['company_location_Turkiye'] == 1]['salary_numeric']
        
        if len(europe_salaries) > 10 and len(turkey_salaries) > 10:
            t_stat, p_value = ttest_ind(europe_salaries, turkey_salaries, equal_var=False)
            effect_size = calculate_effect_size(europe_salaries, turkey_salaries)
            
            results['europe_vs_turkey'] = {
                'europe_mean': europe_salaries.mean(),
                'turkey_mean': turkey_salaries.mean(),
                'mean_diff': europe_salaries.mean() - turkey_salaries.mean(),
                'p_value': p_value,
                'effect_size': effect_size,
                'significant': p_value < 0.05,
                'europe_count': len(europe_salaries),
                'turkey_count': len(turkey_salaries)
            }
    
    # Gender gap
    male_salaries = df[df['gender'] == 0]['salary_numeric']
    female_salaries = df[df['gender'] == 1]['salary_numeric']
    
    if len(male_salaries) > 10 and len(female_salaries) > 10:
        t_stat, p_value = ttest_ind(male_salaries, female_salaries, equal_var=False)
        effect_size = calculate_effect_size(male_salaries, female_salaries)
        
        results['gender_gap'] = {
            'male_mean': male_salaries.mean(),
            'female_mean': female_salaries.mean(),
            'mean_diff': male_salaries.mean() - female_salaries.mean(),
            'p_value': p_value,
            'effect_size': effect_size,
            'significant': p_value < 0.05,
            'male_count': len(male_salaries),
            'female_count': len(female_salaries)
        }
    
    return results

# Perform statistical tests
test_results = perform_statistical_tests(df)

# Display results in a formatted table
results_df = pd.DataFrame([
    {
        'Comparison': test_name.replace('_', ' ').title(),
        'Group 1 Mean': result['mean_diff'] + result.get('non_react_mean', result.get('office_mean', result.get('turkey_mean', result.get('female_mean', 0)))),
        'Group 2 Mean': result.get('non_react_mean', result.get('office_mean', result.get('turkey_mean', result.get('female_mean', 0)))),
        'Mean Difference': result['mean_diff'],
        'P-value': result['p_value'],
        'Effect Size (Cohen\'s d)': result['effect_size'],
        'Significant': result['significant'],
        'Group 1 Count': result.get('react_count', result.get('remote_count', result.get('europe_count', result.get('male_count', 0)))),
        'Group 2 Count': result.get('non_react_count', result.get('office_count', result.get('turkey_count', result.get('female_count', 0))))
    }
    for test_name, result in test_results.items()
])

display(results_df)

Unnamed: 0,Comparison,Group 1 Mean,Group 2 Mean,Mean Difference,P-value,Effect Size (Cohen's d),Significant,Group 1 Count,Group 2 Count
0,React Vs Non React,96.057044,99.311576,-3.254532,0.1288888,-0.05922,False,1008,1961
1,Remote Vs Office,101.243333,78.636998,22.606335,6.370442e-18,0.41771,True,1350,573
2,Europe Vs Turkey,162.924242,92.93392,69.990323,1.621139e-26,1.350362,True,132,2671
3,Gender Gap,99.387985,86.102273,13.285712,5.050997e-05,0.242228,True,2705,264


## Grup Karşılaştırmaları: Kıdem Seviyeleri

In [20]:
# Seniority level group comparison
if 'seniority_level_ic' in df.columns:
    valid = df[['seniority_level_ic', 'salary_numeric']].dropna()
    groups = [g['salary_numeric'].values for _, g in valid.groupby('seniority_level_ic') if len(g) >= 10]
    labels = [str(k) for k, g in valid.groupby('seniority_level_ic') if len(g) >= 10]
    
    if len(groups) >= 2:
        # ANOVA test
        try:
            f_stat, p_anova = f_oneway(*groups)
        except Exception:
            f_stat, p_anova = np.nan, np.nan
        
        # Kruskal-Wallis test
        try:
            h_stat, p_kruskal = kruskal(*groups)
        except Exception:
            h_stat, p_kruskal = np.nan, np.nan

        print("Seniority level group comparison:")
        print(f"  Groups (n>=10): {labels}")
        print(f"  ANOVA p-value: {p_anova:.4f}" if not np.isnan(p_anova) else "  ANOVA p-value: NA")
        print(f"  Kruskal-Wallis p-value: {p_kruskal:.4f}" if not np.isnan(p_kruskal) else "  Kruskal-Wallis p-value: NA")

        # Tukey HSD post-hoc test
        try:
            tukey = pairwise_tukeyhsd(endog=valid['salary_numeric'], groups=valid['seniority_level_ic'].astype(str), alpha=0.05)
            print("  Tukey HSD (significant pairs):")
            for res in tukey.summary().data[1:]:
                grp1, grp2, meandiff, p_adj, lower, upper, reject = res
                if reject:
                    print(f"    {grp1} vs {grp2}: diff={meandiff:.1f}, p_adj={p_adj:.4f}")
        except Exception:
            print("  Tukey HSD: NA")

Seniority level group comparison:
  Groups (n>=10): ['0.0', '1.0', '2.0', '3.0', '4.0', '5.0', '6.0']
  ANOVA p-value: 0.0000
  Kruskal-Wallis p-value: 0.0000
  Tukey HSD (significant pairs):
    0.0 vs 1.0: diff=-129.7, p_adj=0.0000
    0.0 vs 2.0: diff=-100.7, p_adj=0.0000
    0.0 vs 3.0: diff=-54.0, p_adj=0.0000
    0.0 vs 5.0: diff=-34.3, p_adj=0.0000
    1.0 vs 2.0: diff=29.0, p_adj=0.0000
    1.0 vs 3.0: diff=75.7, p_adj=0.0000
    1.0 vs 4.0: diff=137.9, p_adj=0.0000
    1.0 vs 5.0: diff=95.4, p_adj=0.0000
    1.0 vs 6.0: diff=133.3, p_adj=0.0000
    2.0 vs 3.0: diff=46.7, p_adj=0.0000
    2.0 vs 4.0: diff=108.9, p_adj=0.0000
    2.0 vs 5.0: diff=66.4, p_adj=0.0000
    2.0 vs 6.0: diff=104.3, p_adj=0.0000
    3.0 vs 4.0: diff=62.2, p_adj=0.0000
    3.0 vs 5.0: diff=19.7, p_adj=0.0000
    3.0 vs 6.0: diff=57.6, p_adj=0.0000
    4.0 vs 5.0: diff=-42.5, p_adj=0.0012
    5.0 vs 6.0: diff=37.9, p_adj=0.0000


## Grup Karşılaştırmaları: Yönetim Düzeyleri

In [10]:
# Management level group comparison
if 'is_manager' in df.columns:
    management_cols = [c for c in df.columns if c.startswith('management_')]
    managers = df[df['is_manager'] == 1].copy()
    
    if not managers.empty and management_cols:
        def get_management_level(row):
            for col in management_cols:
                try:
                    if row[col] == 1:
                        return col.replace('management_', '').replace('_', ' ')
                except KeyError:
                    continue
            return 'Unknown'

        managers['management_level_label'] = managers.apply(get_management_level, axis=1)
        managers = managers[managers['management_level_label'] != 'Unknown']
        
        if not managers.empty:
            valid_m = managers[['management_level_label', 'salary_numeric']].dropna()
            mgroups = [g['salary_numeric'].values for _, g in valid_m.groupby('management_level_label') if len(g) >= 5]
            mlabels = [k for k, g in valid_m.groupby('management_level_label') if len(g) >= 5]
            
            if len(mgroups) >= 2:
                # ANOVA test
                try:
                    f_stat_m, p_anova_m = f_oneway(*mgroups)
                except Exception:
                    f_stat_m, p_anova_m = np.nan, np.nan
                
                # Kruskal-Wallis test
                try:
                    h_stat_m, p_kruskal_m = kruskal(*mgroups)
                except Exception:
                    h_stat_m, p_kruskal_m = np.nan, np.nan

                print("\nManagement level group comparison:")
                print(f"  Groups (n>=5): {mlabels}")
                print(f"  ANOVA p-value: {p_anova_m:.4f}" if not np.isnan(p_anova_m) else "  ANOVA p-value: NA")
                print(f"  Kruskal-Wallis p-value: {p_kruskal_m:.4f}" if not np.isnan(p_kruskal_m) else "  Kruskal-Wallis p-value: NA")

                # Tukey HSD post-hoc test
                try:
                    tukey_m = pairwise_tukeyhsd(endog=valid_m['salary_numeric'], groups=valid_m['management_level_label'], alpha=0.05)
                    print("  Tukey HSD (significant pairs):")
                    for res in tukey_m.summary().data[1:]:
                        grp1, grp2, meandiff, p_adj, lower, upper, reject = res
                        if reject:
                            print(f"    {grp1} vs {grp2}: diff={meandiff:.1f}, p_adj={p_adj:.4f}")
                except Exception:
                    print("  Tukey HSD: NA")


Management level group comparison:
  Groups (n>=5): ['C Level Manager', 'Director Level Manager', 'Engineering Manager', 'Partner']
  ANOVA p-value: 0.7425
  Kruskal-Wallis p-value: 0.2319
  Tukey HSD (significant pairs):


## Veri Özet Tabloları

In [21]:
# Create summary tables for key variables

# Salary by career level
career_summary = df.groupby('seniority_level_ic')['salary_numeric'].agg(['count', 'mean', 'std', 'min', 'max']).round(1)
career_summary.columns = ['Count', 'Mean Salary', 'Std Dev', 'Min', 'Max']
# print(career_summary)
career_summary.index = ['Management', 'Junior', 'Mid', 'Senior', 'Staff Engineer', 'Team Lead', 'Architect']
print("Salary by Career Level:")
display(career_summary)

# Gender distribution
gender_summary = df.groupby('gender')['salary_numeric'].agg(['count', 'mean', 'std']).round(1)
gender_summary.columns = ['Count', 'Mean Salary', 'Std Dev']
gender_summary.index = ['Male', 'Female']
print("\nSalary by Gender:")
display(gender_summary)

# Work mode distribution
work_mode_data = []
for mode in ['Remote', 'Hybrid', 'Office']:
    col = f'work_mode_{mode}'
    if col in df.columns:
        vals = df.loc[df[col] == 1, 'salary_numeric']
        if len(vals) > 0:
            work_mode_data.append({
                'Work Mode': mode,
                'Count': len(vals),
                'Mean Salary': vals.mean(),
                'Std Dev': vals.std()
            })

if work_mode_data:
    work_mode_summary = pd.DataFrame(work_mode_data).round(1)
    print("\nSalary by Work Mode:")
    display(work_mode_summary)

Salary by Career Level:


Unnamed: 0,Count,Mean Salary,Std Dev,Min,Max
Management,83,184.8,57.6,45.5,230.5
Junior,733,55.1,33.3,5.0,230.5
Mid,1138,84.1,35.7,5.0,230.5
Senior,772,130.8,47.2,5.0,230.5
Staff Engineer,16,193.0,48.8,95.5,230.5
Team Lead,175,150.5,52.6,15.5,230.5
Architect,52,188.4,48.4,35.5,230.5



Salary by Gender:


Unnamed: 0,Count,Mean Salary,Std Dev
Male,2705,99.4,55.3
Female,264,86.1,49.6



Salary by Work Mode:


Unnamed: 0,Work Mode,Count,Mean Salary,Std Dev
0,Remote,1350,101.2,55.8
1,Hybrid,1046,105.0,54.1
2,Office,573,78.6,49.9


## Save Results for Other Notebooks

In [12]:
# Save the processed dataframe and test results for use in other notebooks
import pickle

# Save dataframe
df.to_pickle('../data/processed_dataframe.pkl')

# Save test results
with open('../data/statistical_test_results.pkl', 'wb') as f:
    pickle.dump(test_results, f)

print("Data and results saved for use in other notebooks.")
print(f"Dataset shape: {df.shape}")
print(f"Number of statistical tests performed: {len(test_results)}")

Data and results saved for use in other notebooks.
Dataset shape: (2969, 94)
Number of statistical tests performed: 4
