## 📄 Overview

This notebook applies the Association Rule Learning (ARL) algorithm to the Stack Overflow dataset to identify and measure relationships between the following attributes:

- `Gender`  
- `DevType`  
- `Continent`  
- `Cargo`  
- `Salary`  
- `YearsCode`

The output is a file named `arl_output.csv`, which contains all the association rules discovered by the ARL algorithm.

Additionally, a hypothesis test is performed to assess whether there are significant salary differences between women and non-women IT professionals with the same level of experience.


In [1]:
pip install mlxtend==0.23.1


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


#Load data

In [2]:
import pandas as pd
woman_not_woman_df = pd.read_csv("../data/woman_not_woman_df.csv")
woman_not_woman_df.shape

(275867, 10)

In [3]:
woman_not_woman_df.sort_values("ConvertedCompYearly", ascending=False).head()

Unnamed: 0,Year,Gender,Country,DevType,ConvertedCompYearly,YearsCode,gender_orig,Continent,country_alpha_code,Cargo
259084,2022,Not Woman,United Kingdom of Great Britain and Northern I...,"Engineer, data",50000000.0,2.0,Prefer not to say,Europe,GBR,
238727,2022,Woman,Portugal,"Developer, front-end",44790396.0,6.0,Woman,Europe,PRT,Desenvolvedor
270660,2022,Not Woman,United States of America,"Developer, front-end",35000000.0,1.0,Man,North America,USA,Desenvolvedor
274597,2022,Not Woman,United States of America,"Engineer, site reliability",32500000.0,10.0,Man,North America,USA,
229558,2021,Not Woman,Afghanistan,"Developer, desktop or enterprise applications",30468516.0,3.0,"Man;Non-binary, genderqueer, or gender non-con...",Asia,AFG,Desenvolvedor


# Process DevType column

In [4]:
devTypes_map = {'Developer': ['Developer, full-stack',
  'Full-Stack Web Developer',
  'Full-stack web developer',
  'Developer, back-end;Developer, front-end;Developer, full-stack',
  'Full-stack developer',
  'Developer, front-end;Developer, full-stack',
  'Developer, front-end;Developer, full-stack;Developer, back-end',
  'Developer, back-end;Developer, full-stack',
  'Developer, full-stack;Developer, back-end',
  'Back-end developer;Front-end developer;Full-stack developer',
  'Back-end developer;Full-stack developer',
  'Developer, back-end',
  'Back-end web developer',
  'Developer, back-end;Developer, front-end;Developer, full-stack',
  'Back-end developer',
  'Developer, front-end;Developer, full-stack;Developer, back-end',
  'Developer, back-end;Developer, full-stack',
  'Developer, full-stack;Developer, back-end',
  'Back-end developer;Front-end developer;Full-stack developer',
  'Back-end developer;Full-stack developer',
  'Developer, back-end;Developer, desktop or enterprise applications',
  'Enterprise level services developer',
  'Developer, front-end',
  'Front-end web developer',
  'Developer, back-end;Developer, front-end;Developer, full-stack',
  'Developer, front-end;Developer, full-stack',
  'Developer, front-end;Developer, full-stack;Developer, back-end',
  'Developer, back-end;Developer, full-stack',
  'Developer, full-stack;Developer, back-end',
  'Back-end developer;Front-end developer;Full-stack developer',
  'Developer, desktop or enterprise applications',
  'Desktop developer',
  'Developer, back-end;Developer, desktop or enterprise applications',
  'Developer, mobile',
  'Mobile Dev (Android, iOS, WP & Multi-Platform)',
  'Mobile developer',
  'Developer, embedded applications or devices',
  'Embedded application developer',
  'Data scientist or machine learning specialist',
  'Mathematics Developers (Data Scientists, Machine Learning Devs & Devs with Stats & Math Backgrounds)',
  'DevOps specialist',
  'Developer, game or graphics',
  'Developer Experience',
  'Database administrator',
  'Developer Advocate'],
  'QA Assurance':['Developer, QA or test', "Quality Assurance", "QA Assurance"],
 'Product Manager': ['Project manager', 'Product manager'],
 'Engineering manager': ['Engineering manager'],
 'Student': ['Student'],
 'Academic researcher': ['Academic researcher'],
 'Research & Development role': ['Research & Development role'],
 'Senior Executive (C-Suite, VP, etc)': ['Senior Executive (C-Suite, VP, etc)'],
 'Engineer, data': ['Engineer, data'],
 'Cloud infrastructure engineer': ['Cloud infrastructure engineer'],
 'Data or business analyst': ['Data or business analyst'],
 'System administrator': ['System administrator'],
 'Security professional': ['Security professional'],
 'Engineer, site reliability': ['Engineer, site reliability'],
 'Educator': ['Educator'],
 'Scientist': ['Scientist'],
 'Blockchain': ['Blockchain'],
 'Hardware Engineer': ['Hardware Engineer'],
 'Designer': ['Designer'],
 'Marketing or sales professional': ['Marketing or sales professional']}

def process_dev_type(dev):
  # print(dev)
  if dev in devTypes_map["QA Assurance"]:
    return "QA"
  if dev in devTypes_map["Developer"]:
    return "Developer"
  if dev in devTypes_map["Product Manager"]:
    return "Product Manager"
  if "designer" in dev.lower():
    return "Designer"
  return None

woman_not_woman_df["DevType"] = woman_not_woman_df["DevType"].apply(process_dev_type)

In [5]:
woman_not_woman_df.DevType.value_counts()

Developer          217064
Designer             5935
QA                   1664
Product Manager       729
Name: DevType, dtype: int64

In [6]:
def display_associations(result, year, gender_str):
    data = []
    for index, r in result.iterrows():
        data.append([year, gender_str,
            "/".join(sorted(list(r['antecedents']))), "/".join(sorted(list(r['consequents']))), r['support'], r['confidence'], r['lift'], r['conviction']
        ])
    return data

# Run ARL

In [7]:
from mlxtend.frequent_patterns import association_rules, fpgrowth
from mlxtend.preprocessing import TransactionEncoder

def categorize(value, q1, q2, q3, col_name):
    if pd.isna(value):
        return 'None'
    elif value <= q1:
        return col_name+'_Baixo'
    elif value <= q2:
        return col_name+'_Medio'
    elif value <= q3:
        return col_name+'_Alto'
    else:
        return col_name+'_Muito_Alto'

all_data = []
print("###########################################################")
all_years_df = woman_not_woman_df.dropna().copy()
all_years_df = all_years_df.rename(columns={"ConvertedCompYearly":"Salary"})
for c_to_catg in ["Salary", "YearsCode"]:
  print(c_to_catg)
  q1 = all_years_df[c_to_catg].quantile(0.25)
  q2 = all_years_df[c_to_catg].quantile(0.50)
  q3 = all_years_df[c_to_catg].quantile(0.75)
  print(q1, q2, q3)
  print()
  all_years_df[c_to_catg+'_categoric'] = all_years_df[c_to_catg].apply(lambda y: categorize(y, q1, q2, q3, c_to_catg))

all_years_df.to_csv("data.csv", header=True)
all_years_df = all_years_df.drop(["Year", "country_alpha_code", "YearsCode", "Salary", "country_alpha_code","Country", "gender_orig"], axis=1)
print("Amount of Data: ", all_years_df.shape[0])
print(all_years_df.head().to_string())
print()

not_woman_year_df = all_years_df[all_years_df["Gender"] == "Not Woman"].drop("Gender", axis=1)
woman_year_df = all_years_df[all_years_df["Gender"] == "Woman"].drop("Gender", axis=1)

dfs_list = [["Not Woman", not_woman_year_df], ["Woman", woman_year_df]]
for gender_str, gender_df in dfs_list:
  print("GENDER: ", gender_str)
  te = TransactionEncoder()
  data_list = gender_df.to_numpy()
  te_ary = te.fit(data_list).transform(data_list)
  transformed_df = pd.DataFrame(te_ary, columns=te.columns_)
  print(transformed_df.head().to_string())

  frequent_itemsets = fpgrowth(transformed_df, min_support=0.1, use_colnames=True)
  output = association_rules(frequent_itemsets, metric = "lift", min_threshold = 1) # after, lift > 1 was considered in the manual inspection
  associations = output.loc[(output['lift'] > 1) & (output['confidence'] >= 0.5) & (output['conviction'] > 1), :]
  data = display_associations(associations, None, gender_str)
  all_data += data
  print()

###########################################################
Salary
25200.0 55000.0 95000.0

YearsCode
3.5 6.0 11.0

Amount of Data:  225392
       Gender    DevType      Continent          Cargo   Salary_categoric YearsCode_categoric
0   Not Woman  Developer  North America  Desenvolvedor  Salary_Muito_Alto      YearsCode_Alto
8   Not Woman  Developer  North America  Desenvolvedor  Salary_Muito_Alto      YearsCode_Alto
9   Not Woman  Developer  North America  Desenvolvedor  Salary_Muito_Alto      YearsCode_Alto
15  Not Woman  Developer  North America  Desenvolvedor        Salary_Alto     YearsCode_Baixo
19  Not Woman  Developer  North America  Desenvolvedor        Salary_Alto      YearsCode_Alto

GENDER:  Not Woman
   Africa  Analista de Qualidade   Asia  Desenvolvedor  Designer  Developer  Europe  Gerente de Produto  North America  Oceania  Product Manager     QA  Salary_Alto  Salary_Baixo  Salary_Medio  Salary_Muito_Alto  South America  YearsCode_Alto  YearsCode_Baixo  YearsCode_Medio

In [10]:
import numpy as np
output_df = (pd.DataFrame(all_data, columns=["Year",
                                            "Gender", "Antecedents",
                                            "Consequents", 'support', 'confidence', 'lift', 'conviction']).drop(["Year"], axis=1).sort_values("conviction", ascending=False))
# output_df.conviction = output_df.conviction.replace([np.inf], 1e12)
# output_df.sort_values(["Antecedents", "Gender"], ascending=False)
# output_df.sort_values("conviction", ascending=False).shape


def translate_gender(g):
  if g == "Not Woman":
    return "Não Mulher"
  return "Mulher"

def translate_cons(c):
  return  {
        'Developer': 'Desenvolvedor',
        'North America': 'América do Norte',
        'Salary_Muito_Alto': 'Salário_Muito_Alto',
        'Developer/North America': 'Desenvolvedor/América do Norte',
        'Developer/Salary_Muito_Alto': 'Desenvolvedor/Salário_Muito_Alto',
        'Europe': 'Europa',
        'Developer/Europe': 'Desenvolvedor/Europa',
        'YearsCode_Baixo': 'AnosCod_Baixo',
        'Developer/YearsCode_Baixo': 'Desenvolvedor/AnosCod_Baixo',
        'Salary_Baixo': 'Salário_Baixo',
        'Developer/Salary_Baixo': 'Desenvolvedor/Salário_Baixo'
    }.get(c, c)

def translate_ant(a):
  return {
    'YearsCode_Alto': 'AnosCod_Alto',
    'Europe/YearsCode_Alto': 'Europa/AnosCod_Alto',
    'Salary_Muito_Alto': 'Salário_Muito_Alto',
    'North America': 'América do Norte',
    'Developer/Salary_Muito_Alto': 'Desenvolvedor/Salário_Muito_Alto',
    'North America/Salary_Muito_Alto': 'América do Norte/Salário_Muito_Alto',
    'Developer/North America': 'Desenvolvedor/América do Norte',
    'Salary_Alto': 'Salário_Alto',
    'Europe/Salary_Alto': 'Europa/Salário_Alto',
    'Developer/Salary_Alto': 'Desenvolvedor/Salário_Alto',
    'Salary_Medio': 'Salário_Médio',
    'Europe/Salary_Medio': 'Europa/Salário_Médio',
    'Developer/Salary_Medio': 'Desenvolvedor/Salário_Médio',
    'Salary_Baixo': 'Salário_Baixo',
    'Developer/Salary_Baixo': 'Desenvolvedor/Salário_Baixo',
    'Europe': 'Europa',
    'Asia': 'Ásia',
    'Asia/Developer': 'Ásia/Desenvolvedor',
    'YearsCode_Muito_Alto': 'AnosCod_Muito_Alto',
    'YearsCode_Baixo': 'AnosCod_Baixo',
    'Asia/Salary_Baixo': 'Ásia/Salário_Baixo',
    'YearsCode_Medio': 'AnosCod_Médio'
}.get(a, a)



output_df["Gender"] = output_df["Gender"].apply(translate_gender)
output_df["Consequents"] = output_df["Consequents"].apply(translate_cons)
output_df["Antecedents"] = output_df["Antecedents"].apply(translate_ant)


output_df.columns = ["Gênero","Antecedentes","Consequentes","Suporte","Confiança","Lift","Convicção"]
output_df.head()

Unnamed: 0,Gênero,Antecedentes,Consequentes,Suporte,Confiança,Lift,Convicção
0,Não Mulher,Desenvolvedor,Desenvolvedor,0.964431,1.0,1.03688,inf
117,Não Mulher,Desenvolvedor/YearsCode_Muito_Alto,Desenvolvedor,0.204049,1.0,1.03688,inf
135,Mulher,Desenvolvedor/North America/Salary_Alto,Desenvolvedor,0.130884,1.0,1.063351,inf
130,Mulher,Desenvolvedor/Salário_Alto,Desenvolvedor,0.234064,1.0,1.063351,inf
129,Mulher,Desenvolvedor/Salary_Alto,Desenvolvedor,0.234064,1.0,1.063351,inf


In [None]:
output_df.to_csv("arl_output.csv")

  and should_run_async(code)


In [11]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# sns.set_theme(style="ticks", palette="pastel")

def translate_gender_to_portuguese(df, gender_column):
    """
    Translates the values in the Gender column from English to Brazilian Portuguese.

    Parameters:
        df (pd.DataFrame): The dataframe containing the Gender column.
        gender_column (str): The name of the column representing gender.

    Returns:
        pd.DataFrame: The dataframe with the Gender column translated.
    """
    translation_map = {
        'Not Woman': 'Não Mulher',
        'Woman': 'Mulher'
    }
    df[gender_column] = df[gender_column].map(translation_map)
    return df

def plot_salary_distribution_by_gender(df, gender_column, salary_column, title_template="Distribuição Salarial para {gender}", save=False, file_name="hipo_number_1"):
    """
    Plota a distribuição salarial por gênero usando histogramas e gráficos de densidade separados para cada gênero com Matplotlib e Seaborn.

    Parâmetros:
        df (pd.DataFrame): O dataframe contendo os dados.
        gender_column (str): O nome da coluna que representa o gênero.
        salary_column (str): O nome da coluna que representa o salário.
        title_template (str): Template para os títulos dos gráficos. Use '{gender}' como marcador para o gênero.
        save (bool): Se True, salva os gráficos localmente com nomes intuitivos.
    """
    plt.figure(figsize=(8, 6))
    genders = df[gender_column].unique()

    sns.boxplot(x=gender_column, y=salary_column,
                hue=gender_column, palette=["b", "r"],
                data=df, log_scale=True)
    sns.despine(offset=10, trim=True)

    plt.xlabel('Gênero', fontsize=12)
    plt.ylabel('Salário', fontsize=12)
    if save:
            plt.savefig(f"{file_name}_boxplot.png", bbox_inches='tight')
    plt.show()

    for gender in genders:
        plt.figure(figsize=(8, 6))

        # Subset data for the current gender
        subset = df[df[gender_column] == gender]

        # Determine color based on gender
        color = 'r' if gender == 'Mulher' else 'b'

        # Plot histogram and density plot
        sns.set_theme()
        sns.histplot(subset[salary_column], log_scale=True, color=color, fill=False)

        # Calculate statistics
        mean_salary = (subset[salary_column].mean())
        quartiles = (subset[salary_column].quantile([0.25, 0.5, 0.75]))

        # Add vertical lines for quartiles and mean
        # plt.axvline(mean_salary, color='green', linestyle='--', linewidth=1, label='Média')
        plt.axvline(quartiles[0.25], color='purple', linestyle='--', linewidth=1, label='1º Quartil')
        plt.axvline(quartiles[0.5], color='orange', linestyle='--', linewidth=1, label='Mediana')
        plt.axvline(quartiles[0.75], color='brown', linestyle='--', linewidth=1, label='3º Quartil')

        # Customize the plot
        title = title_template.format(gender=gender)
        plt.title(title, fontsize=14)
        plt.xlabel('Salário', fontsize=12)
        plt.ylabel('Frequência', fontsize=12)
        plt.legend()

        # Save the plot if required
        if save:
            plt.savefig(f"{file_name}_{gender}.png", bbox_inches='tight')

        # Show the plot
        plt.show()

  if LooseVersion(mpl.__version__) >= "3.0":
  other = LooseVersion(other)
  mpl_cm.register_cmap(_name, _cmap)
  mpl_cm.register_cmap(_name + "_r", _cmap_r)
  mpl_cm.register_cmap(_name, _cmap)
  mpl_cm.register_cmap(_name + "_r", _cmap_r)
  mpl_cm.register_cmap(_name, _cmap)
  mpl_cm.register_cmap(_name + "_r", _cmap_r)
  mpl_cm.register_cmap(_name, _cmap)
  mpl_cm.register_cmap(_name + "_r", _cmap_r)
  mpl_cm.register_cmap(_name, _cmap)
  mpl_cm.register_cmap(_name + "_r", _cmap_r)
  mpl_cm.register_cmap(_name, _cmap)
  mpl_cm.register_cmap(_name + "_r", _cmap_r)


In [12]:
from scipy.stats import mannwhitneyu

'''
Testing whether there is a significant salary difference between 
women and non-women professionals with the same level of experience.
'''
alpha = 0.05
data_df = pd.read_csv("data.csv")
translation_map = {
    'Not Woman': 'Não Mulher',
    'Woman': 'Mulher'
}


for xp in data_df.YearsCode_categoric.drop_duplicates().tolist():
  print(f"Running for: {xp}")
  single_xp_df = data_df[(data_df["YearsCode_categoric"] == xp)].copy()

  single_xp_df["Gender"] = single_xp_df["Gender"].map(translation_map)

  woman_df = single_xp_df[single_xp_df["Gender"] == "Mulher"]
  not_woman_df = single_xp_df[single_xp_df["Gender"] == "Não Mulher"]
  _, p_value = mannwhitneyu(woman_df.Salary, not_woman_df.Salary)

  res = None
  if p_value < alpha:
      res = "Reject the null hypothesis"
      print("p_value: ", p_value)
      print("Média Salarial Mulher: ",(woman_df.Salary.mean()))
      print("Quantidade de dados: ", woman_df.shape[0])
      print("Média Salarial Não Mulher: ",not_woman_df.Salary.mean())
      print("Quantidade de dados: ", not_woman_df.shape[0])
  else:
      res = "Fail to reject the null hypothesis"
      print("p_value: ", p_value)
  # plot_salary_distribution_by_gender(single_xp_df, "Gender", "Salary",title_template = "Distribuição Salarial - {gender} \n Ásia/Desenvolvedor/Alto Nível de Experiência", save=True, file_name="hipo_number_2")
  print(res)
  print("#######################")




Running for: YearsCode_Alto
p_value:  0.0497448345501315
Média Salarial Mulher:  108808.29305314232
Quantidade de dados:  2763
Média Salarial Não Mulher:  111940.62994032423
Quantidade de dados:  61940
Reject the null hypothesis
#######################
Running for: YearsCode_Baixo
p_value:  9.931555399911255e-25
Média Salarial Mulher:  68101.93548816832
Quantidade de dados:  5783
Média Salarial Não Mulher:  69915.43139853582
Quantidade de dados:  63892
Reject the null hypothesis
#######################
Running for: YearsCode_Muito_Alto
p_value:  0.00444865216553817
Média Salarial Mulher:  151143.98389574443
Quantidade de dados:  1499
Média Salarial Não Mulher:  164198.234187867
Quantidade de dados:  44935
Reject the null hypothesis
#######################
Running for: YearsCode_Medio
p_value:  4.864707192473846e-25
Média Salarial Mulher:  125055.43908389342
Quantidade de dados:  2913
Média Salarial Não Mulher:  106955.5229045099
Quantidade de dados:  41667
Reject the null hypothesis
##