## ***Análisis de Salarios en el Sector de Trabajos de Datos (Usando Spark)***

## Prerrequisites

Install Spark and Java in VM

In [77]:
# install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# download spark 3.5.0
!wget -q https://apache.osuosl.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

In [78]:
ls -l # check the .tgz is there

total 1173052
drwxr-xr-x  1 root root      4096 Jan 11 17:02 [0m[01;34msample_data[0m/
drwxr-xr-x  3 root root      4096 Jan 16 14:37 [01;34mSpark[0m/
drwxr-xr-x 13 1000 1000      4096 Sep  9 02:08 [01;34mspark-3.5.0-bin-hadoop3[0m/
-rw-r--r--  1 root root 400395283 Sep  9 02:10 spark-3.5.0-bin-hadoop3.tgz
-rw-r--r--  1 root root 400395283 Sep  9 02:10 spark-3.5.0-bin-hadoop3.tgz.1
-rw-r--r--  1 root root 400395283 Sep  9 02:10 spark-3.5.0-bin-hadoop3.tgz.2


In [79]:
# unzip it
!tar xf spark-3.5.0-bin-hadoop3.tgz

In [80]:
!pip install -q findspark

In [None]:
!pip install py4j

# For maps
!pip install folium
!pip install plotly

Define the environment

In [82]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

Start Spark Session

---

In [None]:
import findspark
findspark.init("spark-3.5.0-bin-hadoop3")# SPARK_HOME

from pyspark.sql import SparkSession

# create the session
spark = SparkSession \
        .builder \
        .appName("Joins") \
        .master("local[*]") \
        .getOrCreate()

spark.version

In [None]:
spark

In [None]:
# Import sql functions
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import histogram_numeric

Download datasets

In [86]:
!git clone https://github.com/pabloivorra/Spark.git


fatal: destination path 'Spark' already exists and is not an empty directory.


Read JSONs

In [87]:
# Clona el repositorio desde GitHub
!git clone https://github.com/pabloivorra/Spark.git

# Especifica la ruta relativa al archivo CSV dentro del directorio clonado
ruta_archivo = "Spark/salary_data_cleaned.csv"

# Lee el archivo CSV
salariesDF = spark.read.option("header", "true").csv(ruta_archivo)

# Muestra las primeras 15 filas del DataFrame
salariesDF.show(15)



fatal: destination path 'Spark' already exists and is not an empty directory.
+---------+--------------------+--------------------+---------------+------+-------------+------------------+----------------+---------------+------------+----------------+------------+
|work_year|           job_title|        job_category|salary_currency|salary|salary_in_usd|employee_residence|experience_level|employment_type|work_setting|company_location|company_size|
+---------+--------------------+--------------------+---------------+------+-------------+------------------+----------------+---------------+------------+----------------+------------+
|     2023|Data DevOps Engineer|    Data Engineering|            EUR| 88000|        95012|           Germany|       Mid-level|      Full-time|      Hybrid|         Germany|           L|
|     2023|      Data Architect|Data Architecture...|            USD|186000|       186000|     United States|          Senior|      Full-time|   In-person|   United States|      

In [88]:
# De media, ¿que profesion paga mejor?

average_salary_by_job_title = salariesDF.groupBy("job_title").agg(avg("salary_in_usd").alias("avg_salary"))
highest_paid_job_title = average_salary_by_job_title.orderBy("avg_salary", ascending=False)
highest_paid_job_title.show(1)

# Obtención del trabajo mejor pagado
highest_paid_job_title = highest_paid_job_title.first()["job_title"]

print("Trabajo mejor pagado:")
print(highest_paid_job_title)



+--------------------+----------+
|           job_title|avg_salary|
+--------------------+----------+
|Analytics Enginee...|  399880.0|
+--------------------+----------+
only showing top 1 row

Trabajo mejor pagado:
Analytics Engineering Manager


In [89]:
# Encuentra los trabajos peor pagados, primero en Estados Unidos y luego en el mundo
lowest_paid_job_title = average_salary_by_job_title.orderBy("avg_salary", ascending=True).first()["job_title"]

details_of_lowest_paid_job = salariesDF.filter(col("job_title") == lowest_paid_job_title).select(
    "job_title", "employee_residence", "experience_level", "employment_type", "work_setting",
    "company_location", "salary_in_usd"
)

details_of_lowest_paid_job.show()

# Muestra el nombre del trabajo peor pagado
print("Trabajo peor pagado:")
print(lowest_paid_job_title)



+--------------------+------------------+----------------+---------------+------------+----------------+-------------+
|           job_title|employee_residence|experience_level|employment_type|work_setting|company_location|salary_in_usd|
+--------------------+------------------+----------------+---------------+------------+----------------+-------------+
|Compliance Data A...|     United States|     Entry-level|      Full-time|      Remote|   United States|        60000|
|Compliance Data A...|           Nigeria|     Entry-level|      Full-time|      Remote|         Nigeria|        30000|
+--------------------+------------------+----------------+---------------+------------+----------------+-------------+

Trabajo peor pagado:
Compliance Data Analyst


In [90]:
#Salarios mínimo y máximo por región

salariesDF.groupBy(salariesDF["company_location"]).agg(max(salariesDF["salary_in_usd"])).show(15)

salariesDF.groupBy(salariesDF["company_location"]).agg(min(salariesDF["salary_in_usd"])).show(15)

+--------------------+------------------+
|    company_location|max(salary_in_usd)|
+--------------------+------------------+
|             Algeria|            100000|
|      American Samoa|             50000|
|             Andorra|             50745|
|           Argentina|             80000|
|             Armenia|             50000|
|           Australia|             83864|
|             Austria|             91237|
|             Bahamas|             45555|
|             Belgium|             88654|
|Bosnia and Herzeg...|            120000|
|              Brazil|             84000|
|              Canada|             99703|
|Central African R...|             55368|
|               China|            100000|
|            Colombia|             90000|
+--------------------+------------------+
only showing top 15 rows

+--------------------+------------------+
|    company_location|min(salary_in_usd)|
+--------------------+------------------+
|             Algeria|            100000|
|      A

In [91]:
#¿Pagan mejor las empresas grandes o pequeñas?


average_salary_by_company_size = salariesDF.groupBy("company_size").agg(avg("salary_in_usd").alias("avg_salary"))
sorted_average_salary = average_salary_by_company_size.orderBy("avg_salary", ascending=False)

sorted_average_salary.show()

#Al parcer las empresas que mejor pagan son las medianas

+------------+------------------+
|company_size|        avg_salary|
+------------+------------------+
|           M|152237.08925189395|
|           L|141097.16310160427|
|           S| 90642.59748427673|
+------------+------------------+



In [92]:
#salario promedio por tipo de trabajo

salario_promedio_por_categoria = salariesDF.groupBy("job_category").agg(avg("salary_in_usd").alias("salario_promedio"))
salario_promedio_por_categoria.show(15)

+--------------------+------------------+
|        job_category|  salario_promedio|
+--------------------+------------------+
|Data Science and ...|163758.57597876576|
|Data Architecture...|156002.35907335908|
|       Data Analysis|108505.72134522992|
|    Data Engineering|146197.65619469027|
|Data Management a...| 103139.9344262295|
|Data Quality and ...|100879.47272727273|
|Machine Learning ...|178925.84733893556|
|  Cloud and Database|          155000.0|
|Leadership and Ma...| 145476.0198807157|
|BI and Visualization|135092.10223642172|
+--------------------+------------------+



In [93]:
#Salario promedio por nivel de cargo/experiencia

salario_promedio_por_experiencia = salariesDF.groupBy("experience_level").agg(avg("salary_in_usd").alias("salario_promedio"))
salario_promedio_por_experiencia.show(15)

+----------------+------------------+
|experience_level|  salario_promedio|
+----------------+------------------+
|          Senior|162356.12609926963|
|     Entry-level| 88534.77620967742|
|       Executive|189462.91459074733|
|       Mid-level|117523.91813804173|
+----------------+------------------+



In [94]:
#Salario promedio por año

salario_promedio_por_anio = salariesDF.groupBy("work_year").agg(avg("salary_in_usd").alias("salario_promedio")).orderBy(asc("work_year"))
salario_promedio_por_anio.show(15)

+---------+------------------+
|work_year|  salario_promedio|
+---------+------------------+
|     2020|105878.85915492958|
|     2021|106483.64467005077|
|     2022| 135467.5018359853|
|     2023|155132.59170803704|
+---------+------------------+

+---------+------------------+
|work_year|  salario_promedio|
+---------+------------------+
|     2020|105878.85915492958|
|     2021|106483.64467005077|
|     2022| 135467.5018359853|
|     2023|155132.59170803704|
+---------+------------------+



In [98]:
# ¿Qué países son los que mejor pagan?

average_salary_by_country = salariesDF.groupBy("company_location").agg(avg(col("salary_in_usd")).alias("avg_salary"))

top5_highest_paying_countries = average_salary_by_country.orderBy("avg_salary", ascending=False).limit(5)

print("Top 5 Highest Paying Countries:")
top5_highest_paying_countries.show()


Top 5 Highest Paying Countries:
+----------------+------------------+
|company_location|        avg_salary|
+----------------+------------------+
|           Qatar|          300000.0|
|     Puerto Rico|          167500.0|
|           Japan|          165500.0|
|   United States|158158.72823413674|
|          Canada|143918.83628318584|
+----------------+------------------+



Conclusión

Si alguien desea maximizar sus probabilidades de obtener el salario más alto dentro del campo de la ciencia de datos, basándose en el análisis del conjunto de datos proporcionado, debería considerar la siguiente estrategia:

Optar por el puesto de "Analytics Engineering Manager" (preferiblemente en un nivel senior o ejecutivo) en una empresa mediana o grande. Además, residir en Qatar, Japón o Estados Unidos aumentaría las posibilidades de alcanzar un salario promedio de $155,132.