# Homework 1 - Francesco Brunello (M63001655), Antonio Boccarossa (M63001643)

---



# Introduction to Clinical Trials
Clinical trials are scientific studies conducted on human subjects to evaluate
the safety and effectiveness of medical treatments, drugs, devices, or procedures.
Each trial follows a defined protocol and may span several phases (Phase I, II,
III, IV), sometimes involving thousands of participants across multiple countries.

# The Dataset
The dataset used in this exercise comes from Dimensions.ai, a platform that
aggregates data on global scientific research. Each row in the dataset represents
a clinical trial; informations about the columns can be found in the provided
legend.csv file.
Some columns contain nested or structured data, such as lists of conditions,
organizations, or locations.

# Task
Perform at least five analytics using PySpark on the provided clinical trials
dataset. The results must be compiled and presented in a structured PDF
report. For each analysis, the report should include the following components:


*   Objective: The goal of the analysis.
*   Description: A brief description of the methodology used.
*   Code: The PySpark code used to perform the analysis.

Include analyses of varying complexity, from basic aggregations to more
complex operations.

### Dependencies
*Installing pyspark dependencies.*

In [None]:
# Run below commands in google colab
# install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# download spark3.0.0
!wget -q http://apache.osuosl.org/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz
# unzip it
!tar xf spark-3.5.5-bin-hadoop3.tgz
# install findspark
!pip install -q findspark

### Spark Context and Session
*Creating Spark Context and Session in order to use pyspark.sql utilities*

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.2-bin-hadoop3"

In [None]:
# Verify the Spark version running on the virtual cluster
import pyspark as ps
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate()

assert  "3." in sc.version, "Verify that the cluster Spark's version is 3.x"

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession(sc)
print(spark)

<pyspark.sql.session.SparkSession object at 0x7924f1982c90>


### Reading the Dataset

*The given dataset is composed by various data entries related to clinical trials and it's composed by 15999 rows and 38 columns.*

In [None]:
import pandas as pd
df = pd.read_excel("dimensions_clinicalTrials.xlsx")
clinicalDS = spark.createDataFrame(df)
# printSchema shows the names and types of columns
clinicalDS.printSchema()

root
 |-- Rank: long (nullable = true)
 |-- Trial ID: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Brief title: string (nullable = true)
 |-- Acronym: string (nullable = true)
 |-- Abstract: string (nullable = true)
 |-- Start date: timestamp (nullable = true)
 |-- Start Year: double (nullable = true)
 |-- End Date: timestamp (nullable = true)
 |-- Completion Year: double (nullable = true)
 |-- Phase: string (nullable = true)
 |-- Study Type: string (nullable = true)
 |-- Study Design: string (nullable = true)
 |-- Conditions: string (nullable = true)
 |-- Recruitment Status: string (nullable = true)
 |-- Number of Participants: double (nullable = true)
 |-- Intervention: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Registry: string (nullable = true)
 |-- Investigators/Contacts: string (nullable = true)
 |-- Sponsors/Collaborators: string (nullable = true)
 |-- City of Sponsor/Collaborator: string (nullable

### **Objective N1: Fin the number of studies started per year.**




In order to find the ***The number of studies started per year*** we grouped the rows by the column 'Start Year', counted the different studies identified by Trial ID for each year and sorted the result by the count, filtering NaN Start Year values.

In [None]:
from pyspark.sql.functions import count , desc

clinicalDS.groupBy('Start Year') \
.agg(count('`Trial ID`') \
.alias('Number of Studies')) \
.orderBy(desc('Number of Studies')) \
.where(clinicalDS['Start Year'] != "NaN").show(40)

+----------+-----------------+
|Start Year|Number of Studies|
+----------+-----------------+
|    2021.0|             1460|
|    2020.0|             1438|
|    2019.0|             1321|
|    2018.0|             1207|
|    2022.0|             1151|
|    2017.0|             1146|
|    2016.0|              850|
|    2015.0|              841|
|    2013.0|              784|
|    2014.0|              767|
|    2012.0|              687|
|    2023.0|              673|
|    2011.0|              616|
|    2010.0|              502|
|    2008.0|              468|
|    2009.0|              465|
|    2007.0|              344|
|    2006.0|              268|
|    2005.0|              206|
|    2004.0|              165|
|    2001.0|               96|
|    2024.0|               88|
|    2003.0|               88|
|    2002.0|               84|
|    2000.0|               64|
|    1998.0|               26|
|    1994.0|               26|
|    1999.0|               26|
|    1993.0|               25|
|    199

### **Objective N2: Find the Top 10 most frequent medical conditions in all clinical trials**

In order to find ***The Top 10 most frequent medical conditions in all clinical trials*** we grouped the rows by the 'Conditions' field, counted and sorted all of them to find the results, filtering the NaN values.






In [None]:
clinicalDS.groupBy('Conditions') \
.agg(count('Conditions').alias('Entries')) \
.orderBy(desc('Entries')).where(clinicalDS['Conditions'] != "NaN").show(10)

+--------------------+-------+
|          Conditions|Entries|
+--------------------+-------+
|       Breast Cancer|    281|
|    Multiple Myeloma|    179|
|      Ovarian Cancer|    126|
|            Melanoma|    118|
|  Ulcerative Colitis|    112|
|Acute Myeloid Leu...|    106|
|            Leukemia|    102|
|Carcinoma, Non-Sm...|    101|
| Follicular Lymphoma|     92|
|Metastatic Colore...|     90|
+--------------------+-------+
only showing top 10 rows



### **Objective N3: Find the Top 10 Funder Countries that financied the most relevant studies in terms of total Altmetric Attention Score.**

In order to find ***The Top 10 Funder Countries that financied the most relevant studies in terms of total Altmetric Attention Score*** we selected all the Countries (exploding the list generated by the split function on the ";" character in order to consider all the effective values of Funder Countries) and Altmetric Attention Score, filtering the NaN values, grouping by Countries and ordering the resultant rows by the total amount of the Total Altmetric Attention Score per Country.


In [None]:
from pyspark.sql.functions import explode, split, trim, sum, col

clinicalDS.select(explode(split('Funder Country', ";")).alias('Countries'), 'Altmetric Attention Score') \
.withColumn('Countries', trim(col("Countries"))) \
.filter(col('Altmetric Attention Score') != "NaN")\
.filter(col('Countries') != "NaN")\
.groupBy('Countries')\
.agg(sum('Altmetric Attention Score').alias('Total Altmetric Attention Score'))\
.orderBy(desc('Total Altmetric Attention Score')) \
.show(10)

+--------------+-------------------------------+
|     Countries|Total Altmetric Attention Score|
+--------------+-------------------------------+
| United States|                       263844.0|
|         Japan|                        64930.0|
|       Germany|                        37015.0|
|United Kingdom|                        19507.0|
|        France|                         5281.0|
|   Switzerland|                         4513.0|
|         Italy|                         2744.0|
|       Belgium|                         1386.0|
|        Canada|                          675.0|
|       Finland|                          347.0|
+--------------+-------------------------------+
only showing top 10 rows



### **Objective N4: Find the Top 5 research fields with most clinic trials.**

In order to find ***The Top 5 fields of the research with most clinic trials*** we selected all the Fields of Research (exploding the list generated by the split function on the ";" character in order to consider all the effective values of Reasearch Fields) filtering the NaN values. Also, we grouped on the Research Field in order to count them finding the most relevant Count of trials for each Research Field.

In [None]:
clinicalDS.select(explode(split('Fields of Research (ANZSRC 2020)', ";")).alias('Research Field')) \
.withColumn('Research Field', trim(col("Research Field"))) \
.filter(col('Research Field') != "NaN") \
.groupBy('Research Field') \
.agg(count(col('Research Field')).alias('Count of trials')) \
.orderBy(desc('Count of trials')) \
.show(5, truncate=False)

+--------------------------------------------+---------------+
|Research Field                              |Count of trials|
+--------------------------------------------+---------------+
|32 Biomedical and Clinical Sciences         |15279          |
|3202 Clinical Sciences                      |7808           |
|3211 Oncology and Carcinogenesis            |6081           |
|3201 Cardiovascular Medicine and Haematology|1867           |
|42 Health Sciences                          |1364           |
+--------------------------------------------+---------------+
only showing top 5 rows



### **Objective N5: Find the longest Phase 3 of Clinical Trials in term of expected or actual years.**




In order to find ***The longest Phase 3 of Clinical Trials in term of expected or actual years*** we created a new column called 'Years' filling it with the difference between 'Completion Year' and 'Start Year' (previously filtered to avoid NaN values). In addition, we selected the appropriate fields, filtering the output on 'Phase' field and ordering all by 'Years' field.

In [None]:
clinicalDS.filter(col('Completion Year') != 'NaN') \
.filter(col('Start Year') != 'NaN') \
.withColumn('Years', col('Completion Year') - col('Start Year')) \
.select('Trial ID', 'Abstract', 'Years', 'Completion Year', 'Start Year') \
.where(col('Phase') == 'Phase 3') \
.distinct() \
.orderBy(desc('Years')) \
.show(2)

+-----------+--------------------+-----+---------------+----------+
|   Trial ID|            Abstract|Years|Completion Year|Start Year|
+-----------+--------------------+-----+---------------+----------+
|NCT00070564|RATIONALE: Drugs ...| 24.0|         2027.0|    2003.0|
|NCT01704716|This is a randomi...| 24.0|         2026.0|    2002.0|
+-----------+--------------------+-----+---------------+----------+
only showing top 2 rows



### **Objective N6: Find the most frequent HRCS HC Categories of study per gender (All, Female, Male), studied between 2005 and 2021 ordered by the sum of the Altmetric Attention Score of the related clinical trials.**




In order to find ***The most frequent HRCS HC Categories of study per gender (All, Female, Male), studied between 2005 and 2021 ordered by the sum of the Altmetric Attention Score of the related clinical trials*** we created a Window object in order to partitition the final output by the Gender besed on the rank assigned by the row_number function. This function, associates a "rank" value based on the Total Attention Score assumed by that istance.

In the query, we selected the HRCS HC Categories (exploding the list generated by the split function on the ";" character in order to consider all the effective values of HRCS HC Categories field), filtering the data by years and removing the NaN values. Also, we grouped the data by Gender and Categories, in order to execute the sum of the AASs values and find the first ranked row for each gender.

In [None]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

windowSpec = Window.partitionBy("Gender").orderBy(desc("Total Attention Score")) #Partizioniamo in base al genere e ordiniamo in base al Total Attention Score

results = clinicalDS.select(explode(split('HRCS HC Categories', ';')).alias('Categories'), 'Gender', 'Altmetric Attention Score') \
.withColumn('Categories', trim(col('Categories'))) \
.filter(col('Categories')!= 'NaN') \
.filter(col('Gender') != 'NaN') \
.filter(col('Altmetric Attention Score') != 'NaN') \
.filter(col('Start Year')> 2004) \
.filter(col('Completion Year')< 2022) \
.groupBy('Gender', 'Categories' ) \
.agg(sum('Altmetric Attention Score').alias('Total Attention Score')) \
.orderBy(desc('Total Attention Score')) \
.withColumn("rank", row_number().over(windowSpec)) \
.filter(col("rank") == 1) \
.drop("rank") \
.show()


+------+----------+---------------------+
|Gender|Categories|Total Attention Score|
+------+----------+---------------------+
|   All|    Cancer|              30848.0|
|Female|    Cancer|               2237.0|
|  Male|    Cancer|                174.0|
+------+----------+---------------------+



### **Objective N7: Find the Top 10 AHC involved in Mental Health Category clinical trials with more than 100 participants.**


In order to find ***The Top 10 AHC involved in Mental Health category clinical trials with more then 100 participants*** we selected the HRCS HS Categories (exploding the list generated by the split function on the ";" character in order to consider all the effective values of HRCS HS Categories) filterintg NaN values and finding only the rows with 'Mental Health' category. We also filtered the data by the Number of Partecipants, grouped by AHC values and ordered by Number of trails in order to show the right output.

In [None]:
clinicalDS.select('AHC', explode(split('HRCS HC Categories', ';')).alias('HRCS HC Categories Splitted') ) \
.withColumn('HRCS HC Categories Splitted', trim(col('HRCS HC Categories Splitted'))) \
.filter(col('HRCS HC Categories Splitted') != 'NaN') \
.filter(col('HRCS HC Categories Splitted') == 'Mental health') \
.filter(col('Number of Participants') > 100) \
.groupBy('AHC')\
.agg(count('AHC').alias('Number of trials')) \
.orderBy(desc('Number of trials')) \
.show(10, truncate=False)

+---------------------+----------------+
|AHC                  |Number of trials|
+---------------------+----------------+
|IRCCS_CAGRANDA       |8               |
|AOU_SENESE           |7               |
|AOU_CITTADELLASCIENZA|6               |
|AOUSSN_GMARTINO      |6               |
|AOU_CAREGGI          |5               |
|IRCCS_BURLOGAROFOLO  |5               |
|AOU_VERONA           |4               |
|AOU_PISANA           |3               |
|AOU_GONZAGA          |3               |
|AOUSSN_FEDERICOII    |3               |
+---------------------+----------------+
only showing top 10 rows



### **Objective N8: Find the most used Intervention Models (in Study Design field) in cancer related clinical trails**

In order to find ***The most used Intervention Models (in Study Design field) in cancer related clinical trails*** we selected the Study Design categories (exploding the list generated by the split function on the ";" character in order to consider all the effective values of Study Design) filterintg NaN values and finding only the rows with 'Intervention' value. We also grouped the data by the new field Design Phases and ordered the data by the Number of Intevention in order to show the correct output.

In [None]:
clinicalDS.select(explode(split('Study Design', ';')).alias('Design Phases'))\
.withColumn('Design Phases', trim(col('Design Phases')))\
.filter(col('Design Phases').startswith('Intervention'))\
.filter(col('Cancer Types') != 'NaN') \
.groupBy('Design Phases')\
.agg(count('Design Phases').alias('Number of Intervention'))\
.orderBy(desc('Number of Intervention'))\
.show(truncate=False)

+----------------------------------------------------------+----------------------+
|Design Phases                                             |Number of Intervention|
+----------------------------------------------------------+----------------------+
|Intervention Model: Parallel Assignment                   |3901                  |
|Intervention Model: Single Group Assignment               |1585                  |
|Intervention Model: Sequential Assignment                 |195                   |
|Intervention Model: Crossover Assignment                  |72                    |
|Intervention Model: Factorial Assignment                  |69                    |
|Intervention Model: Cohort Study                          |2                     |
|Intervention Model: Review Of Electronic Case Report Forms|1                     |
|Intervention Model: Parallel                              |1                     |
|Intervention Model: N/A                                   |1               

### **Objective N9: Find the average number of participants per study title**

In order to find The **Average number of participants per study title** we selected the **Title** category and then calculate the average of the **Number of Participants** on that field.


In [273]:
clinicalDS.groupBy("Title").avg("Number of Participants").show()

+--------------------+---------------------------+
|               Title|avg(Number of Participants)|
+--------------------+---------------------------+
|An International,...|                     1000.0|
|IMPAHCT: A Phase ...|                      462.0|
|An Open-Label, Do...|                       80.0|
|Therapeutic Modul...|                       17.0|
|A Phase 2, Random...|                      120.0|
|A Randomized, Ope...|                      257.0|
|Open-Label, Phase...|                       54.0|
|A Phase 3, Open-l...|                      418.0|
|A Multicenter, Op...|                     3500.0|
|EMBER-4: A Random...|                     6000.0|
|Biomarkers Study:...|                      104.0|
|Prevention of Car...|                      260.0|
|MotOr, cogniTIVe ...|                      165.0|
|A Double-Blind, R...|                      111.0|
|Ivabradine in mul...|                        NaN|
|Interventional, R...|                      315.0|
|Calcium Algorithm...|         

### **Objective N10: Fin the countries with the highest average number of participants per study**

In order to find The **Countries with the highest average number of participants per study** we selected the **Country of Sponsor/Collaborator, Trial ID and Number of Participants** Categories (exploding the Country of Sponsor/Collaborator list generated by the split function on the ";" character in order to consider all the effective values) filterintg NaN values on **Number of Participants** field and calculating the average on it. Note: we used the *trim* function in order to remove the spaces between each word.

In [None]:
clinicalDS.select(explode(split("Country of Sponsor/Collaborator",";")).alias("Countries"),"Trial ID", "Number of Participants") \
.withColumn("Countries", trim(col("Countries"))) \
.filter((col("Countries").isNotNull()) & (col("Countries") != "")) \
.distinct().groupBy("Trial ID", "Countries") \
.avg("Number of Participants") \
.filter(col("avg(Number of Participants)") != "NaN") \
.orderBy(desc("avg(Number of Participants)")).show(100)

### **Objective N11: Find the 10 most popular cities of sponsor/collaborator per study type**

In order to find The **10 most popular cities of sponsor/collaborator per study type** we selected the **City of Sponsor/Collaborator, Trial ID and Study Type** Categories (exploding the City of Sponsor/Collaborator list generated by the split function on the ";" character in order to consider all the effective values) grouping by "Study Type" and **Collaborator_Cities** fields and counting the number of trials for each city.

In [None]:
clinicalDS.select(explode(split("City of Sponsor/Collaborator", "; ")).alias("Collaborator_Cities"), "Study Type", "Trial ID") \
.withColumn("Collaborator_Cities", trim(col("Collaborator_Cities"))) \
.filter((col("Collaborator_Cities").isNotNull()) & (col("Collaborator_Cities") != "")) \
.groupBy("Study Type", "Collaborator_Cities") \
.agg(count(clinicalDS["Trial ID"]).alias("Numero_Prove")).orderBy(desc("Numero_Prove")).show(10)


+--------------+-------------------+------------+
|    Study Type|Collaborator_Cities|Numero_Prove|
+--------------+-------------------+------------+
|Interventional|             Madrid|       13154|
|Interventional|          Barcelona|       12534|
|Interventional|               Rome|       12407|
|Interventional|              Milan|       12404|
|Interventional|              Seoul|       10437|
|Interventional|             London|        9495|
|Interventional|              Paris|        6884|
|Interventional|           New York|        6397|
|Interventional|              Tokyo|        5620|
| Observational|               Rome|        5411|
+--------------+-------------------+------------+
only showing top 10 rows



### **Objective N12: Find the clinical trial with the highest Altmetric Attention Score that started after July 10, 2013**

In order to find The **the clinical trial with the highest Altmetric Attention Score that started after July 10, 2013** we selected the **Trial ID and Altmetric Attention Score** Categories, ordering by "Altmetric Attention Score" field and choosing two conditions for the *select* clause, according to the query.

In [274]:
clinicalDS.select("Trial ID", "Altmetric Attention Score") \
.orderBy(desc(clinicalDS["Altmetric Attention Score"])) \
.where((clinicalDS["Start Date"] > "2013-07-10") & (clinicalDS["Altmetric Attention Score"]!="NaN"))\
.limit(1).show()

+-----------+-------------------------+
|   Trial ID|Altmetric Attention Score|
+-----------+-------------------------+
|NCT04575597|                   1703.0|
+-----------+-------------------------+



### **Objective N13 Find the 5 most relevant (based on number of participants) trials per Cancer Types**

In order to find The **the 5 most relevant (based on number of participants) trials per Cancer Types** we selected the **Cancer Types, Trial ID and Number of Participants** Categories (exploding the Cancer Types list generated by the split function on the ";" character in order to consider all the effective values), ordering by "Number of Participants" field and filtering on "Cancer Types" and "Number of Participants" field to show the output properly.

In [275]:
clinicalDS.select(explode(split("Cancer Types", "; ")).alias("Tipi_Cancro"), "Trial ID", "Number of Participants") \
.withColumn("Tipi_Cancro", trim(col("Tipi_Cancro"))) \
.filter((col("Tipi_Cancro").isNotNull()) & (col("Tipi_Cancro") != "") & (col("Tipi_Cancro") != "NaN")) \
.filter(col("Number of Participants")!="NaN") \
.distinct().orderBy(desc(clinicalDS["Number of Participants"])).limit(5).show()

+--------------------+-----------+----------------------+
|         Tipi_Cancro|   Trial ID|Number of Participants|
+--------------------+-----------+----------------------+
|Not Site-Specific...|NCT05339841|               82800.0|
|Not Site-Specific...|NCT02038491|               63692.0|
|       Breast Cancer|NCT04590560|               60000.0|
|     Cervical Cancer|NCT01837693|               60000.0|
|Genital System, F...|NCT01837693|               60000.0|
+--------------------+-----------+----------------------+



### **Objective N14: Retrieve the start date, the completion year and studied conditions of the top 10 clinical trials that involve the "AOUSSN_CAGLIARI" AHC and with the highest Altmetric Attention Score**

In order to find The **start date, the completion year and studied conditions of the top 10 clinical trials that involve the "AOUSSN_CAGLIARI" AHC and with the highest Altmetric Attention Score** we selected the **Trial ID, Start Date, Completion Year, Conditions, AHC and Altmetric Attention Score** Categories, ordering by "Altmetric Attention Score" field, filtering on the same field and filtering on the "AHC" field to show the output properly.

In [None]:
clinicalDS.select("Trial ID","Start Date", "Completion Year","Conditions","AHC", "Altmetric Attention Score") \
.where(clinicalDS["AHC"]=="AOUSSN_CAGLIARI") \
.orderBy(desc(clinicalDS["Altmetric Attention Score"])) \
.filter(col("Altmetric Attention Score") != "NaN") \
.show(10)

+-----------+-------------------+---------------+--------------------+---------------+-------------------------+
|   Trial ID|         Start Date|Completion Year|          Conditions|            AHC|Altmetric Attention Score|
+-----------+-------------------+---------------+--------------------+---------------+-------------------------+
|NCT04303780|2020-06-04 00:00:00|         2026.0|KRAS p, G12c Muta...|AOUSSN_CAGLIARI|                    788.0|
|NCT03466411|2018-04-13 00:00:00|         2030.0|     Crohn's Disease|AOUSSN_CAGLIARI|                    693.0|
|NCT04763408|2021-04-09 00:00:00|         2028.0|Carcinoma, Hepato...|AOUSSN_CAGLIARI|                    299.0|
|NCT03764293|2019-06-10 00:00:00|         2023.0|Locally Advanced ...|AOUSSN_CAGLIARI|                    275.0|
|NCT04929210|2021-08-30 00:00:00|         2026.0|Arthritis, Psoriatic|AOUSSN_CAGLIARI|                    131.0|
|NCT03607422|2018-07-27 00:00:00|         2025.0|   Atopic Dermatitis|AOUSSN_CAGLIARI|          

### **Objective N15: Find the 10 clinical trials with the youngest group of study participants**

In order to find The **the 10 clinical trials with the youngest group of study participants** we added two columns: one for the minimum age for each trial and the second for the maximum age and we specified a regular expression to retrieve the numeric values properly. After that, we added one more column to show the differences between each maximum and minimum and finally we selected **Trial ID and diff_age** for the output, ordering on the "diff_age" field.

In [276]:
from pyspark.sql.functions import regexp_extract

trialsDS = clinicalDS.withColumn(
    "min_age",
    regexp_extract(col("Age"), r"^(\d+)", 1).cast("int")
)
trialsDS = trialsDS.withColumn(
    "max_age",
    regexp_extract(col("Age"), r"(\d+)\s*Years\s*-\s*(\d+)", 2).cast("int")
)
trialsDS = trialsDS.withColumn("diff_age", trialsDS["max_age"]-trialsDS["min_age"])
trialsDS.select("Trial ID","diff_age") \
.orderBy(desc("diff_age")) \
.show()

+-----------+--------+
|   Trial ID|diff_age|
+-----------+--------+
|NCT02239120|     132|
|NCT02239120|     132|
|NCT02239120|     132|
|NCT02239120|     132|
|NCT01874353|     112|
|NCT04987203|     112|
|NCT02476968|     112|
|NCT01874353|     112|
|NCT01844986|     112|
|NCT01874353|     112|
|NCT01844986|     112|
|NCT01874353|     112|
|NCT01844986|     112|
|NCT04987203|     112|
|NCT06129864|     112|
|NCT00053053|     103|
|NCT01344018|     102|
|NCT00769327|     102|
|NCT00553410|     102|
|NCT03948178|     102|
+-----------+--------+
only showing top 20 rows



## **Objective N16: Find the most popular (by number of trials) Field of Research per country with at least 500 participants**

In order to find The **most popular (by number of trials) Field of Research per country with at least 500 participants** we added one column in order to explode the **Fields of Research** field properly and one column for the **Country of Sponsor/Collaborator** (the process for this column is exactly the same we did for the first one). After that, we added one **where** clause on the "Number of Participants", we filtered on the same field to display the output properly. Finally grouped by the two newly columns and we counted (and ordered) on the number of trials.

In [277]:
trialsDS.withColumn("Campi_Ricerca", explode(split("Fields of Research (ANZSRC 2020)","; "))) \
.withColumn("Paesi", explode(split("Country of Sponsor/Collaborator","; "))) \
.where(trialsDS["Number of Participants"]>=500) \
.filter(col("Number of Participants")!="NaN") \
.groupBy("Paesi", "Campi_Ricerca") \
.agg(count("Trial ID").alias("Numero_Prove")) \
.orderBy(desc("Numero_Prove")) \
.show(truncate=False)

+--------------+-----------------------------------+------------+
|Paesi         |Campi_Ricerca                      |Numero_Prove|
+--------------+-----------------------------------+------------+
|United States |32 Biomedical and Clinical Sciences|142562      |
|Italy         |32 Biomedical and Clinical Sciences|97823       |
|United States |3202 Clinical Sciences             |67772       |
|United States |3211 Oncology and Carcinogenesis   |66089       |
|Germany       |32 Biomedical and Clinical Sciences|44869       |
|Italy         |3202 Clinical Sciences             |44624       |
|Italy         |3211 Oncology and Carcinogenesis   |37674       |
|Spain         |32 Biomedical and Clinical Sciences|35135       |
|France        |32 Biomedical and Clinical Sciences|30059       |
|Japan         |32 Biomedical and Clinical Sciences|25244       |
|United Kingdom|32 Biomedical and Clinical Sciences|22872       |
|China         |32 Biomedical and Clinical Sciences|19970       |
|United St