# Clustering analysis on Open Research Dataset CORD 19

## Overview
In response to the COVID-19 pandemic, the White House and a coalition of leading research groups
have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over
57,000 scholarly articles, including over 45,000 with full text, about COVID-19, SARS-CoV-2, and
related coronaviruses. This freely available dataset is provided to the global research community. As
a big data community, how can we help researchers to easily find the related research papers
easily?

You can find the dataset and the main challenge on kaggle

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

## Goals
Given the large number of literature and the rapid spread of COVID-19, it is difficult for health
professionals to keep up with new information on the virus. Can clustering similar research articles
together simplify the search for related publications? How can the content of the clusters be
qualified? And over each cluster how can we recommend the most similar papers leveraging
clustering?

## Requirements
you are required to find out the best way to cluster the research papers using the research
papers details in the JSON file with the metadata in the CSV file, then you should build a
neighborhood recommender system to receive the title of research paper and recommend
the most N similar papers to it based on its cluster. So you should find a way to represent
the papers in vectors and cluster them then build a neighbourhood recommender system
on the clusters.

### Required Steps

#### 1. Read the dataset using spark:
The dataset is 8GB so we don’t expect you can manage the whole entire dataset on your local machine.

#### 2. Do exploratory data analysis:
Do the EDA to understand your data and extract insights help you in feature
engineering ,Document your insights

#### 3. Preparation and Cleaning the data:
- Joining the json file with the metadata in the csv file
- Handling Nulls.
- Handling Duplications.
- Keep Only the english documents

#### 4. Preprocessing:
Our main goal is to clean and preprocess the txt to prepare it to represent it in
vectors. It is a mandatory step in NLP projects to preprocess the text. You can have
a look in this article to explore some of well known preprocessing steps

https://towardsdatascience.com/nlp-text-preprocessing-a-practical-guide-and-template-d80874676e79

##### Required preprocessing
1. Remove stop words.

2. Remove custom stop words, Research papers will often frequently use words that don't actually contribute to the meaning and are not considered everyday stopwords and should be removed to enhance the accuracy.
        custom_stop_words = [ 'doi', 'preprint', 'copyright', 'peer', 'reviewed', 'org', 'https', 'et', 'al', 'author', 'figure','rights', 'reserved', 'permission', 'used', 'using', 'biorxiv', 'medrxiv', 'license', 'fig', 'fig.', 'al.', Elsevier', 'PMC', 'CZI', 'www']


3. Remove Punctuation, use this Regex

        '!()-[]{};:'"\,<>./?@#$%^&*_~'


4. convert text to lower case.


#### 5. Vectorization:
convert the data into a format that can be handled by our algorithms. For this purpose we can use *Word2vec*.

#### 6. Clustering (Not implemented yet)
Apply clustering algorithm on the data and choose the best k you decide from the *elbow* method. You can use *PCA* to reduce the dimensions while still keeping *95\%* variance for better performance and hopefully remove some noise/outliers.

#### 7. Recommender system (Not implemented yet)
Build a very basic recommender system:
- Create a function with the signature *recommendPaper(paper_title,N)* where N is the number of recommended papers in the list and it returns the recommendation list.
- Recommend top N recommendation list based on the most similar(*cosine similarity*) papers to it with respect to the cluster it belongs.

## Install required packages:

In [1]:
# !sudo pip3 install langdetect
# !sudo pip3 install nltk

## Import required packages:

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.ml.feature import Word2Vec
import json
from langdetect import detect
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

### Download the required library for stopwords and punctuation

In [3]:
# nltk.download('punkt')
# nltk.download('stopwords')

## Create Spark session

In [4]:
spark = SparkSession \
    .builder \
    .appName('CORD-19 Clustering') \
    .getOrCreate()

## 1. Read the dataset using spark
### Read metadata

In [5]:
metadata = spark.read\
                .format('csv')\
                .option('header', 'true')\
                .option('inferSchema', 'true')\
                .load('./metadata.csv')

In [6]:
metadata = metadata.withColumn('publish_time', metadata.publish_time.cast(DateType()))

In [7]:
metadata.printSchema()

root
 |-- cord_uid: string (nullable = true)
 |-- sha: string (nullable = true)
 |-- source_x: string (nullable = true)
 |-- title: string (nullable = true)
 |-- doi: string (nullable = true)
 |-- pmcid: string (nullable = true)
 |-- pubmed_id: string (nullable = true)
 |-- license: string (nullable = true)
 |-- abstract: string (nullable = true)
 |-- publish_time: date (nullable = true)
 |-- authors: string (nullable = true)
 |-- journal: string (nullable = true)
 |-- mag_id: string (nullable = true)
 |-- who_covidence_id: string (nullable = true)
 |-- arxiv_id: string (nullable = true)
 |-- pdf_json_files: string (nullable = true)
 |-- pmc_json_files: string (nullable = true)
 |-- url: string (nullable = true)
 |-- s2_id: string (nullable = true)



In [8]:
metadata_pd = metadata.toPandas()
metadata_pd.head()

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id
0,ug7v899j,d1aafb70c066a2068b02786f8929fd9c900897fb,PMC,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,PMC35282,11472636,no-cc,OBJECTIVE: This retrospective chart review des...,2001-07-04,"Madani, Tariq A; Al-Ghamdi, Aisha A",BMC Infect Dis,,,,document_parses/pdf_json/d1aafb70c066a2068b027...,document_parses/pmc_json/PMC35282.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,
1,02tnwd4m,6b0567729c2143a66d737eb0a2f63f2dce2e5a7d,PMC,Nitric oxide: a pro-inflammatory mediator in l...,10.1186/rr14,PMC59543,11667967,no-cc,Inflammatory diseases of the respiratory tract...,2000-08-15,"Vliet, Albert van der; Eiserich, Jason P; Cros...",Respir Res,,,,document_parses/pdf_json/6b0567729c2143a66d737...,document_parses/pmc_json/PMC59543.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
2,ejv2xln0,06ced00a5fc04215949aa72528f2eeaae1d58927,PMC,Surfactant protein-D and pulmonary host defense,10.1186/rr19,PMC59549,11667972,no-cc,Surfactant protein-D (SP-D) participates in th...,2000-08-25,"Crouch, Erika C",Respir Res,,,,document_parses/pdf_json/06ced00a5fc04215949aa...,document_parses/pmc_json/PMC59549.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
3,2b73a28n,348055649b6b8cf2b9a376498df9bf41f7123605,PMC,Role of endothelin-1 in lung disease,10.1186/rr44,PMC59574,11686871,no-cc,Endothelin-1 (ET-1) is a 21 amino acid peptide...,2001-02-22,"Fagan, Karen A; McMurtry, Ivan F; Rodman, David M",Respir Res,,,,document_parses/pdf_json/348055649b6b8cf2b9a37...,document_parses/pmc_json/PMC59574.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
4,9785vg6d,5f48792a5fa08bed9f56016f4981ae2ca6031b32,PMC,Gene expression in epithelial cells in respons...,10.1186/rr61,PMC59580,11686888,no-cc,Respiratory syncytial virus (RSV) and pneumoni...,2001-05-11,"Domachowske, Joseph B; Bonville, Cynthia A; Ro...",Respir Res,,,,document_parses/pdf_json/5f48792a5fa08bed9f560...,document_parses/pmc_json/PMC59580.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,


### Read the json files and save in parquet files

In [9]:
pdfJsonFiles_schema = StructType([
    StructField("paper_id", StringType()),
    StructField("metadata", StructType([
        StructField("title", StringType()),
        StructField("authors", ArrayType(StructType([
            StructField("first", StringType()),
            StructField("middle", ArrayType(StringType())),
            StructField("last", StringType()),
            StructField("suffix", StringType()),
            StructField("affiliation", StringType()),
            StructField("email", StringType()),
        ]))),
    ])),
    StructField("abstract", ArrayType(StructType([
        StructField("text", StringType()),
        StructField("cite_spans", ArrayType(StructType([
            StructField("start", IntegerType()),
            StructField("end", IntegerType()),
            StructField("text", StringType()),
            StructField("ref_id", StringType()),
        ]))),
        StructField("ref_spans", ArrayType(StructType([
            StructField("start", IntegerType()),
            StructField("end", IntegerType()),
            StructField("text", StringType()),
            StructField("ref_id", StringType()),
        ]))),
        StructField("section", StringType()),
    ]))),
        StructField("body_text", ArrayType(StructType([
        StructField("text", StringType()),
        StructField("cite_spans", ArrayType(StructType([
            StructField("start", IntegerType()),
            StructField("end", IntegerType()),
            StructField("text", StringType()),
            StructField("ref_id", StringType()),
        ]))),
        StructField("ref_spans", ArrayType(StructType([
            StructField("start", IntegerType()),
            StructField("end", IntegerType()),
            StructField("text", StringType()),
            StructField("ref_id", StringType()),
        ]))),
        StructField("section", StringType()),
    ]))),
    StructField("bib_entries", StringType()),
    StructField("ref_entries", StringType()),
    StructField("back_matter", StringType()),
])


In [10]:
pdfJsonFiles = spark.read\
                    .schema(pdfJsonFiles_schema)\
                    .option("multiline","true")\
                    .json('document_parses/pdf_json/*.json')

In [11]:
pdfJsonFiles.printSchema()

root
 |-- paper_id: string (nullable = true)
 |-- metadata: struct (nullable = true)
 |    |-- title: string (nullable = true)
 |    |-- authors: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- first: string (nullable = true)
 |    |    |    |-- middle: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- last: string (nullable = true)
 |    |    |    |-- suffix: string (nullable = true)
 |    |    |    |-- affiliation: string (nullable = true)
 |    |    |    |-- email: string (nullable = true)
 |-- abstract: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- text: string (nullable = true)
 |    |    |-- cite_spans: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- start: integer (nullable = true)
 |    |    |    |    |-- end: integer (nullable = true)
 |    |    |    |    |-- text: string (n

In [12]:
pdfJsonFiles.toPandas().head()

Unnamed: 0,paper_id,metadata,abstract,body_text,bib_entries,ref_entries,back_matter
0,93e41ab96746bff2cc0d899137cae2054645e6c6,(POSTER SESSION ABSTRACTS 220 CFTR 1ଙ MODELS O...,[(High-throughput screening (HTS) has emerged ...,"[(The high resolution, experimental 3D structu...","{""BIBREF0"":{""ref_id"":""b0"",""title"":""Data collec...","{""FIGREF0"":{""text"":""Supported by Vaincre La Mu...","[{""text"":""Background: Studies report reduced b..."
1,0a17eac54e995f96411bd23dec19354ae1db3036,"(Structure, Properties, and Biological Relevan...","[(G quadruplexes (G4s), which are known to hav...",[(The non canonical DNA structures formed by c...,"{""BIBREF0"":{""ref_id"":""b0"",""title"":""Telomeric r...","{""FIGREF0"":{""text"":""Schematic representation o...","[{""text"":""This work was supported by the Russi..."
2,ffec12aa4a9fd44c1cef69a933194c5f19e75bf0,(Humanized Mice for Live-Attenuated Vaccine Re...,[(Live-attenuated vaccines (LAV) represent one...,[(Live-attenuated vaccines (LAVs) have saved m...,"{""BIBREF0"":{""ref_id"":""b0"",""title"":""World Healt...","{""FIGREF0"":{""text"":""Development of human immun...","[{""text"":""We are thankful to our colleagues in..."
3,0ae7fff969f6c8af643343337e618034a94f4a77,(In the Realm of Opportunity: The Kaiser Wilhe...,[],"[(speakers for public lectures, as representat...","{""BIBREF0"":{""ref_id"":""b0"",""title"":""Bericht übe...","{""FIGREF0"":{""text"":""What meaning does a gene h...","[{""text"":""The case of the three heterochromous..."
4,cd6de82b70bb7c544716d3b4ad69d63e5e33fdce,(ACVIM ABSTRACTS 5 WARMUP EXERCISE ALTERS ENER...,[(This study examined the effects of three dif...,"[(PORT H.C. Schott II, C.L. Coursen, S.W. Eber...","{""BIBREF0"":{""ref_id"":""b0"",""title"":""From the HE...","{""FIGREF0"":{""text"":""VIVO PRIMING AND ACTIVATIO...","[{""text"":""This study was designed to evaluate ..."


In [13]:
pdfJsonFiles.drop('metadata', 'bib_entries', 'ref_entries', 'back_matter')\
            .write\
            .parquet("pdfJsonFiles.parquet", "overwrite")

In [14]:
pdfJsonFilesParquetFile = spark.read.parquet("pdfJsonFiles.parquet")

In [15]:
pdfJsonFilesParquetFile.count()

752

In [16]:
pdfJsonFilesParquetFile.toPandas().head()

Unnamed: 0,paper_id,abstract,body_text
0,93e41ab96746bff2cc0d899137cae2054645e6c6,[(High-throughput screening (HTS) has emerged ...,"[(The high resolution, experimental 3D structu..."
1,0a17eac54e995f96411bd23dec19354ae1db3036,"[(G quadruplexes (G4s), which are known to hav...",[(The non canonical DNA structures formed by c...
2,ffec12aa4a9fd44c1cef69a933194c5f19e75bf0,[(Live-attenuated vaccines (LAV) represent one...,[(Live-attenuated vaccines (LAVs) have saved m...
3,0ae7fff969f6c8af643343337e618034a94f4a77,[],"[(speakers for public lectures, as representat..."
4,cd6de82b70bb7c544716d3b4ad69d63e5e33fdce,[(This study examined the effects of three dif...,"[(PORT H.C. Schott II, C.L. Coursen, S.W. Eber..."


## 2. Do exploratory data analysis:

In [17]:
total_count = metadata.count()
print('Total count of All records in the metadata csv file')
total_count

Total count of All records in the metadata csv file


128492

In [18]:
metadata_nan = (metadata_pd.isnull().sum() / total_count) * 100
metadata_nan

cord_uid             0.000000
sha                 56.611307
source_x             0.000000
title                0.021791
doi                 21.693958
pmcid               52.666314
pubmed_id           22.878467
license              0.030352
abstract            20.895464
publish_time         2.076394
authors              3.702954
journal              4.733369
mag_id              97.956293
who_covidence_id    85.205305
arxiv_id            97.349251
pdf_json_files      55.394110
pmc_json_files      64.730878
url                 60.799894
s2_id               20.847991
dtype: float64

- We can see that there are columns almost null (mag_id, who_covidence_id, arxiv_id).
- We will use the columns ['sha', 'source_x', 'title', 'license', 'abstract', 'publish_time', 'authors', 'journal'] 
- Considering that COVID-19 starts its spreading from 2019 so we will consider only papers that published from 2019.
- We will drop the duplicates by using the paper id *(sha)*. 

In [19]:
metadataGreaterThan2019 = metadata.where(year(metadata.publish_time) >= '2019')\
                                .coalesce(1)\
                                .select(
                                    'sha',
                                    'source_x',
                                    'title',
                                    'license',
                                    'abstract',
                                    'publish_time',
                                    'authors',
                                    'journal'
                                )\
                                .withColumnRenamed('abstract', 'main_abstract')\
                                .dropDuplicates(subset=['sha'])

In [20]:
print('Count of records after filtering the year to be greater than 2019 is')
metadataGreaterThan2019.count()

Count of records after filtering the year to be greater than 2019 is


18146

## 3. Preparation and Cleaning the data:

### Joining the json file with the metadata in the csv file

In [21]:
joinedPapersData = pdfJsonFilesParquetFile.join(
                                            broadcast(metadataGreaterThan2019),
                                            metadataGreaterThan2019.sha == pdfJsonFilesParquetFile.paper_id
                                        )\
                                        .drop('sha', 'paper_id')

In [22]:
print('Count of the joined data:')
joinedPapersData.count()

Count of the joined data:


210

In [23]:
joinedPapersData.toPandas().head()

Unnamed: 0,abstract,body_text,source_x,title,license,main_abstract,publish_time,authors,journal
0,[(Live-attenuated vaccines (LAV) represent one...,[(Live-attenuated vaccines (LAVs) have saved m...,PMC,Humanized Mice for Live-Attenuated Vaccine Res...,cc-by,Live-attenuated vaccines (LAV) represent one o...,2020-01-21,"O’Connell, Aoife K.; Douam, Florian",Vaccines (Basel)
1,[],[(Influence of the minimum b-value on prostate...,PMC,"ECR 2020 Book of Abstracts: Vienna, Austria. 1...",no-cc,,2020-05-05,,Insights Imaging
2,[(Background: The COVID-19 pandemic has broadl...,[(COVID-19 was first recognized in December 20...,Elsevier; Medline; PMC,Guidelines for TMS/tES Clinical Services and R...,no-cc,BACKGROUND: The COVID-19 pandemic has broadly ...,2020-05-12,"Bikson, Marom; Hanlon, Colleen A.; Woods, Adam...",Brain Stimul
3,[],"[(Although we often refer to viruses as ""auton...",Elsevier; PMC,Chapter 4 Interaction of virus populations wit...,els-covid,Abstract Viral population numbers are extremel...,2020-12-31,"Domingo, Esteban",Virus as Populations
4,[(The great advance in the field of diagnosis ...,[(An enormous progress occurred in the field o...,Elsevier; Medline; PMC,Biotic concerns in generating molecular diagno...,els-covid,Abstract The great advance in the field of dia...,2019-12-31,"Davidson, Irit",Journal of Virological Methods


### Keep Only the english documents

In [24]:
# A function to detect text language
def detectLang(text):
    lang = detect(text)
    if lang == 'en':
        return lang

detect_lang_udf = udf(lambda text: detectLang(text))

- Get first Author only.
- split the publish date to (year, month, day).
- Get the language (en or null otherwise).
- Drop unwanted columns.
- Fill null in the important columns.
- Drop non-english.

In [25]:
papersData = joinedPapersData.withColumn('abstract', concat_ws(' ', joinedPapersData.abstract.text))\
                            .withColumn('body_text', concat_ws(' ', joinedPapersData.body_text.text))\
                            .withColumn('first_author', split(joinedPapersData.authors, ';')[0])\
                            .withColumn('publish_year', year(joinedPapersData.publish_time))\
                            .withColumn('publish_month', month(joinedPapersData.publish_time))\
                            .withColumn('publish_day', dayofmonth(joinedPapersData.publish_time))\
                            .withColumn('lang', detect_lang_udf(joinedPapersData.title))\
                            .drop('authors', 'publish_time')\
                            .fillna({'first_author': 'unknown', 'journal': 'unknown', 'main_abstract': ''})

In [26]:
papersData = papersData.dropna()\
                        .drop('lang')\
                        .persist()

In [27]:
papersData.toPandas().head()

Unnamed: 0,abstract,body_text,source_x,title,license,main_abstract,journal,first_author,publish_year,publish_month,publish_day
0,Live-attenuated vaccines (LAV) represent one o...,Live-attenuated vaccines (LAVs) have saved mil...,PMC,Humanized Mice for Live-Attenuated Vaccine Res...,cc-by,Live-attenuated vaccines (LAV) represent one o...,Vaccines (Basel),"O’Connell, Aoife K.",2020,1,21
1,,Influence of the minimum b-value on prostate c...,PMC,"ECR 2020 Book of Abstracts: Vienna, Austria. 1...",no-cc,,Insights Imaging,unknown,2020,5,5
2,Background: The COVID-19 pandemic has broadly ...,COVID-19 was first recognized in December 2019...,Elsevier; Medline; PMC,Guidelines for TMS/tES Clinical Services and R...,no-cc,BACKGROUND: The COVID-19 pandemic has broadly ...,Brain Stimul,"Bikson, Marom",2020,5,12
3,,"Although we often refer to viruses as ""autonom...",Elsevier; PMC,Chapter 4 Interaction of virus populations wit...,els-covid,Abstract Viral population numbers are extremel...,Virus as Populations,"Domingo, Esteban",2020,12,31
4,The great advance in the field of diagnosis of...,An enormous progress occurred in the field of ...,Elsevier; Medline; PMC,Biotic concerns in generating molecular diagno...,els-covid,Abstract The great advance in the field of dia...,Journal of Virological Methods,"Davidson, Irit",2019,12,31


## 4. Preprocessing:

### Convert text to lower case

In [28]:
papersData = papersData.withColumn('abstract', lower(papersData.abstract))\
                        .withColumn('body_text', lower(papersData.body_text))\
                        .withColumn('title', lower(papersData.title))\
                        .withColumn('main_abstract', lower(papersData.main_abstract))\

### Remove stop words.

In [29]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [30]:
#A function to remove Stop Words.
def removeStopwords(text):
    text_tokens = word_tokenize(text)
    tokens_without_sw = [word for word in text_tokens if not word in stopwords.words('english')]    
    return tokens_without_sw

remove_stopwords_udf = udf(lambda text: removeStopwords(text), ArrayType(StringType()))

### Remove custom stop words

In [35]:
#A function to remove Custom Stop Words.
custom_stopwords = [ 'doi', 'preprint', 'copyright', 'peer', 'reviewed', 'org',
'https', 'et', 'al', 'author', 'figure','rights', 'reserved', 'permission', 'used', 'using',
'biorxiv', 'medrxiv', 'license', 'fig', 'fig.', 'al.', 'Elsevier', 'PMC', 'CZI', 'www']

def removeCustomStopwords(text_tokens):
    tokens_filtered = [word for word in text_tokens if not word in custom_stopwords]    
    return (" ").join(tokens_filtered)

remove_custom_stopwords_udf = udf(lambda text_tokens: removeCustomStopwords(text_tokens), ArrayType(StringType()))

### Remove Punctuation

In [32]:
# A function to remove Punctuation
def removePunctuation(text_tokens):
    #puctuation = '!()-[]{};:'"\,<>./?@#$%^&*_~'
    tokenizer = nltk.RegexpTokenizer(r"\w+")
    new_words = tokenizer.tokenize(text_tokens)
    return new_words

remove_punctuation_udf = udf(lambda text_tokens: removePunctuation(text_tokens), ArrayType(StringType()))

In [36]:
processedPapersData = papersData.withColumn(
                                    'title',
                                    remove_punctuation_udf(
                                        remove_custom_stopwords_udf(
                                            remove_stopwords_udf(papersData.title)
                                        )
                                    )
                                )\
                                .withColumn(
                                    'abstract',
                                    remove_punctuation_udf(
                                        remove_custom_stopwords_udf(
                                            remove_stopwords_udf(papersData.abstract)
                                        )
                                    )
                                )\
                                .withColumn(
                                    'main_abstract',
                                    remove_punctuation_udf(
                                        remove_custom_stopwords_udf(
                                            remove_stopwords_udf(papersData.main_abstract)
                                        )
                                    )
                                )\
                                .withColumn(
                                    'body_text',
                                    remove_punctuation_udf(
                                        remove_custom_stopwords_udf(
                                            remove_stopwords_udf(papersData.body_text)
                                        )
                                    )
                                )\
                                .persist()

In [37]:
processedPapersData.toPandas().head()

Unnamed: 0,abstract,body_text,source_x,title,license,main_abstract,journal,first_author,publish_year,publish_month,publish_day
0,"[live, attenuated, vaccines, lav, represent, o...","[live, attenuated, vaccines, lavs, saved, mill...",PMC,"[humanized, mice, live, attenuated, vaccine, r...",cc-by,"[live, attenuated, vaccines, lav, represent, o...",Vaccines (Basel),"O’Connell, Aoife K.",2020,1,21
1,[],"[influence, minimum, b, value, prostate, cance...",PMC,"[ecr, 2020, book, abstracts, vienna, austria, ...",no-cc,[],Insights Imaging,unknown,2020,5,5
2,"[background, covid, 19, pandemic, broadly, dis...","[covid, 19, first, recognized, december, 2019,...",Elsevier; Medline; PMC,"[guidelines, tms, tes, clinical, services, res...",no-cc,"[background, covid, 19, pandemic, broadly, dis...",Brain Stimul,"Bikson, Marom",2020,5,12
3,[],"[although, often, refer, viruses, autonomous, ...",Elsevier; PMC,"[chapter, 4, interaction, virus, populations, ...",els-covid,"[abstract, viral, population, numbers, extreme...",Virus as Populations,"Domingo, Esteban",2020,12,31
4,"[great, advance, field, diagnosis, avian, viru...","[enormous, progress, occurred, field, diagnosi...",Elsevier; Medline; PMC,"[biotic, concerns, generating, molecular, diag...",els-covid,"[abstract, great, advance, field, diagnosis, a...",Journal of Virological Methods,"Davidson, Irit",2019,12,31


## 5. Vectorization :

- Concat all text columns in one column.
- Apply Word2Vec.

In [38]:
processedPapersData = processedPapersData\
                                        .withColumn(
                                            'text',
                                            concat(
                                                processedPapersData.title,
                                                processedPapersData.abstract,
                                                processedPapersData.main_abstract,
                                                processedPapersData.body_text
                                            )
                                        )

In [39]:
word2Vec = Word2Vec(inputCol="text", outputCol="result")
model = word2Vec.fit(processedPapersData)

result = model.transform(processedPapersData)

In [40]:
result.toPandas().head()

Unnamed: 0,abstract,body_text,source_x,title,license,main_abstract,journal,first_author,publish_year,publish_month,publish_day,text,result
0,"[live, attenuated, vaccines, lav, represent, o...","[live, attenuated, vaccines, lavs, saved, mill...",PMC,"[humanized, mice, live, attenuated, vaccine, r...",cc-by,"[live, attenuated, vaccines, lav, represent, o...",Vaccines (Basel),"O’Connell, Aoife K.",2020,1,21,"[humanized, mice, live, attenuated, vaccine, r...","[-0.03052558893194353, 0.04946341356110135, -0..."
1,[],"[influence, minimum, b, value, prostate, cance...",PMC,"[ecr, 2020, book, abstracts, vienna, austria, ...",no-cc,[],Insights Imaging,unknown,2020,5,5,"[ecr, 2020, book, abstracts, vienna, austria, ...","[-0.04787440190833189, -0.006202977763913214, ..."
2,"[background, covid, 19, pandemic, broadly, dis...","[covid, 19, first, recognized, december, 2019,...",Elsevier; Medline; PMC,"[guidelines, tms, tes, clinical, services, res...",no-cc,"[background, covid, 19, pandemic, broadly, dis...",Brain Stimul,"Bikson, Marom",2020,5,12,"[guidelines, tms, tes, clinical, services, res...","[-0.0413408475612456, 0.037755414528474175, -0..."
3,[],"[although, often, refer, viruses, autonomous, ...",Elsevier; PMC,"[chapter, 4, interaction, virus, populations, ...",els-covid,"[abstract, viral, population, numbers, extreme...",Virus as Populations,"Domingo, Esteban",2020,12,31,"[chapter, 4, interaction, virus, populations, ...","[-0.009700097191966454, 0.05362811148262047, -..."
4,"[great, advance, field, diagnosis, avian, viru...","[enormous, progress, occurred, field, diagnosi...",Elsevier; Medline; PMC,"[biotic, concerns, generating, molecular, diag...",els-covid,"[abstract, great, advance, field, diagnosis, a...",Journal of Virological Methods,"Davidson, Irit",2019,12,31,"[biotic, concerns, generating, molecular, diag...","[-0.011102675271875629, 0.05507244685679239, -..."


In [41]:
result.printSchema()

root
 |-- abstract: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- body_text: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- source_x: string (nullable = true)
 |-- title: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- license: string (nullable = true)
 |-- main_abstract: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- journal: string (nullable = false)
 |-- first_author: string (nullable = false)
 |-- publish_year: integer (nullable = true)
 |-- publish_month: integer (nullable = true)
 |-- publish_day: integer (nullable = true)
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- result: vector (nullable = true)

