# Litsurvey

Identify literature using the following keyword string: ‘sickle cell’ or ‘haemoglobin S’ or ‘hemoglobin S’ or ‘Hb S’ in PubMed based on time period from 1950 to 20 October 2009. 

## 1. PubMed

It's expected to have 18336 results in PubMed.

### 1.1 Code

#### 1.1.1 Export Results in Batches from PubMed

Since PubMed can only export a limited number of records at a time, the following code divides the total records into two batches by time and exports them in turn. 


In [5]:
from selenium import webdriver   
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
import time
find=[]
find.append('("sickle cell" OR "haemoglobin S" OR "hemoglobin S" OR "Hb S") AND (1950/01/01:1989/12/31[dp])')
find.append('("sickle cell" OR "haemoglobin S" OR "hemoglobin S" OR "Hb S") AND (1990/01/01:2009/10/20[dp])')
for i in range(2):
    url = 'https://pubmed.ncbi.nlm.nih.gov/?term='+find[i]+'&size=200&page=51'
    driver = webdriver.Chrome()
    driver.get(url)
    driver.maximize_window()
    driver.implicitly_wait(10)
    driver.implicitly_wait(10)
    ActionChains(driver).move_to_element(driver.find_element(By.XPATH,r'//*[@id="save-results-panel-trigger"]')).click().perform()
    Select(driver.find_element(By.XPATH,r'//*[@id="save-action-selection"]')).select_by_visible_text("All results on this page")
    ActionChains(driver).move_to_element(driver.find_element(By.XPATH,r'//*[@id="save-action-format"]')).click().perform()   
    Select(driver.find_element(By.XPATH,r'/html/body/main/div[1]/div/form/div[2]/select')).select_by_visible_text("CSV")
    ActionChains(driver).move_to_element(driver.find_element(By.XPATH,r'//*[@id="save-action-panel-form"]/div[3]/button[1]')).click().perform()
    time.sleep(20)
driver.quit()

#### 1.1.2 Merge and Process Results

The files exported in the previous step need to be merged. After merging, filter out records added after 2009/10/20.

In [7]:
import pandas as pd
import numpy as np

#"csv-sicklecell-set (1).csv" and "csv-sicklecell-set (2).csv" are names of the files exported in the previous step. 
#Please ensure that they are in the same folder as Python source files.
h1 = pd.read_csv("csv-sicklecell-set (1).csv")
h2 = pd.read_csv("csv-sicklecell-set (2).csv")

#Merge csv files
df = pd.concat([h1,h2])
df.drop_duplicates()  #Remove duplicate data
df.index=range(1,df.shape[0]+1)
df.to_csv('concated.csv',encoding = 'utf-8')

#Filter out records after 2009/10/20
a=[]
data=np.array(df)
for item in data:
    if item[7]<='2009/10/20':
        a.append(item)
df1=pd.DataFrame(a)
df1.index=range(1,df1.shape[0]+1)
df1.to_csv('concated-filtered.csv',encoding = 'utf-8')

### 1.2 Results

Exported the results to csv format, the total number of results is as following:

![jupyter](./image-20221024162307363.png)

The number of records is 18501, which is 212 more than the number of results expected.

After 2009/10/20, records may have been added or deleted, causing the number of records to be slightly different than expected. What can be done is to filter out records after 2009/10/20.

Filter out records added after 2009/10/20, the number of results is as following:

![jupyter](./image-20221024162347927.png)

The number of records is 18289, which is 47 less than the number of results expected.

I have no idea how to get the number of records deleted. But the adjusted result is definitely not less than 18289, probably less than 18501 and very close to 18336.