To get data from PubMed, I used two Python packages. I started with Biopython Entrez and while it was doing relatively well with downloading 500 abstracts (it wold atke it around 2 hours), it would give me an IncompleteRead error described here: https://github.com/biopython/biopython/issues/1944
or slow down and hang for many hours if I increase the number of abstracts to download. I was not able to resolve the issue by throwing exception errors or by following the directions in the link. Luckily, I ran into another open source package, Metapub. I did have a hard time setting up API key in jupyter notebook since there was no clear directions in the documentation. However, I was able to reach out to the API developers, and they were kind to respond right away and help me. Overall, Metpub was very fast and helpful in downloading a large amount of data form PubMed. 

To get data for the category Other, I collected all the topics from this link: https://www.ncbi.nlm.nih.gov/mesh/1000048 into a python list that were related to either 'Chemicals and Drugs Category' or 'Diseases Category' but not related to  sub-topics such as 'Congenital, Hereditary, and Neonatal Diseases and Abnormalities' and 'Chemically-Induced Disorders'. 

Please see all the codes below.


### Fetching data from Pubmed using Biopython Entrez

In [1]:
from Bio import Entrez
import pandas as pd
def fetch_data_entrez(topic, quantity, file_name):
    Entrez.email = "marina.drus@gmail.com"   
    handle = Entrez.esearch(db="pubmed", term=topic,retmax=quantity)
    record = Entrez.read(handle)
    list = record["IdList"]
    pmids = [int(i) for i in list]
    for index in range(0, len(list)):
        listId = list[index]
        handle = Entrez.efetch(db="pubmed", id=','.join(map(str, pmids)),
                       retmode="text",rettype="xml")
        records = Entrez.read(handle)
        abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract'] ['AbstractText'][0] 
             for pubmed_article in records['PubmedArticle'] if 'Abstract' in
             pubmed_article['MedlineCitation']['Article'].keys()]
    Abstract_data = pd.DataFrame(abstracts, index=range (0,len(abstracts)), columns=['abstract'])
    Abstract_data.to_csv(file_name)
    return
    


In [2]:
fetch_data_entrez('Congenital Abnormalities', 500, 'abnom.csv')

###  Fetching data from Pubmed using Metapub

In [7]:
import os
from metapub import PubMedFetcher
import pandas as pd

def fetch_data_metapub(topic, quantity, file_name):
    os.environ["NCBI_API_KEY"] = "metapub_key"
    fetch = PubMedFetcher()
    pmids = fetch.pmids_for_query(topic, retmax=quantity)
    Abstract_data = pd.DataFrame(columns=["abstract"])
    for pmid in pmids:
        Abstract_data = Abstract_data.append({"abstract":fetch.article_by_pmid(pmid).abstract},ignore_index=True)
    Abstract_data.to_csv(file_name)
    return


In [8]:
fetch_data_metapub('Congenital Abnormalities', 3700, 'abnom.csv')

### Fetching data from Pubmed for category Other using Metapub

In [77]:
from metapub import PubMedFetcher

topics=['Amino Acids, Peptides, and Proteins','Biological Factors',\
       'Biomedical and Dental Materials','Carbohydrates','Chemical Actions and Uses',\
       'Complex Mixtures','Enzymes and Coenzymes','Heterocyclic Compounds',\
        'Hormones, Hormone Substitutes, and Hormone Antagonists','Inorganic Chemicals',
       'Lipids','Macromolecular Substances','Nucleic Acids, Nucleotides, and Nucleosides',\
       'Organic Chemicals','Pharmaceutical Preparations','Polycyclic Compounds',
       'Animal Diseases','Cardiovascular Diseases','Digestive System Diseases',\
       'Disorders of Environmental Origin','Endocrine System Diseases','Eye Diseases',\
       'Female Urogenital Diseases and Pregnancy Complications','Hemic and Lymphatic Diseases',\
        'Immune System Diseases','Infections','Male Urogenital Diseases','Musculoskeletal Diseases',\
        'Neoplasms','Nervous System Diseases','Nutritional and Metabolic Diseases',\
        'Occupational Diseases','Otorhinolaryngologic Diseases','Pathological Conditions, Signs and Symptoms',\
        'Respiratory Tract Diseases','Skin and Connective Tissue Diseases',\
        'Stomatognathic Diseases','Wounds and Injuries']

pmids_list=[]
fetch = PubMedFetcher()
for topic in topics:
    pmids = fetch.pmids_for_query(topic, retmax=106)
    pmids_list.append(pmids)
pmids_mix = [item for sublist in pmids_list for item in sublist] 
Abstract_data = pd.DataFrame(columns=["abstract"])
for pmid in pmids_mix:
    Abstract_data = Abstract_data.append({"abstract":fetch.article_by_pmid(pmid).abstract},ignore_index=True)
Abstract_data.to_csv('others.csv')