# Final Project Template 

## 1) Get your data
You may use any data set(s) you like, so long as they meet these criteria:

* Your data must be publically available for free.
* Your data should be interesting to _you_. You want your final project to be something you're proud of.
* Your data should be "big enough":
    - It should have at least 1,000 rows.
    - It should have enough of columns to be interesting.
    - If you have questions, contact a member of the instructional team.

## 2) Provide a link to your data
Your data is required to be free and open to anyone.
As such, you should have a URL which anyone can use to download your data:

In [1]:
# https://www.ncbi.nlm.nih.gov/pubmed/

# 3) Import your data

In the space below, import your data. If your data span multiple files, read them all in. If applicable, merge or append them as needed.

In [2]:
#importing needed packages
import pandas as pd
import numpy as np 
from Bio import Entrez

#searching PubMed for "genomics" and downloading dictionary of data including number of publications, accession IDs, and relevant MeSH term 
Entrez.email = "eileen.cahill@nih.gov"
handle = Entrez.esearch(db="pubmed", retmax=11, term="genomics", idtype="acc")
record = Entrez.read(handle)
handle.close()

In [3]:
#looking at data and types
print(type(record))
print(type(handle))
print(record)
print(record['IdList'])
print(type(record['IdList']))

<class 'Bio.Entrez.Parser.DictionaryElement'>
<class '_io.TextIOWrapper'>
{'Count': '246457', 'RetMax': '11', 'RetStart': '0', 'IdList': ['32365385', '32365362', '32365353', '32365209', '32365196', '32365118', '32365064', '32364878', '32364825', '32364535', '32364420'], 'TranslationSet': [{'From': 'genomics', 'To': '"genomics"[MeSH Terms] OR "genomics"[All Fields]'}], 'TranslationStack': [{'Term': '"genomics"[MeSH Terms]', 'Field': 'MeSH Terms', 'Count': '110045', 'Explode': 'Y'}, {'Term': '"genomics"[All Fields]', 'Field': 'All Fields', 'Count': '186868', 'Explode': 'N'}, 'OR', 'GROUP'], 'QueryTranslation': '"genomics"[MeSH Terms] OR "genomics"[All Fields]'}
['32365385', '32365362', '32365353', '32365209', '32365196', '32365118', '32365064', '32364878', '32364825', '32364535', '32364420']
<class 'Bio.Entrez.Parser.ListElement'>


In [4]:
#converting type from Bio.Entrez.Parser.DictionaryElement to list
list = []
listrecords = []

for key, value in record.items():
    list = [key, value]
    listrecords.append(list)
print(listrecords)
print(type(listrecords))

[['Count', '246457'], ['RetMax', '11'], ['RetStart', '0'], ['IdList', ['32365385', '32365362', '32365353', '32365209', '32365196', '32365118', '32365064', '32364878', '32364825', '32364535', '32364420']], ['TranslationSet', [{'From': 'genomics', 'To': '"genomics"[MeSH Terms] OR "genomics"[All Fields]'}]], ['TranslationStack', [{'Term': '"genomics"[MeSH Terms]', 'Field': 'MeSH Terms', 'Count': '110045', 'Explode': 'Y'}, {'Term': '"genomics"[All Fields]', 'Field': 'All Fields', 'Count': '186868', 'Explode': 'N'}, 'OR', 'GROUP']], ['QueryTranslation', '"genomics"[MeSH Terms] OR "genomics"[All Fields]']]
<class 'list'>


In [5]:
#creating iterable list of IDs of both string and int type
id_list_acquire = listrecords[3]
print(id_list_acquire)
id_list = []
id_list_int = []

for id in id_list_acquire[1]:
    id_list.append(id)
print(id_list)

for id2 in id_list_acquire[1]:
    id_list_int.append(int(id))
print(id_list_int)

['IdList', ['32365385', '32365362', '32365353', '32365209', '32365196', '32365118', '32365064', '32364878', '32364825', '32364535', '32364420']]
['32365385', '32365362', '32365353', '32365209', '32365196', '32365118', '32365064', '32364878', '32364825', '32364535', '32364420']
[32364420, 32364420, 32364420, 32364420, 32364420, 32364420, 32364420, 32364420, 32364420, 32364420, 32364420]


In [6]:
list_titles = []

for item in id_list:
    handle2 = Entrez.esummary(db="pubmed", id=item, retmode="xml")
    records2 = Entrez.parse(handle2)

    for record in records2:
            #each record is a Python dictionary or list.
        list_titles.append(record['Title'])

handle2.close()

print(list_titles)

['Four-Dimensional Introital Ultrasound in Assessing Perioperative Pelvic Floor Muscle Functions of Women with Cystoceles.', 'Interleukin-31 promotes pathogenic mechanisms underlying skin and lung fibrosis in scleroderma.', 'Rapid reconstruction of SARS-CoV-2 using a synthetic genomics platform.', 'mstree: a multispecies coalescent approach for estimating ancestral population size and divergence time during speciation with gene flow.', 'Abnormalities in Sodium Current and Calcium Homeostasis as Drivers of Arrhythmogenesis in Hypertrophic Cardiomyopathy.', 'Analysis of meiotic segregation by triple-color fish on both total and motile sperm fractions in a t(1p;18) river buffalo bull.', 'A new neuropeptide insect parathyroid hormone iPTH in the red flour beetle Tribolium castaneum.', 'Overexpression of <i>NEIL3</i> associated with altered genome and poor survival in selected types of human cancer.', '<i>Ustilaginoidea virens</i>: Insights into an Emerging Rice Pathogen.', 'Genomic and epi

## 5) Show me the shape of your data

In [7]:
#showing the dataframe and shape
df = pd.DataFrame(listrecords)
print(df)
df.shape

df = pd.DataFrame(list_titles)
print(df)
df.shape

0                                                  1
0             Count                                             246457
1            RetMax                                                 11
2          RetStart                                                  0
3            IdList  [32365385, 32365362, 32365353, 32365209, 32365...
4    TranslationSet  [{'From': 'genomics', 'To': '"genomics"[MeSH T...
5  TranslationStack  [{'Term': '"genomics"[MeSH Terms]', 'Field': '...
6  QueryTranslation   "genomics"[MeSH Terms] OR "genomics"[All Fields]
                                                    0
0   Four-Dimensional Introital Ultrasound in Asses...
1   Interleukin-31 promotes pathogenic mechanisms ...
2   Rapid reconstruction of SARS-CoV-2 using a syn...
3   mstree: a multispecies coalescent approach for...
4   Abnormalities in Sodium Current and Calcium Ho...
5   Analysis of meiotic segregation by triple-colo...
6   A new neuropeptide insect parathyroid hormone ...
7   Overexpression

(11, 1)

In [8]:
text_glob = ""

for x in list_titles:
    text_glob += x

text_glob = text_glob.replace(".", " ")
text_lower = text_glob.lower()
print(text_lower)

four-dimensional introital ultrasound in assessing perioperative pelvic floor muscle functions of women with cystoceles interleukin-31 promotes pathogenic mechanisms underlying skin and lung fibrosis in scleroderma rapid reconstruction of sars-cov-2 using a synthetic genomics platform mstree: a multispecies coalescent approach for estimating ancestral population size and divergence time during speciation with gene flow abnormalities in sodium current and calcium homeostasis as drivers of arrhythmogenesis in hypertrophic cardiomyopathy analysis of meiotic segregation by triple-color fish on both total and motile sperm fractions in a t(1p;18) river buffalo bull a new neuropeptide insect parathyroid hormone ipth in the red flour beetle tribolium castaneum overexpression of <i>neil3</i> associated with altered genome and poor survival in selected types of human cancer <i>ustilaginoidea virens</i>: insights into an emerging rice pathogen genomic and epigenomic ebf1 alterations modulate tert

In [21]:
from nltk.tokenize import word_tokenize
from nltk import FreqDist
from nltk.corpus import stopwords
#from nltk.stem import PorterStemmer

#Create word tokens and count frequencies
tokens = word_tokenize(text_lower)
#print(tokens)

#removing stopwords and showing most frequent non-stopword words
stops = set(stopwords.words('english'))
stopwords = [x for x in freq_words if x not in stops]
print(stopwords)

freq_words = FreqDist(stopwords)
for word in freq_words:
    print(word, freq_words[word])

most_freq_words = freq_words.most_common(10)
print(most_freq_words)

['four-dimensional', 'introital', 'ultrasound', 'assessing', 'perioperative', 'pelvic', 'floor', 'muscle', 'functions', 'women', 'cystoceles', 'interleukin-31', 'promotes', 'pathogenic', 'mechanisms', 'underlying', 'skin', 'lung', 'fibrosis', 'scleroderma', 'rapid', 'reconstruction', 'sars-cov-2', 'using', 'synthetic', 'genomics', 'platform', 'mstree', ':', 'multispecies', 'coalescent', 'approach', 'estimating', 'ancestral', 'population', 'size', 'divergence', 'time', 'speciation', 'gene', 'flow', 'abnormalities', 'sodium', 'current', 'calcium', 'homeostasis', 'drivers', 'arrhythmogenesis', 'hypertrophic', 'cardiomyopathy', 'analysis', 'meiotic', 'segregation', 'triple-color', 'fish', 'total', 'motile', 'sperm', 'fractions', '(', '1p', ';', '18', ')', 'river', 'buffalo', 'bull', 'new', 'neuropeptide', 'insect', 'parathyroid', 'hormone', 'ipth', 'red', 'flour', 'beetle', 'tribolium', 'castaneum', 'overexpression', '<', '>', 'neil3', '/i', 'associated', 'altered', 'genome', 'poor', 'surv

## 7) Give me a problem statement.
Below, write a problem statement. Keep in mind that your task is to tease out relationships in your data and eventually build a predictive model. Your problem statement can be vague, but you should have a goal in mind. Your problem statement should be between one sentence and one paragraph.

In [10]:
#It will be helpful for NIH administrators to know which aspects of a certain scientific area are most common in publications and which are least common. For example if we know that "human" is the most common term in a PubMed search of "genomics" publications, then we can learn that there is a strong human emphasis in genomic research.

## 8) What is your _y_-variable?
For final project, you will need to perform a statistical model. This means you will have to accurately predict some y-variable for some combination of x-variables. From your problem statement in part 7, what is that y-variable?

In [11]:
#How many times a word is used in publication abstract/titles.