# Final Project Template 

## 1) Get your data
You may use any data set(s) you like, so long as they meet these criteria:

* Your data must be publically available for free.
* Your data should be interesting to _you_. You want your final project to be something you're proud of.
* Your data should be "big enough":
    - It should have at least 1,000 rows.
    - It should have enough of columns to be interesting.
    - If you have questions, contact a member of the instructional team.

## 2) Provide a link to your data
Your data is required to be free and open to anyone.
As such, you should have a URL which anyone can use to download your data:

In [1]:
# https://www.ncbi.nlm.nih.gov/pubmed/

# 3) Import your data

In the space below, import your data. If your data span multiple files, read them all in. If applicable, merge or append them as needed.

In [2]:
#importing needed packages
import pandas as pd
import numpy as np 
from Bio import Entrez

#searching PubMed for "genomics" and downloading dictionary of data including number of publications, accession IDs, and relevant MeSH term 
Entrez.email = "eileen.cahill@nih.gov"
handle = Entrez.esearch(db="pubmed", retmax=100, term="genom*", idtype="acc")
record = Entrez.read(handle)
handle.close()


print(record)
if record['Count'] == '0':
    print("\nYour search did not return any publications. Try another search.\n")
    print("Error list says: " + str(record['ErrorList']) + "\n")
    print("Warning list says: " + str(record['WarningList']))
else:
    print("Search successful.")

Search successful.


In [3]:
#looking at data and types
print(type(record))
print(type(handle))
#print(record)
#print(record['IdList'])


<class 'Bio.Entrez.Parser.DictionaryElement'>
<class '_io.TextIOWrapper'>


In [4]:
#converting type from Bio.Entrez.Parser.DictionaryElement to list
list = []
listrecords = []

for key, value in record.items():
    list = [key, value]
    listrecords.append(list)
print(listrecords)
print(type(listrecords))

<class 'list'>


In [5]:
#creating iterable list of IDs of both string and int type
id_list_acquire = listrecords[3]
print(id_list_acquire)
id_list = []
id_list_int = []

for id in id_list_acquire[1]:
    id_list.append(id)
print(id_list)

['IdList', ['32369867', '32369848', '32369831', '32369821', '32369817', '32369810', '32369809', '32369805', '32369735', '32369734', '32369668', '32369633', '32369600', '32369593', '32369588', '32369585', '32369572', '32369566', '32369565', '32369554', '32369553', '32369552', '32369522', '32369498', '32369496', '32369493', '32369486', '32369481', '32369457', '32369456', '32369452', '32369445', '32369358', '32369273', '32369165', '32369034', '32369020', '32369019', '32369005', '32369004', '32369003', '32369001', '32369000', '32368999', '32368997', '32368983', '32368982', '32368974', '32368927', '32368866', '32368861', '32368792', '32368734', '32368696', '32368685', '32368569', '32368540', '32368513', '32368431', '32368353', '32368352', '32368345', '32368326', '32368312', '32368310', '32368301', '32368296', '32368197', '32368194', '32368185', '32368168', '32368150', '32368149', '32368108', '32368105', '32368102', '32368101', '32368095', '32367807', '32367804', '32367802', '32367801', '323

In [6]:
list_titles = []
list_pmids = []

for item in id_list:
    handle2 = Entrez.esummary(db="pubmed", id=item, retmode="xml")
    records2 = Entrez.parse(handle2)

    for record in records2:
        #each record is a Python dictionary or list.
        list_titles.append(record['Title'])

handle2.close()

#print(list_titles)

In [7]:
#showing the dataframe and shape
df = pd.DataFrame(listrecords)
print(df)
df.shape

df = pd.DataFrame(list_titles)
print(df)
df.shape

0                                                  1
0             Count                                             799431
1            RetMax                                                100
2          RetStart                                                  0
3            IdList  [32369867, 32369848, 32369831, 32369821, 32369...
4    TranslationSet                                                 []
5  TranslationStack  [{'Term': 'genom[All Fields]', 'Field': 'All F...
6  QueryTranslation  genom[All Fields] OR genom'ic[All Fields] OR g...
                                                    0
0   5' splice site GC&gt;GT and GT&gt;GC variants ...
1   Platelet Integrin αIIbβ3 Activation is Associa...
2   Increased urinary exosomal SYT17 levels in chr...
3   Mutational Profiling of Driver Tumor Suppresso...
4   What Will It Take to Build an Expert Group of ...
..                                                ...
95  CAM guard cell anion channel activity follows ...
96  Fine-tuning th

(100, 1)

In [8]:
text_glob = ""

for x in list_titles:
    text_glob += x

text_glob = text_glob.replace(".", " ")
text_glob = text_glob.replace(",", " ")
text_glob = text_glob.replace(":", " ")
text_glob = text_glob.replace(";", " ")
text_glob = text_glob.replace("(", " ")
text_glob = text_glob.replace(")", " ")
text_glob = text_glob.replace("?", " ")
text_glob = text_glob.replace("&", " ")
text_glob = text_glob.replace("'", " ")
text_glob = text_glob.replace("<", " ")
text_glob = text_glob.replace(">", " ")
text_glob = text_glob.replace("/i", " ")

text_lower = text_glob.lower()
print(text_lower)

5  splice site gc gt gt and gt gt gc variants differ markedly in terms of their functionality and pathogenicity platelet integrin αiibβ3 activation is associated with 25-hydroxyvitamin d concentrations in healthy adults increased urinary exosomal syt17 levels in chronic active antibody-mediated rejection after kidney transplantation via the il-6 amplifier mutational profiling of driver tumor suppressor and oncogenic genes in brazilian malignant pleural mesotheliomas what will it take to build an expert group of nutrigenomic practitioners nonmosaic trisomy 19p13 3p13 2 resulting from a rare unbalanced t y 19  q12 p13 2  translocation in a patient with pachygyria and polymicrogyria introducing the bird chromosome database  an overview of cytogenetic studies in birds cytogenetic and molecular characterization of three mimetic species of the genus alagoasa bechyné 1955  coleoptera  alticinae  from the neotropical region nad sup + /sup  controls circadian reprogramming through per2 nuclear 

In [9]:
#tokenization
from nltk.tokenize import word_tokenize
from nltk import FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#Create word tokens and count frequencies
tokens = word_tokenize(text_lower)
#print(tokens)

#removing stopwords and showing most frequent non-stopword words
stops = set(stopwords.words('english'))
stopwords = [x for x in tokens if x not in stops]
#print(stopwords)

#lemmatize words
lemmatizer = WordNetLemmatizer() 
 
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in stopwords])
#print(lemmatized_output)

new_text_glob = word_tokenize(lemmatized_output)
#print(newoutput)

#printing most frequently used words
freq_words = FreqDist(new_text_glob)
most_freq_words = freq_words.most_common(15)
print(most_freq_words)

[('gene', 12), ('cell', 11), ('sp', 8), ('genetic', 8), ('nov', 8), ('virus', 8), ('cancer', 8), ('analysis', 7), ('patient', 6), ('novel', 6), ('characterization', 5), ('human', 5), ('using', 5), ('development', 5), ('sequencing', 5)]


In [15]:
#Named Entity Recognition
from nltk import pos_tag
from nltk import ne_chunk

import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
tags = pos_tag(new_text_glob)

chunk = ne_chunk(tags)

#print(chunk)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/eileencahill/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/eileencahill/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/eileencahill/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [17]:
#Establishing rules for chunking
from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *


rules = '''
    NP: {<DT>?<JJ>*<NN>}
    VB: {<VBP>*<RB>]
    '''
chunkparser = RegexpParser(rules)
result = chunkparser.parse(tags)
print(result)


ValueError: Illegal chunk pattern: {<VBP>*<RB>]

In [16]:
rule1 = "NP: {<DT>?<JJ>*<NN>*<NNS>}"
rule2 = "VB: {<VBP>*<RB>}"
rule3 = "VP: {<VP>+<NN>}"

chunkparse1 = RegexpParser(rule1)
result1 = chunkparse1.parse(tags)
chunkparse2 = RegexpParser(rule2)
result2 = chunkparse2.parse(result1)
chunkparse3 = RegexpParser(rule3)
result3 = chunkparse3.parse(result2)
print(result3)

NameError: name 'RegexpParser' is not defined

In [34]:
reg = NP: "{<DT>|<JJ>*<NN>}” 
a = nltk.RegexpParser(reg)
result = a.parse(tags)
print(result)

SyntaxError: invalid syntax (<ipython-input-34-b0772fdf9997>, line 1)

## 7) Give me a problem statement.
Below, write a problem statement. Keep in mind that your task is to tease out relationships in your data and eventually build a predictive model. Your problem statement can be vague, but you should have a goal in mind. Your problem statement should be between one sentence and one paragraph.

In [12]:
#It will be helpful for NIH administrators to know which aspects of a certain scientific area are most common in publications and which are least common. For example if we know that "human" is the most common term in a PubMed search of "genomics" publications, then we can learn that there is a strong human emphasis in genomic research.

## 8) What is your _y_-variable?
For final project, you will need to perform a statistical model. This means you will have to accurately predict some y-variable for some combination of x-variables. From your problem statement in part 7, what is that y-variable?

In [13]:
#How many times a word is used in publication abstract/titles.