# Final Project Template 

## 1) Get your data
You may use any data set(s) you like, so long as they meet these criteria:

* Your data must be publically available for free.
* Your data should be interesting to _you_. You want your final project to be something you're proud of.
* Your data should be "big enough":
    - It should have at least 1,000 rows.
    - It should have enough of columns to be interesting.
    - If you have questions, contact a member of the instructional team.

## 2) Provide a link to your data
Your data is required to be free and open to anyone.
As such, you should have a URL which anyone can use to download your data:

In [3]:
# https://www.ncbi.nlm.nih.gov/pubmed/

# 3) Import your data

In the space below, import your data. If your data span multiple files, read them all in. If applicable, merge or append them as needed.

In [4]:
#importing needed packages
import pandas as pd
import numpy as np 
from Bio import Entrez

#searching PubMed for "genomics" and downloading dictionary of data including number of publications, accession IDs, and relevant MeSH term 
Entrez.email = "eileen.cahill@nih.gov"
handle = Entrez.esearch(db="pubmed", retmax=100, term="genom*", idtype="acc")
record = Entrez.read(handle)
handle.close()

#optional print because data size is large
#print(record)

if record['Count'] == '0':
    print("\nYour search did not return any publications. Try another search.\n")
    print("Error list says: " + str(record['ErrorList']) + "\n")
    print("Warning list says: " + str(record['WarningList']))
else:
    print("Search successful.")

Search successful.


In [5]:
#simply showing data type
print(type(record))


<class 'Bio.Entrez.Parser.DictionaryElement'>


In [6]:
#converting all data from type Bio.Entrez.Parser.DictionaryElement to list
list = []
listrecords = []

for key, value in record.items():
    list = [key, value]
    listrecords.append(list)

#optional: examine list of records (large)
#print(listrecords)

print(type(listrecords))

<class 'list'>


In [7]:
#creating iterable list of IDs
id_list_acquire = listrecords[3]
print(id_list_acquire)
id_list = []
id_list_int = []

for id in id_list_acquire[1]:
    id_list.append(id)
print(id_list)

['IdList', ['32375182', '32375181', '32375177', '32375154', '32375120', '32375053', '32375052', '32375049', '32375047', '32375040', '32375039', '32375038', '32375033', '32375029', '32375028', '32375018', '32374917', '32374914', '32374878', '32374876', '32374873', '32374870', '32374866', '32374865', '32374863', '32374858', '32374845', '32374844', '32374843', '32374833', '32374831', '32374823', '32374820', '32374793', '32374791', '32374757', '32374732', '32374727', '32374726', '32374657', '32374631', '32374515', '32374503', '32374462', '32374452', '32374423', '32374413', '32374345', '32374294', '32374287', '32374261', '32374252', '32374251', '32374190', '32374181', '32374079', '32374001', '32373998', '32373977', '32373937', '32373862', '32373753', '32373711', '32373663', '32373648', '32373647', '32373633', '32373597', '32373583', '32373565', '32373552', '32373550', '32373547', '32373535', '32373528', '32373524', '32373523', '32373514', '32373482', '32373477', '32373361', '32373300', '323

In [8]:
#search PubMed for title information using the PMIDs that resulted from the keyword search
list_titles = []
list_pmids = []

for item in id_list:
    handle2 = Entrez.esummary(db="pubmed", id=item, retmode="xml")
    records2 = Entrez.parse(handle2)

    for record in records2:
        #each record is a Python dictionary or list.
        list_titles.append(record['Title'])

handle2.close()

#print(list_titles)

In [9]:
#showing the dataframe and shape
df = pd.DataFrame(list_titles)
print(df)
df.shape

0
0   Liraglutide Ameliorates Lipotoxicity-Induced O...
1   Mitochondrial DNA segregation and replication ...
2   Onset of hippocampal network aberration and me...
3   The NASA Twins Study: The Effect of One Year i...
4   Primary hyperparathyroidism as first manifesta...
..                                                ...
95  Genome-Wide Analysis of Methylation-Driven Gen...
96  Identifying Shared Risk Genes for Asthma, Hay ...
97  Genomic Breeding Programs Realize Larger Benef...
98  Drug Targeting of Genomic Instability in Multi...
99  Whole Genome Scan Reveals Molecular Signatures...

[100 rows x 1 columns]


(100, 1)

In [10]:
#create one mass of text, replace special characters, and create all lowercase text
text_glob = ""

for x in list_titles:
    text_glob += x

text_glob = text_glob.replace(".", " ")
text_glob = text_glob.replace(",", " ")
text_glob = text_glob.replace(":", " ")
text_glob = text_glob.replace(";", " ")
text_glob = text_glob.replace("(", " ")
text_glob = text_glob.replace(")", " ")
text_glob = text_glob.replace("?", " ")
text_glob = text_glob.replace("&", " ")
text_glob = text_glob.replace("'", " ")
text_glob = text_glob.replace("<", " ")
text_glob = text_glob.replace(">", " ")
text_glob = text_glob.replace("/i", " ")

text_lower = text_glob.lower()
#print(text_lower)

In [11]:
#Tokenization and lemmatization steps <-- Text Preprocessing
from nltk.tokenize import word_tokenize
from nltk import FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#Create word tokens and count frequencies
tokens = word_tokenize(text_lower)
#print(tokens)

#removing stopwords and showing most frequent non-stopword words
stops = set(stopwords.words('english'))
stopwords = [x for x in tokens if x not in stops]
#print(stopwords)

#lemmatize words
lemmatizer = WordNetLemmatizer() 
 
lemmatized_output = ' '.join([lemmatizer.lemmatize(word) for word in stopwords])
#print(lemmatized_output)

new_text_glob = word_tokenize(lemmatized_output)
#print(new_text_glob)

#printing most frequently used words
freq_words = FreqDist(new_text_glob)
most_freq_words = freq_words.most_common(15)
print(most_freq_words)

[('gene', 12), ('analysis', 11), ('study', 8), ('genetic', 8), ('genomic', 7), ('protein', 7), ('cell', 6), ('tumor', 6), ('disease', 6), ('dna', 5), ('cancer', 5), ('data', 5), ('selection', 5), ('patient', 5), ('risk', 5)]


In [12]:
#Named Entity Recognition
from nltk import pos_tag
from nltk import ne_chunk
import matplotlib.pyplot as plt
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

tags = pos_tag(new_text_glob)
chunk = ne_chunk(tags)

#print(chunk)

tags_freq = FreqDist(tag for (word, tag) in chunk)
print(tags_freq.most_common())
plt.tags_freq.plot(cumulative = True)
plt.show()

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/eileencahill/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/eileencahill/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/eileencahill/nltk_data...
[nltk_data]   Package words is already up-to-date!
[('NN', 546), ('JJ', 294), ('VBG', 32), ('NNS', 23), ('VBD', 19), ('CD', 17), ('VBN', 13), ('VBZ', 12), ('VBP', 12), ('IN', 11), ('RB', 7), ('FW', 5), ('NNP', 4), ('VB', 4), ('JJR', 4), ('JJS', 1), ('RBS', 1), ('RBR', 1)]


<Figure size 640x480 with 1 Axes>

<matplotlib.axes._subplots.AxesSubplot at 0x1a27585050>

In [None]:
#Establishing rules for chunking
from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *

rules = r'''
NP: {<JJ><NN.>}
    {<VB.>*<RB>}
    {<NN><NN.>+}
    {<VP><NN.>+}
    {<NN.><RB>+}
    {<NNP>+}
    {<NN><NN>}
    '''
chunkparse = RegexpParser(rules)
result = chunkparse.parse(tags)
print(result)
result.draw()
#print(tags)
#nouns = noun for noun in result if tags == "NN"

(S
  liraglutide/JJ
  ameliorates/VBZ
  lipotoxicity-induced/JJ
  oxidative/JJ
  stress/NN
  activating/VBG
  nrf2/JJ
  (NP pathway/NN hepg2/NN)
  cell/NN
  mitochondrial/JJ
  (NP dna/NN segregation/NN)
  (NP replication/NN restrict/NN)
  transmission/NN
  detrimental/JJ
  (NP mutation/NN onset/NN)
  hippocampal/JJ
  (NP network/NN aberration/NN)
  (NP memory/NN deficit/NN)
  (NP p301s/NN tau/NN)
  mouse/NN
  associated/VBN
  early/JJ
  (NP gene/NN signature/NN)
  nasa/FW
  (NP twin/NN study/NN)
  effect/NN
  one/CD
  (NP year/NN space/NN)
  long-chain/NN
  fatty/JJ
  acid/NN
  desaturases/VBZ
  elongases/VBZ
  primary/JJ
  hyperparathyroidism/NN
  first/JJ
  (NP manifestation/NN men/NNS)
  2a/CD
  international/JJ
  (NP multicenter/NN study/NN)
  suppression/NN
  inflammasome/JJ
  (NP activation/NN irf8/NN)
  (NP irf4/NN cdc/NN)
  critical/JJ
  cell/NN
  priming/VBG
  (NP senataxin/NN ortholog/NN)
  sen1/JJ
  (NP limit/NN dna/NN)
  rna/VBP
  hybrid/JJ
  (NP accumulation/NN dna/NN)
  d

In [None]:
#training a model, bag-of-words

from sklearn.feature_extraction.text import CountVectorizer

#you can do preprocessing first and then pass the processed texts to CountVectorizer

vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(list_titles)

#print(vectorizer.get_feature_names())
print(vectors.toarray())

freq = pd.DataFrame(data = vectors.toarray(), columns = vectorizer.get_feature_names())

print(freq)


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectors = tfidf_vectorizer.fit_transform(list_titles)

tfidfs = pd.DataFrame(data=tfidf_vectors.toarray(), columns=tfidf_vectorizer.get_feature_names())

print(tfidfs)

## 7) Give me a problem statement.
Below, write a problem statement. Keep in mind that your task is to tease out relationships in your data and eventually build a predictive model. Your problem statement can be vague, but you should have a goal in mind. Your problem statement should be between one sentence and one paragraph.

In [None]:
#It will be helpful for NIH administrators to know which aspects of a certain scientific area are most common in publications and which are least common. For example if we know that "human" is the most common term in a PubMed search of "genomics" publications, then we can learn that there is a strong human emphasis in genomic research.

## 8) What is your _y_-variable?
For final project, you will need to perform a statistical model. This means you will have to accurately predict some y-variable for some combination of x-variables. From your problem statement in part 7, what is that y-variable?

In [None]:
#How many times a word is used in publication abstract/titles.