# Final Project Template 

## 1) Get your data
You may use any data set(s) you like, so long as they meet these criteria:

* Your data must be publically available for free.
* Your data should be interesting to _you_. You want your final project to be something you're proud of.
* Your data should be "big enough":
    - It should have at least 1,000 rows.
    - It should have enough of columns to be interesting.
    - If you have questions, contact a member of the instructional team.

## 2) Provide a link to your data
Your data is required to be free and open to anyone.
As such, you should have a URL which anyone can use to download your data:

In [124]:
# https://www.ncbi.nlm.nih.gov/pubmed/

# 3) Import your data

In the space below, import your data. If your data span multiple files, read them all in. If applicable, merge or append them as needed.

In [125]:
#importing needed packages
import pandas as pd
import numpy as np 
from Bio import Entrez

#searching PubMed for "genomics" and downloading dictionary of data including number of publications, accession IDs, and relevant MeSH term 
Entrez.email = "eileen.cahill@nih.gov"
handle = Entrez.esearch(db="pubmed", retmax=11, term="genomics", idtype="acc")
record = Entrez.read(handle)
handle.close()

In [126]:
#looking at data and types
print(type(record))
print(type(handle))
print(record)
print(record['IdList'])
print(type(record['IdList']))

<class 'Bio.Entrez.Parser.DictionaryElement'>
<class '_io.TextIOWrapper'>
{'Count': '246416', 'RetMax': '11', 'RetStart': '0', 'IdList': ['32362049', '32362048', '32362037', '32361995', '32361911', '32361833', '32361785', '32361592', '32361136', '32361020', '32360957'], 'TranslationSet': [{'From': 'genomics', 'To': '"genomics"[MeSH Terms] OR "genomics"[All Fields]'}], 'TranslationStack': [{'Term': '"genomics"[MeSH Terms]', 'Field': 'MeSH Terms', 'Count': '110045', 'Explode': 'Y'}, {'Term': '"genomics"[All Fields]', 'Field': 'All Fields', 'Count': '186827', 'Explode': 'N'}, 'OR', 'GROUP'], 'QueryTranslation': '"genomics"[MeSH Terms] OR "genomics"[All Fields]'}
['32362049', '32362048', '32362037', '32361995', '32361911', '32361833', '32361785', '32361592', '32361136', '32361020', '32360957']
<class 'Bio.Entrez.Parser.ListElement'>


In [127]:
#converting type from Bio.Entrez.Parser.DictionaryElement to list
list = []
listrecords = []

for key, value in record.items():
    list = [key, value]
    listrecords.append(list)
print(listrecords)
print(type(listrecords))

[['Count', '246416'], ['RetMax', '11'], ['RetStart', '0'], ['IdList', ['32362049', '32362048', '32362037', '32361995', '32361911', '32361833', '32361785', '32361592', '32361136', '32361020', '32360957']], ['TranslationSet', [{'From': 'genomics', 'To': '"genomics"[MeSH Terms] OR "genomics"[All Fields]'}]], ['TranslationStack', [{'Term': '"genomics"[MeSH Terms]', 'Field': 'MeSH Terms', 'Count': '110045', 'Explode': 'Y'}, {'Term': '"genomics"[All Fields]', 'Field': 'All Fields', 'Count': '186827', 'Explode': 'N'}, 'OR', 'GROUP']], ['QueryTranslation', '"genomics"[MeSH Terms] OR "genomics"[All Fields]']]
<class 'list'>


In [128]:
#creating iterable list of IDs of both string and int type
id_list_acquire = listrecords[3]
print(id_list_acquire)
id_list = []
id_list_int = []

for id in id_list_acquire[1]:
    id_list.append(id)
print(id_list)

for id2 in id_list_acquire[1]:
    id_list_int.append(int(id))
print(id_list_int)

['IdList', ['32362049', '32362048', '32362037', '32361995', '32361911', '32361833', '32361785', '32361592', '32361136', '32361020', '32360957']]
['32362049', '32362048', '32362037', '32361995', '32361911', '32361833', '32361785', '32361592', '32361136', '32361020', '32360957']
[32360957, 32360957, 32360957, 32360957, 32360957, 32360957, 32360957, 32360957, 32360957, 32360957, 32360957]


In [129]:
for item in id_list:
    handle2 = Entrez.esummary(db="pubmed", id=item, retmode="xml")
    records2 = Entrez.parse(handle2)
    for record in records2:
         # each record is a Python dictionary or list.
        print(record['Title'])
        print(record['LastAuthor'])
        print(record['PubDate'])
handle2.close()

How many young drivers do not meet the driver licencing vision requirements?
Lee SS
2020 May 3
Detection of runs of homozygosity in conserved and commercial pig breeds in Poland.
Gurgul A
2020 May 3
Structural and chemical trapping of flavin-oxide intermediates reveals substrate-directed reaction multiplicity.
Li TL
2020 May 3
Genome-wide DNA methylation patterns associated with sleep and mental health in children: a population-based study.
Cecil CAM
2020 May 3
Single-cell transcriptome analysis of the novel coronavirus (SARS-CoV-2) associated gene ACE2 expression in normal and non-obstructive azoospermia (NOA) human male testes.
Qiao J
2020 Apr 30
Correction to: InRange: Comparison of the Second-Generation Basal Insulin Analogues Glargine 300 U/mL and Degludec 100 U/mL in Persons with Type 1 Diabetes Using Continuous Glucose Monitoring-Study Design.
Bergenstal R
2020 May 2
Genetic interactions among Pto-miR319 family members and their targets influence growth and wood properties in Po

In [130]:
#converting list of IDs to dictionary
dictionary = {}
key_list = 0

for x in id_list:
    key_list += 1
    value_in_list = x
    dictionary[key_list] = int(value_in_list)
print(dictionary)
print(type(dictionary))


{1: 32362049, 2: 32362048, 3: 32362037, 4: 32361995, 5: 32361911, 6: 32361833, 7: 32361785, 8: 32361592, 9: 32361136, 10: 32361020, 11: 32360957}
<class 'dict'>


In [131]:
#use pubmed_lookup to search pubmed for information on publications given by accession IDs
from pubmed_lookup import PubMedLookup
from pubmed_lookup import Publication

email = 'eileen.cahill@nih.gov'

for key, value in dictionary.items():
    url = "https://www.ncbi.nlm.nih.gov/pubmed/" + str(value)
    lookup = PubMedLookup(url, email)
    #print(lookup)
    #print(type(lookup))
    #print(type(key))
    #print(type(value))
    #publication = Publication(lookup)
    #CAN'T GET Publication TO WORK <- TypeError result
    
    #print(publication.abstract)


## 5) Show me the shape of your data

In [133]:
df = pd.DataFrame(listrecords)
print(df)
df.shape

0                                                  1
0             Count                                             246416
1            RetMax                                                 11
2          RetStart                                                  0
3            IdList  [32362049, 32362048, 32362037, 32361995, 32361...
4    TranslationSet  [{'From': 'genomics', 'To': '"genomics"[MeSH T...
5  TranslationStack  [{'Term': '"genomics"[MeSH Terms]', 'Field': '...
6  QueryTranslation   "genomics"[MeSH Terms] OR "genomics"[All Fields]


(7, 2)

## 6) Show me the proportion of missing observations for each column of your data

## 7) Give me a problem statement.
Below, write a problem statement. Keep in mind that your task is to tease out relationships in your data and eventually build a predictive model. Your problem statement can be vague, but you should have a goal in mind. Your problem statement should be between one sentence and one paragraph.

In [134]:
#



## 8) What is your _y_-variable?
For final project, you will need to perform a statistical model. This means you will have to accurately predict some y-variable for some combination of x-variables. From your problem statement in part 7, what is that y-variable?

In [135]:
#