# Final Project Template 

## 1) Get your data
You may use any data set(s) you like, so long as they meet these criteria:

* Your data must be publically available for free.
* Your data should be interesting to _you_. You want your final project to be something you're proud of.
* Your data should be "big enough":
    - It should have at least 1,000 rows.
    - It should have enough of columns to be interesting.
    - If you have questions, contact a member of the instructional team.

## 2) Provide a link to your data
Your data is required to be free and open to anyone.
As such, you should have a URL which anyone can use to download your data:

#### [Dallas Lifespan Brain Study (DLBS)](http://fcon_1000.projects.nitrc.org/indi/retro/dlbs.html): _all data files are in .tar.gz format_
 - [Cognitive data](ftp://www.nitrc.org/fcon_1000/htdocs/indi/retro/dlbs_content/dlbs_cogdata.tar.gz)
 - [Neuroimaging data](ftp://www.nitrc.org/fcon_1000/htdocs/indi/retro/dlbs_content/dlbs_imaging.tar.gz)
  - [Anatomical scan parameters](ftp://www.nitrc.org/fcon_1000/htdocs/indi/retro/dlbs_content/dlbs_scan_params_anat.pdf)
  - [PET scan parameters](ftp://www.nitrc.org/fcon_1000/htdocs/indi/retro/dlbs_content/dlbs_scan_params_pet.pdf)
 - [Genetic data](ftp://www.nitrc.org/fcon_1000/htdocs/indi/retro/dlbs_content/dlbs_genetics.tar.gz)

## 3) Import your data
In the space below, import your data.
If your data span multiple files, read them all in.
If applicable, merge or append them as needed.

In [10]:
# import relevant libraries
from pathlib import Path # path management
import ftplib # ftp server access for source data file download
import time # operation timers
import tarfile # read tar.gz archive source data files
import pandas as pd # data wrangling and tabular representation
import seaborn as sb # data visualization

In [11]:
# return current working directory
Path.cwd()

PosixPath('/home/faysal/fshaikh4-GitHub/final-project')

#### Data import via `FTP` module below has been commented-out due to issues downloading neuroimaging dataset.

As an alternative, appropriate files (in their unmodified `.tar.gz` archive format) were copied and pasted into `./data/` and utilized in downstream prepreprocessing steps.

In [3]:
# # create './data/' dir in cwd, if not existing (but do not overwrite)
# Path('./data/').mkdir(exist_ok=True)

In [12]:
# specify URLs for source data files
ftp_url = 'www.nitrc.org'
ftp_path = 'fcon_1000/htdocs/indi/retro/dlbs_content/'

cog_fname = 'dlbs_cogdata.tar.gz'
ni_fname = 'dlbs_imaging.tar.gz'
gen_fname = 'dlbs_genetics.tar.gz'

anat_parm_fname = 'dlbs_scan_params_anat.pdf'
pet_parm_fname = 'dlbs_scan_params_pet.pdf'

In [None]:
# # start timer object
# t_i = time.time()

# # use FTP to access and download source data files (ONCE ONLY)
# ftp = ftplib.FTP(ftp_url)
# ftp.login()
# ftp.cwd(ftp_path)
# ftp.retrbinary('RETR ' + cog_fname, open('./data/' + cog_fname, 'wb').write)
# ftp.retrbinary('RETR ' + ni_fname, open('./data/' + ni_fname, 'wb').write)
# ftp.retrbinary('RETR ' + gen_fname, open('./data/' + gen_fname, 'wb').write)
# ftp.retrbinary('RETR ' + anat_parm_fname, open('./data/' + anat_parm_fname, 'wb').write)
# ftp.retrbinary('RETR ' + pet_parm_fname, open('./data/' + pet_parm_fname, 'wb').write)
# ftp.quit();

# # end timer object
# t_f = time.time()

# # print elapsed time
# print('Time elapsed: '+str(t_f - t_i)+' seconds.')

#### After obtaining relevant data files in raw `.tar.gz` archive file format, relevant data were extracted and utilized in downstream preprocessing.

In [13]:
# create extracted data subdirectories
Path('./data/cogdata/').mkdir(exist_ok=True) # create dir if not existing
Path('./data/imaging/').mkdir(exist_ok=True) # create dir if not existing
Path('./data/genetics/').mkdir(exist_ok=True) # create dir if not existing

In [15]:
# extract relevant data files from .tar.gz archives
import tarfile

# extract cognitive data
tar = tarfile.open('./data/'+cog_fname, 'r:gz')
tar.extractall(path = './data/cogdata/')
tar.close()

# extract neuroimaging data
tar = tarfile.open('./data/'+ni_fname, 'r:gz')
tar.extractall(path = './data/imaging/')
tar.close()

# extract genetics data
tar = tarfile.open('./data/'+gen_fname, 'r:gz')
tar.extractall(path = './data/genetics/')
tar.close()

In [146]:
# read in each data file into individual pandas.DataFrame objects
# cog_data = pd.read_table('./data/' + cog_fname, sep='\t')
# ni_data = pd.read_table('./data/' + ni_fname, sep='\t')
# gen_data = pd.read_table('./data/' + gen_data, sep='\t')

In [None]:
# merge data into single consolidated pandas.DataFrame object

## Visualize neuroimaging data.

In [2]:
from nilearn import plotting
%matplotlib inline
import numpy as np
import nibabel as nb
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')

---

## 4) Show me the head of your data.

## 5) Show me the shape of your data

## 6) Show me the proportion of missing observations for each column of your data

## 7) Give me a problem statement.
Below, write a problem statement. Keep in mind that your task is to tease out relationships in your data and eventually build a predictive model. Your problem statement can be vague, but you should have a goal in mind. Your problem statement should be between one sentence and one paragraph.

(placeholder cell to fill in later)

## 8) What is your _y_-variable?
For final project, you will need to perform a statistical model. This means you will have to accurately predict some y-variable for some combination of x-variables. From your problem statement in part 7, what is that y-variable?

(placeholder cell to fill in later)