# PHASE 1: Project Proposal & Planning

### GOALS:

1) Please provide details on 
   - the datasets, 
   - \# of obs,  
   - \# columns, 
   - what the columns are and their types. 
   - If it is geographical, 
     - what areas it cover, 
   - if it is time series, 
     - what time frame it covers. 
   - What is the label you want to predict, 
   - what are the features you use to predict the label. 

2) Develop my proposal a bit more 
     - elaborate on the columns/rows that I want
     - extract out some of the rows or use groupby to take a look at them
     - get more detailed info

# IMPORTS

In [1]:
import os
import pandas as pd

# EXECUTE

In [2]:
# Get current file path
current_path = os.getcwd()
parent_file_path = os.path.dirname(current_path)

# Construct desired file path
file_path = f'{parent_file_path}\\data\\KAGGLE\\kaggle-oasis-1\\oasis_cross-sectional.csv'
print('file_path =', file_path)

file_path = c:\Users\GlaDOS\Documents\GitHub\eugene_data606\data\KAGGLE\kaggle-oasis-1\oasis_cross-sectional.csv


Load the OASIS dataset
- Read in the CSV as a Pandas dataframe (DF)
- Top row as header

Then let's take a look at the top of the DF

In [3]:
#! WARNING -- set datatypes
df = pd.read_csv(file_path, 
                 header = [0])
df.head()

Unnamed: 0,ID,M/F,Hand,Age,Educ,SES,MMSE,CDR,eTIV,nWBV,ASF,Delay
0,OAS1_0001_MR1,F,R,74,2.0,3.0,29.0,0.0,1344,0.743,1.306,
1,OAS1_0002_MR1,F,R,55,4.0,1.0,29.0,0.0,1147,0.81,1.531,
2,OAS1_0003_MR1,F,R,73,4.0,3.0,27.0,0.5,1454,0.708,1.207,
3,OAS1_0004_MR1,M,R,28,,,,,1588,0.803,1.105,
4,OAS1_0005_MR1,M,R,18,,,,,1737,0.848,1.01,


In [4]:
#* TODO - add metadata/descriptions of column meanings

Let's take a look at some general info about this DF

In [5]:
print('CHARACTERISTIC | VALUE')
print('Dimensions   =', df.ndim)
print('Objects      =', df.size)
print('Shape        =', df.shape)
print('Rows         =', df.shape[0])
print('Columns      =', df.shape[1])

CHARACTERISTIC | VALUE
Dimensions   = 2
Objects      = 5232
Shape        = (436, 12)
Rows         = 436
Columns      = 12


Let's look at each column
- check # of row entries and how many aren't empty
- check data types

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 436 entries, 0 to 435
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      436 non-null    object 
 1   M/F     436 non-null    object 
 2   Hand    436 non-null    object 
 3   Age     436 non-null    int64  
 4   Educ    235 non-null    float64
 5   SES     216 non-null    float64
 6   MMSE    235 non-null    float64
 7   CDR     235 non-null    float64
 8   eTIV    436 non-null    int64  
 9   nWBV    436 non-null    float64
 10  ASF     436 non-null    float64
 11  Delay   20 non-null     float64
dtypes: float64(7), int64(2), object(3)
memory usage: 41.0+ KB


### Column datatypes: 
- float64(7)  --> 7 numerical 
- int64(2)    --> 2 integer 
- object(3)   --> 3 string (categorical) 

In [7]:
# NOTE - REPEATED - in df.info, but handy for proposal requirements
df.dtypes

ID        object
M/F       object
Hand      object
Age        int64
Educ     float64
SES      float64
MMSE     float64
CDR      float64
eTIV       int64
nWBV     float64
ASF      float64
Delay    float64
dtype: object

Summary statistics
- WARNING: 
  - `df.describe()` only works on the 9 numerical columns
  - Q) Why are some float and some int?

In [8]:
df.describe()

Unnamed: 0,Age,Educ,SES,MMSE,CDR,eTIV,nWBV,ASF,Delay
count,436.0,235.0,216.0,235.0,235.0,436.0,436.0,436.0,20.0
mean,51.357798,3.178723,2.490741,27.06383,0.285106,1481.919725,0.79167,1.198894,20.55
std,25.269862,1.31151,1.120593,3.69687,0.383405,158.740866,0.059937,0.128682,23.86249
min,18.0,1.0,1.0,14.0,0.0,1123.0,0.644,0.881,1.0
25%,23.0,2.0,2.0,26.0,0.0,1367.75,0.74275,1.11175,2.75
50%,54.0,3.0,2.0,29.0,0.0,1475.5,0.809,1.19,11.0
75%,74.0,4.0,3.0,30.0,0.5,1579.25,0.842,1.28425,30.75
max,96.0,5.0,5.0,30.0,2.0,1992.0,0.893,1.563,89.0


Let's see how many empty values there are
- NOTE: if 20022 rows of nulls, then the entire column is empty --> can remove from consideration

In [9]:
df.isnull().sum()

ID         0
M/F        0
Hand       0
Age        0
Educ     201
SES      220
MMSE     201
CDR      201
eTIV       0
nWBV       0
ASF        0
Delay    416
dtype: int64

Get a list of which columns are 100% null

In [10]:
print(f'List of empty columns:')
print(f'COL # | COL NAME')

# Go thru each column
for column in df:
    #print(df[column].name)    
    
    # CASE #1 - Check if # of nulls is the full length of the DF
    if df.isnull().sum()[column] == len(df):
        
        # Get the index of column
        col_index = df.columns.get_loc(column)  
        
        # Print column # and column      
        print(f'{col_index} = {df[column].name[0]}')

List of empty columns:
COL # | COL NAME


Let's check for unique values in each column
- Hand = 1 --> all right-handed, maybe I can remove this column?

In [11]:
df.nunique()

ID       436
M/F        2
Hand       1
Age       73
Educ       5
SES        5
MMSE      17
CDR        4
eTIV     312
nWBV     182
ASF      282
Delay     14
dtype: int64

### REFERENCE

In [12]:
# # Get columns in a list format

# # print(df.columns)
# # print('length of df.columns =', len(df.columns))

# column_number = 0

# print('COL # | COL NAME')

# while column_number < len(df.columns):
#     print(f'{column_number} =', df.columns[column_number][0])
#     column_number += 1

In [13]:
# Get columns in a list format

print('COL # | COL NAME')
for column_number, column_name in enumerate(df.columns):
    print(f'{column_number} = {column_name}')

COL # | COL NAME
0 = ID
1 = M/F
2 = Hand
3 = Age
4 = Educ
5 = SES
6 = MMSE
7 = CDR
8 = eTIV
9 = nWBV
10 = ASF
11 = Delay


# (GENERAL) QUESTIONS:
1) Do I need to include a download script thru Jupyter Notebook?
2) Where to create requirements.txt?   