# To-Do May 16

## HMM!

## Port PCA to csv for use in R, Bayes, etc

## Find out how many individuals are represented in correct vs incorrect predictions
Eg. do we get some users who always show up in false neg or false pos, or in true pos?  
This might be especially useful for the url and created date analyses, where we don't know the extent to which a subset of usernames might be driving correct classification.  
for those in false pos, look at the actual posts (might be easier for twitter), and maybe even have mturk rate whether they seem depressed or not. are we finding new depressed cases, or are we just wrong?  

## Restrict username analysis to only those individuals with a minimum number of days represented in their observations 
Alternately, min number of posts   
See R code notes for more. (bayes.R)

## Move R code to Jupyter?

In [1]:
import sqlite3
import pandas as pd
import numpy as np
import pickle

from matplotlib import pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline

from bgfunc import *

loading data/LIWC/LIWC2007_English100131_words.dic
loading LIWC with stopVal=0.5, for 906 words
loading data/LIWC/LIWC2007_English100131_words.dic
loading LIWC with stopVal=0.0, for 4483 words
loading LabMT with stopVal=1.0, for 3731 words
loading ANEW with stopVal=1.0, for 765 words




## Load SQLite database 

In [2]:
dbfile = 'project_may9_649pm.db'
conn = sqlite3.connect(dbfile)

## Analysis parameters 

Which condition are we studying?  Are there any cutoffs based on testing?  
What sort of analyses or models do we want to run?

In [3]:
platform = 'tw' # ig = instagram, tw = twitter
condition = 'depression' # depression, pregnancy, ptsd, cancer

In [4]:
specs = analysis_specifications(platform, condition)

platform_long = specs['plong'][platform]
gb_types = specs['gb_types'][platform]
fields = specs['fields'][platform] 
test_name = specs['test_name'][condition]
test_cutoff = specs['test_cutoff'][condition]
photos_rated = specs['photos_rated'][condition]
has_test = specs['has_test'][condition]

clfs = ['lr','rf'] # lr = logistic regression, rf = random forests, svc = support vector
periods = ['before','after']
turn_points = ['from_diag','from_susp']

impose_test_cutoff = True # do we want to limit target pop based on testing cutoff (eg. cesd > 21)?

report_sample_size = False # simple reporting feature

load_from_pickle = True # loads entire data dict, including masters, from pickle file

final_pickle = True # pickles entire data dict after all masters are created

populate_wordfeats_db = False # generates word features from reagan code

run_master = True 
run_subsets = True
run_before_after = False
run_separate_pca = False

action_params = {
    'create_master': True, 
    'save_to_file' : False, 
    'density' : False, 
    'ml' : False, 
    'nhst' : False, 
    'corr' : False, 
    'print_corrmat' : False,
    'tall_plot': True
}

params = define_params(condition, test_name, test_cutoff, impose_test_cutoff,
                       platform, platform_long, fields, photos_rated, has_test)

In [5]:
# printout showing sample sizes for target and control groups
if report_sample_size:
    report_sample_sizes(params, conn, condition, platform_long, test_cutoff, test_name)

## Load ready data, or prepare raw data

Set load_from_pickle to determine action here.  

In case you don't have a pickled data dict, or if you want to make a new one, the next block will:

- Pulls data from db
- Aggregates in buckets (day, week, user)
- Creates before/after diag/susp date subsets along with whole

Otherwise we load existing cleaned/aggregated data from pickle

In [6]:
if load_from_pickle:
    data = pickle.load( open( "{cond}_{pl}_data.p".format(cond=condition,pl=platform), "rb" ) )
    
else:
    data = make_data_dict(params, condition, test_name, conn)
    prepare_raw_data(data, platform, params, conn, gb_types, condition, periods, turn_points)
    pickle.dump( data, open( "{cond}_{pl}_data.p".format(cond=condition,pl=platform), "wb" ) )
    

### This next section generates word features from Andy Reagan's code

Only set populate_wordfeats_db = True if you need to redo the features for some reason!
    

In [7]:
if populate_wordfeats_db:
    create_word_feats_wrapper(['target','control'], gb_types, data, condition, conn, 
                              write_to_db=True, testing=False)                        

## Construct master dataset & run analyses

Possible actions:
- generate master data
- save to disk
- plot target vs control densities for each variable
- correlation plot
- ML modeling
- NHST

In [8]:
if run_master:
    master = data['master']
    target = data['target']['gb']
    control = data['control']['gb'] 
    report = 'MAIN'

    if action_params['create_master']:
        master['model'] = {}

    for gb_type in gb_types:

        master_actions(master, target, control, condition, platform, 
                       params, gb_type, report, action_params, clfs)


Merge to master: MAIN created_date
master created_date shape: (34676, 86)


Merge to master: MAIN weekly
master weekly shape: (10622, 86)


Merge to master: MAIN user_id
master user_id shape: (190, 78)



## Subset master actions

Same as above block, but for subsets, eg. target before diag_date vs controls

In [9]:
use_pca = False # should models be fit using orthogonal pca components?

if run_subsets:
    for period in periods:
        if action_params['create_master']:
            data['master'][period] = {}

        for turn_point in turn_points:    
            if action_params['create_master']:
                data['master'][period][turn_point] = {}

            master = data['master'][period][turn_point]
            target = data['target'][period][turn_point]['gb']
            control = data['control']['gb'] 
            report = '{} {}'.format(period,turn_point)

            if action_params['create_master']:
                master['model'] = {}

            for gb_type in gb_types:
                print 'Reporting for: SUBSETS'
                print 'Period: {}  Focus: {}  Groupby: {}'.format(period.upper(), turn_point.upper(), gb_type.upper())
                # merge target, control, into master
                master_actions(master, target, control, condition,
                               platform, params, gb_type, report,
                               action_params, clfs, 
                               use_pca=use_pca) # using PCA!

Reporting for: SUBSETS
Period: BEFORE  Focus: FROM_DIAG  Groupby: CREATED_DATE

Merge to master: before from_diag created_date
master created_date shape: (25948, 86)

Reporting for: SUBSETS
Period: BEFORE  Focus: FROM_DIAG  Groupby: WEEKLY

Merge to master: before from_diag weekly
master weekly shape: (8420, 86)

Reporting for: SUBSETS
Period: BEFORE  Focus: FROM_DIAG  Groupby: USER_ID

Merge to master: before from_diag user_id
master user_id shape: (171, 78)

Reporting for: SUBSETS
Period: BEFORE  Focus: FROM_SUSP  Groupby: CREATED_DATE

Merge to master: before from_susp created_date
master created_date shape: (20711, 86)

Reporting for: SUBSETS
Period: BEFORE  Focus: FROM_SUSP  Groupby: WEEKLY

Merge to master: before from_susp weekly
master weekly shape: (6997, 86)

Reporting for: SUBSETS
Period: BEFORE  Focus: FROM_SUSP  Groupby: USER_ID

Merge to master: before from_susp user_id
master user_id shape: (119, 78)

Reporting for: SUBSETS
Period: AFTER  Focus: FROM_DIAG  Groupby: CREAT

### Pickle entire data dict

Set final_pickle = True to save to disk  
Note that this is separate from saving individual files to csv, which is controlled by the save_to_file flag in action_params.

In [10]:
if final_pickle:
    pickle.dump( data, open( "{cond}_{pl}_data.p".format(cond=condition,pl=platform), "wb" ) )

### Within-target before vs after

This compares before/after diag/susp dates within target population.  
Basically just a check to see whether the population looks different based on a given change point

In [11]:
# no username gb because you don't have the infrastructure built (you'd need to split bef/aft before the username gb)
# but at any rate, this is just a check...and per-username groupby has the lowest sample size anyhow
if run_before_after:
    for gb_type in ['created_date','weekly']: 
        before_vs_after(data['target']['gb'], gb_type, platform, condition, params['vars'][platform], action_params)

### PCA

PCA below only runs on master (ie. not before/after diag/susp vs control).

Note: You can fold in PCA into the master_actions() sequence, above, by adding the parameter use_pca=True  
This will only run PCA and PCA components as predictors, though.  
Currently you can't run both PCA and non-PCA when modeling, simulataneously - you did this mainly to cut down on the length of any one given code block output.  
You may find PCA particularly helpful for the timeline groups analysis...

In [12]:
if run_separate_pca:
    master = data['master']
    report = 'PCA MAIN'

    for gb_type in gb_types:

        print 'RUNNING PCA: {}'.format(gb_type)
        print
        model_df = {'name':'Models: {} {}'.format(report, gb_type),
                    'unit':gb_type,
                    'data':master[gb_type],
                    'features':params['vars'][platform][gb_type]['means'],
                    'target':'target',
                    'platform':platform,
                    'tall_plot':action_params['tall_plot']
                   }

        excluded_set = params['model_vars_excluded'][platform][gb_type]

        _, pcafit = make_models(model_df, clf_types=clfs, excluded_set=excluded_set, 
                                tall_plot=model_df['tall_plot'], use_pca=True)