# Preprocessing data

Here we describe how we read in data, clean it up and provide more machine-readable column names, and separate out some columns for additional analyses.

Data is stored in:
 * data
 
It consists of 3 files that have been converted from Excel (how they were provided) to CSVs in Excel:
 * TP_DV_File.csv
 * TP_Graph_Characteristics.csv
 * TP_Subject_file.csv
 
Below is brief descriptions of them (more extensive descriptions are in the Word doc)

### TP_DV_File.csv

A table containing the subject, the graph and point combination, and whether it was marked as a tipping point or not.

### TP_Graph_Characteristics.csv

These are about characteristics of the graphs. To code the characteristics 5 different reviewers coded a characteristic, if there were differences majority opinion was used if more than 3 reviewers coded the graph as the same (i.e. greater than 60% overall). Otherwise the characteristics was left blank.

### TP_Subject_file.csv

A table about the subjects, many variables here some of which are likely correlated.

# Code walkthrough

In [None]:
using DataFrames
using CSV
using GLM
using Gadfly
using Statistics
using NamedArrays

DATA="data"

┌ Info: Recompiling stale cache file /Users/aguang/.julia/compiled/v1.1/DataFrames/AR9oZ.ji for DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
└ @ Base loading.jl:1184
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Info: Recompiling stale cache file /Users/aguang/.julia/compiled/v1.1/CategoricalArrays/RHXoP.ji for CategoricalArrays [324d7699-5711-5eae-9e2f-1d82baa6b597]
└ @ Base loading.jl:1184
│ This may mean Compat [34da2185-b29b-5c13-b0c7-acf172513d20] does not support precompilation but is imported by a module that does.
└ @ Base loading.jl:947
┌ Info: Recompiling stale cache file /Users/aguang/.julia/compiled/v1.1/Tables/Z804B.ji for Tables [bd369af6-aec1-5ad0-b16a-f7cc5008161c]
└ @ Base loading.jl:1184
┌ Info: Recompiling stale cache file /Users/aguang/.julia/compiled/v1.1/CSV/HHBkp.ji for CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b]
└ @ Base loading.jl:1184
│ This may

For each CSV file we wrote a function `clean_x` to read in the file and then rename the columns. Each function has roughly the same two lines: `CSV.File` reads in the file, the `|> DataFrame!` pipes it and converts it into a dataframe, and then `names!` sets the column names as something both short and human identifiable. A quick check of the size of `df_dv` reveals that it has 5696 rows and 3 columns.

In [2]:
function clean_dv(filepath)
    df = CSV.File(filepath,normalizenames=true) |> DataFrame!
    names!(df, [:subj, :graphid, :tp])
end

df_dv = clean_dv(joinpath(DATA,"TP_DV_file.csv"))
size(df_dv)

(5696, 3)

For DV we additionally split out the `graphid` column into `q` (specific graph) and `pt` (specific point on the graph). We add these to the dataframe `df_dv` as new columns. A quick check with `size` confirms that we now have 5 columns in `df_dv`.

In [3]:
# adding some additional columns for category and pt
df_dv.q = [split(s)[1] for s in df_dv.graphid]
df_dv.pt = [split(s)[2] for s in df_dv.graphid]
size(df_dv)

(5696, 5)

The rest of the preprocessing goes similarly.

In [4]:
function clean_gc(filepath)
    df = CSV.File(filepath,normalizenames=true) |> DataFrame!
    names!(df, [:graphid, :risingBefore, :cannotSeeAfter, :downOverall, :bellOverall, :complexOverall])
end

df_gc = clean_gc(joinpath(DATA,"TP_Graph_Characteristics.csv"))
size(df_gc)

(32, 6)

In [5]:
function clean_subject(filepath)
    df = CSV.File(filepath,normalizenames=true) |> DataFrame!
    names!(df, [:subj, :uniBrown, :expExec, :tpChange, :tpRate, :tpDir, :tpNoReturn,
            :tellMgr, :impChange, :impRise, :impFall, :impPeriodic, :numOtherTP,
            :liwcPosemo, :liwcNegemo, :liwcCause, :liwcFocusPre, :liwcFocusFut,
            :liwcRelativ, :liwcTime])
end

df_subject = clean_subject(joinpath(DATA,"TP_Subject_file.csv"))
size(df_subject)

(178, 20)

Finally, we combine all of the dataframes together into a full dataframe.

In [6]:
full_df = join(df_dv, df_gc, on= :graphid)
full_df = join(full_df, df_subject, on = :subj)
first(full_df, 6)

Unnamed: 0_level_0,subj,graphid,tp,q,pt,risingBefore,cannotSeeAfter,downOverall
Unnamed: 0_level_1,Int64,String,Int64⍰,SubStrin…,SubStrin…,Int64,Int64,Int64
1,1,Q2 A,0,Q2,A,1,0,0
2,1,Q2 B,0,Q2,B,1,0,0
3,1,Q2 C,0,Q2,C,1,1,0
4,1,Q3 A,0,Q3,A,1,0,1
5,1,Q3 B,0,Q3,B,0,0,1
6,1,Q3 C,0,Q3,C,0,0,1


│   caller = compacttype(::Type, ::Int64) at show.jl:39
└ @ DataFrames /Users/aguang/.julia/packages/DataFrames/Iyo5L/src/abstractdataframe/show.jl:39


Now, we save the dataframes for other notebooks to use.

In [9]:
CSV.write(joinpath(DATA,"df_dv.dat"), df_dv)
CSV.write(joinpath(DATA,"df_gc.dat"), df_gc)
CSV.write(joinpath(DATA,"df_subject.dat"), df_subject)
CSV.write(joinpath(DATA,"full_df.dat"), full_df)

"/Users/aguang/CORE/tippingpoint/tippingpoint/data/full_df.dat"

# Hypotheses

The next part of the notebook to run is `02_descriptive_logistic`. We will go over the hypotheses again there. But briefly, here are the hypotheses we were looking at.

Our broad hypotheses are that: tipping points cannot be predicted, that bias and experience make one less likely to declare a tipping point, and that collective decision making makes a single individual less likely to declare a tipping point.

Specifically from our independent variables, we wanted to test the following hypotheses:

 1. If cannot see what follows point of interest, less likely to declare a TP
 2. If graph is rising before point of interest, more likely to declare a TP
 3. If subject is more experienced, less likely to declare a TP
 4. If more noise in overall graph, less likely to declare a TP
 5. If view a sustained change as important, less likely to declare a TP
 6. If likely to tell manager, more likely to declare a TP
 7. If see "other" TPs, more likely to declare a TP
 8. If view a decline as important, more likely to declare a TP if point has a decline
 9. Emotions drive TP observations
 10. If group is small, more likely to declare a TP
 