forked from wesm/charlton
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
107 lines (87 loc) · 4.83 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
-- Rename Categorical.from_strings (misleading name)
-- Instead of having an explicit list of stateful transformations,
perhaps we should look up each called function in the environment,
and check for some sort of magic attribute like
__charlton_stateful_transformation__
It'd also be nice if stateful transforms *could* be called like
functions when desired -- the current API is just a bit clunky all
around. Another approach for R basically uses regular functions
with a dynamically scoped state:
http://www.stat.auckland.ac.nz/~yee/smartpred/TechnicalGuide.txt
-- As a safety check for non-stateful transforms, maybe we should
always evaluate each formula on just the first row of data alone,
and make sure the result matches what we got when evaluating it
vectorized (i.e., confirm f(x[0]) == f(x)[0], where f is our
transform.
-- Some more *non*-magic builtin utilities
-- I()
-- get(string) for data names that aren't valid Python identifiers?
(is there a better name? Q()? G()?)
-- the contrast makers
-- More contrast coding methods:
-- Helmert (NB there is some disagreement about what's called Helmert
vs. what's called reverse Helmert)
-- Forward difference (probably should be the default for ordered factors)
-- Maybe 'deviation coding' (everything against grand mean)
-- Some sort of symbolic tools for user-defined contrasts -- take
the comparisons that people want to compute in terms of linear
combinations of level names, convert that to a matrix and do the
pinv dance?
-- Some way to specify the default contrast
-- Should we have a way to do "centered" contrasts? (Ask Klinton about
this.)
-- (Currently we ignore whether the levels of categorical data are
ordered. Should that change?)
-- Support for the magic "." term
-- The "y ~ everything else" form
-- The "what I had in this other ModelDesc" form (e.g., "y ~ . - a"
to drop the 'a' predictor from an old model)
-- Make sure we can handle all the different data holding objects that
people like to use (dict, recarray, pandas, larry...)
-- ability to specify the dtype of the model matrix?
-- More stateful transforms:
-- Splines
-- 'cut': numeric->factor by quantile dichotimization
-- Orthogonal polynomials
-- Ability to eliminate rows due to NaN/mask/missing data handling
(shouldn't be too bad, do it in make_model_matrices).
-- And then extend make_model_matrices to handle other "parallel"
data, like weights, which need to participate in the missingness
calculation.
-- Better NaN/masks/missing data handling in transforms. I think the
current ones will just blow up if there are any NaNs.
-- For large data being loaded in chunks, the ModelDesc->ModelSpec
code will load at least the first chunk to do type inference and
work out the shape of the model matrix. If there are any
categorical variables with an unknown number of levels, then it
will have to do a full pass through the data at this point. But if
there aren't, then it will look at only the first chunk, and then
throw it away, and then when we build the model matrix we will load
that chunk again. In practice this is not a huge problem (instead
of loading n chunks, we load n+1, where n is presumably large, plus
loading that first chunk twice in row will exploit whatever caching
is present), but we could do a bit better in this case by saving
the first chunk and the started iterator and using them to produce
the first chunk of model matrix.
-- It might also be useful to have a mode for large data that says
"please do only one pass or else error out"
-- Real testing/support for extending the evaluation machinery
-- A good way to support magic functions like mgcv's s().
-- Profiling/optimization. There are lots of places where I use lazy
quadratic algorithms (or even exponential, in the case of the
non-redundant coding stuff). Perhaps worse is the heavy
multiplication used unconditionally to load data into the model
matrix. I'm pretty sure that at least most of the quadratic stuff
doesn't matter because it's n^2 where n is something like the
number of factors in an interaction term (and who has hundreds of
factors interacting in one term?), but it wouldn't hurt to run some
profiles to check.
-- Support for building sparse model matrices directly. (This should
be pretty straightforward when it comes to exploiting the intrinsic
sparsity of categorical factors; numeric factors that evaluate to a
sparse matrix directly might be slightly more complicated.)
-- Possible optimization: let a stateful transform's memorize_chunk
function raise Stateless to indicate that actually, ha-ha, it turns
out that it doesn't need to memorize anything after all (b/c the
relevant data turns out to be specified explicitly in *args,
**kwargs)