forked from biodatascience/datasci611
-
Notifications
You must be signed in to change notification settings - Fork 33
/
slides.Rpres
261 lines (180 loc) · 6.5 KB
/
slides.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
========================================================
Introduction to the Course
--------------------------
Introduction to Datascience
========================================================
What is a Datascientist?
------------------------
"A datascientist is what you get when you take a statistician and remove reason
and accountability."
```{r, echo=FALSE}
ic <- knitr::include_graphics;
ic('./images/jn.jpg')
```
***
<small>
This is a mean way of saying that many data scientists come fron
non-statistical backgrounds and operate under circumstances where
rigor is not the top priority.
You all are lucky: you have the statistical foundation to be good data
scientists but may not have the technical background to navigate all the
tools of the trade that host and enable your work.
This course's job is to introduce those tools to you.
</small>
What is a Data Scientist?
========================================================
```{r, echo=FALSE}
ic('./images/Data_Science_VD.png')
```
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
What is a Data Scientist?
========================================================
```{r, echo=FALSE}
ic('./images/Data_Science_VD_with_me.png')
```
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
There are things about statistics you will surely be shocked I don't know.
Data Scientist vs Statistician
========================================================
A statistician takes data, typically from a designed experiment (often the
statistician is involved in the design) and produces particular sorts of
_answers_.
A data scientist takes "raw" data (which might be generated by some non-
experimental process like an application that uses a database) and generates
_questions._
Tasks data scientists do:
========================================================
1. Visualization
2. Exploratory Data Analysis
3. ETL (Extract, Transform, Load)
4. Application Development
5. Operations (setting up pipelines between databases and models)
6. Orchestrating Data Collection Standards
7. Lots else.
Scientific Research with Computers
========================================================
This course will also be useful for anyone doing research with
computers. If you are doing non-trivial statistical analysis or simulation,
the methods we cover here will let you do so in an organized, portable,
comprehensible way.
Data Science Project Lifetime
=============================
Science is the orderly passage through this graph:
```{r, echo=FALSE}
ic("images/idealized-science.svg")
```
...
Data Science Project Lifetime
=============================
Science is the dis-orderly passage through this graph:
```{r, echo=FALSE}
ic("images/realized-science.svg")
```
Good Data Science
=================
```{r, echo=FALSE}
ic("images/good-data-science.svg")
```
Elaboration
===========
We want to convert an ad-hoc, disordered, process into one in which
each phase is represented and each change is recorded along with
meaningful context.
Goofus
======
<small>
Goofus Presents Preliminary Results at BIOCON 2019 and publishes a
paper in 2021. A colleague asks why a large p-value (reported in 2019)
is much smaller in 2021.
Goofus just has a folder on his hard drive with a giant notebook in
it. He has a backup from late 2019 but has no idea what of the many
changes he has made since then changed the p-value. Also, he doesn't
have previous data set - having replaced it with an updated one.
</small>
***
```{r, echo=FALSE}
knitr::include_graphics("./images/goofus.png");
```
Gallant
=======
<small>
Gallant presented at the same conference and published in 2021 as
well. She (her first name is Alice) also had a p value reach significance between presentation
and publication.
When Gallant is asked about it, she is able
to time travel back into her git repository and re-run her analysis to
reproduce the presentation result. She then writes a `git-bisect`
script and finds the exact commit where her p-value changed. The
commit message says "Modified outlier elimination to remove bad data
from this study." She can provide a definitive answer!
</small>
***
```{r, echo=FALSE}
knitr::include_graphics("./images/gallant.png");
```
Not Just Version Control
========================
By far the most useful tool you'll learn here is Git.
But you will also need these tools to make your work _really
reproducible_:
1. Unix Skills - tie everything together
2. Programming Skills - R and Python and Shell
3. Docker - reproducible development environments
4. Make - reproducible builds
Non-Tools
=========
Jupyter/RMarkdown
These are ok tools for exploratory work and quick write ups. I
encourage you to use them if you'd like. These slides are RMarkdown.
But they are bad tools for reproducible data science.
***
1. They obscure dependencies
2. They maintain a lot of global state
3. They discourage "factoring"
4. They impose a modest technical lock in
5. They don't play well with git
6. They interleave _reporting_ with _processing_ and these are
fundamentally disjoint tasks.
Analysis
========
While I'm hardly a statistician and you all probably know more than I
do, we will also cover:
1. Exploratory Data Analysis in R and Python
2. Processing, Joining and Cleaning Data
3. Visualization
4. Modeling (clustering, classification)
Things To Do Before Next Class
==============================
<small>
Visit these sites:
https://www.kaggle.com/
https://teddit.net/r/datasets
(teddit is a mirror of reddit that is more usuable)
https://duckduckgo.com/?q=open+data+sets&t=newext&atb=v234-1&ia=web
</small>
Things to Try Before Next Class
===============================
<small>
1. Get a docker environment running. You should be able to do this in
Windows or Linux or Mac.
https://docs.docker.com/get-started/
On windows you can install Docker directly or install a Linux virtual
Machine via Virtualbox. The latter is a bit more heavyweight a
solution but will make your life easier in the long run.
2. Run a rocker/verse image.
`docker run -e PASSWORD=yourpassword --rm -p 8787:8787 rocker/rstudio`
</small>
Course Material
===============
Located here:
https://github.com/Vincent-Toups/datasci611
Homework
========
1. Pick a data set you might find interesting.
2. Write a pitch for your analysis (~500 words). What do you hope to
learn by looking at it in detail?
3.
Submit a document to Canvas with:
- a picture of your face
- a phonetic pronunciation of your preferred name and pronouns
- something interesting about you (or something boring)