/
Lesson8_Correlation.ipynb
507 lines (507 loc) · 16.6 KB
/
Lesson8_Correlation.ipynb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"../dsi.png\" style=\"height:128px;\">"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# Lesson 8: Correlation\n",
"\n",
"In our last notebook, we learned how to use the power of programming to visualize data. But that's not all we can do with these Python libraries. Today we'll take our programming toolkit one step further and see how we can use the same libraries to calculate numerical information about our data after we've plotted it. Visualizations and numerical information together can tell us detailed stories about our data and what it means.\n",
"\n",
"We'll begin by using matplotlib - the same library we used to make line graphs, bar graphs, and pie charts - to explore the process of visualizing data using scatter plots. We'll then see how we can use another powerful library called numpy to numerically investigate the relationships between variables in our data by computing correlation coefficients. Finally we'll also look into plotting lines of best fit as an introduction to the concept of regression."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"import matplotlib\n",
"matplotlib.use('Agg')\n",
"from datascience import Table\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"plt.style.use('fivethirtyeight')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"def checker(strong, positive):\n",
" if strong:\n",
" if positive:\n",
" print(\"Try Again!\")\n",
" else:\n",
" print(\"You are correct! You now know how to make and read scatterplots to analyze trends in data!\") \n",
" else:\n",
" print(\"Try Again!\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Scatterplots\n",
"\n",
"Remember from today's lesson that when we have data about two variables, the best way to visualize it is through something called a \"scatterplot.\" We can use our handy datascience library to quickly make lots of scatterplots.\n",
"\n",
"Let's start with the example of clasroom participation from the lesson. Below we'll create a new Table using that data and then use the datascience \".scatter('Column Name 1', 'Column Name 2')\" function to create a scatterplot."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"classroom_participation = Table().with_columns([\n",
" 'Student',['Sathvik','Anjali', 'Shreyan','Chaaru','Rishi','Divya'],\n",
" '1st Week', [3,1,1,2,2,3],\n",
" '12th Week', [7,2,4,5,6,5]\n",
"])\n",
"classroom_participation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"classroom_participation.scatter('1st Week', '12th Week')"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Scatterplots are useful for illustrating \"trends\" in our data. As we can see from this scatterplot, students who tended to participate less in the first week also tended to participated less in the twelfth week - like Anjali for example, who participated the least out of all the students in the 1st and 12th weeks. Similarly, students who tended to particpate more in the first week also participated more in the twelfth week. This is exactly what we mean by a trend. \n",
"\n",
"This is an example of a \"weak positive\" correlation. It's weak because the data points don't form exactly a straight line, and it's positive because the data points rougly go from the bottom left to the top right in our scatterplot (positive slope). Now let's look at another example. Fill in the following cells to make a scatter plot of classroom absences vs. final grade in the class."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"student_abscences = Table().with_columns([\n",
" 'Student',['Rashi','Amit', 'Simaran','Shruti','Eesha','Arpit'],\n",
" 'Absences', [0,1,2,3,4,7],\n",
" 'Grades', [99,95,90,80,75,68]\n",
"])\n",
"student_abscences"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"student_abscences.scatter( ) # fill in this line to make the scatterplot"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Is this data strongly or weakly correlated? And is the correlation positive or negative? Put your answers in the cell below and run it to find out if you're right!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"strong = # put True here if you think it's a strong correlation and False if it's weak\n",
"positive = # put True here if you think the data is positively correlated, and False if it's negatively correlated\n",
"\n",
"checker(strong, positive)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Correlation Coefficients\n",
"\n",
"Scatterplots can tell us a lot about trends in our data, but just knowing whether our data is strongly or weakly correlated and whether the correlation is positive or negative isn't enough. You might ask yourself, \"How strong is the correlation? Are some correlations stronger than others?\"\n",
"\n",
"This is where the idea of a correlation coefficient comes in. The correlation coefficient is a number, known as r, between -1 and 1 that can be calculated for any two variables. If r is negative the correlation is negative, and if it's positive the correlation is positive. And the \"magnitude\" of r (in other words, how close |r| is to 1) tells us how strong the correlation is.\n",
"\n",
"In the following cells, we'll walk you through an example of calculating the correlation coefficient for a dataset. For our dataset, let's take a look at some data we first saw all the way back in lesson 1. We'll use the datascience library to create a Table with fertility, population, life expectancy, and child mortality data for all the countries in the world for the year 2015."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"fertility_data = Table.read_table('fertility.csv').where('time', 2015).drop(\"time\")\n",
"population_data = Table.read_table('population.csv').where('time', 2015).drop(\"time\")\n",
"life_expectancy_data = Table.read_table('life_expectancy.csv').where('time', 2015).drop(\"time\")\n",
"child_mortality_data = Table.read_table('child_mortality.csv').where('time', 2015).drop(\"time\")\n",
"\n",
"joined_data = fertility_data.join(\"geo\", population_data)\\\n",
" .join(\"geo\", life_expectancy_data)\\\n",
" .join(\"geo\", child_mortality_data)\n",
"joined_data = joined_data.relabeled(\"children_per_woman_total_fertility\", \"fertility\")\\\n",
" .relabeled(\"population_total\", \"population\")\\\n",
" .relabeled(\"life_expectancy_years\", \"life expectancy\")\\\n",
" .relabeled(\"child_mortality_0_5_year_olds_dying_per_1000_born\", \"child mortality\")\n",
"joined_data"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Let's take a look at the \"child mortality\" and \"life expectancy\" columns. We'd expect these two be negatively correlated since the higher the risk of dying as a child, the lower your expected lifetime should be. We can see if this trend holds true by first plotting the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"joined_data.scatter(\"life expectancy\", \"child mortality\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Yup, in fact the data seems negatively correlated. How strong is the correlation exactly? Well to answer that let's compute the correlation coefficient, using the numpy library. First, let's get the two columns we want."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"life_expectancy = joined_data.column(\"life expectancy\")\n",
"child_mortality = joined_data.column(\"child mortality\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Remember that the next step in computing the correlation coefficient is calculating the mean and standard deviation of our two variables. Numpy lets us do this with the convenient \"np.mean\" and \"np.std\" functions. Here's how to use them."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"avg_life_expectancy = np.mean(life_expectancy)\n",
"stddev_life_expectancy = np.std(life_expectancy)\n",
"avg_child_mortality = np.mean(child_mortality)\n",
"stddev_child_mortality = np.std(child_mortality)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The next step is to transform the data by calculating z_x = (x - x_mean)/x_stddev and z_y = (y - y_mean)/y_stddev for each x and y value, then multiplying together to get z_x * z_y. We can use the numpy \"subtract,\" \"divide,\" and \"multiply\" functions for this."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"transformed_life_expectancy = np.divide(np.subtract(life_expectancy, avg_life_expectancy), stddev_life_expectancy)\n",
"transformed_child_mortality = np.divide(np.subtract(child_mortality, avg_child_mortality), stddev_child_mortality)\n",
"products = np.multiply(transformed_life_expectancy, transformed_child_mortality)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Finally, we add up the values in this array and divide by n, where n is the number of values, to get the final correlation coefficient."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"correlation_coefficient = np.sum(products) / len(products)\n",
"print(\"Correlation coefficient: \", correlation_coefficient)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"As you can see the correlation coefficient is negative and very close to -1, indicating that this is a strong negative correlation as expected. \n",
"\n",
"Now it's your turn. Let's take a look at another pair of variables: fertility and population. Fill in the following cells to make the scatterplot."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"joined_data.scatter( ) # fill this in to get a scatterplot of fertility vs population"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Now guess whether the correlation is strong or weak, and if it's positive or negative. Then fill in the following cells to find out if you're right by computing the correlation coefficient!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"fertility = joined_data.column( )\n",
"population = joined_data.column( )\n",
"\n",
"avg_fertility = \n",
"stddev_fertility = \n",
"avg_population = \n",
"stddev_population = \n",
"\n",
"transformed_fertility = np.divide(np.subtract( ), )\n",
"transformed_population = np.divide(np.subtract( ), )\n",
"products = \n",
"\n",
"correlation_coefficient = np.sum( ) / ( )\n",
"print(\"This is the correlation coefficient: \", correlation_coefficient)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"A much quicker way to compute the correlation coefficient is to use the np.corrcoef() function. To check if you filled in the previous cells correctly and found the right value, run the next cell to see the right answer."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [],
"source": [
"print(\"This is the correlation coefficient: \", \n",
" np.corrcoef(joined_data.column(\"population\"), joined_data.column(\"fertility\"))[1,0])"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Line of Best Fit\n",
"\n",
"The last thing you learned about correlation is the idea of a line of best fit. This is a line that roughly \"fits\" the data by describing the ideal linear relationship between the data points. Luckily we can plot lines of best fit easily using the datascience library by adding an additional argument, \"fit_line = True\", to our call to the scatter function. Here's an example."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"joined_data.scatter(\"life expectancy\", \"child mortality\", fit_line=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"When the data is strongly correlated (correlation coefficient close to 1 or -1), the line closely fits most of the data points well. But when the data is not strongly correlated (correlation coefficient close to 0), the lie of best fit doesn't seem to fit the data at all. Let's see what the line of best fit looks like for population and fertility. Fill in the following cell to create a scatterplot of population and fertility with a best fit line."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"joined_data.scatter( ) # fill this in"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"In the next lesson, we'll learn how to actually mathematically find the equation for this line of best fit by using a technique known as \"regression.\""
]
}
],
"metadata": {
"celltoolbar": "Raw Cell Format",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}