-
Notifications
You must be signed in to change notification settings - Fork 9
/
12-joins.html
193 lines (192 loc) · 13.8 KB
/
12-joins.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: R for reproducible scientific analysis</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<a href="index.html"><h1 class="title">R for reproducible scientific analysis</h1></a>
<h2 class="subtitle">Multi-Table Joins</h2>
<section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Focus on the third tidy data principle
<ul>
<li>Each variable forms a column.</li>
<li>Each observation forms a row.</li>
<li>Each type of observational unit forms a table.</li>
</ul></li>
<li>By able to use <code>dplyr</code>’s join functions to merge tables</li>
</ul>
</div>
</section>
<h3 id="joins">Joins</h3>
<p>The third tidy data maxim states that each observation type gets its own table. The idea of multiple tables within a dataset will be familiar to anyone who has worked with a relational database but may seem foreign to those who have not.</p>
<p>The idea is this: Suppose we conduct a behavioral experiment that puts individuals in groups, and we measure both individual- and group-level variables. We should have a table for the individual-level variables and a separate table for the group-level variables. Then, should we need to merge them, we can do so using the <code>join</code> functions of <code>dplyr</code>.</p>
<p>The join functions are nicely illustrated in RStudio’s <a href="https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf">Data wrangling cheatsheet</a>. Each function takes two data.frames and, optionally, the name(s) of columns on which to match. If no column names are provided, the functions match on all shared column names.</p>
<p>The different join functions control what happens to rows that exist in one table but not the other.</p>
<ul>
<li><code>left_join</code> keeps all the entries that are present in the left (first) table and excludes any that are only in the right table.</li>
<li><code>right_join</code> keeps all the entries that are present in the right table and excludes any that are only in the left table.</li>
<li><code>inner_join</code> keeps only the entries that are present in both tables. <code>inner_join</code> is the only function that guarantees you won’t generate any missing entries.</li>
<li><code>full_join</code> keeps all of the entries in both tables, regardless of whether or not they appear in the other table.</li>
</ul>
<div class="figure">
<img src="fig/dplyr-joins.png" alt="dplyr joins, via RStudio" />
<p class="caption">dplyr joins, via RStudio</p>
</div>
<p>We will practice on our continents data.frame from module 2 and the gapminder data.frame. Note how these are tidy data: We have observations at the level of continent and at the level of country, so they go in different tables. The continent column in the gapminder data.frame allows us to link them now. If continents data.frame isn’t in your Environment, load it and recall what it consists of:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">load</span>(<span class="st">'data/continents.RDA'</span>)
continents</code></pre></div>
<pre class="output"><code> continent area_km2 population percent_total_pop
1 Africa 30370000 1022234000 15.0
2 Americas 42330000 934611000 14.0
3 Antarctica 13720000 4490 0.0
4 Asia 43820000 4164252000 60.0
5 Europe 10180000 738199000 11.0
6 Oceania 9008500 29127000 0.4
</code></pre>
<p>We can join the two data.frames using any of the <code>dplyr</code> functions. We will pass the results to <code>str</code> to avoid printing more than we can read, and to get more high-level information on the resulting data.frames.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">left_join</span>(gapminder, continents) </code></pre></div>
<pre class="output"><code># A tibble: 1,704 × 9
country year pop continent lifeExp gdpPercap area_km2
<chr> <int> <dbl> <chr> <dbl> <dbl> <dbl>
1 Afghanistan 1952 8425333 Asia 28.801 779.4453 43820000
2 Afghanistan 1957 9240934 Asia 30.332 820.8530 43820000
3 Afghanistan 1962 10267083 Asia 31.997 853.1007 43820000
4 Afghanistan 1967 11537966 Asia 34.020 836.1971 43820000
5 Afghanistan 1972 13079460 Asia 36.088 739.9811 43820000
6 Afghanistan 1977 14880372 Asia 38.438 786.1134 43820000
7 Afghanistan 1982 12881816 Asia 39.854 978.0114 43820000
8 Afghanistan 1987 13867957 Asia 40.822 852.3959 43820000
9 Afghanistan 1992 16317921 Asia 41.674 649.3414 43820000
10 Afghanistan 1997 22227415 Asia 41.763 635.3414 43820000
# ... with 1,694 more rows, and 2 more variables: population <dbl>,
# percent_total_pop <dbl>
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">right_join</span>(gapminder, continents)</code></pre></div>
<pre class="output"><code># A tibble: 1,705 × 9
country year pop continent lifeExp gdpPercap area_km2 population
<chr> <int> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 Algeria 1952 9279525 Africa 43.077 2449.008 30370000 1022234000
2 Algeria 1957 10270856 Africa 45.685 3013.976 30370000 1022234000
3 Algeria 1962 11000948 Africa 48.303 2550.817 30370000 1022234000
4 Algeria 1967 12760499 Africa 51.407 3246.992 30370000 1022234000
5 Algeria 1972 14760787 Africa 54.518 4182.664 30370000 1022234000
6 Algeria 1977 17152804 Africa 58.014 4910.417 30370000 1022234000
7 Algeria 1982 20033753 Africa 61.368 5745.160 30370000 1022234000
8 Algeria 1987 23254956 Africa 65.799 5681.359 30370000 1022234000
9 Algeria 1992 26298373 Africa 67.744 5023.217 30370000 1022234000
10 Algeria 1997 29072015 Africa 69.152 4797.295 30370000 1022234000
# ... with 1,695 more rows, and 1 more variables: percent_total_pop <dbl>
</code></pre>
<p>These operations produce slightly different results, either 1704 or 1705 observations. Can you figure out why? Antarctica contains no countries so doesn’t appear in the gapminder data.frame. When we use <code>left_join</code> it gets filtered from the results, but when we use <code>right_join</code> it appears, with missing values for all of the country-level variables:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">right_join</span>(gapminder, continents) %>%<span class="st"> </span>
<span class="st"> </span><span class="kw">filter</span>(continent ==<span class="st"> "Antarctica"</span>)</code></pre></div>
<pre class="output"><code># A tibble: 1 × 9
country year pop continent lifeExp gdpPercap area_km2 population
<chr> <int> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 <NA> NA NA Antarctica NA NA 13720000 4490
# ... with 1 more variables: percent_total_pop <dbl>
</code></pre>
<p>There’s another problem in this data.frame – it has two population measures, one by continent and one by country and it’s not clear which is which! Let’s rename a couple of columns.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">right_join</span>(gapminder, continents) %>%<span class="st"> </span>
<span class="st"> </span><span class="kw">rename</span>(<span class="dt">country_pop =</span> pop, <span class="dt">continent_pop =</span> population)</code></pre></div>
<pre class="output"><code># A tibble: 1,705 × 9
country year country_pop continent lifeExp gdpPercap area_km2
<chr> <int> <dbl> <chr> <dbl> <dbl> <dbl>
1 Algeria 1952 9279525 Africa 43.077 2449.008 30370000
2 Algeria 1957 10270856 Africa 45.685 3013.976 30370000
3 Algeria 1962 11000948 Africa 48.303 2550.817 30370000
4 Algeria 1967 12760499 Africa 51.407 3246.992 30370000
5 Algeria 1972 14760787 Africa 54.518 4182.664 30370000
6 Algeria 1977 17152804 Africa 58.014 4910.417 30370000
7 Algeria 1982 20033753 Africa 61.368 5745.160 30370000
8 Algeria 1987 23254956 Africa 65.799 5681.359 30370000
9 Algeria 1992 26298373 Africa 67.744 5023.217 30370000
10 Algeria 1997 29072015 Africa 69.152 4797.295 30370000
# ... with 1,695 more rows, and 2 more variables: continent_pop <dbl>,
# percent_total_pop <dbl>
</code></pre>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h4 id="challenge-putting-the-pieces-together"><span class="glyphicon glyphicon-pencil"></span>Challenge – Putting the pieces together</h4>
</div>
<div class="panel-body">
<p>A colleague suggests that the more land area an individual has, the greater their gdp will be and that this relationship will be observable at any scale of observation. You chuckle and mutter “Not at the continental scale,” but your colleague insists. Test your colleague’s hypothesis by:</p>
<ul>
<li>Calculating the total GDP of each continent,
<ul>
<li>Hint: Use <code>dplyr</code>’s <code>group_by</code> and <code>summarize</code></li>
</ul></li>
<li>Joining the resulting data.frame to the <code>continents</code> data.frame,</li>
<li>Calculating the per-capita GDP for each continent, and</li>
<li>Plotting per-capita gdp versus population density.</li>
</ul>
</div>
</section>
<h2 id="challenge-solutions">Challenge solutions</h2>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h4 id="solution-to-challenge-putting-the-pieces-together"><span class="glyphicon glyphicon-pencil"></span>Solution to Challenge – Putting the pieces together</h4>
</div>
<div class="panel-body">
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(ggplot2)
<span class="co"># Calculate country-level GDP</span>
<span class="kw">mutate</span>(gapminder, <span class="dt">GDP =</span> gdpPercap *<span class="st"> </span>pop) %>%<span class="st"> </span>
<span class="co"># Group by continent</span>
<span class="st"> </span><span class="kw">group_by</span>(continent) %>%<span class="st"> </span>
<span class="co"># Calculate continent-level GDP</span>
<span class="st"> </span><span class="kw">summarize</span>(<span class="dt">cont_gdp =</span> <span class="kw">sum</span>(GDP)) %>%<span class="st"> </span>
<span class="co"># Join the continent-GDP data.frame to the continents data.frame</span>
<span class="st"> </span><span class="kw">left_join</span>(continents) %>%<span class="st"> </span>
<span class="co"># Calculate continent-level per-capita GDP</span>
<span class="st"> </span><span class="kw">mutate</span>(<span class="dt">per_cap =</span> cont_gdp /<span class="st"> </span>population) %>%<span class="st"> </span>
<span class="co"># Plot gdp versus land area</span>
<span class="st"> </span><span class="kw">ggplot</span>(<span class="kw">aes</span>(<span class="dt">x =</span> area_km2, <span class="dt">y =</span> per_cap)) +<span class="st"> </span>
<span class="co"># Draw points</span>
<span class="st"> </span><span class="kw">geom_point</span>() +<span class="st"> </span>
<span class="co"># And label them</span>
<span class="st"> </span><span class="kw">geom_text</span>(<span class="kw">aes</span>(<span class="dt">label =</span> continent), <span class="dt">nudge_y =</span> <span class="fl">5e3</span>) </code></pre></div>
<p><img src="fig/Putting the pieces together - solution-1.png" title="plot of chunk Putting the pieces together - solution" alt="plot of chunk Putting the pieces together - solution" style="display: block; margin: auto;" /></p>
</div>
</section>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
<a class="label swc-blue-bg" href="mailto:admin@software-carpentry.org">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>