You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Indeed it's gone. Alright. Let's upload this to hive. First, we need to upload it to hdfs.
193
201
(of course, change `irmak` to whichever directory you have in hdfs)
194
202
```bash
@@ -259,6 +267,9 @@ OK
259
267
Time taken: 1.166 seconds
260
268
```
261
269
And it's in!
270
+
271
+
#### Use Hive to make queries over the distributed data
272
+
262
273
We now have a Hive table. The best part of hive is, when you make a query (that most of the time looks **exactly** like a sql query), Hive automatically creates the map and reduce tasks, runs them over the hadoop cluster, and gives you the answer, without you having to worry about any of it. If your question is easily represented in the form of a sql query, Hive will take care of all the dirty work for you. The table might be spread over thousands of computers, but you don't need to think hard about that at all.
263
274
264
275
Let's start easy. Let's find out how many players we have in this table.
@@ -448,6 +459,8 @@ As you can see, a simple GROUP BY statement takes care of everything. Easier tha
448
459
449
460
In this manner, you can do sql-like queries over tons of data that live in the hdfs in a distributed state. Since hdfs and mapreduce have overheads, it will not be as fast as a sql query on data that fits a single machine, but you now get the answers in parallel, and are able to do sql queries over hundreds of terabytes of data.
450
461
462
+
#### Join example in Hive
463
+
451
464
Let's upload another table and see how joins work. Salaries.csv has four columns: year, team, league, player, salary. It only has salary information for after 1984, but it's pretty extensive.
0 commit comments