Skip to content

Commit 6b68fd0

Browse files
committed
added subheaders to the hive tutorial
1 parent 4e5b20d commit 6b68fd0

File tree

1 file changed

+15
-0
lines changed

1 file changed

+15
-0
lines changed

hive.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
## Using Hive
22

3+
#### Install and set up Hive
4+
35
ssh to your cloud computer and switch to the hduser. Go to the hduser's home.
46

57
```bash
@@ -64,6 +66,8 @@ Hive's syntax is (almost) identical to SQL. So let's load up some data and use i
6466
hive> exit;
6567
```
6668

69+
#### Download some baseball data to play with
70+
6771
You should be back at your regular prompt now. Let's download some baseball data.
6872
```bash
6973
$ wget http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip
@@ -82,6 +86,8 @@ $ mkdir baseballdata
8286
$ unzip lahman-csv_2014-02-14.zip -d baseballdata
8387
```
8488

89+
#### First look & cleanup of the data
90+
8591
Now you have a bunch of csv files in the `baseballdata` directory.
8692
You can think of each csv as a table in a baseball database.
8793
Let's create one Hive table and read a csv into that table.
@@ -189,6 +195,8 @@ abbeybe01,1869,11,11,USA,VT,Essex,1962,6,11,USA,VT,Colchester,Bert,Abbey,Bert Wo
189195
abbeych01,1866,10,14,USA,NE,Falls City,1926,4,27,USA,CA,San Francisco,Charlie,Abbey,Charles S.,169,68,L,L,1893-08-16,1897-08-19,abbec101,abbeych01
190196
```
191197

198+
#### Upload data to Hive
199+
192200
Indeed it's gone. Alright. Let's upload this to hive. First, we need to upload it to hdfs.
193201
(of course, change `irmak` to whichever directory you have in hdfs)
194202
```bash
@@ -259,6 +267,9 @@ OK
259267
Time taken: 1.166 seconds
260268
```
261269
And it's in!
270+
271+
#### Use Hive to make queries over the distributed data
272+
262273
We now have a Hive table. The best part of hive is, when you make a query (that most of the time looks **exactly** like a sql query), Hive automatically creates the map and reduce tasks, runs them over the hadoop cluster, and gives you the answer, without you having to worry about any of it. If your question is easily represented in the form of a sql query, Hive will take care of all the dirty work for you. The table might be spread over thousands of computers, but you don't need to think hard about that at all.
263274

264275
Let's start easy. Let's find out how many players we have in this table.
@@ -448,6 +459,8 @@ As you can see, a simple GROUP BY statement takes care of everything. Easier tha
448459

449460
In this manner, you can do sql-like queries over tons of data that live in the hdfs in a distributed state. Since hdfs and mapreduce have overheads, it will not be as fast as a sql query on data that fits a single machine, but you now get the answers in parallel, and are able to do sql queries over hundreds of terabytes of data.
450461

462+
#### Join example in Hive
463+
451464
Let's upload another table and see how joins work. Salaries.csv has four columns: year, team, league, player, salary. It only has salary information for after 1984, but it's pretty extensive.
452465

453466
Let's remove the header and upload it to hdfs
@@ -642,6 +655,8 @@ Time taken: 15.451 seconds, Fetched: 106 row(s)
642655
```
643656
Done. By joining tables, you can build some pretty complicated queries, which Hive will automatically execute with MapReduce.
644657
658+
#### More resources
659+
645660
[You can find the documentation for Hive commands here](https://cwiki.apache.org/confluence/display/Hive/LanguageManual).
646661
647662
[And here is another tutorial with more examples](https://cwiki.apache.org/confluence/display/Hive/Tutorial)

0 commit comments

Comments
 (0)