# Importing Weather Data into Hive

Before the whole DataFrame tutorial, we should import weather data into Hive. This way, we'll have some nice data to play with.

Since Jupyter Notebooks do not support Hive cells, the instructions are provided here. You can follow these steps inside Hue or on the Hive command line interface.

## Creating Training Database

If you don't already have a training database, create one

    create database training;
    
## Import Weather Data

First we will create an external Hive table, which will receive relevant partitions

    create external table training.weather_raw (
        data string
    )
    partitioned by(year string) 
    stored as textfile;
    
Now add partitions

    alter table training.weather_raw
    add partition(year='2014')
    location '/user/cloudera/data/weather/2014';
    
Check that we get some results:

    select * from training.weather_raw limit 10
    
## Create VIEW

The data in the raw Hive table is in an ugly format. In order to work with a table in a more accessible format, we create a VIEW which will extract relevant information.

    CREATE VIEW training.weather AS
    SELECT 
        year,
        SUBSTR(`data`,5,6) AS `usaf`,
        SUBSTR(`data`,11,5) AS `wban`, 
        SUBSTR(`data`,16,8) AS `date`, 
        SUBSTR(`data`,24,4) AS `time`,
        SUBSTR(`data`,42,5) AS report_type,
        SUBSTR(`data`,61,3) AS wind_direction, 
        SUBSTR(`data`,64,1) AS wind_direction_qual, 
        SUBSTR(`data`,65,1) AS wind_observation, 
        CAST(SUBSTR(`data`,66,4) AS FLOAT)/10 AS wind_speed,
        SUBSTR(`data`,70,1) AS wind_speed_qual,
        CAST(SUBSTR(`data`,88,5) AS FLOAT)/10 AS air_temperature, 
        SUBSTR(`data`,93,1) AS air_temperature_qual 
    FROM training.weather_raw;
    
We should also check that the ouput looks nice:

    select * from training.weather limit 10
    

# Import Station Data

We also need to import the weather station data. Fortunately this is easier, since the format is already CSV. 

## Put Data into Directory

We will not directly create the table, since we only have a single file. We'll need to put this into a new directory

    hdfs dfs -mkdir /user/cloudera/data/weather/isd
    hdfs dfs -cp /user/cloudera/data/weather/isd-history.csv /user/cloudera/data/weather/isd

## Create VIEW

Now we can create the table

    CREATE EXTERNAL TABLE training.stations(
        usaf STRING,
        wban STRING,
        name STRING,
        country STRING,
        state STRING,
        icao STRING,
        latitude FLOAT,
        longitude FLOAT,
        elevation FLOAT,
        date_begin STRING,
        date_end STRING) 
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES (
       "separatorChar" = ",",
       "quoteChar"     = "\"",
       "escapeChar"    = "\\"
    )
    STORED AS TEXTFILE
    LOCATION '/user/cloudera/data/weather/isd'
    
Of course we'll again check contents

    select * from training.stations limit 10

In [1]:
print sqlContext.table("training.weather").take(10)

[Row(year=u'2014', usaf=u'010060', wban=u'99999', date=u'20140101', time=u'0100', report_type=u'FM-12', wind_direction=u'240', wind_direction_qual=u'1', wind_observation=u'N', wind_speed=3.0, wind_speed_qual=u'1', air_temperature=-13.6, air_temperature_qual=u'1'), Row(year=u'2014', usaf=u'010060', wban=u'99999', date=u'20140101', time=u'0200', report_type=u'FM-12', wind_direction=u'140', wind_direction_qual=u'1', wind_observation=u'N', wind_speed=2.0, wind_speed_qual=u'1', air_temperature=-14.2, air_temperature_qual=u'1'), Row(year=u'2014', usaf=u'010060', wban=u'99999', date=u'20140101', time=u'0400', report_type=u'FM-12', wind_direction=u'210', wind_direction_qual=u'1', wind_observation=u'N', wind_speed=4.0, wind_speed_qual=u'1', air_temperature=-10.7, air_temperature_qual=u'1'), Row(year=u'2014', usaf=u'010060', wban=u'99999', date=u'20140101', time=u'0500', report_type=u'FM-12', wind_direction=u'170', wind_direction_qual=u'1', wind_observation=u'N', wind_speed=3.0, wind_speed_qual=