# Step 1 - Data Engineering

-- The climate data for Hawaii is provided through two CSV files. Start by using Python and Pandas to inspect the content of these files and clean the data.

-- Create a Jupyter Notebook file called data_engineering.ipynb and use this to complete all of your Data Engineering tasks.

-- Use Pandas to read in the measurement and station CSV files as DataFrames.

-- Inspect the data for NaNs and missing values. You must decide what to do with this data.

-- Save your cleaned CSV files with the prefix clean_.


In [18]:
# Dependencies
import pandas as pd
from datetime import datetime

In [19]:
# Store filepath in a variable
measurements = "hawaii_measurements.csv"
stations = "hawaii_stations.csv"

In [20]:
# Read our Data file with the pandas library
# Not every CSV requires an encoding, but be aware this can come up
measurements_df = pd.read_csv(measurements, encoding = "ISO-8859-1")
stations_df = pd.read_csv(stations, encoding = "ISO-8859-1")

In [21]:
# Show measurements_df table
measurements_df.head()

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73


In [22]:
# Show stations_df table
stations_df

Unnamed: 0,station,name,latitude,longitude,elevation
0,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0
1,USC00513117,"KANEOHE 838.1, HI US",21.4234,-157.8015,14.6
2,USC00514830,"KUALOA RANCH HEADQUARTERS 886.9, HI US",21.5213,-157.8374,7.0
3,USC00517948,"PEARL CITY, HI US",21.3934,-157.9751,11.9
4,USC00518838,"UPPER WAHIAWA 874.3, HI US",21.4992,-158.0111,306.6
5,USC00519523,"WAIMANALO EXPERIMENTAL FARM, HI US",21.33556,-157.71139,19.5
6,USC00519281,"WAIHEE 837.5, HI US",21.45167,-157.84889,32.9
7,USC00511918,"HONOLULU OBSERVATORY 702.2, HI US",21.3152,-157.9992,0.9
8,USC00516128,"MANOA LYON ARBO 785.2, HI US",21.3331,-157.8025,152.4


In [23]:
# Inspect the data for NaNs and missing values. You must decide what to do with this data.


#show tht emissing pull the values#
#add header here#


In [24]:
# Export files as a CSV, without the Pandas index, but with the header
measurements_df.to_csv("Clean_measurements_df.csv", index=False, header=True)
stations_df.to_csv("Clean_stations_df.csv", index=False, header=True)

# Step 2 - Database Engineering

-- Use SQLAlchemy to model your table schemas and create a sqlite database for your tables. You will need one table for measurements and one for stations.

-- Create a Jupyter Notebook called database_engineering.ipynb and use this to complete all of your Database Engineering work.

-- Use Pandas to read your cleaned measurements and stations CSV data.

-- Use the engine and connection string to create a database called hawaii.sqlite.

-- Use declarative_base and create ORM classes for each table.

-- You will need a class for Measurement and for Station.
-- Make sure to define your primary keys.
-- Once you have your ORM classes defined, create the tables in the database using create_all.

In [25]:
# Dependencies
# ----------------------------------
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, Numeric, Text, Float, Date, ForeignKey
from sqlalchemy.orm import Session
from sqlalchemy import create_engine, inspect


In [26]:
engine = create_engine("sqlite:///hawaii.sqlite")  #Create the database connection engine
conn=engine.connect()#use dataframe to populate 

from sqlalchemy_utils.functions import drop_database
drop_database(engine.url)


#engine.execute("DROP TABLE IF EXISTS mea")

In [27]:
Base = declarative_base()

In [28]:
# Create measurement and stations Classes
# ----------------------------------
class Measurements(Base):
    __tablename__ = 'measurements'
    id = Column(Integer, primary_key=True)
    station = Column(String(255))
    date = Column(Date())
    prcp = Column(Integer)
    tobs = Column(Integer)

class Stations(Base):
    __tablename__ = 'stations'
    id = Column(Integer, primary_key=True)
    station = Column(String(255))
    name= Column(String(255))
    latitude = Column(Integer)
    longitude = Column(Integer)
    elevation = Column(Integer)

Base.metadata.create_all(engine) # create the tables


In [29]:
engine.table_names() # to make sure the tables appear

['measurements', 'stations']

In [30]:
# Store filepath in a variable
clean_measurements = "Clean_measurements_df.csv"
clean_stations = "Clean_stations_df.csv"

# Read our Data file with the pandas library
clean_measurements_df = pd.read_csv(clean_measurements)
clean_stations_df = pd.read_csv(clean_stations)

clean_measurements_df.head()

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73


In [31]:
clean_stations_df.head()

Unnamed: 0,station,name,latitude,longitude,elevation
0,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0
1,USC00513117,"KANEOHE 838.1, HI US",21.4234,-157.8015,14.6
2,USC00514830,"KUALOA RANCH HEADQUARTERS 886.9, HI US",21.5213,-157.8374,7.0
3,USC00517948,"PEARL CITY, HI US",21.3934,-157.9751,11.9
4,USC00518838,"UPPER WAHIAWA 874.3, HI US",21.4992,-158.0111,306.6


In [32]:
#Create the session
session = Session(engine)
list(session.query(Measurements))

[]

In [33]:
#Read dataframe to sql
clean_measurements_df.to_sql(
 Measurements.__tablename__,
    engine,
    flavor=None,
    schema=None, if_exists='append', index=False, index_label=None, chunksize=None, dtype=None)


In [34]:
stations_df.to_sql(
 Stations.__tablename__,
    engine,
    flavor=None,
    schema=None, if_exists='append', index=False, index_label=None, chunksize=None, dtype=None)


In [35]:
#session.query(Stations).all()

inspector = inspect(engine)
inspector.get_columns('stations')

[{'autoincrement': 'auto',
  'default': None,
  'name': 'id',
  'nullable': False,
  'primary_key': 1,
  'type': INTEGER()},
 {'autoincrement': 'auto',
  'default': None,
  'name': 'station',
  'nullable': True,
  'primary_key': 0,
  'type': VARCHAR(length=255)},
 {'autoincrement': 'auto',
  'default': None,
  'name': 'name',
  'nullable': True,
  'primary_key': 0,
  'type': VARCHAR(length=255)},
 {'autoincrement': 'auto',
  'default': None,
  'name': 'latitude',
  'nullable': True,
  'primary_key': 0,
  'type': INTEGER()},
 {'autoincrement': 'auto',
  'default': None,
  'name': 'longitude',
  'nullable': True,
  'primary_key': 0,
  'type': INTEGER()},
 {'autoincrement': 'auto',
  'default': None,
  'name': 'elevation',
  'nullable': True,
  'primary_key': 0,
  'type': INTEGER()}]

In [36]:
inspector.get_columns('measurements')

[{'autoincrement': 'auto',
  'default': None,
  'name': 'id',
  'nullable': False,
  'primary_key': 1,
  'type': INTEGER()},
 {'autoincrement': 'auto',
  'default': None,
  'name': 'station',
  'nullable': True,
  'primary_key': 0,
  'type': VARCHAR(length=255)},
 {'autoincrement': 'auto',
  'default': None,
  'name': 'date',
  'nullable': True,
  'primary_key': 0,
  'type': DATE()},
 {'autoincrement': 'auto',
  'default': None,
  'name': 'prcp',
  'nullable': True,
  'primary_key': 0,
  'type': INTEGER()},
 {'autoincrement': 'auto',
  'default': None,
  'name': 'tobs',
  'nullable': True,
  'primary_key': 0,
  'type': INTEGER()}]

# Step 3 - Climate Analysis and Exploration

You are now ready to use Python and SQLAlchemy to do basic climate analysis and data exploration on your new weather station tables. All of the following analysis should be completed using SQLAlchemy ORM queries, Pandas, and Matplotlib.

Create a Jupyter Notebook file called climate_analysis.ipynb and use it to complete your climate analysis and data exporation.

Choose a start date and end date for your trip. Make sure that your vacation range is approximately 3-15 days total.

Use SQLAlchemy create_engine to connect to your sqlite database.

Use SQLAlchemy automap_base() to reflect your tables into classes and save a reference to those classes called Station and Measurement.

### Precipitation Analysis

Design a query to retrieve the last 12 months of precipitation data.

Select only the date and prcp values.

Load the query results into a Pandas DataFrame and set the index to the date column.

Plot the results using the DataFrame plot method.

### Station Analysis

Design a query to calculate the total number of stations.

Design a query to find the most active stations.

List the stations and observation counts in descending order
Which station has the highest number of observations?
Design a query to retrieve the last 12 months of temperature observation data (tobs).

Filter by the station with the highest number of observations.
Plot the results as a histogram with bins=12.

### Temperature Analysis

Write a function called calc_temps that will accept a start date and end date in the format %Y-%m-%d and return the minimum, average, and maximum temperatures for that range of dates.

Use the calc_temps function to calculate the min, avg, and max temperatures for your trip using the matching dates from the previous year (i.e. use "2017-01-01" if your trip start date was "2018-01-01")

Plot the min, avg, and max temperature from your previous query as a bar chart.

Use the average temperature as the bar height.
Use the peak-to-peak (tmax-tmin) value as the y error bar (yerr).

### Optional Recommended Analysis

The following are optional challenge queries. These are highly recommended to attempt, but not required for the homework.

Calcualte the rainfall per weather station using the previous year's matching dates.
Calculate the daily normals. Normals are the averages for min, avg, and max temperatures.

Create a function called daily_normals that will calculate the daily normals for a specific date. This date string will be in the format %m-%d. Be sure to use all historic tobs that match that date string.
Create a list of dates for your trip in the format %m-%d. Use the daily_normals function to calculate the normals for each date string and append the results to a list.
Load the list of daily normals into a Pandas DataFrame and set the index equal to the date.
Use Pandas to plot an area plot (stacked=False) for the daily normals.