<!--lint disable no-heading-punctuation-->
# Surfs Up!
<!--lint enable no-heading-punctuation-->

<img src='images/surfs-up.jpeg'/>

Congratulations! You've decided to treat yourself to a long holiday vacation in Honolulu, Hawaii! To help with your trip planning, you decided to do some climate analysis on the area. Because you are such an awesome person, you have decided to share your ninja analytical skills with the community by providing a climate analysis api. The following outlines what you need to do.

## Step 1 - Data Engineering

The climate data for Hawaii is provided through two CSV files. Start by using Python and Pandas to inspect the content of these files and clean the data.

* Create a Jupyter Notebook file called `data_engineering.ipynb` and use this to complete all of your Data Engineering tasks.

* Use Pandas to read in the measurement and station CSV files as DataFrames.

* Inspect the data for NaNs and missing values. You must decide what to do with this data.

* Save your cleaned CSV files with the prefix `clean_`.

In [1]:
!rm hawaii.sqlite Resources/clean_measurements.csv Resources/clean_stations.csv Resources/clean_hawaii.csv

In [2]:
# Dependencies
import pandas as pd

In [3]:
# Path of the CSV file
csvfile = 'Resources/hawaii_measurements.csv'

In [4]:
#Read csv file into Pandas' Dataframe
df = pd.read_csv(csvfile, dtype=object)

#Inspect if there are missing values thru count()
df.count()

station    19550
date       19550
prcp       18103
tobs       19550
dtype: int64

In [5]:
#Preview dataframe
# Note that some rows are missing prcp
df[:6]

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73
5,USC00519397,2010-01-07,0.06,70


In [6]:
# Use pd.notnull() to drop any rows where there is missing data
# Note that after resetting the index, the gap is gone
clean_measurements = df[pd.notnull(df['prcp'])].reset_index(drop=True)

#Preview again the data
clean_measurements[:5]

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-07,0.06,70


In [7]:
#Save clean file CSV
new_csv = 'Resources/clean_measurements.csv'
clean_measurements.to_csv(new_csv, index=False)

In [8]:
# Path of the CSV file
csvfile2 = 'Resources/hawaii_stations.csv'

In [9]:
#Read second csv file (wether stations listing) into dataframe
clean_stations = pd.read_csv(csvfile2, dtype=object)

#Use count() to  check for NAN values
clean_stations.count()

station      9
name         9
latitude     9
longitude    9
elevation    9
dtype: int64

In [10]:
#Preview the weather stations listing
for index, row in clean_stations.iterrows():
    row['name'] = ''.join([i for i in row['name'] if not i.isdigit()])
    row['name'] = row['name'].replace(".", "")
    row['name'] = row['name'].split(',')[0]
    clean_stations.loc[index, 'name'] = row['name']
clean_stations

Unnamed: 0,station,name,latitude,longitude,elevation
0,USC00519397,WAIKIKI,21.2716,-157.8168,3.0
1,USC00513117,KANEOHE,21.4234,-157.8015,14.6
2,USC00514830,KUALOA RANCH HEADQUARTERS,21.5213,-157.8374,7.0
3,USC00517948,PEARL CITY,21.3934,-157.9751,11.9
4,USC00518838,UPPER WAHIAWA,21.4992,-158.0111,306.6
5,USC00519523,WAIMANALO EXPERIMENTAL FARM,21.33556,-157.71139,19.5
6,USC00519281,WAIHEE,21.45167,-157.84889,32.9
7,USC00511918,HONOLULU OBSERVATORY,21.3152,-157.9992,0.9
8,USC00516128,MANOA LYON ARBO,21.3331,-157.8025,152.4


In [11]:
#Save clean file CSV
new_csv2 = 'Resources/clean_stations.csv'
clean_stations.to_csv(new_csv2, index=False)

In [12]:
clean_hawaii = clean_measurements.merge(clean_stations, left_on='station', right_on='station', how='outer')
clean_hawaii[:5]

Unnamed: 0,station,date,prcp,tobs,name,latitude,longitude,elevation
0,USC00519397,2010-01-01,0.08,65,WAIKIKI,21.2716,-157.8168,3
1,USC00519397,2010-01-02,0.0,63,WAIKIKI,21.2716,-157.8168,3
2,USC00519397,2010-01-03,0.0,74,WAIKIKI,21.2716,-157.8168,3
3,USC00519397,2010-01-04,0.0,76,WAIKIKI,21.2716,-157.8168,3
4,USC00519397,2010-01-07,0.06,70,WAIKIKI,21.2716,-157.8168,3


In [13]:
#Save clean file CSV
new_csv3 = 'Resources/clean_hawaii.csv'
clean_hawaii.to_csv(new_csv3, index=False)