# Surfs Up! Climate Analysis.

### Step 1 - Data Engineering

The climate data for Hawaii is provided through two CSV files. Start by using Python and Pandas to inspect the content of these files and clean the data.

Create a Jupyter Notebook file called data_engineering.ipynb and use this to complete all of your Data Engineering tasks.

* Use Pandas to read in the measurement and station CSV files as DataFrames.

* Inspect the data for NaNs and missing values. You must decide what to do with this data.

* Save your cleaned CSV files with the prefix clean_.

In [1]:
# Dependencies
import os
import pandas as pd

In [2]:
# Function for reading files in given directory.

def read_base_file(data_folder, base_file):
    file_csv = os.path.join(data_folder,base_file)
    try:
        df = pd.read_csv(file_csv)
    except (IOException, e):
        print ("Error in reading", base_file)
        print (e)
        df = pd.DataFrame()
    return df

In [3]:
# CSV files to load

data_folder = 'Resources'
base_file1 = 'hawaii_measurements.csv'
measurements_df = read_base_file(data_folder,base_file1)
base_file2 = 'hawaii_stations.csv'
stations_df = read_base_file(data_folder,base_file2)

### Cleaning Data  - Table Measurements -

In [4]:
# How many columns and rows has our initial dataset.
measurements_df.shape

(19550, 4)

In [5]:
measurements_df.columns

Index(['station', 'date', 'prcp', 'tobs'], dtype='object')

In [6]:
measurements_df.head(10)

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73
5,USC00519397,2010-01-07,0.06,70
6,USC00519397,2010-01-08,0.0,64
7,USC00519397,2010-01-09,0.0,68
8,USC00519397,2010-01-10,0.0,73
9,USC00519397,2010-01-11,0.01,64


In [7]:
def num_missing(x):
  return sum(x.isnull())

In [8]:
# Check if measurements has duplicated values

duplicated = measurements_df.duplicated(subset=['station', 'date', 'prcp', 'tobs']).sum()
print(f"Total of duplicated values in measurement_df : {duplicated} ")

# Applying per column:
print (f"Missing values per column in measurements_df:\n{measurements_df.apply(num_missing, axis=0)}")

# Applying per row:
print (f"\nMissing values per row in measurements_df:\n{measurements_df.apply(num_missing, axis=1).head(20)}")

Total of duplicated values in measurement_df : 0 
Missing values per column in measurements_df:
station       0
date          0
prcp       1447
tobs          0
dtype: int64

Missing values per row in measurements_df:
0     0
1     0
2     0
3     0
4     1
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
dtype: int64


In [9]:
# Percentage of total missing rows in measurement (Total of missing prcp/ Total rows)

missing_prcp = sum(pd.isnull(measurements_df['prcp']))
pctg_Naan_prcp = (missing_prcp/len(measurements_df['prcp']))*100
print(f"The percentage of missing data is : {pctg_Naan_prcp}" )

The percentage of missing data is : 7.40153452685422


In [10]:
# Using interpolate method instead of removing rows with Naan values with df.dropna(axis=0, how='any')
# ‘linear’: ignore the index and treat the values as equally spaced.
# axis = 0: fill column-by-column
# ffill() Synonym for DataFrame.fillna(method='ffill')
# bffill() Synonym fo DataFrame.fillna(method='bfill')

clean_measurements_df = measurements_df.interpolate(method='linear', axis=0).ffill().bfill()
print(f"Totals of missing data in clean_measurements :\n{clean_measurements_df.count()} ")

clean_measurements_df.head()

Totals of missing data in clean_measurements :
station    19550
date       19550
prcp       19550
tobs       19550
dtype: int64 


Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,0.03,73


In [11]:
clean_measurements_df.dtypes

station     object
date        object
prcp       float64
tobs         int64
dtype: object

### Cleaning Data - Table Stations -

In [12]:
# How many columns and rows has our initial dataset.
stations_df.shape

(9, 5)

In [13]:
stations_df.columns

Index(['station', 'name', 'latitude', 'longitude', 'elevation'], dtype='object')

In [14]:
# No Missing data in Stations DataFrame
stations_df.head(10)

Unnamed: 0,station,name,latitude,longitude,elevation
0,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0
1,USC00513117,"KANEOHE 838.1, HI US",21.4234,-157.8015,14.6
2,USC00514830,"KUALOA RANCH HEADQUARTERS 886.9, HI US",21.5213,-157.8374,7.0
3,USC00517948,"PEARL CITY, HI US",21.3934,-157.9751,11.9
4,USC00518838,"UPPER WAHIAWA 874.3, HI US",21.4992,-158.0111,306.6
5,USC00519523,"WAIMANALO EXPERIMENTAL FARM, HI US",21.33556,-157.71139,19.5
6,USC00519281,"WAIHEE 837.5, HI US",21.45167,-157.84889,32.9
7,USC00511918,"HONOLULU OBSERVATORY 702.2, HI US",21.3152,-157.9992,0.9
8,USC00516128,"MANOA LYON ARBO 785.2, HI US",21.3331,-157.8025,152.4


In [15]:
# Check if stations has duplicated values
duplicated = stations_df.duplicated(subset=['station', 'name', 'latitude', 
                                            'longitude', 'elevation']).sum()
print(f"Total of duplicated values in stations_df : {duplicated} ")

# Check if stations does not have missing data

# Applying per column:
print (f"Missing values per column in stations_df:\n{stations_df.apply(num_missing, axis=0)}")

# Applying per row:
print (f"\nMissing values per row in stations_df:\n{stations_df.apply(num_missing, axis=1).head(10)}")

Total of duplicated values in stations_df : 0 
Missing values per column in stations_df:
station      0
name         0
latitude     0
longitude    0
elevation    0
dtype: int64

Missing values per row in stations_df:
0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
dtype: int64


### Save cleaned data to CSV files.

In [16]:
# Save to a csv file.
clean_measurements_df.to_csv('Resources/clean_measurements.csv', index=False)
stations_df.to_csv('Resources/clean_stations.csv', index=False)