# 03 Cleaning: Weather Data

Description: Cleaning weather data, which consists mostly of filling in the missing and trace values. We'll also split the DataFrame in two, and then create a new set of weather features by taking averages of the two stations. 

In [566]:
import pandas as pd
import numpy as np

### Investigating Weather Data

In [567]:
data = pd.read_csv('../data/weather.csv', parse_dates = ['Date'])

In [568]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 22 columns):
Station        2944 non-null int64
Date           2944 non-null datetime64[ns]
Tmax           2944 non-null int64
Tmin           2944 non-null int64
Tavg           2944 non-null object
Depart         2944 non-null object
DewPoint       2944 non-null int64
WetBulb        2944 non-null object
Heat           2944 non-null object
Cool           2944 non-null object
Sunrise        2944 non-null object
Sunset         2944 non-null object
CodeSum        2944 non-null object
Depth          2944 non-null object
Water1         2944 non-null object
SnowFall       2944 non-null object
PrecipTotal    2944 non-null object
StnPressure    2944 non-null object
SeaLevel       2944 non-null object
ResultSpeed    2944 non-null float64
ResultDir      2944 non-null int64
AvgSpeed       2944 non-null object
dtypes: datetime64[ns](1), float64(1), int64(5), object(15)
memory usage: 506.1+ KB


All of the columns that are objects will eventually have to be integers or floats

### Looking at the first five rows of Weather data

In [569]:
data.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,83,50,67,14,51,56,0,2,...,,0,M,0.0,0.0,29.1,29.82,1.7,27,9.2
1,2,2007-05-01,84,52,68,M,51,57,0,3,...,,M,M,M,0.0,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,59,42,51,-3,42,47,14,0,...,BR,0,M,0.0,0.0,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,60,43,52,M,42,47,13,0,...,BR HZ,M,M,M,0.0,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,66,46,56,2,40,48,9,0,...,,0,M,0.0,0.0,29.39,30.12,11.7,7,11.9


In [570]:
data.columns

Index(['Station', 'Date', 'Tmax', 'Tmin', 'Tavg', 'Depart', 'DewPoint',
       'WetBulb', 'Heat', 'Cool', 'Sunrise', 'Sunset', 'CodeSum', 'Depth',
       'Water1', 'SnowFall', 'PrecipTotal', 'StnPressure', 'SeaLevel',
       'ResultSpeed', 'ResultDir', 'AvgSpeed'],
      dtype='object')

### Cleaning the data

Filling the missing 'Tavg's by calculating the average between max temperature and minimum temperature 

In [571]:
data['Tavg'] = data.apply(lambda x: int(np.ceil((x['Tmax'] + x['Tmin'])/2)) if x['Tavg'] == 'M' else x['Tavg'],1)

Transrom the new variable from object into integers

In [572]:
data['Tavg']=pd.to_numeric(data['Tavg'])

Filling in all the variable's missing values with zeroes if they are missing ('M') or trace values ('T)

In [573]:
data['WetBulb'] = data.apply(lambda x: 0 if x['WetBulb'] == 'M' else x['WetBulb'],1)

In [574]:
data['Depart'] = data.apply(lambda x: 0 if x['Depart'] == 'M' else x['Depart'],1)

In [575]:
data['WetBulb'] = data.apply(lambda x: 0 if x['WetBulb'] == 'M' else x['WetBulb'],1)

In [576]:
data['Depart'] = data.apply(lambda x: 0 if x['Depart'] == 'M' else x['Depart'],1)

In [577]:
data['Depth'] = data.apply(lambda x: 0 if x['Depth'] == 'M' else x['Depth'],1)

In [578]:
data['StnPressure'] = data.apply(lambda x: 0 if x['StnPressure'] == 'M' else x['StnPressure'],1)

In [579]:
data['PrecipTotal'] = data.apply(lambda x: 0 if x['PrecipTotal'] == '  T' else x['PrecipTotal'],1)

In [580]:
data['SeaLevel'] = data.apply(lambda x: 0 if x['SeaLevel'] == 'M' else x['SeaLevel'],1)

In [581]:
data['PrecipTotal'] = data.apply(lambda x: 0 if x['PrecipTotal'] == 'M' else x['PrecipTotal'],1)

In [582]:
data['Heat'] = data.apply(lambda x: 0 if x['Heat'] == 'M' else x['Heat'],1)

In [583]:
data['Cool'] = data.apply(lambda x: 0 if x['Cool'] == 'M' else x['Cool'],1)

In [584]:
data['Sunrise'] = data.apply(lambda x: 0 if x['Sunrise'] == '-' else x['Sunrise'],1)

In [585]:
data['AvgSpeed'] = data.apply(lambda x: 0 if x['AvgSpeed'] == 'M' else x['AvgSpeed'],1)

In [586]:
data['Sunset'] = data.apply(lambda x: 0 if x['Sunset'] == '-' else x['Sunset'],1)

### Transforming the variables from objects into integers

In [587]:
data['Depart']=pd.to_numeric(data['Depart'])

In [588]:
data['WetBulb'] = pd.to_numeric(data['WetBulb'])

In [589]:
data['Cool'] = pd.to_numeric(data['Cool'])

In [590]:
data['Sunrise'] = pd.to_numeric(data['Sunrise'])

In [591]:
data['Sunset'] = pd.to_numeric(data['Sunset'])

In [592]:
data['Depth'] = pd.to_numeric(data['Depth'])

In [593]:
data['SeaLevel'] = pd.to_numeric(data['SeaLevel'])

In [594]:
data['StnPressure'] = pd.to_numeric(data['StnPressure'])

In [595]:
data['Heat'] = pd.to_numeric(data['Heat'])

In [596]:
data['AvgSpeed'] = pd.to_numeric(data['AvgSpeed'])

In [597]:
data['PrecipTotal'] = pd.to_numeric(data['PrecipTotal'])

### Dropping Water and Snowfall

Water was completely missing and snowfall has nothing to do with data that occured during the summer time

In [598]:
data.drop('Water1', axis=1, inplace=True)

In [599]:
data.drop('SnowFall', axis=1, inplace=True)

### Splitting Weather Data

Here, we'll split the weather data by station and createa a new dataframe of the two stations data, side by side. This is done so that there aren't repeats of dates on the index.

In [600]:
weather_stn1 = data[data['Station']==1]
weather_stn2 = data[data['Station']==2]
weather_stn1 = weather_stn1.drop('Station', axis=1)
weather_stn2 = weather_stn2.drop('Station', axis=1)
weather = weather_stn1.merge(weather_stn2, on='Date')

Checking the merge

In [601]:
weather.shape

(1472, 37)

Merge was successful

In [602]:
weather.columns

Index(['Date', 'Tmax_x', 'Tmin_x', 'Tavg_x', 'Depart_x', 'DewPoint_x',
       'WetBulb_x', 'Heat_x', 'Cool_x', 'Sunrise_x', 'Sunset_x', 'CodeSum_x',
       'Depth_x', 'PrecipTotal_x', 'StnPressure_x', 'SeaLevel_x',
       'ResultSpeed_x', 'ResultDir_x', 'AvgSpeed_x', 'Tmax_y', 'Tmin_y',
       'Tavg_y', 'Depart_y', 'DewPoint_y', 'WetBulb_y', 'Heat_y', 'Cool_y',
       'Sunrise_y', 'Sunset_y', 'CodeSum_y', 'Depth_y', 'PrecipTotal_y',
       'StnPressure_y', 'SeaLevel_y', 'ResultSpeed_y', 'ResultDir_y',
       'AvgSpeed_y'],
      dtype='object')

### Creating new variables from the two stations

Creating a new variable called `day_length`, by finding the difference in seconds between the sunset and sunrise military times.

In [603]:
from datetime import datetime

In [604]:
def day_length(row):
    sunset = row['Sunset_x']
    sunrise = row['Sunrise_x']
    if sunset % 100 == 60:
        sunset = sunset + 40
        sunset = str(sunset)
        sunrise = str(sunrise)
    else:
        sunset = str(sunset)
        sunrise = str(sunrise)
    
    x = datetime.strptime(sunset, '%H%M') - datetime.strptime(sunrise, '%H%M')
    return x.seconds
    # parse into datetime
    # find the difference
    # format  for output
    # return

In [605]:
weather['Day_length'] = weather.apply(day_length, axis=1)

### Creating new variables by taking averages of the two stations

In [1]:
weather['Tmax'] = weather.apply(lambda x: np.mean([x['Tmax_x'],x['Tmax_y']]), 1)

NameError: name 'weather' is not defined

In [607]:
weather['Tmin'] = weather.apply(lambda x: np.mean([x['Tmin_x'],x['Tmin_y']]),1)

In [608]:
weather['Tavg'] = weather.apply(lambda x: np.mean([x['Tavg_x'],x['Tavg_y']]),1)

In [609]:
weather['ResultSpeed'] = weather.apply(lambda x: np.mean([x['ResultSpeed_x'],x['ResultSpeed_y']]),1)

In [610]:
weather['ResultDir'] = weather.apply(lambda x: np.mean([x['ResultDir_x'],x['ResultDir_y']]),1)

In [611]:
weather['AvgSpeed'] = weather.apply(lambda x: np.mean([x['AvgSpeed_x'],x['AvgSpeed_y']]),1)

In [612]:
weather['Heat'] = weather.apply(lambda x: np.mean([x['Heat_x'],x['Heat_y']]),1)

In [613]:
weather['DewPoint'] = weather.apply(lambda x: np.mean([weather['DewPoint_x'],weather['DewPoint_y']]),1)

In [614]:
weather['WetBulb'] = weather.apply(lambda x: np.mean([x['WetBulb_x'],x['WetBulb_y']]),1)

In [615]:
weather['Cool'] = weather.apply(lambda x: np.mean([x['Cool_x'],x['Cool_y']]),1)

In [616]:
weather['PrecipTotal'] = weather.apply(lambda x: np.mean([x['PrecipTotal_x'],x['PrecipTotal_y']]),1)

In [617]:
weather['StnPressure'] = weather.apply(lambda x: np.mean([x['StnPressure_x'],x['StnPressure_y']]),1)

These variables had missing station 2 values, so they simply get casted as station 1 values

In [618]:
weather['Sunset'] = weather['Sunset_x']

In [619]:
weather['Sunrise'] = weather['Sunrise_x']

In [620]:
weather['Depart'] = weather['Depart_x']

In [621]:
weather['CodeSum'] = weather['CodeSum_x']

### Dropping the two station's variables

In [622]:
 weather.drop([
         'SeaLevel_x', 
         'SeaLevel_y',
         'Tavg_x',
         'Tavg_y',
         'ResultSpeed_x',
         'ResultSpeed_y',
         'ResultDir_x',
         'ResultDir_y',
         'AvgSpeed_x',
         'AvgSpeed_y',
         'Heat_x',
         'Heat_y',
         'Tmax_x',
         'Tmax_y',
         'Tmin_x',
         'Tmin_y',
         'Sunset_y',
         'Sunrise_y',
         'Depart_y',
         'DewPoint_x',
         'DewPoint_y',
         'WetBulb_x',
         'WetBulb_y',
         'Cool_x',
         'Cool_y',
         'Sunrise_x',
         'Sunset_x',
         'CodeSum_x',
         'CodeSum_y',
         'Depth_x',
         'Depth_y',
         'PrecipTotal_x',
         'PrecipTotal_y',
         'StnPressure_x',
         'StnPressure_y',
         'Depart_x',
         'Depart_y',
         'DewPoint_x',
         'DewPoint_y'], 1, inplace=True)

### Checking Work

In [623]:
weather.columns

Index(['Date', 'Day_length', 'Tmax', 'Tmin', 'Tavg', 'ResultSpeed',
       'ResultDir', 'AvgSpeed', 'Heat', 'DewPoint', 'WetBulb', 'Cool',
       'PrecipTotal', 'StnPressure', 'Sunset', 'Sunrise', 'Depart', 'CodeSum'],
      dtype='object')

### Saving Results

In [624]:
weather.to_csv('../data/clean_weather.csv')