# Data Cleaning: Weather


#### Data Source:
Weather data for the top 5 largest cities in spain was obtained from OpenWeatherMap. The data contains hourly information on teperature, pressure, rainfall, cloud index, and weather descrption.

#### Summary of cleaning actions:
- Add names to the cities
- Drop columns that contain no data
- Convert timestamps to datetimes and set a datetime index
- In columns with partial data, assume Nans are zero values.
- Set elements to lower case and remove speical characters in categorical columns


#### Function list:
1. get_clean_weather - takes in weather data and returns a cleaned set for the spain top 5 cities data

In [26]:
import pandas as pd
import numpy as np

In [2]:
#import data
path = './data/weather/spain-weather-2013-2019.csv'
data = pd.read_csv(path)

In [3]:
#first look at the data
data.head(3)

Unnamed: 0,dt,dt_iso,city_id,city_name,lat,lon,temp,temp_min,temp_max,pressure,...,rain_today,snow_1h,snow_3h,snow_24h,snow_today,clouds_all,weather_id,weather_main,weather_description,weather_icon
0,1380585600,2013-10-01 00:00:00 +0000 UTC,2509954,,,,299.15,299.15,299.15,1008,...,,,,,,20,801,Clouds,few clouds,02n
1,1380589200,2013-10-01 01:00:00 +0000 UTC,2509954,,,,298.15,298.15,298.15,1009,...,,,,,,20,801,Clouds,few clouds,02n
2,1380592800,2013-10-01 02:00:00 +0000 UTC,2509954,,,,296.161,296.161,296.161,1009,...,,,0.0,,,10,800,Clear,sky is Clear,02


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263000 entries, 0 to 262999
Data columns (total 28 columns):
dt                     263000 non-null int64
dt_iso                 263000 non-null object
city_id                263000 non-null int64
city_name              0 non-null float64
lat                    0 non-null float64
lon                    0 non-null float64
temp                   263000 non-null float64
temp_min               263000 non-null float64
temp_max               263000 non-null float64
pressure               263000 non-null int64
sea_level              0 non-null float64
grnd_level             0 non-null float64
humidity               263000 non-null int64
wind_speed             263000 non-null int64
wind_deg               263000 non-null int64
rain_1h                27406 non-null float64
rain_3h                20017 non-null float64
rain_24h               0 non-null float64
rain_today             0 non-null float64
snow_1h                2 non-null float64
snow

In [16]:
def clean_weather_data(data):
    """
    Input: hourly bulk data export from OpenWeatherMaps.
    
    Output: cleaned data
    
    """

    
    #add city names
    city_codes = {3128760 : 'Barcelona', 
                  3117735 : 'Madrid', 
                  3128026 : 'Bilbao', 
                  2509954 : 'Valencia', 
                  6361046 : 'Seville'}
    
    data['city_name'] = data['city_id'].replace(city_codes)

    #drop all columns with only NaN values
    data = data.drop(['lat', 
                      'lon', 
                      'sea_level', 
                      'grnd_level', 
                      'rain_24h', 
                      'snow_today',
                      'rain_today', 
                      'snow_1h', 
                      'snow_24h'], axis=1)


    #convert timestamp to datetime object
    times = pd.to_datetime(data['dt'], unit='s', origin='unix')

    #convert the times to local time zone
    data['dt'] = times.dt.tz_localize('UTC').dt.tz_convert('Europe/Madrid').dt.strftime('%Y-%m-%d %H:%M:%S')

    #replace null values with zeros in columns with relevant informaiton
    nul_cols = ['rain_1h', 'rain_3h', 'snow_3h']
    data[nul_cols] = data[nul_cols].fillna(0)
    
    return data

In [6]:
data = get_clean_weather(data)
data.head(3)

NameError: name 'get_clean_weather' is not defined

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263000 entries, 0 to 262999
Data columns (total 19 columns):
dt                     263000 non-null object
dt_iso                 263000 non-null object
city_id                263000 non-null int64
city_name              263000 non-null object
temp                   263000 non-null float64
temp_min               263000 non-null float64
temp_max               263000 non-null float64
pressure               263000 non-null int64
humidity               263000 non-null int64
wind_speed             263000 non-null int64
wind_deg               263000 non-null int64
rain_1h                263000 non-null float64
rain_3h                263000 non-null float64
snow_3h                263000 non-null float64
clouds_all             263000 non-null int64
weather_id             263000 non-null int64
weather_main           263000 non-null object
weather_description    263000 non-null object
weather_icon           263000 non-null object
dtypes: float64(

#### Converting temperatures

The min and max temperatures look to be the same as the hourly temperature indicated. We will check to see if this is always the case. If this is the case these columns will be removed.

In [17]:
diff_temp_min = (data['temp'] == data['temp_min']).sum()/len(data)
diff_temp_max = (data['temp'] == data['temp_max']).sum()/len(data)
diff_min_max = (data['temp_min'] == data['temp_max']).sum()/len(data)
print('Percentage of values different between temperature and min temperature {}'.format(diff_temp_min))
print('Percentage of values different between temperature and max temperature {}'.format(diff_temp_min))
print('Percentage of values different between min and max temperature {}'.format(diff_temp_min))

Percentage of values different between temperature and min temperature 0.3512661596958175
Percentage of values different between temperature and max temperature 0.3512661596958175
Percentage of values different between min and max temperature 0.3512661596958175


In this case there are differences between the columns values and the data will be kept.

#### Checking categorical columns

The columns weather_main and weather_description contain categorical information. We will investigate their vlaues and see if any information reduction is possible.

In [22]:
#investigate values in the weather main
data['weather_main'].value_counts(dropna=False)

Clear           118166
Clouds          107307
Rain             24748
Mist              4873
Fog               3016
Drizzle           2333
Thunderstorm      1319
Haze               491
Dust               404
Snow               297
Smoke               43
Squall               1
Sand                 1
Tornado              1
Name: weather_main, dtype: int64

In [21]:
#investigate values in the weather main
data['weather_description'].value_counts(dropna=False)

Sky is Clear                        100334
few clouds                           52151
broken clouds                        25835
scattered clouds                     24462
light rain                           15131
sky is Clear                         11466
moderate rain                         5690
mist                                  4873
overcast clouds                       4859
clear sky                             4244
fog                                   3016
sky is clear                          2122
heavy intensity rain                  1719
light intensity drizzle               1613
light intensity shower rain            806
proximity shower rain                  572
shower rain                            560
proximity thunderstorm                 555
drizzle                                542
haze                                   491
thunderstorm                           440
dust                                   399
thunderstorm with rain                 170
very heavy 

The weather_descrption column appears to be a subcategory of lower granularity information to the weather_main column. Something to consider in the features selection process is if this adds relevant additional information. In this step however we will make all fields lowercase, and remove special characters.

In [8]:
def clean_descrption_cols(data):
    """
    small function that sets the descrption columns to lower case, and removes special characters from the names.
    
    """
    
    #make each element in the columns lowercase
    data[['weather_main', 'weather_description']] = data[['weather_main', 'weather_description']].apply(lambda x: x.str.lower())
    
    #remove spcial characters
    special_chars = [',', '/', ':', ';', '-']
    
    for char in special_chars:
        data['weather_description'] = data['weather_description'].str.replace(char,' ')
        
    return data

In [28]:
data = clean_descrption_cols(data)

#### Export the data

In [30]:
data.to_csv('./data/weather_2013_2019.csv')

In [19]:
def get_weather_data(path='./data/weather/spain-weather-2013-2019.csv'):

    data = pd.read_csv(path)
    
    weather_data = clean_weather_data(data)
    weather_data = clean_descrption_cols(weather_data)
    
    return weather_data

In [20]:
weather = get_weather_data()
weather.head(3)

Unnamed: 0,dt,dt_iso,city_id,city_name,temp,temp_min,temp_max,pressure,humidity,wind_speed,wind_deg,rain_1h,rain_3h,snow_3h,clouds_all,weather_id,weather_main,weather_description,weather_icon
0,2013-10-01 02:00:00,2013-10-01 00:00:00 +0000 UTC,2509954,Valencia,299.15,299.15,299.15,1008,61,5,290,0.0,0.0,0.0,20,801,clouds,few clouds,02n
1,2013-10-01 03:00:00,2013-10-01 01:00:00 +0000 UTC,2509954,Valencia,298.15,298.15,298.15,1009,65,4,250,0.0,0.0,0.0,20,801,clouds,few clouds,02n
2,2013-10-01 04:00:00,2013-10-01 02:00:00 +0000 UTC,2509954,Valencia,296.161,296.161,296.161,1009,71,4,269,0.0,0.0,0.0,10,800,clear,sky is clear,02


In [21]:
weather.city_name.value_counts()

Madrid       53357
Bilbao       52774
Seville      52488
Barcelona    52416
Valencia     51965
Name: city_name, dtype: int64

### Create National Features

Data is comprised of 5 major cities. Our electricity power demand data is for spain as a country. So the wather data will need to represent the whole country.

To do this we will take a weighted average of the cities feature value based on population. This is rationalized because a heat wave hitting a city with 2x population is likely to see more total energy demanded than the smaller city. Below are the city weightings:

|City | Population | Weight|
|-----|------------|-------|
|Madrid | 3,174,000 | |
|Barcelona | 1,165,000 | |
|Bilbao | 345,000 | |
|Seville | 690,000 | |
|Valencia | 789,000 | |


Steps:
1. Create a column with the weighted values (i.e. the population)
2. Group over the dates column

In [23]:
populations = {'Madrid' : 3174000,
              'Barcelona' : 1165000,
              'Bilbao' : 345000,
              'Seville' : 690000,
              'Valencia' : 789000}

#create a populations column
weather['population'] = [populations[city] for city in weather.city_name]

In [29]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263000 entries, 0 to 262999
Data columns (total 20 columns):
dt                     263000 non-null object
dt_iso                 263000 non-null object
city_id                263000 non-null int64
city_name              263000 non-null object
temp                   263000 non-null float64
temp_min               263000 non-null float64
temp_max               263000 non-null float64
pressure               263000 non-null int64
humidity               263000 non-null int64
wind_speed             263000 non-null int64
wind_deg               263000 non-null int64
rain_1h                263000 non-null float64
rain_3h                263000 non-null float64
snow_3h                263000 non-null float64
clouds_all             263000 non-null int64
weather_id             263000 non-null int64
weather_main           263000 non-null object
weather_description    263000 non-null object
weather_icon           263000 non-null object
population      

In [47]:
numeric_cols = ['temp', 'pressure', 'wind_speed', 'rain_1h', 'rain_3h', 'snow_3h']

#create dataframe to store the transformed data
national_weather = pd.DataFrame()

#for the numeric columns, group by datetime and average according to their population weight
for col in numeric_cols:
    #group by the datecolumn. for each element in the column average it by it's weight
    national_weather[col] = weather.groupby(weather.dt).apply(lambda x : np.average(x[col], weights=x.population))

In [39]:
national_weather.index.min(), national_weather.index.max()

('2013-10-01 02:00:00', '2019-08-26 02:00:00')

In [48]:
national_weather = national_weather.reset_index()

In [49]:
national_weather.head(3)

Unnamed: 0,dt,temp,pressure,wind_speed,rain_1h,rain_3h,snow_3h
0,2013-10-01 02:00:00,293.616979,1008.499108,3.256044,0.0,0.055979,0.0
1,2013-10-01 03:00:00,293.521288,1008.459192,3.195197,0.0,0.055979,0.0
2,2013-10-01 04:00:00,293.025492,1008.336038,3.720104,0.033588,0.0,0.0


In [50]:
#save the national weather data set as is for the prohpet/sklearn models
national_weather.to_csv('./data/cleaned_data/national_weather_2013_2019')

In [53]:
national_weather.shape

(51714, 7)

In [55]:
#create separate set 
datetimes = pd.to_datetime(national_weather['dt'], format='%d-%m-%Y %H%M', errors='ignore')

national_weather_dtidx = national_weather.set_index(pd.DatetimeIndex(datetimes))

In [57]:
national_weather_dtidx.to_csv('./data/cleaned_data/national_weather_2013_2019_dtidx')