# **Data Collection and Preparation - Weather Data**

In [1]:
import pandas as pd
import numpy as np

In [2]:
try:
    df = pd.read_csv('resources/csv_files/weather_burbank_airport.csv')
    print("Successfully imported weather_burbank_airport.csv")
except FileNotFoundError:
    print("Error: weather_burbank_airport.csv not found.")

Successfully imported weather_burbank_airport.csv


## Format

In [3]:
df

Unnamed: 0,city,timestamp,temperature,cloud_cover,cloud_cover_description,pressure,windspeed,precipitation,felt_temperature
0,Burbank,2018-01-01 08:53:00,9.0,33.0,Fair,991.75,9.0,0.0,8.0
1,Burbank,2018-01-01 09:53:00,9.0,33.0,Fair,992.08,0.0,0.0,9.0
2,Burbank,2018-01-01 10:53:00,9.0,21.0,Haze,992.08,0.0,0.0,9.0
3,Burbank,2018-01-01 11:53:00,9.0,29.0,Partly Cloudy,992.08,0.0,0.0,9.0
4,Burbank,2018-01-01 12:53:00,8.0,33.0,Fair,992.08,0.0,0.0,8.0
...,...,...,...,...,...,...,...,...,...
29239,Burbank,2021-01-01 03:53:00,13.0,33.0,Fair,986.81,0.0,0.0,13.0
29240,Burbank,2021-01-01 04:53:00,12.0,33.0,Fair,986.81,11.0,0.0,12.0
29241,Burbank,2021-01-01 05:53:00,12.0,33.0,Fair,987.47,9.0,0.0,12.0
29242,Burbank,2021-01-01 06:53:00,11.0,33.0,Fair,987.14,13.0,0.0,11.0


Again, just as in the charging sessions, we will convert the values of **timestamp** from type object to datetime. Similar to as before, we know by the supplemented document of the team assignment that all datetimes are in UTC. For easier use, we will convert all datetimes from Timezone *UTC* to *America/Los Angeles*.

In [4]:
df['timestamp'] = pd.to_datetime(df['timestamp'], utc=True, errors='coerce').dt.tz_convert('America/Los_Angeles')

Again, let's a look at the number of unique values for each column.

In [5]:
df.nunique(dropna=True)

city                           1
timestamp                  29244
temperature                   45
cloud_cover                   17
cloud_cover_description       23
pressure                      88
windspeed                     28
precipitation                 44
felt_temperature              43
dtype: int64

We see that there is only one unique value for *city* (as we would expect from a weather station), so we can drop that column.

In [6]:
df = df.drop(columns='city', axis=1)

## Duplicates & Missing Data

Again, we check for duplicates:

In [7]:
len(df[df.duplicated()])

0

Apparently there are none. We continue with checking if there are missing values:

In [8]:
df.isnull().sum()

timestamp                   0
temperature                25
cloud_cover                20
cloud_cover_description    20
pressure                    8
windspeed                  86
precipitation               0
felt_temperature           26
dtype: int64

Compared to the size of the full dataset the amount of missing values is negligible and we dont lose much if we just drop the corresponding rows
(Other option: Fill with last non-null)

In [9]:
df = df.dropna()

## Transformation

Let's recall the values of our columns again:

In [10]:
df.cloud_cover_description.unique()

array(['Fair', 'Haze', 'Partly Cloudy', 'Mostly Cloudy', 'Cloudy', 'Fog',
       'Light Rain', 'Rain', 'Heavy Rain', 'Heavy Rain / Windy',
       'Light Rain / Windy', 'T-Storm', 'Fair / Windy', 'Cloudy / Windy',
       'Mostly Cloudy / Windy', 'Partly Cloudy / Windy',
       'Thunder in the Vicinity', 'Thunder', 'Smoke',
       'Light Rain with Thunder', 'Heavy T-Storm', 'Rain / Windy',
       'Blowing Dust'], dtype=object)

We realize that *cloud_cover_description* is a categorical value. Let's see how it relates to the *cloud_cover*.

In [11]:
for description in df.sort_values(by='cloud_cover').cloud_cover_description.unique():
    print(description, df[df.cloud_cover_description == description]['cloud_cover'].unique())

T-Storm [4.]
Heavy T-Storm [4.]
Light Rain with Thunder [4.]
Light Rain [11.]
Light Rain / Windy [11.]
Rain [12.]
Rain / Windy [12.]
Blowing Dust [19.]
Fog [20.]
Haze [21.]
Smoke [22.]
Cloudy [26.]
Cloudy / Windy [26.]
Mostly Cloudy [28. 27.]
Mostly Cloudy / Windy [28. 27.]
Partly Cloudy [29. 30.]
Partly Cloudy / Windy [30. 29.]
Fair [33. 34.]
Fair / Windy [34. 33.]
Thunder in the Vicinity [38. 47.]
Thunder [38.]
Heavy Rain [40.]
Heavy Rain / Windy [40.]


| **cloud_cover_description**| **cloud_cover**            |
|----------------------------|-----------------------------|
| T-Storm                    | `[4]`                      |
| Heavy T-Storm              | `[4]`                      |
| Light Rain with Thunder    | `[4]`                      |
| Light Rain                 | `[11]`                     |
| Light Rain / Windy         | `[11]`                     |
| Rain                       | `[12]`                     |
| Rain / Windy               | `[12]`                     |
| Blowing Dust               | `[19]`                     |
| Fog                        | `[20]`                     |
| Haze                       | `[21]`                     |
| Smoke                      | `[22]`                     |
| Cloudy                     | `[26]`                     |
| Cloudy / Windy             | `[26]`                     |
| Mostly Cloudy              | `[28, 27]`                 |
| Mostly Cloudy / Windy      | `[28, 27]`                 |
| Partly Cloudy              | `[29, 30]`                 |
| Partly Cloudy / Windy      | `[29, 30]`                 |
| Fair                       | `[33, 34]`                 |
| Fair / Windy               | `[33, 34]`                 |
| Thunder in the Vicinity    | `[38, 47]`                 |
| Thunder                    | `[38]`                     |
| Heavy Rain                 | `[40]`                     |
| Heavy Rain / Windy         | `[40]`                     |

We notice something very interesting: The *cloud_cover* seems to be categorical as well and each description can be mapped to a discrete *cloud_cover* value. There doesn't seem to be an order of the values (Cloudy has a cloud_cover of 26, while Partly Cloudy has a cloud_cover of [29, 30] and T-Storm (Thunderstorm) a value of 4.) Furthermore, if we look at the actual values, the name *cloud_cover_description* doesn't really fit to the data in the column (values like *Light Rain* or *Haze* are not a description of the clouds, but rather of the general weather conditions). Because of this, we have decided to use another weather dataset. We found the website *https://www.wunderground.com/weather/us/ca/burbank* from which we can extract historical weather data for the same weather station in Burbank. More importantly, we can retrieve the cloud cover in a scale according to [METAR](https://en.wikipedia.org/wiki/METAR), which we expect to be more suitable for our further analysis. The python script for scraping the data via the REST API can be found here [fetch-weather-data.py](../utils/fetch-weather-data.py).

In [12]:
try:
    new_df = pd.read_csv('resources/csv_files/new_burbank_weather_data.csv')
    print("Successfully imported burbank_weather_data.csv")
except FileNotFoundError:
    print("Error: burbank_weather_data.csv not found.")

Successfully imported burbank_weather_data.csv


In [13]:
new_df

Unnamed: 0,timestamp,temperature,clouds,wx_phrase,pressure,windspeed,precipitation,felt_temperature
0,2018-01-01 08:53:00,9.0,CLR,Fair,991.75,9.0,0.0,8.0
1,2018-01-01 09:53:00,9.0,CLR,Fair,992.08,0.0,0.0,9.0
2,2018-01-01 10:53:00,9.0,CLR,Haze,992.08,0.0,0.0,9.0
3,2018-01-01 11:53:00,9.0,CLR,Partly Cloudy,992.08,0.0,0.0,9.0
4,2018-01-01 12:53:00,8.0,CLR,Fair,992.08,0.0,0.0,8.0
...,...,...,...,...,...,...,...,...
30047,2021-02-01 03:53:00,16.0,CLR,Fair,991.75,6.0,0.0,16.0
30048,2021-02-01 04:53:00,17.0,CLR,Fair,991.09,0.0,0.0,17.0
30049,2021-02-01 05:53:00,16.0,CLR,Fair,990.43,6.0,0.0,16.0
30050,2021-02-01 06:53:00,14.0,CLR,Fair,990.76,0.0,0.0,14.0


Again, we will handle the data as before.

In [14]:
new_df['timestamp'] = pd.to_datetime(new_df['timestamp'], utc=True, errors='coerce').dt.tz_convert('America/Los_Angeles')

In [15]:
new_df.nunique(dropna=True)

timestamp           30052
temperature            45
clouds                  5
wx_phrase              23
pressure               88
windspeed              28
precipitation          46
felt_temperature       43
dtype: int64

In [16]:
len(new_df[new_df.duplicated()])

0

In [17]:
new_df.isnull().sum()

timestamp            0
temperature         25
clouds              37
wx_phrase           20
pressure             8
windspeed           87
precipitation        0
felt_temperature    26
dtype: int64

In [18]:
new_df = new_df.dropna()

Because we still have the categorical features *clouds* and *wx_phrase*, we are going to transform them.
For that, we will look at their values again:

In [19]:
new_df.clouds.unique()

array(['CLR', 'SCT', 'FEW', 'OVC', 'BKN'], dtype=object)

We can define an ordering on the clouds via the METAR scale: CLR < FEW < SCT < BKN < OVC. This allows us to encode those numerically:

In [20]:
cloud_cover_map = {
    'CLR': 0,
    'FEW': 1,
    'SCT': 2,
    'BKN': 3,
    'OVC': 4
}
new_df.loc[:, 'clouds'] = new_df['clouds'].map(cloud_cover_map)
new_df[new_df.clouds != 0]

Unnamed: 0,timestamp,temperature,clouds,wx_phrase,pressure,windspeed,precipitation,felt_temperature
67,2018-01-03 19:53:00-08:00,18.0,2,Partly Cloudy,989.44,0.0,0.0,18.0
68,2018-01-03 20:53:00-08:00,17.0,1,Partly Cloudy,990.10,0.0,0.0,17.0
73,2018-01-04 01:53:00-08:00,12.0,1,Partly Cloudy,990.76,0.0,0.0,12.0
74,2018-01-04 02:53:00-08:00,12.0,4,Cloudy,991.09,6.0,0.0,12.0
75,2018-01-04 03:53:00-08:00,12.0,4,Cloudy,990.76,0.0,0.0,12.0
...,...,...,...,...,...,...,...,...
29997,2021-01-29 17:53:00-08:00,10.0,4,Cloudy,987.47,13.0,0.0,10.0
29998,2021-01-29 18:53:00-08:00,10.0,4,Cloudy,987.80,9.0,0.0,10.0
29999,2021-01-29 19:53:00-08:00,10.0,4,Cloudy,988.45,0.0,0.0,10.0
30000,2021-01-29 20:53:00-08:00,9.0,1,Fair,989.11,0.0,0.0,9.0


In [21]:
for description in new_df.loc[new_df['windspeed'].sort_values().index].wx_phrase.unique():
    print(description, "min:", new_df[new_df.wx_phrase == description]['windspeed'].min(), "max:", new_df[new_df.wx_phrase == description]['windspeed'].max())

Fair min: 0.0 max: 31.0
Mostly Cloudy min: 0.0 max: 31.0
Haze min: 0.0 max: 24.0
Cloudy min: 0.0 max: 31.0
Partly Cloudy min: 0.0 max: 31.0
Fog min: 0.0 max: 19.0
Light Rain min: 0.0 max: 30.0
Heavy Rain min: 0.0 max: 31.0
Rain min: 0.0 max: 31.0
Smoke min: 0.0 max: 20.0
T-Storm min: 0.0 max: 24.0
Thunder in the Vicinity min: 7.0 max: 28.0
Heavy T-Storm min: 7.0 max: 20.0
Light Rain with Thunder min: 15.0 max: 15.0
Blowing Dust min: 15.0 max: 28.0
Thunder min: 15.0 max: 15.0
Fair / Windy min: 33.0 max: 52.0
Mostly Cloudy / Windy min: 33.0 max: 57.0
Cloudy / Windy min: 33.0 max: 48.0
Partly Cloudy / Windy min: 33.0 max: 44.0
Light Rain / Windy min: 33.0 max: 37.0
Rain / Windy min: 33.0 max: 44.0
Heavy Rain / Windy min: 37.0 max: 46.0


We see that there is a clear distinction between Descriptions with and without the word *Windy*. As soon as the windspeed is above 33, it is considered as windy. Now to lower the overall dimensionality, we want to add a new feature called *Windy*. Doing so, we can summarize the records *Partly Cloudy* with *Partly Cloudy / Windy*, *Fair* with *Fair / Windy* and so on.

In [22]:
new_df.loc[:,'windy'] = new_df['wx_phrase'].str.contains("Windy")
new_df.loc[:,'wx_phrase'] = new_df.loc[:, 'wx_phrase'].str.replace(' / Windy', '', regex=True)
new_df[new_df.windy == True]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df.loc[:,'windy'] = new_df['wx_phrase'].str.contains("Windy")


Unnamed: 0,timestamp,temperature,clouds,wx_phrase,pressure,windspeed,precipitation,felt_temperature,windy
215,2018-01-09 04:43:00-08:00,14.0,4,Heavy Rain,975.28,37.0,3.05,14.0,True
216,2018-01-09 04:53:00-08:00,14.0,4,Light Rain,974.95,33.0,3.30,14.0,True
217,2018-01-09 05:02:00-08:00,14.0,4,Light Rain,974.63,33.0,0.00,14.0,True
218,2018-01-09 05:16:00-08:00,14.0,4,Light Rain,974.63,37.0,0.00,14.0,True
219,2018-01-09 05:53:00-08:00,13.0,4,Heavy Rain,975.61,43.0,3.81,13.0,True
...,...,...,...,...,...,...,...,...,...
29721,2021-01-19 20:53:00-08:00,17.0,3,Mostly Cloudy,984.17,33.0,0.00,17.0,True
29723,2021-01-19 22:53:00-08:00,18.0,0,Fair,984.17,37.0,0.00,18.0,True
29724,2021-01-19 23:53:00-08:00,18.0,0,Fair,984.17,33.0,0.00,18.0,True
29725,2021-01-20 00:53:00-08:00,18.0,0,Fair,984.17,35.0,0.00,18.0,True


Now lets add some DUMMIES!!!

In [23]:
new_df

Unnamed: 0,timestamp,temperature,clouds,wx_phrase,pressure,windspeed,precipitation,felt_temperature,windy
0,2018-01-01 00:53:00-08:00,9.0,0,Fair,991.75,9.0,0.0,8.0,False
1,2018-01-01 01:53:00-08:00,9.0,0,Fair,992.08,0.0,0.0,9.0,False
2,2018-01-01 02:53:00-08:00,9.0,0,Haze,992.08,0.0,0.0,9.0,False
3,2018-01-01 03:53:00-08:00,9.0,0,Partly Cloudy,992.08,0.0,0.0,9.0,False
4,2018-01-01 04:53:00-08:00,8.0,0,Fair,992.08,0.0,0.0,8.0,False
...,...,...,...,...,...,...,...,...,...
30047,2021-01-31 19:53:00-08:00,16.0,0,Fair,991.75,6.0,0.0,16.0,False
30048,2021-01-31 20:53:00-08:00,17.0,0,Fair,991.09,0.0,0.0,17.0,False
30049,2021-01-31 21:53:00-08:00,16.0,0,Fair,990.43,6.0,0.0,16.0,False
30050,2021-01-31 22:53:00-08:00,14.0,0,Fair,990.76,0.0,0.0,14.0,False


In [24]:
new_df = pd.get_dummies(new_df, columns=["wx_phrase"])

In [25]:
df.to_csv('resources/csv_files/new_burbank_weather_data_prepared.csv', index=False)
df.to_pickle('resources/pickle_files/new_burbank_weather_data_prepared.pkl')