# Information Visualization Project - Part 1

The dataset the group will dive into describes different kinds of weather phenomena <br>
in the United States. It can be found through the following [LINK](https://openml.org/search?type=data&id=43380) 
accessing the OpenML website.

## A brief description

The US Weather Dataset (2016-2020) compiles the climate data from 2 thousand airports<br>
Throughout the country of the United States, it covers 49 states and all the data <br>
streches from January 2016 up to December 2020.

# Libraries Imports

In [1]:
# Basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Vega Altair
import altair as alt

# Download Data
import openml 

To use altair with larger datasets:

In [2]:
alt.data_transformers.enable("vegafusion")

DataTransformerRegistry.enable('vegafusion')

## Downloading data

In [3]:
dataset = openml.datasets.get_dataset(43380)

In [4]:
dataset

OpenML Dataset
Name.........: US-Weather-Events-(2016---2020)
Version......: 1
Format.......: arff
Upload Date..: 2022-03-23 12:51:42
Licence......: CC BY-NC-SA 4.0
Download URL.: https://api.openml.org/data/v1/download/22102205/US-Weather-Events-(2016---2020).arff
OpenML URL...: https://www.openml.org/d/43380
# of features: None

In [5]:
X, y, _, _ = dataset.get_data(dataset_format="dataframe")

In [6]:
X.head()

Unnamed: 0,EventId,Type,Severity,StartTime(UTC),EndTime(UTC),Precipitation(in),TimeZone,AirportCode,LocationLat,LocationLng,City,County,State,ZipCode
0,W-1,Snow,Light,2016-01-06 23:14:00,2016-01-07 00:34:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
1,W-2,Snow,Light,2016-01-07 04:14:00,2016-01-07 04:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
2,W-3,Snow,Light,2016-01-07 05:54:00,2016-01-07 15:34:00,0.03,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
3,W-4,Snow,Light,2016-01-08 05:34:00,2016-01-08 05:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
4,W-5,Snow,Light,2016-01-08 13:54:00,2016-01-08 15:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0


In [7]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7479165 entries, 0 to 7479164
Data columns (total 14 columns):
 #   Column             Dtype  
---  ------             -----  
 0   EventId            object 
 1   Type               object 
 2   Severity           object 
 3   StartTime(UTC)     object 
 4   EndTime(UTC)       object 
 5   Precipitation(in)  float64
 6   TimeZone           object 
 7   AirportCode        object 
 8   LocationLat        float64
 9   LocationLng        float64
 10  City               object 
 11  County             object 
 12  State              object 
 13  ZipCode            float64
dtypes: float64(4), object(10)
memory usage: 798.9+ MB


## Data Exploration

### Describing the variables

EventID: 

The variable has no duplicates and has no meaning attached to it.

In [8]:
X['EventId'].nunique()

7479165

Type:

There are 7 kinds of events: 
- Snow, Fog, Cold, Storm, Rain, Precipitation 

In [9]:
X['Type'].unique()

array(['Snow', 'Fog', 'Cold', 'Storm', 'Rain', 'Precipitation', 'Hail'],
      dtype=object)

In [10]:
X['Type'].value_counts()

Type
Rain             4397546
Fog              1722738
Snow              980411
Cold              197691
Precipitation     128836
Storm              49203
Hail                2740
Name: count, dtype: int64

In [11]:
title = alt.TitleParams('Events Count', anchor='middle')
base = alt.Chart(X, title=title).encode(
    alt.X(
        'Type',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'), 
        axis=alt.Axis(labelAngle=0),
        title='Type of event'),
    alt.Y('count()', title='Count of events'),
    text='count(Type)'
).properties(
    width=500,
    height=400
)
base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)

Severity:

There are 6 level for severity: 
- Light, Severe, Moderate, Heavy, UNK and Other

In [12]:
X['Severity'].unique()

array(['Light', 'Severe', 'Moderate', 'Heavy', 'UNK', 'Other'],
      dtype=object)

In [13]:
title = alt.TitleParams('Severity Count', anchor='middle')
base = alt.Chart(X, title=title).encode(
    alt.X(
        'Severity',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'), 
        axis=alt.Axis(labelAngle=0),
        title='Type of Severity'),
    alt.Y('count()', title='Count'),
    text='count(Severity)'
).properties(
    width=500,
    height=400
)
base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)

Start Time and End Time:

In [14]:
X[['StartTime(UTC)', 'EndTime(UTC)']].head()

Unnamed: 0,StartTime(UTC),EndTime(UTC)
0,2016-01-06 23:14:00,2016-01-07 00:34:00
1,2016-01-07 04:14:00,2016-01-07 04:54:00
2,2016-01-07 05:54:00,2016-01-07 15:34:00
3,2016-01-08 05:34:00,2016-01-08 05:54:00
4,2016-01-08 13:54:00,2016-01-08 15:54:00


In [15]:
X[['StartTime(UTC)', 'EndTime(UTC)']].describe()

Unnamed: 0,StartTime(UTC),EndTime(UTC)
count,7479165,7479165
unique,1980059,1940056
top,2017-03-12 06:15:00,2016-03-13 06:15:00
freq,116,103


Precipitation:

Extreme unbalanced data, the proportion of events in which it didn't rain<br>
is far greater than the ones with some precipitation.

In [16]:
X['Precipitation(in)'].describe()

count    7.479165e+06
mean     9.518492e-02
std      9.185906e-01
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      5.000000e-02
max      1.104130e+03
Name: Precipitation(in), dtype: float64

In [17]:
title = alt.TitleParams('Precipitation Histogram', anchor='middle')
base = alt.Chart(X, title=title).encode(
    alt.X(
        'Precipitation(in)',
        #axis=alt.Axis(labelAngle=0),
        title='Precipitation (inches)'),
    alt.Y('count()', title='Frequency').scale(type='log')
).properties(
    width=500,
    height=400
)
base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)

TimeZone

In [18]:
X['TimeZone'].head()

0    US/Mountain
1    US/Mountain
2    US/Mountain
3    US/Mountain
4    US/Mountain
Name: TimeZone, dtype: object

In [19]:
X['TimeZone'].unique()

array(['US/Mountain', 'US/Central', 'US/Eastern', 'US/Pacific'],
      dtype=object)

In [20]:
title = alt.TitleParams('TimeZone Count', anchor='middle')
base = alt.Chart(X, title=title).encode(
    alt.X(
        'TimeZone',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'), 
        axis=alt.Axis(labelAngle=0),
        title='Type of TimeZone'),
    alt.Y('count()', title='Count'),
    text='count(TimeZone)'
).properties(
    width=500,
    height=400
)
base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)

AirportCode

In [21]:
X['AirportCode'].head()

0    K04V
1    K04V
2    K04V
3    K04V
4    K04V
Name: AirportCode, dtype: object

In [22]:
X['AirportCode'].nunique()

2071

In [23]:
X['AirportCode'].unique()

array(['K04V', 'KAXS', 'KAEL', ..., 'KB23', 'KARL', 'KBVR'], dtype=object)

In [24]:
title = alt.TitleParams('AirportCode Count', anchor='middle')
base = alt.Chart(X, title=title).encode(
    alt.X(
        'AirportCode',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'), 
        axis=alt.Axis(labelAngle=0),
        title='AirportCode'),
    alt.Y('count()', title='Count')
).properties(
    width=600,
    height=400
)
base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)

Wheather Stations Locations (Latitude and Longitude)

In [25]:
X['LocationLat'].nunique()

2056

In [26]:
X['LocationLng'].nunique()

2063

In [27]:
from vega_datasets import data

In [28]:
states = alt.topo_feature(data.us_10m.url, 'states')
capitals = data.us_state_capitals.url

In [29]:
# US states background
background = alt.Chart(states).mark_geoshape(
    fill='lightgray',
    stroke='white'
).properties(
    title='Wheather Stations Locations',
    width=650,
    height=400
).project('albersUsa')

In [30]:
# Points and text
hover = alt.selection_point(on='pointerover', nearest=True,
                      fields=['lat', 'lon'])


In [37]:
base = alt.Chart(X[['LocationLat', 'LocationLng']].drop_duplicates()).encode(
    longitude='LocationLng:Q',
    latitude='LocationLat:Q',
)

In [32]:
text = base.mark_text(dy=-5, align='right').encode(
    alt.Text('city:N'),
    opacity=alt.condition(~hover, alt.value(0), alt.value(1))
)

In [52]:
points = base.mark_point().encode(
    color=alt.value('navy'),
    size=alt.condition(~hover, alt.value(5), alt.value(5))
).add_params(hover)

In [53]:
background + points #+ text

In [None]:
X.groupby(['State'])

City

In [56]:
X['City'].nunique()

1716

In [67]:
list_top_cities = X['City'].value_counts().head(10).index

In [71]:
title = alt.TitleParams('Top 10 Cities count', anchor='middle')
base = alt.Chart(X[X['City'].isin(list_top_cities)], title=title).encode(
    alt.X(
        'City',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'), 
        axis=alt.Axis(labelAngle=0),
        title='City'),
    alt.Y('count()', title='Count')
).properties(
    width=600,
    height=400
)
base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)

County

In [74]:
X['County'].nunique()

1100

In [77]:
list_top_county = X['County'].value_counts().head(10).index

In [78]:
title = alt.TitleParams('Top 10 Counties count', anchor='middle')
base = alt.Chart(X[X['City'].isin(list_top_county)], title=title).encode(
    alt.X(
        'City',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'), 
        axis=alt.Axis(labelAngle=0),
        title='County'),
    alt.Y('count()', title='Count')
).properties(
    width=600,
    height=400
)
base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)

State

In [80]:
X['State'].nunique()

48

In [81]:
list_top_state = X['State'].value_counts().head(10).index

In [84]:
title = alt.TitleParams('Top 10 States count', anchor='middle')
base = alt.Chart(X[X['State'].isin(list_top_state)], title=title).encode(
    alt.X(
        'State',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'), 
        axis=alt.Axis(labelAngle=0),
        title='State'),
    alt.Y('count()', title='Count')
).properties(
    width=600,
    height=400
)
base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)

In [79]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7479165 entries, 0 to 7479164
Data columns (total 14 columns):
 #   Column             Dtype  
---  ------             -----  
 0   EventId            object 
 1   Type               object 
 2   Severity           object 
 3   StartTime(UTC)     object 
 4   EndTime(UTC)       object 
 5   Precipitation(in)  float64
 6   TimeZone           object 
 7   AirportCode        object 
 8   LocationLat        float64
 9   LocationLng        float64
 10  City               object 
 11  County             object 
 12  State              object 
 13  ZipCode            float64
dtypes: float64(4), object(10)
memory usage: 798.9+ MB


Char initialization

In [None]:
chart = alt.Chart(X)

DataTransformerRegistry.enable('vegafusion')

# Data Transformation

In [None]:
X['date_error'] = np.where(X['StartTime(UTC)'] > X['EndTime(UTC)'], 1, 0)

In [None]:
X[X['date_error'] == 1]

Unnamed: 0,EventId,Type,Severity,StartTime(UTC),EndTime(UTC),Precipitation(in),TimeZone,AirportCode,LocationLat,LocationLng,City,County,State,ZipCode,date_error
