# Information Visualization Project - Part 1

The dataset the group will dive into describes different kinds of weather phenomena <br>
in the United States. It can be found through the following [LINK](https://openml.org/search?type=data&id=43380) 
accessing the OpenML website.

## A brief description

The US Weather Dataset (2016-2020) compiles the climate data from 2 thousand airports<br>
Throughout the country of the United States, it covers 49 states and all the data <br>
streches from January 2016 up to December 2020.

# Libraries Imports

In [1]:
# Basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Vega Altair
import altair as alt

# Download Data
import openml 

To use altair with larger datasets:

In [None]:
alt.data_transformers.enable("vegafusion")

## Downloading data

In [2]:
dataset = openml.datasets.get_dataset(43380)

In [3]:
dataset

OpenML Dataset
Name.........: US-Weather-Events-(2016---2020)
Version......: 1
Format.......: arff
Upload Date..: 2022-03-23 12:51:42
Licence......: CC BY-NC-SA 4.0
Download URL.: https://api.openml.org/data/v1/download/22102205/US-Weather-Events-(2016---2020).arff
OpenML URL...: https://www.openml.org/d/43380
# of features: None

In [4]:
X, y, _, _ = dataset.get_data(dataset_format="dataframe")

In [5]:
X.head()

Unnamed: 0,EventId,Type,Severity,StartTime(UTC),EndTime(UTC),Precipitation(in),TimeZone,AirportCode,LocationLat,LocationLng,City,County,State,ZipCode
0,W-1,Snow,Light,2016-01-06 23:14:00,2016-01-07 00:34:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
1,W-2,Snow,Light,2016-01-07 04:14:00,2016-01-07 04:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
2,W-3,Snow,Light,2016-01-07 05:54:00,2016-01-07 15:34:00,0.03,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
3,W-4,Snow,Light,2016-01-08 05:34:00,2016-01-08 05:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
4,W-5,Snow,Light,2016-01-08 13:54:00,2016-01-08 15:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0


In [6]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7479165 entries, 0 to 7479164
Data columns (total 14 columns):
 #   Column             Dtype  
---  ------             -----  
 0   EventId            object 
 1   Type               object 
 2   Severity           object 
 3   StartTime(UTC)     object 
 4   EndTime(UTC)       object 
 5   Precipitation(in)  float64
 6   TimeZone           object 
 7   AirportCode        object 
 8   LocationLat        float64
 9   LocationLng        float64
 10  City               object 
 11  County             object 
 12  State              object 
 13  ZipCode            float64
dtypes: float64(4), object(10)
memory usage: 798.9+ MB


## Data Exploration

### Describing the variables

EventID: 

The variable has no duplicates and has no meaning attached to it.

In [10]:
X['EventId'].nunique()

7479165

Type:

There are 7 kinds of events: 
- Snow, Fog, Cold, Storm, Rain, Precipitation 

In [12]:
X['Type'].unique()

array(['Snow', 'Fog', 'Cold', 'Storm', 'Rain', 'Precipitation', 'Hail'],
      dtype=object)

In [13]:
X['Type'].value_counts()

Type
Rain             4397546
Fog              1722738
Snow              980411
Cold              197691
Precipitation     128836
Storm              49203
Hail                2740
Name: count, dtype: int64

Char initialization

In [8]:
chart = alt.Chart(X)

DataTransformerRegistry.enable('vegafusion')

In [34]:
title = alt.TitleParams('Events Count', anchor='middle')
base = alt.Chart(X, title=title).encode(
    alt.X(
        'Type',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'), 
        axis=alt.Axis(labelAngle=0),
        title='Type of event'),
    alt.Y('count()', title='Count of events'),
    text='count(Type)'
).properties(
    width=500,
    height=400
)
base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)