# Information Visualization Project - Part 1

The dataset the group will dive into describes different kinds of weather phenomena <br>
in the United States. It can be found through the following [LINK](https://openml.org/search?type=data&id=43380) 
accessing the OpenML website.

## A brief description

The US Weather Dataset (2016-2020) compiles the climate data from 2 thousand airports<br>
Throughout the country of the United States, it covers 49 states and all the data <br>
streches from January 2016 up to December 2020.

## Libraries Imports

In [1]:
# Basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Vega Altair
import altair as alt

# Download Data
import openml 

To use altair with larger datasets:

In [2]:
alt.data_transformers.enable("vegafusion")

DataTransformerRegistry.enable('vegafusion')

## Downloading data

In [3]:
dataset = openml.datasets.get_dataset(43380)

In [4]:
dataset

OpenML Dataset
Name.........: US-Weather-Events-(2016---2020)
Version......: 1
Format.......: arff
Upload Date..: 2022-03-23 12:51:42
Licence......: CC BY-NC-SA 4.0
Download URL.: https://api.openml.org/data/v1/download/22102205/US-Weather-Events-(2016---2020).arff
OpenML URL...: https://www.openml.org/d/43380
# of features: None

In [5]:
X, y, _, _ = dataset.get_data(dataset_format="dataframe")

In [6]:
X.head()

Unnamed: 0,EventId,Type,Severity,StartTime(UTC),EndTime(UTC),Precipitation(in),TimeZone,AirportCode,LocationLat,LocationLng,City,County,State,ZipCode
0,W-1,Snow,Light,2016-01-06 23:14:00,2016-01-07 00:34:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
1,W-2,Snow,Light,2016-01-07 04:14:00,2016-01-07 04:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
2,W-3,Snow,Light,2016-01-07 05:54:00,2016-01-07 15:34:00,0.03,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
3,W-4,Snow,Light,2016-01-08 05:34:00,2016-01-08 05:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
4,W-5,Snow,Light,2016-01-08 13:54:00,2016-01-08 15:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0


In [7]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7479165 entries, 0 to 7479164
Data columns (total 14 columns):
 #   Column             Dtype  
---  ------             -----  
 0   EventId            object 
 1   Type               object 
 2   Severity           object 
 3   StartTime(UTC)     object 
 4   EndTime(UTC)       object 
 5   Precipitation(in)  float64
 6   TimeZone           object 
 7   AirportCode        object 
 8   LocationLat        float64
 9   LocationLng        float64
 10  City               object 
 11  County             object 
 12  State              object 
 13  ZipCode            float64
dtypes: float64(4), object(10)
memory usage: 798.9+ MB


## Data Exploration

### Describing the variables

EventID: 

The variable has no duplicates and has no meaning attached to it.

In [8]:
X['EventId'].nunique()

7479165

Type:

There are 7 kinds of events: 
- Snow, Fog, Cold, Storm, Rain, Precipitation 

In [9]:
X['Type'].unique()

array(['Snow', 'Fog', 'Cold', 'Storm', 'Rain', 'Precipitation', 'Hail'],
      dtype=object)

In [10]:
X['Type'].value_counts()

Type
Rain             4397546
Fog              1722738
Snow              980411
Cold              197691
Precipitation     128836
Storm              49203
Hail                2740
Name: count, dtype: int64

In [11]:
title = alt.TitleParams('Contagem por tipo de Evento', anchor='middle', fontSize=15)
base = alt.Chart(X, title=title).encode(
    alt.X(
        'Type',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'), 
        axis=alt.Axis(labelAngle=0),
        title='Tipo de Evento'),
    alt.Y('count()', title='Contagem'),
    text=alt.Text('count()', format=',.0f')
).properties(
    width=500,
    height=400
)
p = base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)
p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/events_count.png', ppi=300)

Severity:

There are 6 level for severity: 
- Light, Severe, Moderate, Heavy, UNK and Other

In [12]:
X['Severity'].unique()

array(['Light', 'Severe', 'Moderate', 'Heavy', 'UNK', 'Other'],
      dtype=object)

In [13]:
title = alt.TitleParams('Contagem por tipo de Severidade', anchor='middle', fontSize=15)
base = alt.Chart(X, title=title).encode(
    alt.X(
        'Severity',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'), 
        axis=alt.Axis(labelAngle=0),
        title='Severidade'),
    alt.Y('count()', title='Contagem'),
    text=alt.Text('count()', format=',.0f')
).properties(
    width=500,
    height=400
)
p = base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)
p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/severity_count.png', ppi=300)

Start Time and End Time:

In [14]:
X[['StartTime(UTC)', 'EndTime(UTC)']].head()

Unnamed: 0,StartTime(UTC),EndTime(UTC)
0,2016-01-06 23:14:00,2016-01-07 00:34:00
1,2016-01-07 04:14:00,2016-01-07 04:54:00
2,2016-01-07 05:54:00,2016-01-07 15:34:00
3,2016-01-08 05:34:00,2016-01-08 05:54:00
4,2016-01-08 13:54:00,2016-01-08 15:54:00


In [15]:
X[['StartTime(UTC)', 'EndTime(UTC)']].describe()

Unnamed: 0,StartTime(UTC),EndTime(UTC)
count,7479165,7479165
unique,1980059,1940056
top,2017-03-12 06:15:00,2016-03-13 06:15:00
freq,116,103


Precipitation:

Extreme unbalanced data, the proportion of events in which it didn't rain<br>
is far greater than the ones with some precipitation.

In [16]:
X['Precipitation(in)'].describe()

count    7.479165e+06
mean     9.518492e-02
std      9.185906e-01
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      5.000000e-02
max      1.104130e+03
Name: Precipitation(in), dtype: float64

In [17]:
title = alt.TitleParams('Histograma de Precipitação (polegadas)', anchor='middle', fontSize=15)
base = alt.Chart(X, title=title).encode(
    alt.X(
        'Precipitation(in)',
        #axis=alt.Axis(labelAngle=0),
        title='Precipitação (polegadas)'),
    alt.Y('count()', title='Contagem').scale(type='log')
).properties(
    width=500,
    height=400
)
p = base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)
p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/preci_hist.png', ppi=300)

TimeZone

In [18]:
X['TimeZone'].head()

0    US/Mountain
1    US/Mountain
2    US/Mountain
3    US/Mountain
4    US/Mountain
Name: TimeZone, dtype: object

In [19]:
X['TimeZone'].unique()

array(['US/Mountain', 'US/Central', 'US/Eastern', 'US/Pacific'],
      dtype=object)

In [20]:
title = alt.TitleParams('Contagem por Fuso horário', anchor='middle', fontSize=15)
base = alt.Chart(X, title=title).encode(
    alt.X(
        'TimeZone',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'), 
        axis=alt.Axis(labelAngle=0),
        title='Fuso horário'),
    alt.Y('count()', title='Contagem'),
    text=alt.Text('count()', format=',.0f')
).properties(
    width=500,
    height=400
)
p = base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)
p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/4.timezone_count.png', ppi=300)

AirportCode

In [21]:
X['AirportCode'].head()

0    K04V
1    K04V
2    K04V
3    K04V
4    K04V
Name: AirportCode, dtype: object

In [22]:
X['AirportCode'].nunique()

2071

In [23]:
X['AirportCode'].unique()

array(['K04V', 'KAXS', 'KAEL', ..., 'KB23', 'KARL', 'KBVR'], dtype=object)

In [24]:
title = alt.TitleParams('Contagem por código de aeroporto', anchor='middle', fontSize=15)
base = alt.Chart(X, title=title).encode(
    alt.X(
        'AirportCode',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'), 
        axis=alt.Axis(labels=False, tickSize=0),
        title='Código do Aeroporto'),
    alt.Y('count()', title='Contagem')
).properties(
    width=800,
    height=400
)
p = base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)
p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/5.airportcode_count.png', ppi=300)

Wheather Stations Locations (Latitude and Longitude)

In [25]:
X['LocationLat'].nunique()

2056

In [26]:
X['LocationLng'].nunique()

2063

In [27]:
from vega_datasets import data

In [28]:
states = alt.topo_feature(data.us_10m.url, 'states')
capitals = data.us_state_capitals.url

In [29]:
# US states background
title = alt.Title('Localização das estações meteorológicas', fontSize=15)
background = alt.Chart(states, title=title).mark_geoshape(
    fill='lightgray',
    stroke='black'
).properties(
    width=650,
    height=400
).project('albersUsa')

In [30]:
# Points and text
hover = alt.selection_point(on='pointerover', nearest=True,
                      fields=['lat', 'lon'])


In [31]:
base = alt.Chart(X[['LocationLat', 'LocationLng']].drop_duplicates()).encode(
    longitude='LocationLng:Q',
    latitude='LocationLat:Q',
)

In [32]:
text = base.mark_text(dy=-5, align='right').encode(
    alt.Text('city:N'),
    opacity=alt.condition(~hover, alt.value(0), alt.value(1))
)

In [33]:
points = base.mark_point().encode(
    color=alt.value('blue'),
    size= alt.value(5),#alt.condition(~hover, alt.value(5), alt.value(5))
    opacity=alt.value(0.3)
)#.add_params(hover)

In [34]:
p = background + points #+ text
p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/6.stations_locations.png', ppi=300)

City

In [35]:
X['City'].nunique()

1716

In [36]:
list_top_cities = X['City'].value_counts().head(10).index

In [37]:
title = alt.TitleParams('Contagem das top 10 Cidades', anchor='middle', fontSize=15)
base = alt.Chart(X[X['City'].isin(list_top_cities)], title=title).encode(
    alt.X(
        'City',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'), 
        axis=alt.Axis(labelAngle=0),
        title='Cidade'),
    alt.Y('count()', title='Contagem'),
    text=alt.Text('count()', format=',.0f')
).properties(
    width=600,
    height=400
)
p = base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)

p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/7.cities.png', ppi=300)

County

In [38]:
X['County'].nunique()

1100

In [39]:
list_top_county = X['County'].value_counts().head(10).index

In [40]:
title = alt.TitleParams('Contagem dos top 10 Condados', anchor='middle', fontSize=15)
base = alt.Chart(X[X['County'].isin(list_top_county)], title=title).encode(
    alt.X(
        'County',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'), 
        axis=alt.Axis(labelAngle=0),
        title='Condado'),
    alt.Y('count()', title='Contagem'),
    text=alt.Text('count()', format=',.0f')
).properties(
    width=600,
    height=400
)
p = base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)

p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/8.county.png', ppi=300)

State

In [41]:
X['State'].nunique()

48

In [42]:
list_top_state = X['State'].value_counts().head(10).index

In [43]:
title = alt.TitleParams('Contagem dos top 10 Estados', anchor='middle', fontSize=15)
base = alt.Chart(X[X['State'].isin(list_top_state)], title=title).encode(
    alt.X(
        'State',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'), 
        axis=alt.Axis(labelAngle=0),
        title='Estado'),
    alt.Y('count()', title='Contagem'),
    text=alt.Text('count()', format=',.0f')
).properties(
    width=600,
    height=400
)
p = base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)

p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/9.state.png', ppi=300)


ZipCode

In [44]:
X['ZipCode'] = X['ZipCode'].astype('object')

In [45]:
X['ZipCode'].nunique()

2020

In [46]:
list_top_zipcode = X['ZipCode'].value_counts().head(10).index

In [47]:
title = alt.TitleParams('Contagem dos top 10 Códigos Postais', anchor='middle', fontSize=15)
base = alt.Chart(X[X['ZipCode'].isin(list_top_zipcode)], title=title).encode(
    alt.X(
        'ZipCode:N',
        sort=alt.EncodingSortField(field="Letters", op="count", order='descending'),
        title='Código Postal'),
    alt.Y('count()', title='Contagem'),
    text=alt.Text('count()', format=',.0f')
).properties(
    width=600,
    height=400
)
p = base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)
p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/10.zipcode.png', ppi=300)


## Data Cleaning

### Checking for nulls

All missing cities also don't have its zipcode

In [48]:
X.isnull().sum()

EventId                  0
Type                     0
Severity                 0
StartTime(UTC)           0
EndTime(UTC)             0
Precipitation(in)        0
TimeZone                 0
AirportCode              0
LocationLat              0
LocationLng              0
City                 14563
County                   0
State                    0
ZipCode              59234
dtype: int64

In [49]:
X.loc[:,['City', 'ZipCode']].isnull().all(axis=1).sum()

np.int64(14563)

In [50]:
X.isnull().any(axis=1).sum()

np.int64(59234)

### Checking for Errors in registers - Start / End

In [51]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7479165 entries, 0 to 7479164
Data columns (total 14 columns):
 #   Column             Dtype  
---  ------             -----  
 0   EventId            object 
 1   Type               object 
 2   Severity           object 
 3   StartTime(UTC)     object 
 4   EndTime(UTC)       object 
 5   Precipitation(in)  float64
 6   TimeZone           object 
 7   AirportCode        object 
 8   LocationLat        float64
 9   LocationLng        float64
 10  City               object 
 11  County             object 
 12  State              object 
 13  ZipCode            object 
dtypes: float64(3), object(11)
memory usage: 798.9+ MB


In [52]:
X['StartTime(UTC)'] = pd.to_datetime(X['StartTime(UTC)'], format='%Y-%m-%d %H:%M:%S')
X['EndTime(UTC)'] = pd.to_datetime(X['EndTime(UTC)'], format='%Y-%m-%d %H:%M:%S')

In [53]:
X.loc[:,['StartTime(UTC)','EndTime(UTC)']].apply(lambda x: x[0])

StartTime(UTC)   2016-01-06 23:14:00
EndTime(UTC)     2016-01-07 00:34:00
dtype: datetime64[ns]

In [54]:
X['EventDuration'] = X['EndTime(UTC)'] - X['StartTime(UTC)'] 
X['EventDuration'] = X['EventDuration'].apply(lambda x: x.total_seconds())

In [55]:
X[X['EventDuration'] < 0]

Unnamed: 0,EventId,Type,Severity,StartTime(UTC),EndTime(UTC),Precipitation(in),TimeZone,AirportCode,LocationLat,LocationLng,City,County,State,ZipCode,EventDuration


In [56]:
X[X['EventDuration'] == 0]

Unnamed: 0,EventId,Type,Severity,StartTime(UTC),EndTime(UTC),Precipitation(in),TimeZone,AirportCode,LocationLat,LocationLng,City,County,State,ZipCode,EventDuration
101876,W-103205,Snow,Light,2020-11-01 07:15:00,2020-11-01 07:15:00,0.00,US/Central,K04W,46.0244,-92.8991,Hinckley,Pine,MN,55037.0,0.0
148483,W-150530,Fog,Severe,2021-11-07 07:55:00,2021-11-07 07:55:00,0.00,US/Central,KFYG,38.5876,-90.9938,Marthasville,Warren,MO,63357.0,0.0
367097,W-372333,Fog,Severe,2019-11-03 07:15:00,2019-11-03 07:15:00,0.00,US/Central,KLXY,31.6412,-96.5144,Mexia,Limestone,TX,76667.0,0.0
716783,W-726704,Snow,Light,2019-11-03 07:53:00,2019-11-03 07:53:00,0.00,US/Central,KPIR,44.3827,-100.2860,Pierre,Hughes,SD,57501.0,0.0
793055,W-804001,Snow,Light,2020-11-01 07:55:00,2020-11-01 07:55:00,0.00,US/Central,KCBG,45.5628,-93.2644,Cambridge,Isanti,MN,55008.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7326092,W-7422022,Rain,Light,2017-11-05 09:55:00,2017-11-05 09:55:00,0.01,US/Pacific,KS47,45.4175,-123.8087,Tillamook,Tillamook,OR,97141.0,0.0
7408757,W-7505805,Rain,Light,2017-11-05 09:35:00,2017-11-05 09:35:00,0.00,US/Pacific,KTDO,46.4772,-122.8065,Toledo,Lewis,WA,98591.0,0.0
7413848,W-7510896,Rain,Light,2021-11-07 09:35:00,2021-11-07 09:35:00,0.00,US/Pacific,KTDO,46.4772,-122.8065,Toledo,Lewis,WA,98591.0,0.0
7419635,W-7516895,Cold,Severe,2017-11-05 08:55:00,2017-11-05 08:55:00,0.00,US/Mountain,KSHC,41.9308,-109.9687,Green River,Sweetwater,WY,82935.0,0.0


In [57]:
X = X[X['EventDuration'] > 0]

In [58]:
X['EventDuration'].describe()

count    7.479051e+06
mean     5.021843e+03
std      8.495643e+04
min      6.000000e+01
25%      1.200000e+03
50%      2.400000e+03
75%      4.560000e+03
max      9.795192e+07
Name: EventDuration, dtype: float64

In [59]:
title = alt.TitleParams('Histograma da duração dos Eventos', anchor='middle', fontSize=15)
base = alt.Chart(X, title=title).encode(
    alt.X(
        'EventDuration',
        #axis=alt.Axis(labelAngle=0),
        title='Duração (s)'),
    alt.Y('count()', title='Contagem').scale(type='log')
).properties(
    width=500,
    height=400
)
p = base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)
p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/11.duration_hist.png', ppi=300)

## Transformation

### Date variable

In [60]:
X['EventDate'] = X['StartTime(UTC)'].apply(lambda x: x.date())

Day, Month and Year (START and END)

In [61]:
X['StartDay'] = X['StartTime(UTC)'].apply(lambda x: x.day)
X['StartMonth'] = X['StartTime(UTC)'].apply(lambda x: x.month)
X['StartYear'] = X['StartTime(UTC)'].apply(lambda x: x.year)

X['EndDay'] = X['EndTime(UTC)'].apply(lambda x: x.day)
X['EndMonth'] = X['EndTime(UTC)'].apply(lambda x: x.month)
X['EndYear'] = X['EndTime(UTC)'].apply(lambda x: x.year)

## Data Vizualization

### Basic precipitation time series

In [62]:
aggregation = {
    'Precipitation(in)' : 'sum' 
}

In [63]:
preci_by_date = X.groupby('EventDate').agg(aggregation)

In [64]:
preci_by_date = preci_by_date.reset_index().copy()

In [65]:
preci_by_date.head()

Unnamed: 0,EventDate,Precipitation(in)
0,2016-01-01,462.12
1,2016-01-02,40.1
2,2016-01-03,47.19
3,2016-01-04,34.0
4,2016-01-05,175.24


In [66]:
preci_by_date['EventDate'] = pd.to_datetime(preci_by_date['EventDate'])

In [67]:
title = alt.Title('Precipitação total do país por dia', fontSize=15)

ts_chart = alt.Chart(preci_by_date, title=title).mark_line().encode(
    x=alt.X('EventDate:T', title='Dia'),
    y=alt.Y(
        'Precipitation(in):Q',
        title='Precipitação (polegadas)'),
    color=alt.value('navy')
).properties(
    width=600,
    height=300
)
p = ts_chart
p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/12.precipitation_day.png', ppi=300)

Seasonal charts

In [68]:
preci_by_month = X.groupby(['StartMonth', 'StartYear']).agg(aggregation).reset_index()

In [87]:
title = alt.Title('Sazonalidade da precipitação por mês do ano')

base = alt.Chart(preci_by_month, title=title).encode(
    x=alt.X(
        'StartMonth:N', 
        axis=alt.Axis(labelAngle=0),
        title='Mês do ano'),
    y=alt.Y('Precipitation(in):Q'),
    color='StartYear:N'
).properties(
    width=600,
    height=300
)

p = base.mark_line()
p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/13.precipitation_seasonality_month.png', ppi=300)


In [91]:
title = alt.Title('Tendência de precipitação por mês do ano')
base = alt.Chart(preci_by_month).encode(
    x=alt.X(
        'StartYear:N',
        axis=alt.Axis(labelAngle=-90),
        scale=alt.Scale(nice={'interval': 'year', 'step': 2}),
        title='Ano'),
    y=alt.Y('Precipitation(in):Q', title='Precipitação (polegadas)')
).properties(
    width=50,
    height=400
)

p = base.mark_line().facet(column='StartMonth')
p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/14.precipitation_trend_month.png', ppi=300)

Month

In [71]:
X['StartYear'] == 2016

0           True
1           True
2           True
3           True
4           True
           ...  
7479160    False
7479161    False
7479162    False
7479163    False
7479164    False
Name: StartYear, Length: 7479051, dtype: bool

In [72]:
preci_by_day = X[X['StartYear'] == 2016].groupby(['StartDay', 'StartMonth']).agg(aggregation).reset_index()

In [93]:
title = alt.Title('Precipitação total por dia do mês em 2016')

base = alt.Chart(preci_by_day, title=title).encode(
    x=alt.X(
        'StartDay:N', 
        axis=alt.Axis(labelAngle=0),
        title='Dia do mês'),
    y=alt.Y('Precipitation(in):Q', title='PrecipitaçAo (polegadas)'),
    color=alt.Color('StartMonth:N').title('Mês')
).properties(
    width=600,
    height=300
)

p = base.mark_line()
p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/15.preci_seasonality_day_2016.png', ppi=300)


In [74]:
teste_scale = X[['Precipitation(in)', 'StartTime(UTC)']].copy()

In [75]:
from sklearn.preprocessing import MinMaxScaler

In [76]:
scaler = MinMaxScaler()
preci_scaled = scaler.fit_transform(teste_scale[['Precipitation(in)']])

In [77]:
teste_scale['Precipitation(in)'] = preci_scaled

In [78]:
preci_scaled.min()

np.float64(0.0)

In [79]:
preci_scaled.max()

np.float64(1.0)

In [90]:
title = alt.TitleParams('Histograma de precipitação escalada', anchor='middle')
base = alt.Chart(teste_scale, title=title).encode(
    alt.X(
        'Precipitation(in)',
        #axis=alt.Axis(labelAngle=0),
        title='Precipitação (polegadas)'),
    alt.Y('count()', title='Contagem').scale(type='log')
).properties(
    width=500,
    height=400
)
p = base.mark_bar() + base.mark_text(align='center', dx=0, yOffset=-10)
p.save('/Users/administrador/Documents/Pessoal/repositorios/information_visualization/images/15.preci_min_max.png', ppi=300)

In [81]:
X_na = X[X.isnull().any(axis=1)][['LocationLat', 'LocationLng']].drop_duplicates()

In [82]:
X.isnull().sum()

EventId                  0
Type                     0
Severity                 0
StartTime(UTC)           0
EndTime(UTC)             0
Precipitation(in)        0
TimeZone                 0
AirportCode              0
LocationLat              0
LocationLng              0
City                 14563
County                   0
State                    0
ZipCode              59234
EventDuration            0
EventDate                0
StartDay                 0
StartMonth               0
StartYear                0
EndDay                   0
EndMonth                 0
EndYear                  0
dtype: int64

In [83]:
X[X.isnull().any(axis=1)][['LocationLat', 'LocationLng', 'City', 'ZipCode']].drop_duplicates()

Unnamed: 0,LocationLat,LocationLng,City,ZipCode
47309,35.0222,-76.4625,Davis,
194050,33.2338,-119.4559,,
729442,38.7578,-104.3013,,
961214,40.7187,-114.0309,Wendover,
995177,26.9688,-99.2489,Zapata,
1766339,35.889,-101.03,Miami,
2364292,47.4542,-115.6697,Mullan,
2936187,44.2708,-71.3035,Sargents,
3327594,44.6629,-104.5678,,
4406000,31.346,-85.6543,,


In [84]:
X_na.merge(X[np.logical_not(X.isnull().any(axis=1))][['LocationLat', 'LocationLng', 'City', 'ZipCode']], left_on=['LocationLat', 'LocationLng'], right_on=['LocationLat', 'LocationLng'])

Unnamed: 0,LocationLat,LocationLng,City,ZipCode


In [85]:
X[(X['LocationLat'] == 33.2338) & (X['LocationLng'] == -119.4559)]['City'].unique()

array([None], dtype=object)

In [86]:
X[X['City'] == 'Davis']['ZipCode'].unique()

array([nan, 95616.0], dtype=object)