# Did It Rain In Seattle? (1948 - 2017): Exploratory Data Analysis 

Besides coffee, grunge and technology companies, one of the things that Seattle is most famous for is how often it rains. This dataset was aggregated from publicly available data on the NOAA (National Oceanic and Atmospheric Administration) website from the Seattle-Tacoma International Airport, with the following columns: 
* DATE - The date of the observation 
* RAIN - Whether it rained (True/False)
* MIN - Minimum temperature (in F) 
* MAX - Maximum temperature (in F) 
* PRCP - Precipitation level (in inches) 

This dataset contains complete records of daily rainfall patterns from January 1st, 1948 to December 12, 2017. This notebook contains high-level visualizations aimed at exploratory data analysis of over 60+ years of Seattle weather data.

In [1]:
import numpy as np
import pandas as pd

import plotly.express as px

from calendar import month_abbr

In [2]:
# Read in data from file
seattle_weather_df = pd.read_csv('/kaggle/input/did-it-rain-in-seattle-19482017/seattleWeather_1948-2017.csv')
seattle_weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25551 entries, 0 to 25550
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   DATE    25551 non-null  object 
 1   PRCP    25548 non-null  float64
 2   TMAX    25551 non-null  int64  
 3   TMIN    25551 non-null  int64  
 4   RAIN    25548 non-null  object 
dtypes: float64(1), int64(2), object(2)
memory usage: 998.2+ KB


In [3]:
# Drop rows with null values
seattle_weather_df.dropna(inplace = True)








































**Correlation Matrix**
* Minimum and maximum temperature have 90% positive correlation.
* Whether it rained and precipitation levels have 50% positive correlation.
* Maximum temperature and precipitation have 40% negative correlation.

In [4]:
# Is there correlation between any of the features?
corr = seattle_weather_df.drop(columns = ['DATE']).corr().round(1)  

mask = np.zeros_like(corr, dtype = bool)
mask[np.triu_indices_from(mask)] = True
corr = corr.mask(mask).dropna(how = 'all')

fig = px.imshow(corr, text_auto = True, title = "Correlation Matrix", template = 'none',
                color_continuous_scale = px.colors.qualitative.G10[0:2], height = 350)

fig.show()

In [5]:
# Data pre-processing for more advanced plotting 

# Time features - adding decade and month
seattle_weather_df['DECADE'] = seattle_weather_df['DATE'].apply(lambda x: int(x[0:3] + '0'))
seattle_weather_df['YEAR'] = pd.to_datetime(seattle_weather_df.DATE).dt.year
seattle_weather_df['MONTH'] = pd.to_datetime(seattle_weather_df.DATE).dt.month

# Temperature features - adding temperature average and range measurements
seattle_weather_df['TAVG'] = np.divide(np.add(seattle_weather_df['TMAX'], seattle_weather_df['TMIN']), 2)
seattle_weather_df['TRANGE'] = np.divide(np.subtract(seattle_weather_df['TMAX'], seattle_weather_df['TMIN']), 2)

# Rain features - cleaning features
seattle_weather_df['RAIN'] = seattle_weather_df['RAIN'].apply(lambda d: 1 if d == True else 0 )
seattle_weather_df['PRCP'] = seattle_weather_df['PRCP'].apply(lambda p: round(p * 100, 2))

# Summarize new shape of data
seattle_weather_df.describe().round(1).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PRCP,25548.0,10.6,23.9,0.0,0.0,0.0,10.0,502.0
TMAX,25548.0,59.5,12.8,4.0,50.0,58.0,69.0,103.0
TMIN,25548.0,44.5,8.9,0.0,38.0,45.0,52.0,71.0
RAIN,25548.0,0.4,0.5,0.0,0.0,0.0,1.0,1.0
DECADE,25548.0,1978.0,20.4,1940.0,1960.0,1980.0,2000.0,2010.0
YEAR,25548.0,1982.5,20.2,1948.0,1965.0,1982.0,2000.0,2017.0
MONTH,25548.0,6.5,3.4,1.0,4.0,7.0,10.0,12.0
TAVG,25548.0,52.0,10.5,10.0,44.4,51.5,60.5,87.0
TRANGE,25548.0,7.5,3.4,-17.5,5.0,7.0,9.5,21.0


**Yearly/Monthly Precipitation** 
* There doesn't appear to be a trend with yearly precipitation levels, but there are a few years that have a lot less rain than the other years, on average.
* There tends to be lower levels of precipitation during the middle of the year (primarily summer months).
* The least amount of rain was in 1955 and the most was in 1950.
* Some of the higher precipitation levels may be due to weather events such as flooding, heavy rain or snowfall. For example:
    *  On October 20, 2003, Seattle recieved 5 inches of rainfall at once.
    *   Seattle flooded in November of 2006 due to several days of heavy rainfall.

In [6]:
month_dict = dict(zip(list(range(13)), month_abbr))

fig1 = px.bar(seattle_weather_df, x = 'MONTH', y = 'PRCP', hover_data = ['YEAR'],
              template = 'none', title = "Monthly Precipitation", color_discrete_sequence = px.colors.qualitative.G10)
fig2 = px.bar(seattle_weather_df, x = 'YEAR', y = 'PRCP', hover_data = ['MONTH'],
              template = 'none', title = "Yearly Precipitation", color_discrete_sequence = px.colors.qualitative.G10)

fig2.update_layout(yaxis = dict(title = 'PRCP (inches)'))
fig1.update_layout(xaxis = dict(title = 'MONTH', labelalias = month_dict), yaxis = dict(title = 'PRCP (inches)'))

fig1.show()
fig2.show()

**Box Chart Temperature Distrubution**

As expected, higher temperatures occur in summer months and decrease in fall and winter months. Thus, it may be helpful to plot how monthly temperature changes over the year to get a more granular view of the data.

* In line with the seasons, weather outliers can be seen with both the warmer and cooler seasons.
* There is a clear temperature increase between each 30-year period. The exception appears to be December in which the 1980s box plot has a lower median and larger range than its counterparts.
* January 1950 appears to be the coldest month in the dataset and August 1981 the hottest.

**NOTE:** The 30-year gap for the plot was chosen because the 1940 decade of the dataset contains 2 years and the 2010s decade contains 7 years. This leaves 6 decades to split and examine for a more balanced analysis with an equal count of data points.

In [7]:
# Plot box chart
fig2 = px.box(seattle_weather_df[seattle_weather_df.DECADE.isin([1950, 1980, 2010])],
              x = 'MONTH', y = 'TAVG', hover_data = 'YEAR', color = 'DECADE', template = 'none',
              color_discrete_sequence=px.colors.qualitative.G10[3:6])

fig2.update_layout(title = "Monthly Temperature: 1950s vs 1980s vs 2010s",
                   xaxis = dict(title = 'MONTH', labelalias = month_dict), 
                   yaxis = dict(title = 'TEMP (F)'))

fig2.show()

In [8]:
# Get yearly averages of weather data
yearly_weather_df = seattle_weather_df.groupby('YEAR').mean(numeric_only = True)
yearly_weather_df = yearly_weather_df.round(2)

yearly_weather_df['DECADE'] = yearly_weather_df['DECADE'].astype('int')
yearly_weather_df.reset_index('YEAR', inplace = True)

# Show yearly_weather_df data
yearly_weather_df.head()

Unnamed: 0,YEAR,PRCP,TMAX,TMIN,RAIN,DECADE,MONTH,TAVG,TRANGE
0,1948,12.51,57.01,41.2,0.48,1940,6.51,49.11,7.91
1,1949,8.89,59.15,41.39,0.38,1940,6.53,50.27,8.88
2,1950,15.11,57.04,41.0,0.53,1950,6.53,49.02,8.02
3,1951,11.04,58.55,41.05,0.41,1950,6.53,49.8,8.75
4,1952,6.5,58.74,41.47,0.38,1950,6.51,50.11,8.64


**Max/Min Temperature Plot**
* Minimum and maximum temperature, as seen previously, are **highly** positively correlateed.
* Earlier years are colder on average, and later years are hotter on average.
* There are a couple of outlier years that don't seem to match the postive trend of year and temperature:
  * 1958 & 1967 - hotter than expected
  * 1985 - colder than expected
* The coldest year appears to be 1955, and the hottest is 2015.
* Weather extremes (or outliers) align with [well-documented weather events](https://en.wikipedia.org/wiki/Climate_of_Seattle#Precipitation) such as blizzards or floods.

In [9]:
# Plot bubble chart
fig5 = px.scatter(yearly_weather_df, x = 'TMAX', y = 'TMIN', trendline = 'ols', template = 'none',
            title = "Min & Max Temperature, 1948 - 2017", size = 'PRCP', color = 'YEAR', 
                  color_continuous_scale = px.colors.qualitative.G10[5:7])

fig5.update_layout(xaxis = dict(title = 'TMIN (F)'), yaxis = dict(title = 'TMAX (F)'))

fig5.show()

**Temperature Time Series** 
* Temperature has a positive trend line, indicating generally rising temperatures.
* The climate variability is largest roughly between 1948 and 1953. This is in line with some of the harsher temperatures seen in the beginning of the dataset's time period.
* There is also high temperature variability between 2011 and 2015.

**Note:** The calculation 'TRANGE' helps quantify the [temperature variability](https://iaspoint.com/annual-temperature-range/) within a geographic area. This means a place with a high annual temperature range experiences more extreme fluctuations between seasons, while smaller ranges indicate more consistent temperatures throughout the year.

In [10]:
# Plot two line charts
fig3 = px.scatter(yearly_weather_df, x = 'YEAR', y = 'TAVG', trendline ='ols', template = 'none',
                  color_discrete_sequence = px.colors.qualitative.G10,
                    title = "Temperature Trend")
fig4 = px.scatter(yearly_weather_df, x = 'YEAR', y = 'PRCP', trendline ='lowess', template = 'none',
                  color_discrete_sequence = px.colors.qualitative.G10,
                  labels = {'TRANGE': 'RANGE (F)'}, title = "Temperature Range Trend")

fig3.data[-1].line.color = px.colors.qualitative.G10[1]
fig3.update_layout(yaxis = dict(title = 'TEMP (F)'))
fig3.update_traces(mode = 'lines')

fig4.data[-1].line.color = px.colors.qualitative.G10[1]
fig4.update_layout(yaxis = dict(title = 'TEMP (F)'))
fig4.update_traces(mode = 'lines')

fig3.show()
fig4.show()