# Did It Rain In Seattle? (1948 - 2017): Exploratory Data Analysis (EDA)

Besides coffee, grunge and technology companies, one of the things that Seattle is most famous for is how often it rains. This dataset was aggregated from publicly available data on the NOAA (National Oceanic and Atmospheric Administration) website from the Seattle-Tacoma International Airport (SEA), with the following columns: 
* DATE - The date of the observation 
* RAIN - Whether it rained (True/False)
* MIN - Minimum temperature (in F) 
* MAX - Maximum temperature (in F) 
* PRCP - Precipitation level (in inches) 

This dataset contains complete records of daily rainfall patterns from January 1st, 1948 to December 12, 2017. This notebook contains high-level visualizations aimed at exploratory data analysis of over 60 years of Seattle weather data.

## Table of Contents

1. [Correlation Matrix](#section1)
2. [Max/Min Temperature Plot](#section2)
3. [Yearly/Monthly Precipitation Plot](#section3)
4. [Box Chart Temperature Distrubution](#section4)
5. [Temperature Time Series](#section5)
6. [Closing Thoughts](#section6)


In [1]:
import numpy as np
import pandas as pd

import plotly.express as px

import plotly.io as pio
pio.renderers.default = 'iframe'

from calendar import month_abbr

In [2]:
# Read in data from file
seattle_weather_df = pd.read_csv('/kaggle/input/did-it-rain-in-seattle-19482017/seattleWeather_1948-2017.csv')
seattle_weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25551 entries, 0 to 25550
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   DATE    25551 non-null  object 
 1   PRCP    25548 non-null  float64
 2   TMAX    25551 non-null  int64  
 3   TMIN    25551 non-null  int64  
 4   RAIN    25548 non-null  object 
dtypes: float64(1), int64(2), object(2)
memory usage: 998.2+ KB


In [3]:
# Drop rows with null values
seattle_weather_df.dropna(inplace = True)

### Correlation Matrix <a class="anchor"  id="section1"></a>

Right away we can see that minimum and maximum temperature are **highly** correlated. On the other hand, maximum temperature and precipitation are **somewhat negatively** correlated. Rain and precipitation, of course, are inherently intertwined.

In [4]:
# Is there correlation between any of the features?
corr = seattle_weather_df.drop(columns = ['DATE']).corr().round(1)  

mask = np.zeros_like(corr, dtype = bool)
mask[np.triu_indices_from(mask)] = True
corr = corr.mask(mask).dropna(how = 'all')

# Plot correlation matrix
fig = px.imshow(corr, text_auto = True, title = "Correlation Matrix", template = 'none',
                color_continuous_scale = px.colors.qualitative.G10[0:2], height = 350)

fig.show()

In [5]:
# Data pre-processing

# Time features - adding decade, year, and month
seattle_weather_df['DECADE'] = seattle_weather_df.DATE.apply(lambda x: int(x[0:3] + '0'))
seattle_weather_df['YEAR'] = pd.to_datetime(seattle_weather_df.DATE).dt.year
seattle_weather_df['MONTH'] = pd.to_datetime(seattle_weather_df.DATE).dt.month

# Temperature features - adding temperature average measurement
seattle_weather_df['TAVG'] = np.divide(np.add(seattle_weather_df.TMAX, seattle_weather_df.TMIN), 2)

# Rain features - cleaning features
seattle_weather_df['RAIN'] = seattle_weather_df.RAIN.apply(lambda d: 1 if d == True else 0)
seattle_weather_df['PRCP'] = seattle_weather_df.PRCP.apply(lambda p: round(p * 100, 2))

# Summarize new shape of data
seattle_weather_df.describe().round(1).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PRCP,25548.0,10.6,23.9,0.0,0.0,0.0,10.0,502.0
TMAX,25548.0,59.5,12.8,4.0,50.0,58.0,69.0,103.0
TMIN,25548.0,44.5,8.9,0.0,38.0,45.0,52.0,71.0
RAIN,25548.0,0.4,0.5,0.0,0.0,0.0,1.0,1.0
DECADE,25548.0,1978.0,20.4,1940.0,1960.0,1980.0,2000.0,2010.0
YEAR,25548.0,1982.5,20.2,1948.0,1965.0,1982.0,2000.0,2017.0
MONTH,25548.0,6.5,3.4,1.0,4.0,7.0,10.0,12.0
TAVG,25548.0,52.0,10.5,10.0,44.4,51.5,60.5,87.0


### Max/Min Temperature Plot <a class="anchor"  id="section2"></a>

Earlier years are colder on average, and later years are hotter on average. The coldest year appears to be 1955, and the hottest 2015. There are a couple of outlier years that don't seem to match the postive trend of year and temperature:
* 1985 - colder than expected
* 1958 & 1967 - hotter than expected

Finally, weather extremes (or outliers) align with [well-documented weather events](https://en.wikipedia.org/wiki/Climate_of_Seattle#Precipitation) such as blizzards or floods.

In [6]:
# Get yearly averages of weather data
yearly_weather_df = seattle_weather_df.groupby('YEAR').mean(numeric_only = True)
yearly_weather_df = yearly_weather_df.round(2)

yearly_weather_df.DECADE = yearly_weather_df.DECADE.astype('int')
yearly_weather_df.reset_index('YEAR', inplace = True)

# Show yearly_weather_df data
yearly_weather_df.head()

Unnamed: 0,YEAR,PRCP,TMAX,TMIN,RAIN,DECADE,MONTH,TAVG
0,1948,12.51,57.01,41.2,0.48,1940,6.51,49.11
1,1949,8.89,59.15,41.39,0.38,1940,6.53,50.27
2,1950,15.11,57.04,41.0,0.53,1950,6.53,49.02
3,1951,11.04,58.55,41.05,0.41,1950,6.53,49.8
4,1952,6.5,58.74,41.47,0.38,1950,6.51,50.11


In [7]:
# Plot bubble chart max/min temperature trend
fig1 = px.scatter(yearly_weather_df, x = 'TMAX', y = 'TMIN', trendline = 'ols', template = 'none',
            title = "Annual Average Min & Max Temperature, 1948 - 2017", size = 'PRCP', color = 'YEAR', 
                  color_continuous_scale = px.colors.qualitative.G10[5:7])

fig1.update_layout(xaxis = dict(title = 'TMIN (F)'), yaxis = dict(title = 'TMAX (F)'))

fig1.show()

### Yearly/Monthly Precipitation Plot <a class="anchor"  id="section3"></a>
There doesn't appear to be a trend with yearly precipitation levels, but there are a few years that have a lot less rain than the other years, on average. There also tends to be lower levels of precipitation during the middle of the year (primarily summer months).

Some of the higher precipitation levels may be due to weather events such as significant flooding, heavy rain or snowfall. For example:
* On October 20, 2003, Seattle recieved 5 inches of rainfall at once.
* Seattle flooded in November of 2006 due to several days of heavy rainfall.

In [8]:
# Get monthly sums of precipitation data
monthly_prcp = seattle_weather_df.groupby('MONTH').PRCP.sum()
monthly_prcp.head()

MONTH
1    39723.0
2    28290.0
3    27768.0
4    18586.0
5    12695.0
Name: PRCP, dtype: float64

In [9]:
# Retrieve months of year for plot x-axis labels
month_dict = dict(zip(list(range(13)), month_abbr))

# Plot average monthly precipitation
fig2 = px.bar(monthly_prcp, x = monthly_prcp.index, y = 'PRCP', template = 'none', 
              title = "Monthly Precipitation Sum", color_discrete_sequence = px.colors.qualitative.G10)

fig2.update_layout(xaxis = dict(title = 'MONTH', labelalias = month_dict), yaxis = dict(title = 'PRCP (inches)'))

fig2.show()

In [10]:
# Calculate yearly sums of precipitation data
yearly_prcp = seattle_weather_df.groupby('YEAR').PRCP.sum()
yearly_prcp.head()

YEAR
1948    4579.0
1949    3246.0
1950    5514.0
1951    4030.0
1952    2378.0
Name: PRCP, dtype: float64

In [11]:
# Plot average yearly precipitation
fig3 = px.bar(yearly_prcp, x = yearly_prcp.index, y = 'PRCP', template = 'none', 
              title = "Annual Precipitation Sum", color_discrete_sequence = px.colors.qualitative.G10)

fig3.update_layout(yaxis = dict(title = 'PRCP (inches)'))

fig3.show()

### Box Chart Temperature Distrubution <a class="anchor"  id="section4"></a>

As expected, higher temperatures occur in summer months and decrease in fall and winter months. Thus, it may be helpful to plot how monthly temperature changes over the year to get a more granular view of the data. Additionally, across the 30-year gaps, all months but December show a general positive trend in temperature.

**NOTE:** The 30-year gap for the plot was chosen because the 1940 decade of the dataset contains 2 years and the 2010s decade contains 7 years. This leaves 6 decades to split and examine for a more balanced analysis with an equal count of data points.

In [12]:
# Filter data to observations dated within the 1950s, 1980s, and 2010s 
three_decades_df = seattle_weather_df[seattle_weather_df.DECADE.isin([1950, 1980, 2010])]

# Plot box chart of 30-year monthly temperature differences
fig4 = px.box(three_decades_df, x = 'MONTH', y = 'TAVG', hover_data = 'YEAR', color = 'DECADE', 
              template = 'none', color_discrete_sequence = px.colors.qualitative.G10[3:6])

fig4.update_layout(title = "Decade Monthly Temperature: 1950s vs 1980s vs 2010s",
                   xaxis = dict(title = 'MONTH', labelalias = month_dict), 
                   yaxis = dict(title = 'TEMP (F)'))

fig4.show()

### Temperature Time Series <a class="anchor"  id="section5"></a>
As expected, the temperature trend is positive. In 1955 and 1958, the average temperture was especially low, with the opposite being true for 2015. While the annual temperature range can change by over 10+ degrees year-to-year in many year-to-year instances, the general trend is about -4% -- a very minimal negative trend.

**Note:** 'TVAR' is a measurement used by meteorologists, climatologists, and environmental scientists to understand the [variability of temperature over a year](http://https://iaspoint.com/annual-temperature-range/) within a specific geographic area. This means that a place with a high annual temperature range experiences more extreme fluctuations between seasons, while smaller ranges indicate more consistent temperatures throughout the year.

In [13]:
# Plot average temperature trend
fig5 = px.scatter(yearly_weather_df, x = 'YEAR', y = 'TAVG', trendline = 'ols', template = 'none',
                  color_discrete_sequence = px.colors.qualitative.G10, title = "Annual Temperature Trend")

fig5.data[-1].line.color = px.colors.qualitative.G10[1]
fig5.update_layout(yaxis = dict(title = 'TEMP (F)'))
fig5.update_traces(mode = 'lines')

fig5.show()

In [14]:
temp_var_df = seattle_weather_df.groupby('YEAR').max()[['TMIN', 'TMAX']]

temp_var_df['TMIN'] = seattle_weather_df.groupby('YEAR').min().TMIN
temp_var_df['TVAR'] = temp_var_df.TMAX - temp_var_df.TMIN
temp_var_df = temp_var_df.TVAR

temp_var_df.head()

YEAR
1948    72
1949    80
1950    86
1951    94
1952    82
Name: TVAR, dtype: int64

In [15]:
# Plot average temperature range trend
fig6 = px.scatter(temp_var_df, x = temp_var_df.index, y = 'TVAR', trendline ='ols', template = 'none',
                  color_discrete_sequence = px.colors.qualitative.G10, labels = {'TVAR': 
                    'RANGE (F)'}, title = "Annual Temperature Range (ATR) Trend")

fig6.data[-1].line.color = px.colors.qualitative.G10[1]
fig6.update_layout(yaxis = dict(title = 'TEMP (F)'))
fig6.update_traces(mode = 'lines')

fig6.show()

### Closing Thoughts <a class="anchor"  id="section6"></a>

Seattle is known for being rainy, and while that's true, the data shows that most of the rain happens in the winter, with much less in the summer. This suggests that while the total amount of rain might seem constant, its timing is not. People living in Seattle could be experiencing a more concentrated wet season and drier summers than they did in the past. In fact, Seattle isn't even in the top 5 [raniest U.S. cities](https://www.weatherstationadvisor.com/rainiest-city-in-the-us/)!

In summary, these discoveries don't just help us understand Seattle's climate; they also suggest taking a closer look at how these changing weather patterns may affect important parts of the city. Thus, this analysis could serve as a foundation for more advanced work. For example, this data could be used to create a predictive model that forecasts future weather patterns with greater accuracy. Or by linking weather data with information such as local crop yields, energy consumption, or public health incidents, we can move our analysis beyond simple description and start actively preparing for the future.