I began by downloading the dataset from Kaggle and uploading it to Jupyter. The dataset is called 'weatherHistory.csv' and is found at [Leeds Weather Data](https://www.kaggle.com/datasets/muthuj7/weather-dataset?resource=download) . The code seemed to run best when I stored the csv file in the same folder as my Python notebook. Below is the set up for packages we will use later.

In [None]:
import numpy as np #linear algebra
import pandas as pd #data processing & reading csvs
import matplotlib.pyplot as plt #plotting
import seaborn as sns #statistical graphs

Here I have read the csv file and checked it has done so correctly by outputting the first 5 rows.

In [None]:
data = pd.read_csv('weatherHistory.csv')
data.head()

Let us do a quick summary of the data and see how many rows and columns there are. In addition the info() function gives us a concise dataframe summary.

In [None]:
data.shape

In [None]:
data.info()

We can then check for missing data entries.

In [None]:
data.isnull().sum()

The heatmap below shows the pairwise correlation between all of the columns in the dataframe. The value of correlation falls in the interval [-1,1]. Strong correlation between two random variables or bivariate data does not necessarily imply a causal relationship.

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(data= data.corr(), annot=True)
plt.title("Pairwise correlation of all columns in the dataframe")
plt.savefig('plot6.png', dpi=300, bbox_inches='tight')
plt.show()

Before continuing with some more interesting analysis, we have to format the date into the correct format. The .to_datetime() function converts a scalar, array-like dataframe to a pandas datetime object. I had a couple issues with a Key Error in pandas but managed to remedy the mistake by clearing all outputs and re-running all cells.

In [None]:
data['Formatted Date'] = pd.to_datetime(data['Formatted Date'], utc=True)
data = data.set_index("Formatted Date")
data.head()

In [None]:
df_column = ['Apparent Temperature (C)', 'Humidity']
df_monthly_mean = data[df_column].resample("MS").mean() #MS-Month Starting
df_monthly_mean.head()

In [None]:
sns.set_style("darkgrid")
sns.regplot(data=df_monthly_mean, x="Apparent Temperature (C)", y="Humidity", color="g")
plt.title("Relation between Apparent Temperature (C) and Humidity")
# save the figure
plt.savefig('plot1.png', dpi=300, bbox_inches='tight')
plt.show()

We produce a time-series plot of the Apparent temperature and Humidity over time, from 2006 to 2016. As you can see, temperature behaves cyclicly, with the seasons, whereas humidity is more stable throughout the year.

In [None]:
plt.figure(figsize=(14,6))
sns.lineplot(data = df_monthly_mean)
plt.xlabel('Year')
plt.title("Variation of Apparent Temperature and Humidity with time")
plt.savefig('plot2.png', dpi=300, bbox_inches='tight')
plt.show()

Below is a pair plot for correlation between Apparent temperature & Humidity. The plot is then saved.

In [None]:
sns.set_style("darkgrid")
plt.figure(figsize=(4,4))
plt.title("Correlation between Apparent temperature & Humidity")
sns.heatmap(data= df_monthly_mean.corr(), annot=True)
plt.savefig('plot7.png', dpi=300, bbox_inches='tight')
plt.show()