<a href="https://colab.research.google.com/github/deliabel/CodeDivisionWorksheets/blob/main/Air_Quality_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


### All boxes in italics are the instuctions from the worksheet. Other text boxes are my write-up and notes.


# *Clean and wrangle air quality data*

*The following data file contains data collected at a roadside monitoring station.  You can see the data in a spreadsheet here: https://docs.google.com/spreadsheets/d/1XpAvrpuyMsKDO76EZ3kxuddBOu7cZX1Od4uEts14zco/edit?usp=sharing*

*The data contains:*
* *a heading line (Chatham Roadside) which needs to be skipped*   
* *dates which are sometimes left- and sometimes right-justified indicating that they are not formatted as dates, rather they are text (so need to be converted to dates)*   
* *times which are not all in the same format*   
* *Nitrogen Dioxide levels which are, again, text and sometimes contain nodata*   
* *Status which is always the same*   






### *Project - clean, sort and wrangle the data*

* *Read the dataset into a dataframe, skipping the first row*   
* *Convert dates to date format*   
* *Remove rows with nodata in the Nitrogen dioxide column*   
* *Convert the Nitrogen dioxide levels values to float type*   
* *Sort by Nitrogen dioxide level*   
* *Create a new column for 'Weekdays' (use df['Date'].dt.weekday)*   
* *Rename the column Nitrogen dioxide level to NO2 Level (V µg/m2)*   
* *Remove the Status column*   

*The dataset can be viewed here:  https://drive.google.com/file/d/1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ/view?usp=sharing  and the data accessed here: https://drive.google.com/uc?id=1SOe9b4VJ1FCtDVgZ2T8d00-jTw2Kux1i  This is a .csv file*


## Cleaning the data

Using the guidence provided as a starting point, the data was made more useable by:
* Skipping the first row in the raw dataset as it was read in, to remove a title row.
* Converting the 'Dates' column values from object type to a datetime type, so that datetime functions can be used.
* Replacing the midnight values in the 'Time' column to match the format of the other values and to use the standard 00:00 instead of 24:00.
** Code commented out, but left in place, incase a problem was found later: Removing the rows without a measurment (value is 'nodata') from the 'Nitrogen Dioxide' column.
* Converting the 'nodata' values in the 'Nitrogen Dioxide' column to none, to allow the column data type to be changed with these rows in place.
* Converting the 'Nitrogen Dioxide' column values from object type to a float type, in order to acces the numbers.
* Renaming the 'Nitrogen Dioxide' column to 'NO2 levels (V µg/m3)' to better describe the measurements and to allow the 'Status' column to be removed.
* Removing the unnecessary 'Status' column.
* Adding a 'Weekday' column to label the rows by day of the week, to allow analysis and plots.
* Adding 'Week' column to label the rows by week.



In [None]:
import pandas as pd

In [None]:
# Reads the dataset into a dataframe, skipping the first row
# Output gives an idea of the dataset
url2020 = 'https://raw.githubusercontent.com/deliabel/CodeDivisionWorksheets/main/data%20sets/NO2-measured-data-2020-2021-Chatham-Roadside.csv'
measured_20df = pd.read_csv(url2020, skiprows = 1)
measured_20df

In [None]:
# (checks if any columns have null data, and finds out which type of data is in each column)
measured_20df.info()

In [None]:
# Converts values in 'Date' column to date format
# Output shows the 'Date' datatype has changed to datetime64
measured_20df['Date']= pd.to_datetime(measured_20df['Date'], dayfirst = True)
measured_20df.info()

In [None]:
# additional: Replaces midnight time, making the format match the rest
# output shows the last midnight time is now 00:00
measured_20df = measured_20df.replace('24:00:00', '00:00')
measured_20df

In [None]:
# Removes rows with nodata in the 'Nitrogen dioxide' column
# Output shows that the indices have not changed, but there are now fewer rows
#measured_20df = measured_20df.loc[measured_20df['Nitrogen dioxide'] != 'nodata']
#measured_20df

In [None]:
# (checks there are no nodata rows left)
# (as a second check, if this is run first, there are 152 rows: 8784 - 152 = 8632, which is the new length of the df)
#measured_20df_check = measured_20df[measured_20df['Nitrogen dioxide'] == 'nodata']
#measured_20df_check

In [None]:
# additional: Replaces 'nodata' with None, might help with plots, will introduce nulls
# Output shows there are now null values in the 'Nitrogen dioxide' column, and a section where 'nodata' has been replaced with None
measured_20df = measured_20df.replace('nodata', None)
print(measured_20df.info())
measured_20df.iloc[50:60]

In [None]:
# Converts the 'Nitrogen dioxide' values to float type
# Output shows 'Nitrogen dioxide' datatype has changed to float64
measured_20df = measured_20df.astype({'Nitrogen dioxide': float,})
measured_20df.info()

In [None]:
# Renames the column 'Nitrogen dioxide' to 'NO2 Level (V µg/m3)'
measured_20df = measured_20df.rename({'Nitrogen dioxide': "NO2 Level (V µg/m3)"}, axis="columns")
measured_20df.head(3)

In [None]:
# Removes the 'Status' column
measured_20df = measured_20df.drop('Status', axis = 1)
measured_20df

In [None]:
# Creates a new 'Weekday' column to label days of the week
measured_20df.insert(1, 'Weekday', measured_20df['Date'].dt.weekday)
measured_20df.head(3)

In [None]:
# (checks that there are labels for 7 days, and that all the days are included: Monday is 0 and Sunday is 6)
measured_20df['Weekday'].unique()

In [None]:
# additional: Creates a new 'Week' column to label weeks in the year
# Output shows the new column is present and that there are labels for each week in the year
measured_20df.insert(2, 'Week', measured_20df['Date'].dt.isocalendar().week)
print(measured_20df.head(3))
measured_20df['Week'].unique()

### *Expand the dataset*
---

*There is a second data set here covering the year 2021:  https://drive.google.com/uc?id=1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ*

*Concatenate the two datasets to expand it to 2020 and 2021.*  

*Before you can concatenate the datasets you will need to clean and wrangle the second dataset in the same way as the first.  Use the code cell below.  Give the second dataset a different name.*

### *Show summary statistics for larger dataset*
---

*After the datasets have been concatenated, group the data by Weekdays and show summary statistics by day of the week.*

## Cleaning the second dataset

The second dataset was cleaned in the same way.

In [None]:
# Reads the dataset into a dataframe, skipping the first row
url2021 = 'https://raw.githubusercontent.com/deliabel/CodeDivisionWorksheets/main/data%20sets/NO2-measured-data-2021-2022-Chatham-Roadside.csv'
measured_21df = pd.read_csv(url2021, skiprows = 1)
measured_21df

In [None]:
# (checks if any columns have null data, and finds out which type of data is in each column)
measured_21df.info()

In [None]:
# Converts values in 'Date' column to date format
measured_21df['Date']= pd.to_datetime(measured_21df['Date'], dayfirst = True)
measured_21df.info()

In [None]:
# additional: Replaces midnight time, making the format match the rest
measured_21df = measured_21df.replace('24:00', '00:00')
measured_21df.tail(26)

In [None]:
# Removes rows with nodata in the 'Nitrogen dioxide' column
# measured_21df = measured_21df.loc[measured_21df['Nitrogen dioxide'] != 'nodata']
#measured_21df = measured_21df[measured_21df['Nitrogen dioxide'] != 'nodata']
#measured_21df

In [None]:
# (checks there are no nodata rows left)
#measured_21df_check = measured_21df[measured_21df['Nitrogen dioxide'] == 'nodata']
#measured_21df_check

In [None]:
# additional: replaces 'nodata' with None, might help with plots, will introduce nulls
measured_21df = measured_21df.replace('nodata', None)
measured_21df.info()

In [None]:
# Converts the 'Nitrogen dioxide' values to float type
measured_21df = measured_21df.astype({'Nitrogen dioxide': float,})
measured_21df.info()

In [None]:
# Renames the column 'Nitrogen dioxide' to 'NO2 Level (V µg/m3)'
measured_21df = measured_21df.rename({'Nitrogen dioxide': "NO2 Level (V µg/m3)"}, axis="columns")
measured_21df.head(3)

In [None]:
# Removes the 'Status' column
measured_21df = measured_21df.drop('Status', axis = 1)
measured_21df

In [None]:
# Creates a new 'Weekday' column to label days of the week
measured_21df.insert(1, 'Weekday', measured_21df['Date'].dt.weekday)
measured_21df.head(3)

In [None]:
# (checks that there are labels for 7 days, and that all the days are included: monday is 0 and sunday is 6)
measured_21df['Weekday'].unique()

In [None]:
# additional: Creates a new 'Week' column to label weeks in the year
measured_21df.insert(2, 'Week', measured_21df['Date'].dt.isocalendar().week)
print(measured_21df.head(3))
measured_21df['Week'].unique()

#### ---end of cleaning---
---


## Concatonating datasets

In [None]:
# Concatenates the two datasets to expand it to 2020 and 2021.
# Output shows the new length of the dataframe
measuredNO2_df = pd.concat([measured_20df, measured_21df], ignore_index = True)
print('shape:', measuredNO2_df.shape)
measuredNO2_df

In [None]:
# Displays a summary of statistics for the measured nitrogen dioxide levels
measuredNO2_df[["NO2 Level (V µg/m3)"]].describe()

## Looking at the Data
To get an idea of any trends in the data or questions that could be investigated, ... different ways of viewing the data.

Range of Measurements
* Found the maximum and minimum measurements of the nitrogen dioxide level for each year.
* Allows the range of the measured concentrations to be seen.

Regulations
* Found the mean measurements of the nitrogen dioxide level for each year.
* Together with the maximums and minimums, this allows the measurements to be checked against the regulation levels.
* "The Air Quality Standards Regulations 2010 require that the annual mean concentration of NO2 must not exceed 40 µg/m3 and that there should be no more than 18 exceedances of the hourly mean limit value (concentrations above 200 µg/m3) in a single year."
* https://www.gov.uk/government/statistics/air-quality-statistics/ntrogen-dioxide

Annual plots
* For each year: Seperated the data by finding the mean measurment for each week, and then plotted both years on one graph.
* Gives a rough idea of any trends over a year, and allows years to be compared.
* Note: The week numbers mean that the dates in this plot will not exactly be aligned. This should be a small factor considering that weekday may have more effect than date, and the season will still match.

Weekly plots
* Grouped the data for both years by day of the week, calculated maximums and means for each days, and made a bar chart.
* This should allow any trends over a week to be seen.

### Range of the measurements


In [None]:
# shows minimum and maximum nitrogen dioxide levels for the year 2020
# measured_20df[["NO2 Level (V µg/m3)"]].describe()
minNO2_20 = measured_20df['NO2 Level (V µg/m3)'].min()
maxNO2_20 = measured_20df['NO2 Level (V µg/m3)'].max()
print('2020 \nminimum measurement:', minNO2_20, '\nmaximum measurement:', maxNO2_20)

In [None]:
# shows minimum and maximum nitrogen dioxide levels for the year 2020
# measured_21df[["NO2 Level (V µg/m3)"]].describe()
minNO2_21 = measured_21df['NO2 Level (V µg/m3)'].min()
maxNO2_21 = measured_21df['NO2 Level (V µg/m3)'].max()
print('2021 \nminimum measurement:', minNO2_21, '\nmaximum measurement:', maxNO2_21)

In [None]:
# Sorts 2021 dataframe by Nitrogen dioxide level
# Output shows there are only 9 negative values.
sorted_21df = measured_21df.sort_values('NO2 Level (V µg/m3)')
sorted_21df.head(10)


####Negative Values
The negative values are not meaningful measurements. This indicates that there is something wrong, maybe a callibration error or a problem with the sensors. It is unclear if this problem effects the rest of the data.

There are only 9 negative values, so they should not have a large effect on the calculations. If a problem was noticed, they could be removed in a similar way to the part of the cleaning section where 'nodata' was replaces with None.

###Regulations


In [None]:
# shows means of both years
meanNO2_20 = measured_20df['NO2 Level (V µg/m3)'].mean()
meanNO2_21 = measured_21df['NO2 Level (V µg/m3)'].mean()
print('2020 \nmean measurement:', meanNO2_20, '\n2021 \nmean measurement:', meanNO2_21)

#### Within Regulations
Both calculated annual mean measurements of 18.54 µg/m3 for 2020 and 18.64 µg/m3 for 2020 are lower than the maximum of 40 µg/m3 allowed.
The maximums measurements from the range section, of 113.06 µg/m3 for 2020 and 82.56 µg/m3 for 2021, are both below the hourly mean limit of 200 µg/m3, so the hourly means will not exceed this limit.

Neither year exceeded the regulations

### Annual Plots

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Finds mean nitrogen dioxide level for each week of 2020
meanNO2_weeks20 = measured_20df.groupby('Week')['NO2 Level (V µg/m3)'].mean()

In [None]:
# Finds mean nitrogen dioxide level for each week of 2021
meanNO2_weeks21 = measured_21df.groupby('Week')['NO2 Level (V µg/m3)'].mean()

In [None]:
# Plots data from both years, to get an idea of trends over the year
x20_weeks = meanNO2_weeks20.keys()
x21_weeks = meanNO2_weeks21.keys()

fig, ax = plt.subplots()

ax.plot(x20_weeks, meanNO2_weeks20, color='blue', label='2020')
ax.plot(x21_weeks, meanNO2_weeks21, color = 'green', label='2021')

plt.title('Mean weekly NO2 at Chatham Roadside in 2020 and 2021')
plt.xlabel('Week')
plt.ylabel('NO2 Level (V µg/m3)')
ax.legend(loc='upper right')

plt.show

####Seasonal Differences
The plot shows a clear difference between measurements made during summer and winter. This could be due to weather or differences in traffic levels.

The first part of the two years are quite different, but the last approximatly a third of each year have peaks in very similar places. There is a bigger difference at about weeks 15 - 25. This is approximately when the first UK covid lockown was, and could be looked at more closely.

### Weekly Plots

In [None]:
# groups the data by Weekday and shows summary statistics by day of the week
measuredNO2_df.groupby('Weekday')['NO2 Level (V µg/m3)'].describe()

In [None]:
# Finds mean nitrogen dioxide level by day of the week
meanNO2_days = measuredNO2_df.groupby('Weekday')['NO2 Level (V µg/m3)'].mean()
meanNO2_days

In [None]:
# Finds maximum nitrogen dioxide level by day of the week
maxNO2_days = measuredNO2_df.groupby('Weekday')['NO2 Level (V µg/m3)'].max()
maxNO2_days

In [None]:
# additional: plots mean and max on a bar chart, to see if there might be a pattern over the week

labels = ['Mon', 'Tues', 'Weds', 'Thurs', 'Fri', 'Sat', 'Sun']

fig, ax = plt.subplots()

ax.bar(labels, maxNO2_days, color='xkcd:greyish purple', label='max level of NO2')
ax.bar(labels, meanNO2_days, color = 'xkcd:dark purple', label='mean level of NO2')

plt.title('Levels of NO2 measured at Chatham Roadside in 2020 and 2021')
plt.xlabel('Day of the Week')
plt.ylabel('NO2 Level (V µg/m3)')
ax.legend(loc='upper right')

plt.show

#### Weekday Trends

As might be expected Sunday has the lowest nitogen dioxide levels. Monday is notably higher, with levels roughly decreasing through the week.

## Other possible investigations
#### Without additional information:
*   trends over a day: school run, rush hour (would need to convert the times too)
**   measurements are during covid on/off lockdowns, so work routines are interupted

#### Would require additional data or to look up dates:
*   excluding special weeks (such as holidays) to improve accuracy trends over a week
*   effect school holidays, other holidays/events
*   effect of lockdown dates
*   if there was weather data/ station from the same position, could relate to sunny days vs rainy/bad weather or if wind speed was relevant

*   check data quality of results