# Task 5 - Data Analysis

- Data analysis of our [data set](https://archive.ics.uci.edu/ml/datasets/Air+Quality) where we will begin to answer the following research questions.

1. What times of day have the highest (and lowest) air pollution?

2. Why do the two Tungsten Oxides (NOx and NO2) have such a negative correlation?

3. What are the seasons where air pollution is high and low?


In [None]:
# load data
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
path_parent = os.path.dirname(os.getcwd())
os.chdir(path_parent)
print(os.getcwd())
from thierry.project_functions import load_and_process 
df = load_and_process('https://raw.githubusercontent.com/data301-2020-winter2/course-project-group_1050/main/data/raw/AirQualityUCI.csv')

### 1. What times of day have the highest (and lowest) air pollution

In [None]:
## time_df gets the mean of each column for each time of measurement (hourly)
time_df = df.groupby('Time').mean()
time_df = time_df.reset_index()
time_df.head()

In [None]:
sns.set(rc={'figure.figsize':(50,20)})
time_plot = sns.lineplot(x='Time',
            y='value',
            hue='variable',  
            linewidth=5,
            data= pd.melt(time_df, ['Time']))
time_plot.axes.set_title('Mean Hourly Airbourne Metallic Oxides', fontsize=50)
time_plot.set_xlabel('Time of Day', fontsize=40)
time_plot.set_ylabel('Metal Oxide Values', fontsize=40)
time_plot.tick_params(labelsize=20)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
time_plot



*Time plot analysis* -- There are two obvious peaks in airbourne metal oxides: between 7 and 10 AM, and between 5 and 9 PM. This is most likely due to increase commuter activity during rush hours. There is also a low between 4 and 5 AM.

### 2. Why do the two Tungsten Oxides have such a negative correlation

In [None]:
tungsten_data = pd.DataFrame({'Temperature (C)':df['Temperature (C)'],
                             'Tungsten Oxide (NOx)':df['Tungsten Oxide (NOx)'],
                             'Tungsten Oxide (NO2)':df['Tungsten Oxide (NO2)']})
tungsten_data.head()

In [None]:

sns.lineplot(x='Temperature (C)',
            y='value',
            hue='variable',
            data= pd.melt(tungsten_data, ['Temperature (C)']))

*Tungsten Oxide Analysis* -- There is a crossover point at approximately 4 degrees celcius where Tungsten Oxide (NO2) becomes more abundant above the threshold, where Tungsten Oxide (NOx) levels are higher below the threshold. The Nitrogen Oxide compound that Tungsten reacts with must be dependent on temperature.

### 3. What are the seasons where air pollution is high and low?

In [None]:
date_df = df.groupby('Date').mean()
date_df = date_df.reset_index()
date_df.head()

In [None]:
date_plot = sns.lineplot(x='Date',
            y='value',
            hue='variable',  
            linewidth=5,
            data= pd.melt(date_df, ['Date']))
date_plot.axes.set_title('Mean Daily Airbourne Metallic Oxides', fontsize=50)
date_plot.set_xlabel('Date', fontsize=40)
date_plot.set_ylabel('Metal Oxide Values', fontsize=40)
date_plot.tick_params(labelsize=20)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
date_plot

In [None]:
*Date plot analysis* -- analysis goes here