# Task 5 - Data Analysis

- Data analysis of our [data set](https://archive.ics.uci.edu/ml/datasets/Air+Quality) where we will begin to answer the following research questions.

1. What times of day have the highest (and lowest) air pollution?

2. Why do the two Tungsten Oxides (NOx and NO2) have such a negative correlation?


In [1]:
# load data
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
path_parent = os.path.dirname(os.getcwd())
os.chdir(path_parent)
os.chdir(path_parent)
print(os.getcwd())
from thierry.project_functions import load_and_process 
df = load_and_process('https://raw.githubusercontent.com/data301-2020-winter2/course-project-group_1050/main/data/raw/AirQualityUCI.csv')

C:\Users\ryan\Desktop\301 git\course-project-group_1050\analysis


KeyError: 'CO(GT)'

### 1. What times of day have the highest (and lowest) air pollution

In [None]:
## time_df gets the mean of each column for each time of measurement (hourly)
time_df = df.groupby('Time').mean()
time_df = time_df.reset_index()
time_df.head()

In [None]:
sns.set(rc={'figure.figsize':(50,20)})
time_plot = sns.lineplot(x='Time',
            y='value',
            hue='variable',  
            linewidth=5,
            data= pd.melt(time_df, ['Time']))
time_plot.axes.set_title('Mean Daily Airbourne Metallic Oxides', fontsize=50)
time_plot.set_xlabel('Time of Day', fontsize=40)
time_plot.set_ylabel('Metal Oxide Values', fontsize=40)
time_plot.tick_params(labelsize=20)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
time_plot



*Time plot analysis*
<br>
There are two obvious peaks in airbourne metal oxides: between 7 and 10 AM, and between 5 and 9 PM. This is most likely due to increase commuter activity during rush hours. There is also a low between 4 and 5 AM.

### 2. Why do the two Tungsten Oxides have such a negative correlation

In [None]:
tungsten_data = pd.DataFrame({'Temperature (C)':df['Temperature (C)'],
                             'Tungsten Oxide (NOx)':df['Tungsten Oxide (NOx)'],
                             'Tungsten Oxide (NO2)':df['Tungsten Oxide (NO2)']})
tungsten_data.head()

In [None]:

res = sns.lineplot(x='Temperature (C)',
            y='value',
            hue='variable',
            data= pd.melt(tungsten_data, ['Temperature (C)']))
res.set_xlabel("Temperature (C)",fontsize=30)
res.set_ylabel("Concentration",fontsize=20)
res.tick_params(labelsize=24)

*Tungsten Oxide Analysis* 
<br>
As we can see tungsten NOX is high during low temps and as they climb tungsten NO2 overtakes it, however, is it due to temperature or other factors?

In [None]:
df1 = df[['Temperature (C)', 'Tungsten Oxide (NO2)', 'Tungsten Oxide (NOx)']].copy()
corr = df1.corr()# plot the heatmap
res = sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, annot=True, cmap=sns.diverging_palette(250, 20, as_cmap=True))
res.set_xticklabels(res.get_xmajorticklabels(), fontsize = 25)
res.set_yticklabels(res.get_ymajorticklabels(), fontsize = 25)
plt.title('Tunsgten correlation')

           


*Correlation analysis*
<br>
As we can see NO2 is has a moderate positive correlation to temperature, however, NOx has a weak negative correlation to temperature. However, they both have moderate negative correlations so maybe their concentrations are dependant on something else.
<br>
*__Could be__* 
- Time of the day?
- Time of year?



In [None]:
df2 = pd.DataFrame({'Time':df['Time'],
                             'Tungsten Oxide (NOx)':df['Tungsten Oxide (NOx)'],
                             'Tungsten Oxide (NO2)':df['Tungsten Oxide (NO2)']})
res = sns.lineplot(x='Time',
            y='value',
            hue='variable',
            data= pd.melt(df2, ['Time']))
res.set_xlabel("X Label",fontsize=30)
res.set_ylabel("Y Label",fontsize=20)
res.set(xlabel='Time', ylabel='Concentration')
res.tick_params(labelsize=24)


*Time analysis*
<br>
Nothing notable here besides the dip & increase between 0000 & 0800 but that most likely correlates to a temperature drop

In [None]:
df2 = pd.DataFrame({'Time':df['Time'],
                             'Temp':df['Temperature (C)']})
res = sns.lineplot(x='Time',
            y='value',
            hue='variable',
            data= pd.melt(df2, ['Time']))
res.set_xlabel("X Label",fontsize=30)
res.set_ylabel("Y Label",fontsize=20)
res.set(xlabel='Time', ylabel='Concentration')
res.tick_params(labelsize=24)

*Temp analysis*
<br>
as predicted above it is due to the temps dropping

In [None]:

df2 = pd.DataFrame({'Time':df['Time'],
                             'Tungsten Oxide (NOx)':df['Tungsten Oxide (NOx)'],
                             'Tungsten Oxide (NO2)':df['Tungsten Oxide (NO2)']})
df2.insert(0, 'Count', range(0,len(df2)))

df2.drop(labels='Time', axis =1)
df2
df2 = pd.DataFrame({'Count':df2['Count'],
                             'Tungsten Oxide (NOx)':df2['Tungsten Oxide (NOx)'],
                             'Tungsten Oxide (NO2)':df2['Tungsten Oxide (NO2)']})


res = sns.lineplot(x='Count',
            y='value',
            hue='variable',
            data= pd.melt(df2, ['Count']))
res.set_xlabel("X Label",fontsize=30)
res.set_ylabel("Y Label",fontsize=20)
res.set(xlabel='Time', ylabel='Concentration')
res.tick_params(labelsize=24)


*longterm trend analysis*
<br>
No distinctive changes throughout the year besides an overall dip near the end of the dataset