# Project 2 - NO2

## Description

The data are a subsample of 500 observations from a data set that originate in a study submitted by Magne Aldrin [28/Jul/04].
Aim of the study was to verify whether air pollution at a road is related to traffic volume and meteorological variables, collected by the Norwegian Public Roads Administration.

Description of the variables:
- **x1** hourly values of the logarithm of the concentration of NO2 (particles), measured at Alnabru in Oslo, Norway, between October 2001 and August 2003.
- **x2** logarithm of the number of cars per hour
- **x3** temperature 2 meters above ground (Celsius degrees)
- **x4** wind speed (meters/second)
- **x5** temperature difference between 25 and 2 meters above ground (Celsius degrees)
- **x6** wind direction (degrees between 0 and 360),
- **x7**, **x8** hour of day and day number from October 1, 2001

## Tasks

Provide a qualitative description of the variables in the dataset and of their distribution, using histograms, pie charts, tables or other graphical instruments.
Then answer specifically to the following questions:
1. Divide the hour of day into day and night hours, paying also attention to the season, that is quite relevant in Norway. Is there a significant difference in the mean of concentration of NO2 during day and night hours?
2. Divide each of the climatic variables x3,x4,x5 into two groups (high temperature/low temperature, high wind speed/low wind speed, etc.) and check for each of them if there is a significant difference in the mean of concentration of NO2 in the two identified groups.
3. Divide x2 into two groups (high number of cars/low number of cars), and check if there is a significant difference in the mean of concentration of NO2 in the two identified groups.
4. Deduce which of the variables x2-x8 is more influent in changing the concentration of NO2 in the air, discussing also the presence of possible correlations between x2-x8

## Solution

### Data preprocessing

In [3]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

Read the dataset and rename the columns in a meaningful way

In [8]:
column_map = {
    "x1": "log_no2",
    "x2": "log_cars_num",
    "x3": "temp_2",
    "x4": "wind_speed",
    "x5": "temp_diff_25_2",
    "x6": "wind_dir",
    "x7": "hour",
    "x8": "day"
}

data = pd.read_csv("NO2.csv").rename(columns=column_map)
data.head()

Unnamed: 0,log_no2,log_cars_num,temp_2,wind_speed,temp_diff_25_2,wind_dir,hour,day
0,3.71844,7.6912,9.2,4.8,-0.1,74.4,20,600
1,3.10009,7.69894,6.4,3.5,-0.3,56.0,14,196
2,3.31419,4.81218,-3.7,0.9,-0.1,281.3,4,513
3,4.38826,6.95177,-7.2,1.7,1.2,74.0,23,143
4,4.3464,7.51806,-1.3,2.6,-0.1,65.0,11,115


Create a meaningful **date** column which will contain exact date in *yyyy-mm-dd* format starting from October 1, 2001 as mentioned in the description.

In [16]:
import datetime

start_date = datetime.date(2001, 10, 1)
data["date"] = data.apply(lambda x: datetime.timedelta(days=x["day"]) + start_date, axis=1)
data["date"] = pd.to_datetime(data["date"])
data.head()

Unnamed: 0,log_no2,log_cars_num,temp_2,wind_speed,temp_diff_25_2,wind_dir,hour,day,date
0,3.71844,7.6912,9.2,4.8,-0.1,74.4,20,600,2003-05-24
1,3.10009,7.69894,6.4,3.5,-0.3,56.0,14,196,2002-04-15
2,3.31419,4.81218,-3.7,0.9,-0.1,281.3,4,513,2003-02-26
3,4.38826,6.95177,-7.2,1.7,1.2,74.0,23,143,2002-02-21
4,4.3464,7.51806,-1.3,2.6,-0.1,65.0,11,115,2002-01-24


Now we create some additional columns like **dow** (day of week), **month** and **yea**. We can use them in the further analysis while applying aggregation functions over the time frame, e.g. comparing the mean **log_no2** value in each day of the week.

In [23]:
data["dow_name"] = data["date"].dt.day_name()
data["dow_num"] = data["date"].dt.day_of_week
data["month_num"] = data["date"].dt.month
data["month_name"] = data["date"].dt.month_name()
data["year"] = data["date"].dt.year
data.head()

Unnamed: 0,log_no2,log_cars_num,temp_2,wind_speed,temp_diff_25_2,wind_dir,hour,day,date,dow,month,year,dow_name,dow_num,month_num,month_name
0,3.71844,7.6912,9.2,4.8,-0.1,74.4,20,600,2003-05-24,Saturday,May,2003,Saturday,5,5,May
1,3.10009,7.69894,6.4,3.5,-0.3,56.0,14,196,2002-04-15,Monday,April,2002,Monday,0,4,April
2,3.31419,4.81218,-3.7,0.9,-0.1,281.3,4,513,2003-02-26,Wednesday,February,2003,Wednesday,2,2,February
3,4.38826,6.95177,-7.2,1.7,1.2,74.0,23,143,2002-02-21,Thursday,February,2002,Thursday,3,2,February
4,4.3464,7.51806,-1.3,2.6,-0.1,65.0,11,115,2002-01-24,Thursday,January,2002,Thursday,3,1,January


### Quantitative univariate analysis

**IMPORTANT** For drawing the plots I suggest you using **matplotlib** or **seaborn**. Please NAME YOUR X AND Y VARIABLES on each plot and WRITE DOWN YOUR CONCLUSIONS after each task completed.

**Task 1.1** Draw the **histograms** of the variables *log_cars_num*, *wind_speed*, *wind_dir*.

In [20]:
## Your code here

**Task 1.2** Draw the joint **histogram** (two histograms of different color in one plot) of the temperature at the 2 meters and 25 meters above the ground (you need to create the latter value by yourself).

In [21]:
## Your code here

**Task 1.3** Draw the **boxplots** of the variables *log_cars_num*, *temp_2*, *wind_speed*, *wind_dir* grouped by each time frame *dow*, *month*, *year*. You should be able to analyze how the temperature varies across months on a single plot.

**Hint 1**: You may need the **hue** parameter in the plotting function to group the variables by time. See examples for interpretation in the documentation
**Hint 2:** You should end up with 12 plots in total, so try to optimize you code by running the drawing function in a cycle.

In [22]:
# Your code here

**Task 1.4** Return NO2 and number of cars to the original scale by raising those variables to exponent. Draw the histograms of the new variables and explain why the logarithm was used originally.

In [25]:
# Your code here

### Time series analysis

**Task 2.1** Plot the mean temperature each month for all time period. You should end up with a plot where x-axis is a format *yyyy-mm* and y-axis is the mean temperature in the corresponding month. Make sure your x-values are sorted in ascending order.

**Hint** Create a new column *yyyy-mm* and apply `sns.lineplot` with correct parameters. You should end up with the line which connects the points in the plot with the corresponding confidence interval for the variable out of the box! Please refer to the documentation

In [24]:
# Your code here

**Task 2.2** Plot the mean for other variables in the dataset across month and make a conclusion if you see any dynamic.

In [26]:
# Your code here

**Task 2.3** Plot the mean temperature, number of cars, and wind direction each hour of the day.

In [27]:
# Your code here

### Answering the project questions

*1. Divide the hour of day into day and night hours, paying also attention to the season, that is quite relevant in Norway. Is there a significant difference in the mean of concentration of NO2 during day and night hours?*

**Hint 1** There are different daylight hours in different seasons of the year. You need to check it specifically for Norway in the internet and include that information in the dataframe, e.g. create a separate columns with the start and end hours of light day each month. Then you need to create a binary column *day-night* where you specify if this record falls into the day or the night hours.

**Hint 2** *Significant difference* means that you need to test the hypothesis of the mean equivalence between samples in the day and night hours:
H0: m1=m2
H1: m1 != m2

Please refer to our lectures how we did that. Use `scipy.stats.t.ppf` or `scipy.stats.norm.ppf` for calculating the inverse cumulative distribution function instead of referring to the tables in the internet. Make the significance level *alpha* of the test an adjusted variable.

**Hint 3** Before moving to the actual hypothesis testing try to simply visualize that difference somehow and make your own assumptions

In [28]:
# Your cool ideas and code goes here

*2.Divide each of the climatic variables x3,x4,x5 into two groups (high temperature/low temperature, high wind speed/low wind speed, etc.) and check for each of them if there is a significant difference in the mean of concentration of NO2 in the two identified groups*

**Hint 1** Based on the distribution of variables x3, x4, x5 choose the reasonable threshold for *high* and *low* values and explain your decision.

**Hint 2** Please refer to the *Hint 2* of the first task regarding the hypothesis testing.

**Hint 3** Before moving to the actual hypothesis testing try to simply visualize that difference somehow and make your own assumptions

In [29]:
# You statistical skills goes here

*Divide x2 into two groups (high number of cars/low number of cars), and check if there is a significant difference in the mean of concentration of NO2 in the two identified groups*

**Hint 1** Based on the distribution of variable x2 choose the reasonable threshold for *high* and *low* values and explain your decision. Try to convert the variable to the original scale and see if that helps

**Hint 2** Please refer to the *Hint 2* of the first task regarding the hypothesis testing.

**Hint 3** Before moving to the actual hypothesis testing try to simply visualize that difference somehow and make your own assumptions

In [30]:
# You passion to the research activities goes here

*4.Deduce which of the variables x2-x8 is more influent in changing the concentration of NO2 in the air, discussing also the presence of possible correlations between x2-x8*

**Step 1** Plot simple scatter plots of variables x2-x8 with NO2 and make the conclusion about the relationship of those variables
**Step 2** Calculate the Pearson correlation between variables x2-x8 with NO2
**Step 3 HARD LEVEL** Build a simple linear regression by choosing x2-x8 as X variables and predict Y variable N02. Measure the quality of the model. Analyze the feature weights and draw the conclusion about each feature contribution to the total result.

In [32]:
# You love to Python coding goes directly here