# Project

Name: Stacy Waweru 

 Github url: https://github.com/Waweru-Stacy-123

## Business Overview 88888

As a member of the data science team, I have been tasked to analyse the air quality dataset and provide insights to the business. The data was collected from a multigas sensor array that was deployed in the field in an Italian city.
The data was collected between March 2004 and February 2005 (one year) representing the time in which the sensor was operated. The data was collected at an industrial location with high levels of pollution. round Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2)  and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities.

        

## Purpose of Research 88888
To develop a predictive model that estimates the Air Quality Index (AQI) using sensor readings and meteorological data, helping city planners and health officials monitor and manage air quality more effectively.

Primary Objective:  
Predict Air Quality Index (AQI) Levels Based on Environmental Factors and Pollutant Concentrations

Objectives: 
1. To determine the pollutants that have the most significant impact on AQI changes 
2. To understand how weather conditions influence the concentrations of key pollutants  
3. To build and evaluate machine learning models hat will find the most accurate approach for predicting AQI


## The Data

The data is in the form of an excel file. Data was collected over a period of a year.

### Data Exploration

In [1]:
# import necessary libraries

import pandas as pd
import os
from numbers import Number
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error



ImportError: cannot import name 'int' from 'numpy' (C:\Users\stacy\anaconda3\envs\learn-env\lib\site-packages\numpy\__init__.py)

In [None]:
# Read the data

data_path = r"AirQualityUCI.xlsx"

data_df = pd.read_excel(data_path)


In [None]:
# Display the first 5 rows of the data
data_df.head() 

In [None]:
# Display the last 5 rows of the data
data_df.tail()

In [None]:
# Determine the shape of the data
data_df.shape

In [None]:
# Display the columns of the data
data_df.columns

The dataset contains hourly data for 9358 hours and 15 features.   

1. Date of observation

2. Time of observation

3. CO (GT) - True hourly averaged concentration CO in mg/m^3 (reference analyzer)

4. PT08.S1 (CO) - (tin oxide) hourly averaged sensor response (nominally CO targeted)

5. NMHC (GT) - True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)

6. C6H6 (GT) - True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)

7. PT08.S2 (NMHC) - (titania) hourly averaged sensor response (nominally NMHC targeted)

8. NOx (GT) - True hourly averaged NOx concentration in ppb (reference analyzer)

9. PT08.S3 (NOx) - (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)

10. NO2 (GT) - True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)

11. PT08.S4 (NO2) - (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)

12. PT08.S5 (O3) - (indium oxide) hourly averaged sensor response (nominally O3 targeted)

13. T - Temperature in °C

14. RH - Relative Humidity (%)

15. AH - Absolute Humidity

### Data Analysis

In [None]:
# Information about the data
data_df.info()

The data does not appear to have missing values. However, missing values are tagged with -200 values. We will determine if such values are in the dataset.

In [None]:
# Determine the missing values in the data

# Find rows with missing values represented by -200
missing_values = data_df[data_df == -200].any(axis=1)

# Display the rows with missing values
missing_rows = data_df[missing_values]
missing_rows

From the above, it is clear that there are a lot of missing values. The missing values represented by -200 will be replaced with NaN to make them easier to identify.

In [None]:
# Replace missing values (-200) with NaN

data_df.replace(-200, np.nan, inplace=True)
data_df

In [None]:
# Determine the missing values in the data

data_df.isnull().sum()

We will need to find a way to handle the missing values. Some rows will need to be dropped while others may need to be retained.

In [None]:
# Determmine the percentage of missing values in the data
percentage_missing = data_df.isnull().sum() * 100 / len(data_df)
percentage_missing

If the percentage missing is less than 30%, we will analyze the missing values. The column will remain in the dataframe. If the missing values are greater than 50%, we will drop the whole column since it will not be relevant in our data.

In [None]:
# Drop the columns with more than 50% missing values    

data_df.dropna(thresh=0.5*len(data_df), axis=1, inplace=True)

data_df.isnull().sum()

In [None]:
data_df.info()


If the datatype of the values in the column is either a float or integer, we will use the mean or median value to fill in the missing value.

In [None]:
# Fill the missing values in the data with the mean of the respective columns
columns_with_missing_values = data_df.columns[data_df.isnull().any()] 
for column in columns_with_missing_values:
    data_df[column].fillna(data_df[column].mean(), inplace=True)

data_df.isnull().sum()


From the above, we can see that there are no missing values in the dataframe. We can proceed to analyze the data. 

In [None]:
# Describe the data

data_df.describe()

In [None]:
# Determine the correlation between the variables
correlation_matrix = data_df.select_dtypes(include=[np.number]).corr()
correlation_matrix              
sns.heatmap(correlation_matrix, annot=True)
plt.show()



The closer a value is to one, the stronger the correlation between columns. The value 1 in the diagonal shows that the columns against themselves.

In [None]:
# Split the data into features and target variable
data_df = data_df. copy()

# Define the target variable
target = data_df['CO(GT)']  


# Drop the target variable from the features


In [None]:
# X and y variables
X = data_df.drop('T', axis=1)

In [None]:
# Split the data into features and target variable

train_test_split(data_df, test_size=0.2, random_state=42)r