# **DATA ANALYTICS ON CORONA VIRUS**

This project is undertook to analyse the effect of corona virus on air quality in INDIA specifically as the data we will be using is belongs to the city of INDIA recorded per day.The data is analysed using kaggle's R notebook which provides power and pre installed R packages.

*Further the data is bifurcated in vehicle pollution and industrial pollution for better understanding the effects of them on air quality*

In [None]:
library(tidyverse)
list.files(path = "../input")

In [None]:
city <- read.csv("../input/air-quality-data-in-india/city_day.csv")

Here we are reading an csv file which contains **26219** rows and **16** variables . As I said earlier we are going to bifurcate this dataset further in two datasets caled veh_pollution(Pollution caused by vehicles) and indst_pollution(Pollution caused by industries).We will dsitribute the data by knowing the hazardous gas generate by each of them.

Note: Here, Ozone(O3) does not mean Ozone Layer but the harmful gas produced by vehicle and industries both which can cause several breathing problems and Lungs diseases to humans.

Variables we need for veh_pollution Dataset are:
* City 
* Date 
* NO2 - Nitrogen Dioxide
* CO - Carbon Oxide
* Benzene	
* Toluene	
* Xylene
* O3 - Ozone
* AQI_Bucket - Air Quality Index(Indicating: very bad , bad , satisfied , good)

Variables we need for indst_pollution Dataset are:
* City 
* Date 
* NO - Nitrogen Oxide
* CO - Carbon Oxide
* SO2 - Sulfur dioxide
* O3 - Ozone
* AQI_Bucket - Air Quality Index(Indicating: very bad , bad , satisfied , good)


In [None]:
head(city)
dim(city)

In [None]:
unique(city$AQI_Bucket)
city$AQI_Bucket <- factor(city$AQI_Bucket , levels = c("Poor" , "Very Poor" , "Severe" , "Moderate" , "Satisfactory" , "Good"))

In [None]:
new_data <- city %>%
    separate(Date, sep="-", into = c("Year", "month", "day"))

#visualizing data

ggplot(new_data , aes(x = Year , y= City , fill = AQI_Bucket ))+
geom_tile()+
ggtitle("Air Quality Of Cities From 2015-2020")+
theme_classic()+
theme(
     axis.text=element_text(size=12),
    axis.title=element_text(size=14,face="bold"),
    title = element_text(size=14,face="bold"))+
xlab("Years")+
ylab("Cities")

 **GETTING AND CLEANING DATA FOR VEHICLE AND INDUSTRIAL POLLUTION**

In [None]:
#getting required columns for industrial pollution
indst_pollution <- city%>%
    select(City , Date , NO , CO , SO2 , O3, AQI_Bucket)%>%
    filter(AQI_Bucket == c("Poor" , "Very Poor" , "Severe" , "Moderate" , "Satisfactory" , "Good"))%>%
    group_by(AQI_Bucket)%>%
    arrange(City)
indst_pollution <- na.omit(indst_pollution)
indst_pollution$AQI_Bucket = factor(indst_pollution$AQI_Bucket , levels =  c("Poor" , "Very Poor" , "Severe" , "Moderate" , "Satisfactory" , "Good"))
indst_pollution$Date = as.Date(indst_pollution$Date)
head(indst_pollution)

In [None]:
#getting required columns for vehicle pollution
veh_pollution <- city%>%
    select(City , Date ,NO2 , CO , Benzene , Toluene , Xylene , O3 , AQI_Bucket)%>%
    filter(AQI_Bucket == c("Poor" , "Very Poor" , "Severe" , "Moderate" , "Satisfactory" , "Good"))%>%
    group_by(AQI_Bucket)%>%
    arrange(City)
veh_pollution <- na.omit(veh_pollution)
veh_pollution$AQI_Bucket = factor(veh_pollution$AQI_Bucket , levels =  c("Poor" , "Very Poor" , "Severe" , "Moderate" , "Satisfactory" , "Good"))
veh_pollution$Date = as.Date(veh_pollution$Date)
head(veh_pollution)


**Here, we have our data for vehicle pollution for different cities per day.
now lets analyze different variables correlation with O3 i.e. we going to look how far different gases affects ozone.**

In [None]:
veh_year <- veh_pollution %>%
    separate(Date, sep="-", into = c("Year", "month", "day"))%>%
    group_by(Year)%>%
    summarize(
              NO2 = sum(NO2),
              CO= sum(CO),
              Benzene= sum(Benzene),
              Toluene= sum(Toluene),
              Xylene= sum(Xylene),
              Ozone= sum(O3))
veh_year

In [None]:
library(GGally)
ggpairs(data=veh_year, columns=2:7, title="AQI data(Vehicle Pollution)")

**From above we can see the relationship between different gases , i.e. the more the correlation value is close to 1 the more they are relate to eachother**

**As we can see here the gases which effects *ozone* the most are as follows:**
1. Nitrogen Dioxide
2. Toluene	
3. Xylene



> In India first corona case is recorded in year 2020 and hence, I created another Two Dataset for Pre And Post Corona Pandemic.
>For Pre Corona, I  have Subset the dataset which is inbetween Dates "2015-01-01" and "2019-12-31
>**AND** also 
>For Post Corona, I  have Subset the dataset which is after "2020-01-01"

In [None]:
#PRE AND POST CORONA EFFECT ON INDUSTRIAL POLLUTION 
indst_pollution_pre <- subset(indst_pollution,
                           Date >= "2015-01-01" & Date <= "2019-12-31")
indst_pollution_post <- subset(indst_pollution,
                            Date >= "2020-01-01")

In [None]:
#PRE AND POST CORONA EFFECT ON VEHICLE POLLUTION 
veh_pollution_pre <- subset(veh_pollution,
                           Date >= "2015-01-01" & Date <= "2019-12-31")
veh_pollution_post <- subset(veh_pollution,
                            Date >= "2020-01-01")

In [None]:
veh_pre_percentage = veh_pollution_pre %>% group_by(AQI_Bucket) %>%
  summarise(count=n()) %>%
  mutate(Percentage=count/sum(count)) 

veh_post_percentage = veh_pollution_post %>% group_by(AQI_Bucket) %>%
  summarise(count=n()) %>%
  mutate(Percentage=count/sum(count))

#visualizing data
ggplot(veh_pre_percentage , aes(x=AQI_Bucket , y=Percentage ))+
geom_bar(fill = "skyblue " , stat="identity")+
ggtitle("Pre COVID-19 (Vehicle Pollution)")+
theme_classic()+
theme(legend.position = "None",
     axis.text.x=element_text(size=12),
    axis.title=element_text(size=14,face="bold"),
     title = element_text(size = 15),
     axis.ticks.y = element_blank(),
     axis.text.y = element_blank(),
     axis.line.y = element_blank())+
xlab("Air Quality Index")+
scale_y_continuous(limits=c(0,0.5)) + 
  geom_text(data=veh_pre_percentage, aes(label=paste0(round(Percentage*100,1),"%"),
                               y=Percentage+0.012), size=4)


ggplot(veh_post_percentage , aes(AQI_Bucket , y=Percentage ))+
geom_bar(fill = "lightblue", stat="identity")+
ggtitle("Post COVID-19 (Vehicle Pollution)")+
theme_classic()+
theme(legend.position = "None",
     axis.text.x=element_text(size=12),
    axis.title=element_text(size=14,face="bold"),
     title = element_text(size = 15),
     axis.ticks.y = element_blank(),
     axis.text.y = element_blank(),
     axis.line.y = element_blank())+
xlab("Air Quality Index")+ 
  geom_text(data=veh_post_percentage, aes(label=paste0(round(Percentage*100,1),"%"),
                               y=Percentage+0.012), size=4)



**From above we can see Pre And Post Corona Effect on air pollution by vehicles. Their is drastic decrease in severe, poor and very poor air quality index of cities whereas their is increase in satisfactory and good air quality index**

Now we will analyse gases increment over the years by **industrial pollution**

In [None]:
indst_year <- indst_pollution %>%
    separate(Date, sep="-", into = c("Year", "month", "day"))%>%
    group_by(Year)%>%
    summarize(
              NO = sum(NO),
              CO= sum(CO),
              SO2= sum(SO2),
              Ozone= sum(O3))
ggpairs(data=indst_year, columns=2:5, title="AQI data(Industrial Pollution)")

From above we can see which variable of gas is highly correlated with which gas that is produced by industries

In [None]:
indst_pre_percentage = indst_pollution_pre %>% group_by(AQI_Bucket) %>%
  summarise(count=n()) %>%
  mutate(Percentage=count/sum(count)) 

indst_post_percentage = indst_pollution_post %>% group_by(AQI_Bucket) %>%
  summarise(count=n()) %>%
  mutate(Percentage=count/sum(count))

#visualizing data
ggplot(indst_pre_percentage , aes(x=AQI_Bucket , y=Percentage ))+
geom_bar(fill = "skyblue " , stat="identity")+
ggtitle("Pre COVID-19 (Industrial Pollution)")+
theme_classic()+
theme(legend.position = "None",
     axis.text.x=element_text(size=12),
    axis.title=element_text(size=14,face="bold"),
     title = element_text(size = 15),
     axis.ticks.y = element_blank(),
     axis.text.y = element_blank(),
     axis.line.y = element_blank())+
xlab("Air Quality Index")+
scale_y_continuous(limits=c(0,0.5)) + 
  geom_text(data=indst_pre_percentage, aes(label=paste0(round(Percentage*100,1),"%"),
                               y=Percentage+0.012), size=4)


ggplot(indst_post_percentage , aes(AQI_Bucket , y=Percentage ))+
geom_bar(fill = "lightblue", stat="identity")+
ggtitle("Post COVID-19 (Industrial Pollution)")+
theme_classic()+
theme(legend.position = "None",
     axis.text.x=element_text(size=12),
    axis.title=element_text(size=14,face="bold"),
     title = element_text(size = 15),
     axis.ticks.y = element_blank(),
     axis.text.y = element_blank(),
     axis.line.y = element_blank())+
xlab("Air Quality Index")+ 
  geom_text(data=indst_post_percentage, aes(label=paste0(round(Percentage*100,1),"%"),
                               y=Percentage+0.012), size=4)


> Above Bar Chart shows the that their is drastic fall in air pollution caused by industries which leads to the increment AIR QUALITY INDEX towards satisfactory and good

# **SUMMARY**

> Corona virus is affeecting peoples life-style in many ways.Many countries including india have suggested people to quarantine themselves and also lockdown entire countries that means not a single person is allowed to go out,If one wished to go out he/she must follow the guide lines that are given by the government of india. 
* Here,I analysed the air quality data of india for post and pre corona effects on air pollution created by vehicles and industries,And I reached to the conclusion that this pandemic is having a good effect on air pollution.
* Since the industries are mostly closed and most of them trying to adopt environment of work from home , which can be a reason we see their is continuous downfall in pollution levels of the cities.
* Another cause of air pollution is through vehicles, which is also seems to be decreasing as most of the people self quaratined themselves in their houses in order to not get infected by this virus.Another reason would be that most of the people is ignoring to travel long distance and moving out of their houses just to buy essential stuff they need.