# Air Quality and COVID-19 Infection Rates
For this project, our goal is to investigate a possible relationship between Air Quality and COVID-19 Infection Rates. To that end, we will merge an Air Quality dataset with a COVID-19 dataset, train a linear regression model on the merged dataset, and perform predictions to test the accuracy of our model.

Some thoughts about the web app:
- Perhaps an app where Air Quality features and COVID-19 cases are the independent variables while COVID-19 infection rate (%) is the dependent variable.
- Sliders and input fields would be used for the independent variables.

**Important:** Because this is an exploratory project, we are not absolutely certain that there exists a significant relationship between Air Quality and COVID-19 datasets. Furthermore, our results, significant or not, can only lend support to hypotheses surrounding Air Quality and COVID-19.

For full project motivations, please see `CS180_G16_ProjectProposal.pdf`.

In [16]:
%reset -f
%config InteractiveShell.ast_node_interactivity = 'all'

def newline(): print("---------------------------\n")

import pandas as pd
dataA = pd.read_csv('datasets/A-WHO-air-quality.csv')
dataB = pd.read_csv('datasets/B-WHO-covid-infections-deaths.csv')

print("Dataset A: WHO Air Quality (2023)")
print("Dataset A Contents")
dataA.head()
print("Additional Details")
dataA.describe()
dataA.info()

newline()

print("Dataset B: WHO COVID-19 Cases and Deaths (2023)")
print("Dataset B Contents")
dataB.head()
print("Additional Details")
dataB.describe()
dataB.info()

Dataset A: WHO Air Quality (2023)
Dataset A Contents


Unnamed: 0,who_region,iso3,country_name,city,year,version,pm10_concentration,pm25_concentration,no2_concentration,pm10_tempcov,pm25_tempcov,no2_tempcov,type_of_stations,reference,web_link,population,population_source,latitude,longitude,who_ms
0,3_Sear,IND,India,Chennai,2018,version 2022,,30.0,,,91.0,,,"U.S. Department of State, United States Enviro...",https://www.airnow.gov/index.cfm?action=airnow...,9890427.0,,13.08784,80.27847,1
1,3_Sear,IND,India,Solapur,2016,"version 2022, version 2018",,39.0,,,99.0,,,"Central Pollution Control Board India, Environ...",,985568.0,,17.659919,75.906391,1
2,3_Sear,IND,India,Chennai,2019,version 2022,,39.0,,,85.0,,,"U.S. Department of State, United States Enviro...","[[[""EPA AirNow DOS"",""http://airnow.gov/index.c...",9890427.0,,13.08784,80.27847,1
3,3_Sear,IND,India,Hyderabad,2019,version 2022,,42.0,,,87.0,,,"U.S. Department of State, United States Enviro...","[[[""EPA AirNow DOS"",""http://airnow.gov/index.c...",8943523.0,,17.38405,78.45636,1
4,3_Sear,IND,India,Pune,2017,version 2022,,43.0,,,,,,"Central Pollution Control Board India, Environ...",http://www.cpcb.gov.in/CAAQM/,5727530.0,,18.50532,73.823839,1


Additional Details


Unnamed: 0,year,pm10_concentration,pm25_concentration,no2_concentration,pm10_tempcov,pm25_tempcov,no2_tempcov,population,population_source,latitude,longitude,who_ms
count,41364.0,28177.0,21566.0,26704.0,21344.0,16114.0,22991.0,17161.0,0.0,41115.0,41112.0,41364.0
mean,2016.20571,30.68751,19.977822,18.790743,93.146317,92.542323,93.863903,785843.1,,40.742536,12.025232,0.999782
std,3.120684,30.371733,17.907894,11.909244,9.764701,10.82449,8.141595,2145706.0,,16.882423,57.469974,0.014749
min,2000.0,1.0,0.0,0.0,10.0,10.0,11.0,0.0,,-76.010867,-159.36624,0.0
25%,2014.0,16.0,9.0,10.0,92.0,91.0,93.0,31292.0,,37.1568,-0.8086,1.0
50%,2016.0,22.0,13.0,17.0,97.0,97.0,96.0,159063.0,,43.8397,10.8817,1.0
75%,2019.0,31.0,25.0,25.0,99.0,99.0,99.0,625290.0,,49.227045,24.179,1.0
max,2022.0,540.0,436.0,211.0,100.0,100.0,100.0,38001020.0,,176.891634,178.45,1.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41364 entries, 0 to 41363
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   who_region          41364 non-null  object 
 1   iso3                41364 non-null  object 
 2   country_name        41364 non-null  object 
 3   city                41252 non-null  object 
 4   year                41364 non-null  int64  
 5   version             41219 non-null  object 
 6   pm10_concentration  28177 non-null  float64
 7   pm25_concentration  21566 non-null  float64
 8   no2_concentration   26704 non-null  float64
 9   pm10_tempcov        21344 non-null  float64
 10  pm25_tempcov        16114 non-null  float64
 11  no2_tempcov         22991 non-null  float64
 12  type_of_stations    29459 non-null  object 
 13  reference           40538 non-null  object 
 14  web_link            31983 non-null  object 
 15  population          17161 non-null  float64
 16  popu

Unnamed: 0,Date_reported,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
0,2020-01-03,AF,Afghanistan,EMRO,0,0,0,0
1,2020-01-04,AF,Afghanistan,EMRO,0,0,0,0
2,2020-01-05,AF,Afghanistan,EMRO,0,0,0,0
3,2020-01-06,AF,Afghanistan,EMRO,0,0,0,0
4,2020-01-07,AF,Afghanistan,EMRO,0,0,0,0


Additional Details


Unnamed: 0,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
count,295065.0,295065.0,295065.0,295065.0
mean,2600.664,1306386.0,23.514659,16987.45
std,38506.99,6086080.0,142.754287,72380.7
min,-11360.0,0.0,-3520.0,0.0
25%,0.0,1733.0,0.0,14.0
50%,9.0,31221.0,0.0,345.0
75%,329.0,356244.0,3.0,5280.0
max,6966046.0,103436800.0,11447.0,1127152.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 295065 entries, 0 to 295064
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Date_reported      295065 non-null  object
 1   Country_code       293820 non-null  object
 2   Country            295065 non-null  object
 3   WHO_region         295065 non-null  object
 4   New_cases          295065 non-null  int64 
 5   Cumulative_cases   295065 non-null  int64 
 6   New_deaths         295065 non-null  int64 
 7   Cumulative_deaths  295065 non-null  int64 
dtypes: int64(4), object(4)
memory usage: 18.0+ MB
