<a href="https://colab.research.google.com/github/Yanhuijun1911/PythonData/blob/main/Air_quality_mini_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clean and wrangle air quality data

The following data file contains data collected at a roadside monitoring station.  You can see the data in a spreadsheet here: https://docs.google.com/spreadsheets/d/1XpAvrpuyMsKDO76EZ3kxuddBOu7cZX1Od4uEts14zco/edit?usp=sharing

The data contains:
* a heading line (Chatham Roadside) which needs to be skipped
* dates which are sometimes left- and sometimes right-justified indicating that they are not formatted as dates, rather they are text (so need to be converted to dates)
* times which are not all in the same format
* Nitrogen Dioxide levels which are, again, text and sometimes contain nodata
* Status which is always the same





### Project - clean, sort and wrangle the data

Read the dataset into a dataframe, skipping the first row   
Convert dates to date format  
Remove rows with nodata in the Nitrogen dioxide column  
Convert the Nitrogen dioxide levels values to float type  
Sort by Nitrogen dioxide level  
Create a new column for 'Weekdays' (use df['Date'].dt.weekday)  
Rename the column Nitrogen dioxide level to NO2 Level (V ug/m2)  
Remove the Status column  

The dataset can be viewed here:  https://drive.google.com/file/d/1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ/view?usp=sharing  and the data accessed here: https://drive.google.com/uc?id=1QSNJ3B1ku8kjXsA_tCBh4fbpDK7wVLAA This is a .csv file  

**NOTE:** Some useful references are included at the bottom of this spreadsheet.

Use the code cell below to work your code.

In [None]:
import pandas as pd
import numpy as np

url = "https://drive.google.com/uc?id=1QSNJ3B1ku8kjXsA_tCBh4fbpDK7wVLAA"
df = pd.read_csv(url, skiprows=1)
df["Date"]= pd.to_datetime(df["Date"])
print(df)
df.info()

           Date      Time Nitrogen dioxide   Status
0    2020-01-01      1:00         35.65193  V µg/m³
1    2020-01-01      2:00         37.99122  V µg/m³
2    2020-01-01      3:00         35.70462  V µg/m³
3    2020-01-01      4:00          36.5796  V µg/m³
4    2020-01-01      5:00          32.9441  V µg/m³
...         ...       ...              ...      ...
8779 2020-12-31     20:00         11.22419  V µg/m³
8780 2020-12-31     21:00         11.17037  V µg/m³
8781 2020-12-31     22:00          9.54137  V µg/m³
8782 2020-12-31     23:00          8.21683  V µg/m³
8783 2020-12-31  24:00:00          8.16537  V µg/m³

[8784 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8784 entries, 0 to 8783
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Date              8784 non-null   datetime64[ns]
 1   Time              8784 non-null   object        
 2   Nitrogen dioxide  8784 n

In [None]:
# Drop a row by condition
#df[df.Name != 'Alisa']

df = df[df["Nitrogen dioxide"] != 'nodata']
print(df)

           Date      Time Nitrogen dioxide   Status
0    2020-01-01      1:00         35.65193  V µg/m³
1    2020-01-01      2:00         37.99122  V µg/m³
2    2020-01-01      3:00         35.70462  V µg/m³
3    2020-01-01      4:00          36.5796  V µg/m³
4    2020-01-01      5:00          32.9441  V µg/m³
...         ...       ...              ...      ...
8779 2020-12-31     20:00         11.22419  V µg/m³
8780 2020-12-31     21:00         11.17037  V µg/m³
8781 2020-12-31     22:00          9.54137  V µg/m³
8782 2020-12-31     23:00          8.21683  V µg/m³
8783 2020-12-31  24:00:00          8.16537  V µg/m³

[8672 rows x 4 columns]


In [None]:
df["Nitrogen dioxide"] = df["Nitrogen dioxide"].astype(float)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8672 entries, 0 to 8783
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Date              8672 non-null   datetime64[ns]
 1   Time              8672 non-null   object        
 2   Nitrogen dioxide  8672 non-null   float64       
 3   Status            8672 non-null   object        
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 338.8+ KB


In [None]:
df= df.sort_values("Nitrogen dioxide")
print(df)

           Date      Time  Nitrogen dioxide   Status
3442 2020-05-23     11:00           0.31041  V µg/m³
5844 2020-08-31     13:00           0.38390  V µg/m³
7684 2020-11-16      5:00           0.40116  V µg/m³
7756 2020-11-19      5:00           0.40229  V µg/m³
3440 2020-05-23      9:00           0.41544  V µg/m³
...         ...       ...               ...      ...
501  2020-01-21     22:00          66.59166  V µg/m³
504  2020-01-22      1:00          67.62859  V µg/m³
503  2020-01-21  24:00:00          69.17734  V µg/m³
2347 2020-07-04     20:00          69.88823  V µg/m³
502  2020-01-21     23:00          70.41527  V µg/m³

[8672 rows x 4 columns]


In [None]:
#Create a new column for 'Weekdays' (use df['Date'].dt.weekday)
df["Weekdays"] = df["Date"].dt.weekday
print(df)


           Date      Time  Nitrogen dioxide   Status  Weekdays
3442 2020-05-23     11:00           0.31041  V µg/m³         5
5844 2020-08-31     13:00           0.38390  V µg/m³         0
7684 2020-11-16      5:00           0.40116  V µg/m³         0
7756 2020-11-19      5:00           0.40229  V µg/m³         3
3440 2020-05-23      9:00           0.41544  V µg/m³         5
...         ...       ...               ...      ...       ...
501  2020-01-21     22:00          66.59166  V µg/m³         1
504  2020-01-22      1:00          67.62859  V µg/m³         2
503  2020-01-21  24:00:00          69.17734  V µg/m³         1
2347 2020-07-04     20:00          69.88823  V µg/m³         5
502  2020-01-21     23:00          70.41527  V µg/m³         1

[8672 rows x 5 columns]


In [None]:
df.drop("Status", axis = 1, inplace=True)
df

Unnamed: 0,Date,Time,Nitrogen dioxide,Weekdays
3442,2020-05-23,11:00,0.31041,5
5844,2020-08-31,13:00,0.38390,0
7684,2020-11-16,5:00,0.40116,0
7756,2020-11-19,5:00,0.40229,3
3440,2020-05-23,9:00,0.41544,5
...,...,...,...,...
501,2020-01-21,22:00,66.59166,1
504,2020-01-22,1:00,67.62859,2
503,2020-01-21,24:00:00,69.17734,1
2347,2020-07-04,20:00,69.88823,5


In [None]:
#Rename the column Nitrogen dioxide level to NO2 Level (V ug/m2)
df['NO2 Level (V ug/m2)'] = df['Nitrogen dioxide']
print(df)


           Date      Time  Nitrogen dioxide  Weekdays  NO2 Level (V ug/m2)
3442 2020-05-23     11:00           0.31041         5              0.31041
5844 2020-08-31     13:00           0.38390         0              0.38390
7684 2020-11-16      5:00           0.40116         0              0.40116
7756 2020-11-19      5:00           0.40229         3              0.40229
3440 2020-05-23      9:00           0.41544         5              0.41544
...         ...       ...               ...       ...                  ...
501  2020-01-21     22:00          66.59166         1             66.59166
504  2020-01-22      1:00          67.62859         2             67.62859
503  2020-01-21  24:00:00          69.17734         1             69.17734
2347 2020-07-04     20:00          69.88823         5             69.88823
502  2020-01-21     23:00          70.41527         1             70.41527

[8672 rows x 5 columns]


### Expand the dataset and show summary statistics for larger dataset
---

There is a second data set here covering the year 2021:  https://drive.google.com/uc?id=1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ  

Concatenate the two datasets to expand it to 2020 and 2021.  

Before you can concatenate the datasets you will need to clean and wrangle the second dataset in the same way as the first.  Use the code cell below.  Give the second dataset a different name. 

After the datasets have been concatenated, group the data by Weekdays and show summary statistics by day of the week.

In [None]:
#Read the dataset into a dataframe, skipping the first row
#Convert dates to date format
#Remove rows with nodata in the Nitrogen dioxide column
#Convert the Nitrogen dioxide levels values to float type
#Sort by Nitrogen dioxide level
#Create a new column for 'Weekdays' (use df['Date'].dt.weekday)
#Rename the column Nitrogen dioxide level to NO2 Level (V ug/m2)
#Remove the Status column

In [None]:
import pandas as pd
url = "https://drive.google.com/uc?id=1QSNJ3B1ku8kjXsA_tCBh4fbpDK7wVLAA"
df = pd.read_csv(url, skiprows=1)
url2 = "https://drive.google.com/uc?id=1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ"
df2 = pd.read_csv(url2, skiprows=1)
df_new = pd.concat([df, df2], ignore_index = True)
print(df_new)


             Date   Time Nitrogen dioxide   Status
0      01/01/2020   1:00         35.65193  V µg/m³
1      01/01/2020   2:00         37.99122  V µg/m³
2      01/01/2020   3:00         35.70462  V µg/m³
3      01/01/2020   4:00          36.5796  V µg/m³
4      01/01/2020   5:00          32.9441  V µg/m³
...           ...    ...              ...      ...
17539  31/12/2021  20:00         12.51492  P µg/m³
17540  31/12/2021  21:00         14.00046  P µg/m³
17541  31/12/2021  22:00         10.04780  P µg/m³
17542  31/12/2021  23:00          3.49557  P µg/m³
17543  31/12/2021  24:00          4.15682  P µg/m³

[17544 rows x 4 columns]


In [None]:
df_new["Date"]= pd.to_datetime(df_new["Date"])
df_new.info()
print(df_new)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17544 entries, 0 to 17543
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Date              17544 non-null  datetime64[ns]
 1   Time              17544 non-null  object        
 2   Nitrogen dioxide  17544 non-null  object        
 3   Status            17544 non-null  object        
dtypes: datetime64[ns](1), object(3)
memory usage: 548.4+ KB
            Date   Time Nitrogen dioxide   Status
0     2020-01-01   1:00         35.65193  V µg/m³
1     2020-01-01   2:00         37.99122  V µg/m³
2     2020-01-01   3:00         35.70462  V µg/m³
3     2020-01-01   4:00          36.5796  V µg/m³
4     2020-01-01   5:00          32.9441  V µg/m³
...          ...    ...              ...      ...
17539 2021-12-31  20:00         12.51492  P µg/m³
17540 2021-12-31  21:00         14.00046  P µg/m³
17541 2021-12-31  22:00         10.04780  P µg/m³
17542 2

In [None]:
df_new["Nitrogen dioxide"]= df_new["Nitrogen dioxide"][df_new["Nitrogen dioxide"] != "nodata"]
print(df_new)

            Date   Time Nitrogen dioxide   Status
0     2020-01-01   1:00         35.65193  V µg/m³
1     2020-01-01   2:00         37.99122  V µg/m³
2     2020-01-01   3:00         35.70462  V µg/m³
3     2020-01-01   4:00          36.5796  V µg/m³
4     2020-01-01   5:00          32.9441  V µg/m³
...          ...    ...              ...      ...
17539 2021-12-31  20:00         12.51492  P µg/m³
17540 2021-12-31  21:00         14.00046  P µg/m³
17541 2021-12-31  22:00         10.04780  P µg/m³
17542 2021-12-31  23:00          3.49557  P µg/m³
17543 2021-12-31  24:00          4.15682  P µg/m³

[17544 rows x 4 columns]


In [None]:
df_new["Nitrogen dioxide"] = df_new["Nitrogen dioxide"].astype(float)
df_new.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17544 entries, 0 to 17543
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Date              17544 non-null  datetime64[ns]
 1   Time              17544 non-null  object        
 2   Nitrogen dioxide  17352 non-null  float64       
 3   Status            17544 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 548.4+ KB


In [None]:
df_new = df_new.sort_values(["Nitrogen dioxide"])    
print(df_new)

#? why there're NaN in Nitrogen dioxie?????

            Date   Time  Nitrogen dioxide   Status
15961 2021-10-27  02:00          -0.77743  P µg/m³
15793 2021-10-20  02:00          -0.54076  P µg/m³
15891 2021-10-24  04:00          -0.41740  P µg/m³
15458 2021-06-10  03:00          -0.31174  P µg/m³
15962 2021-10-27  03:00          -0.28544  P µg/m³
...          ...    ...               ...      ...
15157 2021-09-23  14:00               NaN  V µg/m³
15464 2021-06-10  09:00               NaN  P µg/m³
15801 2021-10-20  10:00               NaN  P µg/m³
16810 2021-01-12  11:00               NaN  P µg/m³
17171 2021-12-16  12:00               NaN  P µg/m³

[17544 rows x 4 columns]


In [None]:
#Weekdays' (use df['Date'].dt.weekday
df_new['Weekdays']= df_new['Date'].dt.weekday
print(df_new)

#Rename the column Nitrogen dioxide level to NO2 Level (V ug/m2)
df_new["NO2 Level (V ug/m2)"] = df_new['Nitrogen dioxide']
print(df_new)
#Remove the Status column
df_new.drop("Status", axis= 1, inplace = True)
print(df_new)

            Date   Time  ...  Weekdays NO2 Level (V ug/m2)
15961 2021-10-27  02:00  ...         2            -0.77743
15793 2021-10-20  02:00  ...         2            -0.54076
15891 2021-10-24  04:00  ...         6            -0.41740
15458 2021-06-10  03:00  ...         3            -0.31174
15962 2021-10-27  03:00  ...         2            -0.28544
...          ...    ...  ...       ...                 ...
15157 2021-09-23  14:00  ...         3                 NaN
15464 2021-06-10  09:00  ...         3                 NaN
15801 2021-10-20  10:00  ...         2                 NaN
16810 2021-01-12  11:00  ...         1                 NaN
17171 2021-12-16  12:00  ...         3                 NaN

[17544 rows x 6 columns]
            Date   Time  ...  Weekdays NO2 Level (V ug/m2)
15961 2021-10-27  02:00  ...         2            -0.77743
15793 2021-10-20  02:00  ...         2            -0.54076
15891 2021-10-24  04:00  ...         6            -0.41740
15458 2021-06-10  03:00  ...  

### Helpful references
---
Skipping rows when reading datasets:  
https://www.geeksforgeeks.org/how-to-skip-rows-while-reading-csv-file-using-pandas/  

Converting strings to dates:  
https://www.geeksforgeeks.org/convert-the-column-type-from-string-to-datetime-format-in-pandas-dataframe/

Dropping rows where data has a given value:  
https://www.datasciencemadesimple.com/drop-delete-rows-conditions-python-pandas/  
(see section Drop a row or observation by condition) 

Convert a column of strings to a column of floats:
https://datatofish.com/convert-string-to-float-dataframe/  

Create a new column from data converted in an existing column:  
https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/  

Rename a column:  
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html  

Remove a column by name:  
https://www.kite.com/python/answers/how-to-delete-columns-from-a-pandas-%60dataframe%60-by-column-name-in-python#:~:text=Use%20the%20del%20keyword%20to,the%20name%20column_name%20from%20DataFrame%20.


#Reflection
1. datetime
2. merge
3. drop
4. change type: astype