## NOTE

In this notebook, I will be conductin almost the same data manipulation perforemd on our dataset but using another library : **Polars** to get familiar with it

### Polars Library :
To handle our data efficiently, we will be using Polars which is a DataFrame library completely written in Rust and is built to empower Python developers with a scalable and efficient framework for handling data and is considered as an alternative to the very popular pandas library. It provides a wide range of functionalities that facilitate various data manipulation and analysis tasks. Some of the key features and advantages of using Polars include:
- Speed and performance
- Data manipulation capabilities
- Expressive syntax
- Polars support lazy evaluation


### Why choose Polars when we have Pandas ?

pandas, a widely adopted library, is known for its flexibility and ease of use. However, when dealing with large datasets, Pandas can suffer from performance bottlenecks due to its reliance on single-threaded execution. As the dataset size increases, processing times can become prohibitively long, limiting productivity.

Polars has been specifically designed to handle large datasets efficiently. With its lazy evaluation strategy and parallel execution capabilities, Polars excels at processing substantial amounts of data swiftly. By distributing computations across multiple CPU cores, Polars leverages parallelism to deliver impressive performance gains. See the speed comparison test between Pandas and Polars by Yuki.


**Polars documentation  :** <https://docs.pola.rs> 

In [4]:
%pip install polars

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [5]:
import polars as pl

print(f"[POLLARS VERSION] == {pl.__version__}")

[POLLARS VERSION] == 0.20.18


In [6]:
df = pl.read_csv("../../data/AQI_data.csv")
df

Country,City,AQI Value,AQI Category,CO AQI Value,CO AQI Category,Ozone AQI Value,Ozone AQI Category,NO2 AQI Value,NO2 AQI Category,PM2.5 AQI Value,PM2.5 AQI Category,lat,lng
str,str,i64,str,i64,str,i64,str,i64,str,i64,str,f64,f64
"""Russian Federa…","""Praskoveya""",51,"""Moderate""",1,"""Good""",36,"""Good""",0,"""Good""",51,"""Moderate""",44.7444,44.2031
"""Brazil""","""Presidente Dut…",41,"""Good""",1,"""Good""",5,"""Good""",1,"""Good""",41,"""Good""",-5.29,-44.49
"""Brazil""","""Presidente Dut…",41,"""Good""",1,"""Good""",5,"""Good""",1,"""Good""",41,"""Good""",-11.2958,-41.9869
"""Italy""","""Priolo Gargall…",66,"""Moderate""",1,"""Good""",39,"""Good""",2,"""Good""",66,"""Moderate""",37.1667,15.1833
"""Poland""","""Przasnysz""",34,"""Good""",1,"""Good""",34,"""Good""",0,"""Good""",20,"""Good""",53.0167,20.8833
…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""United States …","""Highland Sprin…",54,"""Moderate""",1,"""Good""",34,"""Good""",5,"""Good""",54,"""Moderate""",37.5516,-77.3285
"""Slovakia""","""Martin""",71,"""Moderate""",1,"""Good""",39,"""Good""",1,"""Good""",71,"""Moderate""",49.065,18.9219
"""Slovakia""","""Martin""",71,"""Moderate""",1,"""Good""",39,"""Good""",1,"""Good""",71,"""Moderate""",36.3385,-88.8513
"""France""","""Sceaux""",50,"""Good""",1,"""Good""",20,"""Good""",5,"""Good""",50,"""Good""",48.7786,2.2906


In [7]:
# get the dimensionality of our df
print(f"THE SHAPE OF OUR DATAFRAME == {df.shape}")
print(f"WE HAVE {df.shape[0] } ROWS AND {df.shape[1]} COLUMNS ")

THE SHAPE OF OUR DATAFRAME == (16695, 14)
WE HAVE 16695 ROWS AND 14 COLUMNS 


In [8]:
# View the top 10 rows of our df
print(df[33:40])

shape: (7, 14)
┌────────────┬────────┬───────────┬────────────┬───┬────────────┬────────────┬─────────┬───────────┐
│ Country    ┆ City   ┆ AQI Value ┆ AQI        ┆ … ┆ PM2.5 AQI  ┆ PM2.5 AQI  ┆ lat     ┆ lng       │
│ ---        ┆ ---    ┆ ---       ┆ Category   ┆   ┆ Value      ┆ Category   ┆ ---     ┆ ---       │
│ str        ┆ str    ┆ i64       ┆ ---        ┆   ┆ ---        ┆ ---        ┆ f64     ┆ f64       │
│            ┆        ┆           ┆ str        ┆   ┆ i64        ┆ str        ┆         ┆           │
╞════════════╪════════╪═══════════╪════════════╪═══╪════════════╪════════════╪═════════╪═══════════╡
│ United     ┆ Dayton ┆ 45        ┆ Good       ┆ … ┆ 45         ┆ Good       ┆ 39.7805 ┆ -84.2003  │
│ States of  ┆        ┆           ┆            ┆   ┆            ┆            ┆         ┆           │
│ America    ┆        ┆           ┆            ┆   ┆            ┆            ┆         ┆           │
│ United     ┆ Dayton ┆ 45        ┆ Good       ┆ … ┆ 45         ┆ Good      

In [9]:
# check the data types of each column
for col, dtype in zip(df.columns, df.dtypes):
    print(f"COLUMN: {col}  -> DATATYPE : {dtype}")

COLUMN: Country  -> DATATYPE : String
COLUMN: City  -> DATATYPE : String
COLUMN: AQI Value  -> DATATYPE : Int64
COLUMN: AQI Category  -> DATATYPE : String
COLUMN: CO AQI Value  -> DATATYPE : Int64
COLUMN: CO AQI Category  -> DATATYPE : String
COLUMN: Ozone AQI Value  -> DATATYPE : Int64
COLUMN: Ozone AQI Category  -> DATATYPE : String
COLUMN: NO2 AQI Value  -> DATATYPE : Int64
COLUMN: NO2 AQI Category  -> DATATYPE : String
COLUMN: PM2.5 AQI Value  -> DATATYPE : Int64
COLUMN: PM2.5 AQI Category  -> DATATYPE : String
COLUMN: lat  -> DATATYPE : Float64
COLUMN: lng  -> DATATYPE : Float64


In [10]:
# get general information about the data
df.describe()

statistic,Country,City,AQI Value,AQI Category,CO AQI Value,CO AQI Category,Ozone AQI Value,Ozone AQI Category,NO2 AQI Value,NO2 AQI Category,PM2.5 AQI Value,PM2.5 AQI Category,lat,lng
str,str,str,f64,str,f64,str,f64,str,f64,str,f64,str,f64,f64
"""count""","""16393""","""16695""",16695.0,"""16695""",16695.0,"""16695""",16695.0,"""16695""",16695.0,"""16695""",16695.0,"""16695""",16695.0,16695.0
"""null_count""","""302""","""0""",0.0,"""0""",0.0,"""0""",0.0,"""0""",0.0,"""0""",0.0,"""0""",0.0,0.0
"""mean""",,,62.998682,,1.342138,,31.767355,,3.819647,,59.821324,,30.267148,-3.944485
"""std""",,,43.091971,,2.371379,,22.839343,,5.880677,,43.208298,,22.947398,73.037148
"""min""","""Afghanistan""","""Aabenraa""",7.0,"""Good""",0.0,"""Good""",0.0,"""Good""",0.0,"""Good""",0.0,"""Good""",-54.8019,-171.75
"""25%""",,,39.0,,1.0,,20.0,,0.0,,34.0,,16.5167,-75.1779
"""50%""",,,52.0,,1.0,,29.0,,2.0,,52.0,,38.8158,5.6431
"""75%""",,,69.0,,1.0,,38.0,,5.0,,69.0,,46.6833,36.2833
"""max""","""Zimbabwe""","""Zyryanovsk""",500.0,"""Very Unhealthy…",133.0,"""Unhealthy for …",222.0,"""Very Unhealthy…",91.0,"""Moderate""",500.0,"""Very Unhealthy…",70.767,178.0178


In [11]:
df.null_count()
# we have no missing values youpii :)

Country,City,AQI Value,AQI Category,CO AQI Value,CO AQI Category,Ozone AQI Value,Ozone AQI Category,NO2 AQI Value,NO2 AQI Category,PM2.5 AQI Value,PM2.5 AQI Category,lat,lng
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
302,0,0,0,0,0,0,0,0,0,0,0,0,0


In [12]:
df.is_duplicated().sum()

0

In [13]:
df = df.unique(subset=["City"])
df.shape

(14229, 14)

In [14]:
for col in df.columns:
    print(df[col].value_counts())

shape: (175, 2)
┌────────────────────────────┬───────┐
│ Country                    ┆ count │
│ ---                        ┆ ---   │
│ str                        ┆ u32   │
╞════════════════════════════╪═══════╡
│ Luxembourg                 ┆ 1     │
│ Ireland                    ┆ 25    │
│ Cameroon                   ┆ 37    │
│ Solomon Islands            ┆ 1     │
│ Hungary                    ┆ 46    │
│ …                          ┆ …     │
│ Saint Kitts and Nevis      ┆ 1     │
│ Iran (Islamic Republic of) ┆ 29    │
│ Montenegro                 ┆ 3     │
│ Ukraine                    ┆ 144   │
│ Finland                    ┆ 64    │
└────────────────────────────┴───────┘
shape: (14_229, 2)
┌─────────────┬───────┐
│ City        ┆ count │
│ ---         ┆ ---   │
│ str         ┆ u32   │
╞═════════════╪═══════╡
│ Geldern     ┆ 1     │
│ Veracruz    ┆ 1     │
│ Fontana     ┆ 1     │
│ Borodino    ┆ 1     │
│ Orillia     ┆ 1     │
│ …           ┆ …     │
│ Moussoro    ┆ 1     │
│ Schertz     

In [15]:
# Let's see the top 5 dirtiest cities
sorted_df = df.sort(by="AQI Value")
sorted_df[["Country", "City", "AQI Value"]][:6]

Country,City,AQI Value
str,str,i64
"""Ecuador""","""Macas""",7
"""Papua New Guin…","""Tari""",8
"""Ecuador""","""Azogues""",8
"""Peru""","""Huaraz""",9
"""Indonesia""","""Manokwari""",10
"""Peru""","""Huancavelica""",10


In [16]:
# Let's see the top 5 cleanest cities
sorted_df = df.sort(by="AQI Value")
sorted_df.tail(6)[["Country", "City", "AQI Value"]]

Country,City,AQI Value
str,str,i64
"""India""","""Delhi""",500
"""India""","""Phalodi""",500
"""India""","""Mahendragarh""",500
"""India""","""Dhanaura""",500
"""India""","""Ratangarh""",500
"""India""","""Sardulgarh""",500


In [17]:
%pip install seaborn
%pip install pyarrow

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
