# ðŸ”¬ Beijing Air Quality
## ðŸ“˜ Notebook 01 â€“ Data Cleaning

| Field         | Description |
|---------------|-------------|
| Author:       | Robert Steven Elliott |
| Course:       | Code Institute â€“ Data Analytics with AI Bootcamp |
| Project Type: |	Hackathon 2 |
| Date:         |	December 2025 |

### Import Libraries

In [1]:
import sys
import pandas as pd
import numpy as np
from pathlib import Path


### Project Paths

In [2]:
PROJECT_ROOT = Path.cwd().parent
sys.path.append(str(PROJECT_ROOT))
DATA_DIR = PROJECT_ROOT / "data"
RAW_DATA_DIR = DATA_DIR / "raw"
CLEAN_DATA_DIR = DATA_DIR / "clean"
if not CLEAN_DATA_DIR.exists():
    CLEAN_DATA_DIR.mkdir(parents=True, exist_ok=True)

### Load Custom Libraries

In [3]:
from utils.data_processing import load_data, remove_unneeded_columns

### Load Data

In [4]:
df = load_data(RAW_DATA_DIR / "spotify_raw.csv", string_cols=['track_name', 'artists'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114000 entries, 0 to 113999
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Unnamed: 0        114000 non-null  int64   
 1   track_id          114000 non-null  category
 2   artists           113999 non-null  string  
 3   album_name        113999 non-null  category
 4   track_name        113999 non-null  string  
 5   popularity        114000 non-null  int64   
 6   duration_ms       114000 non-null  int64   
 7   explicit          114000 non-null  bool    
 8   danceability      114000 non-null  float64 
 9   energy            114000 non-null  float64 
 10  key               114000 non-null  int64   
 11  loudness          114000 non-null  float64 
 12  mode              114000 non-null  int64   
 13  speechiness       114000 non-null  float64 
 14  acousticness      114000 non-null  float64 
 15  instrumentalness  114000 non-null  float64 
 16  li

### Remove Uneeded Coulmns

In [5]:
UNEEDED_COLUMNS = ['Unnamed: 0', 'track_id', 'album_name']

df = remove_unneeded_columns(df, UNEEDED_COLUMNS)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114000 entries, 0 to 113999
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   artists           113999 non-null  string  
 1   track_name        113999 non-null  string  
 2   popularity        114000 non-null  int64   
 3   duration_ms       114000 non-null  int64   
 4   explicit          114000 non-null  bool    
 5   danceability      114000 non-null  float64 
 6   energy            114000 non-null  float64 
 7   key               114000 non-null  int64   
 8   loudness          114000 non-null  float64 
 9   mode              114000 non-null  int64   
 10  speechiness       114000 non-null  float64 
 11  acousticness      114000 non-null  float64 
 12  instrumentalness  114000 non-null  float64 
 13  liveness          114000 non-null  float64 
 14  valence           114000 non-null  float64 
 15  tempo             114000 non-null  float64 
 16  ti

In [6]:
df.drop_duplicates(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 106949 entries, 0 to 113999
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   artists           106948 non-null  string  
 1   track_name        106948 non-null  string  
 2   popularity        106949 non-null  int64   
 3   duration_ms       106949 non-null  int64   
 4   explicit          106949 non-null  bool    
 5   danceability      106949 non-null  float64 
 6   energy            106949 non-null  float64 
 7   key               106949 non-null  int64   
 8   loudness          106949 non-null  float64 
 9   mode              106949 non-null  int64   
 10  speechiness       106949 non-null  float64 
 11  acousticness      106949 non-null  float64 
 12  instrumentalness  106949 non-null  float64 
 13  liveness          106949 non-null  float64 
 14  valence           106949 non-null  float64 
 15  tempo             106949 non-null  float64 
 16  time_si

In [7]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 106948 entries, 0 to 113999
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   artists           106948 non-null  string  
 1   track_name        106948 non-null  string  
 2   popularity        106948 non-null  int64   
 3   duration_ms       106948 non-null  int64   
 4   explicit          106948 non-null  bool    
 5   danceability      106948 non-null  float64 
 6   energy            106948 non-null  float64 
 7   key               106948 non-null  int64   
 8   loudness          106948 non-null  float64 
 9   mode              106948 non-null  int64   
 10  speechiness       106948 non-null  float64 
 11  acousticness      106948 non-null  float64 
 12  instrumentalness  106948 non-null  float64 
 13  liveness          106948 non-null  float64 
 14  valence           106948 non-null  float64 
 15  tempo             106948 non-null  float64 
 16  time_si

In [8]:
df.to_csv(CLEAN_DATA_DIR / "spotify_clean.csv", index=False)