# **Data Cleaning and Initial Exploration**

___
**Atoosa Rashid** 

[GitHub](https://github.com/atoosa-r/) | [LinkedIn](https://www.linkedin.com/in/atoosarashid/) 
___

## **Table of Contents**

- [Introduction](#introduction)
- [Data Dictionary](#data-dictionary)
- [Initial Exploration](#initial-exploration)
- [Data Cleaning](#data-cleaning)
  - [Dropping Unnecessary Columns](#dropping-unnecessary-columns)
  - [Handling Missing Values](#handling-missing-values)
  - [Removing Duplicates](#removing-duplicates)
- [Final Remarks](#final-remarks)

___

## **Introduction**
This notebook focuses on cleaning and preparing the Spotify streaming data for analysis. The process involves importing the data, examining its structure, handling missing values, removing irrelevant columns, and eliminating duplicates. By the end of this notebook, the dataset will be ready for further analysis and visualization.

In [1]:
# Importing libraries: 

import numpy as np                   
import pandas as pd                   
import matplotlib.pyplot as plt       
import seaborn as sns                 
import os

In [2]:
# Loading Data

# Path to the folder containing the JSON files (fill in with your own folder path)
directory_path = 'path_to_your_files/JSON Files'

# Getting a list of all JSON files in the directory
json_files = [os.path.join(directory_path, file) for file in os.listdir(directory_path) if file.endswith('.json')]

# Combining all JSON files into a single dataframe (df)
df = pd.concat([pd.read_json(file) for file in json_files], ignore_index=True)

In [None]:
# Sanity check 

df.head()

___

### Data Dictionary

The dataset provides detailed insights into your Spotify streaming history. Below is a description of each field:

| **Field Name**                     | **Description**                                                                                         | **Example**                              |
|------------------------------------|---------------------------------------------------------------------------------------------------------|------------------------------------------|
| **ts**                             | Timestamp indicating when the track stopped playing (UTC). Format: YYYY-MM-DDTHH:MM:SSZ.               | `2025-01-01T13:30:30Z`                   |
| **platform**                       | Platform or device used for streaming (e.g., operating system and device model).                       | `iOS 14.2 (iPhone10,6)`                  |
| **ms_played**                      | Total milliseconds the track was played.                                                               | `2205`                                   |
| **conn_country**                   | Two-letter country code where the stream occurred.                                                     | `CA`                                     |
| **ip_addr**                        | IP address logged during the stream.                                                                   | `72.143.202.158`                         |
| **master_metadata_track_name**     | Name of the track streamed.                                                                            | `Lucky`                                |
| **master_metadata_album_artist_name** | Name of the artist or band.                                                                            | `H.E.R.`                               |
| **master_metadata_album_album_name** | Name of the album containing the track.                                                                | `Lucky`                                |
| **spotify_track_uri**              | Spotify URI uniquely identifying the track. Format: `spotify:track:<base-62 string>`.                  | `spotify:track:3kdyQO3jkZiUOtvoNGwOjw`   |
| **episode_name**                   | Name of the podcast episode (if applicable).                                                           | `Breaking Down the Day's News`           |
| **episode_show_name**              | Name of the podcast show (if applicable).                                                              | `The Current`                            |
| **spotify_episode_uri**            | Spotify URI uniquely identifying the podcast episode. Format: `spotify:episode:<base-62 string>`.      | `spotify:episode:abc123`                 |
| **audiobook_title**                | Title of the audiobook (if applicable).                                                                | `The Great Gatsby`                       |
| **audiobook_uri**                  | Spotify URI uniquely identifying the audiobook. Format: `spotify:audiobook:<base-62 string>`.          | `spotify:audiobook:123abc`               |
| **audiobook_chapter_uri**          | Spotify URI identifying a specific audiobook chapter.                                                  | `spotify:audiobook:chapter:xyz456`       |
| **audiobook_chapter_title**        | Title of the audiobook chapter (if applicable).                                                        | `Chapter 1: The Beginning`               |
| **reason_start**                   | Reason why the track started (e.g., `fwdbtn`, `trackdone`).                                            | `fwdbtn`                                 |
| **reason_end**                     | Reason why the track ended (e.g., `fwdbtn`, `endplay`).                                                | `fwdbtn`                                 |
| **shuffle**                        | Indicates if shuffle mode was used during playback (`True`/`False`/`null`).                            | `False`                                  |
| **skipped**                        | Indicates if the track was skipped (`True`/`False`/`null`).                                             | `False`                                  |
| **offline**                        | Indicates if the track was played offline (`True`/`False`/`null`).                                     | `False`                                  |
| **offline_timestamp**              | Timestamp of when offline mode was used, if applicable.                                                | `2025-01-01T14:00:00Z`                   |
| **incognito_mode**                 | Indicates if the track was played during a private session (`True`/`False`/`null`).                    | `False`                                  |


### Initial Exploration

In [None]:
# Shape of our df

print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

In [None]:
# Viewing the full df and all columns 

pd.set_option('display.max_columns', None)

df

In [None]:
# Displaying data types of all the columns

print(df.dtypes)

In [None]:
# Converting the data type of the `ts` column to a datetime object
df['ts'] = pd.to_datetime(df['ts'])

# Reviewing the full timeframe of the data
earliest_date = df['ts'].min()
latest_date = df['ts'].max()

print(f"The dataset covers the time period from {earliest_date} to {latest_date}.")

___

### Data Cleaning


#### Dropping Unnecessary Columns:
   - Columns related to audiobooks (`audiobook_*`) and episodes (`episode_*`) will be removed as they are not relevant for our analysis, which focuses on music streaming activity.
   - The IP Address column (`ip_addr`) will also be dropped since we will not be working with it. 

In [None]:
# Dropping unnecessary columns

columns_to_drop = [
    'episode_name', 'episode_show_name', 'spotify_episode_uri', 
    'audiobook_title', 'audiobook_uri', 'audiobook_chapter_uri', 
    'audiobook_chapter_title', 'ip_addr'
]

df = df.drop(columns=columns_to_drop)

In [None]:
# Sanity Check

df.head()

____ 

#### Handling Missing Values

In [None]:
# Summarizing all missing values found in the columns

missing_values_summary = df.isnull().sum()

print("\n Missing Values Summary Report")
print("--------------------------------------------")
print(missing_values_summary)

**Missing Values:**

- Metadata columns:
    - Missing values for `master_metadata_track_name`, `master_metadata_album_artist_name`, and `master_metadata_album_album_name` suggest a lack of track-specific data for certain entries. This could be non-music streams (e.g., advertisements) or unlogged metadata or unavailable songs.
    - These missing values represent less than 0.2% of the total dataset, so their impact on analysis is minimal.
    - Due to their minimal impact and lack of relevant information, these rows will be dropped.
- The `offline` column:
    - Indicates whether a session occurred offline (True/False).
    - Has minimal missing data.
- The `offline_timestamp` column:
    - Records timestamps for offline sessions.
    - Has a high proportion of missing values (mostly for online sessions where it is irrelevant).
    - Online sessions do not require `offline_timestamp`, so gaps are expected and not problematic.


In [None]:
# Filter rows with missing values in metadata specified columns
missing_rows = df[df[['master_metadata_track_name', 
                      'master_metadata_album_artist_name', 
                      'master_metadata_album_album_name']].isnull().any(axis=1)]


pd.set_option('display.max_columns', None)
missing_rows


In [None]:
# Drop rows with missing values in the specified columns , since they hold no value 
df = df.dropna(subset=['master_metadata_track_name', 
                       'master_metadata_album_artist_name', 
                       'master_metadata_album_album_name']).reset_index(drop=True)

df


___

#### Removing Duplicates

In [None]:
# Initial check for duplicates

print(f"Duplicates found: {df.duplicated().any()}") 

In [None]:
# Counting all duplicate rows (including all occurrences of duplicates)

duplicated_rows = df[df.duplicated(keep=False)]
print(f"Total duplicate rows: {len(duplicated_rows)}")

In [None]:
# Reviewing the individual duplicated rows

pd.set_option('display.max_columns', None)

duplicated_rows


In [None]:
# Count rows that will actually be removed

rows_to_remove = df.duplicated(keep='first').sum()
print(f"Number of rows to be removed: {rows_to_remove}")


In [17]:
# After reviewing, these duplicates seem to be identical (timestamps, IPs, etc.), so we can go ahead and remove them from our df:

rows_before = len(df)  
df = df.drop_duplicates() 
rows_after = len(df) 


In [None]:
# Verifying after removing duplicates

print(f"Number of rows removed: {rows_before - rows_after}")
print(f"Final shape of DataFrame: {df.shape}")

In [None]:
# Save the cleaned df to a CSV file
#df.to_csv('/_spotify_streaming_history.csv', index=False)

___

### Final Remarks

- The dataset has been cleaned and prepared for analysis, with irrelevant columns dropped, missing values handled, and duplicates removed.