# Data Cleaning Sub-project
 Created by Christopher Maher
### Goal: Create a API cleaning pipeline

  This project is to create an API cleaning ETL pipeline that allows a given API to be cleaned based on the following criteria:

  * Handling Missing Data
  * Dealing with Duplicate Data
  * Standardization of Data
  * Handling Outliers of Data
  * Reshaping Data 
  * Filtering and Selecting Data

This data pipeline will also attempt to fit best practices in ETL pipeline building which means the following:

- Error handling
- Logging
- Testing

The pipeline will be built with an attempt to be able to scale however due to the nature of the project will be unneeded and will be an attempt at best practice.

#### Project Details

  Use the Spotify API to identify trends in users 'saved songs', more commonly known as favorite songs, to identify their preferred listening type

## Data Imports and variables
All imports are listed below as well as versions used.

Versions developed on:

 `pandas: 2.0.0`

 `logging: 0.5.1.2`
 
 `requests: 2.28.2`
 
 `json: 2.0.9`

 `numpy: 1.24.2`

Check to current versions in case of errors

In [140]:
import logging 
import time
import json
import pandas as pd
import numpy as np
import requests
import spotipy
import scipy.stats as stats
from spotipy.oauth2 import SpotifyOAuth

## .gitignore file containing secure information not to be published, stores as clientID and clientSecret both as strings
import keys

# Debug
DEBUG = False

# Proof of learning

PL = False

#Check versions
if DEBUG:
    print(pd.__version__)
    print(logging.__version__)
    print(requests.__version__)
    print(json.__version__)
    print(np.__version__)

## Data Retrieval
Retrives the data from the following API: Spotify API

- This API is a large data collection provided by Spotify that allows multiple data features to be collected. This can range from how music 'feels' based on internal classifications on Spotify to other similar information provided by the API
- This API **requires** creditentials so it's important to add your own credientials into it in the variables listed above
- Alongside that currently 

Puts the data into pandas Dataframes for future processing

In [103]:
#Connect to Spotify API as a user
try: 
    sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=keys.clientID,client_secret=keys.clientSecret,redirect_uri='http://localhost:8080'))
    logging.info("Established API connection")

except :
    print("Failed to connect to Spotify")
    logging.exception("Failed to connect to Spotify API")
    SystemExit


results = sp.current_user_saved_tracks(limit=50)

#Fun check for if I'm getting results here
if DEBUG:
    for idx, item in enumerate(results['items']):
        track = item['track']
        print(idx, track['artists'][0]['name'], " – ", track['name'])

#put the data in dataframe!
data = pd.DataFrame(results['items'])
logging.info("Data has been entered into DataFrame")


## Data Cleaning
  - Will test the DataFrame for N/A values or empty values and in this situation remove them
  - Check and remove duplicate data
  - Filter and Shape our data
  - Included section for standardization of data but not applicable in current data set

In [138]:
#Checks dataframe for any null values
if(data.isnull().values.any()):
    missing_Values = data.isnull().sum().sum() + data['track'].isnull().sum()
    logging.warning(missing_Values,"missing values were found")
    # if the data is a subest of either of the track missing or the input time
    if data.isnull().sum().sum() != 0:
        count = 0
        for item in enumerate(data):
            if item[0] or item[1] != 0:
                data.drop([count])
            count+=1

#Remove outliers in the dataset
if PL:
    #NDArray of z scores of my data
    z = np.abs(stats.zscore(data))

    #only keeps the rows within the aboslute value of 3 (aka within 3 standard deviations)

    data = data[(z>3).all(axis=1)]


#Now we'll find the last duplicated values and drop them
data.drop_duplicates()




missing_Values = data['track'][0]
print(missing_Values)


dict_values([{'album_group': 'single', 'album_type': 'single', 'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/0gKdP7vj37EabSA4PZuVrn'}, 'href': 'https://api.spotify.com/v1/artists/0gKdP7vj37EabSA4PZuVrn', 'id': '0gKdP7vj37EabSA4PZuVrn', 'name': 'Cloke', 'type': 'artist', 'uri': 'spotify:artist:0gKdP7vj37EabSA4PZuVrn'}], 'available_markets': [], 'external_urls': {'spotify': 'https://open.spotify.com/album/3qI7YPn8H0ukCpOzlJS3jh'}, 'href': 'https://api.spotify.com/v1/albums/3qI7YPn8H0ukCpOzlJS3jh', 'id': '3qI7YPn8H0ukCpOzlJS3jh', 'images': [{'height': 640, 'url': 'https://i.scdn.co/image/ab67616d0000b273035ec317112a824562c306b0', 'width': 640}, {'height': 300, 'url': 'https://i.scdn.co/image/ab67616d00001e02035ec317112a824562c306b0', 'width': 300}, {'height': 64, 'url': 'https://i.scdn.co/image/ab67616d00004851035ec317112a824562c306b0', 'width': 64}], 'is_playable': True, 'name': 'So Innocent', 'release_date': '2023-02-22', 'release_date_precision': 'day', 'to

In [None]:
tid = 'spotify:track:4TTV7EcfroSLWzXRY6gLv6'
start = time.time()
analysis = sp.audio_analysis(tid)
delta = time.time() - start
print(json.dumps(analysis, indent=4))
print("analysis retrieved in %.2f seconds" % (delta,))