# Data Cleaning Sub-project
 Created by Christopher Maher
### Goal: Create a API cleaning pipeline

  This project is to create an API cleaning ETL pipeline that allows a given API to be cleaned based on the following criteria:

  * Handling Missing Data
  * Dealing with Duplicate Data
  * Standardization of Data
  * Handling Outliers of Data
  * Reshaping Data 
  * Filtering and Selecting Data

This data pipeline will also attempt to fit best practices in ETL pipeline building which means the following:

- Error handling
- Logging
- Testing

The pipeline will be built with an attempt to be able to scale however due to the nature of the project will be unneeded and will be an attempt at best practice.

#### Project Details

  Use the Spotify API to identify trends in users 'saved songs', more commonly known as favorite songs, to identify their preferred listening type

## Data Imports and variables
All imports are listed below as well as versions used.

Versions developed on:

 `pandas: 2.0.0`

 `logging: 0.5.1.2`
 
 `requests: 2.28.2`
 
 `json: 2.0.9`

 `numpy: 1.24.2`

Check to current versions in case of errors

In [57]:
import logging 
import time
import json
import openpyxl
import pandas as pd
import numpy as np
import requests
import spotipy
import scipy.stats as stats
from spotipy.oauth2 import SpotifyOAuth

## .gitignore file containing secure information not to be published, stores as clientID and clientSecret both as strings
import keys

# Debug
DEBUG = False

# Proof of learning

PL = False

#Check versions
if DEBUG:
    print(pd.__version__)
    print(logging.__version__)
    print(requests.__version__)
    print(json.__version__)
    print(np.__version__)

## Data Retrieval
Retrives the data from the following API: Spotify API

- This API is a large data collection provided by Spotify that allows multiple data features to be collected. This can range from how music 'feels' based on internal classifications on Spotify to other similar information provided by the API
- This API **requires** creditentials so it's important to add your own credientials into it in the variables listed above
- Alongside that currently 

Puts the data into pandas Dataframes for future processing

In [63]:

# define a function to extract the first element of the dictionary
def extract_first_val(row):
    return list(row.values())[0]

#Connect to Spotify API as a user
try: 
    sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=keys.clientID,client_secret=keys.clientSecret,redirect_uri='http://localhost:8080'))
    logging.info("Established API connection")

except :
    print("Failed to connect to Spotify")
    logging.exception("Failed to connect to Spotify API")
    SystemExit


results = sp.current_user_saved_tracks(limit=50)

#Fun check for if I'm getting results here
if DEBUG:
    for idx, item in enumerate(results['items']):
        track = item['track']
        print(idx, track['artists'][0]['name'], " – ", track['name'])

#put the data in dataframe! During this stage I 'unpack' a lot of the information since it's hidden in dictionaries
timeAdded = pd.DataFrame(results['items']).drop(['track'],axis=1)

trackInformation = pd.DataFrame(pd.DataFrame(results['items'])['track'].apply(lambda x: pd.Series(x)))

albumInfo = trackInformation['album'].apply(lambda x: pd.Series(x))

#This was a literal pain to figure out, the for some reason changed their entry types to be a list for artists(from a series) then apply my lambda
artistInfo = pd.DataFrame(albumInfo['artists'].tolist())[0].apply(lambda x: pd.Series(x))

# apply the function to each row of the 'external_urls' column
artistInfo['external_urls'] = artistInfo['external_urls'].apply(extract_first_val)

albumInfo['external_urls'] = albumInfo['external_urls'].apply(extract_first_val)

#Dropping columns I don't need

albumInfo.drop('artists', axis=1)

#print(type(artistInfo['id'][49]))
logging.info("Data has been entered into DataFrame")

## Create the Data Schemas
   This is to allow us to better define what values we're expecting for each column that we created previously we're going to keep them individually now since it allows for easier cleaning, the important bit will be keeping a unique id on them which will be their row #

In [33]:
from schema import Schema
#Convert the artistInfo DataFrame to a dictionary so the schema can work. Then create and validate from the schema information
dict_artistInfo = artistInfo.to_dict()
artist_schema = Schema({'external_urls':dict[int,str],
                            'href': dict[int,str],
                            'id': dict[int,str],
                            'name': dict[int,str],
                            'type': dict[int,str],
                            'uri': dict[int,str]
})
try:
    artist_schema.validate(dict_artistInfo)
except:
    print("The artist has missing or wrong values")
    logging.error("Artist information doesn't have correct values or is missing values")
#Convert the artistInfo DataFrame to a dictionary so the schema can work. Then create and validate from the schema information
album_schema = Schema({'album_group':dict[int,str],
                        'album_type': dict[int,str],
                        'available_markets': dict[int,list],
                        'name': dict[int,str],
                        'type': dict[int,str],
                        'uri': dict[int,str]
})

## Data Cleaning
  - Will test the DataFrame for N/A values or empty values and in this situation remove them
  - Check and remove duplicate data
  - Filter and Shape our data
  - Included section for standardization of data but not applicable in current data set

In [61]:
#Checks dataframe for any null values
if(data.isnull().values.any()):
    missing_Values = data.isnull().sum().sum() + data['track'].isnull().sum()
    logging.warning(missing_Values,"missing values were found")
    # if the data is a subest of either of the track missing or the input time
    if data.isnull().sum().sum() != 0:
        count = 0
        for item in enumerate(data):
            if item[0] or item[1] != 0:
                data.drop([count])
            count+=1

#Remove outliars and duplicates in the dataset NOT NEEDED for current dataset.
if PL:
    def clean(data: pd.DataFrame):
        """
        Cleans the data of outliars and duplicate data.

        data: DataFrame of data you want to clean.
        """
        #NDArray of z scores of my data
        z = np.abs(stats.zscore(data))

        #only keeps the rows within the aboslute value of 3 (aka within 3 standard deviations)

        data = data[(z>3).all(axis=1)]

        #Now we'll find the last duplicated values and drop them
        data.drop_duplicates()

# Not all data we have is useful data to us currently I'll say the only data we care about is as follows
# We care about the artist name, the song name, and the song tid
#First lets clean up our data a bit more...



KeyError: 'track'

In [None]:
tid = 'spotify:track:4TTV7EcfroSLWzXRY6gLv6'
start = time.time()
analysis = sp.audio_analysis(tid)
delta = time.time() - start
print(json.dumps(analysis, indent=4))
print("analysis retrieved in %.2f seconds" % (delta,))