## Project Description
Cleaning concert data from SeatGeek that were previously extracted using a data pipeline. We will use pandas to examine what data is available and missing, tidy up formats, and deal with missing values. This dataset may be used in several projects including but not limited to creating interactive dashboards showing upcoming events for someone trying to sell their tickets or see what events are available and k-means clustering to group the concerts by category. I'm curious to find out what unsupervised learning will discover!

## Introduction
As I go over my projects from the past, I realize that they are so messy. I revisit this particular project to go over the data cleaning. I want to have multiple purposes for this data. 1) for an interactive dashboard and 2) machine learning if possible. Super ambitious, unrealistic at times, dreaming and not expecting how much work it actually takes to finish something ambitious.

## Load libraries and dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json

In [2]:
# Loading json files into dataframes
df = pd.read_json('ny-concerts.json')

## Data Mining

### Taking stock of what data I have and don't have

In [3]:
# Check Shape
df.shape # 2779 rows and 19 columns

# Check missing Values
df = df.replace('', np.nan) # performer_genre missing was set to '' in last project
df.isna().sum() # 982 tickets missing pricing info / 725 missing performer genres
None

Though I have quite a few missing information I'm going to leave them in for now.

In [4]:
#Drop null value rows (1064)
# c_df.dropna(axis = 0, subset=['average_price'], inplace=True)

# Dates

#### Convert String Date Columns into DateTime Objects

In [5]:
# Check current datatype of value date columns
type(df['announce_date'][0]) # str data type

# `df.columns` to list out all columns to find remaining date columns
df.columns
date_columns = ['announce_date', 'date&time_event', 'visible_until_utc']

# Change all date_columns to datetime format using a loop
for i in date_columns:
    df[i] = pd.to_datetime(df[i])
    
type(df['announce_date'][0]) # pandas._libs.tslibs.timestamps.Timestamp
None

## Performer Data

I actually have plenty of data to work if I cut out all 

In [6]:
df[df['type_event'].str.contains('concert')] # I'm going to drop this column as all have it
df.performer_genre.unique()

array([nan, 'Electronic', 'Blues', 'Rock', 'Pop', 'Country',
       'Alternative', 'Hip-Hop', 'Reggae', 'Soul', 'Hard Rock', 'Latin',
       'Folk', 'Indie', 'Rnb'], dtype=object)

In [23]:
df.performer_genre.dropna()

1        Electronic
2             Blues
5              Rock
6               Pop
10          Country
12      Alternative
13              Pop
16             Rock
18          Hip-Hop
19      Alternative
24      Alternative
25            Blues
27              Pop
28             Rock
29              Pop
30              Pop
34              Pop
35          Hip-Hop
38             Rock
40             Rock
42              Pop
55       Electronic
58              Pop
61       Electronic
62       Electronic
63             Rock
64              Pop
66              Pop
67              Pop
73             Rock
           ...     
2748            Pop
2749           Soul
2750           Rock
2751           Soul
2752            Pop
2753            Pop
2754            Pop
2755    Alternative
2756          Blues
2757          Blues
2758           Soul
2759           Soul
2760            Pop
2761            Pop
2762          Blues
2763          Blues
2764            Pop
2765            Pop
2766            Pop
