## Project Description
Cleaning concert data from SeatGeek that were previously extracted using a data pipeline. We will use pandas to examine what data is available and missing, tidy up formats, and deal with missing values. This dataset may be used in several projects including but not limited to creating interactive dashboards showing upcoming events for someone trying to sell their tickets or see what events are available and k-means clustering to group the concerts by category. I'm curious to find out what unsupervised learning will discover!

## Introduction
As I go over my projects from the past, I realize that they are so messy. I revisit this particular project to go over the data cleaning. I want to have multiple purposes for this data. 1) for an interactive dashboard and 2) machine learning if possible. Super ambitious, unrealistic at times, dreaming and not expecting how much work it actually takes to finish something ambitious.

## load libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import seaborn as sns
import datetime
import re
import json
from dateutil.parser import parse

from sklearn import preprocessing, datasets, linear_model
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import mean_squared_error, r2_score
import collections


In [4]:
# Loading json files into dataframes
df = pd.read_json('ny-concerts.json')

## Data Mining

### Taking stock of what data I have and don't have

In [1]:
# Check Shape
df.shape # 2779 rows and 19 columns

# Check missing Values
df = df.replace('', np.nan) # performer_genre missing was set to '' in last project
df.isna().sum() # 982 tickets missing pricing info / 725 missing performer genres
None

NameError: name 'df' is not defined

Though I have quite a few missing information I'm going to leave them in for now.

In [7]:
#Drop null value rows (1064)
c_df.dropna(axis = 0, subset=['average_price'], inplace=True)

# Replace string nan to numpy nan
c_df = c_df.replace('NaN', np.nan)

In [15]:
# Check again
df.isna().sum()

#Now that the string value 'NaN' are numpy NaNs, I can see that there are more NaN values in Genre.

announce_date             0
average_price           982
date&time_event           0
event_title               0
highest_price           982
lowest_price            982
median_price            982
performer_genre           0
performer_name            0
ticket_listing_count    982
type_event                0
upcoming_events?          0
url                       0
venue_capacity            0
venue_city                0
venue_name                0
venue_score               0
venue_zipcode             0
visible_until_utc         0
dtype: int64

Unnamed: 0,announce_date,average_price,date&time_event,event_title,highest_price,lowest_price,median_price,performer_genre,performer_name,ticket_listing_count,type_event,upcoming_events?,url,venue_capacity,venue_city,venue_name,venue_score,venue_zipcode,visible_until_utc
0,2019-08-14,,2019-08-16 17:00:00,Big Eyed Phish Live,,,,,Big Eyed Phish Live,,concert,True,https://seatgeek.com/big-eyed-phish-live-ticke...,0,Canandaigua,Lincoln Hill Farms,0.000000,14424,2019-08-17 01:00:00
3,2019-05-09,,2019-08-16 18:00:00,Gatsby,,,,,Gatsby,,concert,True,https://seatgeek.com/gatsby-tickets/new-york-n...,0,New York,Club Bonafide,0.000000,10022,2019-08-17 02:00:00
4,2019-06-16,,2019-08-16 18:00:00,"No Zodiac, Kaonashi, VCTMS",,,,,"No Zodiac, Kaonashi, VCTMS",,concert,True,https://seatgeek.com/no-zodiac-kaonashi-vctms-...,0,Brooklyn,Gold Sounds Bar,0.000000,11237,2019-08-17 02:00:00
7,2019-07-07,,2019-08-16 18:30:00,OFF,,,,,OFF,,concert,True,https://seatgeek.com/off-tickets/brooklyn-new-...,0,Brooklyn,The Kingsland Bar and Grill,0.000000,11222,2019-08-17 02:30:00
8,2019-08-09,,2019-08-16 18:30:00,The Fast Lane Eagles a Tribute to the Eagles,,,,,The Fast Lane Eagles a Tribute to the Eagles,,concert,True,https://seatgeek.com/the-fast-lane-eagles-a-tr...,0,Mahopac,Putnam County Golf Course,0.000000,10541,2019-08-17 02:30:00
9,2019-06-13,,2019-08-16 19:00:00,Gunhild Carling,,,,,Gunhild Carling,,concert,True,https://seatgeek.com/gunhild-carling-tickets/n...,0,New York,Birdland,0.000000,10036,2019-08-17 03:00:00
11,2019-07-06,,2019-08-16 19:00:00,Indian Independence Day Boat Party,,,,,Indian Independence Day Boat Party,,concert,True,https://seatgeek.com/indian-independence-day-b...,0,New York,Hornblower Cruises & Events Pier 15,0.000000,10038,2019-08-17 03:00:00
14,2019-08-13,,2019-08-16 19:00:00,"Bandits on the Run, Stereo League, Cold Weathe...",,,,,"Bandits on the Run, Stereo League, Cold Weathe...",,concert,True,https://seatgeek.com/bandits-on-the-run-stereo...,250,New York,Mercury Lounge,0.455722,10002,2019-08-17 03:00:00
15,2019-07-25,,2019-08-16 19:00:00,The Queens Boat Party NYC,,,,,The Queens Boat Party NYC,,concert,True,https://seatgeek.com/the-queens-boat-party-nyc...,0,New York,Hornblower Cruises & Events Pier 15,0.000000,10038,2019-08-17 03:00:00
17,2019-08-10,,2019-08-16 19:00:00,Sunset on the Hudson,,,,,Sunset on the Hudson,,concert,True,https://seatgeek.com/sunset-on-the-hudson-tick...,0,New York,Pier 45 at Hudson River Park,0.000000,10014,2019-08-17 03:00:00


# Dates

In [16]:
# Convert string dates to datetime objects

In [38]:
type(df['announce_date'])

pandas.core.series.Series

In [43]:
df.columns

Index(['announce_date', 'average_price', 'date&time_event', 'event_title',
       'highest_price', 'lowest_price', 'median_price', 'performer_genre',
       'performer_name', 'ticket_listing_count', 'type_event',
       'upcoming_events?', 'url', 'venue_capacity', 'venue_city', 'venue_name',
       'venue_score', 'venue_zipcode', 'visible_until_utc'],
      dtype='object')

In [49]:
date_columns = ['announce_date', 'date&time_event', 'visible_until_utc']

In [50]:
for i in date_columns:
    print(i)

announce_date
date&time_event
visible_until_utc


In [59]:
#loop through the date_columns and change to time stamp
for i in date_columns:
    df[i] = pd.to_datetime(df[i])
type(df['announce_date'][0]) # not a str anymore, now series object
# pd.to_datetime(df['announce_date']))

pandas._libs.tslibs.timestamps.Timestamp

In [60]:
df.head()

Unnamed: 0,announce_date,average_price,date&time_event,event_title,highest_price,lowest_price,median_price,performer_genre,performer_name,ticket_listing_count,type_event,upcoming_events?,url,venue_capacity,venue_city,venue_name,venue_score,venue_zipcode,visible_until_utc
0,2019-08-14,,2019-08-16 17:00:00,Big Eyed Phish Live,,,,,Big Eyed Phish Live,,concert,True,https://seatgeek.com/big-eyed-phish-live-ticke...,0,Canandaigua,Lincoln Hill Farms,0.0,14424,2019-08-17 01:00:00
1,2019-04-20,,2019-08-16 17:00:00,George Fitzgerald,,,,Electronic,George Fitzgerald,,concert,True,https://seatgeek.com/george-fitzgerald-tickets...,0,Brooklyn,Elsewhere,0.0,11237,2019-08-17 01:00:00
2,2019-08-13,,2019-08-16 17:30:00,The Birdland Big Band,,,,Blues,The Birdland Big Band,,concert,True,https://seatgeek.com/the-birdland-big-band-tic...,0,New York,Birdland,0.0,10036,2019-08-17 01:30:00
3,2019-05-09,,2019-08-16 18:00:00,Gatsby,,,,,Gatsby,,concert,True,https://seatgeek.com/gatsby-tickets/new-york-n...,0,New York,Club Bonafide,0.0,10022,2019-08-17 02:00:00
4,2019-06-16,,2019-08-16 18:00:00,"No Zodiac, Kaonashi, VCTMS",,,,,"No Zodiac, Kaonashi, VCTMS",,concert,True,https://seatgeek.com/no-zodiac-kaonashi-vctms-...,0,Brooklyn,Gold Sounds Bar,0.0,11237,2019-08-17 02:00:00


In [17]:
import pytz

c_df['announce_date'] = pd.to_datetime(c_df['announce_date']).dt.date
c_df['visible_until_utc'] = pd.to_datetime(c_df['visible_until_utc'])
# c_df['visible_until_utc'] = pd.to_datetime(c_df['visible_until_utc'])

#Changing Visible Until UTC to Eastern time
eastern = pytz.timezone('US/Eastern')
c_df['visible_until_est'] = c_df['visible_until_utc'].dt.tz_localize('UTC').dt.tz_convert(eastern)

#Convert event time to weekend/weekday and afternoon/evening

c_df['event_year']= [d.split('-')[0] for d in c_df['date&time_event']]
c_df['event_month']= [d.split('-')[1] for d in c_df['date&time_event']]
c_df['event_day']= [d.split('-')[2] for d in c_df['date&time_event']]

# Create 
c_df['date&time_event'] = pd.to_datetime(c_df['date&time_event'])
c_df['event_start'] = pd.to_datetime(c_df['date&time_event']).dt.strftime("%H")
c_df['event_date'] = pd.to_datetime(c_df['date&time_event']).dt.date
c_df['event_month'] = pd.to_datetime(c_df['event_date']).dt.strftime("%m")
c_df['event_day'] = pd.to_datetime(c_df['date&time_event']).dt.strftime("%a")

#Create ticket window column
c_df['ticket_avail_for'] = pd.to_datetime(c_df['visible_until_utc'].dt.date - c_df['announce_date']).dt.strftime("%d")



# Afternoon or Evening

In [18]:
type(c_df['event_start'][0])

str

In [19]:
# Change values to float
c_df['event_start'] = c_df['event_start'].apply(lambda x: int(x))

In [20]:
#drop all times before 11
before_11 = c_df[c_df['event_start']<12].index
c_df.drop(before_11, inplace=True)
c_df[c_df['event_start']<12]

#drop all times above 24
after_24 = c_df[c_df['event_start']>24].index
c_df.drop(after_24, inplace=True)
c_df[c_df['event_start'] >24]


Unnamed: 0,announce_date,average_price,date&time_event,event_title:,highest_price,lowest_price,median_price,performer_genre,performer_name,ticket_listing_count,...,venue_name,venue_zipcode,visible_until_utc,visible_until_est,event_year,event_month,event_day,event_start,event_date,ticket_avail_for


In [21]:
c_df.reset_index(inplace=True)

In [22]:
type(c_df.loc[c_df['event_start'][0]])

pandas.core.series.Series

In [23]:
c_df['event_start'].value_counts().sort_values()

16      2
12      5
15      7
14     10
23     13
17     14
13     43
22     51
18     90
21    132
19    479
20    866
Name: event_start, dtype: int64

In [24]:
def f(row):
    if (12 <= row['event_start'] <= 17):
        val = 'afternoon'
    elif (18 <= row['event_start'] <= 23):
        val = 'evening'
    return val

In [25]:
c_df['time_of_day'] = c_df.apply(f, axis=1)

In [26]:
c_df['time_of_day'].value_counts()

evening      1631
afternoon      81
Name: time_of_day, dtype: int64

# Weekend or Weekday

In [27]:
type(c_df['event_day'])

pandas.core.series.Series

In [28]:
c_df['event_day'].value_counts()

Sat    416
Fri    378
Thu    266
Sun    222
Wed    196
Tue    150
Mon     84
Name: event_day, dtype: int64

In [29]:
c_df['event_day'].apply(lambda x: str(x))

0       Fri
1       Fri
2       Fri
3       Fri
4       Fri
5       Fri
6       Fri
7       Fri
8       Fri
9       Fri
10      Fri
11      Fri
12      Fri
13      Fri
14      Fri
15      Fri
16      Sat
17      Sat
18      Sat
19      Sat
20      Sat
21      Sat
22      Sat
23      Sat
24      Sat
25      Sat
26      Sat
27      Sat
28      Sat
29      Sat
       ... 
1682    Wed
1683    Thu
1684    Fri
1685    Sat
1686    Sat
1687    Sun
1688    Wed
1689    Thu
1690    Sat
1691    Sun
1692    Sat
1693    Sun
1694    Sun
1695    Sat
1696    Sat
1697    Sun
1698    Sun
1699    Fri
1700    Sun
1701    Fri
1702    Sun
1703    Sun
1704    Sat
1705    Sun
1706    Thu
1707    Sat
1708    Sun
1709    Sun
1710    Sat
1711    Fri
Name: event_day, Length: 1712, dtype: object

In [30]:
def d(row):
    if (c_df['event_day'] == 'Sat').any():
        val = 'weekend'
    else:
        val = 'weekday'
    return val

In [31]:
c_df['wkend_wkday'] = c_df.apply(d, axis=1)

In [32]:
c_df['wkend_wkday'].value_counts()

weekend    1712
Name: wkend_wkday, dtype: int64

In [33]:
c_df.head()

Unnamed: 0,index,announce_date,average_price,date&time_event,event_title:,highest_price,lowest_price,median_price,performer_genre,performer_name,...,visible_until_utc,visible_until_est,event_year,event_month,event_day,event_start,event_date,ticket_avail_for,time_of_day,wkend_wkday
0,0,2018-10-13,58.0,2019-02-01 18:30:00,Young Nudy with SahBabii,63.0,54.0,54.0,Hip-Hop,SahBabii,...,2019-02-02 03:30:00,2019-02-01 22:30:00-05:00,2019,2,Fri,18,2019-02-01,23,evening,weekend
1,2,2018-10-16,114.0,2019-02-01 19:00:00,Umphrey's McGee with Robert Walter's 20th Cong...,117.0,111.0,111.0,Pop,Umphrey's McGee,...,2019-02-02 04:00:00,2019-02-01 23:00:00-05:00,2019,2,Fri,19,2019-02-01,20,evening,weekend
2,3,2018-11-27,337.0,2019-02-01 19:00:00,ZyanosE with Hank Wood & The Hammerheads,584.0,253.0,256.0,rock,Hank Wood & The Hammerheads,...,2019-02-02 04:00:00,2019-02-01 23:00:00-05:00,2019,2,Fri,19,2019-02-01,9,evening,weekend
3,8,2018-12-18,41.0,2019-02-01 19:00:00,Sighns with Cyberattack,46.0,36.0,36.0,blues,Sighns,...,2019-02-02 04:00:00,2019-02-01 23:00:00-05:00,2019,2,Fri,19,2019-02-01,16,evening,weekend
4,9,2018-12-20,337.0,2019-02-01 19:30:00,Ritual Talk with The Humble Cheaters,584.0,253.0,256.0,electronic,Ritual Talk,...,2019-02-02 04:30:00,2019-02-01 23:30:00-05:00,2019,2,Fri,19,2019-02-01,14,evening,weekend


# Dropping columns and doing repairs

In [34]:
c_df.drop(columns=['index','announce_date', 'visible_until_utc', 'visible_until_est',\
                  'event_year', 'event_date', 'event_title:', 'upcoming_events?', 'date&time_event',\
                  'type_event', 'venue_zipcode'],inplace=True)
under_3 = c_df[c_df['ticket_listing_count']<3].index
c_df.drop(under_3, inplace=True)

## Fill NaN values with average prices
c_df[c_df['median_price'].isna()]
c_df['median_price'].fillna(c_df['average_price'].isna(),inplace=True)
c_df['median_price'] = c_df['median_price'].apply(lambda x: float(x))

In [35]:
c_df = c_df.reset_index(drop=True)

In [36]:
#drop performers without genre
c_df.dropna(inplace=True)

In [37]:
c_df =c_df.reset_index(drop=True)
c_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1281 entries, 0 to 1280
Data columns (total 15 columns):
average_price           1281 non-null float64
highest_price           1281 non-null float64
lowest_price            1281 non-null float64
median_price            1281 non-null float64
performer_genre         1281 non-null object
performer_name          1281 non-null object
ticket_listing_count    1281 non-null float64
venue_city              1281 non-null object
venue_name              1281 non-null object
event_month             1281 non-null object
event_day               1281 non-null object
event_start             1281 non-null int64
ticket_avail_for        1281 non-null object
time_of_day             1281 non-null object
wkend_wkday             1281 non-null object
dtypes: float64(5), int64(1), object(9)
memory usage: 150.2+ KB


# Venues

In [38]:
v_df.head()

Unnamed: 0,venue_capacity,venue_name,venue_score
0,250,The Cutting Room,0.5739
1,206,Baby's All Right,0.424032
2,0,Knitting Factory Brooklyn,0.468147
3,0,Brooklyn Bowl,0.527452
4,0,Iridium Jazz Club,0.48766


In [39]:
v_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2875 entries, 0 to 2874
Data columns (total 3 columns):
venue_capacity    2875 non-null int64
venue_name        2875 non-null object
venue_score       2875 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 67.5+ KB


In [40]:
v_df.drop_duplicates(subset=['venue_name'],inplace=True)

In [41]:
v_df = v_df.reset_index(drop=True)

In [42]:
c_df = pd.merge(c_df, v_df, how='left', on='venue_name').reset_index(drop=True)

In [43]:
# c_df.loc[c_df['performer_name'] == 'Pop Evil']['venue_capacity'] = 2500

In [75]:
# work with this table
change_table['venue_capacity'] = change_table['venue_capacity'].apply(lambda x: x+250)
change_table['venue_capacity']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


0       500.0
66      500.0
79      500.0
109     500.0
147     500.0
222     500.0
238     500.0
435     500.0
502     500.0
504     500.0
524     500.0
551     500.0
634     500.0
643     500.0
660     500.0
686     500.0
692     500.0
729     500.0
757     500.0
825     500.0
884     500.0
1160    500.0
Name: venue_capacity, dtype: float64

In [55]:
c_df.loc[c_df['venue_capacity'] == 0]['venue_name'].unique()

array(['Knitting Factory Brooklyn', 'Sony Hall', 'Analog BKNY',
       'Iridium Jazz Club', 'The Bowery Electric', 'Spirit of New York',
       'The Lost Horizon', 'Brooklyn Bowl', 'Apollo Theater', "SOB's",
       'New York City Center - Stage 1', 'The Haunt', 'Elsewhere',
       'The Bug Jar', 'The Chance',
       'Colden Center for Performing Arts at Kupferberg Center for the Arts',
       'Funk N Waffles', 'Kings Hall', 'The Hangar Theatre',
       'Club Helsinki', 'The 9th Ward at Babeville', 'The Bell House',
       'Montage Music Hall', "Lando's", 'Avant Gardner',
       "Lando's Hotel & Lounge", 'Mulcahys', 'Mohawk Place',
       "Sportsmen's Tavern", 'Buffalo Riverworks',
       "Funk 'n Waffles - Downtown", 'Rec Room - Buffalo',
       'Murmrr Theatre', 'Kodak Center for Performing Arts',
       'Market Hotel', 'Smith Opera House', 'Asbury Hall At The Church',
       "Sharkey's Summer Stage",
       'Carnegie Hall - Judy & Arthur Zankel Hall',
       'Asbury Hall inside Babev

In [168]:
c_df.loc[c_df['venue_name'] == "Brewery Ommegang", 'venue_capacity'] = 5000

In [169]:
c_df.loc[c_df['venue_capacity'] == 0]['venue_name'].unique()

array(['Spirit of New York', "Lando's", "Lando's Hotel & Lounge",
       "Sharkey's Summer Stage", 'Bartlett Theatre in Coxe Hall'],
      dtype=object)

In [178]:
c_df['venue_capacity'].value_counts()

1000.0     146
1800.0      86
250.0       85
180.0       66
200.0       54
575.0       52
2894.0      38
400.0       38
20000.0     37
3000.0      37
499.0       35
600.0       34
450.0       33
1200.0      32
2870.0      29
1753.0      27
1495.0      21
843.0       21
850.0       18
700.0       18
17500.0     17
3400.0      17
15000.0     17
900.0       15
550.0       15
5000.0      14
19200.0     14
1500.0      14
6015.0      13
25100.0     13
          ... 
14000.0      3
500.0        3
800.0        3
237.0        3
150.0        3
2804.0       2
2700.0       2
4400.0       2
944.0        2
440.0        2
12428.0      2
1510.0       2
225.0        2
3050.0       1
982.0        1
350.0        1
1740.0       1
6925.0       1
2500.0       1
2117.0       1
41800.0      1
1968.0       1
1400.0       1
2983.0       1
1744.0       1
483.0        1
2085.0       1
2738.0       1
599.0        1
489.0        1
Name: venue_capacity, Length: 81, dtype: int64