<img style="float: right;" src="meetup_logo.svg" width=200>

# Meetup - Data Cleaning (all 2018)


<i>Cleaning the contents of the NYC meetup data</i>

<u>Datasets:</u>

1. <a href='#events'>Meetup Events</a> (all 2018)
2. <a href='#groups'>Meetup Groups</a>
3. <a href='#members'>Meetup Members</a>
***

### Import libraries

In [1]:
from haversine import haversine 
import reverse_geocode
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import time
import re
from meetup_api_functions import clean_text, remove_special_chars
from meetup_api_functions import get_subway_distances
import requests

ModuleNotFoundError: No module named 'meetup_api_functions'

***
<a id='events'></a>
### 1. Meetup Events

#### Load Data

In [17]:
# open all_events file
with open('monthly_events_2018/sepoct_events.pkl', 'rb') as f:
    sepoct_events = pickle.load(f)
    
with open('monthly_events_2018/nov_events.pkl', 'rb') as f:
    nov_events = pickle.load(f)

with open('jan_events.pkl', 'rb') as f:
    jan_events = pickle.load(f)


In [13]:
len(nov_events)

11885

In [18]:
len(jan_events)

10012

In [9]:
len(aug_events)

10952

In [127]:
print("There were {} events in Sep-Oct 2018 in NYC.".format(len(sepoct_events)))

There were 25424 events in Sep-Oct 2018 in NYC.


In [128]:
# convert all_events into a dataframe
df_events = pd.DataFrame(sepoct_events)

In [129]:
# preview the information
df_events.head(1)

Unnamed: 0,created,description,duration,event_url,fee,group,headcount,how_to_find_us,id,maybe_rsvp_count,...,rsvp_limit,status,time,updated,utc_offset,venue,visibility,waitlist_count,why,yes_rsvp_count
0,1535391367000,<p>Join us in person or tune in online!</p> <p...,7200000.0,https://www.meetup.com/Build-with-Code-New-Yor...,,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor/ stream online at ...,254149786,0,...,,past,1536100200000,1536109876000,-14400000,"{'country': 'us', 'localized_country_name': 'U...",public,0,,42


#### Data Cleaning

In [130]:
# check percentage of NaN values in each column
df_events.isna().sum()/len(df_events)*100

created              0.000000
description          0.641126
duration             5.046413
event_url            0.000000
fee                 84.892228
group                0.000000
headcount            0.000000
how_to_find_us      61.237413
id                   0.000000
maybe_rsvp_count     0.000000
name                 0.000000
photo_url           39.407646
rating               0.000000
rsvp_limit          73.277218
status               0.000000
time                 0.000000
updated              0.000000
utc_offset           0.000000
venue                8.535242
visibility           0.000000
waitlist_count       0.000000
why                 99.559471
yes_rsvp_count       0.000000
dtype: float64

Based on the information above, there is some data cleaning and handling of missing values to address:
- convert values in ```duration``` from miliseconds to minutes and fill in the missing values with median
- drop ```utc_offset``` since that information is captured in ```time```
- drop ```why``` since most values are NaNs
- clean up text in ```description``` with regex
- label encode ```fee```, ```photo_url```, and ```how_to_find_us```
- ```venue``` fill missing values with 'None'

In [131]:
# convert timestamps to human-readable format (dividing by 1000. since time is in milliseconds)
# df_events['time'] = df_events['time'].apply(lambda x:time.strftime('%m/%d/%Y %H:%M:%S', time.gmtime(x/1000.)))
# df_events['updated'] = df_events['updated'].apply(lambda x:time.strftime('%m/%d/%Y %H:%M:%S', time.gmtime(x/1000.)))


In [132]:
# convert values in duration column from milliseconds to minutes
df_events['duration'] = df_events['duration'].apply(lambda x: x/60000)

In [133]:
# label encode value for whether group's join-mode is open or not
df_events['group_is_open'] = df_events.group.apply(lambda x: 1 if x['join_mode'] == 'open' else 0)

In [134]:
df_events['group_id'] = df_events.group.apply(lambda x: x.get('id'))

In [135]:
# rename column to note time unit of the data
df_events.rename(columns={'duration':'duration_min'}, inplace=True)

In [136]:
df_events['how_to_find_us'].fillna(0, inplace = True)
df_events['has_how_to_find'] = df_events['how_to_find_us'].apply(lambda x: 1 if x != 0 else 0)

In [137]:
df_events['rsvp_limit'].fillna(0, inplace =True)
df_events['has_rsvp_limit'] = df_events['rsvp_limit'].apply(lambda x: 1 if x != 0 else 0)

In [138]:
# clean text in description using regex
df_events.description[0]

'<p>Join us in person or tune in online!</p> <p>Livestream: <a href="https://zoom.us/j/190996928" class="linkified">https://zoom.us/j/190996928</a></p> <p>Get started now on challenges related to these topics on our FREE online learning platform CSX, <a href="https://csx.codesmith.io/" class="linkified">https://csx.codesmith.io/</a></p> <p>--</p> <p>During this workshop, we will cover:</p> <p>- What happens when our code runs in the browser?<br/>- A closer look at objects<br/>- Reusing our logic (declaring/invoking functions)</p> <p>These concepts are the foundation of all web development - we will cover them under-the-hood so you can confidently use them as you work on harder concepts to come.</p> <p>Schedule</p> <p>6:30 - 7:00pm: Meet your future pair programming partner.</p> <p>7:00 - 8:00pm: Core JavaScript concept for the challenge and introduction to the secret hack for learning to code - pair-programming.</p> <p>8:00 - 9:00pm: Pair-programming.</p> <p>***Bring a friend who\'d li

In [139]:
# but first replace NaN in description column with 'None'
df_events.description.fillna(value = 'None', inplace = True)

In [140]:
# function to clean text in the description column
# def clean_text(description):
#     """return cleaned text; requires a string"""
#     result1 = re.sub('&lt;br/&gt;', '', re.sub('<[^>]+>', '', description))
#     final = re.sub('&lt;/p&gt;', '', re.sub('&amp;', "", result1))
#     return final

In [141]:
# apply/lambda to clean all data in the description series
df_events['description'] = df_events['description'].apply(lambda x: clean_text(x))

In [142]:
# cleaned text in description
df_events.description[0]

"Join us in person or tune in online! Livestream: https://zoom.us/j/190996928 Get started now on challenges related to these topics on our FREE online learning platform CSX, https://csx.codesmith.io/ -- During this workshop, we will cover: - What happens when our code runs in the browser?- A closer look at objects- Reusing our logic (declaring/invoking functions) These concepts are the foundation of all web development - we will cover them under-the-hood so you can confidently use them as you work on harder concepts to come. Schedule 6:30 - 7:00pm: Meet your future pair programming partner. 7:00 - 8:00pm: Core JavaScript concept for the challenge and introduction to the secret hack for learning to code - pair-programming. 8:00 - 9:00pm: Pair-programming. ***Bring a friend who'd like to build and you can pair program together! Price: Always free For those online! Please join the stream here: Livestream: https://zoom.us/j/190996928"

In [143]:
# function to remove special character tokens in the tokenzied descriptions

def remove_special_chars(some_list):
    remove = ["-", "--", "###", "##", "","•"]
    return [x for x in some_list if x not in remove]

In [144]:
# get word count of events
df_events['event_num_words'] = df_events.description.apply(lambda x: len(remove_special_chars(x.split(' '))))

In [145]:
# summary statistics of event duration (in minutes)
df_events.duration_min.describe()

count    24141.000000
mean       313.717452
std       1211.278678
min          1.000000
25%        120.000000
50%        120.000000
75%        210.000000
max      20160.000000
Name: duration_min, dtype: float64

In [146]:
# look at sample non-NaN value in fee column
df_events.fee[12838]

{'amount': 10,
 'accepts': 'cash',
 'description': 'per person',
 'currency': 'USD',
 'label': 'Price',
 'required': '0'}

In [147]:
# replace missing values in duration to median value
df_events.duration_min.fillna(value = df_events.duration_min.median(), inplace = True)
# replace missing venue values with 'None'
df_events.venue.fillna(value = 'None', inplace = True)
# replace missing fee values with 'N/A'
df_events.fee.fillna(value = 0, inplace = True)
# replace missing photo_url values with 'N/A'
df_events.photo_url.fillna(value = 0, inplace = True)

In [148]:
# extract just the amount from the fee dictionary
df_events.fee = df_events.fee.apply(lambda x: x['amount'] if x!= 0 else 0)

In [149]:
df_events['has_photo'] = df_events.photo_url.apply(lambda x: 0 if x == 0 else 1)

In [150]:
# preview the cleaned dataframe
df_events.head()

Unnamed: 0,created,description,duration_min,event_url,fee,group,headcount,how_to_find_us,id,maybe_rsvp_count,...,visibility,waitlist_count,why,yes_rsvp_count,group_is_open,group_id,has_how_to_find,has_rsvp_limit,event_num_words,has_photo
0,1535391367000,Join us in person or tune in online! Livestrea...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor/ stream online at ...,254149786,0,...,public,0,,42,1,21993357,1,0,137,0
1,1535385547000,Get started now on challenges related to these...,150.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,https://zoom.us/j/417883916,254146381,0,...,public,0,,64,1,21993357,1,0,131,0
2,1535392484000,In this workshop we’ll get a clear sense of th...,150.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor / Livestream at ht...,254150230,0,...,public,0,,83,1,21993357,1,0,204,0
3,1531947994000,The number of opportunities for software engin...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to the 2nd floor,252915161,0,...,public,0,,113,1,21993357,1,0,229,0
4,1535383458000,Please tune into the stream here: https://zoom...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please tune into the stream here: https://zoom...,254144933,0,...,public,0,,21,1,21993357,1,0,165,0


In [151]:
df_events.columns

Index(['created', 'description', 'duration_min', 'event_url', 'fee', 'group',
       'headcount', 'how_to_find_us', 'id', 'maybe_rsvp_count', 'name',
       'photo_url', 'rating', 'rsvp_limit', 'status', 'time', 'updated',
       'utc_offset', 'venue', 'visibility', 'waitlist_count', 'why',
       'yes_rsvp_count', 'group_is_open', 'group_id', 'has_how_to_find',
       'has_rsvp_limit', 'event_num_words', 'has_photo'],
      dtype='object')

In [152]:
# there are no more NaN values in df_events
df_events.isna().sum()

created                 0
description             0
duration_min            0
event_url               0
fee                     0
group                   0
headcount               0
how_to_find_us          0
id                      0
maybe_rsvp_count        0
name                    0
photo_url               0
rating                  0
rsvp_limit              0
status                  0
time                    0
updated                 0
utc_offset              0
venue                   0
visibility              0
waitlist_count          0
why                 25312
yes_rsvp_count          0
group_is_open           0
group_id                0
has_how_to_find         0
has_rsvp_limit          0
event_num_words         0
has_photo               0
dtype: int64

In [153]:
# DROPPING UNNECESSARY COLUMNS
# df_events.drop(columns = ['why', 'utc_offset', 'how_to_find_us', 'visibility', 'headcount',
#                          'waitlist_count', 'updated', 'maybe_rsvp_count', 'rating', 'rsvp_limit',
#                          'status', 'photo_url'], inplace = True)

Now let's look at the ```venue``` column.

It is a dictionary itself so we can change that series into its own dataframe.

In [154]:
# converting the 'venue' column into its own dataframe
df_venues = df_events['venue'].apply(pd.Series)
df_venues.head()

  index = _union_indexes(indexes, sort=sort)
  result = result.union(other)


Unnamed: 0,address_1,city,country,id,lat,localized_country_name,lon,name,phone,repinned,state,zip,0,address_2
0,"250 Lafayette Street, New York, NY",New York,us,25315570.0,40.723171,USA,-73.997177,Codesmith,,False,NY,,,
1,Online,New York,us,25626092.0,40.74673,USA,-73.98967,Online,,True,NY,,,
2,250 Lafayette Street,New York,us,25312065.0,40.72317,USA,-73.99718,Codesmith,,True,NY,,,
3,250 Lafayette Street,New York,us,25312065.0,40.723171,USA,-73.997177,Codesmith,,False,NY,,,
4,Online,New York,us,25626092.0,40.746731,USA,-73.98967,Online,,False,NY,,,


In [155]:
# quick check on the shape of the dataframe to ensure that the number of rows matches the total number of events
df_venues.shape

(25424, 14)

In [156]:
# checking for percentage of missing vaues for each column in the df_event_locations dataframe
(df_venues.isna().sum()/df_venues.shape[0])*100

address_1                  8.535242
city                       8.535242
country                    8.535242
id                         8.535242
lat                        8.535242
localized_country_name     8.535242
lon                        8.535242
name                       8.535242
phone                     91.291693
repinned                   8.535242
state                     20.689113
zip                       64.120516
0                         91.464758
address_2                 93.411737
dtype: float64

We only want to keep the latitude, longitude of the venue as well as the country of the venue, which we will dummify (1=USA, 0=not USA).

Let's zip the latitude and longitude.

In [157]:
df_events['venue_latlon'] = list(zip(df_venues.lat, df_venues.lon))

In [158]:
# drop the 'venue' column from df_events
df_events.drop(columns =['venue'], inplace=True)

In [159]:
# preview updated dataframe
df_events.head()

Unnamed: 0,created,description,duration_min,event_url,fee,group,headcount,how_to_find_us,id,maybe_rsvp_count,...,waitlist_count,why,yes_rsvp_count,group_is_open,group_id,has_how_to_find,has_rsvp_limit,event_num_words,has_photo,venue_latlon
0,1535391367000,Join us in person or tune in online! Livestrea...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor/ stream online at ...,254149786,0,...,0,,42,1,21993357,1,0,137,0,"(40.723171, -73.997177)"
1,1535385547000,Get started now on challenges related to these...,150.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,https://zoom.us/j/417883916,254146381,0,...,0,,64,1,21993357,1,0,131,0,"(40.74673, -73.98967)"
2,1535392484000,In this workshop we’ll get a clear sense of th...,150.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor / Livestream at ht...,254150230,0,...,0,,83,1,21993357,1,0,204,0,"(40.72317, -73.99718)"
3,1531947994000,The number of opportunities for software engin...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to the 2nd floor,252915161,0,...,0,,113,1,21993357,1,0,229,0,"(40.723171, -73.997177)"
4,1535383458000,Please tune into the stream here: https://zoom...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please tune into the stream here: https://zoom...,254144933,0,...,0,,21,1,21993357,1,0,165,0,"(40.746731, -73.98967)"


In [160]:
# rename id column to event_id for clarity
df_events.rename(index=str, columns={"id": "event_id"})

Unnamed: 0,created,description,duration_min,event_url,fee,group,headcount,how_to_find_us,event_id,maybe_rsvp_count,...,waitlist_count,why,yes_rsvp_count,group_is_open,group_id,has_how_to_find,has_rsvp_limit,event_num_words,has_photo,venue_latlon
0,1535391367000,Join us in person or tune in online! Livestrea...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor/ stream online at ...,254149786,0,...,0,,42,1,21993357,1,0,137,0,"(40.723171, -73.997177)"
1,1535385547000,Get started now on challenges related to these...,150.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,https://zoom.us/j/417883916,254146381,0,...,0,,64,1,21993357,1,0,131,0,"(40.74673, -73.98967)"
2,1535392484000,In this workshop we’ll get a clear sense of th...,150.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor / Livestream at ht...,254150230,0,...,0,,83,1,21993357,1,0,204,0,"(40.72317, -73.99718)"
3,1531947994000,The number of opportunities for software engin...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to the 2nd floor,252915161,0,...,0,,113,1,21993357,1,0,229,0,"(40.723171, -73.997177)"
4,1535383458000,Please tune into the stream here: https://zoom...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please tune into the stream here: https://zoom...,254144933,0,...,0,,21,1,21993357,1,0,165,0,"(40.746731, -73.98967)"
5,1536081851000,Join us in person or tune in online! Livestrea...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor / livestream at ht...,254381681,0,...,0,,57,1,21993357,1,0,137,0,"(40.72317, -73.99718)"
6,1536076828000,Please register for and tune into the stream h...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please register for and tune into the stream h...,254379546,0,...,0,,63,1,21993357,1,0,127,0,"(40.74673, -73.98967)"
7,1536092565000,"Join us on Thursday, Sep 13, for a great serie...",30.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Come up to the 2nd floor,254387434,0,...,0,,14,1,21993357,1,0,122,1,"(40.72317, -73.99718)"
8,1536092141000,"Join us on Thursday, for a great series of lig...",150.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to the 2nd floor / Watch on livest...,254387251,0,...,0,,60,1,21993357,1,0,377,0,"(40.72317, -73.99718)"
9,1535591874000,In this online workshop (9am-3pm PST / 12-6pm ...,360.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,0,254226040,0,...,0,,22,1,21993357,0,0,334,1,"(40.74673, -73.98967)"


Let's engineer some additional event features:
- number of years the group has been around
- add day of week 
- number of stations within 0.5 miles
- number of events held in the month
- descriptiveness of event (determined by number of words in the description text)

Adding year/month/day and day of week columns.

In [161]:
# adding event date as Year/Month/Day
# let's first change the time to a datetime datatype
df_events['time_datetime'] = pd.to_datetime(df_events['time'], unit = 'ms')

In [162]:
df_events['time_datetime'].head()

0   2018-09-04 22:30:00
1   2018-09-06 00:00:00
2   2018-09-06 22:30:00
3   2018-09-07 22:30:00
4   2018-09-11 01:30:00
Name: time_datetime, dtype: datetime64[ns]

In [163]:
df_events['time_m_d_y'] = df_events['time_datetime'].apply(lambda x: x.strftime('%Y-%m-%d')) 
df_events['time_m_d_y'].head()

0    2018-09-04
1    2018-09-06
2    2018-09-06
3    2018-09-07
4    2018-09-11
Name: time_m_d_y, dtype: object

In [164]:
# add column with day of week
df_events['time_m_d_y'] = pd.to_datetime(df_events['time_m_d_y'])
df_events['day_of_week'] = df_events['time_m_d_y'].dt.day_name()

In [165]:
df_events['event_hour'] = df_events['time_datetime'].dt.hour
df_events['event_hour'] = df_events['event_hour'].astype('category')

In [166]:
df_events['event_hour']

0        22
1         0
2        22
3        22
4         1
5        22
6         0
7        22
8        22
9        16
10        1
11        0
12       22
13       22
14       22
15       22
16        1
17        0
18       22
19       23
20       14
21       13
22       22
23       22
24       22
25       22
26       22
27       22
28        9
29       10
         ..
25394    23
25395    23
25396    23
25397    23
25398    23
25399    23
25400    23
25401    23
25402    22
25403    23
25404    23
25405    22
25406    23
25407    23
25408    13
25409    23
25410    20
25411    22
25412    23
25413    23
25414    22
25415    13
25416    15
25417    23
25418    23
25419    23
25420    23
25421    23
25422    20
25423    23
Name: event_hour, Length: 25424, dtype: category
Categories (24, int64): [0, 1, 2, 3, ..., 20, 21, 22, 23]

In [167]:
# bin the event hour into 6 bins (4-hour intervals in 24-day)
bins = [0, 4,8,12,16,21,24]
df_events['event_hour_group'] = pd.cut(df_events['event_hour'], bins, right =False)

In [168]:
df_events['event_hour_group'].value_counts()

[21, 24)    14210
[16, 21)     4141
[0, 4)       3399
[12, 16)     3062
[8, 12)       531
[4, 8)         81
Name: event_hour_group, dtype: int64

In [169]:
df_events['event_hour_group'].isna().sum()

0

In [170]:
df_events[['event_hour','event_hour_group']]

Unnamed: 0,event_hour,event_hour_group
0,22,"[21, 24)"
1,0,"[0, 4)"
2,22,"[21, 24)"
3,22,"[21, 24)"
4,1,"[0, 4)"
5,22,"[21, 24)"
6,0,"[0, 4)"
7,22,"[21, 24)"
8,22,"[21, 24)"
9,16,"[16, 21)"


In [171]:
df_events.head()

Unnamed: 0,created,description,duration_min,event_url,fee,group,headcount,how_to_find_us,id,maybe_rsvp_count,...,has_how_to_find,has_rsvp_limit,event_num_words,has_photo,venue_latlon,time_datetime,time_m_d_y,day_of_week,event_hour,event_hour_group
0,1535391367000,Join us in person or tune in online! Livestrea...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor/ stream online at ...,254149786,0,...,1,0,137,0,"(40.723171, -73.997177)",2018-09-04 22:30:00,2018-09-04,Tuesday,22,"[21, 24)"
1,1535385547000,Get started now on challenges related to these...,150.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,https://zoom.us/j/417883916,254146381,0,...,1,0,131,0,"(40.74673, -73.98967)",2018-09-06 00:00:00,2018-09-06,Thursday,0,"[0, 4)"
2,1535392484000,In this workshop we’ll get a clear sense of th...,150.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor / Livestream at ht...,254150230,0,...,1,0,204,0,"(40.72317, -73.99718)",2018-09-06 22:30:00,2018-09-06,Thursday,22,"[21, 24)"
3,1531947994000,The number of opportunities for software engin...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to the 2nd floor,252915161,0,...,1,0,229,0,"(40.723171, -73.997177)",2018-09-07 22:30:00,2018-09-07,Friday,22,"[21, 24)"
4,1535383458000,Please tune into the stream here: https://zoom...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please tune into the stream here: https://zoom...,254144933,0,...,1,0,165,0,"(40.746731, -73.98967)",2018-09-11 01:30:00,2018-09-11,Tuesday,1,"[0, 4)"


In [172]:
df_events['day_of_week'].value_counts()

Saturday     5736
Sunday       4281
Thursday     4040
Tuesday      3106
Friday       3104
Wednesday    2953
Monday       2204
Name: day_of_week, dtype: int64

Add distance from each venue all NYC subway stations.
Include a count of subways that are <= 0.25 miles from the venue.

In [173]:
# load subway station data
df_subway = pd.read_csv("NYC_Subway_Data.csv")

In [174]:
# dropping duplicate stations (file contains a location for each entry/exit point which is not what we need)
df_unique_subway = df_subway.drop_duplicates(subset=["Division", "Station Name"])

In [175]:
# preview the distinct subway station data
df_unique_subway

Unnamed: 0,Division,Line,Station Name,Station Latitude,Station Longitude,Route1,Route2,Route3,Route4,Route5,...,ADA,ADA Notes,Free Crossover,North South Street,East West Street,Corner,Entrance Latitude,Entrance Longitude,Station Location,Entrance Location
0,BMT,4 Avenue,25th St,40.660397,-73.998091,R,,,,,...,False,,False,4th Ave,25th St,SE,40.660323,-73.997952,"(40.660397, -73.998091)","(40.660323, -73.997952)"
2,BMT,4 Avenue,36th St,40.655144,-74.003549,N,R,,,,...,False,,True,4th Ave,36th St,NW,40.654490,-74.004499,"(40.655144, -74.003549)","(40.654490, -74.004499)"
5,BMT,4 Avenue,45th St,40.648939,-74.010006,R,,,,,...,False,,True,4th Ave,45th St,NE,40.649389,-74.009333,"(40.648939, -74.010006)","(40.649389, -74.009333)"
9,BMT,4 Avenue,53rd St,40.645069,-74.014034,R,,,,,...,False,,True,4th Ave,53rd St,SW,40.644653,-74.014690,"(40.645069, -74.014034)","(40.644653, -74.014690)"
14,BMT,4 Avenue,59th St,40.641362,-74.017881,N,R,,,,...,False,,True,4th Ave,59th St,NW,40.641606,-74.017897,"(40.641362, -74.017881)","(40.641606, -74.017897)"
20,BMT,4 Avenue,77th St,40.629742,-74.025510,R,,,,,...,False,,True,4th Ave,77th St,NW,40.629550,-74.025731,"(40.629742, -74.02551)","(40.629550, -74.025731)"
23,BMT,4 Avenue,86th St,40.622687,-74.028398,R,,,,,...,False,,True,4th Ave,86th St,SW,40.622656,-74.028547,"(40.622687, -74.028398)","(40.622656, -74.028547)"
26,BMT,4 Avenue,95th St,40.616622,-74.030876,R,,,,,...,False,,True,4th Ave,95th St,SW,40.616021,-74.031383,"(40.616622, -74.030876)","(40.616021, -74.031383)"
31,BMT,4 Avenue,9th St,40.670847,-73.988302,F,G,R,,,...,False,,True,4th Ave,9th St,NE,40.670387,-73.988480,"(40.670847, -73.988302)","(40.670387, -73.988480)"
33,BMT,4 Avenue,Atlantic Av-Barclays Ctr,40.683666,-73.978810,B,Q,D,N,R,...,True,,True,4th Ave,Pacific St,NE,40.683805,-73.978487,"(40.683666, -73.97881)","(40.683805, -73.978487)"


In [176]:
# there are 425 unique stations
df_unique_subway.shape

(425, 32)

In [177]:
# convert the latitude and longitude into floats for distance calculation
df_unique_subway['Station Latitude'].astype(float)
df_unique_subway['Station Longitude'].astype(float)

0      -73.998091
2      -74.003549
5      -74.010006
9      -74.014034
14     -74.017881
20     -74.025510
23     -74.028398
26     -74.030876
31     -73.988302
33     -73.978810
34     -74.023377
37     -73.981824
43     -73.978810
45     -73.992872
48     -73.983110
52     -73.979189
59     -73.986229
60     -73.996209
70     -73.992821
78     -73.989938
82     -73.987823
92     -73.984569
101    -73.981329
118    -73.989779
120    -73.977450
128    -73.980305
136    -73.990862
142    -73.996204
147    -73.995048
152    -73.979678
          ...    
1748   -73.981848
1750   -74.005351
1754   -73.996353
1763   -73.986829
1767   -73.994791
1771   -73.995476
1774   -73.996895
1777   -73.998864
1783   -74.000610
1787   -73.994324
1788   -73.983765
1792   -73.993728
1800   -73.917757
1808   -73.887734
1812   -73.862633
1814   -73.860341
1817   -73.857473
1819   -73.854376
1821   -73.867352
1823   -73.868457
1826   -73.867164
1828   -73.873488
1830   -73.880049
1835   -73.891865
1841   -73

In [178]:
# create a new column with the converted latitude and longitutdes in a tuple
df_unique_subway['latlon'] = list(zip(df_unique_subway['Station Latitude'],df_unique_subway['Station Longitude']))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [179]:
# create a variable with a list of each station's (latitude, longitude)
subway_locations = list(df_unique_subway['latlon'])

In [15]:
# save the subway_locations variable
with open('subway_locations.pkl', 'wb') as f: 
    pickle.dump(subway_locations, f)

In [57]:
def get_subway_distances(coord, subway_locations):
    """returns a list of distances from venue to each subway station in NYC, sorted from closest to farthest"""
    return sorted([haversine(coord, s, unit = 'mi') for s in subway_locations])

In [180]:
# import function created to get the distances of each venue to each subway station
# apply/lambda function to every event
df_events['subway_distances'] = df_events['venue_latlon'].apply(lambda x: get_subway_distances(x, subway_locations))

In [181]:
# create a column with a count of subway stations less than 0.5 miles from each venue
df_events['num_close_subways'] = df_events['subway_distances'].apply(lambda x: len([i for i in x if i <=0.5]))

In [182]:
# preview the updated dataframe
df_events.head()

Unnamed: 0,created,description,duration_min,event_url,fee,group,headcount,how_to_find_us,id,maybe_rsvp_count,...,event_num_words,has_photo,venue_latlon,time_datetime,time_m_d_y,day_of_week,event_hour,event_hour_group,subway_distances,num_close_subways
0,1535391367000,Join us in person or tune in online! Livestrea...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor/ stream online at ...,254149786,0,...,137,0,"(40.723171, -73.997177)",2018-09-04 22:30:00,2018-09-04,Tuesday,22,"[21, 24)","[0.06014082663839308, 0.08460125597085202, 0.1...",12
1,1535385547000,Get started now on challenges related to these...,150.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,https://zoom.us/j/417883916,254146381,0,...,131,0,"(40.74673, -73.98967)",2018-09-06 00:00:00,2018-09-06,Thursday,0,"[0, 4)","[0.09959557289714283, 0.19629674920363338, 0.2...",9
2,1535392484000,In this workshop we’ll get a clear sense of th...,150.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor / Livestream at ht...,254150230,0,...,204,0,"(40.72317, -73.99718)",2018-09-06 22:30:00,2018-09-06,Thursday,22,"[21, 24)","[0.06007690211738512, 0.08461572701356186, 0.1...",12
3,1531947994000,The number of opportunities for software engin...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to the 2nd floor,252915161,0,...,229,0,"(40.723171, -73.997177)",2018-09-07 22:30:00,2018-09-07,Friday,22,"[21, 24)","[0.06014082663839308, 0.08460125597085202, 0.1...",12
4,1535383458000,Please tune into the stream here: https://zoom...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please tune into the stream here: https://zoom...,254144933,0,...,165,0,"(40.746731, -73.98967)",2018-09-11 01:30:00,2018-09-11,Tuesday,1,"[0, 4)","[0.09965482405502167, 0.19628496446669605, 0.2...",9


In [183]:
df_events.columns

Index(['created', 'description', 'duration_min', 'event_url', 'fee', 'group',
       'headcount', 'how_to_find_us', 'id', 'maybe_rsvp_count', 'name',
       'photo_url', 'rating', 'rsvp_limit', 'status', 'time', 'updated',
       'utc_offset', 'visibility', 'waitlist_count', 'why', 'yes_rsvp_count',
       'group_is_open', 'group_id', 'has_how_to_find', 'has_rsvp_limit',
       'event_num_words', 'has_photo', 'venue_latlon', 'time_datetime',
       'time_m_d_y', 'day_of_week', 'event_hour', 'event_hour_group',
       'subway_distances', 'num_close_subways'],
      dtype='object')

In [184]:
# create new column that notes whether there is a fee or no fee for the event
df_events['has_fee'] = df_events.fee.apply(lambda x: 0 if x == 0 else 1)

In [185]:
# preview the updated dataframe
df_events.columns

Index(['created', 'description', 'duration_min', 'event_url', 'fee', 'group',
       'headcount', 'how_to_find_us', 'id', 'maybe_rsvp_count', 'name',
       'photo_url', 'rating', 'rsvp_limit', 'status', 'time', 'updated',
       'utc_offset', 'visibility', 'waitlist_count', 'why', 'yes_rsvp_count',
       'group_is_open', 'group_id', 'has_how_to_find', 'has_rsvp_limit',
       'event_num_words', 'has_photo', 'venue_latlon', 'time_datetime',
       'time_m_d_y', 'day_of_week', 'event_hour', 'event_hour_group',
       'subway_distances', 'num_close_subways', 'has_fee'],
      dtype='object')

In [192]:
df_events['created_to_event_days'] = (df_events['time'].astype(int)-df_events['created'].astype(int))/86400000

In [193]:
df_events['created_to_event_days'][1]

9.333946759259259

In [194]:
df_events.columns

Index(['created', 'description', 'duration_min', 'event_url', 'fee', 'group',
       'headcount', 'how_to_find_us', 'id', 'maybe_rsvp_count', 'name',
       'photo_url', 'rating', 'rsvp_limit', 'status', 'time', 'updated',
       'utc_offset', 'visibility', 'waitlist_count', 'why', 'yes_rsvp_count',
       'group_is_open', 'group_id', 'has_how_to_find', 'has_rsvp_limit',
       'event_num_words', 'has_photo', 'venue_latlon', 'time_datetime',
       'time_m_d_y', 'day_of_week', 'event_hour', 'event_hour_group',
       'subway_distances', 'num_close_subways', 'has_fee',
       'created_to_event_days'],
      dtype='object')

In [195]:
# save cleaned dataframe into csv and json
df_events.to_pickle("df_sepoct_events_cleaned.pickle")
df_events.to_json("sepoct_events_cleaned.json")

In [196]:
df_events.shape

(25424, 38)

In [197]:
df_num_past_events = pd.DataFrame(df_events.group_id.value_counts()).reset_index()

In [198]:
df_num_past_events.columns = ['group_id', 'num_past_events']
df_num_past_events.head()

Unnamed: 0,group_id,num_past_events
0,1414748,190
1,9608102,165
2,24480154,160
3,344877,155
4,1338658,149


In [199]:
len(df_events.group_id.unique())

2714

Left-merge the events and group dataframes on ```group_id```

In [225]:
df_groups = pd.read_pickle('df_all_groups_cleaned.pickle')

In [227]:
df_events_group = pd.merge(df_events, df_groups, how='left', on = 'group_id')

In [228]:
df_events_group.head()

Unnamed: 0,created_x,description_x,duration_min,event_url,fee,group,headcount,how_to_find_us,id,maybe_rsvp_count,...,name_y,state,status_y,urlname,visibility_y,who,category_name,organizer_id,yrs_since_created,created_date
0,1535391367000,Join us in person or tune in online! Livestrea...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor/ stream online at ...,254149786,0,...,Build with Code - New York City,NY,active,Build-with-Code-New-York,public,Engineers,tech,218119162,2.276969,01/20/2017 01:45:02
1,1535385547000,Get started now on challenges related to these...,150.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,https://zoom.us/j/417883916,254146381,0,...,Build with Code - New York City,NY,active,Build-with-Code-New-York,public,Engineers,tech,218119162,2.276969,01/20/2017 01:45:02
2,1535392484000,In this workshop we’ll get a clear sense of th...,150.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor / Livestream at ht...,254150230,0,...,Build with Code - New York City,NY,active,Build-with-Code-New-York,public,Engineers,tech,218119162,2.276969,01/20/2017 01:45:02
3,1531947994000,The number of opportunities for software engin...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to the 2nd floor,252915161,0,...,Build with Code - New York City,NY,active,Build-with-Code-New-York,public,Engineers,tech,218119162,2.276969,01/20/2017 01:45:02
4,1535383458000,Please tune into the stream here: https://zoom...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please tune into the stream here: https://zoom...,254144933,0,...,Build with Code - New York City,NY,active,Build-with-Code-New-York,public,Engineers,tech,218119162,2.276969,01/20/2017 01:45:02


Another left-merge to add number of past events (df_num_past_events).

In [229]:
df_events_group_past = pd.merge(df_events_group, df_num_past_events, how= 'left', on = 'group_id')

In [230]:
df_events_group_past.head()

Unnamed: 0,created_x,description_x,duration_min,event_url,fee,group,headcount,how_to_find_us,id,maybe_rsvp_count,...,state,status_y,urlname,visibility_y,who,category_name,organizer_id,yrs_since_created,created_date,num_past_events
0,1535391367000,Join us in person or tune in online! Livestrea...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor/ stream online at ...,254149786,0,...,NY,active,Build-with-Code-New-York,public,Engineers,tech,218119162,2.276969,01/20/2017 01:45:02,38
1,1535385547000,Get started now on challenges related to these...,150.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,https://zoom.us/j/417883916,254146381,0,...,NY,active,Build-with-Code-New-York,public,Engineers,tech,218119162,2.276969,01/20/2017 01:45:02,38
2,1535392484000,In this workshop we’ll get a clear sense of th...,150.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to second floor / Livestream at ht...,254150230,0,...,NY,active,Build-with-Code-New-York,public,Engineers,tech,218119162,2.276969,01/20/2017 01:45:02,38
3,1531947994000,The number of opportunities for software engin...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please come to the 2nd floor,252915161,0,...,NY,active,Build-with-Code-New-York,public,Engineers,tech,218119162,2.276969,01/20/2017 01:45:02,38
4,1535383458000,Please tune into the stream here: https://zoom...,120.0,https://www.meetup.com/Build-with-Code-New-Yor...,0.0,"{'join_mode': 'open', 'created': 1484876702000...",0,Please tune into the stream here: https://zoom...,254144933,0,...,NY,active,Build-with-Code-New-York,public,Engineers,tech,218119162,2.276969,01/20/2017 01:45:02,38


Rename column headers for clarify (event vs. group info).

In [231]:
df_events_group_past.columns

Index(['created_x', 'description_x', 'duration_min', 'event_url', 'fee',
       'group', 'headcount', 'how_to_find_us', 'id', 'maybe_rsvp_count',
       'name_x', 'photo_url', 'rating', 'rsvp_limit', 'status_x', 'time',
       'updated', 'utc_offset', 'visibility_x', 'waitlist_count', 'why',
       'yes_rsvp_count', 'group_is_open', 'group_id', 'has_how_to_find',
       'has_rsvp_limit', 'event_num_words', 'has_photo', 'venue_latlon',
       'time_datetime', 'time_m_d_y', 'day_of_week', 'event_hour',
       'event_hour_group', 'subway_distances', 'num_close_subways', 'has_fee',
       'created_to_event_days', 'created_y', 'description_y', 'join_mode',
       'lat', 'link', 'localized_country_name', 'localized_location', 'lon',
       'members', 'name_y', 'state', 'status_y', 'urlname', 'visibility_y',
       'who', 'category_name', 'organizer_id', 'yrs_since_created',
       'created_date', 'num_past_events'],
      dtype='object')

In [232]:
df_events_group_past.rename(columns = {'created_x': 'event_created',
                                         'description_x': 'event_description',
                                         'duration_min': 'event_duration',
                                         'headcount': 'event_headcount',
                                         'id': 'event_id',
                                         'name_x': 'event_name',
                                         'rating': 'event_rating',
                                         'status_x': 'event_status',
                                         'time': 'event_time',
                                         'updated': 'event_updated',
                                         'visibility_x': 'event_visibility',
                                         'descrip_tokens': 'event_descrip_tokens',
                                         'descrip_num_words':'event_descrip_num_words',
                                         'has_fee': 'has_event_fee',
                                         'created_y': 'group_created',
                                         'description_y': 'group_description',
                                         'join_mode': 'group_join_mode',
                                         'lat': 'group_lat',
                                         'lon': 'group_lon',
                                         'link': 'group_link',
                                         'state': 'group_state',
                                         'members': 'num_members',
                                         'name_y': 'group_name',
                                         'status_y': 'group_status',
                                         'urlname': 'group_urlname',
                                         'visibility_y': 'group_visibility',
                                         'who': 'group_who',
                                         'category_name': 'group_category',
                                         'organizer_id': 'group_organizer_id',
                                         'yrs_since_created': 'group_yrs_est',
                                         'created_date':'group_created_date'
                                        }, inplace =True)

In [235]:
df_events_group_past.columns

Index(['event_created', 'event_description', 'event_duration', 'event_url',
       'fee', 'group', 'event_headcount', 'how_to_find_us', 'event_id',
       'maybe_rsvp_count', 'event_name', 'photo_url', 'event_rating',
       'rsvp_limit', 'event_status', 'event_time', 'event_updated',
       'utc_offset', 'event_visibility', 'waitlist_count', 'why',
       'yes_rsvp_count', 'group_is_open', 'group_id', 'has_how_to_find',
       'has_rsvp_limit', 'event_num_words', 'has_photo', 'venue_latlon',
       'time_datetime', 'time_m_d_y', 'day_of_week', 'event_hour',
       'event_hour_group', 'subway_distances', 'num_close_subways',
       'has_event_fee', 'created_to_event_days', 'group_created',
       'group_description', 'group_join_mode', 'group_lat', 'group_link',
       'localized_country_name', 'localized_location', 'group_lon',
       'num_members', 'group_name', 'group_state', 'group_status',
       'group_urlname', 'group_visibility', 'group_who', 'group_category',
       'group_org

In [234]:
# save the merged dataframe
df_events_group_past.to_pickle('df_sepoct_events_groups_merged_cleaned.pickle')

***
<a id='groups'></a>
### 2. Meetup Groups

#### Load Data

In [200]:
# open all_groups file
with open('all_groups.pkl', 'rb') as f:
    all_groups = pickle.load(f)

In [201]:
# convert to dataframe
df_groups = pd.DataFrame(all_groups)

In [202]:
df_groups.shape

(8632, 29)

In [203]:
df_groups.head()

Unnamed: 0,category,city,country,created,description,group_photo,id,is_pro_hidden,join_mode,key_photo,...,organizer,pro_network,score,state,status,timezone,untranslated_city,urlname,visibility,who
0,"{'id': 34, 'name': 'Tech', 'shortname': 'tech'...",New York,US,1484876702000,<p>Build with Code hosts free weekly JavaScrip...,,21993357,,open,"{'id': 464860413, 'highres_link': 'https://sec...",...,"{'id': 218119162, 'name': 'Jenny Mith', 'bio':...",,1.0,NY,active,US/Eastern,New York,Build-with-Code-New-York,public,Engineers
1,"{'id': 2, 'name': 'Career & Business', 'shortn...",New York,US,1550615516000,<p>The TechDay New York team invites you to jo...,,31207091,,open,"{'id': 480306005, 'highres_link': 'https://sec...",...,"{'id': 263284450, 'name': 'Ana ', 'bio': '', '...",,1.0,NY,active,US/Eastern,New York,TechDayHQ,public,Members
2,"{'id': 34, 'name': 'Tech', 'shortname': 'tech'...",New York,US,1047953152000,<p>The NYC NoSQL &amp; NewSQL Group <br> (form...,"{'id': 460182357, 'highres_link': 'https://sec...",107592,,open,"{'id': 466506912, 'highres_link': 'https://sec...",...,"{'id': 6618661, 'name': 'Eric David Benari', '...",,1.0,NY,active,US/Eastern,New York,mysqlnyc,public,Data Enthusiasts
3,"{'id': 23, 'name': 'Outdoors & Adventure', 'sh...",New York,US,1548684384000,<p><span>The Awesome Events Meetup Group is th...,,31031999,,open,"{'id': 480057227, 'highres_link': 'https://sec...",...,"{'id': 236287112, 'name': 'Justin', 'bio': '',...",,1.0,NY,active,US/Eastern,New York,awesome-events,public,Awesome People
4,"{'id': 34, 'name': 'Tech', 'shortname': 'tech'...",New York,US,1321563802000,<p><span>Data Driven NYC (organized by FirstMa...,"{'id': 442920809, 'highres_link': 'https://sec...",2829432,,approval,"{'id': 442991280, 'highres_link': 'https://sec...",...,"{'id': 2369792, 'name': 'Matt Turck', 'bio': '...",,1.0,NY,active,US/Eastern,New York,DataDrivenNYC,public,Members


In [204]:
# rename id to group_id
df_groups.rename(columns ={'id':'group_id'}, inplace = True)

In [205]:
df_groups.isna().sum()

category                     8
city                         0
country                      0
created                      0
description                  0
group_photo               4533
group_id                     0
is_pro_hidden             8628
join_mode                    0
key_photo                 1484
lat                          0
link                         0
localized_country_name       0
localized_location           0
lon                          0
members                      0
meta_category              154
name                         0
next_event                5852
organizer                    0
pro_network               8318
score                        0
state                        0
status                       0
timezone                     0
untranslated_city            0
urlname                      0
visibility                   0
who                          0
dtype: int64

We'll deal with most of the missing values in this dataset by dropping columns we won't need:
- ```is_pro_hidden```, ```pro_network```, ```next_event```, ```key_photo```, ```group_photo```, ```timezone```, ```untranslated_city```, ```score```, ```country```, ```city```, ```meta_category``` (contains the same info as ```category```)


In [206]:
df_groups.drop(columns = ['is_pro_hidden', 'pro_network', 'next_event', 'key_photo', 'group_photo',
                         'timezone', 'untranslated_city', 'score', 'country', 'city', 'meta_category'], 
               inplace = True)

In [207]:
df_groups.head()

Unnamed: 0,category,created,description,group_id,join_mode,lat,link,localized_country_name,localized_location,lon,members,name,organizer,state,status,urlname,visibility,who
0,"{'id': 34, 'name': 'Tech', 'shortname': 'tech'...",1484876702000,<p>Build with Code hosts free weekly JavaScrip...,21993357,open,40.75,https://www.meetup.com/Build-with-Code-New-York/,USA,"New York, NY",-73.99,8050,Build with Code - New York City,"{'id': 218119162, 'name': 'Jenny Mith', 'bio':...",NY,active,Build-with-Code-New-York,public,Engineers
1,"{'id': 2, 'name': 'Career & Business', 'shortn...",1550615516000,<p>The TechDay New York team invites you to jo...,31207091,open,40.75,https://www.meetup.com/TechDayHQ/,USA,"New York, NY",-73.99,1361,TechDay Meetup,"{'id': 263284450, 'name': 'Ana ', 'bio': '', '...",NY,active,TechDayHQ,public,Members
2,"{'id': 34, 'name': 'Tech', 'shortname': 'tech'...",1047953152000,<p>The NYC NoSQL &amp; NewSQL Group <br> (form...,107592,open,40.75,https://www.meetup.com/mysqlnyc/,USA,"New York, NY",-73.99,24226,"🔥 SQL NYC, The NoSQL & NewSQL Database Big Dat...","{'id': 6618661, 'name': 'Eric David Benari', '...",NY,active,mysqlnyc,public,Data Enthusiasts
3,"{'id': 23, 'name': 'Outdoors & Adventure', 'sh...",1548684384000,<p><span>The Awesome Events Meetup Group is th...,31031999,open,40.78,https://www.meetup.com/awesome-events/,USA,"New York, NY",-73.96,1694,Awesome Events,"{'id': 236287112, 'name': 'Justin', 'bio': '',...",NY,active,awesome-events,public,Awesome People
4,"{'id': 34, 'name': 'Tech', 'shortname': 'tech'...",1321563802000,<p><span>Data Driven NYC (organized by FirstMa...,2829432,approval,40.76,https://www.meetup.com/DataDrivenNYC/,USA,"New York, NY",-73.97,17382,Data Driven NYC (a FirstMark Event),"{'id': 2369792, 'name': 'Matt Turck', 'bio': '...",NY,active,DataDrivenNYC,public,Members


In [208]:
# clean text in descriptions
df_groups.description = df_groups.description.apply(lambda x: clean_text(x))

In [209]:
df_groups.head()

Unnamed: 0,category,created,description,group_id,join_mode,lat,link,localized_country_name,localized_location,lon,members,name,organizer,state,status,urlname,visibility,who
0,"{'id': 34, 'name': 'Tech', 'shortname': 'tech'...",1484876702000,Build with Code hosts free weekly JavaScript a...,21993357,open,40.75,https://www.meetup.com/Build-with-Code-New-York/,USA,"New York, NY",-73.99,8050,Build with Code - New York City,"{'id': 218119162, 'name': 'Jenny Mith', 'bio':...",NY,active,Build-with-Code-New-York,public,Engineers
1,"{'id': 2, 'name': 'Career & Business', 'shortn...",1550615516000,The TechDay New York team invites you to join ...,31207091,open,40.75,https://www.meetup.com/TechDayHQ/,USA,"New York, NY",-73.99,1361,TechDay Meetup,"{'id': 263284450, 'name': 'Ana ', 'bio': '', '...",NY,active,TechDayHQ,public,Members
2,"{'id': 34, 'name': 'Tech', 'shortname': 'tech'...",1047953152000,The NYC NoSQL NewSQL Group (formerly known a...,107592,open,40.75,https://www.meetup.com/mysqlnyc/,USA,"New York, NY",-73.99,24226,"🔥 SQL NYC, The NoSQL & NewSQL Database Big Dat...","{'id': 6618661, 'name': 'Eric David Benari', '...",NY,active,mysqlnyc,public,Data Enthusiasts
3,"{'id': 23, 'name': 'Outdoors & Adventure', 'sh...",1548684384000,The Awesome Events Meetup Group is the real-li...,31031999,open,40.78,https://www.meetup.com/awesome-events/,USA,"New York, NY",-73.96,1694,Awesome Events,"{'id': 236287112, 'name': 'Justin', 'bio': '',...",NY,active,awesome-events,public,Awesome People
4,"{'id': 34, 'name': 'Tech', 'shortname': 'tech'...",1321563802000,"Data Driven NYC (organized by FirstMark), is a...",2829432,approval,40.76,https://www.meetup.com/DataDrivenNYC/,USA,"New York, NY",-73.97,17382,Data Driven NYC (a FirstMark Event),"{'id': 2369792, 'name': 'Matt Turck', 'bio': '...",NY,active,DataDrivenNYC,public,Members


Let's look at the ```category``` column in more detail and extract just the information we want in the main ```df_groups``` dataframe.

In [210]:
df_category = df_groups['category'].apply(pd.Series)
df_category.head()

  index = _union_indexes(indexes, sort=sort)
  result = result.union(other)


Unnamed: 0,id,name,shortname,sort_name,0
0,34.0,Tech,tech,Tech,
1,2.0,Career & Business,career-business,Career & Business,
2,34.0,Tech,tech,Tech,
3,23.0,Outdoors & Adventure,outdoors-adventure,Outdoors & Adventure,
4,34.0,Tech,tech,Tech,


In [211]:
df_category.isna().sum()/(len(df_category))*100

id             0.092678
name           0.092678
shortname      0.092678
sort_name      0.092678
0            100.000000
dtype: float64

In [212]:
# replace NaNs
df_category['shortname'].fillna('None',inplace=True)

In [213]:
# add columns to main dataframe and drop 'category'
df_groups['category_name'] = df_category['shortname']
df_groups.drop(columns = ['category'], inplace=True)

In [214]:
df_groups.head()

Unnamed: 0,created,description,group_id,join_mode,lat,link,localized_country_name,localized_location,lon,members,name,organizer,state,status,urlname,visibility,who,category_name
0,1484876702000,Build with Code hosts free weekly JavaScript a...,21993357,open,40.75,https://www.meetup.com/Build-with-Code-New-York/,USA,"New York, NY",-73.99,8050,Build with Code - New York City,"{'id': 218119162, 'name': 'Jenny Mith', 'bio':...",NY,active,Build-with-Code-New-York,public,Engineers,tech
1,1550615516000,The TechDay New York team invites you to join ...,31207091,open,40.75,https://www.meetup.com/TechDayHQ/,USA,"New York, NY",-73.99,1361,TechDay Meetup,"{'id': 263284450, 'name': 'Ana ', 'bio': '', '...",NY,active,TechDayHQ,public,Members,career-business
2,1047953152000,The NYC NoSQL NewSQL Group (formerly known a...,107592,open,40.75,https://www.meetup.com/mysqlnyc/,USA,"New York, NY",-73.99,24226,"🔥 SQL NYC, The NoSQL & NewSQL Database Big Dat...","{'id': 6618661, 'name': 'Eric David Benari', '...",NY,active,mysqlnyc,public,Data Enthusiasts,tech
3,1548684384000,The Awesome Events Meetup Group is the real-li...,31031999,open,40.78,https://www.meetup.com/awesome-events/,USA,"New York, NY",-73.96,1694,Awesome Events,"{'id': 236287112, 'name': 'Justin', 'bio': '',...",NY,active,awesome-events,public,Awesome People,outdoors-adventure
4,1321563802000,"Data Driven NYC (organized by FirstMark), is a...",2829432,approval,40.76,https://www.meetup.com/DataDrivenNYC/,USA,"New York, NY",-73.97,17382,Data Driven NYC (a FirstMark Event),"{'id': 2369792, 'name': 'Matt Turck', 'bio': '...",NY,active,DataDrivenNYC,public,Members,tech


Let's look at the ```organizer``` column in more detail and extract just the information we want in the main ```df_groups``` dataframe.

In [215]:
df_org = df_groups['organizer'].apply(pd.Series)
df_org.head()

Unnamed: 0,id,name,bio,photo
0,218119162,Jenny Mith,,"{'id': 262996470, 'highres_link': 'https://sec..."
1,263284450,Ana,,"{'id': 281661741, 'highres_link': 'https://sec..."
2,6618661,Eric David Benari,,"{'id': 4946659, 'highres_link': 'https://secur..."
3,236287112,Justin,,"{'id': 284561488, 'highres_link': 'https://sec..."
4,2369792,Matt Turck,"Managing Director, FirstMark Capital","{'id': 266918773, 'highres_link': 'https://sec..."


In [216]:
df_org.isna().sum()/(len(df_org))*100

id       0.000000
name     0.000000
bio      0.000000
photo    6.452734
dtype: float64

In [217]:
# let's keep just the organizer's id just in case and drop the 'organizer' column from df_groups
df_groups['organizer_id'] = df_org['id']
df_groups.drop(columns = ['organizer'], inplace =True)

In [218]:
df_groups.head()

Unnamed: 0,created,description,group_id,join_mode,lat,link,localized_country_name,localized_location,lon,members,name,state,status,urlname,visibility,who,category_name,organizer_id
0,1484876702000,Build with Code hosts free weekly JavaScript a...,21993357,open,40.75,https://www.meetup.com/Build-with-Code-New-York/,USA,"New York, NY",-73.99,8050,Build with Code - New York City,NY,active,Build-with-Code-New-York,public,Engineers,tech,218119162
1,1550615516000,The TechDay New York team invites you to join ...,31207091,open,40.75,https://www.meetup.com/TechDayHQ/,USA,"New York, NY",-73.99,1361,TechDay Meetup,NY,active,TechDayHQ,public,Members,career-business,263284450
2,1047953152000,The NYC NoSQL NewSQL Group (formerly known a...,107592,open,40.75,https://www.meetup.com/mysqlnyc/,USA,"New York, NY",-73.99,24226,"🔥 SQL NYC, The NoSQL & NewSQL Database Big Dat...",NY,active,mysqlnyc,public,Data Enthusiasts,tech,6618661
3,1548684384000,The Awesome Events Meetup Group is the real-li...,31031999,open,40.78,https://www.meetup.com/awesome-events/,USA,"New York, NY",-73.96,1694,Awesome Events,NY,active,awesome-events,public,Awesome People,outdoors-adventure,236287112
4,1321563802000,"Data Driven NYC (organized by FirstMark), is a...",2829432,approval,40.76,https://www.meetup.com/DataDrivenNYC/,USA,"New York, NY",-73.97,17382,Data Driven NYC (a FirstMark Event),NY,active,DataDrivenNYC,public,Members,tech,2369792


In [219]:
df_groups.columns

Index(['created', 'description', 'group_id', 'join_mode', 'lat', 'link',
       'localized_country_name', 'localized_location', 'lon', 'members',
       'name', 'state', 'status', 'urlname', 'visibility', 'who',
       'category_name', 'organizer_id'],
      dtype='object')

In [220]:
# let's get a count to see how long each group has been around by subtracting 'created' timestamp from May 1st.
df_groups['yrs_since_created'] = ((1556683200000 - df_groups['created'])/86400000)/365

In [221]:
df_groups['created_date'] = df_groups['created'].apply(lambda x:time.strftime('%m/%d/%Y %H:%M:%S', time.gmtime(x/1000.)))

In [222]:
df_groups.head()

Unnamed: 0,created,description,group_id,join_mode,lat,link,localized_country_name,localized_location,lon,members,name,state,status,urlname,visibility,who,category_name,organizer_id,yrs_since_created,created_date
0,1484876702000,Build with Code hosts free weekly JavaScript a...,21993357,open,40.75,https://www.meetup.com/Build-with-Code-New-York/,USA,"New York, NY",-73.99,8050,Build with Code - New York City,NY,active,Build-with-Code-New-York,public,Engineers,tech,218119162,2.276969,01/20/2017 01:45:02
1,1550615516000,The TechDay New York team invites you to join ...,31207091,open,40.75,https://www.meetup.com/TechDayHQ/,USA,"New York, NY",-73.99,1361,TechDay Meetup,NY,active,TechDayHQ,public,Members,career-business,263284450,0.192405,02/19/2019 22:31:56
2,1047953152000,The NYC NoSQL NewSQL Group (formerly known a...,107592,open,40.75,https://www.meetup.com/mysqlnyc/,USA,"New York, NY",-73.99,24226,"🔥 SQL NYC, The NoSQL & NewSQL Database Big Dat...",NY,active,mysqlnyc,public,Data Enthusiasts,tech,6618661,16.131724,03/18/2003 02:05:52
3,1548684384000,The Awesome Events Meetup Group is the real-li...,31031999,open,40.78,https://www.meetup.com/awesome-events/,USA,"New York, NY",-73.96,1694,Awesome Events,NY,active,awesome-events,public,Awesome People,outdoors-adventure,236287112,0.253641,01/28/2019 14:06:24
4,1321563802000,"Data Driven NYC (organized by FirstMark), is a...",2829432,approval,40.76,https://www.meetup.com/DataDrivenNYC/,USA,"New York, NY",-73.97,17382,Data Driven NYC (a FirstMark Event),NY,active,DataDrivenNYC,public,Members,tech,2369792,7.455587,11/17/2011 21:03:22


In [223]:
# pickle cleaned group dataframe
df_groups.to_pickle('df_all_groups_cleaned.pickle')

***
<a id='members'></a>
### 3. Meetup Members

Here we will merge two dataframes containing information on members. The first is information scraped from member profile pages and the other is member info obtained from the members API endpoint.


#### Scraped data

In [351]:
# importing member profiles scraped:
with open('member_profiles_16000.pkl', 'rb') as f:
    member_profiles = pickle.load(f)

In [352]:
print(f"Scraped {len(member_profiles)} profiles")

Scraped 15990 profiles


In [353]:
# view data in dataframe
df_members = pd.DataFrame(member_profiles)
df_members.head()

Unnamed: 0,groups,interests,member_url
0,"[Closing Deals in 6 Inch Heels NYC, Entreprene...","[Professional Development, Professional Women,...",http://www.meetup.com/members/57678912
1,"[Ann Arbor Web Accessibility, Data Driven NYC ...","[Adventure, Language & Culture, Nightlife, Bac...",http://www.meetup.com/members/230923603
2,"[ArtForward, Central Park Sketching & Art Meet...","[Theater, Performing Arts, Walking, Writing, A...",http://www.meetup.com/members/24427602
3,"[#Resist: Danbury, Adult Day Camp, Black Nonbe...","[Museum, Cooking Dinner Parties, Wine, Healthy...",http://www.meetup.com/members/75979532
4,['NYC- Small Business and Entrepreneurs Networ...,"[Hip Hop, Wine, Business Strategy, Dining Out,...",http://www.meetup.com/members/279891863


In [354]:
# count number of items in groups and interest; will drop members without any group or interest information
df_members['num_groups'] = df_members.groups.apply(lambda x: len(x))
df_members['num_interests'] = df_members.interests.apply(lambda x: len(x))

In [355]:
# get the indices of rows that are missing both group and interest data; use indices to drop rows
missing_groups_ints = df_members[(df_members['num_groups'] == 0) & (df_members['num_interests']==0)]
df_members.drop(index = missing_groups_ints.index, axis = 0, inplace = True)

In [356]:
# now's lets also drop members missing either groups or interests (1,105 in total) so that we only work with 
# users with full info
missing_groups = df_members[df_members['num_groups'] == 0]
missing_ints = df_members[df_members['num_interests'] == 0]

df_members.drop(index = missing_groups.index, axis = 0, inplace = True)
df_members.drop(index = missing_ints.index, axis = 0, inplace = True)

In [357]:
df_members.shape

(14879, 5)

In [358]:
# preview the updated dataframe
df_members.head()

Unnamed: 0,groups,interests,member_url,num_groups,num_interests
0,"[Closing Deals in 6 Inch Heels NYC, Entreprene...","[Professional Development, Professional Women,...",http://www.meetup.com/members/57678912,7,4
1,"[Ann Arbor Web Accessibility, Data Driven NYC ...","[Adventure, Language & Culture, Nightlife, Bac...",http://www.meetup.com/members/230923603,8,23
2,"[ArtForward, Central Park Sketching & Art Meet...","[Theater, Performing Arts, Walking, Writing, A...",http://www.meetup.com/members/24427602,3,9
3,"[#Resist: Danbury, Adult Day Camp, Black Nonbe...","[Museum, Cooking Dinner Parties, Wine, Healthy...",http://www.meetup.com/members/75979532,12,51
4,['NYC- Small Business and Entrepreneurs Networ...,"[Hip Hop, Wine, Business Strategy, Dining Out,...",http://www.meetup.com/members/279891863,9,14


In [360]:
# save final dataframe to json and pickle
df_members.to_json("member_profiles_1600_cleaned.json")
df_members.to_pickle("df_scraped_profiles_cleaned.pickle")

#### API data

In [361]:
# getting back pickled dataframe containing the API member info
df_membersapi = pd.read_pickle('df_unique_members.pickle')

In [362]:
df_membersapi.shape

(234609, 17)

In [363]:
df_membersapi.head()

Unnamed: 0,bio,city,country,hometown,id,joined,lat,link,lon,name,other_services,photo,self,state,status,topics,visited
0,,Bronx,us,,276413419,1552398000000.0,40.82,http://www.meetup.com/members/276413419,-73.92,Charisse,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NY,active,"[{'urlkey': 'newtech', 'name': 'New Technology...",1552398000000.0
1,,New York,us,,245744462,1515612000000.0,40.75,http://www.meetup.com/members/245744462,-73.99,Ibrahima Diallo,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NY,active,"[{'urlkey': 'newtech', 'name': 'New Technology...",1515612000000.0
2,,New York,us,,273936256,1549559000000.0,40.75,http://www.meetup.com/members/273936256,-73.99,Victoria Read,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NY,active,[],1549559000000.0
3,,New York,us,,258398074,1531030000000.0,40.75,http://www.meetup.com/members/258398074,-73.99,+V信feng4343注册得99链接186053.com,{},,{'common': {}},NY,active,[],1531030000000.0
4,,New York,us,,259737701,1552287000000.0,40.75,http://www.meetup.com/members/259737701,-73.99,¥en,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NY,active,[],1552287000000.0


In [364]:
# renaming the link column to stage for merging with scraped dataframe
df_membersapi.rename(columns={'link':'member_url'}, inplace = True)
df_membersapi.head()

Unnamed: 0,bio,city,country,hometown,id,joined,lat,member_url,lon,name,other_services,photo,self,state,status,topics,visited
0,,Bronx,us,,276413419,1552398000000.0,40.82,http://www.meetup.com/members/276413419,-73.92,Charisse,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NY,active,"[{'urlkey': 'newtech', 'name': 'New Technology...",1552398000000.0
1,,New York,us,,245744462,1515612000000.0,40.75,http://www.meetup.com/members/245744462,-73.99,Ibrahima Diallo,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NY,active,"[{'urlkey': 'newtech', 'name': 'New Technology...",1515612000000.0
2,,New York,us,,273936256,1549559000000.0,40.75,http://www.meetup.com/members/273936256,-73.99,Victoria Read,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NY,active,[],1549559000000.0
3,,New York,us,,258398074,1531030000000.0,40.75,http://www.meetup.com/members/258398074,-73.99,+V信feng4343注册得99链接186053.com,{},,{'common': {}},NY,active,[],1531030000000.0
4,,New York,us,,259737701,1552287000000.0,40.75,http://www.meetup.com/members/259737701,-73.99,¥en,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NY,active,[],1552287000000.0


#### Merged data

In [365]:
# left merge of dataframe on member_url column
full_df_members = pd.merge(df_members, df_membersapi, how = 'left', on= 'member_url')

In [366]:
full_df_members.shape

(14879, 21)

In [367]:
# preview the merged dataframe
full_df_members.head()

Unnamed: 0,groups,interests,member_url,num_groups,num_interests,bio,city,country,hometown,id,...,lat,lon,name,other_services,photo,self,state,status,topics,visited
0,"[Closing Deals in 6 Inch Heels NYC, Entreprene...","[Professional Development, Professional Women,...",http://www.meetup.com/members/57678912,7,4,,Secaucus,us,secaucus,57678912,...,40.79,-74.06,Dee,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NJ,active,"[{'urlkey': 'business-referral-networking', 'n...",1466428000000.0
1,"[Ann Arbor Web Accessibility, Data Driven NYC ...","[Adventure, Language & Culture, Nightlife, Bac...",http://www.meetup.com/members/230923603,8,23,,New York,us,"St. Gallen, Switzerland",230923603,...,40.72,-73.98,Alistair Barrell,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NY,active,"[{'urlkey': 'foodie', 'name': 'Foodie', 'id': ...",1554763000000.0
2,"[ArtForward, Central Park Sketching & Art Meet...","[Theater, Performing Arts, Walking, Writing, A...",http://www.meetup.com/members/24427602,3,9,,New York,us,,24427602,...,40.72,-74.0,Beth Barber,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NY,active,"[{'urlkey': 'visual-studio', 'name': 'Visual S...",1447760000000.0
3,"[#Resist: Danbury, Adult Day Camp, Black Nonbe...","[Museum, Cooking Dinner Parties, Wine, Healthy...",http://www.meetup.com/members/75979532,12,51,,New Haven,us,New Haven,75979532,...,41.33,-72.97,Kathy,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},CT,active,"[{'urlkey': 'coffee', 'name': 'Coffee', 'id': ...",1514860000000.0
4,['NYC- Small Business and Entrepreneurs Networ...,"[Hip Hop, Wine, Business Strategy, Dining Out,...",http://www.meetup.com/members/279891863,9,14,,West Hempstead,us,,279891863,...,40.69,-73.65,Karen White Kelly,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NY,active,"[{'urlkey': 'hiphop', 'name': 'Hip Hop', 'id':...",1556335000000.0


In [368]:
full_df_members.columns

Index(['groups', 'interests', 'member_url', 'num_groups', 'num_interests',
       'bio', 'city', 'country', 'hometown', 'id', 'joined', 'lat', 'lon',
       'name', 'other_services', 'photo', 'self', 'state', 'status', 'topics',
       'visited'],
      dtype='object')

In [370]:
# we can drop the self columns since they are all empty
full_df_members.self.value_counts()

{'common': {}}    14879
Name: self, dtype: int64

In [None]:
full_df_members.drop(columns = ['self'], inplace = True)

In [371]:
# the other_services column contains other social media contacts for the member
full_df_members.other_services.value_counts()

{}                                                                                                                                                                                                                                                                                               13778
{'twitter': {'identifier': 'http://'}}                                                                                                                                                                                                                                                               3
{'twitter': {'identifier': '@redvioletdar'}}                                                                                                                                                                                                                                                         1
{'twitter': {'identifier': '@HarlemFund'}, 'linkedin': {'identifier': 'http://www.linkedin.com/in/thomas-lopez-pier

In [380]:
# create a column with a count of the number of connected social media accounts
full_df_members['num_sm_accounts'] = full_df_members.other_services.apply(lambda x: len(x))

In [381]:
full_df_members.head()

Unnamed: 0,groups,interests,member_url,num_groups,num_interests,bio,city,country,hometown,id,...,lon,name,other_services,photo,self,state,status,topics,visited,num_sm_accounts
0,"[Closing Deals in 6 Inch Heels NYC, Entreprene...","[Professional Development, Professional Women,...",http://www.meetup.com/members/57678912,7,4,,Secaucus,us,secaucus,57678912,...,-74.06,Dee,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NJ,active,"[{'urlkey': 'business-referral-networking', 'n...",1466428000000.0,0
1,"[Ann Arbor Web Accessibility, Data Driven NYC ...","[Adventure, Language & Culture, Nightlife, Bac...",http://www.meetup.com/members/230923603,8,23,,New York,us,"St. Gallen, Switzerland",230923603,...,-73.98,Alistair Barrell,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NY,active,"[{'urlkey': 'foodie', 'name': 'Foodie', 'id': ...",1554763000000.0,0
2,"[ArtForward, Central Park Sketching & Art Meet...","[Theater, Performing Arts, Walking, Writing, A...",http://www.meetup.com/members/24427602,3,9,,New York,us,,24427602,...,-74.0,Beth Barber,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NY,active,"[{'urlkey': 'visual-studio', 'name': 'Visual S...",1447760000000.0,0
3,"[#Resist: Danbury, Adult Day Camp, Black Nonbe...","[Museum, Cooking Dinner Parties, Wine, Healthy...",http://www.meetup.com/members/75979532,12,51,,New Haven,us,New Haven,75979532,...,-72.97,Kathy,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},CT,active,"[{'urlkey': 'coffee', 'name': 'Coffee', 'id': ...",1514860000000.0,0
4,['NYC- Small Business and Entrepreneurs Networ...,"[Hip Hop, Wine, Business Strategy, Dining Out,...",http://www.meetup.com/members/279891863,9,14,,West Hempstead,us,,279891863,...,-73.65,Karen White Kelly,{},{'highres_link': 'https://secure.meetupstatic....,{'common': {}},NY,active,"[{'urlkey': 'hiphop', 'name': 'Hip Hop', 'id':...",1556335000000.0,0


In [382]:
# drop the other_services column
full_df_members.drop(columns = ['other_services'], inplace =True)

In [396]:
(full_df_members.isna().sum()/len(full_df_members))*100

groups             0.0
interests          0.0
member_url         0.0
num_groups         0.0
num_interests      0.0
bio                0.0
city               0.0
country            0.0
hometown           0.0
id                 0.0
joined             0.0
lat                0.0
lon                0.0
name               0.0
self               0.0
state              0.0
status             0.0
topics             0.0
visited            0.0
num_sm_accounts    0.0
has_photo          0.0
dtype: float64

In [395]:
# fill in the state, bio, and hometown NaN values with 'None'
full_df_members.state.fillna('None', inplace = True)
full_df_members.bio.fillna('None', inplace = True)
full_df_members.hometown.fillna('None', inplace = True)

In [390]:
# create a new column indicating whether member has a photo (1) or not (0) to replace the 'photo' column
full_df_members['has_photo'] = full_df_members.photo.apply(lambda x: 0 if x == None else 1)

In [392]:
# drop the 'photo' column
full_df_members.drop(columns = ['photo'], inplace = True)

In [393]:
full_df_members.head()

Unnamed: 0,groups,interests,member_url,num_groups,num_interests,bio,city,country,hometown,id,...,lat,lon,name,self,state,status,topics,visited,num_sm_accounts,has_photo
0,"[Closing Deals in 6 Inch Heels NYC, Entreprene...","[Professional Development, Professional Women,...",http://www.meetup.com/members/57678912,7,4,,Secaucus,us,secaucus,57678912,...,40.79,-74.06,Dee,{'common': {}},NJ,active,"[{'urlkey': 'business-referral-networking', 'n...",1466428000000.0,0,1
1,"[Ann Arbor Web Accessibility, Data Driven NYC ...","[Adventure, Language & Culture, Nightlife, Bac...",http://www.meetup.com/members/230923603,8,23,,New York,us,"St. Gallen, Switzerland",230923603,...,40.72,-73.98,Alistair Barrell,{'common': {}},NY,active,"[{'urlkey': 'foodie', 'name': 'Foodie', 'id': ...",1554763000000.0,0,1
2,"[ArtForward, Central Park Sketching & Art Meet...","[Theater, Performing Arts, Walking, Writing, A...",http://www.meetup.com/members/24427602,3,9,,New York,us,,24427602,...,40.72,-74.0,Beth Barber,{'common': {}},NY,active,"[{'urlkey': 'visual-studio', 'name': 'Visual S...",1447760000000.0,0,1
3,"[#Resist: Danbury, Adult Day Camp, Black Nonbe...","[Museum, Cooking Dinner Parties, Wine, Healthy...",http://www.meetup.com/members/75979532,12,51,,New Haven,us,New Haven,75979532,...,41.33,-72.97,Kathy,{'common': {}},CT,active,"[{'urlkey': 'coffee', 'name': 'Coffee', 'id': ...",1514860000000.0,0,1
4,['NYC- Small Business and Entrepreneurs Networ...,"[Hip Hop, Wine, Business Strategy, Dining Out,...",http://www.meetup.com/members/279891863,9,14,,West Hempstead,us,,279891863,...,40.69,-73.65,Karen White Kelly,{'common': {}},NY,active,"[{'urlkey': 'hiphop', 'name': 'Hip Hop', 'id':...",1556335000000.0,0,1


In [405]:
# can drop 'topic' column as it contains the same info as 'interests'
full_df_members.drop(columns= ['topics'], inplace = True)

In [406]:
full_df_members.head()

Unnamed: 0,groups,interests,member_url,num_groups,num_interests,bio,city,country,hometown,id,joined,lat,lon,name,self,state,status,visited,num_sm_accounts,has_photo
0,"[Closing Deals in 6 Inch Heels NYC, Entreprene...","[Professional Development, Professional Women,...",http://www.meetup.com/members/57678912,7,4,,Secaucus,us,secaucus,57678912,1459463000000.0,40.79,-74.06,Dee,{'common': {}},NJ,active,1466428000000.0,0,1
1,"[Ann Arbor Web Accessibility, Data Driven NYC ...","[Adventure, Language & Culture, Nightlife, Bac...",http://www.meetup.com/members/230923603,8,23,,New York,us,"St. Gallen, Switzerland",230923603,1537537000000.0,40.72,-73.98,Alistair Barrell,{'common': {}},NY,active,1554763000000.0,0,1
2,"[ArtForward, Central Park Sketching & Art Meet...","[Theater, Performing Arts, Walking, Writing, A...",http://www.meetup.com/members/24427602,3,9,,New York,us,,24427602,1436840000000.0,40.72,-74.0,Beth Barber,{'common': {}},NY,active,1447760000000.0,0,1
3,"[#Resist: Danbury, Adult Day Camp, Black Nonbe...","[Museum, Cooking Dinner Parties, Wine, Healthy...",http://www.meetup.com/members/75979532,12,51,,New Haven,us,New Haven,75979532,1468890000000.0,41.33,-72.97,Kathy,{'common': {}},CT,active,1514860000000.0,0,1
4,['NYC- Small Business and Entrepreneurs Networ...,"[Hip Hop, Wine, Business Strategy, Dining Out,...",http://www.meetup.com/members/279891863,9,14,,West Hempstead,us,,279891863,1556335000000.0,40.69,-73.65,Karen White Kelly,{'common': {}},NY,active,1556335000000.0,0,1


In [407]:
# save the cleaned dataframe
full_df_members.to_pickle("full_df_members.pickle")