In [1]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [2]:
events = pd.read_csv("./filestore/events/fact_events.csv").drop('Unnamed: 0', axis=1)

In [3]:
events.head()

Unnamed: 0,description,duration,headcount,event_id,maybe_rsvp_count,name,rating,rsvp_limit,status,time,updated,utc_offset,venue_id,visibility,waitlist_count,yes_rsvp_count
0,These meetups are very informal. I won't be st...,9000000.0,12,147478282,0,PyLadies Dublin Inaugural meetup - bring laptop!,,,past,1384799400000,1384853013000,0,16176442,public,0,22
1,"Our second meetup will be at Engine Yard, a bi...",,0,152107272,0,Second PyLadies Dublin Meetup - Let's get coding!,,,past,1387218600000,1387230236000,0,13054852,public,0,12
2,Happy New Year! Hope you all had a good Christ...,10800000.0,0,159368332,0,Our first PyLadies Dublin meetup of 2014,,,past,1390240800000,1390470097000,0,17757332,public,0,11
3,Bring your laptops along. If you want some foo...,10800000.0,0,162851382,0,PyLadies Dublin Feb meetup,,,past,1392660000000,1392672314000,0,18096492,public,0,9
4,!!!CHANGE OF VENUE UPDATE!!! &gt;&gt; More inf...,10800000.0,0,166955082,0,PyLadies Dublin Meetup,,,past,1395165600000,1395219566000,0,18950322,public,0,11


In [4]:
events.dtypes

description          object
duration            float64
headcount             int64
event_id             object
maybe_rsvp_count      int64
name                 object
rating              float64
rsvp_limit          float64
status               object
time                  int64
updated               int64
utc_offset            int64
venue_id              int64
visibility           object
waitlist_count        int64
yes_rsvp_count        int64
dtype: object

# Let's look at the event descriptions

In [5]:
desc = events['description'].tolist()

In [179]:
desc[15]

'We are delighted to announce that Metricfire will be hosting our Feb meetup. (Recap: January notes :\xa0https://pyladiesdublin.hackpad.com/PyLadies-Dublin-Jan-2015-Notes-ozFxnmsIWqs) Feel free to update Feb notes to let us know what you are working on:\xa0https://pyladiesdublin.hackpad.com/PyLadies-Dublin-Feb-2015-Notes-dmz4ESRJJsg Join our\xa0mailing list\xa0for discussions and sharing all things Python-y as well as Python-related events.\xa0 This event is\xa0FREE\xa0and is suitable for\xa0ALL\xa0levels. Questions? Ping Vicky at [masked] ABOUT METRICFIRE  Metricfire run a service called Hosted Graphite. Hosted Graphite is a hosted version of the popular Graphite open source\xa0metric and monitoring software, and we have customers all over the\xa0world. We provide a metrics platform to developers to allow them tomeasure their applications and servers. Companies send large amounts of\xa0metric data to us (125,000 data points per second or roughly 10 billion\xa0per day), which we store,

## Cleaning up the description

### Unicode, URLS, smiley faces

In [174]:
special_dict = {
    'smile' : r'[:;=]-[)D]?',
    'uni' : r'\xa0',
    'dupe_space' : r'\s{2,}|\s\Z',
    'url' : r'(?:https?|ftp|file)://\S+',
    'uls_chars' :  r'(?:&[gla][tm]p?)+'
}

def remove_special(s):
    for k, regex in special_dict.items():
        if k == 'uni':
            s = re.sub(regex, ' ', s)
        else:
            s = re.sub(regex, '', s)
    return s

clean_special = [remove_special(s).lower() for s in desc]

### Punctuation and emojis
There are some emoji characters and unwanted punctuation

In [183]:
def find_unwanted_chars(s):
#     pattern = r"[^a-zA-Z0-9\s.\-/':!?&@€$_+Éáéóć%]"
    pattern = r"[^a-zA-Z0-9\s/@€$_+Éáéóć%]"
    return set(re.findall(pattern, s))

unwanted = set(char for e in desc for char in find_unwanted_chars(e))
clean_punct = []

for sent in clean_special:
    for punct in unwanted:
        sent = sent.replace(punct, "")
    clean_punct.append(sent)

In [186]:
clean_punct[47]

'workday will be hosting us for our july meetup food and refreshments will also be providedwe have two speakers from workday  amanda galligan  principal network engineer will do a talk on ansible a network engineers best friend  alan kennedy  principal software development engineer infra services will also do a talk on writing network services using python coroutines naomi oreilly  qa engineer grid cloud master will be conducting a talk on bdd in python  an introduction to behaviour driven development in python with a focus on automated acceptance testing remember to bring your laptop you will have a chance to deep dive with speakers pair programme on a tutorial a project could even be your own ask a question dont be shy we are here to helpif you have announcements events projects questions feel free to add them to  rough running order 1830 guests arrive  food  beverages1900 welcome  announcements by vicky1905 quick word from workday representative1910 lightning talk 11925 lightning ta

# Removing words that have digits

Reason for this is that after tokenizing, I found that there are tokens which consist of digits (perhaps meetup start/end times)

In [190]:
def remove_digits(s):
    pattern = re.compile(r'\b(?:\d+\S+|\S+\d+)\b')
    return re.sub(pattern, '', s)

clean_digits = [remove_digits(s) for s in clean_punct]

In [192]:
clean_digits[47]

'workday will be hosting us for our july meetup food and refreshments will also be providedwe have two speakers from workday  amanda galligan  principal network engineer will do a talk on ansible a network engineers best friend  alan kennedy  principal software development engineer infra services will also do a talk on writing network services using python coroutines naomi oreilly  qa engineer grid cloud master will be conducting a talk on bdd in python  an introduction to behaviour driven development in python with a focus on automated acceptance testing remember to bring your laptop you will have a chance to deep dive with speakers pair programme on a tutorial a project could even be your own ask a question dont be shy we are here to helpif you have announcements events projects questions feel free to add them to  rough running order  guests arrive  food   welcome  announcements by  quick word from workday  lightning talk  lightning talk  lightning talk   deep dive with speakers self

## Tokenizing and removing stop words

In [227]:
event_text = clean_digits

In [305]:
en_stopwords = stopwords.words('english')

In [306]:
extra_stopwords = [
    'python', 'speaker', 'speakers', 'dublin', 'ireland', 'pyladies',
    'talk', 'talks', 'irish', 'james', 'julie', 'leticia', 'charlie',
    'michael', 'marjai', 'atmasked', 'masked', 'isabella', 'annie',
    'lowney', 'daire', 'amaral', 'carlos', 'campbell', 'chris', 
    'docherty', 'louise', 'deepali', 'andrea', 'diarmuid', 'sorcha',
    'jonathan', 'eamon', 'shane', 'stella', 'mclennan', 'ingrid',
    'aimi', 'niamh', 'forgan', 'jans', 'sabine', 'vicky', 'ariane',
    'kats', 'bourke', 'georges'
]
en_stopwords = en_stopwords + extra_stopwords

Vectorizer splits our documents into a distribution of words.
 
X is a term document matrix, where each document is a column and words are rows. The value associated to each cell is the TF-IDF

In [307]:
tfidf_vectorizer = TfidfVectorizer(stop_words = set(en_stopwords))
X_tfidf = tfidf_vectorizer.fit_transform(event_text)

The vectorizer we got above is used as input for LDA or NMF to build the model

# Exploring K-means Clustering

In [308]:
# from sklearn.cluster import KMeans
# from sklearn import metrics
# from scipy.spatial.distance import cdist 

Since we don't have an idea of how many topics can there be, let's use the silhouette score as a measure of how many clusters we should have.

In [309]:
# clusters = range(2,30)
# distortions = []
# silhouette_coeffs = []

# for k in clusters:
#     km = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10,verbose=0)
#     km.fit(X_tfidf)
    
#     distortions.append(km.inertia_)
#     silhouette_coeffs.append(metrics.silhouette_score(X_tfidf, km.labels_))

In [310]:
# plt.style.use('seaborn-darkgrid')
# fig, ax = plt.subplots(figsize=(8,6))

# ax.plot(clusters, distortions, marker='o', color='b')
# plt.show()

In [311]:
# plt.style.use('seaborn-darkgrid')
# fig, ax = plt.subplots(figsize=(8,6))

# ax.bar(x=clusters, height=silhouette_coeffs, color='g')
# plt.show()

It seems that the preprocessing done was not enough and the k-means algorithm is being too sensitive to the data. It could be worthwhile trying to extract the event descriptions manually as there are only 70 ish events...

Below a sample clustering for k=15, doesn't seem good.

In [312]:
# order_centroids = km.cluster_centers_.argsort()[:, ::-1]
# terms = tfidf_vectorizer.get_feature_names()

# for i in range(15):
#     print("Cluster %d:" % i, end='')
#     for ind in order_centroids[i, :10]:
#         print(' %s' % terms[ind], end='')
#     print()

# Exploring LDA

In [313]:
count_vectorizer = CountVectorizer(stop_words = set(en_stopwords))
X_count = count_vectorizer.fit_transform(event_text)

In [314]:
count_df = pd.DataFrame(X_count.toarray(), columns= count_vectorizer.get_feature_names())

In [315]:
agg_counts = pd.DataFrame({
    'word' : count_vectorizer.get_feature_names(),
    'count' : count_df.T.apply(np.sum, axis=1)
}).reset_index(drop=True)

In [316]:
feats = count_vectorizer.get_feature_names()
digits = re.compile(r'^\d+')

print(f"There are {len(feats)} feature names.")
print(f"Of which {len([re.match(digits, f) for f in feats if re.match(digits, f)])} \
start with digits and might be noisy values")

There are 2345 feature names.
Of which 0 start with digits and might be noisy values


In [317]:
# agg_counts.sort_values('count', ascending=False).iloc[:50,:]

From looking at the counts above, the text quality doesn't look good. I will apply some manual cleansing to the description to get a more accurate representation of the Event descriptions

In [318]:
import warnings
warnings.simplefilter("ignore", DeprecationWarning)# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA

In [319]:
# Helper function
def print_topics(model, vectorizer, n_top_words):
    words = vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

First, an attempt with just the CountVectorizer

In [320]:
# Tweak the two parameters below
number_topics = 15
number_words = 10
jobs = -1
max_iter = 25

alpha = None
eta = None

In [321]:
# Create and fit the LDA model
lda = LDA(
    doc_topic_prior = alpha,
    topic_word_prior = eta,
    n_components=number_topics,
    n_jobs = jobs,
    max_iter=max_iter)
lda.fit(X_count)

# Print the topics found by the LDA model
print("Topics found via LDA with CountVectorizer:")
print_topics(lda, count_vectorizer, number_words)

Topics found via LDA with CountVectorizer:

Topic #0:
intercom products work free bring communication questions also details customers

Topic #1:
free notes event levels thanks suitable bring well working pythony

Topic #2:
udemy please meetup work build kubernetes groupon learning working short

Topic #3:
ai bring work projects meetup please free food laptop aol

Topic #4:
women new code people event tea thanks along night drinks

Topic #5:
workshop event get data beginners women mentors session tea coffee

Topic #6:
dbs business rte career us want one learn april right

Topic #7:
please details free meetups questions food call projects submit speaking

Topic #8:
lala prizes please need dont folks people know make find

Topic #9:
power round quiz patch facebook people team folks tech shout

Topic #10:
kx kdb frances bring bank systems markets testing major derivatives

Topic #11:
using data software us questions bring salon provided max laptop

Topic #12:
open tech data free evening e

In [322]:
# Create and fit the LDA model
lda = LDA(
    doc_topic_prior = alpha,
    topic_word_prior = eta,
    n_components=number_topics,
    n_jobs = jobs,
    max_iter=max_iter)
lda.fit(X_tfidf)

# Print the topics found by the LDA model
print("Topics found via LDA with tfidf vectorizer:")
print_topics(lda, tfidf_vectorizer, number_words)

Topics found via LDA with tfidf vectorizer:

Topic #0:
etsy pyladiesdub analytics perform defining gaining describe pythonshort seek donnelly

Topic #1:
graphite metric per source popular called creation large sept craftnight

Topic #2:
engineers rte dit kx beginners mentors jupyter public irelands one

Topic #3:
contribute django optional missed far beginning goal creating two github

Topic #4:
tech makers via tool giving minutes image circuit pepper meet

Topic #5:
prizes qualtrics year apply anything facebook patch change cryptoparty street

Topic #6:
pytorch bank mansura innovation opened grand canal boigrandcanalsq wraps sandwiches

Topic #7:
lia patterns belowmore fill form graphite nemeth multi paradigmatic heres

Topic #8:
traffic networking deininger maker among analyse wireshark wire shark detailsnadja

Topic #9:
gallery week django foundation barge collab chq pad continuing providedplease

Topic #10:
ics kdb women find ict round max salon quiz organising

Topic #11:
free bri

To do:
* Review LDA model to understand how to fine tune alpha and eta
* Vizualize results to see if they make sense
* NMF (?)