## Part 1:  get the model
- __Features(categorical): every unique word in all descriptions of groups. eg: ['this']['is']['fight']['club']...__
- __Labels: Group categories__

#### Goal: Train my model to classify a group based on the description of the group.

In [208]:
import json
import re
from urllib.request import urlopen 
import io
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from sklearn.naive_bayes import MultinomialNB
from pyspark import SparkContext
from sklearn.ensemble import AdaBoostClassifier
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split

In [142]:
def cleaner(descrips):
    
    'clean the description and return a list of strings'
    
    # get rid of str between < >
    delirm = re.sub('<[^>]+>', '', str(descrips))
    # get rid of symbols
    l = re.findall(r"[\w']+", str(delirm))
    return list(filter(lambda a: len(a) > 2, l))

def cleaner2(descrips):
    # remove stop words
    stops = set(stopwords.words("english"))
    return [word for word in descrips if word not in stops]
        


def combiner(words):
    
    'combine list of string back to a sentence'
    
    return ' '.join(word for word in words)


def indexer_vocabs(list_of_event):
    
    'return a list of cleaned sentence and a list of vocabulary'
    
    myindex = []
    allwords = []
    for line in list_of_event:
        list_of_words = cleaner2(cleaner(line))
        allwords += list_of_words
        element = combiner(list_of_words)
        myindex.append(element)
    return myindex, list(set(allwords))


def getDF(ind, col):
    return pd.DataFrame(index=ind, columns=col)


def converToArray(df, col):
    for i in range(df.shape[0]):
        for word in col:
            df.ix[i, str(word)] = df.index[i].count(str(word))
    return np.array(df)


def getshortname(ss):
    
    if ss is None:
        return 'None'
    elif ss.asDict().get('category') is None:
        return 'None'
    elif ss.asDict().get('category').asDict().get('shortname') is None:
        return 'None'
    return ss.asDict().get('category').asDict().get('shortname')


def getcountry(ss):
    
    if ss is None:
        return 'None'
    elif ss.asDict().get('country') is None:
        return 'None'
    return ss.asDict().get('country')


In [41]:
sc = SparkContext.getOrCreate()

In [42]:
txt = spark.read.json('/Users/DL/Desktop/DL-Projects-master/meetup group/2017/*/*/*/*')

In [244]:
txt.printSchema()

root
 |-- description: string (nullable = true)
 |-- duration: long (nullable = true)
 |-- event_url: string (nullable = true)
 |-- fee: struct (nullable = true)
 |    |-- amount: double (nullable = true)
 |    |-- currency: string (nullable = true)
 |    |-- description: string (nullable = true)
 |-- group: struct (nullable = true)
 |    |-- category: struct (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- shortname: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- group_lat: double (nullable = true)
 |    |-- group_lon: double (nullable = true)
 |    |-- group_photo: struct (nullable = true)
 |    |    |-- highres_link: string (nullable = true)
 |    |    |-- photo_id: long (nullable = true)
 |    |    |-- photo_link: string (nullable = true)
 |    |    |-- thumb_link: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- joi

In [67]:
df = txt.toPandas()

In [243]:
# raw data sample
df['group'][900]

Row(category=Row(id=34, name='tech', shortname='tech'), city='Chicago', country='us', group_lat=41.89, group_lon=-87.64, group_photo=None, id=23065697, join_mode='open', name='Chicago Deep Learning PB-Scale AI Big Data Cloud Boot Camp', state='IL', urlname='Chicago-Deep-Learning-PBScale-AI-Big-Data-Cloud-IoT-BootCamp')

In [151]:
descriptions = list(df['description'])
gplist = list(map(getshortname, df['group']))
countrylist = list(map(getcountry, df['group']))

In [161]:
new_table = pd.DataFrame(
    {'descriptions': descriptions,
     'group': gplist,
     'country': countrylist
    })

In [162]:
# only doing events in US, so descriptions will only be in English
new_table = new_table[new_table.country == 'us']

In [173]:
print(new_table.shape)
new_table.head(3)

(13500, 3)


Unnamed: 0,country,descriptions,group
31,us,"<p><a href=""https://www.eventbrite.com/e/kdb-h...",tech
40,us,,food-drink
45,us,<p>No big field game this weekend !&nbsp;</p> ...,sports-recreation


In [174]:
des_list = list(new_table['descriptions'])
len(des_list)

13500

__I only need the desciptions to build my feature matrix. As I noticed there is a person's NAME in the description. This could be very helpful for prediction.__

In [175]:
ind, vocabs = indexer_vocabs(des_list)
print(len(ind), len(vocabs))

13500 57159


In [181]:
ind[100]

"Last Meetup worked system diagram spilt ownership clearly see saw progress understanding audio input PS4 mic array able see great rendering drone simulation PX4 Gazebo React etc you're engineer enthusiast interested helping build program curious welcome nbsp Planning small beer must yrs age option Bring laptops spare drone parts you'd like see given new life feel free email anytime add active Slack group masked nbsp please clearly tell want join nonoisedrone's slack channel"

In [210]:
y = np.asarray(new_table['group'])

In [187]:
cv = CountVectorizer(vocabulary=vocabs)

In [200]:
# get the frequency table
X=cv.fit_transform(ind).toarray()
x_update = normalize(X)



In [227]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=30)

In [228]:
clf = MultinomialNB()
clf.fit(x_train,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [229]:
result = clf.predict(x_test)
result

array(['dancing', 'arts-culture', 'new-age-spirituality', ..., 'games',
       'outdoors-adventure', 'tech'], 
      dtype='<U21')

In [240]:
# this is the result of not applying normalization. With normalization, the score is even lower. 
prediction = sum(result == y_test)/float(len(y_test))
prediction

0.56000000000000005

In [231]:
ada = AdaBoostClassifier(base_estimator=MultinomialNB(),
                        n_estimators =30)

ada.fit(x_train,y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
          learning_rate=1.0, n_estimators=30, random_state=None)

In [233]:
ada.score(x_test, y_test)

0.04925925925925926

__The score is extreming bad. Checking if y_test contains any categories that are not in y-train__

In [235]:
from collections import Counter

In [239]:
set(y_test) == set(y_train)

True

## Part 2: takeaway

__Since I am using all categorical features, Naive baye's is definitly the optimal classifier in this case. However, it is a little different from the NB praticum; my labels have multiple categories, therefore, I used Multinomial Naive baye's.__

__Observation from my data, the group description is constantly changing for the same group, this can result an inconsistancy, in which case my model might get confused. Secondly, many descriptions of the group are completely unrelated to the group category, this can be a big issue as my model might not be able to capture the randomness. Maybe increasing the amount of data can help minimize this problem.__

__At the end, the data is very messy. Cleaning takes most of my time. Description seems to be a bad predictor in this case after all.__

## Part 3: Pseudocoding

### Naive baye

Give training set X, with classes y where y can be multiple(eg: sport, art, tech):

For each class y_i:
2. Calculate number of elements in y_i / number of all elements in all class <= prior_i
3. Find the word frequency <= f_i
4. For each word x_i in X, calculate p(x_i | y_i) = f_i / y_i 
5. predict O: find P(y_i | O) = ∏ p(x_i | y_i)
6. Label O based on the highest P(y_i | O). 

## Part 4: Annotation

In [242]:
from sklearn.naive_bayes import BaseDiscreteNB
class MultinomialNB(BaseDiscreteNB):
    def __init__(self, alpha=1.0, fit_prior=True, class_prior=None):
        self.alpha = alpha
        self.fit_prior = fit_prior
        self.class_prior = class_prior

    def _count(self, X, Y):
        """Count and smooth feature occurrences."""
        # make sure X does not have negative values
        if np.any((X.data if issparse(X) else X) < 0):
            raise ValueError("Input X must be non-negative")
        # word frequency count for each feature
        self.feature_count_ += safe_sparse_dot(Y.T, X)
        # count classes
        self.class_count_ += Y.sum(axis=0)

    def _update_feature_log_prob(self):
        """Apply smoothing to raw counts and recompute log probabilities"""
        
        # Avoiding extremely low feature count
        smoothed_fc = self.feature_count_ + self.alpha
        smoothed_cc = smoothed_fc.sum(axis=1)
        # np.log(f_i/y_i) >> np.log(f_i) - np.log(y_i)
        self.feature_log_prob_ = (np.log(smoothed_fc) -
                                  np.log(smoothed_cc.reshape(-1, 1)))
