# Meetups data - A study case about Meetup's events and events' category predction
--------

## CONTENT:
- [Introduction to the meetup problem](#Introduction-to-the-meetup-problem)
- [Database tables and your features](#Database-tables-and-your-features)
- [Dataset's features](#Dataset's-features)
- [Dealing with the features](#Dealing-with-the-features)
    - [Categorical features](#Categorical-features)
    - [Text features](#Text-features)
    - [Numeric features](#Numeric-features)
- [Train and Test some models](#Train%2FTest-some-models)
- [Conclusion](#Conclusion)
- [How to put the model in production?](#How-to-put-the-model-in-production%3F)
- [Attacking other problems with this model](#Attacking-other-problems-with-this-model)
- [Other possibles problems to attack](#Attacking-other-problems-with-this-data)
-------------

## Introduction to the problem

I have [here](https://www.kaggle.com/sirpunch/meetups-data-from-meetupcom#categories.csv) a Kaggle's dataset which represents some relationed tables of [Meetup's](https://www.meetup.com/) events database, exporteds in [CSV](https://pt.wikipedia.org/wiki/Comma-separated_values) files. 
With the following Entity-Relationship model:

<img src="EER-diagram.png">


The problem I have to attack here is the one that follows:

Given a __new created group__ with the host city, the place where there will be any meetings between the members, what topics characterize the group and the name and description of the group. What is the __category__ of the group (Arts & Culture, Career & Business, etc...)?

#### __Why to do that?__

Well, the answer is quite simple. Sugesting a category or automatically labeling this new group we would save time of the user at creation of this Meetup group and keep this categorical data more consistent and robust.

-----------

## Database tables and your features

In this section I will cover the dataset and your features. As said before, our dataset consists of some related tables, and they are:
    - events
    - venues
    - cities
    - categories
    - groups
    - members
    - groups_topics
    - topics
    - members_topics

As we know, the problem I have to attack is: given a new group, categorize it. So, for this activity, some tables will be useless, since the group is new and the only data we have about are: group's city, name, description and topics. So it doesn't have relationship with tables like: _events_ and _venues_.


I'm going to use pandas to take a look at columns of __each table__:

#### Groups:

In [1]:
import pandas as pd

# Reading GROUPS table csv file:
groups = pd.read_csv("files/groups.csv", encoding="ISO-8859-1")

print(groups.columns)

# Number of columns:
len(groups.columns)

Index(['group_id', 'category_id', 'category.name', 'category.shortname',
       'city_id', 'city', 'country', 'created', 'description',
       'group_photo.base_url', 'group_photo.highres_link',
       'group_photo.photo_id', 'group_photo.photo_link',
       'group_photo.thumb_link', 'group_photo.type', 'join_mode', 'lat',
       'link', 'lon', 'members', 'group_name', 'organizer.member_id',
       'organizer.name', 'organizer.photo.base_url',
       'organizer.photo.highres_link', 'organizer.photo.photo_id',
       'organizer.photo.photo_link', 'organizer.photo.thumb_link',
       'organizer.photo.type', 'rating', 'state', 'timezone', 'urlname',
       'utc_offset', 'visibility', 'who'],
      dtype='object')


36

We can see there's a lot of columns in this table, but a fewer can be good features to our model. If we take a deeper look on the columns we have ids (as foreign keys) to other tables, we have URLs to images from organizers or images from the group, all this kind of things that are not related as good features to classify our group category.

We also have the __category name__, our target to the model in the same table, therefore, it facilities our work and we are not going to need to use foreign keys to access our target category.

I'm going to remove these columns.

In [2]:
# Filtering the features I'm interested in:
groups = groups[
    [
        "group_id",
        "category.name",
        "city_id",
        "city",
        "country",
        "created",
        "description",
        "join_mode",
        "members",
        "group_name",
        "rating",
        "state",
        "timezone",
        "visibility",
        "who",
    ]
]

print(groups.columns)

# Number of columns:
len(groups.columns)

Index(['group_id', 'category.name', 'city_id', 'city', 'country', 'created',
       'description', 'join_mode', 'members', 'group_name', 'rating', 'state',
       'timezone', 'visibility', 'who'],
      dtype='object')


15

All right, now our dataset _groups_ have only features which are useful and provided at group's creation moment. This is the kind of feature we need to train/test our model.

In other moment I will check how the data is varying in some features.

#### Categories:

In [3]:
categories = pd.read_csv("files/categories.csv", encoding="utf-8")
categories.head(3)

Unnamed: 0,category_id,category_name,shortname,sort_name
0,1,Arts & Culture,Arts,Arts & Culture
1,2,Career & Business,Business,Career & Business
2,3,Cars & Motorcycles,Auto,Cars & Motorcycles


Well, all this data in table _categories_, we already have as a _group's_ column. So, we can ignore this dataset.

#### Groups_topics:

In [10]:
groups_topics = pd.read_csv("files/groups_topics.csv", encoding="ISO-8859-1")
groups_topics.head()

Unnamed: 0,topic_id,topic_key,topic_name,group_id
0,83,sportsfans,Sports Fan,241031
1,83,sportsfans,Sports Fan,289172
2,83,sportsfans,Sports Fan,295444
3,83,sportsfans,Sports Fan,1040320
4,83,sportsfans,Sports Fan,1403055


In _groups topics_ table we have an interesting data. We have here __topic names__ as a categorical feature and _group id_ and _topic id_ as FKs. 

We can do a left join on _groups_ table soon. For now, let's check _topics_ table, if exist any useful feature.

#### Topics:

In [3]:
topics = pd.read_csv("files/topics.csv", encoding="ISO-8859-1")
topics = topics[["topic_id", "description", "members"]]
topics.head()

Unnamed: 0,topic_id,description,members
0,83,Meet with others in your local area who are Sp...,471594
1,130,Meet with Latin Music fans in your town.,759757
2,182,Want to practice your English? Meetup with oth...,3176752
3,183,Meet local Spanish language and culture lovers...,1618673
4,184,Meet and mingle with local Italian language an...,465231


Here we see some useful feature like __description__, which is the topic's description and __members__, the quantity of members at each specific topics.

In [4]:
# Simplifying the dataset:
topics = topics[["topic_id", "description", "members"]]
topics.columns

Index(['topic_id', 'description', 'members'], dtype='object')

#### Members

In [5]:
members = pd.read_csv("files/members.csv", encoding="ISO-8859-1")
members.head(3)

Unnamed: 0,member_id,bio,city,country,hometown,joined,lat,link,lon,member_name,state,member_status,visited,group_id
0,3,not_found,New York,us,"New York, NY",2007-05-01 22:04:37,40.72,http://www.meetup.com/members/3,-74.0,Matt Meeker,NY,active,2009-09-18 18:32:23,490552
1,3,not_found,New York,us,"New York, NY",2011-01-23 14:13:17,40.72,http://www.meetup.com/members/3,-74.0,Matt Meeker,NY,active,2011-03-20 01:02:11,1474611
2,3,"Hi, I'm Matt. I'm an entrepreneur who has star...",New York,us,"New York, NY",2010-12-30 18:47:34,40.72,http://www.meetup.com/members/3,-74.0,Matt Meeker,NY,active,2011-01-18 20:37:23,1490492


With the first three rows of the dataset we can see that exists any missing data, related to __bio__'s member. Due to this, it's not good we use as a feature to extract values to the model. Let's see how many rows are missing bio value.

In [6]:
counts = members.bio.value_counts()

# Show first 5 repeated values.
counts.head()

Well, there's a lot of "not_found" values, especifically: 4,838,716. Cleary isn't a good ideia to put bio as a feature.

I think we can pass this table too.

#### Member_topics

In [7]:
members_topics = pd.read_csv("files/members_topics.csv", encoding="ISO-8859-1")
members_topics.head()

Unnamed: 0,topic_id,topic_key,topic_name,member_id
0,83,Sports Fan,sportsfans,121483
1,83,Sports Fan,sportsfans,165644
2,83,Sports Fan,sportsfans,327482
3,83,Sports Fan,sportsfans,337743
4,83,Sports Fan,sportsfans,358259


The main function of this table is reference topic's data with member through _member id_.

In [167]:
# Amount of topics linked to members:
print(len(members_topics.member_id))

# Total of members linked with at least one topic:
len(members_topics.member_id.unique())

3195245


605790

I don't think that is a good idea put this information as our feature to train the model because the only information we have about members, is the creator of the group, and we already going to link the group with one topic (which is a column of this table).

Thus, we are passing this table.

#### Cities

In [7]:
cities = pd.read_csv("files/cities.csv", encoding="ISO-8859-1")
cities.head()

Unnamed: 0,city,city_id,country,distance,latitude,localized_country_name,longitude,member_count,ranking,state,zip
0,West New York,7093,us,2524.541,40.790001,USA,-74.010002,661,32,NJ,7093
1,New York,10001,us,2526.837,40.75,USA,-73.989998,229371,0,NY,10001
2,New York Mills,13417,us,2392.162,43.099998,USA,-75.290001,22,109,NY,13417
3,East Chicago,46312,us,1810.371,41.639999,USA,-87.459999,31,90,IN,46312
4,New York Mills,56567,us,1418.834,46.689999,USA,-95.349998,5,1,MN,56567


From this table we can see repeated features, but exists some new features as __member_count__ from the cities and your ranking. But this ranking isn't clear for us what is the meaning of this information. Due to this, I will pass the feature ranking. But probably going to use __member_count__.

In [8]:
cities = cities[
    ["city_id", "city", "member_count"]
]  # Need columns 'city' to do the match/join

---------------

## Dataset's features

In this section I will put all together and build our dataset to the final model. Based on discussions of [this section](#Database-tables-and-your-features) our dataset will consist of:

- __groups table__: the chosen features;
- __groups_topics table__: we have topic_name and FKs to the _topics table_;
- __topics table__: we have the topics' _description_ and the _members'_ quantity of each topic as features;
- __cities table__: from this table I chose only get one feature, the _member count_ of each city, and left join with city in groups table.


Just to remember the state of our dataframes of the respectively tables:

In [170]:
print(groups.head(2))

   group_id          category.name  city_id      city country  \
0      6388       health/wellbeing    10001  New York      US   
1      6510  community/environment    10001  New York      US   

               created                                        description  \
0  2002-11-21 16:50:46  Those who practice or hold a strong interest i...   
1  2003-05-20 14:48:54  The New York Alternative Energy Meetupis for t...   

  join_mode    lat        lon  members                 group_name  rating  \
0      open  40.75 -73.989998     1440     Alternative Health NYC    4.39   
1      open  40.75 -73.989998      969  Alternative Energy Meetup    4.31   

  state    timezone visibility                      who  
0    NY  US/Eastern     public      Explorers of Health  
1    NY  US/Eastern     public  Clean Energy Supporters  


In [171]:
print(groups_topics.head(2))

   topic_id   topic_key  topic_name  group_id
0        83  sportsfans  Sports Fan    241031
1        83  sportsfans  Sports Fan    289172


In [172]:
print(topics.head(2))

   topic_id                                        description  members
0        83  Meet with others in your local area who are Sp...   471594
1       130           Meet with Latin Music fans in your town.   759757


In [173]:
print(cities.head(2))

   city_id           city  member_count
0     7093  West New York           661
1    10001       New York        229371


#### Putting all together:


[HERE TALK ABOUT TECHNIQUES I'VE USED]


In [11]:
# groups.join(cities, on='city_id', rsuffix='_from_cities')
groups_and_cities = pd.merge(
    groups, cities, on="city_id", how="left", suffixes=("", "_from_cities")
)
dataset = pd.merge(
    groups_and_cities,
    groups_topics,
    on="group_id",
    how="left",
    suffixes=("", "_from_groups_topics"),
)
dataset = pd.merge(
    dataset, topics, on="topic_id", how="left", suffixes=("", "_from_topics")
)

Going to drop duplicated and ID columns.


Also I'm going to rename some columns to keep a kind of pattern. Will organize the columns to be renamed in a dict.


In [12]:
# Droping ID and useless columns:
dataset = dataset.drop(
    columns=[
        "group_id",
        "city_id",
        "topic_key",
        "topic_id",
        "city_from_cities",
        "member_count",
    ]
)

# Renaming columns:
columns_to_rename = {"category.name": "category"}
dataset = dataset.rename(columns=columns_to_rename)

# Checking type from columns
dataset.dtypes

category                    object
city                        object
country                     object
created                     object
description                 object
join_mode                   object
members                      int64
group_name                  object
rating                     float64
state                       object
timezone                    object
visibility                  object
who                         object
topic_name                  object
description_from_topics     object
members_from_topics        float64
dtype: object

#### Checking missing data:

print(dataset.isnull().sum())

In [13]:
values = {"description": "", "group_name": ""}
# Replacing description and group name for empty string:
dataset = dataset.fillna(value=values)

# Removing other rows with nan values:
dataset = dataset.dropna(axis="rows")
print(dataset.isnull().sum())

category                   0
city                       0
country                    0
created                    0
description                0
join_mode                  0
members                    0
group_name                 0
rating                     0
state                      0
timezone                   0
visibility                 0
who                        0
topic_name                 0
description_from_topics    0
members_from_topics        0
dtype: int64


------------
# Dealing with the features

Now that we have our final dataset to train/test the model, we can see that most part has __categorical features__. Also have __text__ and __numeric__ features.

I'm going to explain how to deal with each feature type:

### Categorical features

In this section, I want to attack a classic problem for ML: __Deal with categorical features on dataset__.

Categorical data are common problems in many Data Science and Machine Learning approaches but are usually more challenging to deal with than numerical data. In particular, many machine learning algorithms require that their input is numerical and therefore categorical features must be transformed into numerical features before we can use any of these algorithms.

Therefore, I will create a few datasets with encoded categorical variables. There are a lot of techniques to encode this data, but I'm going to use:

- [__One-Hot Encoding__](http://contrib.scikit-learn.org/categorical-encoding/onehot.html): one column per category, with a 1 or 0 in each cell for if the row contained that column’s category
- [__Binary Encoding__](http://contrib.scikit-learn.org/categorical-encoding/binary.html): first the categories are encoded as ordinal, then those integers are converted into binary code, then the digits from that binary string are split into separate columns. This encodes the data in fewer dimensions that one-hot, but with some distortion of the distances.
- [__Backward Difference__](http://contrib.scikit-learn.org/categorical-encoding/backward_difference.html): the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable.

With these three encoding techniques, I will test all outputted datasets with some different models to evaluate what encoding technique better perfomed in our data.

_Note: For this activity, used [Category Encoders](http://contrib.scikit-learn.org/categorical-encoding/index.html), a python package._

### Text features
To deal with the text features I'm going to use techniques known as [__feature extraction__](https://scikit-learn.org/stable/modules/feature_extraction.html), 

### Numeric features

Before encode our dataset, I want to extract more value as possible from the date feature: _data_. Have two approaches which make sense for us, extracting them from datetime feature:
- Day of the week
- Day of the year

However, we already got day of the week as a original column. Let's get the day of the year as a column.

In [14]:
# First let's deal with text features:
from sklearn.feature_extraction.text import CountVectorizer

# dataset = pd.read_pickle('final_dataset.pkl')

text_extract_descp = dataset["description"]
text_extract_name = dataset["group_name"]
text_extract_tops = dataset["description_from_topics"]

# # Instancing the Vectorizer algorithm:
vectorizer = CountVectorizer()

# Doing the transformations and putting in our dataset:
dataset["description"] = vectorizer.fit_transform(text_extract_descp)
dataset["group_name"] = vectorizer.fit_transform(text_extract_name)
dataset["description_from_topics"] = vectorizer.fit_transform(text_extract_tops)

# Converting column 'created' to datetime type:
dataset["created"] = pd.to_datetime(dataset["created"])

# Creating numeric feature from datetimes:
dataset["created"] = dataset["created"].dt.dayofyear


# Now let's treat our categorical features:

## Initializing our encoders:
import category_encoders as ce

cols = [
    "city",
    "country",
    "join_mode",
    "state",
    "timezone",
    "visibility",
    "who",
    "topic_name",
]
# one_hot_encoder = ce.OneHotEncoder(cols=cols, drop_invariant=True, use_cat_names=True)
# binary_encoder = ce.BinaryEncoder(cols=cols, drop_invariant=True)
backward_encoder = ce.BackwardDifferenceEncoder(cols=cols)

# Separating target from features:
X = dataset.loc[:, dataset.columns != "category"].copy()
Y = dataset.loc[:, dataset.columns == "category"].copy()

# Fiting our encoders to the columns we told:
one_hot_encoder.fit(X)
binary_encoder.fit(X)
backward_encoder.fit(X)

# Encoding our dataset:
X_hot_encoded = one_hot_encoder.transform(X)
X_binary_encoded = binary_encoder.transform(X)
X_backward_encoded = backward_encoder.transform(X)

train_datasets_list = [X_hot_encoded, X_binary_encoded, X_backward_encoded]

BackwardDifferenceEncoder(cols=['city', 'country', 'join_mode', 'state', 'timezone', 'visibility', 'who', 'topic_name'],
             drop_invariant=False, handle_unknown='impute',
             impute_missing=True,
             mapping=[{'col': 'city', 'mapping':       [D.1]     [D.2]     [D.3]     [D.4]     [D.5]     [D.6]     [D.7]  \
1 -0.888889 -0.777778 -0.666667 -0.555556 -0.444444 -0.333333 -0.222222
2  0.111111 -0.777778 -0.666667 -0.555556 -0.444444 -0.333333 -0.222222
3  0.111111  0.222222 -0.666667 -0.5555...  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000

[962 rows x 960 columns]}],
             return_df=True, verbose=0)

### Analyzing dimension of categorical encoders


Before put the hands on the models and metrics, I want to analyze __how many columns__ each encoding have created to our features.

In [16]:
X_hot_encoded.shape
X_binary_encoded.shape
X_backward_encoded.shape

NameError: name 'X_backward_encoded' is not defined

-------
# Train/Test some models
__NOTE: My memory can't deal with these encoding transformations, I've tried differents approaches but always pass 8GB RAM, and my computer freeze, but in these following steps I would train some different models and look to the F1 measure score to decides our best model and take a quicky look at the CONFUSION MATRIX__

I'm very sorry for this step, I could use Spark ou divide the step with different Memory Chuncks, but had no time :( ...

In [191]:
# Train the dataset with the 3 chosen encoders
for df_encoded in train_datasets_list:
    # Split train / test data:
    X_train, X_test, y_train, y_test = train_test_split(df_encoded, Y, test_size=0.25, random_state=42)
    # Now train and test with different models with .fit() .predict()
    

Index(['category', 'city', 'country', 'created', 'description', 'join_mode',
       'lat', 'lon', 'members', 'group_name', 'rating', 'state', 'timezone',
       'visibility', 'who', 'topic_name', 'description_from_topics',
       'members_from_topics'],
      dtype='object')

We also could use [auto sklearn](https://automl.github.io/auto-sklearn/master/index.html) or [automl](https://pypi.org/project/automl/) automatic ML model selection tools to choose the best model automatically.

# Conclusion

Due this problem I mentioned above, I can't tell the final decision and conclusion, but probably the metric I would choose to evaluate the best model would be _f1 score_.

# How to put the model in production?

We build a prediction model on the Meetup's historic data using different machine learning algorithms and classifiers, plot the results and calculate some score metrics of the model on the testing data. Now what? To put it to use in order to predict the new data, we have to deploy it over the target service or porduct. In this section I'll talk about two ways to deploy our model and put it in production.

1. Build your own API;
2. Use a paid service/platform (such as Cloud ML Engine from Google).


### 1. Build your own API
Building your own API service to your model several times is the best solution to many people, because you can use, for example, python's frameworks as Flask and it's easy to start a server with right configuration, it's customizable, you can process the data in the same call since it is python building the API and processing the data for the model,  and cheaper (for free, actually).

However, this traditional approach requires a lot of setup and maintence time from software/ML engineers. Also scaling in multiple machines (clusters) using Flask causes many complications.


### 2. Deploy using a service
Using a platform that serves ML models usually is low latency and makes the ML model deployment to production easier and faster. The part of scale and maintain the model also is simple.

Usually this platforms enable you to upload your python file with _load()_ and _predict()_ method to deploy your model and do the necessary processing.

However, it demands some investment, they are not free. As example of this services we have [Panini AI](https://panini.ai/) and [Cloud Machine Learning Engine](https://cloud.google.com/ml-engine/).

----------

# Attacking other problems with this model

Now, thinking about how would I apply this model to improve other platform interactions with users at Meetup's platform, I must remember that the input information from my model it's the basic data given at the creation of the meetup's group.

Well, knowing the group's category this user is creating, to improve other platform interactions using the same developed model I think about two approaches:
- Show Meetup's groups of the same category and next him (same city or state) to engage the participation in these groups;
- Recommend some tips to the success of this new group based on the groups similar to this one
    - With this __success__ metric I mean more group members, for example. 


----------

# Attacking other problems with this data

Given all the database tables I metioned before, we can think about other problems which is possible to attack using other ML models considering all our data. For example:

- To make the creation of new groups or events __easier and faster__ to the users we could try to use _The transformer neural network_ to write these descriptions automatically, and as showed in this [paper](https://arxiv.org/abs/1706.03762), it performs so well.

- We also can suggest places (cities/states) would be more successful for this group or event

- We could create a model which suggests to a new member, a list of groups and events based on your personal information and interests
--------------