## Introduction to Comet ML  

Comet is a great tool for model versioning and experimentation as it records the parameters and conditions from each of your experiements- allowing you to reproduce your results, or go back to a previous version of your experiment.  

To create an account, visit https://www.comet.ml/  
Follow the instructions for a single user account. Once that is created, you will see a project folder. That is where the records of your experiments can be viewed. 

Comet has an abundance of tutorials and scripts, we're just going to run through this notebook to get you started on the right track. For this illustration, we will be using one of the examples found on the Comet ML GitHub repo.

To begin with, you should install as illustrated below if you don't already have it. *Always import Experiment at the top of your notebook/script.*


In [None]:
!pip install comet_ml

In [24]:
from comet_ml import Experiment

You will see an API key button at the top of the page when you click on an experiment- use this key as illustrated below to link your current workspace to comet. (If a project is empty, the code below will autogenerate for you on the project page, just copy and paste it in here)

In [25]:
# Setting the API key (saved as environment variable)
# Create an experiment with your api key
experiment = Experiment(
    api_key="769nn1WRQ3JOmstx7e82dAUlZ",
    project_name="climate-change-belief-analysis-2201acds-team-nm4",
    workspace="daniel-ifediba",
)

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/daniel-ifediba/climate-change-belief-analysis-2201acds-team-nm4/ce50db2192284febad6785503dbf443f



Import the rest of your necessary libraries as you usually would. For this demonstration we will be using the breast cancer dataset for classification so we will also import that from sklearn.

In [26]:
import warnings # to filter out warnings in the jupyter notebook
warnings.filterwarnings('ignore') # we will ignore all warning and not show them

import re                                  
import string

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
                          
from nltk.corpus import stopwords 
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import TweetTokenizer, TreebankWordTokenizer

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.preprocessing import MaxAbsScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.utils import resample

In [27]:
# Have a look at your dataset
df = pd.read_csv('./data/train.csv', index_col='tweetid', encoding='utf-8')
df.head()

Unnamed: 0_level_0,sentiment,message
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1
625221,1,PolySciMajor EPA chief doesn't think carbon di...
126103,1,It's not like we lack evidence of anthropogeni...
698562,2,RT @RawStory: Researchers say we have three ye...
573736,1,#TodayinMaker# WIRED : 2016 was a pivotal year...
466954,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ..."


In [28]:
# Let's pick a class size of roughly half the size of the largest size
class_size = 5000
# Let's list the target labels
labels_counts = df['sentiment'].value_counts().to_dict()

resampled_classes = []

# For each label
for label, label_size in labels_counts.items():
    # If label_size < class size then set replace to True to upsample, else False to downsample
    if label_size < class_size:
        # Upsample
        replacement = True # sample with replacement (we need to duplicate observations)
    else:
        # Downsample
        replacement = False # sample with replacement (we need to duplicate observations)
    label_data = df[df['sentiment'] == label]
    label_resampled = resample(label_data,
                               replace=replacement, # sample without replacement (no need for duplicate observations)
                               n_samples=class_size, # number of desired samples
                               random_state=27) # reproducible results

    resampled_classes.append(label_resampled)

resampled_data = pd.concat(resampled_classes, axis=0)

In [29]:
# function to strip emojis from the tweets

def remove_emoji(text):
    EMOJI_PATTERN = re.compile(
        "(["
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F700-\U0001F77F"  # alchemical symbols
        "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
        "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
        "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        "\U0001FA00-\U0001FA6F"  # Chess Symbols
        "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        "\U00002702-\U000027B0"  # Dingbats
        "\U000024C2-\U0001F251"
        "])"
    )
    text = re.sub(EMOJI_PATTERN,  '', text)
    return text

In [30]:
def tweet_preprocessor(tweet):
    # remove the old style retweet text "RT"
    tweet_clean = re.sub(r'^RT[\s]+', '', tweet)

    # remove hashtags. We have to be careful here not to remove 
    # the whole hashtag because text of hashtags contains huge information. 
    # only remove the hash # sign from the word
    tweet_clean = re.sub(r'#', '', tweet_clean)

    # remove hyperlinks
    tweet_clean = re.sub(
        r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
        ,r'url-web',
        tweet_clean)

    # remove single numeric terms in the tweet. 
    tweet_clean = re.sub(r'\s[0-9]+\s', '', tweet_clean)

    # remove emojis from in the tweet
    tweet_clean = remove_emoji(tweet_clean)

    # remove punctions from the tweet
    tweet_clean = ''.join([l for l in tweet_clean if l not in string.punctuation])

    # convert tweet to lowercase and return it
    tweet_clean = tweet_clean.lower()
    
    # tokenize the tweet
    tokenizer = TweetTokenizer() #Instantiate the tokenizer class
    tweet_tokens = tokenizer.tokenize(tweet_clean)
    
    # remove stop words
    stopwords_english = stopwords.words('english')
    tweet_tokens_without_stopwords = [t for t in tweet_tokens if t not in stopwords_english]
    
    # stem the tweet
    #stemmer = PorterStemmer()
    #tweet_stems = ' '.join([stemmer.stem(t) for t in tweet_tokens])
    
    # stem the tweet
    lemma = WordNetLemmatizer()
    tweet_stems = ' '.join([lemma.lemmatize(t) for t in tweet_tokens])
    
    return tweet_stems

In [31]:
#Get the labels and targets
X = resampled_data['message']
y = resampled_data['sentiment']

In [32]:
vectorizer = CountVectorizer( # Instantiate the object
    preprocessor=tweet_preprocessor,
    min_df=2, # with min_df of 2 for bag of words
    ngram_range=(1,2) # with ngram_range of (1,2) for bag of words
) 
vectorizer.fit(X) #build vocabulary for training
X_tokenized = vectorizer.transform(X) #encode the text data 

# scale the encoded text data 
transformer = TfidfTransformer() # Instantiate the object
X_trans = transformer.fit_transform(X_tokenized) # Transform the tokens

Split your data into train and test sets, keep in mind that you need to set a random state for your results to be reproduced!

In [33]:
#Split the labels and target into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_trans, y, test_size=0.2, random_state=42)

## GridSearch 

For this example we've used a gridsearch but you may use a model with default parameters or your own parameters too- Just remember to add/remove the neccesary data when you are logging your parameters at the end of the experiment.

The `param_grid` variable contains the 'C' values we want our gridsearch to iterate through.



In [34]:
# Create a list containing the various models to evaluate
# include the parameters we are interested to tune for each model
models = [
    ## define the mode MultinomialNB Classifier
    {
        'model': MultinomialNB(), # instantiate the model
        'param_grid': { # set hyper parameters for model tuning
            'alpha': [0.0001, 0.001, 0.1, 1, 10]
        }
    },
        ## define the model Logistic Regression Classifier
    {  
        'model': LogisticRegression(), # instantiate the model
        'param_grid': { # set hyper parameters for model tuning
            'C' : [0.1, 1],
            'solver' : ['liblinear'],
            'max_iter' : [100, 1000]
        }
    }
]      

In [35]:
for item in models:
    model = item['model'] # get the model object instance
    param_grid = item['param_grid'] # get the parameters that will be used for tuning
    # define the grid instance
    grid =  GridSearchCV(
        model, # model instance
        param_grid = param_grid, # hyper-parameters
        cv = 10, # cross-validation splitting
        scoring = 'f1_weighted', # f1_scoring
        refit=True
    )
    print(f'Fitting {model.__class__.__name__} model....')
    grid.fit(X_train, y_train) # fit the grid to the X-train and the y_train
    y_pred = grid.predict(X_test) # Predict values for X_test

    # Saving each metric to add to a dictionary for logging
    f1 = f1_score(y_test, y_pred, average='weighted')
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    print(f'Finished fitting and testing {grid.best_estimator_}')
    # Create dictionaries for the data we want to log

    params = {"random_state": 42,
              "model_type": model.__class__.__name__,
              "vectorizer": 'CounterVectorizer',
              "transformer": "TfidfTransformer",
              "param_grid": str(param_grid),
              "stratify": True
              }
    metrics = {"f1": f1,
               "recall": recall,
               "precision": precision
               }
    # Log our parameters and results
    experiment.log_parameters(params)
    experiment.log_metrics(metrics)

Fitting MultinomialNB model....
Finished fitting and testing MultinomialNB(alpha=0.1)
Fitting LogisticRegression model....
Finished fitting and testing LogisticRegression(C=1, solver='liblinear')


If you're using comet within a jupyter notebook, it's important to end your experiment when you've finished as illustrated below.

In [36]:
experiment.end()

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/daniel-ifediba/climate-change-belief-analysis-2201acds-team-nm4/ce50db2192284febad6785503dbf443f
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     f1 [2]        : (0.8764811128651229, 0.9085913574572091)
COMET INFO:     precision [2] : (0.8778590159627107, 0.9095206009759094)
COMET INFO:     recall [2]    : (0.878, 0.9085)
COMET INFO:   Parameters:
COMET INFO:     C                            : 1
COMET INFO:     alpha                        : 0.1
COMET INFO:     analyzer                     : word
COMET INFO:     binary                       : False
COMET INFO:     class_prior                  : 1
COMET INFO:     class_weight                 : 1
COMET INFO:     cv                           : 10
COMET INFO:     decode_error                 

## Display  

Running `experiment.display()` will show you your experiments comet.ml page inside your notebook as illustrated below. You can do this immediately after an experiment is run, and logged. 

In [37]:
experiment.display()