This first cell is added by Kaggle by default and contains some useful info/setup to be able to open the data files for the competition. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

This is where we open the dataset for the competition. This can be confusing at first, which is why Kaggle prints out the path for us above. In the top right of the window the data files for this competition are also listed, so that is how we know to use `train.csv`. 

In [None]:
input_data_path = '../input/small-talk-intent-classification-data/Small_talk_Intent.csv'
df = pd.read_csv(input_data_path)
df.head()


Splitting the data early on to avoid data leakage is always a good idea. You can choose a different holdout percentage with the parameters `test_size` (common or typical values are 15% or 20% of your data, but can vary depending on how much data you have). 

In [None]:
# split data into train and validation sets: df_train and df_val
from sklearn.model_selection import train_test_split

X = df['Utterances'].copy()
y = df['Intent'].copy()

X_train_raw, X_val_raw, y_train, y_val = train_test_split(X, y, test_size=0.15, random_state=42)
X_train_raw.head() + "     " + y_train.head()

Next is the vectorization of the data. Note that the tokenization is happening inside the `fit_transform` method as well. Parameters to the `TfidfVectorizer()` constructor can also change how the vectorization is done (e.g. limit vocabularly to __x__ most frequently occuring words, remove stopwords or not, etc.). There is lots there, so checking out the documentation for [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) is not a bad idea. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
#import seaborn as sns
tfidf_vectorizer = TfidfVectorizer()
X_train = tfidf_vectorizer.fit_transform(X_train_raw).toarray()

# an alternative is to use term frequency:
#from sklearn.feature_extraction.text import CountVectorizer
#one_hot_vectorizer = CountVectorizer(binary=True)
#X_train = one_hot_vectorizer.fit_transform(X_train_raw))

print(f"X_train.shape = {X_train.shape}")
type(X_train)

Finally, it's time to fit a model to the data. This is the main part of the assignment is to use a model __other than RandomForest__. Fortunately, this should not be too difficult of a chance to this notebook since Scikit Learn has many, many types of classification models, and they are easily interchangeable. Besides the RandomForestClassifier, there are:
* [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
* [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
* [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)
* [ExtraTreesClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)
* [HistGradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html)
* [Gaussian Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)
* [Naive Bayes Multinomial Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

and even more options. Feel free to choose another one. 

In [None]:
from sklearn.naive_bayes import MultinomialNB
#from sklearn.ensemble import AdaBoostClassifier
#from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score


model = MultinomialNB()
#AdaBoostClassifier(n_estimators=150, random_state=0)
#model = GaussianNB()
model = model.fit(X_train, y_train)

predictions_train = model.predict(X_train)

#disp = ConfusionMatrixDisplay(confusion_matrix(y_train, predictions_train), display_labels=['negative', 'neutral', 'positive'])
#disp.plot()
print(f"accuracy (on X_train): {accuracy_score(y_train, predictions_train):.4f}")

In [None]:
X_val = tfidf_vectorizer.transform(X_val_raw).toarray()
print(f"X_val.shape = {X_val.shape}")
type(X_val)

In [None]:
predictions_val = model.predict(X_val)
#disp = ConfusionMatrixDisplay(confusion_matrix(y_val, predictions_val), display_labels=['negative', 'neutral', 'positive'])
#disp.plot()
print(f"accuracy (on X_train): {accuracy_score(y_val, predictions_val):.4f}")


As we did previously, let's now check to see if there is any hyperparameter tuning that can be done to further improve the model performance. We saw that for Random Forest there will not be much difference here, but hyperparameter choices for other types of the models can cause performance to vary much more. 

In [None]:
from sklearn.metrics import log_loss
from sklearn.ensemble import HistGradientBoostingClassifier

tune_model = True # can change this to False once you've chosen a hyperparam value and before Saving your notebook with Kaggle
intents = df['Intent'].unique()
# A function to create and fit a RF with a specific number of trees
def tuneModel(hyperparam_value):
    rf_model = HistGradientBoostingClassifier(max_iter=hyperparam_value, random_state=5)
    #rf_model = RandomForestClassifier(min_samples_split=hyperparam_value, random_state=1)
    rf_model.fit(X_train, y_train)
    y_train_pred_prob = rf_model.predict_proba(X_train)
    y_train_pred = rf_model.predict(X_train)
    y_val_pred_prob = rf_model.predict_proba(X_val)
    y_val_pred = rf_model.predict(X_val)
    train_loss = log_loss(y_train, y_train_pred_prob, labels=intents)
    train_acc = accuracy_score(y_train, y_train_pred)
    val_loss = log_loss(y_val, y_val_pred_prob, labels=intents)
    val_acc = accuracy_score(y_val, y_val_pred)
    return (train_loss, val_loss, train_acc, val_acc)

# Possible values for the hyperparameter are in the range of 5 to 150 (by 50)
hyp_param_vals = list(range(39,43,1)) # good values for n_estimators
hyp_param_vals = [2,3] + list(range(5, 50, 10)) # good values for min_samples_split
metrics = []

if tune_model:
    for hp in hyp_param_vals:
        metrics.append(tuneModel(hp))

Plot the results of the model performance for each hyperparameter value we looked at

In [None]:
import matplotlib.pyplot as plt

if tune_model:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))

    ax1.set_xticks(hyp_param_vals)
    ax1.set(xlabel="n_estimators", ylabel="loss (lower is better)")
    ax1.plot(hyp_param_vals, [metric[1] for metric in metrics], '--ro') # validation loss
    ax1.plot(hyp_param_vals, [metric[0] for metric in metrics], '--bo') # training loss
    ax1.legend(["Validation Loss", "Train Loss"], loc=1)

    ax2.set_xticks(hyp_param_vals)
    ax2.set(xlabel="n_estimators", ylabel="accuracy (higher is better)")
    ax2.plot(hyp_param_vals, [metric[3] for metric in metrics], '--ro') # validation accuracy
    ax2.plot(hyp_param_vals, [metric[2] for metric in metrics], '--bo') # training accuracy
    ax2.legend(["Validation Accuracy", "Train Accuracy"], loc=4)

To be certain, let's look at the table of values and see which hyperparameter value has the lowest.

In [None]:
# a simple matrix with first row containing hyperparam values, second row containing validation loss, third row containing validation accuracy
# (this could be presented in an even nicer format using a pandas dataframe if you like)
if tune_model:
    tuning_results = np.array([hyp_param_vals, [round(metric[1],2) for metric in metrics], [round(metric[3],2) for metric in metrics]])
    print(tuning_results)

Once you're satisified with your model, it's time to make predictions for the test dataset and submit those to the Kaggle competition. You'll first need to load the `test.csv` data file (which, of course, does not have the labels in it).  

Note that you'll likely need to do some hyperparameter tuning depending on the model that you choose. Tuning means training/fitting your model for many different values of the hyperparameter and then choosing the value that resulted in the best performance (i.e. lowest loss). See our [notebook C from class (nb_C_airline_tweets_take2.ipynb)](https://github.com/sgeinitz/cs39aa_notebooks/), at the very bottom of the notebook, for an example of how to do this. 

In [None]:
test_data_file = 'test.csv'
df_test = pd.read_csv(input_data_path + test_data_file)
df_test.head()

We'll now need to go through the same process of tokenizing and vectorizing the tweets, and then putting them through the model. We need to be careful to not mix up the order of these observations. Specifically, the `id` field is used by Kaggle to know which observation is which (when comparing to the labels that Kaggle is hiding from us). We will be okay as long as we don't shuffle or subset this dataset. 

In [None]:
X_test = tfidf_vectorizer.transform(df_test['text']).toarray()
print(f"X_test.shape = {X_test.shape}")
type(X_test)

Now make the predictions and peek at the first 10. 

****I Used the hyperparameters 39 the best loss and accracy****

In [None]:
# refit the model with the best hyperparameter value you found
model =  AdaBoostClassifier(n_estimators=100, random_state=0)
model = model.fit(X_train, y_train)

# this make predictions for the test set
predictions_test = model.predict(X_test)
predictions_test[:10]

Now append the predictions to the `df_test` data frame as a new column and peek at some of those to see if the predictions look decent. 

In [None]:
df_test['predictions'] = predictions_test
pd.set_option("display.max_colwidth", 240)
df_test.head(n=10)

A submission on Kaggle should have only two fields (or columns). These are a) the `id` and b) the predicted `sentiment`. Those are the exact column names that should be used. The next cell creates a pandas data frame with those two columns, and renames the second column accordingly. 

In [None]:
df_submission = df_test[['id','predictions']]
df_submission.columns = ['id', 'sentiment']
df_submission.head()

The final step is simply to write this data frame with the test predictions to a csv file. 

In [None]:
df_submission.to_csv('submission.csv', index=False)

After you're certain that this notebook runs correctly, you'll then click on the __Submit__ button on the right side of this window. 