# Recurrent Neural Networks

In this task you are supposed to implement a chatbot in two ways: 
1. As a classifier 
2. As generator

- Download the Python Questions from Stack Overflow dataset https://www.kaggle.com/stackoverflow/pythonquestions
- Make the chatbot so that you classify a category (i.e., tag) of input text,
and return a dialog from the correct class. Note that one question could
have multiple tags and ou may need to simplify.
- Alternatively, make a sequence to sequence network that automatically
learns what to respond. It can be character based or word based.
- Hint: Start with a subset of the dataset
- Choose the network architecture with care.
- Train and validate all algorithms.
- Make the necessary assumptions.

## 1. As a Classifier

To create a chatbot as a classifier, we will classify the input text to a category (tag) and return a dialog from that tag.

### Understand and Prepare the Dataset

The data consists of three files:

- `Questions.csv`: Contains information about the questions asked on Stack Overflow. The 'Body' field contains the HTML of the answer.
- `Answers.csv`: Contains information about the answers to the questions. The 'ParentId' field maps to a question.
- `Tags.csv`: Contains the tags associated with each question. The 'Id' field here corresponds to the 'Id' in the Questions.csv file.


In [1]:
import pandas as pd

# Large subsets
questions = pd.read_csv('data/Questions.csv', encoding='latin1', nrows=10000)
answers = pd.read_csv('data/Answers.csv', encoding='latin1', nrows=15000)
tags = pd.read_csv('data/Tags.csv', encoding='latin1')

# Small subsets
# questions = pd.read_csv('data/Questions.csv', encoding='latin1', nrows=1000)
# answers = pd.read_csv('data/Answers.csv', encoding='latin1', nrows=2000)
# tags = pd.read_csv('data/Tags.csv', encoding='latin1', nrows=1000)

In [2]:
# give info of datasets
print("Questions: ", len(questions))
print("Answers: ", len(answers))
print("Tags: ", len(tags))

Questions:  10000
Answers:  15000
Tags:  1885078


### Data Preprocessing

First, we need to preprocess our data. We will use the pandas library to load the data and BeautifulSoup library to clean it.

See https://www.kaggle.com/code/nicolaswattiez/stackoverflow-python-preprocess

The 'Body' field in both the Questions and Answers datasets contains HTML. We need to remove these HTML tags and clean the text data. This also includes converting text to lowercase, removing punctuation, and potentially removing stop words (common words like 'is', 'the', 'and', etc., which don't add much information for the model).

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/markus/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
import re
from bs4 import BeautifulSoup
from nltk.corpus import stopwords

number_of_characters = 86 # valid characters allowed from regex

def clean_text(text):
    # Remove HTML tags
    text = BeautifulSoup(text, "lxml").text
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and most special characters
    text = re.sub(r'[^a-zA-Z0-9\s\(\)\{\}\[\]<>:;=+\-*/&|!.#, _@]', '', text)
    # Remove stopwords
    text = " ".join(word for word in text.split() if word not in stopwords.words('english'))
    return text

In [5]:
questions['Body'] = questions['Body'].apply(clean_text)
answers['Body'] = answers['Body'].apply(clean_text)

In [6]:
print(questions.head()['Body'])
print(answers.head()['Body'])

0    using photoshops javascript api find fonts giv...
1    cross-platform (python) application needs gene...
2    im starting work hobby project python codebase...
3            several ways iterate result set. tradeoff
4    dont remember whether dreaming seem recall fun...
Name: Body, dtype: object
0    open terminal (applications->utilities->termin...
1    havent able find anything directly. think youl...
2    use imagemagicks convert utility this, see exa...
3    one possibility hudson. written java, theres i...
4    run buildbot - trac work, havent used much sin...
Name: Body, dtype: object


### Simplify the tags

As a question can have multiple tags, we need to simplify this. We can either choose one tag per question or create a multi-label classifier. For simplicity, we will choose the first tag for each question.

In [7]:
# Group the tags by question ID
tags = tags.groupby('Id').first().reset_index()

### Merge the data

We'll merge the questions and tags dataframes on the 'Id' field.

In [8]:
# Merge tags into questions dataframe
data = pd.merge(questions, tags, how='inner', on='Id')

### Create and Train the Classifier

We'll use a simple text classifier. For this, we can use the TF-IDF vectorizer to transform our text data into numerical data and then use a classifier like logistic regression.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# Split the data
X_train, X_test, y_train, y_test = train_test_split(data['Body'], data['Tag'], test_size=0.2)

# Create the pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])

# Train the classifier
pipeline.fit(X_train, y_train)

# Validate the classifier
print(f"Accuracy: {pipeline.score(X_test, y_test)}")

Accuracy: 0.948


### Use the Classifier

Now that we have trained our classifier, we can use it to classify new text and return a dialog from the corresponding tag.

In [10]:
def get_response(text):
    # Predict the tag
    tag = pipeline.predict([text])[0]
    
    # Get a random dialog from the tag
    response = data[data['Tag'] == tag]['Body'].sample(1).values[0]
    
    return response

## 2. As a Generator

Creating a chatbot as a generator is a bit more complex, as we need to create a sequence-to-sequence (seq2seq) network. We'll use the Keras library for this.

### Data Preprocessing

The preprocessing steps are similar to the classifier, but we need to tokenize our text.

In [11]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data['Body'])

# Convert text to sequences
sequences = tokenizer.texts_to_sequences(data['Body'])

# Pad sequences
sequences = pad_sequences(sequences)


2023-12-02 22:40:45.269322: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-02 22:40:45.391731: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-02 22:40:45.391761: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-02 22:40:45.405612: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-02 22:40:45.439506: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-02 22:40:45.440342: I tensorflow/core/platform/cpu_feature_guard.cc:1

### Create the Seq2Seq model

We'll create a simple seq2seq model with LSTM layers.

In [12]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

# Create the model
model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=256, input_length=sequences.shape[1]))
model.add(LSTM(256, return_sequences=True))
model.add(LSTM(256))
model.add(Dense(len(tokenizer.word_index) + 1, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Train the model
model.fit(sequences, pd.get_dummies(sequences).values, epochs=10)


2023-12-02 22:40:48.868865: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:274] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


ValueError: Data must be 1-dimensional, got ndarray of shape (10000, 1868) instead

### Use the Seq2Seq model

We can use the seq2seq model to generate new text based on the input text.

In [None]:
def generate_text(text):
    # Convert text to sequence
    sequence = tokenizer.texts_to_sequences([text])[0]
    
    # Predict next word
    prediction = model.predict_classes(sequence)
    
    # Get word from index
    word = tokenizer.index_word[prediction[0]]
    
    return word