# Recurrent Neural Networks

In this task you are suppose to implement a chatbot in two ways: 
1. As a classifier 
2. As generator

- Download the Python Questions from Stack Overflow dataset https://www.kaggle.com/stackoverflow/pythonquestions
- Make the chatbot so that you classify a category (i.e., tag) of input text,
and return a dialog from the correct class. Note that one question could
have multiple tags and ou may need to simplify.
- Alternatively, make a sequence to sequence network that automatically
learns what to respond. It can be character based or word based.
- Hint: Start with a subset of the dataset
- Choose the network architecture with care.
- Train and validate all algorithms.
- Make the necessary assumptions.

*Mostly done with ChatGPT, because I don't know what I'm doing..*

## 1. Understand and Prepare the Dataset

The data consists of three files:

- `Questions.csv`: Contains information about the questions asked on Stack Overflow. The 'Body' field contains the HTML of the answer.
- `Answers.csv`: Contains information about the answers to the questions. The 'ParentId' field maps to a question.
- `Tags.csv`: Contains the tags associated with each question. The 'Id' field here corresponds to the 'Id' in the Questions.csv file.


In [1]:
import pandas as pd

# Large subsets
questions_df = pd.read_csv('data/Questions.csv', encoding='latin1', nrows=20000)
answers_df = pd.read_csv('data/Answers.csv', encoding='latin1', nrows=30000)
tags_df = pd.read_csv('data/Tags.csv', encoding='latin1')

# Small subsets
# questions_df = pd.read_csv('data/Questions.csv', encoding='latin1', nrows=1000)
# answers_df = pd.read_csv('data/Answers.csv', encoding='latin1', nrows=2000)
# tags_df = pd.read_csv('data/Tags.csv', encoding='latin1', nrows=1000)

In [2]:
# give info of datasets
print("Questions: ", len(questions_df))
print("Answers: ", len(answers_df))
print("Tags: ", len(tags_df))

Questions:  20000
Answers:  30000
Tags:  1885078


## 2. Data Preprocessing

See https://www.kaggle.com/code/nicolaswattiez/stackoverflow-python-preprocess

### 2.1. Text Cleaning

The 'Body' field in both the Questions and Answers datasets contains HTML. We need to remove these HTML tags and clean the text data. This also includes converting text to lowercase, removing punctuation, and potentially removing stop words (common words like 'is', 'the', 'and', etc., which don't add much information for the model).

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/markus/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
import re
from bs4 import BeautifulSoup
from nltk.corpus import stopwords

number_of_characters = 86 # valid characters allowed from regex

def clean_text(text):
    # Remove HTML tags
    text = BeautifulSoup(text, "lxml").text
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and most special characters
    text = re.sub(r'[^a-zA-Z0-9\s\(\)\{\}\[\]<>:;=+\-*/&|!.#, _@]', '', text)
    # Remove stopwords
    text = " ".join(word for word in text.split() if word not in stopwords.words('english'))
    return text

In [5]:
questions_df['Body'] = questions_df['Body'].apply(clean_text)
answers_df['Body'] = answers_df['Body'].apply(clean_text)

In [6]:
print(questions_df.head()['Body'])
print(answers_df.head()['Body'])

0    using photoshops javascript api find fonts giv...
1    cross-platform (python) application needs gene...
2    im starting work hobby project python codebase...
3            several ways iterate result set. tradeoff
4    dont remember whether dreaming seem recall fun...
Name: Body, dtype: object
0    open terminal (applications->utilities->termin...
1    havent able find anything directly. think youl...
2    use imagemagicks convert utility this, see exa...
3    one possibility hudson. written java, theres i...
4    run buildbot - trac work, havent used much sin...
Name: Body, dtype: object


### 2.2. Handling Tags:

A question can have multiple tags, but for simplicity, you might want to assign it to just one category. You can choose the most frequent tag, or a tag based on the content of the question.

In [7]:
# Convert 'Tag' to string and then group by 'Id', joining all tags for a question
grouped_tags = tags_df['Tag'].astype(str).groupby(tags_df['Id']).apply(lambda tags: ' '.join(tags))

# Merge tags into questions dataframe
questions_df = questions_df.merge(grouped_tags, how='inner', on='Id')

# Function to get the most common tag
def most_common_tag(tags):
    tags_list = tags.split()
    return max(set(tags_list), key = tags_list.count)

# Apply function to get the most common tag for each question
questions_df['Tag'] = questions_df['Tag'].apply(most_common_tag)

### 2.3. Tokenization:

Tokenization is the process of splitting the text into individual words or tokens. This is necessary to convert your text data into a format that can be inputted into the model.

In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Initialize a tokenizer
tokenizer = Tokenizer()

# Fit it to the questions data
tokenizer.fit_on_texts(questions_df['Body'])

# Tokenize the text
questions_df['Body'] = tokenizer.texts_to_sequences(questions_df['Body'])

2023-11-29 17:13:25.776117: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-29 17:13:26.003275: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-29 17:13:26.003308: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-29 17:13:26.018397: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-29 17:13:26.063735: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-29 17:13:26.064479: I tensorflow/core/platform/cpu_feature_guard.cc:1

## 3. Implement the classifier

The idea here is to train a model to predict the tag of a question based on its content. You can use a recurrent neural network (RNN) architecture for this, as it's good at handling sequential data like text.
You will need to convert your text and tags into numerical format for training. This can involve techniques like one-hot encoding or word embedding.
Split your data into a training set and a validation set.
Train your RNN on the training data and validate it on the validation set.

### 3.1. Prepare the Target Variable

The target variable is the tag of each question. You need to convert these tags into a numerical format that can be used to train the model. One common approach is one-hot encoding, which converts each category into a binary vector.

In [9]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

# Initialize and fit the label encoder
le = LabelEncoder()
le.fit(questions_df['Tag'])

# Transform the tags into integers
questions_df['Tag'] = le.transform(questions_df['Tag'])

# One-hot encode the tags
tags_encoded = to_categorical(questions_df['Tag'])

### 3.2. Prepare the Training and Validation Sets

You need to split your data into a training set and a validation set. A common split is 80% of the data for training and 20% for validation.

In [10]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_val, y_train, y_val = train_test_split(questions_df['Body'], tags_encoded, test_size=0.2, random_state=42)

### 3.3. Padding the Sequences

Neural networks require all input to be the same length. You can use the pad_sequences function from Keras to make all sequences the same length by padding shorter ones with zeros.

In [11]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Pad the sequences
X_train = pad_sequences(X_train)
X_val = pad_sequences(X_val, maxlen=X_train.shape[1])

### 3.4. Build the LSTM Model

In [12]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Initialize the model
model = Sequential()

# Add an Embedding layer
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=1000, input_length=X_train.shape[1]))

# Add an LSTM layer
model.add(LSTM(128))

# Add a Dense layer
model.add(Dense(tags_encoded.shape[1], activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Summary of the model
model.summary()

2023-11-29 17:13:32.757853: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:274] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-11-29 17:13:32.797129: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 245784000 exceeds 10% of free system memory.
2023-11-29 17:13:32.881550: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 245784000 exceeds 10% of free system memory.
2023-11-29 17:13:32.922928: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 245784000 exceeds 10% of free system memory.


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 3953, 1000)        61446000  
                                                                 
 lstm (LSTM)                 (None, 128)               578048    
                                                                 
 dense (Dense)               (None, 2612)              336948    
                                                                 
Total params: 62360996 (237.89 MB)
Trainable params: 62360996 (237.89 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


### 3.5. Train the Model

Finally, the model can be trained using the training data, and validated using the validation data.

In [13]:
# Train the model
model.fit(X_train, y_train, epochs=5, validation_data=(X_val, y_val))

2023-11-29 17:13:33.817076: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 252992000 exceeds 10% of free system memory.


Epoch 1/5


2023-11-29 17:13:34.489378: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 167168000 exceeds 10% of free system memory.


Epoch 2/5


KeyboardInterrupt

