<a href="https://colab.research.google.com/github/UPstartDeveloper/DS-2.4-Advanced-Topics/blob/main/Notebooks/NLP/Stack_Overflow_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What Language is It? 
## Text Classification of Stack Overflow Questions, using Keras 

Originally adapted from the botton of [this tutorial](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/keras/text_classification.ipynb) on binary text classification from Tensorflow. The original notebook stated:

> ### Exercise: multiclass classification on Stack Overflow questions

> This tutorial showed how to train a binary classifier from scratch on the IMDB dataset. As an exercise, you can modify this notebook to train a multiclass classifier to predict the tag of a programming question on [Stack Overflow](http://stackoverflow.com/).

> We have prepared a [dataset](http://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz) for you to use containing the body of several thousand programming questions (for example, "How can sort a dictionary by value in Python?") posted to Stack Overflow. Each of these is labeled with exactly one tag (either Python, CSharp, JavaScript, or Java). Your task is to take a question as input, and predict the appropriate tag, in this case, Python. 

> The dataset you will work with contains several thousand questions extracted from the much larger public Stack Overflow dataset on [BigQuery](https://console.cloud.google.com/marketplace/details/stack-exchange/stack-overflow), which contains more than 17 million posts.

> After downloading the dataset, you will find it has a similar directory structure to the IMDB dataset you worked with previously:

> ```
train/
...python/
......0.txt
......1.txt
...javascript/
......0.txt
......1.txt
...csharp/
......0.txt
......1.txt
...java/
......0.txt
......1.txt
```

> Note: to increase the difficulty of the classification problem, we have replaced any occurences of the words Python, CSharp, JavaScript, or Java in the programming questions with the word *blank* (as many questions contain the language they're about). 

> To complete this exercise, you should modify this notebook to work with the Stack Overflow dataset by making the following modifications:

> 1. At the top of your notebook, update the code that downloads the IMDB dataset with code to download the [Stack Overflow dataset](http://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz) we have prepreared. As the Stack Overflow dataset has a similar directory structure, you will not need to make many modifications. 

> 2. Modify the last layer of your model to read `Dense(4)`, as there are now four output classes.

> 3. When you compile your model, change the loss to `losses.SparseCategoricalCrossentropy`. This is the correct loss function to use for a multiclass classification problem, when the labels for each class are integers (in our case, they can be 0, *1*, *2*, or *3*).

> 4. Once these changes are complete, you will be able to train a multiclass classifier. 

> If you get stuck, you can find a solution [here](https://github.com/tensorflow/examples/blob/master/community/en/text_classification_solution.ipynb).


## Importing Packages

In [3]:
import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

## What Version of Tensorflow is This?

In [4]:
print(tf.__version__)

2.4.0


## Download and Explore the Stack Overflow Dataset

In [7]:
url = "http://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz"

# adds a dir for the dataset
dataset = tf.keras.utils.get_file("stack_overflow_16k.tar.gz", url,
                                    untar=True, cache_dir='.',
                                    cache_subdir='')

# getting the full path to the directory
dataset_dir = os.path.join(os.path.dirname(dataset))

In [8]:
# listing all the sub paths in the dataset's ZIP folder
os.listdir(dataset_dir)

['.config',
 'stack_overflow_16k.tar.gz.tar.gz',
 'test',
 'train',
 'README.md',
 'sample_data']

In [9]:
# do the same just for the training data directory
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

['java', 'csharp', 'javascript', 'python']

**Example Data**:
Let's take a look at one of the questions for Python.

In [10]:
sample_file = os.path.join(train_dir, 'python/102.txt')
with open(sample_file) as f:
  print(f.read())

"why not use self.varibale_name while calling instance attribute when we create a class which is having instance attribute,why don't we call the instance attribute using self keyword.please..class car:.    def __init__(self, type, color):.        self.type = type.        self.color = color...c1 = car('suv','red').print(c1.type)...why not print(c1.self.type), because self.type is the actual attribute..got the following error:..attributeerror: 'car' object has no attribute 'self'"



## Load the dataset

In this step we'll do a number of things to prepare the data for text classification including:

1.   Loading the data using `tf.data.Dataset` ([docs](https://www.tensorflow.org/guide/data))
2.   Splitting the data into training, validation, and testing sets



## Preprocess the dataset