# Lab assignment: news articles classification

<img src="img/news.jpg" style="width:800px;">

In this assignment we will face a text classification problem, trying to categorize news articles based on their headlines and short descriptions. The dataset given for this task contains around 200k news headlines from the year 2012 to 2018 obtained from [HuffPost](https://www.huffpost.com/) and they are classified into 41 different categories.

## Task summary

- Classify each news article in one of the 10 most common categories and follow the instructions to answer the questions. Only use the train data to fit the model. To measure the quality of the classification model, you can use the test data set, but it is important to note that these data can only be used to evaluate the classifier. This implies that the test data cannot be used to perform cross-validation techniques or to train text vectorizers.
- The objective is obtain **the best ROC AUC score** in the test set. In this kind of unbalanced problems it is better to make use of the ROC AUC score which takes into account the importance of all classes. 
- You can use any library and model explained in the course. It will be evaluated the **quality of the code**.
- The delivery are a unique jupiter notebook with all the code. Should run in the course Anaconda deeplearning environment. 
- Send the notebook named **homework\_[name]\_[surname]-text-mining.ipynb**.


## Guidelines

Throughout this notebook you will find empty cells that you will need to fill with your own code. Follow the instructions in the notebook and pay special attention to the following symbols.

<table align="left">
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td>You will need to solve a question by writing your own code or answer in the cell immediately below or in a different file, as instructed.</td></tr>
 <tr><td width="80"><img src="img/exclamation.png" style="width:auto;height:auto"></td><td>This is a hint or useful observation that can help you solve this assignment. You should pay attention to these hints to better understand the assignment.</td></tr>
 <tr><td width="80"><img src="img/pro.png" style="width:auto;height:auto"></td><td>This is an advanced and voluntary exercise that can help you gain a deeper knowledge into the topic. Good luck!</td></tr>
</table>

To do the task, you should use the **conda environment used for deeplearning classes**. To install it:

    conda env create -f environment-deeplearning.yml deeplearning-labs 
    conda ativate deeplearning-labs

After installing it, **make sure to have the jupyter kernel set with this environment**. 

*(optional)* If you would like to use additional Python packages that might not be installed in this conda environment, you can install new Python packages after you have activated it with

    conda install PACKAGENAME
    
if the package is in Anaconda repository. Else you should use

    pip install PACKAGENAME
    
If that is the case, you have to list bellow all the new packages used and versions installed:

* Package1: ...
* ...

The following code will embed any plots into the notebook instead of generating a new window:

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

Lastly, if you need any help on the usage of a Python function you can place the writing cursor over its name and press Caps+Shift to produce a pop-out with related documentation. This will only work inside code cells. 

Let's go!

## Preliminaries: data loading 

In this assignment we will work with the news articles data contained in the following file:

In [2]:
data = "./data/News_Category_Dataset_v2.json"

<table align="left">
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td>
     Load the data into Pandas DataFrames with name <b>df_news</b>.
 </td></tr>
</table>

<table align="left">
 <tr><td width="80"><img src="img/exclamation.png" style="width:auto;height:auto"></td><td>
Take advantage of a Pandas function to read a json as a Pandas DataFrame
 </td></tr>
</table>

In [None]:
#### INSERT YOUR CODE HERE

<table>
 <tr>
  <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">
      Explore the dataset and show how many news are in each category. After that, create a variable called <b>commoncat</b> that is a list of the 10 most common categories and filter df_news to have just these categories and call the dataframe as <b>df_filter</b>.
  </td>
 </tr> 
</table>

In [None]:
#### INSERT YOUR CODE HERE

<table align="left">
 <tr>
  <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">
      Remove all duplicated rows with same <b>headline</b> and <b>short_description</b> columns in <b>df_filter</b>.
  </td>
 </tr> 
</table>

In [1]:
#### INSERT YOUR CODE HERE

<table align="left">
 <tr>
  <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">      Create a variable called <b>X</b> with all the values of <b>headline</b> and <b>short_description</b> columns and a variable called <b>y</b> with the values of <b>category</b> column. 
  </td>
 </tr> 
</table>

In [None]:
#### INSERT YOUR CODE HERE

<table align="left">
 <tr>
  <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">      Next we are going to separate the data into one set to train the model and another to make the predictions. Split X and y into random train and test subsets calling them <b>X_train</b>, <b>X_test</b>, <b>y_train</b> and <b>y_test</b> respectively. Use the random seed and test size which are given bellow:
  </td>
 </tr> 
</table>

<table align="left">
 <tr><td width="80"><img src="img/exclamation.png" style="width:auto;height:auto"></td><td>
scikit-learn can help you!
 </td></tr>
</table>

In [None]:
#### INSERT YOUR CODE HERE
random_state=42
test_size=0.3

<table align="left">
 <tr>
  <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">      In order to enter the category labels in the model, it is necessary to code them numerically. So, translate <b>y_train</b> and after that, <b>y_test</b> to be numeric.
  </td>
 </tr> 
</table>

<table align="left">
 <tr><td width="80"><img src="img/exclamation.png" style="width:auto;height:auto"></td><td>
scikit-learn can help you! May be you need to transform labels in two different ways depending on the model used and the way to calculate roc auc score.
 </td></tr>
</table>

In [None]:
#### INSERT YOUR CODE HERE

Now, we are ready to train the model!

## Model based on characters

<table>
<tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">Build a classification model using training data. Use only characteristics based on the <b>characters</b> in the text. You can try different vectorizers and combinations of parameters. Evaluate the model on the test set. What ROC AUC score can you achieve?</td></tr>
</table>

<table>
<tr><td width="80"><img src="img/exclamation.png" style="width:auto;height:auto"></td><td style="text-align:left">Note that each news articles includes two text fields, <i> headline </i> and <i> short_description </i>. It is recommended that you build a model that analyzes both texts to make the decision. You can build a Pipeline that takes both data inputs into account using <a href=https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html>ColumnTransformer</a> or join them in a single string.</td></tr>
</table>

In [None]:
#### INSERT YOUR CODE HERE

## Model based on tokens (words)

<table>
<tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">Build a classification model using training data. Use only characteristics based on the <b>tokens</b> of the text. You can try different vectorizers and combinations of parameters. Evaluate the model on the test set. What ROC AUC score you achieve?</td></tr>
</table>

<table>
<tr><td width="80"><img src="img/exclamation.png" style="width:auto;height:auto"></td><td style="text-align:left">Note that each news articles includes two text fields, <i> headline </i> and <i> short_description </i>. It is recommended that you build a model that analyzes both texts to make the decision. You can build a Pipeline that takes both data inputs into account using <a href=https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html>ColumnTransformer</a> or join them in a single string.</td></tr>
</table>

In [None]:
#### INSERT YOUR CODE HERE

## Model based on morphosyntactic analysis

<table>
<tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">Build a classification model using the training data. Use some form of morphosyntactic analysis (such as n-grams with lemmas, or filters for POS or stopwords). Evaluate the model on the test set. What ROC AUC score can you achieve?</td></tr>
</table>

<table>
<tr><td width="80"><img src="img/exclamation.png" style="width:auto;height:auto"></td><td style="text-align:left">Note that each news articles includes two text fields, <i> headline </i> and <i> short_description </i>. It is recommended that you build a model that analyzes both texts to make the decision. You can build a Pipeline that takes both data inputs into account using <a href=https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html>ColumnTransformer</a> or join them in a single string.</td></tr>
</table>

<table>
<tr><td width="80"><img src="img/exclamation.png" style="width:auto;height:auto"></td><td style="text-align:left">
Remember that it is possible to speed up morphosyntactic analysis by deactivating certain components of the spaCy nlp model. Check the notebook for the corresponding exercise to remember how.</td></tr>
</table>

In [None]:
#### INSERT YOUR CODE HERE

## Deep Learning model

<table>
<tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">Build a classification model using the training data. Use an approach based on Embeddings and some kind of mix model (CNN, LSTM, GRU, ...). Evaluate the model on the test set. What ROC AUC score can you achieve?</td></tr>
</table>

In [None]:
#### INSERT YOUR CODE HERE

## Visualization of results

<table align="left">
 <tr>
  <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">     Choose your best model and try to visualize which categories you classify worse. Why do you think it is? Can you find an example?
  </td>
 </tr> 
</table>

<table align="left">
 <tr><td width="80"><img src="img/exclamation.png" style="width:auto;height:auto"></td><td>
Remember we did a visualization that can help you in the first lab but if you want you can use another
 </td></tr>
</table>

In [None]:
#### INSERT YOUR CODE HERE

## Report 

<table>
<tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">Write a short report explaining what decisions you have made when designing the model, what things you have tried, what has worked and what has not. In addition, <b>write a table comparing all the results</b> of the models obtained in this task. It is important to compare with the <b>same metric used</b>.</td></tr>
</table>