# Impractical 8: Applications of Text Mining & NLP 
#### Javier Garcia-Bernardo
<img src="img/uu_logo.png" alt="logo" align="right" title="UU" width="50" height="20" />

#### Applied Text Mining - Utrecht Summer School

In this practical you will be answering a research question or solving a problem. For that you will create a pipeline for classification or clustering.

All the data is processed and can be found on the github repository.


Here are some proposed research questions:

### Classification
#### RQ1: Identification of fake news, hate speech or spam + Interpretability of results:
- Data: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset or https://github.com/aitor-garcia-p/hate-speech-dataset (https://paperswithcode.com/dataset/hate-speech) or https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection
- Goal: Evaluate performance of different methods and interpret the results using LIME

#### RQ2: Evaluate the importance of metadata. Create a classification system to identify the movie genre using and excluding metadata:
- Data:  https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots 
- Options: 
    * Create two classifications systems, one using only metadata, one using only text. Stack them to create the best model: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html
    * Use the functional API of Keras to create one model that handles both types of inputs: https://pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/
- Goal: Evaluate performance and interpret the results using LIME


### Clustering:
#### RQ3: Create a recommendation system for movies based on their plot:
- Data: https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots
- Output: What are the closest movies to "The Shawshank Redemption", "Goodfellas", and "Harry Potter and the Sorcerer's Stone"?

####  RQ4: Cluster headlines using word embeddings:
- Data: https://www.ims.uni-stuttgart.de/en/research/resources/corpora/goodnewseveryone/ (https://aclanthology.org/2020.lrec-1.194.pdf)
- Do the clusters correlate to emotions or media sources?

    
You can come up with your own research question using any dataset on text analysis, e.g. from:
* UCI repository: https://archive.ics.uci.edu/ml/datasets.php?format=&task=&att=&area=&numAtt=&numIns=&type=text&sort=nameUp&view=table
* Papers with code repository: https://paperswithcode.com/datasets?mod=texts&page=1
* Kaggle (code examples are often included): https://www.kaggle.com/datasets?tags=13204-NLP
(but given the time restrictions, choosing one of the above is recommended)


In [1]:
# path to the data
path_data = "./data/"

# How to read data (We cleaned it for you)
# data_rq1_fake = pd.read_csv(f"{path_data}/rq1_fake_news.csv.gzip",sep="\t",compression="gzip")
# data_rq1_hate_speech = pd.read_csv(f"{path_data}/rq1_hate_speech.csv.gzip",sep="\t",compression="gzip")
# data_rq1_youtube = pd.read_csv(f"{path_data}/rq1_youtube.csv.gzip",sep="\t",compression="gzip")
# data_rq2_3 = pd.read_csv(f"{path_data}/rq2_3_wiki_movie_plots.csv.gzip",sep="\t",compression="gzip")
# data_rq4 = pd.read_csv(f"{path_data}/rq4_gne-release-v1.0.csv.gzip",sep="\t",compression="gzip")
# data_rq1_fake.shape, data_rq1_hate_speech.shape, data_rq1_youtube.shape, data_rq2_3.shape, data_rq4.shape

In [82]:
# Data wrangling
import pandas as pd
import numpy as np

# Machine learning tools 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression


# Interpretable AI
#!pip install lime
from lime.lime_text import LimeTextExplainer