<a href="https://colab.research.google.com/github/dzaras/review_classification/blob/main/Copy_of_3_Text_Classification_for_actors_gender_update_8_18_21.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install nltk
!pip install numpy 
!pip install pandas
!pip install gensim
!pip install sklearn
!pip install ktrain



# Overview
## 1. Metrics for Text Classification

## 2. Example: 
1. Text Classification with Kears (GPU is needed).     
      Edit --- Notebook Settings --- Hardware accelerator --- GPU

2. (Optional) Text Classification with sklearn (GPU is not needed)

## 3. Exercise: Hate-speech Classification 


# Metrics for Text Classification

In this section, we will go through how to compute different metrics for classification tasks with sklearn. 

Basically, you need to input a list of predicted lables and the ground truth, and the built-in function will return the calculated results

## Accuracy: 

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]

In [None]:
accuracy_score(y_true, y_pred)

0.5

## Precission, Recall and F1

Note: Macro vs Micro:

Macro: Calculate metrics for each label, and find their unweighted mean.

Micro: Calculate metrics globally by counting the total true positives, false negatives and false positives.

In [None]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [None]:
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]

In [None]:
precision_score(y_true, y_pred, average='micro')

0.3333333333333333

In [None]:
recall_score(y_true, y_pred, average='micro')

0.3333333333333333

In [None]:
f1_score(y_true, y_pred, average='micro')

0.3333333333333333

# Example: Text Classification 

In this section we will go through the process to perform text classification with Keras.


Optionally, we will also show the process to perform text classification with LogisticRegression using sklearn. Basically, we need first extract text representation with sklearn and then utilize built-in model in sklearn to learn the classification model.

Here let's use the 20-newsgroup data as an example.

In [None]:
import pandas as pd

# Here we use 20-Newsgroups dataset (http://qwone.com/~jason/20Newsgroups/) for this example. 
# This version of the dataset contains about 11k newsgroups posts from 20 different topics. 
# This is available as https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json

#url = 'https://raw.githubusercontent.com/dzaras/review_classification/main/newspaper_data_3_labels.csv'

raw_data = pd.read_csv('https://raw.githubusercontent.com/dzaras/review_classification/main/amateur_reviews_08_actor_labels.csv' , encoding='latin-1')
#raw_data = pd.read_csv('https://raw.githubusercontent.com/dzaras/review_classification/main/newspaper_data_only_neg-pos.csv' , encoding='latin-1')


#raw_data = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')

print(raw_data)

                                film.title  ...                                            content
0       What Happens in Vegas               ...  The story behind this Romantic Comedy is that ...
1                 Untraceable               ...  This story of a crazed killer using hits on a ...
2                   Get Smart               ...  This film is about an incompetent agent who is...
3                   Good Dick               ...  I always check the spoiler box, just in case.T...
4          Frygtelig lykkelig               ...  Nicely done. I am glad I picked this one out. ...
...                                    ...  ...                                                ...
2363             The Bank Job               ...  Having been weaned on The Italian Job, Thomas ...
2364             The Bank Job               ...  I liked the film to start with - bit cheeky, t...
2365     La siciliana ribelle               ...  While this film adds one of the faces that cou...
2366  The 

In [None]:
raw_data.dtypes

film.title      object
author          object
target           int64
target_name     object
genre           object
year             int64
rating         float64
content         object
dtype: object

In [None]:
raw_data['content'] = raw_data['content'].astype('string')

In [None]:
raw_data.dtypes

film.title      object
author          object
target           int64
target_name     object
genre           object
year             int64
rating         float64
content         string
dtype: object

In [None]:
raw_data

Unnamed: 0,film.title,author,target,target_name,genre,year,rating,content
0,What Happens in Vegas,MovieBuff26,1,female,comedy,2010,8.0,The story behind this Romantic Comedy is that ...
1,Untraceable,tatz32000,1,female,crime,2008,7.0,This story of a crazed killer using hits on a ...
2,Get Smart,Gordon-11,0,male,,2008,,This film is about an incompetent agent who is...
3,Good Dick,wingedheartart,1,female,misc,2009,6.0,"I always check the spoiler box, just in case.T..."
4,Frygtelig lykkelig,chapsmack,0,male,crime,2009,,Nicely done. I am glad I picked this one out. ...
...,...,...,...,...,...,...,...,...
2363,The Bank Job,jemps918,0,male,crime,2008,5.0,"Having been weaned on The Italian Job, Thomas ..."
2364,The Bank Job,prbt,0,male,crime,2008,6.0,"I liked the film to start with - bit cheeky, t..."
2365,La siciliana ribelle,ricardodiazsoto,1,female,crime,2010,,While this film adds one of the faces that cou...
2366,The Secret Life of Bees,johnstonjames,1,female,drama,2010,8.0,Wow. This was a really good movie. The only re...


In [None]:
# Read the text for classification
text = []
#for i in range(0, len(raw_data['content'])):
for i in range(0, 1900):      # first 1000 observations for training; we could have used more or all of them
  text.append(raw_data['content'][i])

In [None]:
# Read the labels 
labels = []
#for i in range(0, len(raw_data['target'])):
for i in range(0, 1900):      # first 1000 observations for training; we could have used more or all of them
  labels.append(raw_data['target'][i])

In [None]:
target_name = ['female','male']

## Text Classification with Keras

In [None]:
import ktrain
from ktrain import text as ktext
from sklearn.model_selection import train_test_split

### Forming the train and test sets

In [None]:
# We first split the original data into train and test set
X_train, X_test, y_train, y_test = train_test_split(text, labels, random_state = 0)     #random state is like random seed

### Preprocess Data

Notice that we use distilbert in the next chunk of code where we pre-process the texts. We'd have to use the same mode (distilbert or roBERta or gpt) in the training of the model as we do in the pre-process stage.
If we wanted use a different kind of model (such as roBERTa instead of distilbert) we'd change the preprocess_mode argument in the following chunk of code. Of course, we'd have to check the documentation of the function to make sure the mode we want to use is supported. 

In [None]:
trn, val, preproc = ktext.texts_from_array(x_train=X_train, y_train=y_train,
                                          x_test=X_test, y_test=y_test,
                                          class_names = target_name,
                                          preprocess_mode='distilbert',
                                          maxlen=128)

preprocessing train...
language: en
train sequence lengths:
	mean : 331
	95percentile : 833
	99percentile : 977


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 320
	95percentile : 806
	99percentile : 969


task: text classification


### Building a Model and a Learner







In [None]:
model = ktext.text_classifier('distilbert', train_data=trn, preproc=preproc)

Is Multi-Label? False
maxlen is 128
done.


Batch size is hyperparameters -> how many datapoints we can analyze simultaneously and it depends on our memory size.

In [None]:
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=16)

### Train the Model

Epoch is similar to 'passes' we saw in the topic modelling section (sort of like bootstrapping for nlp)

In [None]:
learner.fit_onecycle(3e-5, 4)



begin training using onecycle policy with max lr of 3e-05...
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fc4d78bf650>

### Predict on New Data

In [None]:
p = ktrain.get_predictor(model, preproc)

#p.predict("I need some help with my Macbook.")
# the above sentence after p.predict is the text we ask the model to predict the label of
# another example
p.predict("the protagonist goes through several steps before she admits that she needs help from others.")

'male'

## (Optional) Text Classification with Sklearn

### Text Representation

The first step is to represent text with vectors/features such as bag-of-words, tf-idf features.  Here we will use extract tf-idf features as an example. More features could be found here: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')

In [None]:
features = tfidf.fit_transform(text)

In [None]:
features.shape

(11314, 1186545)

### Learning the Classification Model

There are many different models in sklearn such as Naive Bayes Classifier, Logistic Regression Classifier, Linear Support Vector Machine, and etc. Here we use Logistic Regression Classifier as an example. More details cound be found here: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# We first split the original data into train and test set
X_train, X_test, y_train, y_test = train_test_split(text, labels, random_state = 0)

# Extract features
tfidf = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')
tfidf.fit(X_train)
X_train_features = tfidf.transform(X_train)

X_test_features = tfidf.transform(X_test)

In [None]:
# Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
# Logistic Regression Classifier
from sklearn.linear_model import LogisticRegression
# Linear Support Vector Machine
from sklearn.svm import LinearSVC

clf = LogisticRegression().fit(X_train_features, y_train)

### Evaluating the Classification Model

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

# predict labels for test data
y_pred = clf.predict(X_test_features)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

accuracy_score(y_test, y_pred)

0.9020855425945564

In [None]:
f1_score(y_test, y_pred, average='macro')

0.8999629673074958

In [None]:
f1_score(y_test, y_pred, average='micro')

0.9020855425945564

# Exercise: Text Classification

You could first follow the example above and apply the codes to the hate-speech classification task. 



***WARNING: The data, lexicons, and notebooks all contain content that is racist, sexist, homophobic, and offensive in many other ways. ***



First, let's download the dataset first.

The dataset is from `Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. "Automated Hate Speech Detection and the Problem of Offensive Language." ICWSM`

More details could be found here: https://github.com/t-davidson/hate-speech-and-offensive-language

In [None]:
!git clone https://github.com/t-davidson/hate-speech-and-offensive-language.git

Cloning into 'hate-speech-and-offensive-language'...
remote: Enumerating objects: 41, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 41 (delta 4), reused 0 (delta 0), pack-reused 32[K
Unpacking objects: 100% (41/41), done.


In [None]:
import pandas as pd
raw_data = pd.read_csv('./hate-speech-and-offensive-language/data/labeled_data.csv')

In [None]:
raw_data

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
...,...,...,...,...,...,...,...
24778,25291,3,0,2,1,1,you's a muthaf***in lie &#8220;@LifeAsKing: @2...
24779,25292,3,0,1,2,2,"you've gone and broke the wrong heart baby, an..."
24780,25294,3,0,3,0,1,young buck wanna eat!!.. dat nigguh like I ain...
24781,25295,6,0,6,0,1,youu got wild bitches tellin you lies


*count* = number of CrowdFlower users who coded each tweet (min is 3, sometimes more users coded a tweet when judgments were determined to be unreliable by CF).

*hate_speech* = number of CF users who judged the tweet to be hate speech.

*offensive_language* = number of CF users who judged the tweet to be offensive.

*neither* = number of CF users who judged the tweet to be neither offensive nor non-offensive.

*class* = class label for majority of CF users. 0 - hate speech 1 - offensive language 2 - neither

In [None]:
# Read the text for classification
text = []
for i in range(0, len(raw_data['tweet'])):
  text.append(raw_data['tweet'][i])

# Read the labels 
labels = []
for i in range(0, len(raw_data['class'])):
  labels.append(raw_data['class'][i])

target_name = ['hate speech', 'offensive language', 'neither']

In [None]:
target_name = ['hate speech',
 'offensive language',
 'neither',

Text Classification with Keras

In [None]:
import ktrain
from ktrain import text as ktext
from sklearn.model_selection import train_test_split

Forming the Train and Test Tests

In [None]:
# We first split the original data into train and test set
X_train, X_test, y_train, y_test = train_test_split(text, labels, random_state = 0)     #random state is like random seed

Preprocess Data
- i could try using 'bert' instead of 'distilbert' here

In [None]:
trn, val, preproc = ktext.texts_from_array(x_train=X_train, y_train=y_train,
                                          x_test=X_test, y_test=y_test,
                                          class_names = target_name,
                                          preprocess_mode='distilbert',
                                          maxlen=128)

preprocessing train...
language: en
train sequence lengths:
	mean : 14
	95percentile : 26
	99percentile : 29


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 14
	95percentile : 26
	99percentile : 29


task: text classification


Building a Model and a Learner

In [None]:
model = ktext.text_classifier('distilbert', train_data=trn, preproc=preproc)

Is Multi-Label? False
maxlen is 128
done.


In [None]:
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=16)

Train the Model

In [None]:
learner.fit_onecycle(3e-5, 4)



begin training using onecycle policy with max lr of 3e-05...
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7fd0d76681d0>

Predict on New Data

In [None]:
p = ktrain.get_predictor(model, preproc)

p.predict("They are very racist")
# the above sentence after p.predict is the text we ask the model to predict the label of
# another example
# p.predict("this movie is not good at all and certainly not good for me")

'hate speech'