# Classification

Classification task using fastText.

In [1]:
# Clone the fastText repository from GitHub
!git clone https://github.com/facebookresearch/fastText.git

Cloning into 'fastText'...
remote: Enumerating objects: 3998, done.[K
remote: Counting objects: 100% (1057/1057), done.[K
remote: Compressing objects: 100% (197/197), done.[K
remote: Total 3998 (delta 922), reused 889 (delta 855), pack-reused 2941[K
Receiving objects: 100% (3998/3998), 8.30 MiB | 12.49 MiB/s, done.
Resolving deltas: 100% (2529/2529), done.


In [2]:
# Navigate into the fastText directory
%cd fastText

/content/fastText


In [3]:
# Compile the fastText source code
!make

c++ -pthread -std=c++17 -march=native -O3 -funroll-loops -DNDEBUG -c src/args.cc
c++ -pthread -std=c++17 -march=native -O3 -funroll-loops -DNDEBUG -c src/autotune.cc
c++ -pthread -std=c++17 -march=native -O3 -funroll-loops -DNDEBUG -c src/matrix.cc
c++ -pthread -std=c++17 -march=native -O3 -funroll-loops -DNDEBUG -c src/dictionary.cc
c++ -pthread -std=c++17 -march=native -O3 -funroll-loops -DNDEBUG -c src/loss.cc
c++ -pthread -std=c++17 -march=native -O3 -funroll-loops -DNDEBUG -c src/productquantizer.cc
c++ -pthread -std=c++17 -march=native -O3 -funroll-loops -DNDEBUG -c src/densematrix.cc
c++ -pthread -std=c++17 -march=native -O3 -funroll-loops -DNDEBUG -c src/quantmatrix.cc
c++ -pthread -std=c++17 -march=native -O3 -funroll-loops -DNDEBUG -c src/vector.cc
c++ -pthread -std=c++17 -march=native -O3 -funroll-loops -DNDEBUG -c src/model.cc
c++ -pthread -std=c++17 -march=native -O3 -funroll-loops -DNDEBUG -c src/utils.cc
c++ -pthread -std=c++17 -march=native -O3 -funroll-loops -DNDEBUG -

In [4]:
# Print the current working directory path
!pwd

/content/fastText


In [5]:
# Navigate back to the parent directory
%cd ../

/content


## 1. humor detection

In [6]:
# Download the training data for humor detection
!wget https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/lab5/humor-detection_train.txt
# Download the test data for humor detection
!wget https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/lab5/humor-detection_test.txt

--2024-04-03 18:26:33--  https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/lab5/humor-detection_train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1247724 (1.2M) [text/plain]
Saving to: ‘humor-detection_train.txt’


2024-04-03 18:26:33 (22.1 MB/s) - ‘humor-detection_train.txt’ saved [1247724/1247724]

--2024-04-03 18:26:33--  https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/lab5/humor-detection_test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaitin

In [7]:
# Navigate into the fastText directory again
%cd fastText

/content/fastText


In [8]:
# Train a supervised model for humor detection
!./fasttext supervised -input ../humor-detection_train.txt -output model_humor

Read 0M words
Number of words:  31831
Number of labels: 2
Progress: 100.0% words/sec/thread:  295398 lr:  0.000000 avg.loss:  0.166033 ETA:   0h 0m 0s


In [9]:
# Test the humor detection model
!./fasttext test model_humor.bin ../humor-detection_test.txt

N	3355
P@1	0.955
R@1	0.955


In [10]:
# Display the first few lines of the humor detection training data
!head ../humor-detection_train.txt

__label__true	My grandfather died recently, He spent his final years as a regular user of facebook..We won't see the likes of him again.
__label__true	I was sat in traffic the other day. Got hit by a car.
__label__true	Whats the difference between a ginger fanny and a cricket ball? If you try really hard, Really really hard, You can eat a cricket ball.
__label__true	Money can't buy happiness, but I'd much rather cry in a mansion.
__label__true	2B or not 2B. That is the pencil.
__label__true	What's the difference between a Jew and a canoe? Canoes tip!
__label__true	I've just won 10 million on the lottery and decided to buy my local Chinese takeaway called 'Happiness'.  Your move, philosophers.
__label__true	A man was hospitalized with 6 plastic horses up his ass. The doctor described his condition as stable.
__label__true	Just told my joke about Peter Pan again. Never gets old.
__label__true	Two blondes were driving to Disneyland and the exit sign reads: DISNEYLAND LEFT. They started cr

In [11]:
# Download the shuffle.c file
!wget https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/lab5/shuffle.c

--2024-04-03 18:26:36--  https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/lab5/shuffle.c
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3595 (3.5K) [text/plain]
Saving to: ‘shuffle.c’


2024-04-03 18:26:36 (53.0 MB/s) - ‘shuffle.c’ saved [3595/3595]



In [12]:
# Compile the shuffle.c file
!gcc -o shuffle shuffle.c

In [13]:
# Shuffle the humor detection test data
!./shuffle < ../humor-detection_test.txt > humor-detection_train_shuffled.txt

In [14]:
# Train a supervised model for humor detection using shuffled data
!./fasttext supervised -input humor-detection_train_shuffled.txt -output model_humor_shuffled

Read 0M words
Number of words:  12684
Number of labels: 2
Progress: 100.1% words/sec/thread:  219604 lr: -0.000088 avg.loss:  0.268598 ETA:   0h 0m 0sProgress: 100.0% words/sec/thread:  219096 lr:  0.000000 avg.loss:  0.268598 ETA:   0h 0m 0s


In [15]:
# Test the humor detection model trained on shuffled data
!./fasttext test model_humor_shuffled.bin ../humor-detection_test.txt

N	3355
P@1	0.983
R@1	0.983


In [38]:
# Display the first few lines of the shuffled humor detection training data
!head humor-detection_train_shuffled.txt

__label__false	More than 60 people killed in suicide bombs in Nigeria: officials 
__label__false	It was done in response to Ottawa 's Clint Benedict constantly falling to make saves.
__label__false	The two streams run through the estate , one of them the Glamis Burn.
__label__true	The brain is a wonderful organ:  it starts working the moment you get up in the morning, and does not stop until you get to school.
__label__false	Yemeni president appoints general to senior army post, state media report 
__label__false	Liverpool collapse nothing to do with protest, says coach  
__label__true	It is the ability to take a joke, not make one, that proves you have a sense of humor.
__label__false	Vontobel family affirms commitment to Swiss bank after patriarch's death  
__label__true	This planking epidemic is getting out of hand. The old lady next door has been laying outside for 3 days now.
__label__true	What do you tell a woman with two black eyes? Nothing, you already told her twice!


## 2. sarcasm detection

In [17]:
# Navigate back to the parent directory
%cd ../

/content


In [18]:
# Download the training data for sarcasm detection
!wget https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/lab5/sarcasm-detection_train.txt
# Download the test data for sarcasm detection
!wget https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/lab5/sarcasm-detection_test.txt

--2024-04-03 18:26:38--  https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/lab5/sarcasm-detection_train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1633310 (1.6M) [text/plain]
Saving to: ‘sarcasm-detection_train.txt’


2024-04-03 18:26:38 (27.7 MB/s) - ‘sarcasm-detection_train.txt’ saved [1633310/1633310]

--2024-04-03 18:26:38--  https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/lab5/sarcasm-detection_test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent,

In [19]:
# Navigate into the fastText directory
%cd fastText

/content/fastText


In [20]:
# Train a supervised model for sarcasm detection
!./fasttext supervised -input ../sarcasm-detection_train.txt -output model_sarcasm

Read 0M words
Number of words:  32149
Number of labels: 2
Progress: 100.0% words/sec/thread:  209697 lr:  0.000000 avg.loss:  0.252912 ETA:   0h 0m 0s


In [21]:
# Test the sarcasm detection model
!./fasttext test model_sarcasm.bin ../sarcasm-detection_test.txt

N	5342
P@1	0.846
R@1	0.846


In [22]:
# Display the first few lines of the sarcasm detection training data
!head ../sarcasm-detection_train.txt

__label__false former versace store clerk sues over secret 'black code' for minority shoppers
__label__false the 'roseanne' revival catches up to our thorny political mood, for better and worse
__label__true mom starting to fear son's web series closest thing she will have to grandchild
__label__true boehner just wants wife to listen, not come up with alternative debt-reduction ideas
__label__false j.k. rowling wishes snape happy birthday in the most magical way
__label__false advancing the world's women
__label__false the fascinating case for eating lab-grown meat
__label__false this ceo will send your kids to school, if you work for his company
__label__true top snake handler leaves sinking huckabee campaign
__label__false friday's morning email: inside trump's presser for the ages


In [23]:
# Shuffle the sarcasm detection training data
!./shuffle < ../sarcasm-detection_train.txt > sarcasm-detection_train_shuffled.txt

In [24]:
# Train a supervised model for sarcasm detection using shuffled data
!./fasttext supervised -input sarcasm-detection_train_shuffled.txt -output model_sarcasm_shuffled

Read 0M words
Number of words:  32149
Number of labels: 2
Progress: 100.0% words/sec/thread:  262431 lr:  0.000000 avg.loss:  0.240808 ETA:   0h 0m 0s


In [25]:
# Test the sarcasm detection model trained on shuffled data
!./fasttext test model_sarcasm_shuffled.bin ../sarcasm-detection_test.txt

N	5342
P@1	0.851
R@1	0.851


## 3. sentiment analysis on movie reviews (4pts)
You prepare the data (`sentiment-analysis-on-movie-reviews-refined.txt`) for *fastText* using the series of shell command.

In [26]:
# Download the refined sentiment analysis data for movie reviews
!wget https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/lab5/sentiment-analysis-on-movie-reviews-refined.txt

--2024-04-03 18:26:44--  https://raw.githubusercontent.com/jungyeul/computational-tools-for-linguistic-analysis-ubc/main/labs/lab5/sentiment-analysis-on-movie-reviews-refined.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 894456 (873K) [text/plain]
Saving to: ‘sentiment-analysis-on-movie-reviews-refined.txt’


2024-04-03 18:26:45 (18.2 MB/s) - ‘sentiment-analysis-on-movie-reviews-refined.txt’ saved [894456/894456]



In [27]:
# Display the first few lines of the refined sentiment analysis data for movie reviews
!head sentiment-analysis-on-movie-reviews-refined.txt

A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .	1
This quiet , introspective and entertaining independent is worth seeking .	4
Even fans of Ismail Merchant 's work , I suspect , would have a hard time sitting through this one .	1
A positively thrilling combination of ethnography and all the intrigue , betrayal , deceit and murder of a Shakespearean tragedy or a juicy soap opera .	3
Aggressive self-glorification and a manipulative whitewash .	1
A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis .	4
Narratively , Trouble Every Day is a plodding mess .	1
The Importance of Being Earnest , so thick with wit it plays like a reading from Bartlett 's Familiar Quotations	3
But it does n't leave you with much .	1
You could hate it for the same reason .	1


In [28]:
# Score the movie reviews data and store it in scored-movie.txt
!awk '{first = $NF; $NF=""; print first, $0}' sentiment-analysis-on-movie-reviews-refined.txt > scored-movie.txt

In [29]:
# Display the first few lines of the scored movie data
!head scored-movie.txt

1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . 
4 This quiet , introspective and entertaining independent is worth seeking . 
1 Even fans of Ismail Merchant 's work , I suspect , would have a hard time sitting through this one . 
3 A positively thrilling combination of ethnography and all the intrigue , betrayal , deceit and murder of a Shakespearean tragedy or a juicy soap opera . 
1 Aggressive self-glorification and a manipulative whitewash . 
4 A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis . 
1 Narratively , Trouble Every Day is a plodding mess . 
3 The Importance of Being Earnest , so thick with wit it plays like a reading from Bartlett 's Familiar Quotations 
1 But it does n't leave you with much . 
1 You could hate it for the same reason . 


In [30]:
# Add labels to the scored movie data
!sed 's/^/__label__/' scored-movie.txt  > scored-movie-labelled.txt

In [31]:
# Display the first few lines of the labelled scored movie data
!head scored-movie-labelled.txt

__label__1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . 
__label__4 This quiet , introspective and entertaining independent is worth seeking . 
__label__1 Even fans of Ismail Merchant 's work , I suspect , would have a hard time sitting through this one . 
__label__3 A positively thrilling combination of ethnography and all the intrigue , betrayal , deceit and murder of a Shakespearean tragedy or a juicy soap opera . 
__label__1 Aggressive self-glorification and a manipulative whitewash . 
__label__4 A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis . 
__label__1 Narratively , Trouble Every Day is a plodding mess . 
__label__3 The Importance of Being Earnest , so thick with wit it plays like a reading from Bartlett 's Familiar Quotations 
__label__1 But it does n't leave y

In [32]:
# Count the number of lines in the labelled scored movie data
!wc scored-movie-labelled.txt

  8529 170573 979744 scored-movie-labelled.txt


In [33]:
# Extract the training data from the labelled scored movie data
!head -n 6829 scored-movie-labelled.txt > movie.train.txt

In [34]:
# Extract the validation data from the labelled scored movie data
!tail -n 1700 scored-movie-labelled.txt > movie.valid.txt

In [35]:
# Display the first few lines of the training data for movie sentiment analysis
!head movie.train.txt

__label__1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . 
__label__4 This quiet , introspective and entertaining independent is worth seeking . 
__label__1 Even fans of Ismail Merchant 's work , I suspect , would have a hard time sitting through this one . 
__label__3 A positively thrilling combination of ethnography and all the intrigue , betrayal , deceit and murder of a Shakespearean tragedy or a juicy soap opera . 
__label__1 Aggressive self-glorification and a manipulative whitewash . 
__label__4 A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis . 
__label__1 Narratively , Trouble Every Day is a plodding mess . 
__label__3 The Importance of Being Earnest , so thick with wit it plays like a reading from Bartlett 's Familiar Quotations 
__label__1 But it does n't leave y

In [36]:
# Train a supervised model for movie sentiment analysis
!./fasttext supervised -input movie.train.txt -output model_movie

Read 0M words
Number of words:  16092
Number of labels: 5
Progress:  78.8% words/sec/thread:  467892 lr:  0.021238 avg.loss:  1.547459 ETA:   0h 0m 0sProgress: 100.1% words/sec/thread:  297355 lr: -0.000067 avg.loss:  1.541772 ETA:   0h 0m 0sProgress: 100.0% words/sec/thread:  297061 lr:  0.000000 avg.loss:  1.541772 ETA:   0h 0m 0s


In [37]:
# Test the movie sentiment analysis model
!./fasttext test model_movie.bin movie.valid.txt

N	1700
P@1	0.32
R@1	0.32
