# Sentimental Extraction
- Name: Minh T. Nguyen
- Date: 11/24/2023
- About:
    - **Description Sentiment Analysis**: Use pretrained models to performed sentimental analysis and create new feature.
    - Take 2 hours to run on GPU T4 x2 on Kaggle.
- Reference: [Getting Started with Sentiment Analysis using Python - HuggingFace](https://huggingface.co/blog/sentiment-analysis-python)

In [1]:
!pip install -q transformers

**Note:** The datasets can be found [here]((https://www.kaggle.com/competitions/two-sigma-connect-rental-listing-inquiries/data?select=train.json.zip)).
- train.json: the training set.
- images_sample.zip: listing images organized by listing_id (a sample of 100 listings)
- Kaggle-renthop.7z: listing images organized by listing_id. Total size: 78.5 GB compressed.

In [2]:
# default directory connection from Kaggle
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/train-json/train.json


In [3]:
# import libraries
import numpy as np
import pandas as pd
from transformers import pipeline
from transformers import AutoTokenizer
import re

import warnings
warnings.filterwarnings('ignore')



## 1. Import dataset

In [4]:
# import the dataset
df = pd.read_json("/kaggle/input/train-json/train.json")
df.head(5)

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,interest_level
4,1.0,1,8579a0b0d54db803821a35a4a615e97a,2016-06-16 05:55:27,Spacious 1 Bedroom 1 Bathroom in Williamsburg!...,145 Borinquen Place,"[Dining Room, Pre-War, Laundry in Building, Di...",40.7108,7170325,-73.9539,a10db4590843d78c784171a107bdacb4,[https://photos.renthop.com/2/7170325_3bb5ac84...,2400,145 Borinquen Place,medium
6,1.0,2,b8e75fc949a6cd8225b455648a951712,2016-06-01 05:44:33,BRAND NEW GUT RENOVATED TRUE 2 BEDROOMFind you...,East 44th,"[Doorman, Elevator, Laundry in Building, Dishw...",40.7513,7092344,-73.9722,955db33477af4f40004820b4aed804a0,[https://photos.renthop.com/2/7092344_7663c19a...,3800,230 East 44th,low
9,1.0,2,cd759a988b8f23924b5a2058d5ab2b49,2016-06-14 15:19:59,**FLEX 2 BEDROOM WITH FULL PRESSURIZED WALL**L...,East 56th Street,"[Doorman, Elevator, Laundry in Building, Laund...",40.7575,7158677,-73.9625,c8b10a317b766204f08e613cef4ce7a0,[https://photos.renthop.com/2/7158677_c897a134...,3495,405 East 56th Street,medium
10,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,[],40.7145,7211212,-73.9425,5ba989232d0489da1b5f2c45f6688adc,[https://photos.renthop.com/2/7211212_1ed4542e...,3000,792 Metropolitan Avenue,medium
15,1.0,0,bfb9405149bfff42a92980b594c28234,2016-06-28 03:50:23,Over-sized Studio w abundant closets. Availabl...,East 34th Street,"[Doorman, Elevator, Fitness Center, Laundry in...",40.7439,7225292,-73.9743,2c3b41f588fbb5234d8a1e885a436cfa,[https://photos.renthop.com/2/7225292_901f1984...,2795,340 East 34th Street,low


In [5]:
# outlier removal
upper_bound = np.percentile(df["price"].values, 99)
df_filtered = df[df["price"] <= upper_bound]

In [6]:
df_filtered.head(5)

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,interest_level
4,1.0,1,8579a0b0d54db803821a35a4a615e97a,2016-06-16 05:55:27,Spacious 1 Bedroom 1 Bathroom in Williamsburg!...,145 Borinquen Place,"[Dining Room, Pre-War, Laundry in Building, Di...",40.7108,7170325,-73.9539,a10db4590843d78c784171a107bdacb4,[https://photos.renthop.com/2/7170325_3bb5ac84...,2400,145 Borinquen Place,medium
6,1.0,2,b8e75fc949a6cd8225b455648a951712,2016-06-01 05:44:33,BRAND NEW GUT RENOVATED TRUE 2 BEDROOMFind you...,East 44th,"[Doorman, Elevator, Laundry in Building, Dishw...",40.7513,7092344,-73.9722,955db33477af4f40004820b4aed804a0,[https://photos.renthop.com/2/7092344_7663c19a...,3800,230 East 44th,low
9,1.0,2,cd759a988b8f23924b5a2058d5ab2b49,2016-06-14 15:19:59,**FLEX 2 BEDROOM WITH FULL PRESSURIZED WALL**L...,East 56th Street,"[Doorman, Elevator, Laundry in Building, Laund...",40.7575,7158677,-73.9625,c8b10a317b766204f08e613cef4ce7a0,[https://photos.renthop.com/2/7158677_c897a134...,3495,405 East 56th Street,medium
10,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,[],40.7145,7211212,-73.9425,5ba989232d0489da1b5f2c45f6688adc,[https://photos.renthop.com/2/7211212_1ed4542e...,3000,792 Metropolitan Avenue,medium
15,1.0,0,bfb9405149bfff42a92980b594c28234,2016-06-28 03:50:23,Over-sized Studio w abundant closets. Availabl...,East 34th Street,"[Doorman, Elevator, Fitness Center, Laundry in...",40.7439,7225292,-73.9743,2c3b41f588fbb5234d8a1e885a436cfa,[https://photos.renthop.com/2/7225292_901f1984...,2795,340 East 34th Street,low


## 2. Sentimental Analysis With Pretrained Model
- The model used is called "distilbert-base-finetuned-sst-2-english" which is a small and fast version of BERT. The model is trained on Stanford Sentiment Treebank (SST-2) dataset which consists of sentences from movie reviews labeled with their sentiment. DistilBERT itself is a transformer-based model that is a distilled version of BERT, designed to be faster and lighter while still retaining most of BERT's performance. It follows the BERT architecture, which is an attention-based neural network: it uses self-attention mechanisms to weigh the importance of different words in a sentence. Since their is no available pretrained-BERT for apartment-vocab, this is a good general-purpose sentimental analysis model

### Resources
- [BERT Neural Network - EXPLAINED!](https://www.youtube.com/watch?v=xI0HHN5XKDo)
- [What is BERT and how does it work? | A Quick Review](https://www.youtube.com/watch?v=6ahxPTLZxU8)

In [7]:
## 2. Sentimental Analysis With Pretrained Model
- The model used is called "distilbert-base-finetuned-sst-2-english" which is a small and fast version of BERT. The model is trained on Stanford Sentiment Treebank (SST-2) dataset which consists of sentences from movie reviews labeled with their sentiment. DistilBERT itself is a transformer-based model that is a distilled version of BERT, designed to be faster and lighter while still retaining most of BERT's performance. It follows the BERT architecture, which is an attention-based neural network: it uses self-attention mechanisms to weigh the importance of different words in a sentence. Since their is no available pretrained-BERT for apartment-vocab, this is a good general-purpose sentimental analysis model

### Resources
- [BERT Neural Network - EXPLAINED!](https://www.youtube.com/watch?v=xI0HHN5XKDo)
- [What is BERT and how does it work? | A Quick Review](https://www.youtube.com/watch?v=6ahxPTLZxU8)# get the first description
test_des = df_filtered.description.iloc[0]
print(test_des)

Spacious 1 Bedroom 1 Bathroom in Williamsburg!Apartment Features:- Renovated Eat in Kitchen With Dishwasher- Renovated Bathroom- Beautiful Hardwood Floors- Lots of Sunlight- Great Closet Space- Freshly Painted- Heat and Hot Water Included- Live in Super Nearby L, J, M & G Trains !<br /><br />Contact Information:Kenneth BeakExclusive AgentC: 064-692-8838Email: kagglemanager@renthop.com, Text or Email to schedule a private viewing!<br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><p><a  website_redacted 


In [8]:
# function to clean HTML tags and whitespace
def preprocess_text(text):
    text = re.sub(r'<[^>]+>', '', text)  # remove HTML tags
    text = re.sub(r'\s+', ' ', text)     # replace multiple whitespaces with single space
    return text.strip()

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# function to truncate text to a max length
def truncate_text(text, max_length=500):
    # encode the text, ensuring that the total length of the input does not exceed 500 tokens
    inputs = tokenizer.encode_plus(
        text, 
        add_special_tokens=True, 
        max_length=max_length, 
        truncation=True
    )
    # decode back to a string, without the special tokens
    truncated_text = tokenizer.decode(inputs['input_ids'], skip_special_tokens=True)
    return truncated_text

# init sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis")

# function to get sentiment score
def get_sentiment(text):
    print(f"Processed")
    return sentiment_pipeline(text)[0]

# apply preprocessing to the descriptions
df_filtered['clean_description'] = df_filtered['description'].apply(preprocess_text)

# truncate descriptions to max_length
df_filtered['truncated_description'] = df_filtered['clean_description'].apply(truncate_text)

# perform sentiment analysis
df_filtered['sentiment'] = df_filtered['truncated_description'].apply(get_sentiment)

# define thresholds for sentiment classification
positive_threshold = 0.75
negative_threshold = 0.25

# function to classify sentiment
def classify_sentiment(sentiment):
    score = sentiment['score']
    if sentiment['label'] == 'POSITIVE' and score >= positive_threshold:
        return 1
    elif sentiment['label'] == 'NEGATIVE' and score <= negative_threshold:
        return -1
    else:
        return 0

# apply sentiment classification to the dataframe
df_filtered['sentiment_label'] = df_filtered['sentiment'].apply(classify_sentiment)

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed
Processed


In [9]:
# check dataset's first 5 rows
df_filtered.head(5)

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,interest_level,clean_description,truncated_description,sentiment,sentiment_label
4,1.0,1,8579a0b0d54db803821a35a4a615e97a,2016-06-16 05:55:27,Spacious 1 Bedroom 1 Bathroom in Williamsburg!...,145 Borinquen Place,"[Dining Room, Pre-War, Laundry in Building, Di...",40.7108,7170325,-73.9539,a10db4590843d78c784171a107bdacb4,[https://photos.renthop.com/2/7170325_3bb5ac84...,2400,145 Borinquen Place,medium,Spacious 1 Bedroom 1 Bathroom in Williamsburg!...,spacious 1 bedroom 1 bathroom in williamsburg!...,"{'label': 'POSITIVE', 'score': 0.8850623965263...",1
6,1.0,2,b8e75fc949a6cd8225b455648a951712,2016-06-01 05:44:33,BRAND NEW GUT RENOVATED TRUE 2 BEDROOMFind you...,East 44th,"[Doorman, Elevator, Laundry in Building, Dishw...",40.7513,7092344,-73.9722,955db33477af4f40004820b4aed804a0,[https://photos.renthop.com/2/7092344_7663c19a...,3800,230 East 44th,low,BRAND NEW GUT RENOVATED TRUE 2 BEDROOMFind you...,brand new gut renovated true 2 bedroomfind you...,"{'label': 'POSITIVE', 'score': 0.998374342918396}",1
9,1.0,2,cd759a988b8f23924b5a2058d5ab2b49,2016-06-14 15:19:59,**FLEX 2 BEDROOM WITH FULL PRESSURIZED WALL**L...,East 56th Street,"[Doorman, Elevator, Laundry in Building, Laund...",40.7575,7158677,-73.9625,c8b10a317b766204f08e613cef4ce7a0,[https://photos.renthop.com/2/7158677_c897a134...,3495,405 East 56th Street,medium,**FLEX 2 BEDROOM WITH FULL PRESSURIZED WALL**L...,* * flex 2 bedroom with full pressurized wall ...,"{'label': 'POSITIVE', 'score': 0.9986716508865...",1
10,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,[],40.7145,7211212,-73.9425,5ba989232d0489da1b5f2c45f6688adc,[https://photos.renthop.com/2/7211212_1ed4542e...,3000,792 Metropolitan Avenue,medium,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,a brand new 3 bedroom 1. 5 bath apartmentenjoy...,"{'label': 'NEGATIVE', 'score': 0.6298918724060...",0
15,1.0,0,bfb9405149bfff42a92980b594c28234,2016-06-28 03:50:23,Over-sized Studio w abundant closets. Availabl...,East 34th Street,"[Doorman, Elevator, Fitness Center, Laundry in...",40.7439,7225292,-73.9743,2c3b41f588fbb5234d8a1e885a436cfa,[https://photos.renthop.com/2/7225292_901f1984...,2795,340 East 34th Street,low,Over-sized Studio w abundant closets. Availabl...,over - sized studio w abundant closets. availa...,"{'label': 'NEGATIVE', 'score': 0.9978052973747...",0


In [11]:
# save dataset in csv file
df_filtered.to_csv("/kaggle/working/sentimental_extraction.csv")

In [12]:
# check dataset's first 100 rows
df_filtered.head(100)

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,interest_level,clean_description,truncated_description,sentiment,sentiment_label
4,1.0,1,8579a0b0d54db803821a35a4a615e97a,2016-06-16 05:55:27,Spacious 1 Bedroom 1 Bathroom in Williamsburg!...,145 Borinquen Place,"[Dining Room, Pre-War, Laundry in Building, Di...",40.7108,7170325,-73.9539,a10db4590843d78c784171a107bdacb4,[https://photos.renthop.com/2/7170325_3bb5ac84...,2400,145 Borinquen Place,medium,Spacious 1 Bedroom 1 Bathroom in Williamsburg!...,spacious 1 bedroom 1 bathroom in williamsburg!...,"{'label': 'POSITIVE', 'score': 0.8850623965263...",1
6,1.0,2,b8e75fc949a6cd8225b455648a951712,2016-06-01 05:44:33,BRAND NEW GUT RENOVATED TRUE 2 BEDROOMFind you...,East 44th,"[Doorman, Elevator, Laundry in Building, Dishw...",40.7513,7092344,-73.9722,955db33477af4f40004820b4aed804a0,[https://photos.renthop.com/2/7092344_7663c19a...,3800,230 East 44th,low,BRAND NEW GUT RENOVATED TRUE 2 BEDROOMFind you...,brand new gut renovated true 2 bedroomfind you...,"{'label': 'POSITIVE', 'score': 0.998374342918396}",1
9,1.0,2,cd759a988b8f23924b5a2058d5ab2b49,2016-06-14 15:19:59,**FLEX 2 BEDROOM WITH FULL PRESSURIZED WALL**L...,East 56th Street,"[Doorman, Elevator, Laundry in Building, Laund...",40.7575,7158677,-73.9625,c8b10a317b766204f08e613cef4ce7a0,[https://photos.renthop.com/2/7158677_c897a134...,3495,405 East 56th Street,medium,**FLEX 2 BEDROOM WITH FULL PRESSURIZED WALL**L...,* * flex 2 bedroom with full pressurized wall ...,"{'label': 'POSITIVE', 'score': 0.9986716508865...",1
10,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,[],40.7145,7211212,-73.9425,5ba989232d0489da1b5f2c45f6688adc,[https://photos.renthop.com/2/7211212_1ed4542e...,3000,792 Metropolitan Avenue,medium,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,a brand new 3 bedroom 1. 5 bath apartmentenjoy...,"{'label': 'NEGATIVE', 'score': 0.6298918724060...",0
15,1.0,0,bfb9405149bfff42a92980b594c28234,2016-06-28 03:50:23,Over-sized Studio w abundant closets. Availabl...,East 34th Street,"[Doorman, Elevator, Fitness Center, Laundry in...",40.7439,7225292,-73.9743,2c3b41f588fbb5234d8a1e885a436cfa,[https://photos.renthop.com/2/7225292_901f1984...,2795,340 East 34th Street,low,Over-sized Studio w abundant closets. Availabl...,over - sized studio w abundant closets. availa...,"{'label': 'NEGATIVE', 'score': 0.9978052973747...",0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284,1.0,0,aa1739dd95a05fe1892f4c18d796a4c4,2016-06-09 02:40:47,Actual apartment photos<br /><br />Bond New Yo...,W 56 Street,"[Doorman, Elevator, Loft, Laundry in Building,...",40.7655,7129545,-73.9817,b5d90f1b957456dfe7a3ad061efed280,[https://photos.renthop.com/2/7129545_94d1eb7f...,2795,211 W 56 Street,low,Actual apartment photosBond New York is a real...,actual apartment photosbond new york is a real...,"{'label': 'NEGATIVE', 'score': 0.7148356437683...",0
289,1.0,1,04d9c09943370b4d2ea48a47e44c028c,2016-06-16 02:38:27,1BR - JR4 - Inwood-Above 181 - Prime Location ...,W 204 Street,"[Elevator, Dishwasher, Hardwood Floors, Dogs A...",40.8679,7167474,-73.9240,2f1ac1463ec2b0212f337801d176951f,[https://photos.renthop.com/2/7167474_c228e95e...,1875,686 W 204 Street,low,1BR - JR4 - Inwood-Above 181 - Prime Location ...,1br - jr4 - inwood - above 181 - prime locatio...,"{'label': 'NEGATIVE', 'score': 0.8555898666381...",0
291,1.0,1,0,2016-06-21 01:44:38,,West 34th Street,"[Doorman, Fitness Center, Pre-War, Dogs Allowe...",40.7530,7187520,-73.9958,15cb1bfaa5e4583d5df9269942064a0e,[],3450,360 West 34th Street,low,,,"{'label': 'POSITIVE', 'score': 0.7481209635734...",0
292,1.0,2,8f837ada8d7ec5d251a369cd5909af7c,2016-06-26 03:52:06,"Gorgeous, renovated 2 Bedroom with 2 King-size...",Madison Avenue,"[Elevator, Laundry in Building, Dogs Allowed, ...",40.7957,7219030,-73.9481,5a9e53e9a2d79230745af3ea227c782c,[https://photos.renthop.com/2/7219030_2859fc47...,2659,1632 Madison Avenue,low,"Gorgeous, renovated 2 Bedroom with 2 King-size...","gorgeous, renovated 2 bedroom with 2 king - si...","{'label': 'POSITIVE', 'score': 0.9975020289421...",1


In [13]:
# save dataset in json file
df_filtered.to_json("/kaggle/working/sentimental_extraction.json")