<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

1. Features (text representation) used for topic modeling.

2. Top 10 clusters for topic modeling.

3. Summarize and describe the topic for each cluster.


In [None]:
# Write your code here
!pip install pandas scikit-learn gensim sentence-transformers bertopic






In [None]:
!pip install biterm




In [None]:
!pip install pandas gensim nltk




In [None]:
import pandas as pd
from gensim import corpora
from gensim.models import LdaModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import gensim.downloader as api

# we are downloading the  NLTK stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt')

# we are loading the sentimented dataset from 3rd assignment
df = pd.read_csv('amazonreviewssentimented.csv')

# we are  preprocessing  the data from the sample file
stop_words = set(stopwords.words('english'))


texts = df['clean_text'].apply(lambda x: word_tokenize(str(x).lower()))


texts = [[word for word in text if word.isalnum() and word not in stop_words] for text in texts]


#we are creating the dictionary representation for the documents
dictionary = corpora.Dictionary(texts)

# we are creating the bag-of-words
corpus = []
for t in texts:
    bow = dictionary.doc2bow(t)
    corpus.append(bow)

# we are training the lda model
lda_model = LdaModel(corpus, num_topics=10, id2word=dictionary, passes=15)

# we are printing the top 10 topics in the database along with their keywords
topics = lda_model.print_topics(num_words=5)
for tp in topics:
    print(tp)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


(0, '0.104*"customer" + 0.052*"apple" + 0.052*"included" + 0.052*"us" + 0.052*"buy"')
(1, '0.002*"apple" + 0.002*"power" + 0.002*"devices" + 0.002*"watts" + 0.002*"charging"')
(2, '0.156*"works" + 0.104*"ipad" + 0.052*"something" + 0.052*"either" + 0.052*"used"')
(3, '0.060*"charging" + 0.040*"adapter" + 0.033*"iphone" + 0.027*"apple" + 0.020*"fast"')
(4, '0.030*"power" + 0.029*"apple" + 0.022*"watts" + 0.019*"charging" + 0.016*"pd"')
(5, '0.028*"power" + 0.026*"apple" + 0.024*"watts" + 0.021*"charging" + 0.019*"pd"')
(6, '0.046*"charger" + 0.039*"charging" + 0.039*"apple" + 0.023*"speed" + 0.023*"20w"')
(7, '0.002*"charging" + 0.002*"adapter" + 0.002*"iphone" + 0.002*"power" + 0.002*"compact"')
(8, '0.050*"cords" + 0.050*"one" + 0.038*"connection" + 0.038*"cord" + 0.038*"charger"')
(9, '0.194*"great" + 0.097*"power" + 0.097*"delivery" + 0.097*"adapter" + 0.097*"fast"')


# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

1. Select features for the sentiment classification and explain why you select these features. Use a markdown cell to provide your explanation.

2. Select two of the supervised learning algorithms/models from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build two sentiment classifiers respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

3. Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. The test set must be used for model evaluation in this step. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [None]:
# Write your code here

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# we are loading the sentimented dataset
df = pd.read_csv('amazonreviewssentimented.csv')

# clean_text
x = df['clean_text'].astype(str)

# Labels: sentiment
y = df['sentiment']

# we are separating the dataset into train and test i.e 80% for training and 20% for testing

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# we are getting the Feature using tf_idf

tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

X_test_tfidf = tfidf_vectorizer.transform(X_test)

# we are using Support Vector Machine SVM model

svm_model = SVC(kernel='linear', C=1)

svm_scores = cross_val_score(svm_model, X_train_tfidf, y_train, cv=5, scoring='accuracy')

svm_model.fit(X_train_tfidf, y_train)

svm_predictions = svm_model.predict(X_test_tfidf)


# we are using random forest model

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

rf_scores = cross_val_score(rf_model, X_train_tfidf, y_train, cv=5, scoring='accuracy')

rf_model.fit(X_train_tfidf, y_train)

rf_predictions = rf_model.predict(X_test_tfidf)


# we are evaluation the  metrics and printing them
def evaluate_performance(y_true, y_pred, algorithm_name):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted')
    recall = recall_score(y_true, y_pred, average='weighted')
    f1 = f1_score(y_true, y_pred, average='weighted')
    # we are printing the algorithm name
    print(f"Performance Metrics for {algorithm_name}:")
    # we are printing the accuracy
    print(f"Accuracy: {accuracy:.4f}")
    # we are printing the precision
    print(f"Precision: {precision:.4f}")
    # wea re printing the recall
    print(f"Recall: {recall:.4f}")
    # we are printing the f1 score
    print(f"F1 Score: {f1:.4f}\n")


# we are evaluating the  Support Vector Machine performance
evaluate_performance(y_test, svm_predictions, 'Support Vector Machine')

# we are evaluating the random forest performance
evaluate_performance(y_test, rf_predictions, 'Random Forest')



Performance Metrics for Support Vector Machine:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000

Performance Metrics for Random Forest:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000



1.TF-IDF (Term Frequency-Inverse Document Frequency) Features:
this feature represent the most importance of words in reviews relative to a product when we select the top TF-IDF features, this feature focuses on words that are most discriminative for sentiment classification. This feature helps in capturing the semantic meaning of the text and distinguish the difference  between positive and negative sentiments.



2.SVM (Support Vector Machine) :this feature is more effective in text classification as it seeks in finding the optimal hyperplane which separates the different classes based on the selected features. It works nice with high-dimensional sparse data like TF-IDF vectors.

# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.

1. Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
2. Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
3. Develop a regression model. The train set should be used.
4. Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.

In [None]:
# Write your code here

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error
# Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
# we are loading the train and test data
trndat = pd.read_csv("train.csv")
tstdat = pd.read_csv("test.csv")

# we are mixing the both train and test data for preprocessing
mixdata = pd.concat([trndat.drop("SalePrice", axis=1), tstdat])

# we are separating the number and categorical columns
num_clm = mixdata.select_dtypes(include=np.number).columns
cat_clm = mixdata.select_dtypes(include='object').columns

# we are imputing the missing data for number column
numeric_data_imputer = SimpleImputer(strategy='mean')
mixdata[num_clm] = numeric_data_imputer.fit_transform(mixdata[num_clm])

# we are imputing the missing data for categorical column
categorical_data_imputer = SimpleImputer(strategy='most_frequent')

mixdata[cat_clm] = categorical_data_imputer.fit_transform(mixdata[cat_clm])




# Convert categorical to numerical using label encoder
label_encoder = LabelEncoder()
for col in cat_clm:
    mixdata[col] = label_encoder.fit_transform(mixdata[col])

# Splitting back into training and testing data

imputed_training_data = mixdata.iloc[:len(trndat)]

imputed_testing_data = mixdata.iloc[len(trndat):]

# we are scaling the data for better prediction
data_scaler = MinMaxScaler()
x_tr = imputed_training_data
y_train = trndat["SalePrice"]

X_train_scaled_min_max = data_scaler.fit_transform(x_tr)
X_train_scaled_min_max_df = pd.DataFrame(X_train_scaled_min_max, columns=x_tr.columns)

X_test_scaled_min_max = data_scaler.transform(imputed_testing_data)
X_test_scaled_min_max_df = pd.DataFrame(X_test_scaled_min_max, columns=imputed_testing_data.columns)

# Split data into train and test (80-20)
x_train, x_test, y_train, y_test = train_test_split(X_train_scaled_min_max_df, y_train, test_size=0.2, random_state=0)
#3.  Developing a regression model. The train set should be used.

# Train the Linear Regression model
regression_model = LinearRegression()
regression_model.fit(x_train, y_train)
#4. Evaluating performance of the regression model we developed using appropriate evaluation metrics.using The test set
# Predict using the model
y_pred = regression_model.predict(x_test)


# we are evaluating the model and printing them
print('Linear Regression R squared:', regression_model.score(x_test, y_test))
m_sqrd_er = mean_squared_error(y_pred, y_test)
r_m_sqrd_er = np.sqrt(m_sqrd_er)
print('Root Mean Squared Error:', r_m_sqrd_er)

# we give Prediction for house prices as per test data
pred_price = regression_model.predict(X_test_scaled_min_max_df)

# we are Displaying the result
tstdat["Predicted_SalePrice"] = pred_price
tstdat[['Id', 'Predicted_SalePrice']].to_csv('pred_price.csv', index=False)


Linear Regression R squared: 0.507109434948416
Root Mean Squared Error: 58342.33703683097


# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **Pre-trained Language Model (PLM) from the Hugging Face Repository** for predicting sentiment polarities on the data you collected in Assignment 3.

Then, choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any other related models.
1. (5 points) Provide a brief description of the PLM you selected, including its original pretraining data sources,  number of parameters, and any task-specific fine-tuning if applied.
2. (10 points) Use the selected PLM to perform the sentiment analysis on the data collected in Assignment 3. Only use the model in the **zero-shot** setting, NO finetuning is required. Evaluate performance of the model by comparing with the groundtruths (labels you annotated) on Accuracy, Precision, Recall, and F1 metrics.
3. (5 points) Discuss the advantages and disadvantages of the selected PLM, and any challenges encountered during the implementation. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


In [None]:
# importing the libraries
import pandas as pd
import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizer
from transformers import pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# we are reading the sentimented file
sample = pd.read_csv('amazonreviewssentimented.csv')
# we are inputting cleantext to list
t = sample['clean_text'].tolist()
# we are adding sentiments of the reviews into a list
l = sample['sentiment'].tolist()
# we are using tokinizer for pre-training the model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')
# we are anlalysing the sentiments using pipeline
sentiment_pipeline = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, framework='tf')

pred = []
for tx in t:
    tokenized_inputs = tokenizer(tx, truncation=True, max_length=512, return_tensors='tf')

    out = model(tokenized_inputs['input_ids'], tokenized_inputs['attention_mask'])

    logits = out.logits
    predicted_label_index = tf.argmax(logits, axis=1).numpy()[0]
    # we are checcking the sentiments in if cases
    if predicted_label_index == 0:
        pred.append('negative')
    elif predicted_label_index == 1:
        pred.append('neutral')
    else:
        pred.append('positive')

accuracy = accuracy_score(l, pred)
precision = precision_score(l, pred, average='weighted')
recall = recall_score(l, pred, average='weighted')
f1 = f1_score(l, pred, average='weighted')
# we are printing the accuracy
print(f'Accuracy: {accuracy:.2f}')
# we are printing the precission
print(f'Precision: {precision:.2f}')
# we are printing the recall
print(f'Recall: {recall:.2f}')
# we are printing the f1 scores
print(f'F1-score: {f1:.2f}')


All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Accuracy: 0.10
Precision: 0.01
Recall: 0.10
F1-score: 0.02


  _warn_prf(average, modifier, msg_start, len(result))


Pre-trained Language Model (PLM):
we choose DistilBERT model for Hugging Face Repository DistilBERT is a distilled version of the  (Bidirectional Encoder Representations from Transformers) model.

Original Pretraining Data Sources: BooksCorpus (800M words) and the English Wikipedia (13GB) are two of the data sources used to pretrain DistilBERT. On the other hand, DistilBERT compresses the original BERT model using knowledge distillation techniques, leading to a more compact and effective design.

Parameter count: DistilBERT contains about 66 million parameters, a substantial decrease from the 110 million parameters in the BERT base version. DistilBERT is now more lightweight and appropriate for deployment in contexts with limited resources because to this reduction in parameters.

Advantages of DistilBERT:

Efficiency:
in this we has few parameters when compared to bert.

Speed:
as it only uses only few parameters so it provides more speed.

Disadvantages of DistilBERT:

Limited Contextual Understanding:
when we compared with bert it has very limited understanding of language due to reduced capacity in it.

Less Fine-grained Representations:
DistilBERT  will results in less fine-grained representations of language
because of little understanding it effects the performance.




