# Assignment 2 - CT5120

### Instructions:
- Complete all the tasks below and upload your submission as a Python notebook on Blackboard with the filename “`StudentID_Lastname.ipynb`” before **23:59** on **November 25, 2022**.
- This is an individual assignment, you **must not** work with other students to complete this assessment.
- The assignment is worth $50$ marks and constitutes 19% of the final grade. The breakdown of the marking scheme for each task is as follows:

| Task | Marks for write-up | Marks for code | Total Marks |
| :--- | :----------------- | :------------- | :---------- |
| 1    |                  5 |              5 |          10 |
| 2    |                  - |             10 |          10 |
| 3    |                  5 |              5 |          10 |
| 4    |                  5 |              5 |          10 |
| 5    |                  5 |              5 |          10 |



---

This assignment involves tasks for feature engineering, training and evaluating a classifier for suggestion detection. You will work with the data from SemEval-2019 Task 9 subtask A to classify whether a piece of text contains a suggestion or not. 


Download train.csv, test_seen.csv and test_unseen.csv from the [Github](https://github.com/sharduls007/Assignment_2_CT5120) or uncomment the code cell below to get the data as a comma-separated values (CSV) file. The CSV file contains a header row followed by 5,440 rows in train.csv and 1,360 rows in test_seen.csv spread across 3 columns of data. Each row of data contains a unique id, a piece of text and a label assigned by an annotator. A label of $1$ indicates that the given text contains a suggestion while a label of $0$ indicates that the text does not contain a suggestion.

You can find more details about the dataset in Sections 1, 2, 3 and 4 of [SemEval-2019 Task 9: Suggestion Mining from Online Reviews and Forums
](https://aclanthology.org/S19-2151/).

We will be using test_seen.csv for benchmarking our model, hence it has label. On the other hand, test_unseen is used for [Kaggle](https://www.kaggle.com/competitions/nlp2022ct5120suggestionmining/overview) competition.


In [1]:
!curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/train.csv" > train.csv
!curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/test_seen.csv" > test.csv
!curl "https://raw.githubusercontent.com/sharduls007/Assignment_2_CT5120/master/test_unseen.csv" > test_unseen.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  670k  100  670k    0     0   117k      0  0:00:05  0:00:05 --:--:--  101k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  168k  100  168k    0     0   103k      0  0:00:01  0:00:01 --:--:--  103k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  208k  100  208k    0     0  18057      0  0:00:11  0:00:11 --:--:-- 22172


In [2]:
import numpy as np
import pandas as pd

# Read the CSV file.
train_df = pd.read_csv('train.csv', 
                 names=['id', 'text', 'label'], header=0)

test_df = pd.read_csv('test.csv', 
                 names=['id', 'text', 'label'], header=0)

# Store the data as a list of tuples where the first item is the text
# and the second item is the label.
train_texts, train_labels = train_df["text"].to_list(), train_df["label"].to_list() 
test_texts, test_labels = test_df["text"].to_list(), test_df["label"].to_list() 

# Check that training set and test set are of the right size.
assert len(test_texts) == len(test_labels) == 1360
assert len(train_texts) == len(train_labels) == 5440


---

## Task 1: Data Pre-processing (10 Marks)

Explain at least 3 steps that you will perform to preprocess the texts before training a classifier.



Edit this cell to write your answer below the line in no more than 300 words.

---

Data preprocessing mainly includes the following steps, in which all kinds of dirty data are processed:
1. Data quality assessment.
Take a good look at your data and get an idea of its overall quality, relevance to your project, and consistency. There are a number of data anomalies and inherent problems to look out for in almost any data set, for example: Mismatched data types,Mixed data values,Data outliers,Missing data.
2. Data Cleansing.
In the process of data cleaning, missing values, outliers and duplicate values are mainly dealt with. The so-called cleaning refers to discarding, filling, replacing and reweighting data sets. To remove anomalies, correct errors and make up for the missing purpose.
3. Data Transformation.
Data standardization is a commonly used data preprocessing operation, which aims to convert data of different specifications to unified specifications or data of different distributions to a specific range, so as to reduce the impact of scale, features and distribution differences on the model. In addition to model calculation, standardized data also has the significance of directly calculating and generating composite indicators, which is a necessary step for weighted indicators.
4. Feature Selection or Feature Combination.
Feature selection is to select meaningful and helpful features from all features, so as to avoid the situation that all features must be imported into the model for training.
Feature selection is completely independent of any machine learning algorithm. He chose features based on scores from various statistical tests and indicators of correlation.

Data Preparation: from text to sentence
Non-alphanumeric data removing: number, symbol, emoji, HTML tag…
Lowercase and Miss-spelling normalization
Tokenization: from sentence to word
Stop Words removing
Stemming and Lemmatization
Bag of words: Tf-Idf
Word Embedding: Word2Vec

---

In the code cell below, write an implementation of the steps you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

In [4]:

import nltk #Importing the Natural Language Toolkit
import re #Importing the Regular Expressions
from nltk.stem import WordNetLemmatizer #Importing the WordNetLemmatizer 
from nltk.stem import PorterStemmer #Importing the PorterStemmer
from nltk.corpus import stopwords #Importing the stopwords
from nltk.tokenize import word_tokenize #Importing the word_tokenize
import nltk
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import re
import string

def dataclean (words):

    #Lowercasing
    words=[word.lower() for word in words]
  

    #Stop Word Removal
    stops = set(stopwords.words('english'))
    nostop_words=[]

    #remove stop words
    for word in words:
        sentence = [word for word in word.split() if not word in stops]
        nostop_words.append(' '.join(sentence))
    words = nostop_words

    #Stemming
    porter_stemmer=PorterStemmer()
    words=[porter_stemmer.stem(word=word) for word in words]
   

    #remove all punctuations
    sentenceClean = []
    for word in words:
        word = re.sub("[" + re.sub("\.","",string.punctuation) + "]", "", word)
        sentenceClean.append(word)
    words = sentenceClean
    
    #Lemmatization
    #lemmatizer = WordNetLemmatizer()
    #words=[lemmatizer.lemmatize(word=word,pos='v') for word in words]

    return words

train_texts = dataclean(train_texts)
test_texts = dataclean(test_texts)

train_df['texts'] = train_texts
test_df['texts'] = test_texts
train_df.head()

Unnamed: 0,id,text,label,texts
0,train_0,"""One would hope if I search for a word in the ...",0,one would hope search word title app would sho...
1,train_1,"""I would be beyond excited to get a response.""",0,would beyond excited get respons
2,train_2,"""Just like the user can select apps that are a...",1,like user select apps allowed run background w...
3,train_3,"""Once you create a CoreIndependentInputSource ...",0,create coreindependentinputsource touch visual...
4,train_4,"""I Have problems with Contact class on Windows...",0,problems contact class windows 81 windowsphone...


---

## Task 2: Feature Engineering (I) - TF-IDF as features (10 Marks)

In the lectures we have seen that raw counts of words and `tf-idf` scores can be useful features for a classification task. Complete the following code cell to create a suggestion detector which uses `tf-idf` scores as features for a Naïve Bayes classifier.

After applying your preprocessing steps, use the training data to train the classifier and make predictions on the test set. You **must not** use the test set for training.

If everything is implemented correctly, then you should see a single floating point value between 0 and 1 at the end which denotes the accuracy of the classifier.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Calculate tf-idf scores for the words in the training set.
# ... your code goes here


vectorizer = CountVectorizer()
transformer = TfidfTransformer()

train_count = vectorizer.fit_transform(train_texts).toarray()
train_itIdf = transformer.fit_transform(train_count).toarray()

test_count = vectorizer.transform(test_texts).toarray()
test_itIdf = transformer.transform(test_count).toarray()

# Train a Naïve Bayes classifier using the tf-idf scores for words as features.
# ... your code goes here

# instantiate the model (using the default parameters)
GNB = GaussianNB()
# fit the model with data
GNB.fit(train_itIdf, train_labels)

# Predict on the test set.
predictions = GNB.predict(test_itIdf)  
print(classification_report(test_labels, predictions))
#################### DO NOT EDIT BELOW THIS LINE #################


#################### DO NOT EDIT BELOW THIS LINE #################

def accuracy(labels, predictions):
  '''
  Calculate the accuracy score for a given set of predictions and labels.
  
  Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.

  Returns:
    float: A floating point value to score the predictions against the labels.
  '''

  assert len(labels) == len(predictions)
  
  correct = 0
  for label, prediction in zip(labels, predictions):
   if label == prediction:
        correct += 1 
  
        score = correct / len(labels)
  return score

# Calculate accuracy score for the classifier using tf-idf features.
accuracy(test_labels, predictions)

              precision    recall  f1-score   support

           0       0.80      0.58      0.67      1027
           1       0.30      0.56      0.39       333

    accuracy                           0.58      1360
   macro avg       0.55      0.57      0.53      1360
weighted avg       0.68      0.58      0.61      1360



0.5764705882352941

---

## Task 3: Evaluation Metrics (10 marks)

Why is accuracy not the best measure for evaluating a classifier? Describe an evaluation metric which might work better than accuracy for a classification task such as suggestion detection.

Edit this cell to write your answer below the line in no more than 150 words.

---
The most natural of the classification indicators is accuracy, which is the percentage of the total sample that predicts the correct results. However, When the data is balanced, accuracy is a good measure of evaluating our model. In other hand if data is imbalanced then accuracy is not a correct measure of evaluation.In business scenarios, most data won’t be balanced and so accuracy becomes poor measure of evaluation for our classification model.Different evaluation methods are used in different scenarios, and we can even use a combination of multiple dimensions to evaluate the model.

---

In the code cell below, write an implementation of the evaluation metric you defined above. Please write your own implementation from scratch.

In [6]:
def evaluate(labels, predictions):


  # check that labels and predictions are of same length
    
    assert len(labels) == len(predictions)
    score = 0.0
  
  #################### EDIT BELOW THIS LINE #########################
    
    tp = 0
    for gt, pred in zip(labels, predictions):
        if gt == 1 and pred == 1:
            tp +=1
   
    tn = 0
    for gt, pred in zip(labels, predictions):
        if gt == 0 and pred == 0:
            tn +=1
            
    fp = 0
    for gt, pred in zip(labels, predictions):
        if gt == 0 and pred == 1:
            fp +=1
    
    fn = 0
    for gt, pred in zip(labels, predictions):
        if gt == 1 and pred == 0:
            fn +=1
            
    prec = tp/ (tp + fp)  
    recall = tp/ (tp + fn)  
    score = 2 * prec * recall/ (prec + recall)
        
  #################### EDIT ABOVE THIS LINE #########################

    return score

# Calculate evaluation score based on the metric of your choice
# for the classifier trained in Task 2 using tf-idf features.
evaluate(test_labels, predictions)


0.3924050632911392

---

## Task 4: Feature Engineering (II) - Other features (10 Marks)

Describe features other than those defined in Task 2 which might improve the performance of your suggestion detector. If these features require any additional pre-processing steps, then define those steps as well.


Edit this cell to write your answer below the line in no more than 500 words.

---
There are some preprocessing steps that can be executed before doing feature engineering.
1 Dimensionless
2 Standardization
3 Interval scaling method
4 Differences between standardization and normalization
5 Binarization of quantitative features
6 Dummy coding for qualitative features
7 Calculation of missing values
8 Data Transformation

There are some feature options that we can choose.
1 the Filter
1.1 Variance selection method
1.2 Correlation coefficient method
1.3 Chi-square test
1.4 Mutual Information method
2 Wrapper
2.1 Recursive feature elimination
3 Embedded
3.1 Feature selection method based on penalty term
3.2 Feature selection method 4 Dimension reduction based on tree model
4 Principal Component Analysis (PCA)
5 Linear Discriminant Analysis (LDA)


---

In the code cell below, write an implementation of the features (and any additional pre-preprocessing steps) you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

After creating your features, use the training data to train a Naïve Bayes classifier and use the test set to evaluate its performance using the metric defined in Task 3. You **must not** use the test set for training.

To make sure that your code doesn't take too long to run or use too much memory, you can consider a time limit of 3 minutes and a memory limit of 12GB for this task.

In [None]:
# Create your features.
# ... your code goes here
# count number of characters 
from sklearn import preprocessing
from sklearn import feature_selection
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from nltk.tokenize import word_tokenize
import gensim.downloader as api

glove = api.load('glove-wiki-gigaword-300')
def textvectorize(dataframe):
    
    text_toks = []
    for sent in dataframe["texts"]:
        # print(sent)
        text_toks.append(word_tokenize(sent))
        
    all_text_vecs = []
    oov = np.random.rand(1,300)    
    
    for toks in text_toks:
        
        text_vecs = []
        
        for tok in toks:
            if tok in glove:
                text_vecs.append(glove[tok])
        # Special case where no word embedding is found
            else:
                text_vecs.append(oov)
        if len(text_vecs) == 0:text_vecs.append(oov)       
        all_text_vecs.append(text_vecs)    
        

# 1. Loop over each list of word embeddings per input
    for text_vecs in all_text_vecs:
       # 2. Vstack and take the mean of the tex_vecs
        mean_pool = np.mean(np.vstack(text_vecs), axis=0)
        # 3. Append the mean pooled vector to all_pooled_vecs
        all_text_vecs.append(mean_pool)

    dataframe["texts"] = all_text_vecs  

# 4. Update dataset with these pooled vectors
textvectorize(train_df)
textvectorize(test_df)

# break out the encoded labels by train / test split
train_Y = train_df["label"]  
test_Y  = test_df["label"]

X_train = np.vstack(train_df['texts'])
X_test = np.vstack(test_df['texts'])

print(f"Input shapes X_train: {X_train.shape}, X_text: {X_test.shape}")
print(f"Output shapes y_train: {y_train.shape}, y_text: {y_test.shape}")

# ... your code goes here
GNB = GaussianNB()
# fit the model with data
GNB.fit(X_train, train_Y)

# Predict on the test set.
predictions = GNB.predict(test_Y)  
# Evaluate on the test set.
evaluate(test_Y,predictions)
print(classification_report(test_Y, predictions))

---

## Task 5: Kaggle Competition (10 marks)

Head over to https://www.kaggle.com/t/1f90b74da0b7484da9647638e22d1068  
Use above classifier to predict the label for test_unseen.csv from competition page and upload the results to the leaderboard. The current baseline score is 0.36823. Make an improvement above the baseline. Please note that the evaluation metric for the competition is the f-score.

Read competition page for more details.



In [6]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google.colab'

In [7]:
# Preparing submission for Kaggle
#using svm methoed
from sklearn.svm import SVC

StudentID = "22210220_Hu" # Please add your student id and lastname
test_unseen = pd.read_csv("test_unseen.csv", names=['id', 'text'], header=0)
test_unseen['text'] = dataclean(test_unseen['text'])

unseen_count = vectorizer.transform(test_unseen['text']).toarray()
unseen_itIdf = transformer.transform(unseen_count).toarray()
#predictions = GNB.predict(test_itIdf)  


svc = SVC(kernel="linear",verbose=True)
# fit the model with data
svc.fit(train_itIdf, train_labels)
#predict on the test_unseen
pre = svc.predict(test_itIdf)
acc = accuracy(test_labels, pre)
print(acc)
predictions = svc.predict(unseen_itIdf)

#calculation the accuracy and fi_score
# Here Id is unique identifier assigned to each test sample ranging from test_0 till test_1699
# Expected is a list of prediction made by your classifier
sub = {"Id": [f"test_{i}" for i in range(len(test_unseen))],
       "Expected": predictions}

sub_df = pd.DataFrame(sub)
# The code below will generate a StudentID.csv on your drive on the left hand side in the explorer
# Please upload the file as a submission on the competition page
# You can index your submission StudentID_Lastname_index.csv, where index is your number of submission
sub_df.to_csv(f"{22210220}.csv", sep=",", header=1, index=None)

[LibSVM]....*.*
optimization finished, #iter = 5961
obj = -1615.111333, rho = -1.129234
nSV = 2986, nBSV = 1527
Total nSV = 2986
0.825


Mention the approach that you have chosen briefly, and what is the mean average f-score that you have achieved? Did it improve above the chosen baseline model (0.36823)? Why or why not?

Edit this cell to write your answer below the line in no more than 500 words.

---
I chosed TF_IDF as the feature, and use SVM model to predict the unseen dataset. It gets the f1 score 0.825, which has been improved by using feature engineering and changing another model. Feature engineering is the process that takes raw data and transforms it into features that can be used to create a predictive model. The aim of feature engineering is to prepare an input data set that best fits algorithm as well as to enhance the performance of models.

Framing The Problem Correctly: Using the right objective measures to estimate the accuracy of the output 

Inter-Dependencies Within The Model: The inherent, underlying structures in the organization’s data. Good structure always provides far better results. 

Once these things are considered when selecting or designing features, the advantages of feature engineering include:

More flexibility and less complexity in models
Faster processing
Clear, easy-to-understand models
Simpler models that are easier to maintain
A better understanding of the underlying problem
Better representation of all the available data that is helpful in characterizing the underlying problem

Overall, the imput data is optimized.

---