# W06 - Text Classifier
In this hands-on session, we will create a sentiment analyzer on twitter using the concept of classification and text pre-processing that we have learned before. We will cover:<br>
a. text pre-processing,<br>
b. splitting data for training and testing and converting them into numerical features,<br>
c. training a classifier model and perform predictions on testing dataset,<br>
d. Evaluating peformance of algorithm<br>

## Read dataset "tweets.csv"
Use the dataset of "tweets.csv" in zip file provided in the course portal "IF4041 Ilmu Data dan Penggalian Data".

In [1]:
import numpy as np
import pandas as pd

tweets = pd.read_csv('./dataset/tweets.csv', sep=",")# ajust with your own path
tweets.head()

Unnamed: 0,ItemID,Sentiment,SentimentSource,SentimentText
0,1038,1,Sentiment140,that film is fantastic #brilliant
1,1804,1,Sentiment140,this music is really bad #myband
2,1693,0,Sentiment140,winter is terrible #thumbs-down
3,1477,0,Sentiment140,this game is awful #nightmare
4,45,1,Sentiment140,I love jam #loveit


## Question 01 (Q01)
The given dataset is still a 'raw dataset' which include some unwanted features, unwanted characters, etc.<br>
a. Select the `SentimentText` column as an attribute and the `Sentiment` column as a label (ground truth) for this study case<br><br>
b. In this Q01.b, you have been provided a function templete `pre_process` (see below) to perform all the pre-processing step to the all tweet data in the dataset. Complete pre-process function with all techniques that you have learned in the previous hands-on week (W03) for text pre-processing, so the all text attributes can be converted to `pre-processed text`, e.g., after being applied by: (i) tokenization, (ii) normalization, (iii) cleaning, (iv) stemming or lemmatization. Here, you will get `list of words`.<br><br>
c. Use the function that you have completed in Q01.b, looped for each data row of `SentimentText` column. For each looping, you will get `list of words`. Append this `list of words` for each looping result in to list, so, will get `list of list`.<br><br>

d. Split (random & stratified) `list of list` you get in Q01.c into `training data` and `testing data`. The testing dataset must be 20% from overall dataset. Print the total number of initial dataset, total number of training dataset and testing dataset. <br>


In [2]:
input_dataset = tweets[['SentimentText', 'Sentiment']]
input_dataset.head()

Unnamed: 0,SentimentText,Sentiment
0,that film is fantastic #brilliant,1
1,this music is really bad #myband,1
2,winter is terrible #thumbs-down,0
3,this game is awful #nightmare,0
4,I love jam #loveit,1


In [3]:
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def pre_process(input_ori):
    '''
    Write code implementation for text pre-processing in here. 
    Use what you have learned in previous meeting about text pre-processing.
    
    Parameter:
    input_ori = raw data text (single tweet data)
    
    Return value:
    processed_tweet = 'list of words'
    
    
    '''
    # tokenization
    tokenized_words = input_ori.split()

    # normalization
    normalized_words = [w.lower() for w in tokenized_words]

    # cleaning
    table = str.maketrans('', '', string.punctuation)
    punc_removed = [w.translate(table) for w in normalized_words]

    isalpha_words = [word for word in punc_removed if word.isalpha()]

    stop_words = set(stopwords.words('english'))
    stop_words_removed = [w for w in isalpha_words if not w in stop_words]

    # lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(w) for w in stop_words_removed]
    
    return lemmatized_words

In [4]:
preprocessed_text = []
for text in input_dataset['SentimentText']:
    preprocessed_text.append(pre_process(text))

In [5]:
from sklearn.model_selection import StratifiedShuffleSplit

X = np.array(preprocessed_text)
y = input_dataset['Sentiment']
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
for train_index, test_index in sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

print('Total number of dataset', X.size)
print('Total number of training dataset', X_train.size)
print('Total number of testing dataset', X_test.size)

Total number of dataset 1932
Total number of training dataset 1545
Total number of testing dataset 387


## Question 02 (Q02)
a. Build `tfidf_model` by using function below with `training data` you get in Q01.d. 
```
def dummy(doc):
    return doc
tfidf = TfidfVectorizer(
    analyzer='word',
    tokenizer=dummy,
    preprocessor=dummy,
    token_pattern=None)
```
b. Transform `training data` and `testing data` you get in Q01.d by using `tfidf_model` you get in Q02.a. In this case, you will get numerical features, both from `training data` and `testing data`.<br><br>
c. Choose one classification algorithm that you will implement in this task, and explain why you choose it.<br><br>
d. Train the classifier model based on the algorithm you have chosen by using numerical features of `training data` from Q02.b.<br><br>
e. Make predictions on the numerical features of `testing dataset` you get in Q02.b using the classifier model that you have trained.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

def dummy(doc):
    return doc

tfidf = TfidfVectorizer(
    analyzer='word',
    tokenizer=dummy,
    preprocessor=dummy,
    token_pattern=None)

In [7]:
# X_train
train_model = tfidf.fit(X_train)
X_numeric_train = train_model.transform(X_train).toarray()

# X_test
X_numeric_test = train_model.transform(X_test).toarray()

Your explanation (Q02.c): Dipilih model Random Forest karena menggunakan classifier decision tree dengan ekstensi. Decision tree cocok untuk memodelkan data tfidf kata, karena dilakukan feature selection pada kata yang paling relevan terlebih dahulu (menggunakan nilai dari tfidf per kata). Kemudian digunakan Random Forest untuk mencegah overfitting pada decision tree.

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

clf = RandomForestClassifier(n_estimators=10, max_depth=None,
                             min_samples_split=2, random_state=0)
clf = clf.fit(X_numeric_train, y_train)

In [14]:
y_pred = clf.predict(X_numeric_test)

## Question 03 (Q03)
After train the classification model and make prediction using that model, now you will evaluate the performance of your model against testing dataset.<br>
a. Calculate and print the accuracy of your model's predictions in Q02.e against testing dataset ground truth<br>
b. What you can infer based on the result ?<br>

In [15]:
print('Score Random Forest: ', accuracy_score(y_test, y_pred))

Score Random Forest:  0.9741602067183462


Your answer (Q03.b) :

Dengan menggunakan random forest, dihasilkan score akurasi sebesar 0.9741. Selain pengaruh dari pemilihan model, kualitas dari data latih ini juga baik, sehingga akurasi ketika memakai model lain juga menghasilkan score yang tinggi.