# Tugas Python Machine Learning with PACMANN AI
## Sentiment Analysis

## Machine Learning Process Flowchart :
### 1. Importing Data to Python 
    * Drop Duplicates 
### 2. Data Preprocessing :
    * Input-Output Split, Train-Test Split
    * Imputation, Processing Categorical, Normalization 
### 3. Training Machine Learning : 
    * Choose Score to optimize and Hyperparameter Space
### 4. Test Prediction :
    * Evaluate model performance on Test Data
    

## 1. Importing Data to Python

In [1]:
# Import pandas

import pandas as pd

##  Dataset Information

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). From this data we can analyze how travelers in February 2015 expressed their feelings on Twitter.
 
Source : https://www.kaggle.com/crowdflower/twitter-airline-sentiment

### Content
There are 13 variables: 

1. tweet_id
2. airline_sentiment     : <b>Output</b>
3. airline_sentiment_confidence
4. negativereason
5. negativereason_confidence
6. airline
7. name
8. retweet_count
9. text
10. tweet_coord
11. tweet_created
12. tweet_location
13. user_timezone

In [2]:
# Baca dataset 

data = pd.read_csv('../datasets/tweet_airlines.csv')

In [3]:
# Check 5 Observasi pertama dataset

data.head()

Unnamed: 0.1,Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,0,570306133677760513,neutral,1.0,,,Virgin America,cairdin,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,1,570301130888122368,positive,0.3486,,0.0,Virgin America,jnardino,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,2,570301083672813571,neutral,0.6837,,,Virgin America,yvonnalynn,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,jnardino,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,jnardino,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


## Droping Duplicates

In [4]:
# Cek shape dari data yang akan di drop duplicate nya

data.shape

(14640, 14)

In [5]:
# Cek jika ada atau tidak observasi yang duplikat

data.duplicated().sum()

0

In [6]:
# Drop data yang duplikat

data = data.drop_duplicates()

In [7]:
# Cek kembali shape

data.shape

(14640, 14)

In [8]:
# Cek kembali data

data.head()

Unnamed: 0.1,Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,0,570306133677760513,neutral,1.0,,,Virgin America,cairdin,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,1,570301130888122368,positive,0.3486,,0.0,Virgin America,jnardino,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,2,570301083672813571,neutral,0.6837,,,Virgin America,yvonnalynn,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,jnardino,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,jnardino,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


### Make function to import and drop 

Buat lah sebuah function dengan spesifikasi:

 1. import data
 2. cek JUMLAH OBSERVASI dan JUMLAH COLUMN
 3. drop duplicate
 4. drop unnecassary column
 5. cek JUMLAH OBSERVASI dan JUMLAH COLUMN, setelah di-drop
 6. return data setelah di-drop

Function dinamakan dengan `import_data` dan menerima 2 argument yaitu:

 1. `filepath`: Direktori dimana data tersimpan
 2. `drop`    : Nama kolom yang ingin di hapus
 
Lalu assign function tersebut pada suatu variabel yang dengan nama `data_sentiment`

In [9]:
# Buatlah function 

def importData(filepath, drop):
    data = pd.read_csv(filepath)
    print("Data asli : %d Observasi, %d Kolom." %data.shape, '\n')   
    print("Kolom yang di drop :", drop, '\n')
    data_drop = data.drop(drop, axis=1)
    print("Banyaknya data duplicate :", data_drop.duplicated().sum(), '\n')
    data_unique = data_drop.drop_duplicates()
    print("Data setelah di drop : %d Observasi, %d Kolom." %data_unique.shape)

    return data_unique

In [10]:
# Assign fuction kepada variabel data

drop = ['Unnamed: 0', 'tweet_id', 'name', 'tweet_location', 'user_timezone',
       'airline_sentiment_confidence', 'negativereason', 'negativereason_confidence']
data_sentiment = importData("tweet_airlines.csv", drop)

Data asli : 14640 Observasi, 14 Kolom. 

Kolom yang di drop : ['Unnamed: 0', 'tweet_id', 'name', 'tweet_location', 'user_timezone', 'airline_sentiment_confidence', 'negativereason', 'negativereason_confidence'] 

Banyaknya data duplicate : 137 

Data setelah di drop : 14503 Observasi, 6 Kolom.


In [11]:
data_sentiment.head()

Unnamed: 0,airline_sentiment,airline,retweet_count,text,tweet_coord,tweet_created
0,neutral,Virgin America,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800
1,positive,Virgin America,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800
2,neutral,Virgin America,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800
3,negative,Virgin America,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800
4,negative,Virgin America,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800


## 2. Data Preprocessing
### Input-Output Split

Disini kita akan memisahkan kolom berdasarkan input dan output.

Data yang digunakan untuk input akan dinamakan dengan `X`, sedangkan untuk output dengan `y`.

Pada dataset ini, kita hanya perlu menggunakan kolom `airline_sentiment` sebagai output kita. 

In [12]:
# Cek data menggunakan head()

data_sentiment.head()

Unnamed: 0,airline_sentiment,airline,retweet_count,text,tweet_coord,tweet_created
0,neutral,Virgin America,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800
1,positive,Virgin America,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800
2,neutral,Virgin America,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800
3,negative,Virgin America,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800
4,negative,Virgin America,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800


### Make function for input and output

Buatlah sebuah function dengan kriteria dibawah ini:

1. data_input
2. data_output
3. return data_input dan data_output
* Tujuan dari pembuatan function adalah agar function ini dapat digunakan kembali di cases berbeda. 

Function dinamakan dengan `extract_input_output` dan menerima 2 argument yaitu:

1. `data`        : Dataset yang ingin di split
2. `output_column_name` : Nama kolom yang ingin di jadikan output


In [13]:
# Buatlah function tersebut disini

def extract_input_output(data, output_column_name):
    y = data[output_column_name]
    x = data.drop(output_column_name, axis=1)

    return x, y 

# Assign hasil dari funtion tersebut kepada X, y.
# X: data input
# y: data output
x, y =  extract_input_output(data_sentiment, 'airline_sentiment')

In [14]:
x.head()

Unnamed: 0,airline,retweet_count,text,tweet_coord,tweet_created
0,Virgin America,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800
1,Virgin America,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800
2,Virgin America,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800
3,Virgin America,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800
4,Virgin America,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800


In [15]:
y.head()

0     neutral
1    positive
2     neutral
3    negative
4    negative
Name: airline_sentiment, dtype: object

## Train and Test Split

Pada bagian ini, X dan y akan dibagi menjadi 2 set yaitu training dan tes. Kita akan menggunakan function dari library Scikit Learn yaitu `train_test_split`.

In [16]:
# import function train_test_split dari library Scikit Learn

from sklearn.model_selection import train_test_split

#### Train Test Split Function
1. x adalah input
2. y adalah output
3. test size = seberapa besar test, contoh 0.20 untuk 20% test dari data
4. random state adalah kunci untuk random, harus disetting sama, misal random_state = 123
5. Output: 
    * x_train = input dari data training
    * x_test = input dari data test
    * y_train = output dari training data
    * y_test = output dari training data
6. urutan dari x_train, x_test, y_train dan y_test tidak boleh terbalik

In [17]:
# Split dataset

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=123)

In [18]:
# Cek shape untuk tiap set (X_train, X_test, y_train, y_test)

print("Data input training :", x_train.shape)
print("Data input test :", x_test.shape)
print("Data output training :", y_train.shape)
print("Data output test :", y_test.shape)

Data input training : (11602, 5)
Data input test : (2901, 5)
Data output training : (11602,)
Data output test : (2901,)


In [19]:
x_train.head()

Unnamed: 0,airline,retweet_count,text,tweet_coord,tweet_created
14270,American,0,"@AmericanAir @sa_craig no, not helped one bit....",,2015-02-22 15:42:21 -0800
542,United,0,@united yes please! I am newly married and try...,,2015-02-24 10:19:32 -0800
5178,Southwest,0,@SouthwestAir three cheers to your Denver staf...,,2015-02-21 15:25:24 -0800
14114,American,1,@AmericanAir @sarahzou translation: we don't r...,,2015-02-22 17:07:45 -0800
6840,Delta,0,@JetBlue sooo earlier i said i couldnt fly wit...,,2015-02-24 04:51:15 -0800


In [20]:
# dump x_train columns

from sklearn.externals import joblib
joblib.dump(x_train.columns,'input_col.pkl')

['input_col.pkl']

## Separating Numerical and Categorical Data Manually

## Getting Numerical

In [21]:
# get numeric using ._get_numeric_data()

x_train_num = x_train._get_numeric_data()

In [22]:
# check the columns

x_train_num.head()

Unnamed: 0,retweet_count
14270,0
542,0
5178,0
14114,1
6840,0


In [23]:
# drop unexpected numerical column if any 

# num_categorical = [...]
# x_train_num = x_train_num.drop(num_categorical, axis = 1)

In [24]:
# dump numerical columns

joblib.dump(x_train_num.columns,'numerical_col.pkl')

['numerical_col.pkl']

## Getting Categorical


In [25]:
# Get Categorical, drop numerical columns

x_train_cat = x_train.drop(x_train_num.columns, axis=1)

In [26]:
# check the top observations!

x_train_cat.head()

Unnamed: 0,airline,text,tweet_coord,tweet_created
14270,American,"@AmericanAir @sa_craig no, not helped one bit....",,2015-02-22 15:42:21 -0800
542,United,@united yes please! I am newly married and try...,,2015-02-24 10:19:32 -0800
5178,Southwest,@SouthwestAir three cheers to your Denver staf...,,2015-02-21 15:25:24 -0800
14114,American,@AmericanAir @sarahzou translation: we don't r...,,2015-02-22 17:07:45 -0800
6840,Delta,@JetBlue sooo earlier i said i couldnt fly wit...,,2015-02-24 04:51:15 -0800


### Make a function for Separating Numerical and Categorical

In [27]:
# Def a function that returns x_train numerical and x_train categorical

def splitNumCat(data):
    data_num = data._get_numeric_data()
#    data_num = data_num.drop(num_categorical, axis = 1)
    data_cat = data.drop(list(data_num.columns.values) , axis = 1)

    return data_num, data_cat

In [28]:
# call function

x_train_num, x_train_cat = splitNumCat(x_train)

In [29]:
# check the top of the x_train numerical observations!

x_train_num.head()

Unnamed: 0,retweet_count
14270,0
542,0
5178,0
14114,1
6840,0


In [30]:
# check the top of the x_train categorical observations!

x_train_cat.head()

Unnamed: 0,airline,text,tweet_coord,tweet_created
14270,American,"@AmericanAir @sa_craig no, not helped one bit....",,2015-02-22 15:42:21 -0800
542,United,@united yes please! I am newly married and try...,,2015-02-24 10:19:32 -0800
5178,Southwest,@SouthwestAir three cheers to your Denver staf...,,2015-02-21 15:25:24 -0800
14114,American,@AmericanAir @sarahzou translation: we don't r...,,2015-02-22 17:07:45 -0800
6840,Delta,@JetBlue sooo earlier i said i couldnt fly wit...,,2015-02-24 04:51:15 -0800


In [31]:
# check the top of the y_train observations!

y_train.head()

14270    negative
542      negative
5178     positive
14114    negative
6840     positive
Name: airline_sentiment, dtype: object

### Extract Time, Coordinate, and Text from Categorical Data

In [32]:
# check the top of the x_train categorical observations!

x_train_cat.head()

Unnamed: 0,airline,text,tweet_coord,tweet_created
14270,American,"@AmericanAir @sa_craig no, not helped one bit....",,2015-02-22 15:42:21 -0800
542,United,@united yes please! I am newly married and try...,,2015-02-24 10:19:32 -0800
5178,Southwest,@SouthwestAir three cheers to your Denver staf...,,2015-02-21 15:25:24 -0800
14114,American,@AmericanAir @sarahzou translation: we don't r...,,2015-02-22 17:07:45 -0800
6840,Delta,@JetBlue sooo earlier i said i couldnt fly wit...,,2015-02-24 04:51:15 -0800


In [33]:
# extract and assign to x_train_---

x_train_time = x_train_cat['tweet_created']
x_train_coord = x_train_cat['tweet_coord']
x_train_text = x_train_cat['text']

In [34]:
# drop time, coord, and text from categorical data

x_train_cat = x_train_cat.drop(['tweet_created', 'tweet_coord', 'text'], axis=1)

In [35]:
# check the top of the x_train time observations!

x_train_time.head()

14270    2015-02-22 15:42:21 -0800
542      2015-02-24 10:19:32 -0800
5178     2015-02-21 15:25:24 -0800
14114    2015-02-22 17:07:45 -0800
6840     2015-02-24 04:51:15 -0800
Name: tweet_created, dtype: object

In [36]:
# check the top of the x_train coordinate observations!

x_train_coord.head()

14270    NaN
542      NaN
5178     NaN
14114    NaN
6840     NaN
Name: tweet_coord, dtype: object

In [37]:
# check the top of the x_train text observations!

x_train_text.head()

14270    @AmericanAir @sa_craig no, not helped one bit....
542      @united yes please! I am newly married and try...
5178     @SouthwestAir three cheers to your Denver staf...
14114    @AmericanAir @sarahzou translation: we don't r...
6840     @JetBlue sooo earlier i said i couldnt fly wit...
Name: text, dtype: object

In [38]:
# check the top of the x_train categorical observations!

x_train_cat.head()

Unnamed: 0,airline
14270,American
542,United
5178,Southwest
14114,American
6840,Delta


In [39]:
# dump categorical columns

joblib.dump(x_train_cat.columns, 'categorical_col.pkl')

['categorical_col.pkl']

## Data Imputation

Data imputation adalah proses pengisian data yang memiliki data yang kosong, biasanya diperlihatkan sebagai NaN

Proses tersebut terbagi menjadi 2:
* Numerical Imputation
* Categorical Imputation

In [40]:
# Cek data yang kosong di traininig set input

x_train.isnull().sum()

airline              0
retweet_count        0
text                 0
tweet_coord      10798
tweet_created        0
dtype: int64

## Numerical Data Imputation

In [41]:
# check the missing value of the x_train_num

x_train_num.isnull().sum()

retweet_count    0
dtype: int64

* Make a function for numerical imputation

In [42]:
# Import Imputer

from sklearn.preprocessing import Imputer

In [43]:
# Define a function only to fit Imputer
#
# input argument : data, missing_values, strategy
#
# return fitted Imputer

def fitImputNum(data, missing_values, strategy):
    # define 
    imput = Imputer(missing_values, strategy)

    # fit
    imput.fit(data)

    return imput

In [44]:
# Call function for fitting Imputer

imputer = fitImputNum(x_train_num, 'NaN', 'median')

In [45]:
# dump imputer

joblib.dump(imputer, 'imputer.pkl')

['imputer.pkl']

In [46]:
# Define a function to transform Numerical data using Imputer
#
# input argument : data, imputer
#
# return data_num_imputed

def transformNumerical(data, imputer):
    # transform
    data_num_imputed = pd.DataFrame(imputer.transform(data))

    # replace broken column and index
    data_num_imputed.columns = data.columns
    data_num_imputed.index = data.index

    return data_num_imputed

In [47]:
# Call function for transform Numerical data

x_train_num_imputed = transformNumerical(x_train_num, imputer)

In [48]:
# check the missing value of the imputed data

x_train_num_imputed.isnull().sum()

retweet_count    0
dtype: int64

## Categorical Data Imputation


 * Make a function for categorical imputation

In [49]:
x_train_cat.isnull().sum()

airline    0
dtype: int64

In [50]:
# function definition, return imputed data

def categoricalImputation(data):
    # fillna
    data_cat_imputed = data.fillna(value='KOSONG')
    
    return data_cat_imputed

In [51]:
# Call function for imputation

x_train_cat_imputed = categoricalImputation(x_train_cat)

In [52]:
# check the missing value of the imputed data

x_train_cat_imputed.isnull().sum()

airline    0
dtype: int64

## Preprocessing Categorical Variables

### Make a function to get the dummies using Label Encoder & Label Binarizer

In [53]:
# Import LabelBinarizer and LabelEncoder

from sklearn.preprocessing import LabelEncoder, LabelBinarizer

In [54]:
# funtion definition

def categoricalDummies(data):
    dummy_variables = pd.DataFrame([])
    label_encoder = pd.Series([])
    label_binarizer = pd.Series([])
    
    j = 0
    for i in list(data):
        label_en = LabelEncoder()
        label_bin = LabelBinarizer()
        
        encoded = label_en.fit_transform(data[i])
        binary = label_bin.fit_transform(encoded)
        
        dummy = pd.DataFrame(binary, columns=["{}_{}".format(a, b) for b in sorted(data[i].unique())
                                              for a in [i]], 
                             index = data.index)
        dummy_variables = pd.concat([dummy_variables, dummy], axis = 1)
        label_encoder[j] = label_en
        label_binarizer[j] = label_bin
        
        j += 1
    dummy_columns = dummy_variables.columns
    
    return dummy_variables, label_encoder, label_binarizer,dummy_columns

In [55]:
# Call categoricalDummies

x_train_cat_imputed_dummy, label_encoder, label_binarizer, dummy_columns = categoricalDummies(x_train_cat_imputed)

In [56]:
# dump dummy_columns

joblib.dump(dummy_columns, 'dummy_columns.pkl')

['dummy_columns.pkl']

In [57]:
# dump label_encoder

joblib.dump(label_encoder, 'label_encoder.pkl')

['label_encoder.pkl']

In [58]:
# dump label_binarizer

joblib.dump(label_binarizer, 'label_binarizer.pkl')

['label_binarizer.pkl']

In [59]:
# check the top observations

x_train_cat_imputed_dummy.head()

Unnamed: 0,airline_American,airline_Delta,airline_Southwest,airline_US Airways,airline_United,airline_Virgin America
14270,1,0,0,0,0,0
542,0,0,0,0,1,0
5178,0,0,1,0,0,0
14114,1,0,0,0,0,0
6840,0,1,0,0,0,0


In [60]:
for i in x_train_cat_imputed_dummy.columns:
    print(x_train_cat_imputed_dummy[i].value_counts(), '\n\n')

0    9501
1    2101
Name: airline_American, dtype: int64 


0    9830
1    1772
Name: airline_Delta, dtype: int64 


0    9684
1    1918
Name: airline_Southwest, dtype: int64 


0    9265
1    2337
Name: airline_US Airways, dtype: int64 


0    8532
1    3070
Name: airline_United, dtype: int64 


0    11198
1      404
Name: airline_Virgin America, dtype: int64 




### Preprocess Time, Coordinate, and Text

Gunakan program pada Exercise 4 untuk melakukan preprocessing time, coordinate, dan text. <br>
Lakukan modifikasi agar program terdiri dari fungsi-fungsi.

In [61]:
# def columnSplit(data, splitter, columns_name)
#
# return data_splitted

def columnSplit(data, splitter, columns_name):
    data_splitted = pd.DataFrame(data.str.split(splitter).tolist(),
                        columns = columns_name,
                        index = data.index)

    return data_splitted

In [62]:
x_train_time.head()

14270    2015-02-22 15:42:21 -0800
542      2015-02-24 10:19:32 -0800
5178     2015-02-21 15:25:24 -0800
14114    2015-02-22 17:07:45 -0800
6840     2015-02-24 04:51:15 -0800
Name: tweet_created, dtype: object

In [63]:
# def preprocessTime(data)
#
# return data_time (only return necessary column! read the dataset information carefully)
# 
def preprocessTime(data):
    # columSplit
    data_time_full = columnSplit(data, ' ', ['date', 'time', 'GMT'])

    # columSplit again
    data_date = columnSplit(data_time_full['date'], '-', ['year', 'month', 'day'])

    # and again
    data_hour = columnSplit(data_time_full['time'], ':', ['hour', 'minute', 'second'])

    
    # only return necessary column!
    data_time_clean = pd.concat([data_date['day'], data_hour], axis=1)
    
    return data_time_clean

In [64]:
# call preprocessTime

x_train_time_clean = preprocessTime(x_train_time)

In [65]:
for i in x_train_time_clean:
    x_train_time_clean[i] = x_train_time_clean[i].apply(lambda x: int(x))

In [66]:
# check top observation of time data

x_train_time_clean.head()

Unnamed: 0,day,hour,minute,second
14270,22,15,42,21
542,24,10,19,32
5178,21,15,25,24
14114,22,17,7,45
6840,24,4,51,15


In [67]:
# def preprocessCoordinate(data)
#
# return data_coordinate
#
def preprocessCoordinate(data):
    # fillna
    data_coord = data.fillna(value = '[0.0, 0.0]')

    # get value
    data_coord = data_coord.str[1:-1]

    # column split
    data_coordinate = columnSplit(data_coord, ',', ['latitude', 'longitude'])

    # to_numeric
    for i in list(data_coordinate):
        data_coordinate[i] = pd.to_numeric(data_coordinate[i])
        
    return data_coordinate

In [68]:
# call preprocessCoordinate

x_train_coord_clean = preprocessCoordinate(x_train_coord)

In [69]:
# check top observation of coordinate data

x_train_coord_clean.head()

Unnamed: 0,latitude,longitude
14270,0.0,0.0
542,0.0,0.0
5178,0.0,0.0
14114,0.0,0.0
6840,0.0,0.0


<font color='red'>======================================================================</font><br>
Sebelum melakukan "Stemming and Lemmatization", ekstrak isi nltk_data.rar ke Drive C:<br>
Pastikan directory C:\nltk_data\corpora berisi folder wordnet dan file wordnet.zip
<font color='red'>======================================================================</font><br>


In [70]:
# import re, SnowballStemmer, WordnetLemmatizer

import re
from nltk.stem import SnowballStemmer, WordNetLemmatizer

In [71]:
# Define stemmer and lemmatizer

stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

In [72]:
# def preprocessText(text, stemmer, lemmatizer)
#
# return clear_text
#
def preprocessText(text, stemmer, lemmatizer):
    clear_text = pd.Series([])
    
    for i  in text.index:
        string = str(text[i])
        
        # Preprocess using RegularExpression
        string = re.sub('[^A-Za-z0-9]+', ' ', string)
        string = re.sub(' +', ' ', string.strip())
        string.lower()
        
        # Stemming
        string = str(string)
        string = string.split(" ")
        string = [stemmer.stem(word) for word in string]
        string = " ".join(string)

    
        # Lemmatizing
        string = str(string)
        string = string.split(" ")
        string = [lemmatizer.lemmatize(word) for word in string]
        string = " ".join(string)
    
        # Save to clear_text[i]
        clear_text[i] = string

    return clear_text

In [73]:
# call preprocessText

x_train_text_clean = preprocessText(x_train_text, stemmer, lemmatizer)

In [74]:
# check top observation of text data

x_train_text_clean.head()

14270    americanair sa craig no not help one bit actua...
542      unit yes plea i am newli marri and tri to upda...
5178     southwestair three cheer to your denver staff ...
14114    americanair sarahzou translat we don t reinves...
6840     jetblu sooo earlier i said i couldnt fli with ...
dtype: object

In [75]:
# import TF-IDF Vectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

In [76]:
# define vectorizer

tfidf = TfidfVectorizer(min_df=500, stop_words="english")

In [77]:
# def fitVectorizer(text, vectorizer)
#
# return fitted vectorizer 
#
# only fit the vectorizer

def fitVectorizer(text, vectorizer):
    #fit the vectorizer
    vectorizer.fit(text)

    return vectorizer

In [78]:
# def transformText(text, fitted_vectorizer)
#
# return feature_word

def transformText(text, vectorizer):
    # transform using vectorizer
    tf_idf = vectorizer.transform(text)

    # make feature_word
    feature_word = pd.DataFrame(tf_idf.toarray(),
                               columns=vectorizer.get_feature_names(),
                               index=text.index)

    return feature_word

In [79]:
# call fitVectorizer

vectorizer = fitVectorizer(x_train_text_clean, tfidf)

In [80]:
# dump vectorizer

joblib.dump(vectorizer, 'vectorizer.pkl')

['vectorizer.pkl']

In [81]:
# call transformText

x_train_text_feature = transformText(x_train_text_clean, tfidf)

In [82]:
# check top observation of feature_word

x_train_text_feature.head()

Unnamed: 0,americanair,bag,cancel,custom,delay,fli,flight,help,hold,hour,...,need,plane,servic,southwestair,thank,time,unit,usairway,wa,wait
14270,0.581908,0.0,0.0,0.0,0.0,0.0,0.0,0.813254,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
542,0.0,0.0,0.0,0.0,0.0,0.0,0.707205,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.707009,0.0,0.0,0.0
5178,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.62008,0.0,0.0,0.0,0.0,0.0,0.0
14114,0.446839,0.0,0.0,0.63784,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.627292,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6840,0.0,0.0,0.0,0.0,0.0,0.815043,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Join Data

In [83]:
# ambil variabel numerical yang sudah tidak memiliki missing values, variabel kategori yang sudah menjadi dummy,
# variabel time, coord,dan text yang sudah di preprocess

# satukan kembali kolom tersebut menjadi x_train_concat
x_train_concat = pd.concat([x_train_num_imputed, x_train_time_clean, 
                            x_train_coord_clean, x_train_text_feature], axis=1)

In [84]:
# check top observation

x_train_concat.head()

Unnamed: 0,retweet_count,day,hour,minute,second,latitude,longitude,americanair,bag,cancel,...,need,plane,servic,southwestair,thank,time,unit,usairway,wa,wait
14270,0.0,22,15,42,21,0.0,0.0,0.581908,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
542,0.0,24,10,19,32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.707009,0.0,0.0,0.0
5178,0.0,21,15,25,24,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.62008,0.0,0.0,0.0,0.0,0.0,0.0
14114,1.0,22,17,7,45,0.0,0.0,0.446839,0.0,0.0,...,0.0,0.0,0.627292,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6840,0.0,24,4,51,15,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [85]:
# Check NaN values

x_train_concat.isnull().any().any()

False

## Standardizing Variables

- KEGUNAAN: Menyamakan skala dari variable input
- fit: imputer agar mengetahui mean standard deviasi dari setiap column
- transform: isi data dengan value yang dinormalisasi
- output dari transform berupda pd dataframe
- normalize dikeluarkan karena akan dipakai di test

In [86]:
#Import Standard Scaler

from sklearn.preprocessing import StandardScaler

In [87]:
# def fitStandardize(data)
#
# return fitted standardizer

def fitStandardize(data):
    #define standardizer
    standardizer = StandardScaler()

    #fit
    standard = standardizer.fit(data)

    return standard

In [88]:
# def transformStandardize(data, standardizer)
#
# return standardized_data
def transformStandardize(data, standardizer):
    # transform data
    data_standard = pd.DataFrame(standardizer.transform(data))

    # replace broken column and index
    data_standard.columns = data.columns
    data_standard.index = data.index

    return data_standard

In [89]:
# call fitStandardize

normalizer = fitStandardize(x_train_concat)

In [90]:
# dump standardizer

joblib.dump(normalizer, 'standardizer.pkl')

['standardizer.pkl']

In [91]:
# call transformStandardize

x_train_standardize = transformStandardize(x_train_concat, normalizer)

In [92]:
# check top observation

x_train_standardize.head()

Unnamed: 0,retweet_count,day,hour,minute,second,latitude,longitude,americanair,bag,cancel,...,need,plane,servic,southwestair,thank,time,unit,usairway,wa,wait
14270,-0.108057,0.499135,0.500707,0.736408,-0.479603,-0.241366,0.230431,1.744146,-0.218851,-0.265185,...,-0.214438,-0.218052,-0.261428,-0.416098,-0.343852,-0.253241,-0.543325,-0.464249,-0.312419,-0.221933
542,-0.108057,1.423103,-0.436926,-0.600032,0.154485,-0.241366,0.230431,-0.451864,-0.218851,-0.265185,...,-0.214438,-0.218052,-0.261428,-0.416098,-0.343852,-0.253241,1.810427,-0.464249,-0.312419,-0.221933
5178,-0.108057,0.037151,0.500707,-0.251396,-0.30667,-0.241366,0.230431,-0.451864,-0.218851,-0.265185,...,-0.214438,-0.218052,-0.261428,1.937934,-0.343852,-0.253241,-0.543325,-0.464249,-0.312419,-0.221933
14114,1.248735,0.499135,0.87576,-1.297305,0.903861,-0.241366,0.230431,1.234422,-0.218851,-0.265185,...,-0.214438,-0.218052,3.907741,-0.416098,-0.343852,-0.253241,-0.543325,-0.464249,-0.312419,-0.221933
6840,-0.108057,1.423103,-1.562086,1.259362,-0.825469,-0.241366,0.230431,-0.451864,-0.218851,-0.265185,...,-0.214438,-0.218052,-0.261428,-0.416098,-0.343852,-0.253241,-0.543325,-0.464249,-0.312419,-0.221933


In [93]:
# Setelah input numerical distandarisasi, kita gabungkan dengan dummy variables

x_train_clean = pd.concat([x_train_cat_imputed_dummy, x_train_standardize], axis = 1)

In [94]:
x_train_clean.isnull().any().any()

False

In [95]:
x_train_clean.head()

Unnamed: 0,airline_American,airline_Delta,airline_Southwest,airline_US Airways,airline_United,airline_Virgin America,retweet_count,day,hour,minute,...,need,plane,servic,southwestair,thank,time,unit,usairway,wa,wait
14270,1,0,0,0,0,0,-0.108057,0.499135,0.500707,0.736408,...,-0.214438,-0.218052,-0.261428,-0.416098,-0.343852,-0.253241,-0.543325,-0.464249,-0.312419,-0.221933
542,0,0,0,0,1,0,-0.108057,1.423103,-0.436926,-0.600032,...,-0.214438,-0.218052,-0.261428,-0.416098,-0.343852,-0.253241,1.810427,-0.464249,-0.312419,-0.221933
5178,0,0,1,0,0,0,-0.108057,0.037151,0.500707,-0.251396,...,-0.214438,-0.218052,-0.261428,1.937934,-0.343852,-0.253241,-0.543325,-0.464249,-0.312419,-0.221933
14114,1,0,0,0,0,0,1.248735,0.499135,0.87576,-1.297305,...,-0.214438,-0.218052,3.907741,-0.416098,-0.343852,-0.253241,-0.543325,-0.464249,-0.312419,-0.221933
6840,0,1,0,0,0,0,-0.108057,1.423103,-1.562086,1.259362,...,-0.214438,-0.218052,-0.261428,-0.416098,-0.343852,-0.253241,-0.543325,-0.464249,-0.312419,-0.221933


## 3. Training Machine Learning
* Kita harus mengalahkan benchmark
* Choose Score to optimize and Hyperparameter Space
* Cross-Validation: Random Search CV 


### Benchmark:

In [96]:
y_train.value_counts(normalize = True)

negative    0.626875
neutral     0.212981
positive    0.160145
Name: airline_sentiment, dtype: float64

In [97]:
# Import classifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import RandomizedSearchCV

## Decision Tree

In [98]:
def decTree_fit(x_train, y_train, scoring = 'accuracy'):
    decTree = DecisionTreeClassifier(random_state=123)

    hyperparam = {'min_samples_leaf': [3, 5, 7, 9, 13, 17, 21, 27, 33, 41, 50, 60, 80, 100],
                  'max_features': ['sqrt', 'log2', 0.25, 0.5, 0.75]}

    random_decTree = RandomizedSearchCV(decTree, param_distributions = hyperparam, cv = 5,
                                        n_iter = 10, scoring = scoring, n_jobs=2, random_state = 123)
    
    random_decTree.fit(x_train, y_train)
    
    print ("Best Accuracy", random_decTree.best_score_)
    print ("Best Param", random_decTree.best_params_)
    
    return random_decTree

In [99]:
best_decTree = decTree_fit(x_train_clean, y_train)

Best Accuracy 0.680055162903
Best Param {'min_samples_leaf': 60, 'max_features': 0.75}


In [100]:
decTree = DecisionTreeClassifier(min_samples_leaf = best_decTree.best_params_.get('min_samples_leaf'),
                                 max_features = best_decTree.best_params_.get('max_features'), random_state=123)
decTree.fit(x_train_clean, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=0.75, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=60,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best')

### Bagging

In [101]:
# def bagging_fit(x_train, y_train, scoring = 'accuracy')
#
#
# return model

def bagging_fit(x_train, y_train, scoring = 'accuracy'):
    #define DecisionTreeClassifier
    dtree = DecisionTreeClassifier(random_state=123)
                                      
    #define bagging and use DecisionTree as base estimator
    bagging = BaggingClassifier(base_estimator = dtree, random_state=123)
                                      
    # define hyperparameter 
    hyperparam = {'base_estimator__min_samples_leaf': [3, 5, 7, 9, 13, 17, 21, 27, 33, 41, 50, 60, 80, 100],
                  'n_estimators': [100, 200, 300, 500, 1000]}
    # 'base_estimator__' sebelum 'min_samples_leaf' menandakan hyperparameter yang dicari ada di dalam base estimatornya
    # dalam hal ini berarti decTree
    # (min_samples_leaf ada di dalam decTree)
    
    # do randomizedsearchCV for bagging, set the scoring on randomizedsearch
    random_bagging = RandomizedSearchCV(bagging, param_distributions = hyperparam, cv = 5,
                                    n_iter = 10, n_jobs=2, random_state = 123)
      
    # fit
    random_bagging.fit(x_train, y_train)

                                        
    print ("Best Accuracy", random_bagging.best_score_)
    print ("Best Param", random_bagging.best_params_)
   
    return random_bagging

In [102]:
# call bagging_fit function

best_bagging = bagging_fit(x_train_clean, y_train)

Best Accuracy 0.693932080676
Best Param {'n_estimators': 100, 'base_estimator__min_samples_leaf': 33}


In [103]:
# make model with best_params

decTreeBag = DecisionTreeClassifier(min_samples_leaf = best_bagging.best_params_.get('base_estimator__min_samples_leaf'),
                                    random_state=123)
bagging = BaggingClassifier(base_estimator = decTreeBag, 
                            n_estimators = best_bagging.best_params_.get('n_estimators'),
                            random_state=123, n_jobs=2)
bagging.fit(x_train_clean, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=33,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=100, n_jobs=2, oob_score=False,
         random_state=123, verbose=0, warm_start=False)

### Random Forest

In [104]:
# def randomForest_fit(x_train, y_train, scoring = 'accuracy')
#
#
# return model

def randomForest_fit(x_train, y_train, scoring = 'accuracy'):
    # define classifier
    randomForest = RandomForestClassifier(random_state=123)
                                          
    # define hyperparameter 
    hyperparam = {'min_samples_leaf': [3, 5, 7, 9, 13, 17, 21, 27, 33, 41, 50, 60, 80, 100],
                  'max_features': ['sqrt', 'log2', 0.25, 0.5, 0.75], 
                  'n_estimators': [100, 200, 300, 500, 1000]}
                                          
    # do randomizedsearchCV for random forest, set the scoring on randomizedsearch
    random_randomForest = RandomizedSearchCV(randomForest, param_distributions=hyperparam,cv = 5,
                                             n_iter = 10, n_jobs=2, random_state = 123)
    
    # fit
    random_randomForest.fit(x_train, y_train)
    
        
    print ("Best Accuracy", random_randomForest.best_score_)
    print ("Best Param", random_randomForest.best_params_)
        
    return random_randomForest

In [105]:
# call randomForest_fit function

best_randForest = randomForest_fit(x_train_clean, y_train)

Best Accuracy 0.697034993967
Best Param {'n_estimators': 100, 'min_samples_leaf': 5, 'max_features': 'log2'}


In [106]:
# make model with best_params

randForest = RandomForestClassifier(min_samples_leaf=best_randForest.best_params_.get('min_samples_leaf'),
                                   max_features=best_randForest.best_params_.get('max_features'),
                                   n_estimators=best_randForest.best_params_.get('n_estimators'))
    
randForest.fit(x_train_clean, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='log2', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

### Adaptive Boosting

In [107]:
# def adaBoost_fit(x_train, y_train, scoring = 'accuracy')
#
#
# return model

def adaBoost_fit(x_train, y_train, scoring = 'accuracy'):
    #define DecisionTreeClassifier
    dtree = DecisionTreeClassifier(random_state=123)
    
    #define bagging and use DecisionTree as base estimator
    adaBoost = AdaBoostClassifier(base_estimator = dtree , random_state=123)
    
    # define hyperparameter 
    hyperparam = {'base_estimator__min_samples_leaf': [3, 5, 7, 9, 13, 17, 21, 27, 33, 41, 50, 60, 80, 100],
                  'learning_rate':[1., .1, .01, .001],
                  'n_estimators': [100, 200, 300, 500, 1000]}
    
    # do randomizedsearchCV for random forest, set the scoring on randomizedsearch
    random_adaBoost = RandomizedSearchCV(adaBoost, param_distributions = hyperparam, cv = 5,
                                    n_iter =10, n_jobs=2, random_state = 123)
        
    # fit
    random_adaBoost.fit(x_train, y_train)
    
    print ("Best Accuracy", random_adaBoost.best_score_)
    print ("Best Param", random_adaBoost.best_params_)
    
    return random_adaBoost

In [108]:
# call adaBoost_fit

best_adaBoost = adaBoost_fit(x_train_clean, y_train)

Best Accuracy 0.686433373556
Best Param {'n_estimators': 300, 'learning_rate': 0.001, 'base_estimator__min_samples_leaf': 80}


In [109]:
# make model with best_params

decTreeAdaBoost = DecisionTreeClassifier(min_samples_leaf = best_adaBoost.best_params_.get('base_estimator__min_samples_leaf'),
                                    random_state=123)
adaBoost = AdaBoostClassifier(base_estimator = decTreeAdaBoost, 
                              n_estimators = best_adaBoost.best_params_.get('n_estimators'),
                              learning_rate = best_adaBoost.best_params_.get('learning_rate'),
                              random_state=123)
    
adaBoost.fit(x_train_clean, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=80,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=123, splitter='best'),
          learning_rate=0.001, n_estimators=300, random_state=123)

### Dump Classifier

In [110]:
# dump classifier

joblib.dump(decTree, 'decTree.pkl')
joblib.dump(bagging,'bagging.pkl')
joblib.dump(randForest,'randForest.pkl')
joblib.dump(adaBoost, 'adaBoost.pkl')

['adaBoost.pkl']

## 4. Test Prediction

### Preprocessing Test Data

In [111]:
# Categorical

def transformCategorical(data, categorical_columns, label_encoder, label_binarizer,dummy_columns):
    data = data[categorical_columns].fillna("KOSONG")
    dummy_variables = pd.DataFrame([])
    
    j=0
    for i in categorical_columns:
        label_en = label_encoder[j]
        label_bin = label_binarizer[j]
        
        encoded = label_en.transform(data[i])
        binary = label_bin.transform(encoded)
        
        if binary.shape[1] == 1:
            dummy = pd.DataFrame(binary, index = data.index)
        else:
            dummy = pd.DataFrame(binary, index = data.index)
        
        dummy_variables = pd.concat([dummy_variables, dummy], axis = 1)
        j+=1
    dummy_variables.columns = dummy_columns
    
    return dummy_variables

In [112]:
# def validData(data and all necessary object)
#
# return clean_data
#
def validData(data, numerical_columns, categorical_columns, 
              imputer, label_encoder, label_binarizer, dummy_columns, 
              stemmer, lemmatizer, vectorizer, standardizer):
    # preprocess numerical data using transformNumerical()
    data_num_imputed = transformNumerical(data[numerical_columns], imputer)

    # preprocess categorical data using transformCategorical()
    data_cat_dummy = transformCategorical(data, categorical_columns, label_encoder, label_binarizer, dummy_columns) 

    # preprocess time data using preprocessTime()
    data_time = preprocessTime(data['tweet_created'])

    # preprocess coordinate data using preprocessCoordinate()
    data_coord = preprocessCoordinate(data['tweet_coord'])

    # preprocess text data using preprocessText()
    data_text = preprocessText(data['text'], stemmer, lemmatizer)

    # make feature_word using transformText()
    text_feature = transformText(data_text, vectorizer)

    # concat all data
    data_concat = pd.concat([data_num_imputed, data_time, data_coord, text_feature], axis=1)

    # standardize using transformStandardize()
    data_standard = transformStandardize(data_concat, standardizer)
    
    data_valid = pd.concat([data_cat_dummy, data_standard], axis=1)
    
    return data_valid

In [113]:
# load necessary object
# object = joblib.load("filename.pkl")

numerical_columns = joblib.load('numerical_col.pkl')
categorical_columns = joblib.load('categorical_col.pkl')
dummy_columns = joblib.load('dummy_columns.pkl')

imputer = joblib.load('imputer.pkl')
label_binarizer = joblib.load('label_binarizer.pkl')
label_encoder = joblib.load('label_encoder.pkl')

stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
vectorizer = joblib.load('vectorizer.pkl')

standardizer = joblib.load('standardizer.pkl')


In [114]:
# preprocess test data using validData()

x_test_clean = validData(x_test, numerical_columns, categorical_columns, 
                         imputer, label_encoder, label_binarizer, dummy_columns, 
                         stemmer, lemmatizer, vectorizer, standardizer)

In [115]:
# check top observation

x_test_clean.head()

Unnamed: 0,airline_American,airline_Delta,airline_Southwest,airline_US Airways,airline_United,airline_Virgin America,retweet_count,day,hour,minute,...,need,plane,servic,southwestair,thank,time,unit,usairway,wa,wait
8168,0,1,0,0,0,0,-0.108057,-0.886816,1.625866,-0.541926,...,-0.214438,-0.218052,-0.261428,-0.416098,-0.343852,-0.253241,-0.543325,-0.464249,-0.312419,-0.221933
13037,1,0,0,0,0,0,-0.108057,0.961119,-0.061873,-0.716244,...,-0.214438,-0.218052,-0.261428,-0.416098,-0.343852,-0.253241,-0.543325,-0.464249,3.574521,-0.221933
8351,0,1,0,0,0,0,-0.108057,-0.886816,-0.811979,-0.600032,...,-0.214438,-0.218052,-0.261428,-0.416098,-0.343852,-0.253241,-0.543325,-0.464249,-0.312419,-0.221933
8884,0,1,0,0,0,0,-0.108057,-1.810784,-0.2494,1.14315,...,-0.214438,-0.218052,-0.261428,-0.416098,3.223152,-0.253241,-0.543325,-0.464249,-0.312419,-0.221933
13298,1,0,0,0,0,0,-0.108057,0.961119,-0.811979,-0.600032,...,-0.214438,-0.218052,-0.261428,-0.416098,-0.343852,-0.253241,-0.543325,-0.464249,-0.312419,5.777006


### Predict Test Data

In [116]:
# Load classifier

clf = joblib.load('randForest.pkl')

In [117]:
# evaluate score

clf.score(x_test_clean, y_test)

0.70596346087556017

## Predict Single Raw Data (Example)

In [118]:
def predictData(classifier, input_columns, numerical_columns, categorical_columns, 
                imputer, label_encoder, label_binarizer, dummy_columns, stemmer, 
                lemmatizer, vectorizer, standardizer):
    raw_data = pd.Series([])
    for i in range(0,len(input_columns)):
        message = "Masukkan nilai untuk kolom : "+str(input_columns[i])+" "
        raw_data[i] = input(message)
    data = pd.DataFrame(raw_data)
    data = data.transpose()
    data.columns = input_columns
    for i in numerical_columns :
        data[i] = pd.to_numeric(data[i])
    
    # preprocess the data using validData()
    data_clean = validData(data, numerical_columns, categorical_columns, 
                           imputer, label_encoder, label_binarizer, dummy_columns, 
                           stemmer, lemmatizer, vectorizer, standardizer)
    
    result = classifier.predict(data_clean)
    
    print("The sentiment is " + result)
    print(classifier.predict_proba(data_clean))
    
    return result

In [119]:
# load necessary object
# object = joblib.load("filename.pkl")

classifier = joblib.load('randForest.pkl')
input_columns = joblib.load('input_col.pkl')
numerical_columns = joblib.load('numerical_col.pkl')
categorical_columns = joblib.load('categorical_col.pkl')
dummy_columns = joblib.load('dummy_columns.pkl')

imputer = joblib.load('imputer.pkl')
label_binarizer = joblib.load('label_binarizer.pkl')
label_encoder = joblib.load('label_encoder.pkl')

stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
vectorizer = joblib.load('vectorizer.pkl')

standardizer = joblib.load('standardizer.pkl')

Misal, akan dilakukan prediksi untuk data berikut :<br>
<table>
<tr> <td> airline_sentiment_confidence </td><td>             1 <td></td></tr>
<tr> <td>negativereason      </td><td>    Lost Luggage</td></tr>
<tr> <td>negativereason_confidence </td><td> 1</td></tr>
<tr> <td>airline </td><td>  Virgin America</td></tr>
<tr> <td>retweet_count </td><td>  0</td></tr>
<tr> <td>text </td><td> @VirginAmerica everything was fine until you lost my bag</td></tr>
<tr> <td>tweet_coord </td><td>  [40.6413712, -73.78311558]</td></tr>
<tr> <td>tweet_created </td><td>  23-02-15 13:08:00 -0800</td></tr>
</table>

In [120]:
# call predictData()

predictData(classifier, input_columns, numerical_columns, categorical_columns, 
                imputer, label_encoder, label_binarizer, dummy_columns, stemmer, 
                lemmatizer, vectorizer, standardizer)

Masukkan nilai untuk kolom : airline Virgin America
Masukkan nilai untuk kolom : retweet_count 0
Masukkan nilai untuk kolom : text @VirginAmerica everything was fine until you lost my bag
Masukkan nilai untuk kolom : tweet_coord [40.6413712, -73.78311558]
Masukkan nilai untuk kolom : tweet_created 23-02-15 13:08:00 -0800
['The sentiment is negative']
[[ 0.52680993  0.23009963  0.24309044]]


array(['negative'], dtype=object)