# Tugas Python Machine Learning with PACMANN AI


## Intro to Sentiment Analysis : Text Preprocessing

##  Dataset Information

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). From this data we can analyze how travelers in February 2015 expressed their feelings on Twitter.
 
Source : https://www.kaggle.com/crowdflower/twitter-airline-sentiment
 
### Content
Original data contains 15 columns : tweet_id, airline_sentiment, airline_sentiment_confidence, negativereason, negativereason_confidence, airline, airline_sentiment_gold, name, negativereason_gold, retweet_count, text, tweet_coord, tweet_created, tweet_location, user_timezone

In this exercise we only do preprocessing for time, coordinate, and text

## 1. Importing Data to Python

In [1]:
# import libary pandas

import pandas as pd

In [2]:
# load data training_text.csv

data = pd.read_csv('training_text.csv')

In [3]:
# cek head dari data

data.head()

Unnamed: 0.1,Unnamed: 0,time,coordinate,text
0,0,2015-02-24 11:35:52 -0800,,@VirginAmerica What @dhepburn said.
1,1,2015-02-24 11:15:59 -0800,,@VirginAmerica plus you've added commercials t...
2,2,2015-02-24 11:15:48 -0800,,@VirginAmerica I didn't today... Must mean I n...
3,3,2015-02-24 11:15:36 -0800,,@VirginAmerica it's really aggressive to blast...
4,4,2015-02-24 11:14:45 -0800,,@VirginAmerica and it's a really big bad thing...


In [4]:
# cek apakah ada data yang kosong

data.isnull().sum()

Unnamed: 0        0
time              0
coordinate    13621
text              0
dtype: int64

## 2. Preprocessing

### a. Time
Informasi pada kolom `time` sebaiknya diurai menjadi kolom-kolom tertentu agar dapat diproses dengan lebih baik oleh model

In [5]:
# cek tipe data dari kolom time

type(data['time'][0])

str

In [6]:
# ambil data kolom 'time'
# assign ke variabel data_time

data_time = data['time']

In [7]:
# cek head dari data_time

data_time.head()

0    2015-02-24 11:35:52 -0800
1    2015-02-24 11:15:59 -0800
2    2015-02-24 11:15:48 -0800
3    2015-02-24 11:15:36 -0800
4    2015-02-24 11:14:45 -0800
Name: time, dtype: object

In [8]:
# pisahkan isi data berdasarkan spasi : ' '

data_time = data_time.str.split(' ')
data_time

0        [2015-02-24, 11:35:52, -0800]
1        [2015-02-24, 11:15:59, -0800]
2        [2015-02-24, 11:15:48, -0800]
3        [2015-02-24, 11:15:36, -0800]
4        [2015-02-24, 11:14:45, -0800]
5        [2015-02-24, 11:14:33, -0800]
6        [2015-02-24, 11:13:57, -0800]
7        [2015-02-24, 11:12:29, -0800]
8        [2015-02-24, 11:11:19, -0800]
9        [2015-02-24, 10:53:27, -0800]
10       [2015-02-24, 10:48:24, -0800]
11       [2015-02-24, 10:30:40, -0800]
12       [2015-02-24, 10:30:06, -0800]
13       [2015-02-24, 10:21:28, -0800]
14       [2015-02-24, 10:15:29, -0800]
15       [2015-02-24, 10:01:50, -0800]
16       [2015-02-24, 09:42:59, -0800]
17       [2015-02-24, 09:39:46, -0800]
18       [2015-02-24, 09:15:00, -0800]
19       [2015-02-24, 09:04:10, -0800]
20       [2015-02-24, 08:55:56, -0800]
21       [2015-02-24, 08:49:01, -0800]
22       [2015-02-24, 08:30:15, -0800]
23       [2015-02-24, 08:27:52, -0800]
24       [2015-02-24, 08:18:51, -0800]
25       [2015-02-24, 07:

In [9]:
# ubah ke dalam bentuk list

data_time = data_time.tolist()
data_time

[['2015-02-24', '11:35:52', '-0800'],
 ['2015-02-24', '11:15:59', '-0800'],
 ['2015-02-24', '11:15:48', '-0800'],
 ['2015-02-24', '11:15:36', '-0800'],
 ['2015-02-24', '11:14:45', '-0800'],
 ['2015-02-24', '11:14:33', '-0800'],
 ['2015-02-24', '11:13:57', '-0800'],
 ['2015-02-24', '11:12:29', '-0800'],
 ['2015-02-24', '11:11:19', '-0800'],
 ['2015-02-24', '10:53:27', '-0800'],
 ['2015-02-24', '10:48:24', '-0800'],
 ['2015-02-24', '10:30:40', '-0800'],
 ['2015-02-24', '10:30:06', '-0800'],
 ['2015-02-24', '10:21:28', '-0800'],
 ['2015-02-24', '10:15:29', '-0800'],
 ['2015-02-24', '10:01:50', '-0800'],
 ['2015-02-24', '09:42:59', '-0800'],
 ['2015-02-24', '09:39:46', '-0800'],
 ['2015-02-24', '09:15:00', '-0800'],
 ['2015-02-24', '09:04:10', '-0800'],
 ['2015-02-24', '08:55:56', '-0800'],
 ['2015-02-24', '08:49:01', '-0800'],
 ['2015-02-24', '08:30:15', '-0800'],
 ['2015-02-24', '08:27:52', '-0800'],
 ['2015-02-24', '08:18:51', '-0800'],
 ['2015-02-24', '07:49:15', '-0800'],
 ['2015-02-2

In [10]:
# buat ke dalam DataFrame
# pastikan indeksnya tidak berubah

data_time_df = pd.DataFrame(data_time, index = data.index)

In [11]:
# cek head dari data

data_time_df.head()

Unnamed: 0,0,1,2
0,2015-02-24,11:35:52,-800
1,2015-02-24,11:15:59,-800
2,2015-02-24,11:15:48,-800
3,2015-02-24,11:15:36,-800
4,2015-02-24,11:14:45,-800


In [12]:
# ganti nama kolom agar sesuai

data_time_df.columns = ['date','time','GMT']

In [13]:
# cek head dari data

data_time_df.head()

Unnamed: 0,date,time,GMT
0,2015-02-24,11:35:52,-800
1,2015-02-24,11:15:59,-800
2,2015-02-24,11:15:48,-800
3,2015-02-24,11:15:36,-800
4,2015-02-24,11:14:45,-800


### Exercise!
* Uraikan kolom ['date'] menjadi kolom ['year','month','day']
* Uraikan kolom ['time] menjadi kolom ['hour','minute','second']

### Date

In [14]:
date_data = data_time_df['date']

In [15]:
date_data.head()

0    2015-02-24
1    2015-02-24
2    2015-02-24
3    2015-02-24
4    2015-02-24
Name: date, dtype: object

In [16]:
date_data = date_data.str.split('-')
date_data

0        [2015, 02, 24]
1        [2015, 02, 24]
2        [2015, 02, 24]
3        [2015, 02, 24]
4        [2015, 02, 24]
5        [2015, 02, 24]
6        [2015, 02, 24]
7        [2015, 02, 24]
8        [2015, 02, 24]
9        [2015, 02, 24]
10       [2015, 02, 24]
11       [2015, 02, 24]
12       [2015, 02, 24]
13       [2015, 02, 24]
14       [2015, 02, 24]
15       [2015, 02, 24]
16       [2015, 02, 24]
17       [2015, 02, 24]
18       [2015, 02, 24]
19       [2015, 02, 24]
20       [2015, 02, 24]
21       [2015, 02, 24]
22       [2015, 02, 24]
23       [2015, 02, 24]
24       [2015, 02, 24]
25       [2015, 02, 24]
26       [2015, 02, 24]
27       [2015, 02, 24]
28       [2015, 02, 24]
29       [2015, 02, 23]
              ...      
14610    [2015, 02, 22]
14611    [2015, 02, 22]
14612    [2015, 02, 22]
14613    [2015, 02, 22]
14614    [2015, 02, 22]
14615    [2015, 02, 22]
14616    [2015, 02, 22]
14617    [2015, 02, 22]
14618    [2015, 02, 22]
14619    [2015, 02, 22]
14620    [2015, 

In [17]:
date_data = date_data.tolist()
date_data

[['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '24'],
 ['2015', '02', '23'],
 ['2015', '02', '23'],
 ['2015', '02', '23'],
 ['2015', '02', '23'],
 ['2015', '02', '23'],
 ['2015', '02', '23'],
 ['2015', '02', '23'],
 ['2015', '02', '23'],
 ['2015', '02', '23'],
 ['2015', '02', '23'],
 ['2015', '02', '23'],
 ['2015', '02', '23'],
 ['2015', '02', '23'],
 ['2015', '02', '23'],
 ['2015', '

In [18]:
date_data_df = pd.DataFrame(date_data, index = data.index)

In [19]:
date_data_df.head()

Unnamed: 0,0,1,2
0,2015,2,24
1,2015,2,24
2,2015,2,24
3,2015,2,24
4,2015,2,24


In [20]:
date_data_df.columns = ['year', 'month', 'day']

In [21]:
date_data_df.head()

Unnamed: 0,year,month,day
0,2015,2,24
1,2015,2,24
2,2015,2,24
3,2015,2,24
4,2015,2,24


### Time

In [22]:
time_data = data_time_df['time']

In [23]:
time_data.head()

0    11:35:52
1    11:15:59
2    11:15:48
3    11:15:36
4    11:14:45
Name: time, dtype: object

In [24]:
time_data = time_data.str.split(':')
time_data

0        [11, 35, 52]
1        [11, 15, 59]
2        [11, 15, 48]
3        [11, 15, 36]
4        [11, 14, 45]
5        [11, 14, 33]
6        [11, 13, 57]
7        [11, 12, 29]
8        [11, 11, 19]
9        [10, 53, 27]
10       [10, 48, 24]
11       [10, 30, 40]
12       [10, 30, 06]
13       [10, 21, 28]
14       [10, 15, 29]
15       [10, 01, 50]
16       [09, 42, 59]
17       [09, 39, 46]
18       [09, 15, 00]
19       [09, 04, 10]
20       [08, 55, 56]
21       [08, 49, 01]
22       [08, 30, 15]
23       [08, 27, 52]
24       [08, 18, 51]
25       [07, 49, 15]
26       [07, 11, 37]
27       [05, 44, 59]
28       [05, 05, 28]
29       [23, 34, 30]
             ...     
14610    [12, 17, 14]
14611    [12, 17, 05]
14612    [12, 16, 58]
14613    [12, 16, 47]
14614    [12, 16, 20]
14615    [12, 16, 18]
14616    [12, 15, 45]
14617    [12, 15, 19]
14618    [12, 14, 44]
14619    [12, 14, 08]
14620    [12, 14, 03]
14621    [12, 13, 45]
14622    [12, 10, 58]
14623    [12, 10, 16]
14624    [

In [25]:
time_data = time_data.tolist()
time_data

[['11', '35', '52'],
 ['11', '15', '59'],
 ['11', '15', '48'],
 ['11', '15', '36'],
 ['11', '14', '45'],
 ['11', '14', '33'],
 ['11', '13', '57'],
 ['11', '12', '29'],
 ['11', '11', '19'],
 ['10', '53', '27'],
 ['10', '48', '24'],
 ['10', '30', '40'],
 ['10', '30', '06'],
 ['10', '21', '28'],
 ['10', '15', '29'],
 ['10', '01', '50'],
 ['09', '42', '59'],
 ['09', '39', '46'],
 ['09', '15', '00'],
 ['09', '04', '10'],
 ['08', '55', '56'],
 ['08', '49', '01'],
 ['08', '30', '15'],
 ['08', '27', '52'],
 ['08', '18', '51'],
 ['07', '49', '15'],
 ['07', '11', '37'],
 ['05', '44', '59'],
 ['05', '05', '28'],
 ['23', '34', '30'],
 ['22', '52', '29'],
 ['21', '35', '43'],
 ['21', '10', '41'],
 ['20', '55', '30'],
 ['20', '24', '33'],
 ['18', '46', '00'],
 ['18', '43', '35'],
 ['18', '19', '47'],
 ['17', '54', '09'],
 ['17', '41', '58'],
 ['17', '32', '54'],
 ['17', '00', '40'],
 ['16', '24', '11'],
 ['16', '20', '38'],
 ['16', '13', '09'],
 ['16', '08', '07'],
 ['16', '04', '28'],
 ['16', '01',

In [26]:
time_data_df = pd.DataFrame(time_data, index = data.index)

In [27]:
time_data_df.head()

Unnamed: 0,0,1,2
0,11,35,52
1,11,15,59
2,11,15,48
3,11,15,36
4,11,14,45


In [28]:
time_data_df.columns = ['hour', 'minute', 'second']

In [29]:
time_data_df.head()

Unnamed: 0,hour,minute,second
0,11,35,52
1,11,15,59
2,11,15,48
3,11,15,36
4,11,14,45


### b. Coordinate 
Informasi pada kolom `coordinate` sebaiknya diurai menjadi kolom-kolom tertentu agar dapat diproses dengan lebih baik oleh model

In [30]:
# cek tipe data dari kolom coordinate

type(data['coordinate'][0])

float

In [31]:
# Ambil data coordinate

data_coordinate = data['coordinate']
data_coordinate.head()

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: coordinate, dtype: object

In [32]:
# lihat bentuk data dengan value_counts

data_coordinate.value_counts(normalize=True)

[0.0, 0.0]                      0.160942
[40.64656067, -73.78334045]     0.005888
[32.91792297, -97.00367737]     0.002944
[40.64646912, -73.79133606]     0.002944
[39.1766101, -76.6700606]       0.001963
[40.68996177, -73.91640136]     0.001963
[35.22643463, -80.93879965]     0.001963
[33.75539049, -116.36196163]    0.001963
[40.69017276, -73.91646118]     0.001963
[37.78618135, -122.45742542]    0.001963
[40.69002464, -73.91638072]     0.001963
[39.83426941, -104.69960636]    0.001963
[33.75348859, -116.36209633]    0.001963
[37.99311597, -84.52114659]     0.001963
[32.82813261, -97.25115941]     0.001963
[34.0213466, -118.45229268]     0.001963
[40.68994668, -73.91637642]     0.001963
[37.62006843, -122.38822083]    0.001963
[18.22245647, -63.00369733]     0.001963
[39.85861339, -104.67232131]    0.000981
[32.78951797, -96.79891462]     0.000981
[25.8058675, -80.1260541]       0.000981
[4.69840554, -74.14134323]      0.000981
[35.2176668, -80.9426912]       0.000981
[41.45856783, -8

In [33]:
# isi data kosong dengan nilai 0.0 dan sesuaikan formatnya

data_coordinate = data_coordinate.fillna(value = "[0.0, 0.0]")

In [34]:
# cek head dari data

data_coordinate.head()

0    [0.0, 0.0]
1    [0.0, 0.0]
2    [0.0, 0.0]
3    [0.0, 0.0]
4    [0.0, 0.0]
Name: coordinate, dtype: object

In [35]:
# ambil bagian data yang diperlukan

data_coordinate = data_coordinate.str[1:-1]

In [36]:
# cek head dari data

data_coordinate.head()

0    0.0, 0.0
1    0.0, 0.0
2    0.0, 0.0
3    0.0, 0.0
4    0.0, 0.0
Name: coordinate, dtype: object

### Exercise!
* Uraikan kolom ['coordinate'] menjadi kolom ["latitude", "longitude"]

In [37]:
coord_data = data_coordinate.str.split(',')
coord_data

0                         [0.0,  0.0]
1                         [0.0,  0.0]
2                         [0.0,  0.0]
3                         [0.0,  0.0]
4                         [0.0,  0.0]
5                         [0.0,  0.0]
6                         [0.0,  0.0]
7                         [0.0,  0.0]
8                         [0.0,  0.0]
9                         [0.0,  0.0]
10                        [0.0,  0.0]
11                        [0.0,  0.0]
12                        [0.0,  0.0]
13                        [0.0,  0.0]
14                        [0.0,  0.0]
15                        [0.0,  0.0]
16                        [0.0,  0.0]
17                        [0.0,  0.0]
18                        [0.0,  0.0]
19                        [0.0,  0.0]
20                        [0.0,  0.0]
21       [40.74804263,  -73.99295302]
22                        [0.0,  0.0]
23                        [0.0,  0.0]
24                        [0.0,  0.0]
25                        [0.0,  0.0]
26          

In [38]:
coord_data = coord_data.tolist()
coord_data

[['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['40.74804263', ' -73.99295302'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['42.361016', ' -71.02000488'],
 ['33.94540417', ' -118.4062472'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['33.94209449', ' -118.40410103'],
 ['0.0', ' 0.0'],
 ['33.2145038', ' -96.9321504'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['34.0219817', ' -118.38591198'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 ['0.0', ' 0.0'],
 

In [39]:
coord_data_df = pd.DataFrame(coord_data, index = data.index)

In [40]:
coord_data_df.head()

Unnamed: 0,0,1
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0
3,0.0,0.0
4,0.0,0.0


In [41]:
coord_data_df.columns = ['latitude', 'longitude']

In [42]:
coord_data_df.head()

Unnamed: 0,latitude,longitude
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0
3,0.0,0.0
4,0.0,0.0


In [43]:
# cek tipe data dari masing-masin kolom ["latitude", "longitude"]

print(type(coord_data_df['latitude'][0]))
print(type(coord_data_df['longitude'][0]))

<class 'str'>
<class 'str'>


Kembalikan data kolom ["latitude", "longitude"] ke dalam format numerik, gunakan : <br>
data[column_name] = pd.to_numeric(data[column_name])

In [44]:
coord_data_df['latitude'] = pd.to_numeric(coord_data_df['latitude'])
coord_data_df['longitude'] = pd.to_numeric(coord_data_df['longitude'])

In [45]:
print(type(coord_data_df['latitude'][0]))
print(type(coord_data_df['longitude'][0]))

<class 'numpy.float64'>
<class 'numpy.float64'>


### Exercise!
### Buat function untuk memisahkan data pada sebuah kolom

Function dinamakan dengan `columnSplit` dan menerima 3 argument yaitu:

 1. `data`: kolom data yang hendak dipisahkan
 2. `splitter`    : batas pemisah pada data
 3. `columns_name` : nama kolom baru
 
Lalu assign function tersebut pada suatu variabel yang dengan nama `data_ .... `

Kemudian lakukan kembali pemisahan dengan menggunakan function columnSplit pada : 
* kolom data['time'] menjadi kolom ['date', 'time','GMT']
* kolom ['date'] menjadi kolom ['year','month','day']
* kolom ['time] menjadi kolom ['hour','minute','second']

In [46]:
# define columnSplit function

def columnSplit(data, splitter, columns_name):
    data = pd.DataFrame(data.str.split(splitter).tolist(),
                        columns = columns_name,
                        index = data.index)

    return data

In [47]:
# kolom data['time'] menjadi kolom ['date', 'time','GMT']

data_time_full = columnSplit(data['time'], ' ', ['date', 'time', 'GMT'])

In [48]:
# kolom ['date'] menjadi kolom ['year','month','day']

data_date = columnSplit(data_time_full['date'], '-', ['year', 'month', 'day'])

In [49]:
# kolom ['time] menjadi kolom ['hour','minute','second']

data_hour = columnSplit(data_time_full['time'], ':', ['hour', 'minute', 'second'])

### c. Text

In [50]:
# ambil data dari kolom data["text"] 

text = data["text"]

In [51]:
# cek head dari data

text.head()

0                  @VirginAmerica What @dhepburn said.
1    @VirginAmerica plus you've added commercials t...
2    @VirginAmerica I didn't today... Must mean I n...
3    @VirginAmerica it's really aggressive to blast...
4    @VirginAmerica and it's a really big bad thing...
Name: text, dtype: object

####  i. Regular Expression

A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
<br>
Further reference : https://web.stanford.edu/~jurafsky/slp3/slides/2_TextProc.pdf (p1-8)

In [52]:
# import library re

import re

In [53]:
# ambil data baris tertentu, misal 7

raw_text = text[7]
raw_text

'@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP'

In [54]:
# pastikan formatnya menjadi string

raw_text = str(raw_text)
raw_text

'@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP'

In [55]:
# cek dokumentasi dari re.sub
?re.sub

re.sub(pattern, repl, string, count=0, flags=0) : <br>
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl.  repl can be either a string or a callable;
if a string, backslash escapes in it are processed.  If it is
a callable, it's passed the match object and must return
a replacement string to be used.

In [56]:
# hapus simbol pada teks, ganti menjadi spasi

# '[^A-Za-z0-9]+'  :
#    [] menujukkan himpunan yang diinginkan
#    - menunjukkan range
#    ^ menunjukkan negasi
#    + menunjukkan teks dengan karakter sejenis
#      misal : baa+ dapat merujuk pada baa, baaa, baaaa, baaaaa

raw_text = re.sub('[^A-Za-z0-9]+', ' ', raw_text)

#  argumen diatas memiliki arti untuk karakter selain A-Z,a-z,dan 0-9 akan diganti menjadi ' ' (spasi)

raw_text

' VirginAmerica Really missed a prime opportunity for Men Without Hats parody there https t co mWpG7grEZP'

In [57]:
# hapus spasi berlebih

raw_text = re.sub(' +',' ',raw_text.strip())
raw_text

'VirginAmerica Really missed a prime opportunity for Men Without Hats parody there https t co mWpG7grEZP'

In [58]:
# ubah menjadi huruf kecil

raw_text = raw_text.lower()
raw_text

'virginamerica really missed a prime opportunity for men without hats parody there https t co mwpg7grezp'

#### ii. Stemming and Lemmatization

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

> am, are, is $\Rightarrow$ be <br>
> car, cars, car's, cars' $\Rightarrow$ car<br>
> cats, catty  $\Rightarrow$ cat <br>
> stemming, stemmer, stemmed  $\Rightarrow$ stem <br>
> fisher, fishing, fished  $\Rightarrow$ fish

The result of this mapping of text will be something like:
>the boy's cars are different colors $\Rightarrow$ the boy car be differ color

However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma <br><br>

Further reference : https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html


In [59]:
# import library untuk stem

from nltk.stem import SnowballStemmer, WordNetLemmatizer
stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

Further reference on SnowballStemmer : http://snowball.tartarus.org/texts/introduction.html <br>
Further reference on WordNetLemmatizer : https://wordnet.princeton.edu/

<font color='red'>======================================================================</font><br>
Sebelum melakukan "Stemming and Lemmatization", ekstrak isi nltk_data.rar ke Drive C:<br>
Pastikan directory C:\nltk_data\corpora berisi folder wordnet dan file wordnet.zip
<font color='red'>======================================================================</font><br>


In [60]:
# ambil data raw_text, assign ke stem_text

stem_text = str(raw_text)
stem_text

'virginamerica really missed a prime opportunity for men without hats parody there https t co mwpg7grezp'

In [61]:
# pisahkan kalimat ke dalam kata berdasarkan spasi

stem_text = stem_text.split(" ")
stem_text

['virginamerica',
 'really',
 'missed',
 'a',
 'prime',
 'opportunity',
 'for',
 'men',
 'without',
 'hats',
 'parody',
 'there',
 'https',
 't',
 'co',
 'mwpg7grezp']

In [62]:
# lakukan stem dengan stemmer untuk tiap kata dalam kalimat

stem_text = [stemmer.stem(word) for word in stem_text]
stem_text

['virginamerica',
 'realli',
 'miss',
 'a',
 'prime',
 'opportun',
 'for',
 'men',
 'without',
 'hat',
 'parodi',
 'there',
 'https',
 't',
 'co',
 'mwpg7grezp']

In [63]:
# gabung kembali kata-kata menjadi kalimat

stem_text = " ".join(stem_text)
stem_text

'virginamerica realli miss a prime opportun for men without hat parodi there https t co mwpg7grezp'

In [64]:
# ambil stem_text dan assign ke lemmatize_text
# pisahkan tiap kata dalam kalimat berdasarkan spasi

lemmatize_text = stem_text.split(" ")
lemmatize_text

['virginamerica',
 'realli',
 'miss',
 'a',
 'prime',
 'opportun',
 'for',
 'men',
 'without',
 'hat',
 'parodi',
 'there',
 'https',
 't',
 'co',
 'mwpg7grezp']

In [65]:
# lemmatize tiap kata dalam kalimat

lemmatize_text = [lemmatizer.lemmatize(word) for word in lemmatize_text]
lemmatize_text

['virginamerica',
 'realli',
 'miss',
 'a',
 'prime',
 'opportun',
 'for',
 'men',
 'without',
 'hat',
 'parodi',
 'there',
 'http',
 't',
 'co',
 'mwpg7grezp']

In [66]:
# gabung tiap kata-kata menjadi kalimat

lemmatize_text = " ".join(lemmatize_text)
lemmatize_text

'virginamerica realli miss a prime opportun for men without hat parodi there http t co mwpg7grezp'

In [67]:
# tampilkan raw_text, stem_text, dan lemmatize_text
# amati perbedaannya

print(raw_text)
print(stem_text)
print(lemmatize_text)

virginamerica really missed a prime opportunity for men without hats parody there https t co mwpg7grezp
virginamerica realli miss a prime opportun for men without hat parodi there https t co mwpg7grezp
virginamerica realli miss a prime opportun for men without hat parodi there http t co mwpg7grezp



### Exercise!
* Lakukan preprocessing text seperti di atas pada baris data text yang lain, misal text[9]

In [68]:
text9 = text[9]
text9

"@VirginAmerica it was amazing, and arrived an hour early. You're too good to me."

In [69]:
text9 = str(text9)
text9

"@VirginAmerica it was amazing, and arrived an hour early. You're too good to me."

In [70]:
text9 = re.sub('[^A-Za-z0-9]+', ' ', text9)
text9

' VirginAmerica it was amazing and arrived an hour early You re too good to me '

In [71]:
text9 = re.sub(' +', ' ', text9.strip())
text9

'VirginAmerica it was amazing and arrived an hour early You re too good to me'

In [72]:
text9 = text9.lower()
text9

'virginamerica it was amazing and arrived an hour early you re too good to me'

In [73]:
stem_text9 = text9.split(' ')
stem_text9

['virginamerica',
 'it',
 'was',
 'amazing',
 'and',
 'arrived',
 'an',
 'hour',
 'early',
 'you',
 're',
 'too',
 'good',
 'to',
 'me']

In [74]:
stem_text9 = [stemmer.stem(word) for word in stem_text9]
stem_text9

['virginamerica',
 'it',
 'was',
 'amaz',
 'and',
 'arriv',
 'an',
 'hour',
 'earli',
 'you',
 're',
 'too',
 'good',
 'to',
 'me']

In [75]:
stem_text9 = ' '.join(stem_text9)
stem_text9

'virginamerica it was amaz and arriv an hour earli you re too good to me'

In [76]:
lemmatize_text9 = stem_text9.split(' ')
lemmatize_text9

['virginamerica',
 'it',
 'was',
 'amaz',
 'and',
 'arriv',
 'an',
 'hour',
 'earli',
 'you',
 're',
 'too',
 'good',
 'to',
 'me']

In [77]:
lemmatize_text9 = [lemmatizer.lemmatize(word) for word in lemmatize_text9]
lemmatize_text9

['virginamerica',
 'it',
 'wa',
 'amaz',
 'and',
 'arriv',
 'an',
 'hour',
 'earli',
 'you',
 're',
 'too',
 'good',
 'to',
 'me']

In [78]:
lemmatize_text9 = ' '.join(lemmatize_text9)
lemmatize_text9

'virginamerica it wa amaz and arriv an hour earli you re too good to me'

In [79]:
print(text9)
print(stem_text9)
print(lemmatize_text9)

virginamerica it was amazing and arrived an hour early you re too good to me
virginamerica it was amaz and arriv an hour earli you re too good to me
virginamerica it wa amaz and arriv an hour earli you re too good to me


###### re, Stem, Lemmatize pada seluruh data

In [80]:
clear_text = pd.Series([])

In [81]:
for i in text.index:
    string = str(text[i])
    string = re.sub('[^A-Za-z0-9]+', ' ', string)    
    string = re.sub(' +',' ',string.strip())
    clear_text[i] = string.lower()

In [82]:
clear_text.head()

0                     virginamerica what dhepburn said
1    virginamerica plus you ve added commercials to...
2    virginamerica i didn t today must mean i need ...
3    virginamerica it s really aggressive to blast ...
4    virginamerica and it s a really big bad thing ...
dtype: object

In [83]:
import nltk
from nltk.stem import SnowballStemmer, WordNetLemmatizer

In [84]:
stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()
stem_text = pd.Series([])
lemmatize_text = pd.Series([])

In [85]:
for i in clear_text.index:
    string = clear_text[i]
    string = str(string)
    string = string.split(" ")
    string = [stemmer.stem(word) for word in string]
    string = " ".join(string)
    stem_text[i] = str(string)

In [86]:
stem_text.head()

0                     virginamerica what dhepburn said
1    virginamerica plus you ve ad commerci to the e...
2    virginamerica i didn t today must mean i need ...
3    virginamerica it s realli aggress to blast obn...
4    virginamerica and it s a realli big bad thing ...
dtype: object

In [87]:
for i in stem_text.index:
    string = stem_text[i]
    string = str(string)
    string = string.split(" ")
    string = [lemmatizer.lemmatize(word) for word in string]
    string = " ".join(string)
    lemmatize_text[i] = str(string)

In [88]:
lemmatize_text.head()

0                     virginamerica what dhepburn said
1    virginamerica plus you ve ad commerci to the e...
2    virginamerica i didn t today must mean i need ...
3    virginamerica it s realli aggress to blast obn...
4    virginamerica and it s a realli big bad thing ...
dtype: object

In [89]:
clean_text = lemmatize_text

#### iv. TF-IDF

TF: Term Frequency <br>
Measures how frequently a term occurs in a document<br><br>
IDF: Inverse Document Frequency<br>
Measures how important a term is<br><br>
Further reference : Chapter 15.2 from Christopher D. Manning, Hinrich Schütze-Foundations of Statistical Natural Language Processing-The MIT Press (1999)


In [90]:
# import TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

In [91]:
# define vectorizer

vectorizer = TfidfVectorizer(min_df=500, stop_words="english")

Keterangan argument input untuk TfidfVectorizer : <br>
min_df : float in range [0.0, 1.0] or int, default=1
    > When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold. This value is also
    called cut-off in the literature.
    If float, the parameter represents a proportion of documents, integer
    absolute counts.
    This parameter is ignored if vocabulary is not None.
 
stop_words : string {'english'}, list, or None (default)
    > If a string, it is passed to _check_stop_list and the appropriate stop
    list is returned. 'english' is currently the only supported string
    value.
  
    

In [92]:
# fit vectorizer

vectorizer.fit(clean_text)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=500,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [93]:
# transform data clean_text

tf_idf = vectorizer.transform(clean_text)

In [94]:
# buat hasil TfidfVectorizer ke dalam DataFrame

feature_word = pd.DataFrame(tf_idf.toarray(), columns=vectorizer.get_feature_names(), 
                                index = text.index)

In [95]:
# cek head dari data

feature_word.head()

Unnamed: 0,airlin,americanair,amp,bag,cancel,custom,day,delay,fli,flight,...,southwestair,thank,time,tri,unit,usairway,virginamerica,wa,wait,whi
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.727886,0.0,0.0,0.0
3,0.0,0.0,0.692405,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.721509,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


### Exercise!
### Buat function untuk memisahkan data pada sebuah kolom

Function dinamakan dengan `tfidfFeature` dan menerima 2 argument yaitu:

 1. `text`: kolom data yang hendak dipisahkan
 2. `vectorizer`    : batas pemisah pada data

Function mengembalikan `feature_word` dan `vectorizer` yang telah difit.


In [96]:
tfidf = TfidfVectorizer(min_df=500, stop_words='english')

In [97]:
#define function

def tfidfFeature(text, vectorizer):
    vectorizer.fit(text)
    
    tf_idf = vectorizer.transform(text)
    feature_word = pd.DataFrame(tf_idf.toarray(),
                               columns=vectorizer.get_feature_names(),
                               index=text.index)
    
    return feature_word, vectorizer

In [98]:
feature_word, tfidf = tfidfFeature(clean_text, tfidf)

In [99]:
feature_word.head()

Unnamed: 0,airlin,americanair,amp,bag,cancel,custom,day,delay,fli,flight,...,southwestair,thank,time,tri,unit,usairway,virginamerica,wa,wait,whi
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.727886,0.0,0.0,0.0
3,0.0,0.0,0.692405,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.721509,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
