### Student Information
Name: Yi-Chun Hung

Student ID: 103061145

---

### Instructions

- Download the dataset provided in this [link](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#). The sentiment dataset contains a `sentence` and `score` label. Read what the dataset is about on the link provided before you start exploring it. 


- Then, you are asked to apply each of the data exploration and data operation techniques learned in the [first lab session](https://goo.gl/Sg4FS1) on the new dataset. You don't need to explain all the procedures as we did in the notebook, but you are expected to provide some **minimal comments** explaining your code. You are also expected to use the same libraries used in the first lab session. You are allowed to use and modify the `helper` functions we provided in the first lab session or create your own. Also, be aware that the helper functions may need modification as you are dealing with a completely different dataset. This part is worth 80% of your grade!


- After you have completed the operations, you should attempt the **bonus exercises** provided in the [notebook](https://goo.gl/Sg4FS1) we used for the first lab session. There are six (6) additional exercises; attempt them all, as it is part of your grade (10%). 


- You are also expected to tidy up your notebook and attempt new data operations that you have learned so far in the Data Mining course. Surprise us! This segment is worth 10% of your grade.


- After completing all the above tasks, you are free to remove this header block and submit your assignment following the guide provided in the `README.md` file of the assignment's [repository](https://github.com/omarsar/data_mining_hw_1). 

In [148]:
import os
import glob
import pandas as pd
import numpy as np
import nltk
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import sklearn.utils
import plotly.plotly as py
import plotly.graph_objs as go
import math
import matplotlib.pyplot as plt
%matplotlib inline

# my functions
import helpers.data_mining_helpers as dmh
import helpers.text_analysis as ta

In [149]:
path = "./sentiment labelled sentences"
data_array = []
categories = []
for filepath in glob.glob(os.path.join(path, '*labelled.txt')):
    filename = os.path.splitext(filepath.split("/")[-1])[0]
    categories.append(filename)
    with open(filepath, "r") as f:
        for line in f:
            l = line.rstrip().split("\t")
            l.append(filename)
            data_array.append(l)


In [150]:
df = pd.DataFrame(data_array, columns=['text', 'label', 'category_name'])

# Shuffle the entire dataframe
df = sklearn.utils.shuffle(df).reset_index(drop=True)

# Format of df
#   text label from
# 1  a     1    c
# 2  b     0    a
# 3
# 4

In [151]:
# Print the length of df
print("Data length is {}".format(len(df)))

Data length is 3000


In [152]:
# Exercise 0
df.iat[0,0] # get the element at both first col and first row
df.axes[0] # Show the axis of dataframe

RangeIndex(start=0, stop=3000, step=1)

In [153]:
# Check for the missing value
print(df.isnull().apply(lambda x: dmh.check_missing_values(x)))

text             (The amoung of missing records is: , 0)
label            (The amoung of missing records is: , 0)
category_name    (The amoung of missing records is: , 0)
dtype: object


In [154]:
# Exercise 1
df.isnull().apply(lambda x: dmh.check_missing_values(x),axis=1)

0       (The amoung of missing records is: , 0)
1       (The amoung of missing records is: , 0)
2       (The amoung of missing records is: , 0)
3       (The amoung of missing records is: , 0)
4       (The amoung of missing records is: , 0)
5       (The amoung of missing records is: , 0)
6       (The amoung of missing records is: , 0)
7       (The amoung of missing records is: , 0)
8       (The amoung of missing records is: , 0)
9       (The amoung of missing records is: , 0)
10      (The amoung of missing records is: , 0)
11      (The amoung of missing records is: , 0)
12      (The amoung of missing records is: , 0)
13      (The amoung of missing records is: , 0)
14      (The amoung of missing records is: , 0)
15      (The amoung of missing records is: , 0)
16      (The amoung of missing records is: , 0)
17      (The amoung of missing records is: , 0)
18      (The amoung of missing records is: , 0)
19      (The amoung of missing records is: , 0)
20      (The amoung of missing records i

In [155]:
# Check for duplicated data
print("Data duplicates number : {}".format(sum(df.duplicated("text"))))
if sum(df.duplicated("text")) > 0:
    df.drop_duplicates(keep='first', inplace=True)

print("Data length(drop duplicates): {}".format(len(df)))

Data duplicates number : 17
Data length(drop duplicates): 2983


In [156]:
# Sample the data
df_sample = df.sample(frac=0.25).reset_index(drop=True)
print("length of sample df : {}".format(len(df_sample)))

length of sample df : 746


In [157]:
print(df_sample[0:4])

                                                text label  category_name
0                Now this dish was quite flavourful.     1  yelp_labelled
1  Everything about this film is simply incredibl...     1  imdb_labelled
2              My first visit to Hiro was a delight!     1  yelp_labelled
3         Hands down my favorite Italian restaurant!     1  yelp_labelled


In [158]:
# Visulize the sample data
df_category_counts = ta.get_tokens_and_frequency(list(df.category_name))
df_sample_category_counts = ta.get_tokens_and_frequency(
    list(df_sample.category_name))

In [159]:
py.iplot(ta.plot_word_frequency(df_category_counts, "Category distribution"))

In [160]:
#py.iplot(ta.plot_word_frequency(df_sample_category_counts, "Category distribution"))

In [161]:
# Exercise 2
import plotly.graph_objs as go
layout = go.Layout(
    barmode='group'
)
data1 = go.Bar(
    x=list(df_category_counts[0]),
    y=list(df_category_counts[1]),
    name='Category stat'
)
data2 = go.Bar(
    x=list(df_sample_category_counts[0]),
    y=list(df_sample_category_counts[1]),
    name='Category sample stat'
)
data = [data1,data2]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)


In [162]:
# Feature creation
df["unigrams"] = df["text"].apply(lambda x: dmh.tokenize_text(x))

In [163]:
df[0:4]["unigrams"]

0    [I, got, food, poisoning, here, at, the, buffe...
1        [Always, a, great, time, at, Dos, Gringos, !]
2    [It, holds, a, charge, for, a, long, time, ,, ...
3                         [It, was, probably, dirt, .]
Name: unigrams, dtype: object

In [164]:
# Feature subset selection
count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df.text)
analyze = count_vect.build_analyzer()
analyze(" ".join(list(df[4:5].text)))

['all', 'things', 'considered', 'job', 'very', 'well', 'done']

In [165]:
# We can obtain the feature names of the vectorizer, i.e., the terms
count_vect.get_feature_names()[0:10]

['00', '10', '100', '11', '12', '13', '15', '15g', '15pm', '17']

In [166]:
df[0:5]

Unnamed: 0,text,label,category_name,unigrams
0,I got food poisoning here at the buffet.,0,yelp_labelled,"[I, got, food, poisoning, here, at, the, buffe..."
1,Always a great time at Dos Gringos!,1,yelp_labelled,"[Always, a, great, time, at, Dos, Gringos, !]"
2,"It holds a charge for a long time, is reasonab...",1,amazon_cells_labelled,"[It, holds, a, charge, for, a, long, time, ,, ..."
3,It was probably dirt.,0,yelp_labelled,"[It, was, probably, dirt, .]"
4,"All things considered, a job very well done.",1,imdb_labelled,"[All, things, considered, ,, a, job, very, wel..."


In [167]:
plot_x = ["term_"+str(i) for i in count_vect.get_feature_names()[0:20]]
plot_y = ["doc_"+ str(i) for i in list(df.index)[0:20]]
plot_z = df_counts[0:20, 0:20].toarray()
py.iplot(ta.plot_heat_map(plot_x, plot_y, plot_z))

In [194]:
# Exercise 3
plot_x = ["term_"+str(i) for i in count_vect.get_feature_names()[::100]]
plot_y = ["doc_"+ str(i) for i in list(df.index)[::100]]
plot_z = df_counts[::100, ::100].toarray()
py.iplot(ta.plot_heat_map(plot_x, plot_y, plot_z))

In [169]:
from sklearn.decomposition import PCA
df_reduced = PCA(n_components=3).fit_transform(df_counts.toarray())

In [170]:
trace1 = ta.get_trace(df_reduced, df["category_name"], categories[0], "rgb(71,233,163)")
trace2 = ta.get_trace(df_reduced, df["category_name"], categories[1], "rgb(52,133,252)")
trace3 = ta.get_trace(df_reduced, df["category_name"], categories[2], "rgb(229,65,136)")

In [171]:
data = [trace1, trace2, trace3]

In [172]:
layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='simple-3d-scatter')

In [173]:
#Atrribute Transformation / Aggregation

# note this takes time to compute. You may want to reduce the amount of terms you want to compute frequencies for
term_frequencies = []
for j in range(0,df_counts.shape[1]):
    term_frequencies.append(sum(df_counts[:,j].toarray()))

In [174]:
term_frequencies[0]

array([1])

In [175]:
py.iplot(ta.plot_word_frequency([count_vect.get_feature_names(), term_frequencies], "Term Frequency Distribution"))

In [196]:
# Exercise 4
# Sample for one out of ten
data=[count_vect.get_feature_names()[::10], term_frequencies[::10]]
py.iplot(ta.plot_word_frequency(data, "Term Frequency Distribution"))

In [225]:
# Exercise 5
sort_name = [x for _,x in sorted(zip(data[1],data[0]))]
sort_freq = sorted(data[1])
py.iplot(ta.plot_word_frequency([sort_name,sort_freq], "Term Frequency Distribution"))

In [226]:
term_frequencies_log = [math.log(i) for i in term_frequencies]

In [227]:
py.iplot(ta.plot_word_frequency([count_vect.get_feature_names(), term_frequencies_log], "Term Frequency Distribution"))

In [228]:
from sklearn import preprocessing, metrics, decomposition, pipeline, dummy
mlb = preprocessing.LabelBinarizer()
mlb.fit(df.category_name)

LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)

In [229]:
mlb.classes_

array(['amazon_cells_labelled', 'imdb_labelled', 'yelp_labelled'],
      dtype='<U21')

In [230]:
df['bin_category'] = mlb.transform(df['category_name']).tolist()

In [231]:
df[0:9]

Unnamed: 0,text,label,category_name,unigrams,bin_category
0,I got food poisoning here at the buffet.,0,yelp_labelled,"[I, got, food, poisoning, here, at, the, buffe...","[0, 0, 1]"
1,Always a great time at Dos Gringos!,1,yelp_labelled,"[Always, a, great, time, at, Dos, Gringos, !]","[0, 0, 1]"
2,"It holds a charge for a long time, is reasonab...",1,amazon_cells_labelled,"[It, holds, a, charge, for, a, long, time, ,, ...","[1, 0, 0]"
3,It was probably dirt.,0,yelp_labelled,"[It, was, probably, dirt, .]","[0, 0, 1]"
4,"All things considered, a job very well done.",1,imdb_labelled,"[All, things, considered, ,, a, job, very, wel...","[0, 1, 0]"
5,"What possesed me to get this junk, I have no i...",0,amazon_cells_labelled,"[What, possesed, me, to, get, this, junk, ,, I...","[1, 0, 0]"
6,"It's a fresh, subtle, and rather sublime effec...",1,imdb_labelled,"[It, 's, a, fresh, ,, subtle, ,, and, rather, ...","[0, 1, 0]"
7,AFTER ARGUING WITH VERIZON REGARDING THE DROPP...,0,amazon_cells_labelled,"[AFTER, ARGUING, WITH, VERIZON, REGARDING, THE...","[1, 0, 0]"
8,Excellent sound quality.,1,amazon_cells_labelled,"[Excellent, sound, quality, .]","[1, 0, 0]"
