### Student Information
Name: Aida Halitaj

Student ID: 106065432

### Instructions

- Download the dataset provided in this [link](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#). The sentiment dataset contains a `sentence` and `score` label. Read what the dataset is about on the link provided before you start exploring it. 


- Then, you are asked to apply each of the data exploration and data operation techniques learned in the [first lab session](https://goo.gl/Sg4FS1) on the new dataset. You don't need to explain all the procedures as we did in the notebook, but you are expected to provide some **minimal comments** explaining your code. You are also expected to use the same libraries used in the first lab session. You are allowed to use and modify the `helper` functions we provided in the first lab session or create your own. Also, be aware that the helper functions may need modification as you are dealing with a completely different dataset. This part is worth 80% of your grade!


- After you have completed the operations, you should attempt the **bonus exercises** provided in the [notebook](https://goo.gl/Sg4FS1) we used for the first lab session. There are six (6) additional exercises; attempt them all, as it is part of your grade (10%). 


- You are also expected to tidy up your notebook and attempt new data operations that you have learned so far in the Data Mining course. Surprise us! This segment is worth 10% of your grade.


- After completing all the above tasks, you are free to remove this header block and submit your assignment following the guide provided in the `README.md` file of the assignment's [repository](https://github.com/omarsar/data_mining_hw_1). 

**Importing requirements**

In [1]:
import glob
import pandas as pd
import os
from pandas import DataFrame, read_csv

 #this is how I usually import pandas
import numpy as np
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import plotly.plotly as py
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import math
import os
os.path
# Enable inline plotting
%matplotlib inline


# my functions
import helpers.data_mining_helpers as dmh
import helpers.text_analysis as ta

Reading files into a DataFrame

In [2]:
amazon = pd.read_csv('sentimentlabelledsentences/amazon_cells_labelled.txt', sep = '\t', names=['Sentences','Scores'])
imdb = pd.read_csv('sentimentlabelledsentences/imdb_labelled.txt', sep = '\s\s\t', names=['Sentences','Scores'], engine='python')
yelp = pd.read_csv('sentimentlabelledsentences/yelp_labelled.txt', sep = '\t', names=['Sentences','Scores'])

Adding the **Source** column in order to know the corresponding data of each company.

In [3]:
amazon['Source'] = "amazon"
imdb['Source'] = "imdb"
yelp['Source'] = "yelp"

Concatenating three DataFrames into a large one.

In [4]:
frame = [amazon, imdb, yelp]
Reviews = pd.concat(frame, axis=0)
len(Reviews)

3000

In [5]:
#re-indexing the large DataFrame
Reviews.index=range(0,3000)

In [6]:
type(Reviews)

pandas.core.frame.DataFrame

Let's see what our table looks like!

In [7]:
#First five rows of the table
Reviews.head()

Unnamed: 0,Sentences,Scores,Source
0,So there is no way for me to plug it in here i...,0,amazon
1,"Good case, Excellent value.",1,amazon
2,Great for the jawbone.,1,amazon
3,Tied to charger for conversations lasting more...,0,amazon
4,The mic is great.,1,amazon


In [8]:
#Lastr five rows of the table
Reviews.tail()

Unnamed: 0,Sentences,Scores,Source
2995,I think food should have flavor and texture an...,0,yelp
2996,Appetite instantly gone.,0,yelp
2997,Overall I was not impressed and would not go b...,0,yelp
2998,"The whole experience was underwhelming, and I ...",0,yelp
2999,"Then, as if I hadn't wasted enough of my life ...",0,yelp


In [9]:
#Some rows from the mid of the table
Reviews[1555:1560]

Unnamed: 0,Sentences,Scores,Source
1555,I thought it was bad.,0,imdb
1556,"Both films are terrible, but to the credit of ...",0,imdb
1557,"Let's start with all the problemsthe acting, ...",0,imdb
1558,The script is a big flawed mess.,0,imdb
1559,The best example of how dumb the writing is wh...,0,imdb


In [10]:
#This is how the whole data table looks like.
Reviews

Unnamed: 0,Sentences,Scores,Source
0,So there is no way for me to plug it in here i...,0,amazon
1,"Good case, Excellent value.",1,amazon
2,Great for the jawbone.,1,amazon
3,Tied to charger for conversations lasting more...,0,amazon
4,The mic is great.,1,amazon
5,I have to jiggle the plug to get it to line up...,0,amazon
6,If you have several dozen or several hundred c...,0,amazon
7,If you are Razr owner...you must have this!,1,amazon
8,"Needless to say, I wasted my money.",0,amazon
9,What a waste of money and time!.,0,amazon


## Familiarizing with the data

- A query for the first 10 rows. We are keeping only the Sentences and Scores columns.

In [11]:
Reviews[0:10][["Sentences","Scores"]]

Unnamed: 0,Sentences,Scores
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1
5,I have to jiggle the plug to get it to line up...,0
6,If you have several dozen or several hundred c...,0
7,If you are Razr owner...you must have this!,1
8,"Needless to say, I wasted my money.",0
9,What a waste of money and time!.,0


- The last 15 rows of the "Sentences" and "Scores" columns.

In [12]:
Reviews[-15:][["Sentences","Scores"]]

Unnamed: 0,Sentences,Scores
2985,The problem I have is that they charge $11.99 ...,0
2986,Shrimp- When I unwrapped it (I live only 1/2 a...,0
2987,"It lacked flavor, seemed undercooked, and dry.",0
2988,It really is impressive that the place hasn't ...,0
2989,I would avoid this place if you are staying in...,0
2990,The refried beans that came with my meal were ...,0
2991,Spend your money and time some place else.,0
2992,A lady at the table next to us found a live gr...,0
2993,the presentation of the food was awful.,0
2994,I can't tell you how disappointed I was.,0


- Query every 300th record of the table.

In [13]:
Reviews.iloc[::300, :][0:10]

Unnamed: 0,Sentences,Scores,Source
0,So there is no way for me to plug it in here i...,0,amazon
300,Sending it back.,0,amazon
600,Their Research and Development division obviou...,1,amazon
900,"This was utterly confusing at first, which cau...",0,amazon
1200,This is definitely one of the bad ones.,0,imdb
1500,The entire audience applauded at the conclusio...,1,imdb
1800,"In fact, this stinker smells like a direct-to-...",0,imdb
2100,Our server was fantastic and when he found out...,1,yelp
2400,This one is simply a disappointment.,0,yelp
2700,The chips that came out were dripping with gre...,0,yelp


- Query only particular indices.

In [14]:
Reviews.iloc[[990, 1990, 2990]]

Unnamed: 0,Sentences,Scores,Source
990,I'm really disappointed all I have now is a ch...,0,amazon
1990,"The opening sequence of this gem is a classic,...",1,imdb
2990,The refried beans that came with my meal were ...,0,yelp


## Missing Values

In [15]:
Reviews.isnull()

Unnamed: 0,Sentences,Scores,Source
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
6,False,False,False
7,False,False,False
8,False,False,False
9,False,False,False


In [16]:
Reviews.isnull().apply(lambda x: dmh.check_missing_values(x))

Sentences    (The amount of missing records is: , 0)
Scores       (The amount of missing records is: , 0)
Source       (The amount of missing records is: , 0)
dtype: object

- Different ways to check for the missing values in each column

In [17]:
Reviews.isnull().any()
#np.logical_not(Reviews.isnull()).sum()

Sentences    False
Scores       False
Source       False
dtype: bool

In [18]:
Reviews.isnull().sum(axis=0)

Sentences    0
Scores       0
Source       0
dtype: int64

- A simple way to count null values in each row

In [19]:
Reviews.isnull().sum(axis=1)

0       0
1       0
2       0
3       0
4       0
5       0
6       0
7       0
8       0
9       0
10      0
11      0
12      0
13      0
14      0
15      0
16      0
17      0
18      0
19      0
20      0
21      0
22      0
23      0
24      0
25      0
26      0
27      0
28      0
29      0
       ..
2970    0
2971    0
2972    0
2973    0
2974    0
2975    0
2976    0
2977    0
2978    0
2979    0
2980    0
2981    0
2982    0
2983    0
2984    0
2985    0
2986    0
2987    0
2988    0
2989    0
2990    0
2991    0
2992    0
2993    0
2994    0
2995    0
2996    0
2997    0
2998    0
2999    0
Length: 3000, dtype: int64

In [20]:
Reviews.isnull().sum(axis=1).any()

False

Inserting some dummy data into the DataFrame.

In [21]:
dummySeries = pd.Series(["dummyRecord", 1], index=["Sentences", "Scores"])

In [22]:
dummySeries

Sentences    dummyRecord
Scores                 1
dtype: object

In [23]:
result_with_series = Reviews.append(dummySeries, ignore_index=True)

In [24]:
# check if the records was commited into result
len(result_with_series)

3001

In [25]:
result_with_series.isnull().apply(lambda x: dmh.check_missing_values(x))

Sentences    (The amount of missing records is: , 0)
Scores       (The amount of missing records is: , 0)
Source       (The amount of missing records is: , 1)
dtype: object

In [26]:
dummyDictionary = [{'Sentences': 'dummyRecord', 'Scores': 1}]

In [27]:
Reviews = Reviews.append(dummyDictionary, ignore_index = True)

In [28]:
len(Reviews)

3001

In [29]:
Reviews.isnull().apply(lambda x: dmh.check_missing_values(x))

Sentences    (The amount of missing records is: , 0)
Scores       (The amount of missing records is: , 0)
Source       (The amount of missing records is: , 1)
dtype: object

In [30]:
Reviews.dropna(inplace=True)

In [31]:
Reviews.isnull().apply(lambda x: dmh.check_missing_values(x))

Sentences    (The amount of missing records is: , 0)
Scores       (The amount of missing records is: , 0)
Source       (The amount of missing records is: , 0)
dtype: object

So, we deleted the dummy row!

In [32]:
len(Reviews)

3000

## Dealing with Duplicate Data

In [33]:
#Checking if there is any duplicate data
Reviews.duplicated().any()

True

In [34]:
sum(Reviews.duplicated())

17

**So we have 17 duplicated data, lets list them.**

In [35]:
duplicate = Reviews[Reviews.duplicated(keep=False)]
print(duplicate)

                                              Sentences  Scores  Source
18                                        Works great!.       1  amazon
179   If you like a loud buzzing to override all you...       0  amazon
180                             Don't buy this product.       0  amazon
187                                       Great phone!.       1  amazon
262                                        Works great.       1  amazon
285                                       Great phone!.       1  amazon
290                                        Great Phone.       1  amazon
392                               This is a great deal.       1  amazon
402                    Excellent product for the price.       1  amazon
407                                        Works great.       1  amazon
446                                       Does not fit.       0  amazon
524                                       Works great!.       1  amazon
543                             Don't buy this product.       0 

- Now, we can see the 17 unique sentences that have been repeted!

In [36]:
for x in duplicate['Sentences'].unique():
    print(x)

Works great!.
If you like a loud buzzing to override all your conversations, then this phone is for you!
Don't buy this product.
Great phone!.
Works great.
Great Phone.
This is a great deal.
Excellent product for the price.
Does not fit.
Great phone.
Definitely worth checking out.
10/10
Not recommended.
I love this place.
I won't be back.
The food was terrible.
I would not recommend this place.


- We want to delete duplicate sentences!

In [37]:
Reviews.drop_duplicates(keep=False, inplace=True)
len(Reviews)

2966

We can conclude that we had 17 unique sentences which have been repeted and they generated 34 duplicate sentences in general.

## Data Preprocessing

In [38]:
Reviews_sample = Reviews.sample(n=1000)

In [39]:
len(Reviews_sample)

1000

In [40]:
categories = ['Positive', 'Negative']

In [41]:
Reviews_category_counts = ta.get_tokens_and_frequency(list(Reviews['Sentences']))
Reviews_sample_category_counts = ta.get_tokens_and_frequency(list(Reviews_sample['Sentences']))

In [42]:
py.iplot(ta.plot_word_frequency(Reviews_category_counts, "Category distribution"))
py.iplot(ta.plot_word_frequency(Reviews_sample_category_counts, "Category distribution"))

In [43]:
Reviews_category_counts = ta.get_tokens_and_frequency(list(Reviews['Scores']))
Reviews_sample_category_counts = ta.get_tokens_and_frequency(list(Reviews_sample['Scores']))

In [44]:
py.iplot(ta.plot_word_frequency(Reviews_category_counts, "Category distribution"))
py.iplot(ta.plot_word_frequency(Reviews_sample_category_counts, "Category distribution"))

In [45]:
# takes a like a minute or two to process
Reviews['unigrams'] = Reviews['Sentences'].apply(lambda x: dmh.tokenize_text(x))

In [46]:
Reviews[0:4]

Unnamed: 0,Sentences,Scores,Source,unigrams
0,So there is no way for me to plug it in here i...,0,amazon,"[So, there, is, no, way, for, me, to, plug, it..."
1,"Good case, Excellent value.",1,amazon,"[Good, case, ,, Excellent, value, .]"
2,Great for the jawbone.,1,amazon,"[Great, for, the, jawbone, .]"
3,Tied to charger for conversations lasting more...,0,amazon,"[Tied, to, charger, for, conversations, lastin..."


In [47]:
list(Reviews[0:1]['unigrams'])

[['So',
  'there',
  'is',
  'no',
  'way',
  'for',
  'me',
  'to',
  'plug',
  'it',
  'in',
  'here',
  'in',
  'the',
  'US',
  'unless',
  'I',
  'go',
  'by',
  'a',
  'converter',
  '.']]

In [48]:
count_vect = CountVectorizer()
Reviews_counts = count_vect.fit_transform(Reviews.Sentences)

In [49]:
analyze = count_vect.build_analyzer()
analyze(" ".join(list(Reviews[4:5].Sentences)))

['the', 'mic', 'is', 'great']

In [50]:
# We can check the shape of this matrix by:
Reviews_counts.shape

(2966, 5153)

In [51]:
# We can obtain the feature names of the vectorizer, i.e., the terms
count_vect.get_feature_names()[0:10]

['00', '10', '100', '11', '12', '13', '15', '15g', '15pm', '17']

In [52]:
Reviews[0:5]

Unnamed: 0,Sentences,Scores,Source,unigrams
0,So there is no way for me to plug it in here i...,0,amazon,"[So, there, is, no, way, for, me, to, plug, it..."
1,"Good case, Excellent value.",1,amazon,"[Good, case, ,, Excellent, value, .]"
2,Great for the jawbone.,1,amazon,"[Great, for, the, jawbone, .]"
3,Tied to charger for conversations lasting more...,0,amazon,"[Tied, to, charger, for, conversations, lastin..."
4,The mic is great.,1,amazon,"[The, mic, is, great, .]"


In [53]:
Reviews_counts[0:5].toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [54]:
count_vect.transform(['Something completely new.']).toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

In [55]:
count_vect.transform(['00 Something completely new.']).toarray()

array([[1, 0, 0, ..., 0, 0, 0]])

In [56]:
plot_x = ["term_"+str(i) for i in count_vect.get_feature_names()[0:20]]

In [57]:
plot_x

['term_00',
 'term_10',
 'term_100',
 'term_11',
 'term_12',
 'term_13',
 'term_15',
 'term_15g',
 'term_15pm',
 'term_17',
 'term_18',
 'term_18th',
 'term_1928',
 'term_1947',
 'term_1948',
 'term_1949',
 'term_1971',
 'term_1973',
 'term_1979',
 'term_1980']

In [58]:
plot_y = ["doc_"+ str(i) for i in list(Reviews.index)[0:20]]

In [59]:
plot_z = Reviews_counts[0:20, 0:20].toarray()

In [60]:
# to plot
py.iplot(ta.plot_heat_map(plot_x, plot_y, plot_z))

- Plotting the entire term-document matrix.

In [61]:
plot_y = ["doc_"+ str(i) for i in list(Reviews.index)[0:2966]]

In [62]:
plot_z = Reviews_counts[100:2966, 0:2966].toarray()

In [63]:
# to plot
py.iplot(ta.plot_heat_map(plot_x, plot_y, plot_z))

The draw time for this plot will be slow for all clients.



Estimated Draw Time Too Long



## Dimensionality Reduction

In [64]:
Reviews['Scores'].replace([0], 'Negative',inplace=True)
Reviews['Scores'].replace([1], 'Positive',inplace=True)

In [65]:
from sklearn.decomposition import PCA

In [66]:
Reviews_reduced = PCA(n_components=3).fit_transform(Reviews_counts.toarray())

In [67]:
Reviews_reduced.shape

(2966, 3)

In [68]:
categories

['Positive', 'Negative']

In [69]:
trace1 = ta.get_trace(Reviews_reduced, Reviews["Scores"], "Positive", "rgb(71,233,163)")
trace2 = ta.get_trace(Reviews_reduced, Reviews["Scores"], "Negative", "rgb(52,133,252)")

In [70]:
data = [trace1, trace2]

In [71]:
layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='simple-3d-scatter')

## Atrribute Transformation / Aggregation

In [72]:
# note this takes time to compute. You may want to reduce the amount of terms you want to compute frequencies for
term_frequencies = []
for j in range(0,Reviews_counts.shape[1]):
    term_frequencies.append(sum(Reviews_counts[:,j].toarray()))

In [73]:
term_frequencies[0]

array([1])

In [74]:
py.iplot(ta.plot_word_frequency([count_vect.get_feature_names(), term_frequencies], "Term Frequency Distribution"))

# Exercise

In [75]:
term_frequencies_log = [math.log(i) for i in term_frequencies]

In [76]:
py.iplot(ta.plot_word_frequency([count_vect.get_feature_names(), term_frequencies_log], "Term Frequency Distribution"))

##  Discretization and Binarization

In [77]:
categories

['Positive', 'Negative']

In [78]:
Reviews['Scores'].replace([0], 'Negative',inplace=True)
Reviews['Scores'].replace([1], 'Positive',inplace=True)

In [79]:
Reviews.head()

Unnamed: 0,Sentences,Scores,Source,unigrams
0,So there is no way for me to plug it in here i...,Negative,amazon,"[So, there, is, no, way, for, me, to, plug, it..."
1,"Good case, Excellent value.",Positive,amazon,"[Good, case, ,, Excellent, value, .]"
2,Great for the jawbone.,Positive,amazon,"[Great, for, the, jawbone, .]"
3,Tied to charger for conversations lasting more...,Negative,amazon,"[Tied, to, charger, for, conversations, lastin..."
4,The mic is great.,Positive,amazon,"[The, mic, is, great, .]"


In [80]:
from sklearn import preprocessing, metrics, decomposition, pipeline, dummy

In [81]:
mlb = preprocessing.LabelBinarizer()

In [82]:
mlb.fit(Reviews.Scores)

LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)

In [83]:
mlb.classes_

array(['Negative', 'Positive'],
      dtype='<U8')

In [84]:
Reviews['bin_category'] = mlb.transform(Reviews['Scores']).tolist()

In [85]:
Reviews[0:6]

Unnamed: 0,Sentences,Scores,Source,unigrams,bin_category
0,So there is no way for me to plug it in here i...,Negative,amazon,"[So, there, is, no, way, for, me, to, plug, it...",[0]
1,"Good case, Excellent value.",Positive,amazon,"[Good, case, ,, Excellent, value, .]",[1]
2,Great for the jawbone.,Positive,amazon,"[Great, for, the, jawbone, .]",[1]
3,Tied to charger for conversations lasting more...,Negative,amazon,"[Tied, to, charger, for, conversations, lastin...",[0]
4,The mic is great.,Positive,amazon,"[The, mic, is, great, .]",[1]
5,I have to jiggle the plug to get it to line up...,Negative,amazon,"[I, have, to, jiggle, the, plug, to, get, it, ...",[0]
