<a href="https://colab.research.google.com/github/chiahsuy/2014/blob/master/FRET_PU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Imports

In [None]:
import pandas as pd
import numpy as np

!pip install dfply
from dfply import *

Collecting dfply
  Downloading dfply-0.3.3-py3-none-any.whl (612 kB)
[?25l[K     |▌                               | 10 kB 22.4 MB/s eta 0:00:01[K     |█                               | 20 kB 27.0 MB/s eta 0:00:01[K     |█▋                              | 30 kB 15.3 MB/s eta 0:00:01[K     |██▏                             | 40 kB 11.9 MB/s eta 0:00:01[K     |██▊                             | 51 kB 5.5 MB/s eta 0:00:01[K     |███▏                            | 61 kB 5.6 MB/s eta 0:00:01[K     |███▊                            | 71 kB 5.4 MB/s eta 0:00:01[K     |████▎                           | 81 kB 6.0 MB/s eta 0:00:01[K     |████▉                           | 92 kB 5.9 MB/s eta 0:00:01[K     |█████▍                          | 102 kB 5.0 MB/s eta 0:00:01[K     |█████▉                          | 112 kB 5.0 MB/s eta 0:00:01[K     |██████▍                         | 122 kB 5.0 MB/s eta 0:00:01[K     |███████                         | 133 kB 5.0 MB/s eta 0:00:01[K   

In [None]:
import datetime
from tqdm import tqdm
import requests

warnings.filterwarnings("ignore")
pd.options.display.float_format = '{:.2f}'.format
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
# pd.options.display.max_colwidth = 600

import random
random.seed(9)

#### Get Data from Google Drive
- Set path to data in gdrive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
#@title Set gdrive path
gdrive_path = "/content/gdrive/MyDrive/FRET/"

#### Upload Data
- Labeled Positive Cases
- Unlabeled Cases

Upload Positive Training Examples

In [None]:
positive_cases = pd.read_csv(gdrive_path + "FRET_Positive.csv")
positive_cases.columns = map(str.lower, positive_cases.columns)
print("Labeled Positive Cases: ", positive_cases.shape[0])
positive_cases.head(3)

Labeled Positive Cases:  20


Unnamed: 0,id,text,reportability
0,1,I purchased these candles last month. The smell is nice but the glass tends to melt in the jar,P
1,2,Microwave tends to smoke after 10 mins of use,P
2,3,The blades of this chopper are razor sharp. Should be a warning on the package,P


Upload Unlabeled Cases

In [None]:
test_cases = pd.read_csv(gdrive_path + "FRET_Test.csv") 
test_cases.columns = map(str.lower, test_cases.columns)
print("Unlabeled Cases: ", test_cases.shape[0])
test_cases.head(3)

Unlabeled Cases:  50


Unnamed: 0,id,text
0,21,Customer stated that he purchased a bike with a warranty. The breaks on his bike broke. Ths is highly dangerous.
1,22,Customer purchased a sealed pack of strawberries. Found insects inside after opening the pack. Disgusting and unsanitary.
2,23,I purchased a microwave. It started smoking after a week. We were scared it may catch fire.


#### Build Data for Naives Bayes PU (Level I)

Naives Bayes PU Steps


*   Train a NB Classifier assuming all positive examples as P and all unlabeled examples as negative (N)
*   Use this model to predict all unlabeled examples
*   Test cases predicted as N will now act as training negatives
*   Build final NB classifier as provided positive examples as P and predicted negative exmaples as N







In [None]:
nbpu_data = pd.concat([positive_cases, test_cases >> select(X.id, X.text)], axis = 0)
nbpu_data.reset_index(drop = True, inplace = True)
print("All Examples: ", nbpu_data.shape[0])
print("Positive Examples: ", (nbpu_data >> mask(X.reportability == "P")).shape[0])
print("Unlabeled Examples: ", (nbpu_data >> mask(X.reportability.isna())).shape[0])

All Examples:  70
Positive Examples:  20
Unlabeled Examples:  50


In [None]:
nbpu_data.head(24)

Mark all unlabeled cases as negative

In [None]:
nbpu_data["reportability_initial"] = np.where(nbpu_data.reportability.isna(), "N", nbpu_data.reportability)
nbpu_data["reportability_initial"].value_counts()

N    50
P    20
Name: reportability_initial, dtype: int64

#### Text Cleaning + Preprocessing



1.   Remove punctuation
2.   Remove numbers
3.   Make everything lowercase
4.   Remove stopwords
5.   Lemmatize
6.   Get GloVe word vectors






In [None]:
import nltk
# nltk.download('all')
# nltk.download('stopwords')
# nltk.download('wordnet')
import string # for list of all punctuations
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
import re # importing Regex

In [None]:
nbpu_data >>= mutate(clean_text = X.text.str.replace('[^\w\s]',' ')) # Remove Punctuations
nbpu_data >>= mutate(clean_text = X.clean_text.str.replace('\d+','')) # Remove Numbers
nbpu_data >>= mutate(clean_text = X.clean_text.str.lower()) # Lowercase

nbpu_data.head(2)

Unnamed: 0,id,text,reportability,reportability_initial,clean_text
0,1,I purchased these candles last month. The smell is nice but the glass tends to melt in the jar,P,P,i purchased these candles last month the smell is nice but the glass tends to melt in the jar
1,2,Microwave tends to smoke after 10 mins of use,P,P,microwave tends to smoke after mins of use


In [None]:
#@title Remove Stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords_set = set(stopwords.words('english'))

# stopwords_set.pop()
# for i, val in enumerate(random.sample(stopwords_set, 10)): 
#   print(val)

import time
tic = time.perf_counter()
nbpu_data['clean_text_wout_stopwords'] = pd.Series([[w for w in text.split() if w.lower() not in stopwords_set]
            for text in nbpu_data['clean_text']]).apply(lambda k : ' '.join(k))
toc = time.perf_counter()
print(f"Completed in : {toc - tic:0.4f} seconds")
 
nbpu_data.head(2)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Completed in : 0.0027 seconds


Unnamed: 0,id,text,reportability,reportability_initial,clean_text,clean_text_wout_stopwords
0,1,I purchased these candles last month. The smell is nice but the glass tends to melt in the jar,P,P,i purchased these candles last month the smell is nice but the glass tends to melt in the jar,purchased candles last month smell nice glass tends melt jar
1,2,Microwave tends to smoke after 10 mins of use,P,P,microwave tends to smoke after mins of use,microwave tends smoke mins use


In [None]:
#@title Lemmatization
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
from nltk.tokenize import sent_tokenize, word_tokenize

# print("tested : ", wordnet_lemmatizer.lemmatize("tested", pos = "v"))
# print("tests : ", wordnet_lemmatizer.lemmatize("tested", pos = "v"), "\n")
# print("drowned : ", wordnet_lemmatizer.lemmatize("drowned", pos = "v"))
# print("drowning : ", wordnet_lemmatizer.lemmatize("drowning", pos = "v"))


import time
tic = time.perf_counter()
nbpu_data['clean_text_lem'] = pd.Series([[wordnet_lemmatizer.lemmatize(w, pos = "v") for w in text.split()] 
                                            for text in nbpu_data['clean_text_wout_stopwords']]).apply(lambda k : ' '.join(k))
toc = time.perf_counter()
print(f"Completed in : {toc - tic:0.4f} seconds")

nbpu_data.head(2)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
Completed in : 1.7183 seconds


Unnamed: 0,id,text,reportability,reportability_initial,clean_text,clean_text_wout_stopwords,clean_text_lem
0,1,I purchased these candles last month. The smell is nice but the glass tends to melt in the jar,P,P,i purchased these candles last month the smell is nice but the glass tends to melt in the jar,purchased candles last month smell nice glass tends melt jar,purchase candle last month smell nice glass tend melt jar
1,2,Microwave tends to smoke after 10 mins of use,P,P,microwave tends to smoke after mins of use,microwave tends smoke mins use,microwave tend smoke mins use


In [None]:
nbpu_data['clean_text'] = nbpu_data['clean_text_lem']
nbpu_data.drop(["clean_text_wout_stopwords", "clean_text_lem"], axis = 1, inplace = True)

#### Numeric Representation of Text

In [None]:
import csv
glove_vectors = pd.read_table(gdrive_path + "glove.6B.50d.txt", sep = " ", index_col=0, header=None, quoting=csv.QUOTE_NONE)
glove_vectors.index.names = ['words']
glove_vectors.shape

(400000, 50)

In [None]:
gloveWord_set = set(glove_vectors.index)

## for assets
import time
tic = time.perf_counter()
nbpu_data['clean_text_glvwrds'] = pd.Series([[w for w in text.split() if w.lower() in gloveWord_set]
            for text in nbpu_data['clean_text']]).apply(lambda k : ' '.join(k))
toc = time.perf_counter()
print(f"Completed in : {toc - tic:0.4f} seconds")

nbpu_data['clean_text'] = nbpu_data['clean_text_glvwrds']
nbpu_data.drop(["clean_text_glvwrds"], axis = 1, inplace = True)

print("How many reviews did not match any work with the loaded glove corpus?", sum(nbpu_data['clean_text'].isna()), "\n")
print(f"Completed in : {toc - tic:0.4f} seconds")

nbpu_data.head(1)

Completed in : 0.0026 seconds
How many reviews did not match any work with the loaded glove corpus? 0 

Completed in : 0.0026 seconds


Unnamed: 0,id,text,reportability,reportability_initial,clean_text
0,1,I purchased these candles last month. The smell is nice but the glass tends to melt in the jar,P,P,purchase candle last month smell nice glass tend melt jar


In [None]:
from itertools import chain

tic = time.perf_counter()

temp_id = list(chain(*[[c]*len(x.split()) for c, x in enumerate(nbpu_data['clean_text'], 1)]))
temp_words = list(chain(*[x.split() for x in nbpu_data['clean_text']]))
emb_temp = glove_vectors.loc[temp_words]
emb_temp['id'] = temp_id
del temp_id, temp_words
embedding_text = emb_temp.groupby(["id"], as_index = False)[emb_temp.columns[:-1]].mean()

toc = time.perf_counter()
print(f"Completed the Task in {toc - tic:0.4f} seconds")

embedding_text.head()

Completed the Task in 0.1513 seconds


Unnamed: 0,id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50
0,1,0.26,0.3,-0.07,-0.31,0.39,0.01,-0.29,-0.02,-0.13,0.39,-0.04,-0.05,0.26,0.22,0.49,0.56,-0.12,-0.14,-0.39,-0.86,0.38,-0.17,0.26,-0.17,-0.2,-0.8,-0.45,0.78,0.78,-0.18,1.98,0.23,-0.18,0.21,0.02,0.04,0.1,0.26,0.58,-0.24,0.25,0.07,0.01,0.08,0.35,0.59,-0.02,-0.23,-0.1,-0.24
1,2,-0.01,0.18,0.6,-0.37,-0.22,0.19,0.18,-0.36,-0.02,0.29,0.16,-0.16,0.53,0.48,-0.12,0.56,-0.16,0.22,-0.58,-0.92,0.52,0.1,0.82,0.51,0.2,-0.49,-0.16,0.1,1.03,-0.1,1.97,0.51,-0.28,-0.35,0.06,0.18,0.03,0.17,0.36,0.47,0.49,0.27,0.3,0.72,0.24,0.37,0.51,-0.09,-0.1,0.05
2,3,0.15,-0.38,0.64,0.02,-0.27,0.51,-0.01,-0.04,-0.07,-0.31,-0.24,-0.04,-0.16,0.3,-0.29,0.16,-0.16,-0.0,-0.25,-0.9,0.13,-0.44,-0.18,-0.43,0.17,-0.89,-0.04,0.5,0.42,-0.23,1.62,0.27,0.2,0.5,-0.08,0.16,0.0,-0.06,-0.13,-0.33,0.04,0.38,-0.12,-0.2,0.63,-0.12,0.33,0.14,0.01,-0.01
3,4,0.0,-0.1,0.3,-0.19,0.16,0.13,-0.0,0.03,0.07,0.1,-0.03,0.11,-0.46,0.15,0.34,0.7,0.13,-0.41,-0.2,-0.69,0.01,0.17,0.21,-0.28,0.14,-1.08,-0.29,0.41,0.64,0.0,2.07,0.34,-0.1,0.3,0.04,0.1,0.26,0.12,0.32,-0.35,-0.09,0.2,0.02,0.4,0.29,0.02,0.05,-0.29,0.35,-0.23
4,5,0.95,-0.12,-0.75,-0.24,0.69,0.43,-0.36,-0.21,0.18,0.28,-0.26,0.3,1.11,0.04,0.23,-0.04,0.1,0.03,-0.4,-1.07,-0.09,-0.5,1.28,-0.01,0.15,-0.43,-0.07,0.69,0.96,-0.09,2.64,0.1,-0.01,-0.06,0.08,0.44,-0.81,0.29,-0.0,-0.32,-0.23,-0.04,-0.12,0.25,1.35,0.32,0.14,-0.24,0.05,-0.19


#### NBPU Model Level I



In [None]:
tic = time.perf_counter()

from sklearn.naive_bayes import GaussianNB

X_train = embedding_text.loc[:, embedding_text.columns != 'id'].values
y_train = nbpu_data.reportability_initial.values

nbpu_model_level1 = GaussianNB()
nbpu_model_level1.fit(X_train, y_train)

y_pred =  nbpu_model_level1.predict(X_train[20:len(X_train)])

toc = time.perf_counter()
print(f"Completed the Task in {toc - tic:0.4f} seconds")

Completed the Task in 0.0103 seconds


In [None]:
y_pred

array(['N', 'P', 'P', 'P', 'P', 'P', 'N', 'N', 'N', 'P', 'N', 'N', 'N',
       'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N',
       'N', 'P', 'N', 'P', 'N', 'N', 'N', 'N', 'P', 'N', 'N', 'N', 'N',
       'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N'], dtype='<U1')

In [None]:
nbpu_data["reportability_intermediate"] = nbpu_data["reportability_initial"]
nbpu_data.loc[nbpu_data.id.isin(test_cases.id.tolist()), 'reportability_intermediate'] = np.array(y_pred)
nbpu_data >>= select(X.id, X.text, X.reportability, X.reportability_initial, X.reportability_intermediate, everything())
nbpu_data.reset_index(drop = True, inplace = True)

nbpu_data.reportability_intermediate.value_counts()

N    41
P    29
Name: reportability_intermediate, dtype: int64

In [None]:
nbpu_data

Unnamed: 0,id,text,reportability,reportability_initial,reportability_intermediate,clean_text
0,1,I purchased these candles last month. The smell is nice but the glass tends to melt in the jar,P,P,P,purchase candle last month smell nice glass tend melt jar
1,2,Microwave tends to smoke after 10 mins of use,P,P,P,microwave tend smoke mins use
2,3,The blades of this chopper are razor sharp. Should be a warning on the package,P,P,P,blades chopper razor sharp warn package
3,4,The mats are not slip proof. I could have fallen in my tub and hurt myself,P,P,P,mat slip proof could fall tub hurt
4,5,Fruits are not fresh. Sometimes have insects in them,P,P,P,fruit fresh sometimes insects
5,6,My salad was stale and the lettuce had blackened,P,P,P,salad stale lettuce blacken
6,7,The brakes on these bikes are not safe. Do not buy!,P,P,P,brake bike safe buy
7,8,The ingredients on the cream should be written in much bigger font. I am allergic to fragnance,P,P,P,ingredients cream write much bigger font allergic
8,9,The cord of this mixer began to spark after three weeks of use,P,P,P,cord mixer begin spark three weeks use
9,10,"The potholes in front of the store are a health hazard, especially at night",P,P,P,potholes front store health hazard especially night


#### NBPU Model Level II
- Choose Labeled Positives as P for training
- Choose Intermedidate Negatives as N for training
- Train an NB classifier

In [None]:
nbpu_train = (pd.concat([nbpu_data >> mask(X.id.isin(positive_cases.id.tolist())), 
                         nbpu_data >> mask(X.reportability_intermediate == "N")
                         ], axis = 0
                       ).reset_index(drop = True)
             )
print("Training Examples: ", nbpu_train.shape[0])
print(nbpu_train.reportability_intermediate.value_counts())


nbpu_test = (nbpu_data >> mask(X.id.isin(test_cases.id.tolist()))).reset_index(drop = True)


Training Examples:  61
N    41
P    20
Name: reportability_intermediate, dtype: int64


In [None]:
tic = time.perf_counter()

from sklearn.naive_bayes import GaussianNB

X_train = embedding_text.loc[embedding_text.id.isin(nbpu_train.id.tolist()), embedding_text.columns != 'id'].values
y_train = nbpu_train.reportability_intermediate.values

nbpu_model_level2 = GaussianNB()
nbpu_model_level2.fit(X_train, y_train)

y_pred_level2 =  nbpu_model_level2.predict(embedding_text.loc[embedding_text.id.isin(nbpu_test.id.tolist()), embedding_text.columns != 'id'].values)

toc = time.perf_counter()
print(f"Completed the Task in {toc - tic:0.4f} seconds")

Completed the Task in 0.0080 seconds


In [None]:
print(len(y_pred_level2))
y_pred_level2

50


array(['P', 'P', 'P', 'P', 'P', 'P', 'N', 'N', 'N', 'P', 'N', 'N', 'N',
       'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N',
       'N', 'P', 'N', 'P', 'N', 'N', 'N', 'N', 'P', 'N', 'N', 'N', 'N',
       'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N'], dtype='<U1')

In [None]:
test_cases["reportability_final"] = np.array(y_pred_level2)
print(test_cases.reportability_final.value_counts())
test_cases >> mask(X.reportability_final == "P")

N    40
P    10
Name: reportability_final, dtype: int64


Unnamed: 0,id,text,reportability_final
0,21,Customer stated that he purchased a bike with a warranty. The breaks on his bike broke. Ths is highly dangerous.,P
1,22,Customer purchased a sealed pack of strawberries. Found insects inside after opening the pack. Disgusting and unsanitary.,P
2,23,I purchased a microwave. It started smoking after a week. We were scared it may catch fire.,P
3,24,"Customer states the measuring cups in ABC cookware were razor sharp and the customer has cut herself on them several times, and is now afraid to use them.",P
4,25,"I bought a mat for the bottom of my tub, but fell twice in the tub and banged my knee because it slips. It is supposed to stick. I have been severely injured.",P
5,26,The cord of this phone charger is coated with metal and as it was charging the metal part almost caught fire. It ignited in the socket and burned out the outlet in the wall. I caught it in time. Product is unsafe to use.,P
9,30,Ordered food from the deli. The sign stated fresh food. Upon getting home I found the food to be less than satisfactory I was very disappointed because I frequent the deli often The tenders and the wedges were horrible,P
27,48,I purchased the wireless ear buds and the little wire broke off inside the charging case,P
29,50,Produce items are spoiling faster because box cutters are cutting into the packaging,P
34,55,"Currently in my third bottle, does wonder to my skin",P
