# Wikihow Dataset

Authors: David Francisco \<Dfrancisco1998@gmail.com\>

Copyright (C) 2021 David Francisco and DynaGroup i.T. GmbH

In [None]:
# Run it if you are using google colab.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Installing dependencies on google colab
%pip install nltk > /dev/null
%pip install textstat > /dev/null

In [None]:
import nltk

from nltk.corpus import stopwords
nltk.download('stopwords')

from nltk.tokenize import word_tokenize
nltk.download('punkt')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import re
import string
import math
import spacy
import textstat

from sklearn.feature_extraction.text import CountVectorizer
from textblob import TextBlob

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Loading dataset and first analyse

In [None]:
!head -5 '/content/drive/My Drive/Paraphrasing API/datasets/wikihow/test.source'

When you are shopping for vintage jewelry, one way to ensure that you are not buying fake vintage, is to ask the seller about the history of a particular item. If the piece is actually vintage the seller should be able to explain how they came across the item. For instance, it may have been passed down through the family, purchased at an estate sale or auction, or found while antique hunting., Most vintage jewelry was marked by the jewelry maker, either with initials or small emblems. Use a magnifying glass to examine the jewelry for marks before purchasing. If you notice any discrepancies between marks then the piece is likely a fake or replica.Search online for pictures of well-known vintage jewelers' marks.<n>There may be instances when some older items of jewelry were not marked. For example, early pieces of Chanel jewelry were unmarked and different markings were used during different periods.If you can’t locate any markings, then ask the seller about the history of the piece.<n> 

Porting into dataframe 

In [None]:
df = pd.DataFrame()
text = []
count = []
tok_sw = []
file1 = open('/content/drive/My Drive/Paraphrasing API/datasets/wikihow/test.source', 'r')
for line in file1:
  text.append(line)
  count.append(len(line.split()))
  # text_tokens = word_tokenize(line)
  # tokens_without_sw = [word for word in text_tokens if not word in stopwords.words()]
  # tok_sw.append(len(tok_sw))

# print(tokens_without_sw)
df['text'] = text
df['word_count'] = count
# df['tok_sw'] = tok_sw
# df['sw'] = df['count'] - df['tok_sw']
file1.close()

In [None]:
df

Unnamed: 0,text,word_count
0,"When you are shopping for vintage jewelry, one...",486
1,Being overweight makes your heart work harder ...,297
2,Connect your mobile device to the computer and...,221
3,"If you hang up on a telemarketer right away, y...",242
4,"Before you pay for an online course, you shoul...",1376
...,...,...
5572,If you've already built up experience working ...,379
5573,It is usually shelved alongside other garden f...,853
5574,If you look at music and the different categor...,1122
5575,Pour 2 tablespoons of the olive oil into a ski...,356


### Cleaning

In [None]:
def remove_non_ascii(text): 
    return ''.join(i for i in text if ord(i)<128) 
 
df['text'] = df['text'].apply(remove_non_ascii) 

In [None]:
text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)

tokens_without_sw = [word for word in text_tokens if not word in stopwords.words()]

print(tokens_without_sw)

['Nick', 'likes', 'play', 'football', ',', 'however', 'fond', 'tennis', '.']


In [None]:
df.isnull().sum()

text          0
word_count    0
dtype: int64

In [None]:
df.dropna(inplace=True)
df.isnull().sum()

text          0
word_count    0
dtype: int64

In [None]:
contractions_dict = { "ain't": "are not","'s":" is","aren't": "are not",
                     "can't": "cannot","can't've": "cannot have",
                     "'cause": "because","could've": "could have","couldn't": "could not",
                     "couldn't've": "could not have", "didn't": "did not","doesn't": "does not",
                     "don't": "do not","hadn't": "had not","hadn't've": "had not have",
                     "hasn't": "has not","haven't": "have not","he'd": "he would",
                     "he'd've": "he would have","he'll": "he will", "he'll've": "he will have",
                     "how'd": "how did","how'd'y": "how do you","how'll": "how will",
                     "I'd": "I would", "I'd've": "I would have","I'll": "I will",
                     "I'll've": "I will have","I'm": "I am","I've": "I have", "isn't": "is not",
                     "it'd": "it would","it'd've": "it would have","it'll": "it will",
                     "it'll've": "it will have", "let's": "let us","ma'am": "madam",
                     "mayn't": "may not","might've": "might have","mightn't": "might not", 
                     "mightn't've": "might not have","must've": "must have","mustn't": "must not",
                     "mustn't've": "must not have", "needn't": "need not",
                     "needn't've": "need not have","o'clock": "of the clock","oughtn't": "ought not",
                     "oughtn't've": "ought not have","shan't": "shall not","sha'n't": "shall not",
                     "shan't've": "shall not have","she'd": "she would","she'd've": "she would have",
                     "she'll": "she will", "she'll've": "she will have","should've": "should have",
                     "shouldn't": "should not", "shouldn't've": "should not have","so've": "so have",
                     "that'd": "that would","that'd've": "that would have", "there'd": "there would",
                     "there'd've": "there would have", "they'd": "they would",
                     "they'd've": "they would have","they'll": "they will",
                     "they'll've": "they will have", "they're": "they are","they've": "they have",
                     "to've": "to have","wasn't": "was not","we'd": "we would",
                     "we'd've": "we would have","we'll": "we will","we'll've": "we will have",
                     "we're": "we are","we've": "we have", "weren't": "were not","what'll": "what will",
                     "what'll've": "what will have","what're": "what are", "what've": "what have",
                     "when've": "when have","where'd": "where did", "where've": "where have",
                     "who'll": "who will","who'll've": "who will have","who've": "who have",
                     "why've": "why have","will've": "will have","won't": "will not",
                     "won't've": "will not have", "would've": "would have","wouldn't": "would not",
                     "wouldn't've": "would not have","y'all": "you all", "y'all'd": "you all would",
                     "y'all'd've": "you all would have","y'all're": "you all are",
                     "y'all've": "you all have", "you'd": "you would","you'd've": "you would have",
                     "you'll": "you will","you'll've": "you will have", "you're": "you are",
                     "you've": "you have"}

# Regular expression for finding contractions
contractions_re=re.compile('(%s)' % '|'.join(contractions_dict.keys()))

# Function for expanding contractions
def expand_contractions(text,contractions_dict=contractions_dict):
  def replace(match):
    return contractions_dict[match.group(0)]
  return contractions_re.sub(replace, text)

# Expanding Contractions in the reviews
df['text']=df['text'].apply(lambda x:expand_contractions(x))

df['cleaned']=df['text'].apply(lambda x: x.lower())

In [None]:
df['cleaned']=df['cleaned'].apply(lambda x: re.sub('\w*\d\w*','', x))

In [None]:
df['cleaned']=df['cleaned'].apply(lambda x: re.sub('[%s]' % re.escape(string.punctuation), '', x))

In [None]:
# Removing extra spaces
df['cleaned']=df['cleaned'].apply(lambda x: re.sub(' +',' ',x))

In [None]:
# Loading model
nlp = spacy.load('en_core_web_sm',disable=['parser', 'ner'])

# Lemmatization with stopwords removal
df['lemmatized']=df['cleaned'].apply(lambda x: ' '.join([token.lemma_ for token in list(nlp(x)) if (token.is_stop==False)]))

### Some exploratory analysis

In [None]:
docs = list(df['lemmatized'].values)
vec = CountVectorizer()
X = vec.fit_transform(docs)
df_doc = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)

                                                   text  ...                                         lemmatized
0     When you are shopping for vintage jewelry, one...  ...  shop vintage jewelry way ensure buy fake vinta...
1     Being overweight makes your heart work harder ...  ...  overweight make heart work harder pump blood b...
2     Connect your mobile device to the computer and...  ...  connect mobile device computer purchase cd lau...
3     If you hang up on a telemarketer right away, y...  ...  hang telemarketer right away will probably pla...
4     Before you pay for an online course, you shoul...  ...  pay online course realistic ability manage tim...
...                                                 ...  ...                                                ...
5572  If you have already built up experience workin...  ...  build experience work doctor office hospital u...
5573  It is usually shelved alongside other garden f...  ...  usually shelve alongside garden fertilizer



In [None]:
df_doc.sum()

aa                  7
aaa                 8
aaaaaah             1
aaaaahhhh           1
aaah                1
                   ..
zurlo               1
zweilous            1
zyflonlongacting    1
zyrtec              2
zzzs                1
Length: 86094, dtype: int64

In [None]:
df['polarity']=df['lemmatized'].apply(lambda x:TextBlob(x).sentiment.polarity)

In [None]:
print("3 Random Reviews with Highest Polarity:")
df.iloc[df['polarity'].sort_values(ascending=False)[:3].index]

3 Random Reviews with Highest Polarity:


Unnamed: 0,text,word_count,cleaned,lemmatized,polarity
3107,They can penalize you for not addressing them ...,148,they can penalize you for not addressing them ...,penalize address chairman say thank receive ca...,0.8
3026,While Viber will allow you to call and text ov...,97,while viber will allow you to call and text ov...,viber allow text wifi require receive end vibe...,0.8
3065,"You only have to be level 2, then she will tel...",75,you only have to be level then she will tell y...,level tell cavenher follower protect shrine tu...,0.7


In [None]:
df['dale_chall_score']=df['text'].apply(lambda x: textstat.dale_chall_readability_score(x))
df['flesh_reading_ease']=df['text'].apply(lambda x: textstat.flesch_reading_ease(x))
df['gunning_fog']=df['text'].apply(lambda x: textstat.gunning_fog(x))

print('Dale Chall Score of upvoted reviews=>',df['dale_chall_score'].mean())

print('Flesch Reading Score of upvoted reviews=>',df['flesh_reading_ease'].mean())

print('Gunning Fog Index of upvoted reviews=>',df['gunning_fog'].mean())

Dale Chall Score of upvoted reviews=> 8.33775327236865
Flesch Reading Score of upvoted reviews=> 54.863150439304114
Gunning Fog Index of upvoted reviews=> 15.334755244755303


In [None]:
df['text_standard']=df['text'].apply(lambda x: textstat.text_standard(x))

print('Text Standard',df['text_standard'].mode())

Text Standard 0    7th and 8th grade
dtype: object


### Evaluation 

In general:

Bleu measures precision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries.

Rouge measures recall: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries.

Naturally - these results are complementing, as is often the case in precision vs recall. If you have many words/ngrams from the system results appearing in the human references you will have high Bleu, and if you have many words/ngrams from the human references appearing in the system results you will have high Rouge.

There's something called brevity penalty, which is quite important and has already been added to standard Bleu implementations. It penalizes system results which are shorter than the general length of a reference (read more about it here). This complements the n-gram metric behavior which in effect penalizes longer than reference results, since the denominator grows the longer the system result is.

You could also implement something similar for Rouge, but this time penalizing system results which are longer than the general reference length, which would otherwise enable them to obtain artificially higher Rouge scores (since the longer the result, the higher the chance you would hit some word appearing in the references). In Rouge we divide by the length of the human references, so we would need an additional penalty for longer system results which could artificially raise their Rouge score.

Finally, you could use the F1 measure to make the metrics work together: F1 = 2 * (Bleu * Rouge) / (Bleu + Rouge)