# Invalid quotes analysis

During our analysis, we identified a number of invalid quotes. Here is the explanation ho we found them and why we believe those happened.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Identifying invalid quotes
The first type of invalid data we identified in the quotes was some CSS content.
Those css where identified by luck while trying to use the `numOccurences` and by finding a quote with a very unliky high number. After having retrieved it we saw that it was indeed incorrect and was just containg some `css` code.

## Filtering data
The next step was then to filter this css such as to be able to remove those invalid quotes. To do so, we developped some simple filter based on the most common css properties and by taking advantages of the fact that `css` has a semantical structure in the form `property: value;`.  

In [None]:
import bz2
import json
import urllib.parse
import re
import pandas as pd

def filter_css_keywords(keywords, input_file, output_file):
  # Special treatement for url where spaces do not exist and replace those by "-" or "_"
  regex_keywords=re.compile('|'.join(keywords),re.IGNORECASE)
  
  def find(f, seq):
    """Return first item in sequence where f(item) == True."""
    for item in seq:
      if f(item): 
        return True
    return False

  with bz2.open(input_file, 'rb') as s_file:
      with bz2.open(output_file, 'wb') as d_file:
          for instance in s_file:
              # Load one input
              instance = json.loads(instance)
              # Filter using regex
              if regex_keywords.search(instance['quotation']):
                # writing in the new file
                d_file.write((json.dumps(instance)+'\n').encode('utf-8'))

In [None]:
# Generate "climate change"
keywords=["commentId", "overflow:", "px;", "width:", "height:","margin:", "font-size", "font-family", "text-align", "add-js"]

input_file = '/content/drive/MyDrive/Quotebank/quotes.json.bz2'
output_file_html_data = '/content/drive/MyDrive/quotes-2016-html-data.json.bz2'

filter_css_keywords(keywords, input_file, output_file_html_data)

As you can see above some of the filtered words are not some `css` properties like `commentID` and `add-js`. Those where identified in te dataset of 2018.
We will explain the reason why we believe this data happen in this dataset in the first place later in this notebook.

In [None]:
html_data_df = pd.read_json(output_file, lines=True)

Just below, you can see the types of `invalid` quotes this filter allowed us to identify.

In [None]:
pd.set_option('expand_frame_repr', False)
html_data_df[['quoteID', 'quotation']]

Unnamed: 0,quoteID,quotation
0,2016-08-12-069289,position: fixed; left:1 px; top:1 px; overflow...
1,2016-12-12-069776,position: static; vertical-align: top; margin:...
2,2016-08-16-032476,height: auto; width: auto; overflow: scroll
3,2016-05-08-040735,overflow: auto; height:600 px;
4,2016-05-20-012246,border: currentColor; overflow: hidden;
5,2016-05-08-005592,border:0 px; vertical-align: middle; overflow:...
6,2016-11-30-026225,position: absolute; left: -10000 px; top: 0px;...
7,2016-12-27-006940,border:0 px; vertical-align: top; overflow: hi...
8,2016-02-10-000681,display: block; overflow: hidden; text-decorat...
9,2016-04-29-029151,height:200 px; text-align: left; overflow: auto;


In [None]:
html_data_df

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2016-08-12-069289,position: fixed; left:1 px; top:1 px; overflow...,,[],2016-08-12 14:43:46,2,"[[None, 0.6471], [Monty Oum, 0.3529]]",[http://halo.wikia.com/wiki/User:Haloprov?diff...,E
1,2016-12-12-069776,position: static; vertical-align: top; margin:...,,[],2016-12-12 10:10:36,1,"[[None, 0.8217], [Jill Stein, 0.1783]]",[http://www.brainerddispatch.com/news/elsewher...,E
2,2016-08-16-032476,height: auto; width: auto; overflow: scroll,,[],2016-08-16 01:11:06,13,"[[None, 0.9207], [Abraham Lincoln, 0.0793]]",[http://legogames.wikia.com/wiki/Character_Gri...,E
3,2016-05-08-040735,overflow: auto; height:600 px;,,[],2016-05-08 05:25:52,1,"[[None, 0.4221], [Harold Grey, 0.2179], [John ...",[http://outlander.wikia.com/wiki/User:La_Dame_...,E
4,2016-05-20-012246,border: currentColor; overflow: hidden;,,[],2016-05-20 15:59:24,6,"[[None, 0.8177], [Shirley Wilson, 0.0914], [Ch...",[http://news-journalonline.com/article/2016052...,E
5,2016-05-08-005592,border:0 px; vertical-align: middle; overflow:...,,[],2016-05-08 13:17:21,208,"[[None, 0.7628], [Anthony Smith, 0.2373]]",[http://killerinstinct.wikia.com/wiki/Tusk?dif...,E
6,2016-11-30-026225,position: absolute; left: -10000 px; top: 0px;...,,[],2016-11-30 15:32:41,23,"[[None, 0.6669], [Rogelio Chavez, 0.2696], [La...",[http://www.csnbayarea.com/headline/swat-deput...,E
7,2016-12-27-006940,border:0 px; vertical-align: top; overflow: hi...,,[],2016-12-27 01:35:17,13,"[[None, 0.5306], [Anzu Lawson, 0.4694]]",[http://killerinstinct.wikia.com/wiki/Orchid?d...,E
8,2016-02-10-000681,display: block; overflow: hidden; text-decorat...,,[],2016-02-10 16:58:00,15,"[[None, 0.6699], [John Kellogg, 0.1054], [Stev...",[http://www.hypebot.com/hypebot/2016/02/heard-...,E
9,2016-04-29-029151,height:200 px; text-align: left; overflow: auto;,,[],2016-04-29 03:01:02,4,"[[None, 0.8881], [Lilly Singh, 0.1119]]",[http://iceage.wikia.com/wiki/Template:Officia...,E


And here is an extracted list of `quuoteID` if needed :

In [None]:
html_data_df.quoteID.tolist()

['2016-08-12-069289',
 '2016-12-12-069776',
 '2016-08-16-032476',
 '2016-05-08-040735',
 '2016-05-20-012246',
 '2016-05-08-005592',
 '2016-11-30-026225',
 '2016-12-27-006940',
 '2016-02-10-000681',
 '2016-04-29-029151',
 '2016-08-15-018894',
 '2016-07-02-005998',
 '2016-11-30-009320',
 '2016-07-06-011215',
 '2016-09-21-161027',
 '2016-07-29-077837',
 '2016-09-12-067418',
 '2016-08-20-012255',
 '2016-05-20-143609',
 '2016-07-28-022194',
 '2016-02-23-014854',
 '2016-06-03-000180',
 '2016-07-20-024148',
 '2016-08-23-128879',
 '2016-07-19-028240',
 '2016-08-06-067262',
 '2016-08-23-076923']

## Textarea parsing issue
During our analysis of the quotes, we tried to identify the reason why some strange quotes appeared in the dataset. So to investigate, we took the urls and decided to go see what the source code of the page looked like.

Ou main finding is that some `textarea` tag where containg some html content and that the `invalid` quotations where corresponding to the value of some properties like `style="display:none;"` for example.

### The case `investing.com`

The website investing.com uses some custom js script https://i-invdn-com.investing.com/js/comments-7.75.min.js to fill their comment section. This issue is specific to this website and the script is custom made, we didnt found any github repo link with it. Their script is based on parsing the content of some `textarea` tag containing some `html` code. When parsing this content quotebank believe that it's just some text but in reality it's just some more html.

Example of html :
```html
<textarea class="js-templates displayNone">
			<div id="comment">
				<div class="comment js-comment" data-comment-id="{commentID}" id="comment-{commentID}" data-user-id="">...</textarea>
```
- Source [2017-02-15-012897, 2017-02-15-002690, 2017-02-15-012906]:
view-source:https://it.investing.com/news/stock-market-news/nexi-e-sia-sottoscrivono-atto-fusione-nasce-paytech-leader-europea-2032588


The quotes identified as `invalid` for this website are `{commentID}`, `comment-{commentID}`

We believe that there is many more quoteID than just those presented here for example `addJS(...)`.

### The case `jdsupra.com`
This website do not use a custom script as the previous example but include an `iframe` inside a `textarea`:
```html
<textarea class="ba pa3 b--black-20 w-100 f6 mid-gray"><iframe src="//www.jdsupra.com/post/contentViewerEmbed.aspx?fid=edc7ca3b-8c2b-4df9-874a-2627ad4ae419" width="100%" height="620" frameborder="1" style="border: 2px solid #ccc; overflow-x:hidden !important; overflow:hidden;" scrolling="auto"></iframe></textarea></div>
```
And the quote identified contains ecxactly the css style that you can see above: `border: 2px solid #ccc; overflow-x:hidden !important; overflow:hidden;`
You can fin the source here [quoteID:2015-07-13-002788]:
view-source:https://www.jdsupra.com/legalnews/key-takeaways-the-growth-of-early-stage-3777309/ 

### The case `awardsdaily.com`
The last example comes from the website awardsdaily.com but in this case it's the content of a p tag that is parsed as some text while in reality it contains an iframe:
```html
<p><iframe loading="lazy" style="border: none; overflow: hidden;" frameborder="0" height="1200" scrolling="no" src="https://www.facebook.com/plugins/video.php?href=https%3A%2F%2Fwww.facebook.com%2FZacEfron%2Fvideos%2F1447071815390380%2F&amp;show_text=0&amp;width=846" width="300"><span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start">﻿</span></iframe></p>
```
Source [quoteID=2017-11-16-022899]:
view-source:https://www.awardsdaily.com/2017/11/16/zac-efron-post-rehearsal-video-greatest-showman/

and the quote is indeed containing the quote `display: inline-block; width: 0px; overflow: hidden; line-height: 0;`.

### The case `wikia.com`
Another website with some css code identified as quotes is the website `wikia.com` and it's alias `fandom.com`. This website is some kind of a wikidata allowing to store some properties and to update those. The problem resides mainly in the `url` which contains the parameters to be able to compare two different version. When comparing some version then some css properties are showed inside input fields.

See : http://killerinstinct.wikia.com/wiki/Orchid?diff=18846&oldid=18844

The content of those fields is some escaped html but should be considered as such during the parsing.

### The case `hypebot.com`
Another issue identified is for the website http://www.hypebot.com/hypebot/2016/02/heard-well-on-the-music-biz-podcast.html which we believe utilises some block tage is not supported by the parser, which might then no be consider as html, this tag `noscript`:
```html
<a href=https://www.hypebot.com/hypebot/2015/08/the-need-for-transparency-in-music-streaming.html style="box-shadow: 0px 0px 4px #999; padding: 2px; display: block; border-radius: 2px; text-decoration: none;" target=_blank rel="noopener noreferrer">
  <noscript><img alt src=http://i.zemanta.com/355649964_80_80.jpg style="padding: 0; margin: 0; border: 0; display: block; width: 80px; max-width: 100%;"></noscript>
  <img class=lazyload alt src='data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E' data-src=http://i.zemanta.com/355649964_80_80.jpg style="padding: 0; margin: 0; border: 0; display: block; width: 80px; max-width: 100%;">
</a>
```

And the quote identified by the parser correspond to the value of the `style` property: `padding: 0; margin: 0; border: 0; display: block; width: 80px; max-width: 100%;`.

### Others
Finally there is a few quotations that we try to analyse where we didn't found any malformed html or more recent tags. However, those websites might have changed since the time where the parser ran and using `archive.org` might reveal the hidden reason why this happened

Example of website : http://www.godisageek.com/reviews/mystery-castle-xbox-one-review/


## Solutions at the source

It seems to us that, if our analysis is correct, the solution should be solved at the source by changing the way webpages are parse and `not` consider `textarea`, `nojs` and `p` tag as end tag. So it means that the content should be check wether it contains some html or not. 

## Remediation of current quotes

There might be multiple solutions, the first one which is not fixing the source of the problem but just a remediation is to use some filter like thoe one we used for this project. We could use the most famous css properties or try to parse the content to see if it correspond to some valid `css` code. Some properties are often used like `height:`, `width:`, `overflow:` and `display:` which might already catch most of the `invalid` that we are talking about.
