# Correlation Metrics Between Charset and Content Language of Web Pages

The correlation metrics are based on data from the [Common Crawl September 2019 data set](https://commoncrawl.org/2019/09/september-2019-crawl-archive-now-available/). The metrics are mined on the [columnar index](https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/) by the SQL query [correlation-language-charset.sql](https://github.com/commoncrawl/cc-index-table/blob/master/src/sql/examples/cc-index/correlation-language-charset.sql) executed on [AWS Athena](https://aws.amazon.com/athena/). The Athena results are kept in a CSV file ([cc-index-table/data/cc-main-2020-05-language-charset-correlation.csv](./data/cc-main-2020-05-language-charset-correlation.csv)) from where they are further analyzed.

Some background information: content language and charset are detected since August 2018 ([CC-MAIN-2018-34](https://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/)). The detection is performed by [CLD2](https://github.com/CLD2Owners/cld2) resp. [Apache Tika](https://tika.apache.org/). Both use heuristics based on character and n-gram frequencies supported by metadata from the HTML and HTTP headers. Since September 2018 ([CC-MAIN-2018-39](https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/)) the detected information is available as columns `content_languages` and `content_charset`. The former may include multiple values separated by a comma, simply because a web page can be written in multiple languages, eg. one part in Japanese other(s) in English. CLD2 is able to recognize up to three languages per web page. Also important to know: language and charset detection may be wrong - there is a (low) error rate in the statistical methods and also the input data can be misleading (erroneous metadata or garbled content).

But let's now put the SQL query together... You'll get the complete query from here: [correlation-language-charset.sql](https://github.com/commoncrawl/cc-index-table/blob/master/src/sql/examples/cc-index/correlation-language-charset.sql), so I'll explain the steps here in more detail. For better readability the query is split into one part creating a SQL view which is again subdivided by defining a subquery "tmp". The three essential steps are:

1. get the page counts for pairs of language and charset, as well as the overal number of pages per language, charset and in total:
```sql
SELECT COUNT(*) as n_pages,
       content_languages AS languages,
       content_charset AS charset,
       SUM(COUNT(*)) OVER() AS total_pages,
       SUM(COUNT(*)) OVER(PARTITION BY content_charset) AS n_pages_charset,
       SUM(COUNT(*)) OVER(PARTITION BY content_languages) as n_pages_languages
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2020-05'
  AND subset = 'warc'
GROUP BY content_charset,
         content_languages
```
   The [SQL window functions](https://prestodb.io/docs/current/functions/window.html) make it easy to get the overall counts.  We filter by crawl and use only the "warc" subset to count only successfully fetched pages.

2. next, based on the counts from step 1, we calculate the correlation metrics the log-likelihood and the conditional probabilities:
```sql
SELECT ...,
       ... AS loglikelihood,
       (n_pages/CAST(n_pages_languages AS DOUBLE)) AS prob_cs_given_lang,
       (n_pages/CAST(n_pages_charset AS DOUBLE)) AS prob_lang_given_cs
```
   The calculation of the log-likelihood is left away for now, we'll discuss it later.

3. finally, to reduce the noise we filter away pages with multiple languages and rare (and maybe accidential) combinations of language and charset showing up less than 100 times:
```sql
WHERE
  -- skip low-frequency pairs of language and charset:
      n_pages >= 100
  -- skip pages with more than one language:
      AND NOT languages LIKE '%,%'
```

The query is executed by Athena for less than \\$.01 (given a price of \\$5.00 for 1 TB scanned):
```
(Run time: 19.12 seconds, Data scanned: 613.44 MB)
```
The resulting CSV file is also provided here as [cc-index-table/data/cc-main-2020-05-language-charset-correlation.csv](./data/cc-main-2020-05-language-charset-correlation.csv). We'll now take a look the results.

In [1]:
import pandas as pd
pd.options.display.float_format = '{:,.5f}'.format

df = pd.read_csv('data/cc-main-2020-05-language-charset-correlation.csv')
df_ = df.loc[:, ['charset', 'n_pages_charset']]
df_['%'] = 100.0*df_['n_pages_charset']/df_['n_pages_charset'].sum()
df_.groupby(['charset']).aggregate(sum).sort_values(['n_pages_charset'], ascending=0).head(25)

Unnamed: 0_level_0,n_pages_charset,%
charset,Unnamed: 1_level_1,Unnamed: 2_level_1
UTF-8,437142931590,96.89321
ISO-8859-1,10083073661,2.23492
windows-1251,1344152824,0.29793
windows-1252,1147056008,0.25425
GB2312,449164209,0.09956
EUC-JP,173648376,0.03849
Shift_JIS,144895231,0.03212
ISO-8859-2,139068840,0.03082
GBK,105909020,0.02347
ISO-8859-15,102893508,0.02281


UTF-8 is clearly the most frequently used encoding used by almost 97% of the web pages.

But now let's look how charsets are used to write a single language. We take Japanese because it cannot (or hardly) be written using the Latin alphabet.

In [2]:
df[df['languages'] == 'jpn'].sort_values(['prob_cs_given_lang'], ascending=0)

Unnamed: 0,languages,charset,n_pages_languages,n_pages_charset,n_pages,loglikelihood,prob_cs_given_lang,prob_lang_given_cs
323,jpn,UTF-8,64194459,2820276978,54506179,0.85865,0.84908,0.01933
324,jpn,Shift_JIS,64194459,11145787,6517226,4.76341,0.10152,0.58473
325,jpn,EUC-JP,64194459,7893108,3010223,4.81949,0.04689,0.38137
326,jpn,GBK,64194459,15129860,100009,-2.24805,0.00156,0.00661
327,jpn,windows-31j,64194459,57207,32458,4.98683,0.00051,0.56738
328,jpn,GB18030,64194459,3063488,12873,-3.15421,0.0002,0.0042
329,jpn,ISO-2022-JP,64194459,23204,6222,4.54004,0.0001,0.26814
330,jpn,UTF-16BE,64194459,4458,2860,4.85729,4e-05,0.64154
331,jpn,ISO-8859-1,64194459,103949213,2354,-13.52745,4e-05,2e-05
332,jpn,UTF-16LE,64194459,29580,1704,1.97009,3e-05,0.05761


Still, UTF-8 is used most frequently, but for significant lower proportion of web pages (85%). However, there are a couple of other character sets, likely you never heard about. We now re-sort the table by log-likelihood:

In [3]:
df[df['languages'] == 'jpn'].sort_values(['loglikelihood'], ascending=0)

Unnamed: 0,languages,charset,n_pages_languages,n_pages_charset,n_pages,loglikelihood,prob_cs_given_lang,prob_lang_given_cs
327,jpn,windows-31j,64194459,57207,32458,4.98683,0.00051,0.56738
330,jpn,UTF-16BE,64194459,4458,2860,4.85729,4e-05,0.64154
325,jpn,EUC-JP,64194459,7893108,3010223,4.81949,0.04689,0.38137
324,jpn,Shift_JIS,64194459,11145787,6517226,4.76341,0.10152,0.58473
329,jpn,ISO-2022-JP,64194459,23204,6222,4.54004,0.0001,0.26814
332,jpn,UTF-16LE,64194459,29580,1704,1.97009,3e-05,0.05761
323,jpn,UTF-8,64194459,2820276978,54506179,0.85865,0.84908,0.01933
326,jpn,GBK,64194459,15129860,100009,-2.24805,0.00156,0.00661
328,jpn,GB18030,64194459,3063488,12873,-3.15421,0.0002,0.0042
336,jpn,Big5,64194459,1970663,152,-11.14189,0.0,8e-05


The log-likelihood ratio test is a metric for the correlation of two variables, surely more robust than the conditional probability. That becomes immediately evident when comparing the column "loglikelihood" and "prob_lang_given_cs": the top charset by log-likelihood is [windows-31j](https://en.wikipedia.org/wiki/Code_page_932_(Microsoft_Windows)) and [UTF-16BE](https://en.wikipedia.org/wiki/UTF-16#UTF-16BE) has the highest conditional probability. While the former is indeed a character set designed to encode Japanese content, the latter is a universal charset and it might be that its coocurrence with Japanese is purely accidental.

The calculation of the log-likelihood a little bit more complex, esp. if you need to avoid query failures due to integer overflows or taking $log(0)$, see [correlation-language-charset.sql](https://github.com/commoncrawl/cc-index-table/blob/master/src/sql/examples/cc-index/correlation-language-charset.sql). I recommend to read [Ted Dunning's article about association metrics of collocations](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962) to understand calculation and usage of the log-likelihood ratio test.