# Welcome to Wikipedia Articles

Thankfully, the Wikipedia Foundation provides under [their license](https://creativecommons.org/licenses/by-sa/3.0/) free access to their full encylopedia of knowledge via their [dumps website](https://dumps.wikimedia.org/).

The work of downloading the data and [processing it into a usable form](https://github.com/attardi/wikiextractor) into an S3 bucket has already been performed for you.

- [read.json](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.json.html) - Loads JSON files and returns the results as a `DataFrame`.

In [None]:
%time
# Processing 16.5GB of Wikipedia articles took about 4 minutes during testing.
wikipedia_data = spark.read.json("s3://wikipedia-dump-extractor-4815879/enwiki-20220701.jsonl")
wikipedia_data.count()

- [limit](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.limit.html#pyspark.sql.DataFrame.limit) - Limits the result count to the number specified.

In [None]:
%time
# Processing 10 Wikipedia articles took < 1second during testing.
wikipedia_data.limit(10).count()

- [printSchema](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.printSchema.html) - Prints out the schema in the tree format.

In [None]:
wikipedia_data.printSchema()

## Exercises

1. How many articles have a zero length? What's that as a percentage of the total article count?
1. Use the [Soundex](https://en.wikipedia.org/wiki/Soundex) algorithm to find article titles which sound like "Adrian".
    <details>
      <summary>Hint</summary>
      To operate on a constant value, e.g. Adrian you can use the pyspark.sql.functions.lit function
    </details>
1. What is the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between the number of revisions a wikipedia article has and the length of the article?
    <details>
      <summary>Hint 1</summary>
      The DataFrame object has a method corr which can calculate this.
    </details>
    <details>
      <summary>Hint 2</summary>
      The revid column will need to be cast to an integer.
    </details>
1. What is the most frequently occurring 5 letter word across all articles?
    <details>
      <summary>Hint 1</summary>
      Creating a DataFrame which has a column containing an array of all words within the article could help.
    </details>
    <details>
    <summary>Hint 2</summary>
      Manipulating your array of words is hard, but transforming that into a row per word should make filtering and aggregating easier.
    </details>
1. How does the frequency of letters within Wikipedia compare to [The frequency of the letters of the alphabet in English](https://www3.nd.edu/~busiforc/handouts/cryptography/letterfrequencies.html)?

## Resources

- List of [pyspark.sql.functions](https://spark.apache.org/docs/3.1.3/api/python/reference/pyspark.sql.html#functions)