In [None]:
import requests
import tempfile
import shutil
import subprocess
import shlex
import os

# Stack Exchange Analysis



# Processing the Data

## Downloading the data
Our first step is to download the [data dumps](https://archive.org/details/stackexchange). Currently, the Stack Exchange Network hosts their data dumps through the [Internet Archive](https://archive.org/). Dump files are provided for each network (aside from StackOverflow for which there are multiple files) and are compressed using the 7z archive format. Thus, the code below to download and uncompress the data archive requires having the `7z` binary, acquirable through `apt-get`, `brew` and other package managers.

In [None]:
def get_network_data(network, path=''):
    # download archive
    url = 'https://archive.org/download/stackexchange/%s.stackexchange.com.7z' % network
    response = requests.get(url)
    
    with tempfile.NamedTemporaryFile('wb') as f:
        # copy 7z archive into the filesystem
        response.raw.decode_content = False
        f.write(response.content)
        f.flush()
        
        # create a folder to store the XML data
        path = os.path.join(path, network)
        if not os.path.exists(path):
            os.makedirs(path)

        # there are few Python 7z compatible libraries and they don't
        # work correctly with these archives (overwrite switch enabled)
        args = shlex.split('7z x %s -aoa "-o%s"' % (f.name, path))
        return subprocess.check_call(args)

get_network_data('ai')

Now that we've downloaded the archive, it's time to load the XML data within it into a database that can be easily queried when we're building our features. The [README](https://ia800500.us.archive.org/22/items/stackexchange/readme.txt) provided with the data dump describes the schema of each table and below we provide a visualization of these tables as well as the types of each field once we load them into our database. This visualization was created with the use of [WWW SQL Designer](https://ondras.zarovi.cz/sql/demo/).

![Database schema](schema.png)