Let's start by creating a simple linear model for digit recognition.

## Accessing the data

We'll download the famous [MNIST dataset](http://yann.lecun.com/exdb/mnist/) of handwritten digits.

In [1]:
url = 'https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz'
tgz = 'mnist_png.tgz'

We can use `urlsave` from [fastcore](https://fastcore.fast.ai/) to download the dataset. However, we don't want to re-download it if we've downloaded it before. We can use Python's `Path` class to make it more convenient to work with the filesystem.

In [29]:
from pathlib import Path
from fastcore.all import urlsave, untar_dir
import tarfile

In [32]:
cfghome = Path.home()/'.fastai'
archive = cfghome/'archive'
if not (archive/tgz).exists():
    archive.mkdir(exist_ok=True, parents=True)
    urlsave(url, archive)

data = cfghome/'data'
dest = data/'mnist_png'
if not dest.exists():
    data.mkdir(exist_ok=True, parents=True)
    untar_dir(archive/tgz, dest)

In [21]:
#TODO: use FastDownload when available

The dataset has separate folders for the training set and the test set (which we will use for validation):

In [35]:
!ls {dest}

testing  training


The training set is split into folders according to the labels:

In [38]:
trainpath = dest/'training'
testpath = dest/'testing'
!ls {trainpath}

0  1  2  3  4  5  6  7	8  9


With `glob` we can get a list of all of the files in the training and test sets:

In [48]:
from glob import glob
def getfiles(p): return list(glob(f'{p}/**/*.png', recursive=True))

trainfiles = getfiles(trainpath)
testfiles = getfiles(testpath)
fname = trainfiles[0]
fname

'/home/jhoward/.fastai/data/mnist_png/training/8/7674.png'

We can see from the path that this is the number "8", and is in the training set. We can open the image file using [torchvision](https://pytorch.org/vision/stable/index.html):

In [49]:
from torchvision.io import read_image
img = read_image(fname)

This returns a `Tensor` (a multidimensional array). We can see its size:

In [50]:
img.shape

torch.Size([1, 28, 28])

This shows that it is a single image of size 28x28.

## Creating variables

For a linear model, we'll need independent variables (the pixel values) and a dependent variable (the labels). Let's first get the labels using a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions):

In [57]:
def labelfunc(p): return Path(p).parent.name
y = [labelfunc(o) for o in trainfiles]

Let's check the distribution of labels:

In [63]:
from collections import Counter
Counter(sorted(y))

Counter({'0': 5923,
         '1': 6742,
         '2': 5958,
         '3': 6131,
         '4': 5842,
         '5': 5421,
         '6': 5918,
         '7': 6265,
         '8': 5851,
         '9': 5949})