<a href="https://colab.research.google.com/github/gaurangdave/mnist_digits_recognition/blob/main/notebooks/00_get_data.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Get Data

* This notebook is created for just one task, download the mnist data and add it to the shared google drive.
* For rest of the notebooks, we'll direcly use the shared drive to access the data.

## Import Libraries

In [1]:
from sklearn.datasets import fetch_openml
import pandas as pd
from google.colab import drive
from google.colab import userdata

## Mount Google Drive

In [2]:
## mount the google drive
drive.mount("/content/drive")

Mounted at /content/drive


In [3]:
# Retrieve the Google Drive path stored in secrets
shared_folder_path = userdata.get("SHARED_DRIVE_PATH")

## Access MNIST Dataset

In [4]:
mnist = fetch_openml("mnist_784", as_frame=False)

In [5]:
mnist.DESCR

"**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges  \n**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown  \n**Please cite**:  \n\nThe MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples  \n\nIt is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 

In [6]:
mnist.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

In [7]:
mnist["target"]

array(['5', '0', '4', ..., '4', '5', '6'], dtype=object)

In [8]:
mnist.feature_names[0:10]

['pixel1',
 'pixel2',
 'pixel3',
 'pixel4',
 'pixel5',
 'pixel6',
 'pixel7',
 'pixel8',
 'pixel9',
 'pixel10']

In [9]:
mnist.target_names


['class']

* Features and target variables are already split as `data` and `target` keys

In [10]:
X, y = mnist.data, mnist.target

In [11]:
X.shape

(70000, 784)

In [12]:
y.shape

(70000,)

In [13]:
type(y)

numpy.ndarray

In [14]:
mnist_df = pd.concat([pd.DataFrame(X, columns=mnist.feature_names), pd.DataFrame(y, columns=mnist.target_names)], axis=1)

In [15]:
mnist_df.head()

Unnamed: 0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,pixel10,...,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784,class
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,9


## Save Raw Data To Google Drive

In [16]:
raw_data_csv = f"{shared_folder_path}/raw_mnist_data.csv"

In [17]:
mnist_df.to_csv(raw_data_csv, index=False)

## Read Raw Data from Google Drive

In [18]:
mnist_df_read = pd.read_csv(raw_data_csv)
mnist_df_read.head()

Unnamed: 0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,pixel10,...,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784,class
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,9
