# 1. Setup kaggle cli and download dataset in google colab

Since all data is lost when google colab session ends, the six steps given below will download dataset from kaggle and save you from the trouble of downloading the dataset everytime. The first two steps below have to be done manually the first time. After that the rest of the steps can be executed by running the three cells (steps 3-6) below. You have to run these three cells to download the dataset everytime you start a new session. 
  

1. Download / create json credentials after creating an account in kaggle.  See https://github.com/Kaggle/kaggle-api for more details
2. Upload the kaggle.json file to your google drive
3. Run the script in the first cell below to download kaggle.json  to your colab environment
4. It will ask you to click on a link and enter the verification code
5. Install kaggle cli using pip install
6. Download the dataset




In [3]:
# Code from https://medium.com/@move37timm/using-kaggle-api-for-google-colaboratory-d18645f93648
# Create kaggle.json by following instructions at https://github.com/Kaggle/kaggle-api
# Upload kaggle.json to google drive
# Download kaggle.json to colab from the users google drive

from googleapiclient.discovery import build
import io, os
from googleapiclient.http import MediaIoBaseDownload
from google.colab import auth
auth.authenticate_user()
drive_service = build('drive', 'v3')
results = drive_service.files().list(
        q="name = 'kaggle.json'", fields="files(id)").execute()
kaggle_api_key = results.get('files', [])
filename = "/root/.kaggle/kaggle.json"
if not os.path.exists(os.path.dirname(filename)):
  os.makedirs(os.path.dirname(filename))
request = drive_service.files().get_media(fileId=kaggle_api_key[0]['id'])
fh = io.FileIO(filename, 'wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
    status, done = downloader.next_chunk()
    print("Download %d%%." % int(status.progress() * 100))
os.chmod(filename, 600)

Download 100%.


In [5]:
# Install kaggle cli
!pip install kaggle



In [6]:
# Download the dataset for digit-recognizer chalenge
!kaggle competitions download -c digit-recognizer

Downloading train.csv to /content
 82% 60.0M/73.2M [00:00<00:00, 84.2MB/s]
100% 73.2M/73.2M [00:00<00:00, 122MB/s] 
Downloading test.csv to /content
 80% 39.0M/48.8M [00:00<00:00, 52.2MB/s]
100% 48.8M/48.8M [00:00<00:00, 120MB/s] 
Downloading sample_submission.csv to /content
  0% 0.00/235k [00:00<?, ?B/s]
100% 235k/235k [00:00<00:00, 58.2MB/s]


# 2. Read data in pandas dataframe
1. Check train and test csv files have been downloaded
2. import pandas and numpy and create train and test dataframes from the respective csv files
3. Inspect the dataframes
4. Convert to numpy arrays for train, validation, and test set 

In [None]:
# Check train and test csv files exist
!ls -ltr

In [None]:
# Read the csv files using pandas
import pandas as pd
import numpy as np
df_tr = pd.read_csv('train.csv')
df_te = pd.read_csv('test.csv')


In [None]:
# Examine the contents of train.csv
# Contains 28x28 pixel values and the corresponding digit label
print (df_tr.info())
df_tr.head()


In [None]:
# Examine the contents of test.csv
# Contains only the 28x28 pixel values without the corresponding digit label
print (df_te.info())
df_te.head()

In [None]:
# Partition the training data into pixels (independent variable) and label (dependent variable)
X = np.asarray(df_tr.drop('label',axis=1),dtype=np.float32).reshape(-1,28,28)
yhat = np.asarray(df_tr['label'])

# Generate random indices for creating a random validation set with 20% of the labelled data
validx = (np.random.uniform(size=len(X)) <= 0.2)

# Create training set (80% of the labelled data)
X_trn = X[~validx]
y_trn = yhat[~validx]

# Create validation set (20% of the labelled data)
X_val = X[validx]
y_val = yhat[validx]

# Create the test set
X_tes = np.asarray(df_te,dtype=np.float32).reshape(-1,28,28)

# 3. Visualize some of the data items
1. import matplotlib
2. Visualize the first few data items and verify the corresponding labels match

In [None]:
# Concatenate nvis images horizontally and visualize it using matplot lib
import matplotlib.pyplot as plt
nvis = 12
plt.imshow(np.concatenate(X_trn[:nvis],axis=1),cmap='gray',vmin=0,vmax=255)
plt.show()

# Print the corresponding labels to check they match
y_trn[:nvis]

# 4. Create a fully connected neural network in pytorch

We follow the same steps as in [assignment 4](https://github.com/dilthoms/ai-ml-assignments/blob/master/AI-ML-Libs/sklearn-pytorch.ipynb)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

# Create a class and define the layers in the __init__
# and implement the forward propagation. Pytorch will automatically
# calculate the backward propagation for you. 

class SingleHidden_NN(nn.Module):
    '''
    A Neural Network with a single hidden layer.
    ''' 
    
    # Create a constructor and define the layers and activations
    def __init__(self, input_size,hidden_size,output_size):
        '''
        Arguments:
            input_size  : The number of neurons in the input layer
            hidden_size : The number of neurons in the hidden layer
            output_size : The number of neurons in the output layer
        '''
        super(Digit_SingleHidden_NN, self).__init__()
        self.input_size = input_size
      
        # Define a pytorch linear layer that connects the input layer to the hidden layer
        self.layer1 = nn.Linear(input_size, hidden_size)
        # Define a pytorch linear layer that connects the hidden layer to the output layer
        self.layer2 = nn.Linear(hidden_size, output_size)
        

         
    def forward(self, x):
      '''
      Implement forward propagation with relu activation for the hidden layer.
      Arguments:
          x      : The input x
      Returns:
          output : The linear activation from the output layer
      '''
        output = self.layer2(F.relu(self.layer1(x.view(-1,self.input_size))))
        return output

In [None]:
# Create a Dataset subclass for loading datasets in numpy arrays
# See https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class

from torch.utils.data import Dataset, DataLoader

class Numpy_XY_Dataset(Dataset):
  '''
  Dataset subclass for the MNIST digits dataset
  '''
  
  def __init__(self,X,y):
  '''
  Create the independent and dependent variables
  '''
    super(DigitDataset,self).__init__()
    self.X = X
    self.y = y
    assert(len(X)==len(y))
    
  def __len__(self):
  '''
  Return the size of the dataset
  '''
    return len(self.X)
  
  def __getitem__(self,idx):
  '''
  Return the data item at index idx
  '''
    return self.X[idx],self.y[idx]
    

In [None]:
#Write the training Loop

In [None]:
# Generate predictions using the trained model
# TODO: use dataloader for X_tes. Works for now since it is small
with torch.no_grad():
  _,res = torch.max(model(torch.from_numpy(X_tes)),1)

In [None]:
# Convert the results to a pandas dataframe
sub = pd.DataFrame({"ImageId":np.arange(1,28001),"Label":res})

# Create the submission csv file from the dataframe
sub.to_csv("sub.csv",index=False)

In [None]:
# Submit the csv file to kaggle using the kaggle api
!kaggle competitions submit -c digit-recognizer -f sub.csv -m "First attempt"

