## Get the data

Real-world data tends to occupy relational databases and Amazon S3 buckets. You better brush off your SQL and file processing skills. Luckily, languages like Python have a wide variety of helpers (both built-in and as external packages).

Working with raw files require an understanding of how text and image files are stored, can be loaded and shown.

### Start with a sample

Get 10 or 100 examples. Try to get somewhat of a representative subset of the data, but don't try too hard - a simple random subset will do (for now).

### Request more data

Some additional data can be easily stored for you. How easy would it be to get that done? Can someone else do it for you?

## Data from a relational database

One easy way to deal with data stored in a database is to use a tool to extract what you need and convert it to a Pandas data frame.

We'll use data provided by OSMI (Open Source Mental Illness) on mental health in the tech industry. It contains surveys on mental health disorders and their frequency through the industry. Let's download it:

In [1]:
!wget -q https://github.com/curiousily/Hackers-Guide-to-Deep-Learning/raw/master/data/mental_health.sqlite

The SQLite file contains data that is similar to what you might have in your production systems. Let's load it and see what tables it contains:

In [2]:
from sqlalchemy import create_engine

db_engine = create_engine("sqlite:///mental_health.sqlite")

db_engine.table_names()

['Answer', 'Question', 'Survey']

We can use Pandas to look at a sample of each data:

In [3]:
import pandas as pd

pd.read_sql("SELECT * FROM Answer LIMIT 5", db_engine)

Unnamed: 0,AnswerText,SurveyID,UserID,QuestionID
0,37,2014,1,1
1,44,2014,2,1
2,32,2014,3,1
3,31,2014,4,1
4,31,2014,5,1


In [4]:
pd.read_sql("SELECT * FROM Question LIMIT 5", db_engine)

Unnamed: 0,questiontext,questionid
0,What is your age?,1
1,What is your gender?,2
2,What country do you live in?,3
3,"If you live in the United States, which state ...",4
4,Are you self-employed?,5


In [5]:
pd.read_sql("SELECT * FROM Survey LIMIT 5", db_engine)

Unnamed: 0,SurveyID,Description
0,2014,mental health survey for 2014
1,2016,mental health survey for 2016
2,2017,mental health survey for 2017
3,2018,mental health survey for 2018
4,2019,mental health survey for 2019


Pretty straight forward, we have questions and answers from different people.

Ideally, we would want all of this into a single data frame. We can use a simple JOIN to get that. Note that we don't need the Survey table.

In [6]:
sql_statement = """
  SELECT Answer.*, Question.questiontext
  FROM Answer
  INNER JOIN Question ON Answer.QuestionID=Question.questionid
"""
survey_df = pd.read_sql(sql_statement, db_engine)
survey_df.head()

Unnamed: 0,AnswerText,SurveyID,UserID,QuestionID,questiontext
0,37,2014,1,1,What is your age?
1,44,2014,2,1,What is your age?
2,32,2014,3,1,What is your age?
3,31,2014,4,1,What is your age?
4,31,2014,5,1,What is your age?


We can do ourselves a favor and make this easier to read/work with:

In [18]:
survey_df.columns = ["answer", "survey_year", "user_id", "question_id", "question"]
survey_df = survey_df[["question", "answer", "user_id", "survey_year", "question_id"]]
survey_df.head()

Unnamed: 0,question,answer,user_id,survey_year,question_id
0,What is your age?,37,1,2014,1
1,What is your age?,44,2,2014,1
2,What is your age?,32,3,2014,1
3,What is your age?,31,4,2014,1
4,What is your age?,31,5,2014,1


You can use this same technique for datasets with millions of rows to get a regular Pandas data frame and go from there.

## Text data

Working with text requires a way of converting words, subwords or characters into numbers. This process is known as tokenization. The [Tokenizers](https://huggingface.co/docs/tokenizers) library by Huggingface provides an easy way to train a custom tokenizer or use a prebuilt one.

Let's install the library and download the required vocabulary for one of the prebuilt tokenizers:

In [9]:
!pip install -q tokenizers==0.9.4
!wget -q https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt

To create a tokenizer, we need to pass the vocabulary file:

In [16]:
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)

We can pick a question text from the dataset seen previously:

In [22]:
sample_question_text = survey_df.iloc[0].question
sample_question_text

'What is your age?'

And encode it:

In [30]:
encoding = tokenizer.encode(sample_question_text)
encoding.ids, encoding.tokens

([101, 2054, 2003, 2115, 2287, 1029, 102],
 ['[CLS]', 'what', 'is', 'your', 'age', '?', '[SEP]'])

The encoded vector converts each token to its id. But we can also see that there are tokens that don't belong to our text. Those are specific to the model (BERT in this case) to which this tokenizer belongs.

Using all this knowledge, we can build a simple function that converts text to a tensor:

In [37]:
import torch

def text_to_tensor(text, tokenizer):
  encoding = tokenizer.encode(text)
  return torch.Tensor(encoding.ids)

text_to_tensor(sample_question_text, tokenizer)

tensor([ 101., 2054., 2003., 2115., 2287., 1029.,  102.])

Of course, you might need something much more sophisticated, but this should get you started on a baseline text model.

## Image data

## Labelling

Don't have enough or no data at all? You can create your own datasets! Yes, but that is not an easy task.

## Look at the data

## Is the data usable?

## Splitting

## Versioning

In [None]:
!pip install -q watermark
%reload_ext watermark

In [55]:
%watermark -p torch,sqlalchemy,tokenizers,pandas,numpy,scipy,sklearn

torch 1.7.0+cu101
sqlalchemy 1.3.20
tokenizers 0.9.4
pandas 1.1.4
numpy 1.18.5
scipy 1.4.1
sklearn 0.0
