<a href="https://colab.research.google.com/github/anshupandey/natural_language_processing/blob/master/Ludwig_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Ludwig

Ludwig, is a project of Uber, provides a new data type-based approach to deep learning model design that makes the tool suited for many different applications. Rather than building out the architecture, you just need to specify the data.

You can find many examples in https://uber.github.io/ludwig/examples/

In [1]:
# Let's find out colab system configuration
!cat /etc/os-release

NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic


In [2]:
#Let's find GPU 
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243


In [3]:
# GPU card properties
!nvidia-smi

Sun Jul  5 03:53:49 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P8     8W /  75W |      0MiB /  7611MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
#Install ludwig
!pip install ludwig

Collecting ludwig
[?25l  Downloading https://files.pythonhosted.org/packages/8c/e3/65161b00dc06b52a0c23e0a2eaea7127f4df8c72ff3212dd9453f2b7d737/ludwig-0.2.2.8.tar.gz (174kB)
[K     |█▉                              | 10kB 26.0MB/s eta 0:00:01[K     |███▊                            | 20kB 6.2MB/s eta 0:00:01[K     |█████▋                          | 30kB 7.6MB/s eta 0:00:01[K     |███████▌                        | 40kB 7.9MB/s eta 0:00:01[K     |█████████▍                      | 51kB 7.3MB/s eta 0:00:01[K     |███████████▎                    | 61kB 8.2MB/s eta 0:00:01[K     |█████████████▏                  | 71kB 8.5MB/s eta 0:00:01[K     |███████████████                 | 81kB 8.7MB/s eta 0:00:01[K     |█████████████████               | 92kB 8.1MB/s eta 0:00:01[K     |██████████████████▉             | 102kB 8.4MB/s eta 0:00:01[K     |████████████████████▊           | 112kB 8.4MB/s eta 0:00:01[K     |██████████████████████▋         | 122kB 8.4MB/s eta 0:00:01[K 

In [5]:
#Download data
!gsutil cp gs://dataset-uploader/bbc/bbc-text.csv .

Copying gs://dataset-uploader/bbc/bbc-text.csv...
- [1 files][  4.8 MiB/  4.8 MiB]                                                
Operation completed over 1 objects/4.8 MiB.                                      


In [6]:
#Import packages and read data
import pandas as pd
df = pd.read_csv('bbc-text.csv')
df_train=df.iloc[:2000,:]
df_test=df.iloc[2001:,:]
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [None]:
print(df['text'][0])

tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high-

In [7]:
df.shape

(2225, 2)

In [None]:
df['category'].value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: category, dtype: int64

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  2225 non-null   object
 1   text      2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


In [8]:
#Import packages
import ludwig
from ludwig.api import LudwigModel

#Specify model definition
# train a model
model_definition = {
"input_features":[
      {"name": 'text',
        "type": 'text',
        "encoder": 'parallel_cnn',
        "level": 'word'
       }],
 
"output_features":[
        {"name": 'category',
        "type": 'category'}
],
 
"training":{
 "epochs": 10}
}


The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [9]:
#Train model
model = LudwigModel(model_definition)
train_stats = model.train(df_train)

# or load a model
# model = LudwigModel.load(model_path)

# obtain predictions
predictions = model.predict(df_test)

model.close()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  preprocessing_parameters['fill_value'],
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  preprocessing_parameters['fill_value'],


In [10]:
predictions

Unnamed: 0,category_predictions,category_probabilities_<UNK>,category_probabilities_sport,category_probabilities_business,category_probabilities_politics,category_probabilities_entertainment,category_probabilities_tech,category_probability
0,business,4.887402e-13,0.009892,0.781664,0.017228,0.005026,0.186191,0.781664
1,politics,5.115585e-13,0.003417,0.251162,0.539539,0.020601,0.185281,0.539539
2,tech,6.713132e-13,0.072820,0.349900,0.004466,0.062606,0.510208,0.510208
3,sport,1.643938e-13,0.986444,0.000520,0.000216,0.005480,0.007340,0.986444
4,tech,6.469230e-13,0.034109,0.009507,0.454687,0.036036,0.465662,0.465662
...,...,...,...,...,...,...,...,...
219,business,6.689826e-13,0.056447,0.626486,0.010935,0.048469,0.257662,0.626486
220,politics,2.598046e-12,0.012206,0.016530,0.792009,0.024034,0.155221,0.792009
221,entertainment,9.624698e-13,0.146100,0.004364,0.025329,0.424371,0.399836,0.424371
222,tech,8.083551e-13,0.017760,0.011468,0.076598,0.031407,0.862768,0.862768


In [11]:
# Understand accuracy and other metrics through train_stats
train_stats

{'test': OrderedDict([('category',
               OrderedDict([('loss',
                             [2.7345599796872575,
                              1.3366042958299813,
                              0.8917073753868084,
                              0.5824534579482918,
                              0.5721866688124891,
                              0.4973114891620191,
                              0.5836982147273592,
                              0.7099389414633474,
                              0.7022242912879357,
                              0.5019587086093041]),
                            ('accuracy',
                             [0.23821339950372208,
                              0.4094292803970223,
                              0.6625310173697271,
                              0.8660049627791563,
                              0.8039702233250621,
                              0.8213399503722084,
                              0.7468982630272953,
                              0.67