Skip to content

Commit

Permalink
Merge pull request #899 from BlazZupan/doc-load-data
Browse files Browse the repository at this point in the history
Documentation (loading data): Converted to rst. Corrected few typos. Centered images.
  • Loading branch information
BlazZupan committed Dec 10, 2015
2 parents cf60e68 + feeedde commit 9f6bbe7
Show file tree
Hide file tree
Showing 3 changed files with 61 additions and 41 deletions.
Original file line number Diff line number Diff line change
@@ -1,26 +1,25 @@
Loading your Data
=================

Orange comes with its [own data format] (http://docs.orange.biolab.si/reference/rst/Orange.data.formats.html#tab-delimited), but can
Orange comes with its `own data format <http://docs.orange.biolab.si/reference/rst/Orange.data.formats.html#tab-delimited>`_, but can
also handle Excel (.xlsl), comma or tab delimited data files. The input data
set is usually a table, with data instances (samples) in rows and
data attributes in columns. Attributes can be of different types
(continuous, discrete, and strings), with different elements (input variables, meta
attributes, and class). Data attribute type and element can be provided
(continuous, discrete, and strings), with different elements (input variables, meta attributes, and class). Data attribute type and element can be provided
in the data table header. and can be changed later, after reading the
data, with several specialized widgets, such as Select Rows and [Select Columns] (http://docs.orange.biolab.si/widgets/rst/data/selectattributes.html#select-attributes).
data, with several specialized widgets, such as Select Rows and `Select Columns <http://docs.orange.biolab.si/widgets/rst/data/selectattributes.html#select-attributes>`_.

In a Nutshell
-------------

- Orange can import any comma, .xlsx or tab-delimited data file. Use [File] (http://docs.orange.biolab.si/widgets/rst/data/file.html#file)
- Orange can import any comma, .xlsx or tab-delimited data file. Use `File <http://docs.orange.biolab.si/widgets/rst/data/file.html#file>`_
widget and then, if needed, select class and meta attributes in
[Select Columns] (http://docs.orange.biolab.si/widgets/rst/data/selectattributes.html#select-attributes) widget.
`Select Columns <http://docs.orange.biolab.si/widgets/rst/data/selectattributes.html#select-attributes>`_ widget.
- To specify the domain and the type of the attribute, attribute names
can be preceded with a label followed by a hash. Use c for class
and m for meta attribute, i to ignore a column, and C, D, S for
continuous, discrete and string attribute types. Examples: C\#mpg,
mS\#name, i\#dummy. Make sure to set **Import Options** in [File] (http://docs.orange.biolab.si/widgets/rst/data/file.html#file)
mS\#name, i\#dummy. Make sure to set **Import Options** in `File <http://docs.orange.biolab.si/widgets/rst/data/file.html#file>`_
widget and set the header to **Orange simplified header**.
- Orange's native format is a tab-delimited text file with three
header rows. The first row contains attribute names, the second the
Expand All @@ -30,79 +29,91 @@ In a Nutshell
Data from Excel
---------------

Orange 3.0 recognises Excel files directly, thus simply open your .xlsx file in the program.
Orange 3.0 can read .xlsx files from Excel. Here is an example data set (:download:`sample.xlsx <sample.xlsx>`) in Excel:

<img src="spreadsheet1.png" alt="image" width="400">
.. image:: spreadsheet1.png
:width: 600 px
:align: center

To load the data set in Orange, we can design a simple workflow with
File and Data Table widgets,
In Orange, let us start with a simple workflow with File and Data Table widgets,

<img src="file-data-table-workflow.png" alt="image">
.. image:: file-data-table-workflow.png
:align: center

open the File widget (double click on its icon) and click on the file
browser icon,
and then load the data from Excel by opening the File widget (double click on the icon and of the widget) and click on the file browser icon ("..."),

<img src="loadingyourdata.png" alt="image" width="400">
.. image:: loadingyourdata.png
:width: 600 px
:align: center

locate the data file ( e.g. [sample.xlsx] (sample.xlsx)) and open
locate the data file (e.g. :download:`sample.xlsx <sample.xlsx>`) and open
it. The **File** widget sends data to **Data Table** widget, which displays the following result:

<img src="file-widget.png" alt="image" width="400">
.. image:: file-widget.png
:width: 900 px
:align: center

Notice that our data contains 8 data instances (rows) and 7 data
attributes (columns).
Question marks in the data table denote missing data entries. These
entries correspond to empty cells in the Excel table. Rows in our
exemplary data set represent genes, with values in the first column
attributes (columns). Question marks in the data table denote missing data entries. These entries correspond to empty cells in the Excel table. Rows in our exemplary data set represent genes, with values in the first column
denoting a gene class. The second column stores gene names, while the
remaining columns record measurements that characterize each gene. Gene
class can be used for classification. Gene name is a meta information, a
label that is not relevant to any data mining algorithm, but can identify
a data instance in, say, visualizations like scatter plot. We need to
tell Orange that these first two columns are special. One way to do this
within Orange is through [Select Columns] (http://docs.orange.biolab.si/widgets/rst/data/selectattributes.html#select-attributes) widget:
within Orange is through `Select Columns <http://docs.orange.biolab.si/widgets/rst/data/selectattributes.html#select-attributes>`_ widget:

![image](select-columns-schema.png)
.. image:: select-columns-schema.png
:align: center

Opening the [Select Columns] (http://docs.orange.biolab.si/widgets/rst/data/selectattributes.html#select-attributes) widget reveals that in our input data file
Opening the `Select Columns <http://docs.orange.biolab.si/widgets/rst/data/selectattributes.html#select-attributes>`_ widget reveals that in our input data file
all six columns are treated as ordinary attributes (input variables),
with the only distinction being that the first variable is categorical
(discrete) and the other five are real-valued (continuous):

<img src="select-columns-start.png" alt="image" width="400">
.. image:: select-columns-start.png
:width: 600 px
:align: center

To correctly reassign attribute types, drag attribute named `function`
to a **Class** box, and attribute named `gene` to a **Meta Attribute**
box. The [Select Columns] (http://docs.orange.biolab.si/widgets/rst/data/selectattributes.html#select-attributes) widget should now look like this:
box. The `Select Columns <http://docs.orange.biolab.si/widgets/rst/data/selectattributes.html#select-attributes>`_ widget should now look like this:

<img src="select-columns-reassigned.png" alt="image" width="400">
.. image:: select-columns-reassigned.png
:width: 500 px
:align: center

Change of attribute types in *Select Columns* widget should be confirmed
by clicking the **Apply** button. The data from this widget is fed into
[Data Table] (http://docs.orange.biolab.si/widgets/rst/data/datatable.html#data-table) widget, that now renders class and meta attributes in a
`Data Table <http://docs.orange.biolab.si/widgets/rst/data/datatable.html#data-table>`_ widget, that now renders class and meta attributes in a
color different from those of input features:

<img src="data-table-with-class1.png" alt="image" width="400">
.. image:: data-table-with-class1.png
:width: 500 px
:align: center

We could also define the domain for this data set in a different way.
Say, we could make the data set ready for regression, and use `heat 0`
as a continuous class variable, keep gene function and name as meta
variables, and remove `heat 10` and `heat 20` from the data set (making
these two attributes available for type assignment, without including
them in the data on the output of [Select Columns] (http://docs.orange.biolab.si/widgets/rst/data/selectattributes.html#select-attributes) widget):
them in the data on the output of `Select Columns <http://docs.orange.biolab.si/widgets/rst/data/selectattributes.html#select-attributes>`_ widget):

<img src="select-columns-regression.png" alt="image" width="400">
.. image:: select-columns-regression.png
:width: 500 px
:align: center

By setting the attributes as above, the rendering of the data in the
Data Table widget gives the following output:

<img src="data-table-regression1.png" alt="image" width="400">
.. image:: data-table-regression1.png
:width: 600 px
:align: center

Header with Attribute Type Information
--------------------------------------

Let us open the
[sample.xlsx] (sample.xlsx) data set in Excel again. This time,
Let us open the :download:`sample.xlsx <sample.xlsx>` data set in Excel again. This time,
however, we will augment the names of the attributes with prefix
characters expressing attribute type (class or meta attribute) and/or
its domain (continuous, discrete, string), and separate them from the
Expand All @@ -119,13 +130,17 @@ and for the domain:
- S: String

This is how the header with augmented attribute names looks like in
Excel [sample-head.xlsx] (sample-head.xlsx):
Excel (:download:`sample-head.xlsx <sample-head.xlsx>`):

<img src="spreadsheet-simple-head1.png" alt="image" width="400">
.. image:: spreadsheet-simple-head1.png
:width: 500 px
:align: center

We can again use a [Data Table] (http://docs.orange.biolab.si/widgets/rst/data/datatable.html#data-table) widget to read the data from Excel file. Orange will automatically recognize attribute values, which is evident in the modified class icons:
We can again use a `Data Table <http://docs.orange.biolab.si/widgets/rst/data/datatable.html#data-table>`_ widget to read the data from Excel file. Orange will automatically recognize attribute values, which is evident in the modified class icons:

<img src="file-widget-simplified-header-example.png" alt="image" width="400">
.. image:: file-widget-simplified-header-example.png
:width: 500 px
:align: center

Notice that the attributes we have ignored (label "i" in the
attribute name) are not present in the data set.
Expand All @@ -139,7 +154,9 @@ their domain (continuous, discrete and string, or abbreviated c, d and
s), and the third row an optional type (class, meta, or ignore). Here is
an example:

<img src="excel-with-tab1.png" alt="image" width="400">
.. image:: excel-with-tab1.png
:width: 500 px
:align: center

The above screenshot is from Excel, but the file was actually saved using "Tab Delimited Text (.txt)" format. If you want to
save your files in .tab format, you have to rename the file so that it ends with ".tab" extension (say from sample.txt to
Expand All @@ -152,6 +169,9 @@ Saving Files in LibreOffice

If you are using LibreOffice, simply save your files in .xlsl format (available from the drop-down menu under *Save As Type*).

![image](saving-tab-delimited-files.png)
.. image:: saving-tab-delimited-files.png
:align: center

<img src="saving-tab-delimited-files2.png" alt="image" width="400">
.. image:: saving-tab-delimited-files2.png
:width: 500 px
:align: center
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.

0 comments on commit 9f6bbe7

Please sign in to comment.