# Data processing and linear regression with Exasol

This tutorial is for users who want to use Exasol as a data source for training a linear regression model. 
However, the data analyzing and processing part can also be interesting to anyone who's looking for the information how to prepare a dataset for a training. 

In this tutorial we will discuss the following topics:

- Part 1. How to import a dataset from CSV file to Exasol database.
- Part 2. How to analyze the data in the table.
- Part 3. How to prepare data for using it in linear regression model.
- Part 4. How to create and train a model.

The model in this example will predict a flight's delay.

### Prerequisites

The users are assumed to have a basic understanding of Exasol, SQL and also basic Python programming knowledge.

## Part 1. Importing data

In our tutorial we use `Flights.csv` file which provides information about the U.S. domestic flights from 1987 to 2019 year.
The file contains more than 184 millions rows and 109 columns.

As the CSV file is too big, - more than 80 GB - we can't work with it directly importing is as a Pandas DataFrame.
That's the reason why we need to use a database as a warehouse and also a workbench for a future data transformation process.

So, the first step: **we start an Exasol Database**. 
For this tutorial we installed an Exasol Cloud Image using [Exasol Cloud Wizard](https://cloudtools.exasol.com/#/).
If you want to know more about Exasol Cloud Wizard, you can read an article about how to use this tool in our [blog](https://www.exasol.com/en/blog/building-clusters-in-the-sky-exasol-cloud-wizard/).
A local Exasol node was not a good option in our case, because a local machine with this node proceeds data significantly slower than the cloud image.

The second step: **import data from CSV file to Exasol's table**. 

1. We create an SQL CREATE TABLE statement and put it in a separate file named `flights.sql`.
Our table will have 109 columns according to the CSV file. Here is a part of the query:

In [None]:
CREATE OR REPLACE TABLE "FLIGHTS" ("YEAR" INTEGER, "QUARTER" INTEGER, "MONTH" INTEGER, "DAY_OF_MONTH" INTEGER, "DAY_OF_WEEK" DECIMAL(1,0), ... , ...);


2. We connect to the Exasol, create a new schema and create a new table using the statement above.

 For connection and executing queries we use [pyexasol](https://github.com/badoo/pyexasol) library.


In [None]:
import pyexasol

connection = pyexasol.connect(dsn='host:port', user='username', password='password', compression=True)

connection.execute("CREATE SCHEMA IF NOT EXISTS {schema_name};".format(schema_name="FLIGHTS"))
connection.open_schema("FLIGHTS")
create_table_query = open('flights.sql', 'r')
for line in create_table_query:
    connection.execute(query=line)

 3. After table is ready, we are importing the CSV file's content to the table using pyexasol. 
    The file is stored in Google Storage because of its size, so we only use a link to the file.

In [None]:
connection.execute("IMPORT INTO {table_name} FROM CSV AT '{file_path}' FILE '{file_name}' "
            "COLUMN SEPARATOR = '{column_separator}' SKIP = 1  "
            "ERRORS INTO error_table (CURRENT_TIMESTAMP) REJECT LIMIT UNLIMITED ERRORS".format(
                table_name="FLIGHTS", file_path="https://storage.googleapis.com/our/path/", file_name="flights.single.csv.gz", column_separator=","))
connection.close()

Now you should be able to find all the data in the Exasol's table. 
Below you can see a python class that is doing the process described above.

In [None]:
import simplestopwatch


class CsvImporter:
    def __init__(self, connection):
        self.connection = connection

    def import_file(self, sql_create_table_file, schema_name, table_name, file_path, file_name, column_separator):
        self.__handle_schema(self.connection, schema_name)
        self.__execute_create_table(sql_create_table_file)
        self.__run_import_command(self.connection, table_name, file_path, file_name, column_separator)

    def __execute_create_table(self, sql_create_table_file):
        query = open(sql_create_table_file, 'r')
        for line in query:
            self.connection.execute(query=line)

    def __handle_schema(self, connection, schema_name):
        connection.execute("CREATE SCHEMA IF NOT EXISTS " + schema_name + ";")
        connection.open_schema(schema_name)

    def __run_import_command(self, connection, table_name, file_path, file_name, column_separator):
        timer = simplestopwatch.Timer()
        connection.execute(
            "IMPORT INTO {table_name} FROM CSV AT '{file_path}' FILE '{file_name}' "
            "COLUMN SEPARATOR = '{column_separator}' SKIP = 1  "
            "ERRORS INTO error_table (CURRENT_TIMESTAMP) REJECT LIMIT UNLIMITED ERRORS".format(
                table_name=table_name, file_path=file_path, file_name=file_name, column_separator=column_separator))
        timer.stop()
        print("Imported in " + str(timer))

## Part 2. Analyzing data

Now when we have a dataset in the Exasol table, we can start analyzing data.

First, we should take a look at the columns list and decide which of them we would use for the model's training.

### How do we decide which columns we need?   

We just look at the columns one by one and judge whether it can be helpful for prediction or not.
A few examples of columns which we decided to use in our model for delay prediction:
- Day of week
- Departure airport
- Arrival airport etc.

And also a few column which we decided NOT to use:
- Flight number: a unique number that represents an exact flight.
- Delay reason: we don't need a reason to predict a delay.
- Cancelled: this information doesn't affect delay.

### How to handle columns with similar information?

As we don't need the repeated information, we should select only one column which suits best to the model.
For example, we have 3 columns:
-Origin Airport: name of an airport as a string.
-Origin Airport ID: an identification number assigned by US DOT to identify a unique airport. 
-Origin Airport Sequence ID: an identification number assigned by US DOT to identify a unique airport at a given point of time.

These columns encode the same information but in different ways. We would select the second one - Origin Airport ID, for the model.
The first reason - it's numeric. And the second - this code is more stable than the Origin Airport Sequence ID as it won't change with the time.

### Analyzing selected columns

The next step is to collect data about the content of the selected columns. 
Here is a small list which can give you an idea how to analyze a column:

1. How many null values does the column contain?
2. How is the data represented in the column: string, number, date, etc?
3. How many distinct values does the column contain?
4. What are the maximum and minimum values (for numbers)?

For our example we collected data using pyexasol and then created charts with plotly library.

In [None]:
import plotly.express as plotly
import pyexasol


class Stats:
    def get_categorical_stats(self, connection: pyexasol.connection, schema_name: str, table_name: str,
                              column_name: str):
        connection.openSchema(schema_name)
        sum_of_distinct_values = self.__get_query_result(connection,
                                                         'SELECT count (distinct "{column_name}") from {table_name};'
                                                         .format(column_name=column_name, table_name=table_name))
        sum_of_nulls = self.__get_query_result(connection,
                                               'SELECT COUNT(*) from  {table_name} WHERE "{column_name}" IS NULL;'
                                               .format(table_name=table_name, column_name=column_name))
        max_value = self.__get_query_result(connection, 'SELECT MAX("{column_name}") from {table_name};'
                                            .format(column_name=column_name, table_name=table_name))
        min_value = self.__get_query_result(connection, 'SELECT MIN("{column_name}") from {table_name};'
                                            .format(column_name=column_name, table_name=table_name))

        result_set = connection.export_to_pandas(
            'SELECT DISTINCT "{column_name}", COUNT("{column_name}") as sum_of_distinct_values from {table_name} '
            'GROUP BY "{column_name}" ORDER BY "{column_name}";'
                .format(column_name=column_name, table_name=table_name))

        bar = plotly.bar(result_set, x=(column_name), y="SUM_OF_DISTINCT_VALUES",
                         title="Column: " + column_name + ", sum of dist values=" + str(
                             sum_of_distinct_values) + ", nulls=" + str(sum_of_nulls) + ", max value=" + str(
                             max_value) + ", min value=" + str(min_value))
        bar.layout.xaxis.type = 'category'
        return bar

    def __get_query_result(self, connection: pyexasol.connection, query: str):
        iterable_query_result = connection.execute(query)
        counter = 0
        for row in iterable_query_result:
            counter = row[0]
        return counter


And here is an example of a chart:
<img src="img/img1.png">