# 1. Introduction

In this tutorial, I will show how to load and export the dataset using basic Python function open and other methods in modules(numpy, pandas). So let's load this modules first.

In [2]:
import numpy as np
import pandas as pd

# 2. Load the dataset.

Here we are going to use the dataset from https://github.com/llSourcell/Intro_to_the_Math_of_intelligence/blob/master/data.csv and see how we could play with it using some basic Python functions.

## 2.1 Open function

The open function is a Python built in function which is always used to read the text file. Here you need to pay attention to the **mode** once you want to use this function to read the file.

In [1]:
data_open = open('load_data.csv', mode = 'r')

The mode argument in open function just indicates the type of mode that we want to use to read the file. The following shows some common mode to read the dataset.

- 'r'       open for reading (default)
- 'w'       open for writing, truncating the file first
- 'x'       create a new file and open it for writing
- 'a'       open for writing, appending to the end of the file if it exists
- 'b'       binary mode
- 't'       text mode (default)
- '+'       open a disk file for updating (reading and writing)
- 'U'       universal newline mode (deprecated)


Then to see the content of the file, use the readlines() attribute to see the values in this file. To see the first line, just use the readline() instead.

In [3]:
data_open.readlines()

['32.502345269453031,31.70700584656992\n',
 '53.426804033275019,68.77759598163891\n',
 '61.530358025636438,62.562382297945803\n',
 '47.475639634786098,71.546632233567777\n',
 '59.813207869512318,87.230925133687393\n',
 '55.142188413943821,78.211518270799232\n',
 '52.211796692214001,79.64197304980874\n',
 '39.299566694317065,59.171489321869508\n',
 '48.10504169176825,75.331242297063056\n',
 '52.550014442733818,71.300879886850353\n',
 '45.419730144973755,55.165677145959123\n',
 '54.351634881228918,82.478846757497919\n',
 '44.164049496773352,62.008923245725825\n',
 '58.16847071685779,75.392870425994957\n',
 '56.727208057096611,81.43619215887864\n',
 '48.955888566093719,60.723602440673965\n',
 '44.687196231480904,82.892503731453715\n',
 '60.297326851333466,97.379896862166078\n',
 '45.618643772955828,48.847153317355072\n',
 '38.816817537445637,56.877213186268506\n',
 '66.189816606752601,83.878564664602763\n',
 '65.41605174513407,118.59121730252249\n',
 '47.48120860786787,57.251819462268969\

You need to use the close() attribute once you are done with this file.

In [28]:
data_open.close()

As we have seen, the open function is very efficient when loading the dataset. However, it could not, for instance, cope with the NaN values in the dataset. In many machine learning problems, cleaning the data plays a vital role in developing our machine learning models. Hence, it would be much more convenient if we could find a function that could automatically deal with these unpleasant values as specified. 

## 2.2 NumPy

In NumPy, there are a lot of functions for us to manipulate the data. For reading the dataset, **genfromtxt** is a very helpful function. The following shows common arguments of this function.

```Python
genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None, usecols=None, names=None, excludelist=None, deletechars=None, replace_space='_', autostrip=False, case_sensitive=True, defaultfmt='f%i', unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None)
```
Another function is called **loadtxt**:

```Python
loadtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0)
```

We could see that **genfromtxt function has more arguments than loadtxt**, which means that **genfromtxt** could cope with more complex issues in reading the data. Here we read the dataset with these two functions.

In [4]:
data_numpy_genfromtxt = np.genfromtxt(fname = 'load_data.csv', delimiter = ',' )

In [5]:
data_numpy_genfromtxt

array([[  32.50234527,   31.70700585],
       [  53.42680403,   68.77759598],
       [  61.53035803,   62.5623823 ],
       [  47.47563963,   71.54663223],
       [  59.81320787,   87.23092513],
       [  55.14218841,   78.21151827],
       [  52.21179669,   79.64197305],
       [  39.29956669,   59.17148932],
       [  48.10504169,   75.3312423 ],
       [  52.55001444,   71.30087989],
       [  45.41973014,   55.16567715],
       [  54.35163488,   82.47884676],
       [  44.1640495 ,   62.00892325],
       [  58.16847072,   75.39287043],
       [  56.72720806,   81.43619216],
       [  48.95588857,   60.72360244],
       [  44.68719623,   82.89250373],
       [  60.29732685,   97.37989686],
       [  45.61864377,   48.84715332],
       [  38.81681754,   56.87721319],
       [  66.18981661,   83.87856466],
       [  65.41605175,  118.5912173 ],
       [  47.48120861,   57.25181946],
       [  41.57564262,   51.39174408],
       [  51.84518691,   75.38065167],
       [  59.37082201,   

Then we use the loadtxt function to read the dataset.

In [6]:
data_numpy_loadtxt = np.loadtxt(fname = 'load_data.csv', delimiter = ',' )

In [7]:
data_numpy_loadtxt

array([[  32.50234527,   31.70700585],
       [  53.42680403,   68.77759598],
       [  61.53035803,   62.5623823 ],
       [  47.47563963,   71.54663223],
       [  59.81320787,   87.23092513],
       [  55.14218841,   78.21151827],
       [  52.21179669,   79.64197305],
       [  39.29956669,   59.17148932],
       [  48.10504169,   75.3312423 ],
       [  52.55001444,   71.30087989],
       [  45.41973014,   55.16567715],
       [  54.35163488,   82.47884676],
       [  44.1640495 ,   62.00892325],
       [  58.16847072,   75.39287043],
       [  56.72720806,   81.43619216],
       [  48.95588857,   60.72360244],
       [  44.68719623,   82.89250373],
       [  60.29732685,   97.37989686],
       [  45.61864377,   48.84715332],
       [  38.81681754,   56.87721319],
       [  66.18981661,   83.87856466],
       [  65.41605175,  118.5912173 ],
       [  47.48120861,   57.25181946],
       [  41.57564262,   51.39174408],
       [  51.84518691,   75.38065167],
       [  59.37082201,   

To save the data, **np.savetxt** is a very powerful function. First we indicate the directory. Second we add the data we want to save.

In [40]:
np.savetxt('numpy_data', data_numpy)

## 2.3 Pandas

In Pandas, the most common function to read the dataset is called **read_csv**. The following shows the arguments of this function:

```Python
read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)

```

In [8]:
data_pandas = pd.read_csv(filepath_or_buffer = 'load_data.csv', delimiter = ',', header = None)

In [9]:
data_pandas

Unnamed: 0,0,1
0,32.502345,31.707006
1,53.426804,68.777596
2,61.530358,62.562382
3,47.475640,71.546632
4,59.813208,87.230925
5,55.142188,78.211518
6,52.211797,79.641973
7,39.299567,59.171489
8,48.105042,75.331242
9,52.550014,71.300880


In pandas, other functions such as pd.read_excel, pd.read_html could read other types of functions. Use help function to see more information about these functions.