# Importing Data in Python (Part 1)

As a Data Scientist, on a daily basis you will need to clean data, wrangle and munge it, visualize it, build predictive models and interpret these models. Before doing any of these, however, you will need to know how to get data into Python. In this course, you'll learn the many ways to import data into Python: (i) from flat files such as .txts and .csvs; (ii) from files native to other software such as Excel spreadsheets, Stata, SAS and MATLAB files; (iii) from relational databases such as SQLite & PostgreSQL.

## Import Data From Plain Text
To check out any plain text file, you can use the Python's open() function to open a connection to the file.
```python
#assign the file name to the variable string
file_name = 'file_name.txt'
#pass the file name to the open function using 'r' mode 
file = open(file_name, mode='r')# only for reading
#to connection to the file apply the read() function
text =file.read()
#close the connection
file.close()

```
By using a context manager construct that allows us to create a context in which you can execute commands with the file open. We can avoid having to close to connection to the file using the 'with' statement. For the large files, we may want to print a few lines. You can use 'file.readline()' function to execute the first line of a text file. If you execute the same comment again, the second line will be printing and so on.
```python
#By using the with open we don't need to close the connection.
with open(file_name, 'r') as file:
     print(file.read())
#print out the file line by line executing with file.readline()
with open(file_name, 'r') as file:
     print(file.readline())
     print(file.readline())
     print(file.readline())
```
## Import data from Flat File

Flat files are basic text file  containing records, that is table data, without structured relationships.

It is also essential to know flat files can have a header such as in’titanic.csv’, which is row that occurs as the first row and describes the content of data columns or states what the corresponding attributes  or features in each column are.
### importing flat file uisng Numpy
We're now going to load the MNIST digit recognition dataset using the numpy function `loadtxt()` and see just how easy it can be:
* The first argument will be the `filename`.
* The second will be the `delimiter` which can take `','` for comma separated file and `'t'` for tab-delimated file
* `skiprows` allows you to specify how many rows you wish to skip.
* `usecols` takes a list of the indices of the columns you wish to keep.
* `dtype=’str’` will ensure that all entries are imported as strings.

In [7]:

import numpy as np
file_name = 'data/mnist_kaggle_some_rows.csv'
#Default delimeter is white space so, we need the specify delimeter parameter explicity.
data = np.loadtxt(file_name, delimiter=',')
#we are reading file as a numpy array
print(type(data))
print(data)

<class 'numpy.ndarray'>
[[1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 ...
 [2. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [5. 0. 0. ... 0. 0. 0.]]


**Importing different datatypes:** 
Let's imprt the text file which has
* text header, consisting of  string
* tab-delimated

We are going to use seaslug.txt file which the data consists of percentage of sea slug larvae that had metamorphosed in a given time period. 

Due to the file  containing `string` header we need to handle it. if we import the file uisng loadtxt() without handling string file we get the `ValueError` which is tell us `could not convert string to float`. We can hadle it in 2 way:

Alternative 1: We can set the `dtype =str` to avoid to ValueError,<br>
Alternative 2: We can skip the first row , using the skiprows argument!

In [28]:
#Alternative 1:
file_sea_slug = 'data/seaslug.txt'
data_alt_1 = np.loadtxt(file_sea_slug, delimiter='\t', dtype=str)
print(type(data_alt_1))
print(data_alt_1[0:5])

<class 'numpy.ndarray'>
[['Time' 'Percent']
 ['99' '0.067']
 ['99' '0.133']
 ['99' '0.067']
 ['99' '0']]


In [26]:
#Alternative 2:
data_alt_2 = np.loadtxt(file_sea_slug, delimiter='\t', skiprows=1)
print(type(data_alt_2))
print(data_alt_2[0:9])

<class 'numpy.ndarray'>
[[9.90e+01 6.70e-02]
 [9.90e+01 1.33e-01]
 [9.90e+01 6.70e-02]
 [9.90e+01 0.00e+00]
 [9.90e+01 0.00e+00]
 [0.00e+00 5.00e-01]
 [0.00e+00 4.67e-01]
 [0.00e+00 8.57e-01]
 [0.00e+00 5.00e-01]]


To import datasets which have different datatypes in different columns;for example, one column may contain strings and another floats. The function `np.loadtxt()` will freak at this. There is another function, `np.genfromtxt()`, which can handle such structures. If we pass `dtype=None` to it, it will figure out what types each column should be.
```python
data = np.genfromtxt(file_name, delimiter=',', names=True, dtype=None)
```
* The first argument is the filename
* the second specifies the delimiter
* The third argument names tells us there is a header (names=True)
* dtype=None is represent the different data type

*There is also another function `np.recfromcsv()`that behaves similarly to np.genfromtxt(), except that its default dtype is None!*

In [41]:
# Let's look at titanic.csv 
# ',', comma separated
# there is header
# there are string and number data type in it.
titanic = np.genfromtxt('data/titanic_sub.csv', delimiter=',', names=True, dtype=None)
print(titanic[0:2])

[(1, 0, 3, b'male', 22., 1, 0, b'A/5 21171',  7.25  , b'', b'S')
 (2, 1, 1, b'female', 38., 1, 0, b'PC 17599', 71.2833, b'C85', b'C')]


  """


In [37]:
# using np.recfromcsv()
titanic_2 = np.recfromcsv('data/titanic_sub.csv')
print(titanic_2[0:5])

[(1, 0, 3, b'male', 22., 1, 0, b'A/5 21171',  7.25  , b'', b'S')
 (2, 1, 1, b'female', 38., 1, 0, b'PC 17599', 71.2833, b'C85', b'C')
 (3, 1, 3, b'female', 26., 0, 0, b'STON/O2. 3101282',  7.925 , b'', b'S')
 (4, 1, 1, b'female', 35., 1, 0, b'113803', 53.1   , b'C123', b'S')
 (5, 0, 3, b'male', 35., 0, 0, b'373450',  8.05  , b'', b'S')]


  output = genfromtxt(fname, **kwargs)


## Importing flat files using pandas

What we learn so far is to import a bunch of different types of flat files into python as Numpy arrays. Although the numpy array is incredibly useful and has numerous of purpose, they can not handle the data as two dimensional labeled data structure.
Pandas offer us to the DataFrame, which has observations (rows), and variables(columns).

In [42]:
import pandas as pd

In [46]:
df_titanic = pd.read_csv('data/titanic_sub.csv')

print(type(df_titanic))
df_titanic.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,male,35.0,0,0,373450,8.05,,S


Handling read_csv() function under different circumstances 
* `sep:` Stands for a separator, a default is ',' as in .csv(comma separated values): We could specify if needed.
*` header= None:` Load a CSV with no headers
* `names=['column_name1','column_name2']:` Load a .csv while specifying column names
* `index_col='date':` Load a .csv with setting the index column to column name like we would like set index as a date.
* `na_values=['NA']:` Load a .csv while specifying "NA" as missing values.
* `skiprows=3:` Load a .csv while skipping the top 3 rows