# Introduction to Importing Data in Python
👋 Welcome to your workspace! Here, you can write and run Python code and add text in Markdown. All the data files from the course, Introduction to Importing Data in Python, can be found in the datasets/ directory. The course packages have already been imported for you below. This is your sandbox environment: analyze the course datasets further, take notes, or experiment with code!

In [16]:
# Importing course packages; you can add more too!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import h5py
from sas7bdat import SAS7BDAT
from sqlalchemy import create_engine
import pickle

### Don't know where to start?

There are nine data files in the `datasets/` directory of varying kinds: `battledeath.xlsx`, `Chinook.sqlite`, `disarea.dta`, `ja_data2.mat`, `L-L1_LOSC_4_V1-1126259446-32.hdf5`, `mnist_kaggle_some_rows.csv`, `sales.sas7bdat`, `seaslug.txt`, and `titanic_sub.csv`. 

Import each of these files into a format useful for data analysis in Python. 

### Import data
- Flat files, e.g. `.txts`, `.csv`
- files from other softwares
- Relational databases

<img src= "./media/csv & relational database.png">

### Plain text files

<center><img src= "./media/plain txt.png"></center>

In [5]:
titanic = pd.read_csv("./datasets/titanic_sub.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,male,35.0,0,0,373450,8.05,,S


### Reading a text file

In [11]:
filename = './datasets/huck_finn.txt'
file = open(filename, mode='r') # 'r' is to read
text = file.read()
file.close()

### Printing a text file

In [12]:
print(text)

YOU don't know about me without you have read a book by
the name of The Adventures of Tom Sawyer; but that
ain't no matter. That book was made by Mr. Mark Twain,
and he told the truth, mainly. There was things which
he stretched, but mainly he told the truth. That is
nothing. never seen anybody but lied one time or
another, without it was Aunt Polly, or the widow, or
maybe Mary. Aunt Polly--Tom's Aunt Polly, she is--and
Mary, and the Widow Douglas is all told about in that
book, which is mostly a true book, with some
stretchers, as I said before.



### Writing to a file

In [13]:
filename = 'huck_finn.txt'
file = open(filename, mode='w') # 'w' is to write
file.close()

### Context manager with

In [15]:
with open('huck_finn.txt','r') as file:
    print(file.read())




so far you have learned
- Print files to the console
- print specific lines
- Discuss flat files

### The importance of flat files in data science
- Simple comma seperated files or any other format.
- Have rows & columns that are easily converted in DataFrmed row's & column's.
- Text files containning records
- That is, table data
- `Record`: row of fields or attributes
- `Column`: feature or attributes
- `Header`: contain meta data or data information.

<img src="./media/flatfile_titanic.png">

### File extension
- `.csv` - Comma separated values
- `.txt` - Text 
- commas, tabs- Delimiters

### Tab-delimited file

<img src = "./media/tab-delimiters.png">

### How do you import flat files?
- Two main packages: NumPy, pandas

<img src = "./media/packages.png">

- Here, you’ll learn to import:
    - Flat files with numericals data (MNIST)
    - Flat files with numericals data & strings (titanics.csv)

### Importing flat files using NumPy
#### Why NumPy?
- NumPy arrays: standard for storing numerical data

<img src = "./media/numpy.png">

- Essential for other packages: e.g. scikit-learn

<img src = "./media/scikitlearn.png">

`loadtxt()`
`genfromtxt()`

In [5]:
import numpy as np
filename = './datasets/mnist_kaggle_some_rows.csv'
data = np.loadtxt(filename, delimiter=',')
data

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [2., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [5., 0., 0., ..., 0., 0., 0.]])

In [9]:
# Customizing your NumPy import
import numpy as np
filename = './datasets/mnist_kaggle_some_rows.csv'
data = np.loadtxt(filename, delimiter=',', skiprows=1)
print(data)

[[0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [4. 0. 0. ... 0. 0. 0.]
 ...
 [2. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [5. 0. 0. ... 0. 0. 0.]]


In [10]:
# Customizing your NumPy import
import numpy as np
filename = './datasets/mnist_kaggle_some_rows.csv'
data = np.loadtxt(filename, delimiter=',', skiprows=1, usecols=[0])
print(data)

[0. 1. 4. 0. 0. 7. 3. 5. 3. 8. 9. 1. 3. 3. 1. 2. 0. 7. 5. 8. 6. 2. 0. 2.
 3. 6. 9. 9. 7. 8. 9. 4. 9. 2. 1. 3. 1. 1. 4. 9. 1. 4. 4. 2. 6. 3. 7. 7.
 4. 7. 5. 1. 9. 0. 2. 2. 3. 9. 1. 1. 1. 5. 0. 6. 3. 4. 8. 1. 0. 3. 9. 6.
 2. 6. 4. 7. 1. 4. 1. 5. 4. 8. 9. 2. 9. 9. 8. 9. 6. 3. 6. 4. 6. 2. 9. 1.
 2. 0. 5.]


`data = np.loadtxt(filename, delimiter=',', dtype=str)`

### Mixed datatypes
<img src = "./media/mixed_datatypes.png">

### Importing flat files using pandas
#### What a data scientist needs
- Two-dimensional labeled data structure(s)
- Columns of potentially different types
- Manipulate, slice, reshape, groupby, join, merge
- Perform statistics
- Work with time series data

## Pandas and the DataFrame

<img src= "./media/about_pandas.png">

### Manipulating pandas DataFrames
- Exploratory data analysis
- Data wrangling
- Data preprocessing
- Building models
- Visualization
- Standard and best practice to use pandas


In [13]:
filename = './datasets/titanic_sub.csv'
data = pd.read_csv(filename)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,male,35.0,0,0,373450,8.05,,S


In [14]:
data.values

array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ...,
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

You’ll experience:
- Importing flat files in a straightforward manner
- Importing flatfiles with issues such as comments and missing
values

### Next Chapter
- Import other files types
    - Excel, SAS, Stata
- Feather
<img src = "./media/wes_post.png">

- Interact with relational databases