# Importing Data

> There is no data science without data.
>
> \- A wise person

## General Model for Importing Data

### Memory and Size

* Python stores its data in **memory** - this makes it relatively quickly accessible but can cause size limitations in certain fields.

* With that being said, you are likely not going to run into space limitations anytime soon.

* Python memory is session-specific, so quitting Python (i.e. shutting down JupyterLab) removes the data from memory.

### General Framework

A general way to conceptualize data import into and use within Python:

1. Data sits in on the computer/server - this is frequently called "disk"
2. Python code can be used to copy a data file from disk to the Python session's memory
3. Python data then sits within Python's memory ready to be used by other Python code

Here is a visualization of this process:


<center>
<img src="../images/import-framework.png" alt="import-framework.png" width="1000" height="1000">
</center>

## Importing Tabular Data

For much of data science, tabular data -- think 2-dimensional datasets -- is the most common format of data.

Often, these are stored in the form of **delimited files** (i.e. CSV)

## Importing Other Files

* While tabular data is the most popular in data science, other types of data files are used as well.

   - Excel files
   - SQL databases
   - JSON files
   - And many more!
   
* Python has many capabilities (via built-in, standard library, and 3rd party packages) for working with these other file types.

* Later in the lesson reading you will get exposed to a few of these.

### Importing pandas

Most tabular data formats can be imported into Python using the **pandas** library. We can load pandas with the below code:

In [1]:
import pandas as pd

<div class="admonition note alert alert-info">
    <b><p class="first admonition-title" style="font-weight: bold">Note</p></b>
    <p>Recall that the pandas library is the primary library for representing and working with tabular data in Python.</p>
</div>

Pandas is preferred because it imports the data directly into a DataFrame -- the data structure of choice for tabular data in Python.

The `read_csv` function is used to import a tabular data file, a CSV, into a DataFrame:

In [2]:
planes = pd.read_csv('planes.csv')

We can visualize the first few rows of our DataFrame using the `head()` method:

In [3]:
planes.head()

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
0,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
1,N102UW,1998.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
2,N103US,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
3,N104UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
4,N10575,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan


<div class="admonition note alert alert-info">
    <b><p class="first admonition-title" style="font-weight: bold">Relax!</p></b>
    <p>Don't worry, in the next lesson we'll dive deeper into DataFrames. For now I just wanted you to be able to see the data you imported.</p>
</div>

The `read_csv()` function has many parameters for importing data. A few examples:

* `sep` - the data's delimiter
* `header` - the row number containing the column names (0 indicates there is no header)
* `names` - the names of the columns

Full documentation can be pulled up by running the method name followed by a question mark:

In [4]:
pd.read_csv?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mread_csv[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfilepath_or_buffer[0m[0;34m:[0m [0;34m'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msep[0m[0;34m=[0m[0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdelimiter[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheader[0m[0;34m=[0m[0;34m'infer'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnames[0m[0;34m=[0m[0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex_col[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0musecols[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msqueeze[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mprefix[0m[0;34m=[0m[0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmangle_dupe_cols[0m[0;34

## File Paths

![](../images/timeout.png)

It’s important to understand where files exist on your computer and how to reference those paths. There are two main approaches:

1. Absolute paths
2. Relative paths

## Absolute path

An absolute path always contains the root elements and the complete list of directories to locate the specific file or folder. For the planes.csv file, the absolute path on my computer is:

In [5]:
import os

absolute_path = os.path.abspath('planes.csv')
absolute_path

'/Users/b294776/Desktop/workspace/training/UC/uc-bana-6043/instructor-material/module-2/planes.csv'

I can always use the absolute path in `pd.read_csv()`:

In [6]:
planes = pd.read_csv(absolute_path)

## Relative path

A relative path is a path built starting from the current location

If my directory layout looks like:

```
Project A
├── this_notebook.ipynb
└── planes.csv
```

then I can do:

```python
pd.read_csv('planes.csv')
```

If my directory layout looks like:

```
Project A
├── this_notebook.ipynb
└── data
    └── planes.csv
```

then I can do:

```python
pd.read_csv('data/planes.csv')
```

If my directory layout looks like:

```
Project A
├── notebooks
│   ├── this_notebook.ipynb
│   ├── notebook2.ipynb
│   └── notebook3.ipynb
└── data
    └── planes.csv
```

then I can do:

```python
pd.read_csv('../data/planes.csv')
```

## Attributes & Methods

We’ve seen that we can use the dot-notation to access functions in libraries (i.e. `pd.read_csv()`). 

We can use this same approach to access things inside of objects. 

What’s an object? Basically, a variable that contains other data or functionality inside of it that is exposed to users. 

Consequently, our DataFrame item is an object.

## Metadata

For example, our DataFrame contains some items attached to it:

In [7]:
# Dimensions of our data
planes.shape

(3322, 9)

In [8]:
# Summary info for our variables
planes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3322 entries, 0 to 3321
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   tailnum       3322 non-null   object 
 1   year          3252 non-null   float64
 2   type          3322 non-null   object 
 3   manufacturer  3322 non-null   object 
 4   model         3322 non-null   object 
 5   engines       3322 non-null   int64  
 6   seats         3322 non-null   int64  
 7   speed         23 non-null     float64
 8   engine        3322 non-null   object 
dtypes: float64(2), int64(2), object(5)
memory usage: 233.7+ KB


You may have noticed the difference between the two - one has parentheses and the other does not.

```python
planes.shape
planes.info()
```

* __Attribute__: simply a variable that is attached and unique to that object.

* __Method__: is just a function inside an object that is unique to that object.