(data-import)=
# Data Import

## Introduction

In this chapter, we're going to introduce working with your own data. Here, you'll learn how to read plain-text rectangular files into Python. We'll only scratch the surface of data import, but many of the principles will translate to other forms of data. We'll finish with a few pointers to opening other types of data.

### Prerequisites

You will need to have the **pandas** package installed. You'll need to ensure you have **pandas** installed. To do this, and to import **pandas** into your session, run

In [3]:
import pandas as pd

If this command fails, you don't have **pandas** installed. Open up the terminal in Visual Studio Code (Terminal -> New Terminal) and type in `conda install pandas`.

Just to help us understand the 

## Getting Started

There are a huge range of input and output formats available in **pandas**: Stata (.dta), Excel (.xls, .xlsx), csv, tsv, big data formats (HDF5, parquet), JSON, SAS, SPSS, SQL, and more; there's a [full list](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) of formats available in the documentation.

![From the pandas documentation](https://pandas.pydata.org/pandas-docs/stable/_images/02_io_readwrite.svg)

While **pandas** has a huge number of ways to read data in and load it into your Python session, here we'll focus on the humble plain-text table file; for example csv (comma separated values) and tsv (tab separated values).

### Reading data from a file

All of the power needed to open plain-text table files is contained in a single function, `pd.read_csv`. It takes numerous arguments but the two most important are the (unnamed) first one, which gives the path to the data, and `sep=` (a keyword argument) that tells **pandas** whether to expect values to be separated by commas or tabs or another character; however, if you leave this field blank, **pandas** will guess for you. To see the full set of arguments, run `help(pd.read_csv)`.

Here is what a simple CSV file with a row for column names (also commonly referred to as the header row) and six rows of data looks like (using the terminal):

In [7]:
! cat data/students.csv

Student ID,Full Name,favourite.food,mealPlan,AGE
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6

Note that this is a CSV file, so the values are separated by commas. Now let's load this into a **pandas** dataframe in Python:

In [8]:
students = pd.read_csv("data/students.csv")
students

Unnamed: 0,Student ID,Full Name,favourite.food,mealPlan,AGE
0,1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
1,2,Barclay Lynn,French fries,Lunch only,5
2,3,Jayendra Lyne,,Breakfast and lunch,7
3,4,Leon Rossini,Anchovies,Lunch only,
4,5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
5,6,Güvenç Attila,Ice cream,Lunch only,6


If you want to download this data to try it for yourself, head to this link

The first argument to `read_csv` was the path to the data, and **pandas** guessed that this file uses commas as the separator.