# Reading data

We will look at different ways to read data such as directly from a file, from a url or from an API. We will consider some popular file types and different argument options for reading these. You must first start by importing pandas.

In [1]:
import pandas as pd

Now in the future when we want to use a pandas function we reference it by writing pd.

## Reading data from static files

For reading data from files you have saved locally on your machine or on the repositorty (not recommended for real data) it is useful to use the Path package to automatically give you part of the filepath. 

In [5]:
from pathlib import Path

current_directory = Path.cwd()
home_directory = Path.home()
documents_directory = Path.home() / "Documents"

print(current_directory)
print(home_directory)
print(documents_directory)

c:\Users\Eleanor-Young\Desktop\python-pandas-toolkit\tutorials
C:\Users\Eleanor-Young
C:\Users\Eleanor-Young\Documents


Please note for `Path.cwd()` if you are using a Jupyter Notebook (as we currently are) this will give you the filepath to where the Jupyter Notebook is located. If you are using a .py file this will give the filepath to where you are in your terminal.

### Reading CSV's

We will start with the most simple case - reading csv files. I have stored example data on the repository in tutorials/data. We can use built in pandas function `read_csv`

In [6]:
data = pd.read_csv(current_directory / "data/customers-1000.csv")

print(data)

     Index      Customer Id First Name Last Name                      Company  \
0        1  dE014d010c7ab0c     Andrew   Goodman                Stewart-Flynn   
1        2  2B54172c8b65eC3      Alvin      Lane  Terry, Proctor and Lawrence   
2        3  d794Dd48988d2ac      Jenna   Harding                 Bailey Group   
3        4  3b3Aa4aCc68f3Be   Fernando      Ford                 Moss-Maxwell   
4        5  D60df62ad2ae41E       Kara     Woods              Mccarthy-Kelley   
..     ...              ...        ...       ...                          ...   
995    996  FbcCaF483aFaFAE      Diana    Monroe                  Bass-Wilson   
996    997  979c4D58Ae9a9cc      Jerry   Morales                   Pratt-King   
997    998  D0DcF6a4BcefCc8     Tracie     Floyd     Holt, Wilson and Shields   
998    999  90EE9CbbDa374E9       Paul    Barnes     Brown, Oliver and Haynes   
999   1000  51732B5b2328015    Dominic     Duran                   Durham LLC   

                     City  

### Popular arguments

**usecols** - allows you select only certain columns to be read in. Either by putting the column names in a list or by referencing there positional argument (remember python indexes from 0!)

In [8]:
data = pd.read_csv(current_directory / "data/customers-1000.csv", usecols=["Customer Id", "First Name"])

print(data)

         Customer Id First Name
0    dE014d010c7ab0c     Andrew
1    2B54172c8b65eC3      Alvin
2    d794Dd48988d2ac      Jenna
3    3b3Aa4aCc68f3Be   Fernando
4    D60df62ad2ae41E       Kara
..               ...        ...
995  FbcCaF483aFaFAE      Diana
996  979c4D58Ae9a9cc      Jerry
997  D0DcF6a4BcefCc8     Tracie
998  90EE9CbbDa374E9       Paul
999  51732B5b2328015    Dominic

[1000 rows x 2 columns]


In [9]:
data = pd.read_csv(current_directory / "data/customers-1000.csv", usecols=[1,2])

print(data)

         Customer Id First Name
0    dE014d010c7ab0c     Andrew
1    2B54172c8b65eC3      Alvin
2    d794Dd48988d2ac      Jenna
3    3b3Aa4aCc68f3Be   Fernando
4    D60df62ad2ae41E       Kara
..               ...        ...
995  FbcCaF483aFaFAE      Diana
996  979c4D58Ae9a9cc      Jerry
997  D0DcF6a4BcefCc8     Tracie
998  90EE9CbbDa374E9       Paul
999  51732B5b2328015    Dominic

[1000 rows x 2 columns]


There are many possible arguments but I don't often find the need to use these for CSV's. For more information on the different arguments, please reference the documentation https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

### Reading XLSX

Now we will consider reading data from xlsx file or works the same for xls. We use the built in pandas function `read_excel`

In [10]:
data = pd.read_excel(current_directory / "data/Financial Sample.xlsx")

print(data)

ImportError: Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl.