# Reading data

We will look at different ways to read data such as directly from a file, from a url or from an API. We will consider some popular file types and different argument options for reading these. You must first start by importing pandas.

In [21]:
import pandas as pd

Now in the future when we want to use a pandas function we reference it by writing pd.

## Reading data from static files

For reading data from files you have saved locally on your machine or on the repositorty (not recommended for real data) it is useful to use the Path package to automatically give you part of the filepath. 

In [None]:
from pathlib import Path

current_directory = Path.cwd()
home_directory = Path.home()
documents_directory = Path.home() / "Documents"

print(current_directory)
print(home_directory)
print(documents_directory)

Please note for `Path.cwd()` if you are using a Jupyter Notebook (as we currently are) this will give you the filepath to where the Jupyter Notebook is located. If you are using a .py file this will give the filepath to where you are in your terminal.

### Reading CSV's

We will start with the most simple case - reading csv files. I have stored example data on the repository in tutorials/data. We can use built in pandas function `read_csv`

In [None]:
data = pd.read_csv(current_directory / "data/customers-1000.csv")

print(data)

### Popular arguments

**usecols** - allows you select only certain columns to be read in. Either by putting the column names in a list or by referencing there positional argument (remember python indexes from 0!)

In [None]:
data = pd.read_csv(current_directory / "data/customers-1000.csv", usecols=["Customer Id", "First Name"])

print(data)

In [None]:
data = pd.read_csv(current_directory / "data/customers-1000.csv", usecols=[1,2])

print(data)

There are many possible arguments but I don't often find the need to use these for CSV's. For more information on the different arguments, please reference the documentation https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

### Reading XLSX

Now we will consider reading data from xlsx file or works the same for xls. We use the built in pandas function `read_excel`

In [None]:
data = pd.read_excel(current_directory / "data/Financial Sample.xlsx")

print(data)

### Popular arguments

**header** - allows you pick which row number you want to use as your column headers (again remember python indexs from 0!)

**usecols** - allows you select only certain columns to be read in. Either by putting the column names in a list or by referencing there positional argument (remember python indexes from 0!) and call also reference letters from excel as a string.

In [None]:
data = pd.read_excel(current_directory / "data/Financial Sample.xlsx", header=2, usecols=["Segment","Country"])

print(data)

In [None]:
data = pd.read_excel(current_directory / "data/Financial Sample.xlsx", header=2, usecols="A:C,F,H")

print(data)

**sheet_name** - usually the xlsx file will contain multiple sheets so simply state name of the sheet

In [None]:
data = pd.read_excel(current_directory / "data/Financial Sample.xlsx", sheet_name="Sheet1", header=2, usecols="A:C,F,H")

print(data)

**encoding** - the default encoding is `utf-8` this will usually work but sometimes you may get an error that it is unable to read the file due to encoding and you must try another type,such as `ISO-8859-1`.

One horrible scenario you may encounter with xlsx files is that cells are merged or a single sheet contains multiple tables. Please see the example below for dealing with this.

In [None]:
df = pd.read_excel(
        current_directory / "data/Financial Sample merged cells.xlsx", header=[0,1])

# Get the number of columns
num_cols = df.shape[1]

# Extract merged headers and column names
merged_headers = df.iloc[0].values
column_names = df.iloc[1].values

# Initialize the current merged header
current_merged_header = None

# Create new column headers
new_columns = []
for col in range(num_cols):
    merged_header = merged_headers[col]
    column_name = column_names[col]
    
    # Update the current merged header if it's not empty
    if pd.notna(merged_header) and merged_header != '':
        current_merged_header = merged_header
    
    # Combine the current merged header with the column name
    if current_merged_header:
        new_column_name = f"{current_merged_header}_{column_name}"
    else:
        new_column_name = column_name
    
    new_columns.append(new_column_name)

# Assign the new column headers to the DataFrame
df.columns = new_columns

# Filter out columns containing nan, this is the columns between tables
df = df.loc[:, ~df.columns.str.contains('nan', case=False, na=False)]

# Drop the header rows (first two rows)
df = df.drop(index=[0, 1])


print(df)


### Reading ODS files

This works very similar to reading excel files, the only difference being that you must state an appropriate engine as an argument. I have found the fatest to be `engine="calamine"`. More information on engines can be found here in the read_excel documentation https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html