# Reading data

We will look at different ways to read data such as directly from a file or from a url. We will consider some popular file types and different argument options for reading these. This section will cover key functions and packages for reading data:

- Getting filepaths: `pathlib`
- Reading local files: `read_csv`, `read_excel`
- Reading data from urls: `requests`, `BeautifulSoup`


You must first start by importing pandas.

In [65]:
import pandas as pd

Now we can access pandas functions by using `pd.`

## Getting filepaths

For reading data from files you have saved locally on your machine or on the repositorty (not recommended for real data) it is useful to use the `Path` package to automatically give you part of the filepath. This is preferred over hardcoding full filepaths. When using the `Path` type you join filepaths using `/`. Whereas if you are writing out the path as a string, you join paths using `+`.

In [None]:
from pathlib import Path

current_directory = Path.cwd()
home_directory = Path.home()
documents_directory = Path.home() / "Documents"

print(current_directory)
print(home_directory)
print(documents_directory)

Please note for `Path.cwd()` returns the current working directory. If you are using a Jupyter Notebook (as we currently are) this will give you the filepath to where the Jupyter Notebook is located. Whereas if you are using a .py file this will give the filepath to where you are in your terminal.

## Reading CSV files

We will start with the most simple case - reading CSV files. The example data can be found in the repository in the `tutorials/data` folder. We can use pandas function `read_csv` for reading CSV files where the only argument we need to supply is the filepath to the CSV we wish to read.

In [None]:
data = pd.read_csv(current_directory / "data/customers-1000.csv")

data

### Popular arguments

**usecols** - allows you select only certain columns to be read in. Either by putting the column names in a list or by referencing there positional argument (remember python indexes from 0).

In [None]:
data = pd.read_csv(current_directory / "data/customers-1000.csv", usecols=["Customer Id", "First Name"])

data

In [None]:
data = pd.read_csv(current_directory / "data/customers-1000.csv", usecols=[1,2])

data

There are many possible arguments but I don't often find the need to use these for CSV's. For more information on the different arguments, please reference the documentation https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html 

## Reading XLSX

Now we will consider reading data from xlsx or xls file. We use the pandas function `read_excel`

In [None]:
data = pd.read_excel(current_directory / "data/Financial Sample.xlsx")

data

### Popular arguments

**header** - allows you pick which row number you want to use as your column headers (again remember python indexs from 0)

**usecols** - allows you select only certain columns to be read in. Either by putting the column names in a list or by referencing there positional argument or can also reference the column letters from excel as a string. Can use ":" to mean a range. For example, "A,C" means column "A" and "C", "A:C" means column "A,B,C".

In [None]:
data = pd.read_excel(current_directory / "data/Financial Sample.xlsx", header=2, usecols=["Segment","Country"])

data

In [None]:
data = pd.read_excel(current_directory / "data/Financial Sample.xlsx", header=2, usecols="A:C,F,H")

data

**sheet_name** - usually the xlsx file will contain multiple sheets so simply state name of the sheet. Is only required when the file has multiple sheets.

In [None]:
data = pd.read_excel(current_directory / "data/Financial Sample.xlsx", sheet_name="Sheet1", header=2, usecols="A:C,F,H")

data

**encoding** - the default encoding is `utf-8` this will usually work but sometimes you may get an error that it is unable to read the file due to encoding and you must try another type, such as `ISO-8859-1`.

One horrible scenario you may encounter with xlsx files is that cells are merged or a single sheet contains multiple tables. Please see the example below for dealing with this.

In [None]:
df = pd.read_excel(
        current_directory / "data/Financial Sample merged cells.xlsx", header=[2,3])

# Get the number of columns
num_cols = df.shape[1]

# Extract merged headers and column names
merged_headers = df.columns.get_level_values(0)
column_names = df.columns.get_level_values(1)

# Initialize the current merged header
current_merged_header = None

# Create new column headers
new_columns = []
for col in range(num_cols):
    merged_header = merged_headers[col]
    column_name = column_names[col]
    
    # Update the current merged header if it's not empty
    if pd.notna(merged_header) and merged_header != '':
        current_merged_header = merged_header
    
    # Combine the current merged header with the column name
    if current_merged_header:
        new_column_name = f"{current_merged_header}_{column_name}"
    else:
        new_column_name = column_name
    
    new_columns.append(new_column_name)

# Assign the new column headers to the DataFrame
df.columns = new_columns

# Filter out the blank columns between tables
df = df.loc[:, ~df.columns.str.contains('Date.1', case=False, na=False)]

# Drop the header rows (first two rows)
df = df.drop(index=[0, 1])

df


## Reading ODS files

This works very similar to reading excel files, the only difference being that you must state an appropriate engine to deal with this as an argument. I have found the fastest to be `engine="calamine"`. More information on engines can be found here in the `read_excel` documentation https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

## Reading data from urls

To read data directly from urls we use the `requests` package. We will be looking at downloading data from this url https://public.tableau.com/app/learn/sample-data

The first example will look at providing full url to the file.

In [None]:
import requests
from io import BytesIO

url = "https://public.tableau.com/app/sample-data/EMSI_JobChange_UK.xlsx"

# Send request to get data, allows 10 seconds or there will be a timeout error
data = requests.get(url, timeout=10)

# Use BytesIO to handle the content in memory without need to save it 
file_content = BytesIO(data.content)
    
# Read excel file into dataframe
df = pd.read_excel(file_content, sheet_name="1 digit")

df


We may only want to provide the url to the landing page then use python to access functionality of the page such as a download button. We can use the package `BeautifulSoup` for this.

In [None]:
from bs4 import BeautifulSoup

url_landing = "https://public.tableau.com/app/learn/sample-data"
page = requests.get(url_landing, timeout=10)
soup = BeautifulSoup(page.content, features="html.parser")

for a in soup.find_all("a", href=True):
    if "Dataset (xlsx)" in str(a):
        url = a["href"]
        break
    
print(url)

This then returns you the full url and you can read into a pandas dataframe as before. So the complete code is as follows.

In [None]:
url_landing = "https://public.tableau.com/app/learn/sample-data"
page = requests.get(url_landing, timeout=10)
soup = BeautifulSoup(page.content, features="html.parser")

for a in soup.find_all("a", href=True):
    if "Dataset (xlsx)" in str(a):
        url = a["href"]
        break
    
# Send request to get data, allows 10 seconds or there will be a timeout error
data = requests.get(url, timeout=10)

# Use BytesIO to handle the content in memory without need to save it 
file_content = BytesIO(data.content)
    
# Read excel file into dataframe
df = pd.read_excel(file_content, sheet_name="1 digit")

df

## Viewing dataframes

Sometimes you may not want to print the whole dataframe. There are plenty of useful functions to give you a quick view of your data. To only view the first 5 rows of the dataframe we can use the `.head()` function, similarly to view the bottom 5, we can use `.tail()`. The default is 5 but you can also specify how many rows you wish to see.

In [None]:
df.head(10)

You can also print a number of rows based on the size. So if you want the top 10 with the values in the 'Change' column, you can use the `nlargest` function where the first argument is number of rows and second is what column you want to sort by. Similar approach can be done using the `nsmallest` function.

In [None]:
df.nlargest(10,"Change")

You can get a quick overview of data types and size of Dataframe using the `.info()` function.

In [None]:
df.info()

You can get a list of all the column headers using `.columns`.

In [None]:
df.columns