<img src="img/tutorial-grid-snakes-light-header.svg" alt="tutorial-logo" style="width: 100%;"/>

# Working with spreadsheets with Pandas


In this section of the tutorial, we will learn how to work with data and spreadsheets using the [`pandas`](https://pandas.pydata.org/) library in Python.

In [33]:
import pandas as pd

## Reading our data into pandas

The data you want to work with in pandas and then export to a spreadsheet can start in a variety of formats. The most common formats are:
- CSV (comma-separated values)
- Excel spreadsheets
- JSON (JavaScript Object Notation)
- ODF (Open Document Format)

In [42]:
# Lets start by creating a DataFrame from a csv file
acnh_fish_df = pd.read_csv('sample_data/acnh-game-data/fish.csv')
acnh_fish_df

Unnamed: 0,#,Name,Sell,Where/How,Shadow,Total Catches to Unlock,Spawn Rates,Rain/Snow Catch Up,NH Jan,NH Feb,...,SH Dec,Color 1,Color 2,Size,Lighting Type,Icon Filename,Critterpedia Filename,Furniture Filename,Internal ID,Unique Entry ID
0,56,anchovy,200,Sea,Small,0,2–5,No,4 AM – 9 PM,4 AM – 9 PM,...,4 AM – 9 PM,Blue,Red,1x1,No lighting,Fish81,FishAntyobi,FtrFishAntyobi,4201,LzuWkSQP55uEpRCP5
1,36,angelfish,3000,River,Small,20,2–5,No,,,...,4 PM – 9 AM,Yellow,Black,1x1,Fluorescent,Fish30,FishAngelfish,FtrFishAngelfish,2247,XTCFCk2SiuY5YXLZ7
2,44,arapaima,10000,River,XX-Large,50,1,Yes,,,...,4 PM – 9 AM,Black,Blue,3x2,No lighting,Fish36,FishPiraruku,FtrFishPiraruku,2253,mZy4BES54bqwi97br
3,41,arowana,10000,River,Large,50,1–2,No,,,...,4 PM – 9 AM,Yellow,Black,2x1,Fluorescent,Fish33,FishArowana,FtrFishArowana,2250,F68AvCaqddBJL7ZSN
4,58,barred knifejaw,5000,Sea,Medium,20,3–5,No,,,...,All day,White,Black,1x1,Fluorescent,Fish47,FishIshidai,FtrFishIshidai,2265,X3R9SFSAaDzBF4fE3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,23,tilapia,800,River,Medium,0,7–9,No,,,...,All day,Black,Black,1x1,Fluorescent,Fish76,FishThirapia,FtrFishThirapia,4190,as78rnkwY3ahrTkBY
76,66,tuna,7000,Pier,XX-Large,50,2,Yes,All day,All day,...,,Blue,Black,2x1,Fluorescent,Fish57,FishMaguro,FtrFishMaguro,2274,4PnGXx9DSb866AeCM
77,75,whale shark,13000,Sea,Large w/Fin,50,1,Yes,,,...,All day,Black,Blue,3x2,No lighting,Fish72,FishJinbeezame,FtrFishJinbee,2282,r3RAtJsXENwnFvQh7
78,21,yellow perch,300,River,Medium,0,7–10,No,All day,All day,...,,Yellow,Black,1x1,Fluorescent,Fish18,FishYellowparch,FtrFishYellowparch,2233,bLgE5dicZniF5zZDW


We can also create a DataFrame from a xlsx file:


In [43]:
iris_df = pd.read_excel("sample_data/iris-data.xlsx")
iris_df

Unnamed: 0,Iris data set (as Excel table),Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
0,sepal_length,sepal_width,petal_length,petal_width,species,,Select cell G3 to see the sample Python formula.
1,5.1,3.5,1.4,0.2,setosa,,
2,4.9,3,1.4,0.2,setosa,,
3,4.7,3.2,1.3,0.2,setosa,,
4,4.6,3.1,1.5,0.2,setosa,,
...,...,...,...,...,...,...,...
148,6.5,3,5.2,2,virginica,,
149,6.2,3.4,5.4,2.3,virginica,,
150,5.9,3,5.1,1.8,virginica,,
151,,,,,,,


We need to have a bit more control over how we are reading the xlsx file, so let's look at the options we have for the `read_excel` function.

In [48]:
help(pd.read_excel) # or ?? pd.read_excel in jupyter cell magic syntax

Help on function read_excel in module pandas.io.excel._base:

read_excel(
    io,
    sheet_name: 'str | int | list[IntStrT] | None' = 0,
    *,
    header: 'int | Sequence[int] | None' = 0,
    names: 'SequenceNotStr[Hashable] | range | None' = None,
    index_col: 'int | str | Sequence[int] | None' = None,
    usecols: 'int | str | Sequence[int] | Sequence[str] | Callable[[str], bool] | None' = None,
    dtype: 'DtypeArg | None' = None,
    engine: "Literal['xlrd', 'openpyxl', 'odf', 'pyxlsb', 'calamine'] | None" = None,
    converters: 'dict[str, Callable] | dict[int, Callable] | None' = None,
    true_values: 'Iterable[Hashable] | None' = None,
    false_values: 'Iterable[Hashable] | None' = None,
    skiprows: 'Sequence[int] | int | Callable[[int], object] | None' = None,
    nrows: 'int | None' = None,
    na_values=None,
    keep_default_na: 'bool' = True,
    na_filter: 'bool' = True,
    verbose: 'bool' = False,
    parse_dates: 'list | dict | bool' = False,
    date_parser: '

The most likely ones we will need are:

**General**
- `sheet_name`: the name of the sheet we want to read from the xlsx file. 
    - If not specified, pandas will read the first sheet by default.
- `engine`: the engine to use to read the xlsx file. 
    - If not specified, pandas will use the default engine.

**Rows**
- `header`: the row number to use as the column names. 
    - If not specified, pandas will use the first row by default.
- `skiprows`: the number of rows to skip at the beginning of the file. 
    - If not specified, pandas will not skip any rows.
- `nrows`: the number of rows to read from the file. 
    - If not specified, pandas will read all rows by default.

**Columns**
- `index_col`: the column number to use as the index. If not specified, pandas will use a default integer index.
- `usecols`: the columns to read from the xlsx file. If not specified, pandas will read all columns by default.
- `dtype`: the data type to use for the columns. If not specified, pandas will try to infer the data type from the data.

**Data**
- `na_values`: the values to consider as missing values. 
    - If not specified, pandas will use a default set of missing values.
- `converters`: the functions to use to convert the data in the columns. 
    - If not specified, pandas will use a default set of converters.
    
- `parse_dates`: whether to parse dates. 
    - If not specified, pandas will not parse dates by default.

- `thousands`: the thousands separator to use. 
    - If not specified, pandas will use a default thousands separator.
- `decimal`: the decimal separator to use. 
    - If not specified, pandas will use a default decimal separator.

In [51]:
pd.read_csv("sample_data/Summer_Sports_Experience_and__Kids_in_Motion__Programming__2022_to_current.csv", parse_dates=["Date"])

Unnamed: 0,Borough,ParkorPlayground,Date,SportsPlayed,KIMorSSE,Attendance,Cancellation
0,Brooklyn,Hamilton Metz Field,2024-05-14 10:00:00,,K.I.M,71.0,
1,Manhattan,Carmansville Playground,2023-10-13 18:00:00,,K.I.M,0.0,Called Out
2,Queens,Bowne Playground,2024-07-06 18:00:00,,K.I.M,119.0,
3,Queens,Phil Scooter Rizzuto Park,2023-07-13 18:00:00,,K.I.M,109.0,
4,Queens,Grover Cleveland Playground,2024-05-24 18:00:00,,K.I.M,0.0,Detailed to Event
...,...,...,...,...,...,...,...
12250,Brooklyn,Seth Low Playground/ Bealin Square,2024-05-04 10:00:00,,K.I.M,90.0,
12251,Queens,Yellowstone Park,2022-09-16 12:26:00,,K.I.M,0.0,Called Out
12252,Queens,Astoria Park,2022-07-02 18:00:00,,K.I.M,0.0,Detailed to Event
12253,Queens,Queensbridge Park,2023-08-23 18:00:00,,K.I.M,0.0,



## cleaning tasks
- missing values
- duplicates
- data types
- renaming columns
- aggregating data



## How to export your dataframe to a spreadsheet

In [28]:
import pandas as pd
file = pd.ExcelFile("sample_data/iris-data.xlsx")  
file.parse() 

Unnamed: 0,Iris data set (as Excel table),Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
0,sepal_length,sepal_width,petal_length,petal_width,species,,Select cell G3 to see the sample Python formula.
1,5.1,3.5,1.4,0.2,setosa,,
2,4.9,3,1.4,0.2,setosa,,
3,4.7,3.2,1.3,0.2,setosa,,
4,4.6,3.1,1.5,0.2,setosa,,
...,...,...,...,...,...,...,...
148,6.5,3,5.2,2,virginica,,
149,6.2,3.4,5.4,2.3,virginica,,
150,5.9,3,5.1,1.8,virginica,,
151,,,,,,,


In [30]:
iris_df.head()

Unnamed: 0,Iris data set (as Excel table),Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
0,sepal_length,sepal_width,petal_length,petal_width,species,,Select cell G3 to see the sample Python formula.
1,5.1,3.5,1.4,0.2,setosa,,
2,4.9,3,1.4,0.2,setosa,,
3,4.7,3.2,1.3,0.2,setosa,,
4,4.6,3.1,1.5,0.2,setosa,,


In [31]:

df = pd.DataFrame(
    {
        "a": [1, 2, 3],
        "b": [4, 5, 6],
        "c": [7, 8, 9],
        "d": [10, 11, 12],
        "e": [13, 14, 15],
        "f": [16, 17, 18],
        "g": [19, 20, 21],
        "h": [22, 23, 24],
        "i": [25, 26, 27],
        "j": [28, 29, 30],
        "k": [31, 32, 33]
    }).to_excel(
    "sample_data/test.xlsx",
    index=False,
    header=False)

In [32]:
import pandas as pd

df = pd.DataFrame(
    {
        "a": [1, 2, 3],
        "b": [4, 5, 6],
        "c": [7, 8, 9],
        "d": [10, 11, 12],
        "e": [13, 14, 15],
        "f": [16, 17, 18],
        "g": [19, 20, 21],
        "h": [22, 23, 24],
        "i": [25, 26, 27],
        "j": [28, 29, 30],
        "k": [31, 32, 33]
    }).to_excel(
    "sample_data/test.xlsx",
    index=False,
    header=False)