# pandas for Data Science

![Data Science Workflow](img/ds-workflow.png)

## pandas
- When working with tabular data (spreadsheets, databases, etc) **pandas** is the right tool
- **pandas** makes it easy to acquire, explore, clean, process, analyze, and visualize your data
- This basically covers the full Data Science process

## pandas help
- **pandas** is a large tool but also complex
- **pandas** can do (almost) everything with data
    - if you can do it in Excel, you can do it in **pandas**
- **pandas** has a great [Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) to help you
- **pandas** also has great [tutorials](https://pandas.pydata.org/docs/getting_started/index.html)

## What will we cover here?
- Some insights into **DataFrames** (the main datastructure in **pandas**)
- How to work with data

## This course also covers
- Later we will dive into how **pandas** can get data from various sources
    - Web Scraping, Databases, CSV, Parquet, Excel files
- How to combine data from different sources
- How to deal with missing data

## Getting started with pandas
- **pandas** is installed by default in anaconda (JuPyter Notebooks)
- In other environments you can install it with
    - ```pip install pandas```
- To access **pandas** you need to import it
    - ```import pandas as pd```

In [1]:
import pandas as pd

### What is pandas?
- **pandas** is like an Excel sheet - just better
- to learn pandas, let's play with some data

### Read data from CSV
- What is CSV? See this lecture ([Lecture on CSV](https://youtu.be/LEyojSOg4EI))
- ```pd.read_csv(filename, parse_dates, index_col)``` ([docs](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html))
    - ```filename```: The path to the filename
    - ```parse_dates=True```: If True -> try parsing the index (default False)
    - ```index_col=0```: Set the index to be column 0

### Always check data
- The ```.head()```: prints the first 5 columns

## Index and columns
- ```.index```: Returns the index
- ```.columns```: Returns the column names in a list

## Each column has a data type
- ```.dtypes```: Returns the data types of each column

## The size and shape of data
- ```len(data)```: gives the number of rows in the DataFrame
- ```.shape```: Returns the number of rows and columns in the DataFrame

## Slicing rows and columns
- ```data['Close']```: Select one column (Series)
- ```data[['Open', 'Close']]```: Select multiple columns with specific names
- ```data.loc['2020-05-01':'2021-05-01']```: Select all columns between the dates (including 2021-05-01)
- ```data.iloc[50:55]```: Select all columns between rows 50-55 (excluding 55)

## Arithmetic operations
- Calculating with columns on all rows
    - Example: ```data['Close'] - data['Open']```
- Creating new columns
    - Example: ```data['New'] = data['Open'] - data['Close']```

## Select data
- Select data based boolean expressions
    - Example: ```data['New'] > 0```
    - Example: ```data[data['New'] > 0]```

## Groupby and value_counts
- Example
```Python
data['Category'] = data['New'] > 0
data.groupby('Category').mean()
```
- Example
```Python
data['Category'].value_counts()
(data['New'] > 0).value_counts()
```