# __[What kind of data does pandas handle?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html#what-kind-of-data-does-pandas-handle)__
<br>

To load the pandas package and start working with it, import the package.
The community agreed alias for pandas is `pd`, so loading pandas as `pd` is assumed standard practice for all the pandas documentation.

In [None]:
import pandas as pd

### Pandas data table representation
<img src="utility/pd_data_tbl_rep_01.png"/>.
<br>
##### ? Question ?
I want to store passenger data of the Titanic. For a number of passengers, I know the name (characters), age (integers) and sex (male/female) data.
<br>
##### Answer
To manually store data in a table, create a `DataFrame`.

In [18]:
passenger_data = pd.DataFrame(
    {
        'Name': [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)

passenger_data

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",22,male
1,"Allen, Mr. William Henry",35,male
2,"Bonnell, Miss. Elizabeth",58,female


When using a Python dictionary of lists, the dictionary keys will be used as column headers and the values in each list as columns of the `DataFrame`.
<br>

A __[DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame)__ is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the `data.frame` in R.

- The table has 3 columns, each of them with a column label. The column labels are respectively `Name`, `Age` and `Sex`.

- The column `Name` consists of textual data with each value a string, the column `Age` are numbers and the column `Sex` is textual data.

In spreadsheet software, the table representation of our data would look very similar.

### Each column in a DataFrame is a Series
<img src="utility/df_col_series_01.png"/>.
<br>
##### ? Question ?
I’m just interested in working with the data in the column `Age`.
<br>
##### Answer
To select the column, use the column label in between square brackets `[]`.

In [19]:
passenger_data['Age']

0    22
1    35
2    58
Name: Age, dtype: int64

If you are familiar to Python __[dictionaries](https://docs.python.org/3/tutorial/datastructures.html#tut-dictionaries)__, the selection of a single column is very similar to selection of dictionary values based on the key.
<br>
<br>
You can create a `Series` from scratch as well:

In [21]:
ages = pd.Series([22, 35, 58])
ages

0    22
1    35
2    58
dtype: int64

A pandas `Series` has no column labels, as it is just a single column of a `DataFrame`. A `Series` does have row labels.

### Do something with a DataFrame or Series

<br>

##### ? Question ?
I want to know the maximum Age of the passengers.
<br>
##### Answer
We can do this on the `DataFrame` by selecting the `Age` column and applying `max()`:

In [24]:
passenger_data['Age'].max()

58

As illustrated by the `max()` method, you can do things with a `DataFrame` or `Series`. pandas provides a lot of functionalities, each of them a method you can apply to a `DataFrame` or `Series`. As methods are functions, do not forget to use parentheses `()`.

##### ? Question ?
I’m interested in some basic statistics of the numerical data of my data table.
<br>
##### Answer
Use the `describe()` method

In [25]:
passenger_data.describe()

Unnamed: 0,Age
count,3.0
mean,38.333333
std,18.230012
min,22.0
25%,28.5
50%,35.0
75%,46.5
max,58.0


The __[`describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe)__ method provides a quick overview of the numerical data in a `DataFrame`. As the `Name` and `Sex` columns are textual data, these are by default not taken into account by the __[`describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe)__ method.
<br>

Many pandas operations return a `DataFrame` or a `Series`. The __[`describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe)__ method is an example of a pandas operation returning a pandas `Series` or a pandas `DataFrame`.
<br>

Check more options on `describe` in the user guide section about __[aggregations with describe](https://pandas.pydata.org/docs/user_guide/basics.html#basics-describe)__.

This is just a starting point. Similar to spreadsheet software, pandas represents data as a table with columns and rows. Apart from the representation, also the data manipulations and calculations you would do in spreadsheet software are supported by pandas. Continue reading the next tutorials to get started!
<br>
### REMEMBER
- Import the package, aka `import pandas as pd`
- A table of data is stored as a pandas `DataFrame`
- Each column in a `DataFrame` is a `Series`
- You can do things by applying a method to a `DataFrame` or `Series`

# __[How do I read and write tabular data?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html#how-do-i-read-and-write-tabular-data)__<br>

<img src="utility/rd_wrt_tblr_dt_01.png">.
<br>
##### ? Question ?
I want to analyze the Titanic passenger data, available as a CSV file.
<br>
##### Answer
pandas provides the __[`read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)__ function to read data stored as a csv file into a pandas `DataFrame`. pandas supports many file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix `read_*`.
<br>
Make sure to always have a check on the data after reading in the data. When displaying a `DataFrame`, the first and last 5 rows will be shown by default.

In [26]:
titanic_passenger_data = pd.read_csv('data/titanic.csv')
titanic_passenger_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


##### ? Question ?
I want to see the first ## rows of a pandas DataFrame.
<br>
##### Answer
To see the first N rows of a `DataFrame`, use the __[`head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head)__ method with the required number of rows (in this case N) as argument.

In [27]:
titanic_passenger_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


Interested in the last N rows instead? pandas also provides a __[`tail()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail)__ method.

In [29]:
titanic_passenger_data.tail(8)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


A check on how pandas interpreted each of the column data types can be done by requesting the pandas `dtypes` attribute:

In [31]:
titanic_passenger_data.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

For each of the columns, the used data type is enlisted. The data types in this `DataFrame` are integers (`int64`), floats (`float64`) and strings (`object`).
<br>
When asking for the `dtypes`, no brackets are used! `dtypes` is an attribute of a `DataFrame` and `Series`. Attributes of `DataFrame` or `Series` do not need brackets. Attributes represent a characteristic of a `DataFrame`/`Series`, whereas a method (which requires brackets) do something with the `DataFrame`/`Series` as introduced in the __[first tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html)__.


##### ? Question ?
My colleague requested the Titanic data as a spreadsheet.
<br>
##### Answer
Whereas `read_*` functions are used to read data to pandas, the `to_*` methods are used to store data. The __[`to_excel()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html#pandas.DataFrame.to_excel)__ method stores the data as an Excel file, or the __[`to_json`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html#) method stores the data as a json file.

In [32]:
titanic_passenger_data.to_json('data/titanic.json')

The equivalent `read_*` function will reload the data to a `DataFrame`:

In [33]:
titanic_passenger_data_read = pd.read_json('data/titanic.json')
titanic_passenger_data_read.head(20)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


##### ? Question ?
I’m interested in a technical summary of a `DataFrame`
<br>
##### Answer
The method `info()` provides technical information about a `DataFrame`

In [34]:
titanic_passenger_data_read.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB


So let’s explain the output in more detail:
- It is indeed a __[`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame)__.
- There are ### entries, i.e. 891 rows.
- Each row has a row label (aka the `index`) with values ranging from 0 to ###.
- The table has ## columns. Most columns should have a value for each of the rows (all ### values are `non-null`). Some columns will have missing values and less than ### `non-null` values.
- The columns `Name`, `Sex`, `Cabin` and `Embarked` consists of textual data (strings, aka `object`). The other columns are numerical data with some of them whole numbers (aka `integer`) and others are real numbers (aka `float`).
- The kind of data (characters, integers,…) in the different columns are summarized by listing the `dtypes`.
- The approximate amount of RAM used to hold the DataFrame is provided as well.

### REMEMBER
- Getting data in to pandas from many file formats or data sources is supported by `read_*` functions.
- Exporting data out of pandas is provided by different `to_*` methods.
- The `head`/`tail`/`info` methods and the `dtypes` attribute are convenient for a first check.
<br>

For a complete overview of the input and output possibilities from and to pandas, see the user guide section about __[reader and writer functions](https://pandas.pydata.org/docs/user_guide/io.html#io)__.

# __[How do I select a subset of a DataFrame?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#how-do-i-select-a-subset-of-a-dataframe)__
<br>

### How do I select specific columns from a DataFrame?

<img src='utility/slct_sbst_df_01.png'>

##### ? Question ?
I'm interested in the age of the titanic passengers.
<br>
##### Answer
To select a single column, use square brackets `[]` with the column name of the column of interest.

In [36]:
ages = titanic_passenger_data['Age']
ages

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

Each column in a __[`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame)__ is a __[`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series)__. As a single column is selected, the returned object is a pandas __[`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series)__. We can verify this by checking the type of the output:

In [38]:
type(titanic_passenger_data['Age'])

pandas.core.series.Series

And have a look at the `shape` of the output:

In [42]:
titanic_passenger_data['Age'].shape

(891,)

__[`DataFrame.shape`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html#pandas.DataFrame.shape)__ is an attribute (remember __[tutorial on reading and writing](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html#min-tut-02-read-write)__, do not use parentheses for attributes) of a pandas `Series` and `DataFrame` containing the number of rows and columns: (nrows, ncolumns). A pandas `Series` is 1-dimensional and only the number of rows is returned.

##### ? Question ?
I’m interested in the age and sex of the Titanic passengers.
<br>
##### Answer
To select multiple columns, use a list of column names within the selection brackets `[]`. The returned data type is a pandas `DataFrame`.

In [46]:
age_sex = titanic_passenger_data[['Age', 'Sex']]
age_sex

Unnamed: 0,Age,Sex
0,22.0,male
1,38.0,female
2,26.0,female
3,35.0,female
4,35.0,male
...,...,...
886,27.0,male
887,19.0,female
888,,female
889,26.0,male


The inner square brackets define a __[Python list](https://docs.python.org/3/tutorial/datastructures.html#tut-morelists)__ with column names, whereas the outer brackets are used to select the data from a pandas `DataFrame` as seen in the previous example. Remember, a `DataFrame` is 2-dimensional with both a row and column dimension.
<br>
For basic information on indexing, see the user guide section on __[indexing and selecting data](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-basics)__.