# Data Visualization with Modern Data Science

> Getting Started with Pandas

Yao-Jen Kuo <yaojenkuo@ntu.edu.tw> from [DATAINPOINT](https://www.datainpoint.com/)

In [1]:
from datetime import date
from datetime import timedelta
import urllib.error
import re
import pandas as pd

## About `pandas`

## What is `pandas`?

> Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more.

Source: <https://github.com/pandas-dev/pandas>

## Why `pandas`?

Python used to have a weak spot in its analysis capability due to it did not have an appropriate structure handling the common tabular datasets. Pythonists had to switch to a more data-centric language like R or Matlab during the analysis stage until the presence of `pandas`.

## Import Pandas with `import` command

Pandas is officially aliased as `pd`.

In [2]:
import pandas as pd

## If Pandas is not installed, we will encounter a `ModuleNotFoundError`

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pandas'
```

## Use `pip install` at Terminal to install pandas

```bash
pip install pandas
```

## Check version and its installation file path

- `__version__` attribute
- `__file__` attribute

In [3]:
print(pd.__version__)
print(pd.__file__)

1.4.1
/Users/kuoyaojen/opt/miniconda3/envs/python39/lib/python3.9/site-packages/pandas/__init__.py


## What does `pandas` mean?

![](https://media.giphy.com/media/46Zj6ze2Z2t4k/giphy.gif)

Source: <https://giphy.com/>

## Turns out its naming has nothing to do with panda the animal, it refers to three primary class customed by its author [Wes McKinney](https://wesmckinney.com/)

- **Pan**el(Deprecated since version 0.20.0)
- **Da**taFrame
- **S**eries

## In order to master `pandas`, it is vital to understand the relationships between `Index`, `ndarray`, `Series`, and `DataFrame`

- An `Index` and a `ndarray` assembles a `Series`.
- A couple of `Series` that sharing the same `Index` can then form a `DataFrame`.

## `Index` from Pandas

The simplest way to create an `Index` is using `pd.Index()`.

In [4]:
prime_indices = pd.Index([2, 3, 5, 7, 11, 13, 17, 19, 23, 29])
type(prime_indices)

pandas.core.indexes.numeric.Int64Index

## An `Index` is like a combination of `tuple` and `set`

In [5]:
# immutable
prime_indices = pd.Index([2, 3, 5, 7, 11, 13, 17, 19, 23, 29])
try:
    prime_indices[-1] = 31
except TypeError as e:
    print(e)

Index does not support mutable operations


In [6]:
# Index has the characteristics of a set
odd_indices = pd.Index(range(1, 30, 2))
print(prime_indices.intersection(odd_indices))         # prime_indices & odd_indices
print(prime_indices.union(odd_indices))                # prime_indices | odd_indices
print(prime_indices.symmetric_difference(odd_indices)) # prime_indices ^ odd_indices
print(prime_indices.difference(odd_indices))
print(odd_indices.difference(prime_indices))

Int64Index([3, 5, 7, 11, 13, 17, 19, 23, 29], dtype='int64')
Int64Index([1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29], dtype='int64')
Int64Index([1, 2, 9, 15, 21, 25, 27], dtype='int64')
Int64Index([2], dtype='int64')
Int64Index([1, 9, 15, 21, 25, 27], dtype='int64')


## `Series` from Pandas

The simplest way to create a `Series` is using `pd.Series()`.

In [7]:
prime_series = pd.Series([2, 3, 5, 7, 11, 13, 17, 19, 23, 29])
print(type(prime_series))
print(prime_series)

<class 'pandas.core.series.Series'>
0     2
1     3
2     5
3     7
4    11
5    13
6    17
7    19
8    23
9    29
dtype: int64


## A `Series` is a combination of `Index` and `ndarray`

In [8]:
print(type(prime_series.index))
print(type(prime_series.values))

<class 'pandas.core.indexes.range.RangeIndex'>
<class 'numpy.ndarray'>


## The index of a `Series` can be customized

In [9]:
prime_series = pd.Series([2, 3, 5, 7, 11, 13, 17, 19, 23, 29],
                        index=range(1, 11))
prime_series

1      2
2      3
3      5
4      7
5     11
6     13
7     17
8     19
9     23
10    29
dtype: int64

In [10]:
prime_series = pd.Series([2, 3, 5, 7, 11, 13, 17, 19, 23, 29])
prime_series.index = range(1, 11)
prime_series

1      2
2      3
3      5
4      7
5     11
6     13
7     17
8     19
9     23
10    29
dtype: int64

## Indexing a `Series`

Indexing via index positions or index labels.

In [11]:
prime_series = pd.Series([2, 3, 5, 7, 11],
                        index=["1st", "2nd", "3rd", "4th", "5th"])
print(prime_series[-1])
print(prime_series["5th"])

11
11


## Slicing a `Series`

Slicing via index positions excludes `stop`, while slicing via index labels includes `stop`.

In [12]:
print(prime_series[:2])
print(prime_series["1st":"3rd"])

1st    2
2nd    3
dtype: int64
1st    2
2nd    3
3rd    5
dtype: int64


## A `Series` contains a `ndarray` therefore they can be manipulated in the same way 

- Vectorization and broadcasting.
- Fancy indexing.
- Boolean indexing.

In [13]:
# Vectorization and broadcasting
prime_series**2

1st      4
2nd      9
3rd     25
4th     49
5th    121
dtype: int64

In [14]:
# Fancy indexing
print(prime_series[[0, 1, 4]])
print(prime_series[["1st", "2nd", "5th"]])

1st     2
2nd     3
5th    11
dtype: int64
1st     2
2nd     3
5th    11
dtype: int64


In [15]:
# Boolean indexing
prime_series[prime_series % 2 == 1]

2nd     3
3rd     5
4th     7
5th    11
dtype: int64

## `DataFrame` from Pandas

The simplest way to create a `DataFrame` is using `pd.DataFrame()`.

In [16]:
# column-wise
movie_df = pd.DataFrame()
movie_df["title"] = ["The Shawshank Redemption", "The Dark Knight", "Schindler's List", "Forrest Gump", "Inception"]
movie_df["imdb_rating"] = [9.3, 9.0, 8.9, 8.8, 8.7]
type(movie_df)

pandas.core.frame.DataFrame

In [17]:
movie_df

Unnamed: 0,title,imdb_rating
0,The Shawshank Redemption,9.3
1,The Dark Knight,9.0
2,Schindler's List,8.9
3,Forrest Gump,8.8
4,Inception,8.7


## `DataFrame` from Pandas

Creating a `DataFrame` with a `dict`.

In [18]:
# column-wise
movie_dict = {
    "title": ["The Shawshank Redemption", "The Dark Knight", "Schindler's List", "Forrest Gump", "Inception"],
    "imdb_rating": [9.3, 9.0, 8.9, 8.8, 8.7]
}
movie_df = pd.DataFrame(movie_dict)
type(movie_df)

pandas.core.frame.DataFrame

In [19]:
movie_df

Unnamed: 0,title,imdb_rating
0,The Shawshank Redemption,9.3
1,The Dark Knight,9.0
2,Schindler's List,8.9
3,Forrest Gump,8.8
4,Inception,8.7


## `DataFrame` from Pandas

Creating a `DataFrame` with a `list` of dictionaries.

In [20]:
# row-wise
movie_list = [
    {"title": "The Shawshank Redemption", "imdb_rating": 9.3},
    {"title": "The Dark Knight", "imdb_rating": 9.0},
    {"title": "Schindler's List", "imdb_rating": 8.9},
    {"title": "Forrest Gump", "imdb_rating": 8.8},
    {"title": "Inception", "imdb_rating": 8.7}
]
movie_df = pd.DataFrame(movie_list)
type(movie_df)

pandas.core.frame.DataFrame

In [21]:
movie_df

Unnamed: 0,title,imdb_rating
0,The Shawshank Redemption,9.3
1,The Dark Knight,9.0
2,Schindler's List,8.9
3,Forrest Gump,8.8
4,Inception,8.7


## A `DataFrame` is a combination of multiple `Series` sharing the same `Index`

In [22]:
print(type(movie_df.index))
print(type(movie_df["title"]))
print(type(movie_df["imdb_rating"]))

<class 'pandas.core.indexes.range.RangeIndex'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


## Review of the definition of modern data science

> Modern data science is a huge field, it invovles applications and tools like importing, tidying, transformation, visualization, modeling, and communication. Surrounding all these is programming.

![Imgur](https://i.imgur.com/din6Ig6.png)

Source: [R for Data Science](https://r4ds.had.co.nz/)

## Key functionalities analysts rely on `pandas` are

- Importing
- Tidying
- Transforming

## Tidying and transforming together is also known as WRANGLING

![](https://media.giphy.com/media/MnlZWRFHR4xruE4N2Z/giphy.gif)

Source: <https://giphy.com>

## Importing

## `pandas` has massive functions importing tabular data

- Flat text file
- Database table
- Spreadsheet
- JSON
- HTML `<table></table>` tags
- ...etc.

Source: <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html>

## Using `read_csv` function for flat text files

In [23]:
def get_latest_daily_report():
    data_date = date.today()
    day_delta = timedelta(days=1)
    while True:
        data_date_str = date.strftime(data_date, '%m-%d-%Y')
        print("Try importing {} data...".format(data_date_str))
        daily_report_url = "https://raw.githubusercontent.com\
/CSSEGISandData/COVID-19/master\
/csse_covid_19_data/csse_covid_19_daily_reports/{}.csv".format(data_date_str)
        try:
            daily_report = pd.read_csv(daily_report_url)
            print("Successfully imported {} data!".format(data_date_str))
            break
        except urllib.error.HTTPError:
            data_date -= day_delta
    return daily_report

In [24]:
daily_report = get_latest_daily_report()

Try importing 04-27-2022 data...
Try importing 04-26-2022 data...
Successfully imported 04-26-2022 data!


## Using `read_sql` function for database tables

```python
import sqlite3

conn = sqlite3.connect('YOUR_DATABASE.db')
sql_query = """
SELECT * 
  FROM YOUR_TABLE
 LIMIT 10;
"""
pd.read_sql(sql_query, conn)
```

## Using `read_excel` function for spreadsheets

```python
excel_file_path = "PATH/TO/YOUR/EXCEL/FILE"
pd.read_excel(excel_file_path)
```

## Using `read_json` function for JSON

```python
json_file_path = "PATH/TO/YOUR/JSON/FILE"
pd.read_json(json_file_path)
```

## Using `read_html` function for HTML `<table></table>` tags

> The `<table>` tag defines an HTML table. An HTML table consists of one `<table>` element and one or more `<tr>`, `<th>`, and `<td>` elements. The `<tr>` element defines a table row, the `<th>` element defines a table header, and the `<td>` element defines a table cell.

Source: <https://www.w3schools.com/default.asp>

In [25]:
request_url = "https://www.imdb.com/chart/top"
html_tables = pd.read_html(request_url)
print(type(html_tables))
print(len(html_tables))

<class 'list'>
1


In [26]:
html_tables[0]

Unnamed: 0.1,Unnamed: 0,Rank & Title,IMDb Rating,Your Rating,Unnamed: 4
0,,1. 刺激1995 (1994),9.2,12345678910 NOT YET RELEASED Seen,
1,,2. 教父 (1972),9.2,12345678910 NOT YET RELEASED Seen,
2,,3. 黑暗騎士 (2008),9.0,12345678910 NOT YET RELEASED Seen,
3,,4. 教父第二集 (1974),9.0,12345678910 NOT YET RELEASED Seen,
4,,5. 十二怒漢 (1957),8.9,12345678910 NOT YET RELEASED Seen,
...,...,...,...,...,...
245,,246. 阿拉丁 (1992),8.0,12345678910 NOT YET RELEASED Seen,
246,,247. 姊妹 (2011),8.0,12345678910 NOT YET RELEASED Seen,
247,,248. 美女與野獸 (1991),8.0,12345678910 NOT YET RELEASED Seen,
248,,249. 男人的爭鬥 (1955),8.0,12345678910 NOT YET RELEASED Seen,
