<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Courses</font></h1>
</center>

---

<center><h1><font color="red" size="+3">Introduction to Pandas</font></h1></center>

# <font color="red">Objectives</font>
In this presentation, we will cover the following topics:
1. Pandas data structures (Series and DataFrames)
2. Inspecting data in DataFrames
3. Important functions
     - `grouby()`
     - `apply`
     - `concat()`, `join()`, `merge()`
     - `compare()`
4. Reading remote CSV files and tables.
5. Cleaning and formatting data
6. Manipulating time series data
7. Performing statistical calculations
8. Visualizing the data

# <font color="red">Useful References</font>
- [Learn Pandas](https://bitbucket.org/hrojas/learn-pandas/src/master/) by Hernan Rojas.
- [Python Pandas Tutorial: A Complete Introduction for Beginners](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/) by George McIntire, Brenda Martin and Lauren Washington.
- [Introduction into Pandas](https://www.python-course.eu/pandas.php) by Bernd Klein.
- [Time series analysis with pandas](http://earthpy.org/pandas-basics.html) from EarthPy.
- [Working with Time Series](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html) from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas.
- [Introduction to data analysis](https://pythongis.org/part1/chapter-03/index.html)
- [Unlocking Data Manipulation in Python: Key Pandas Techniques](https://www.dasca.org/world-of-data-science/article/unlocking-data-manipulation-in-python-key-pandas-techniques) from the Data Science Council of America.

![fig_logo](https://miro.medium.com/max/3200/1*9v51-jsfHtk6fgAIYLoiHQ.jpeg)
Image Source: pandas.pydata.org

# <font color="red">What is Pandas?</font>
+ `Pandas` is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
+ Some key features:
    - Fast and efficient DataFrame object with default and customized indexing.
    - Tools for loading data into in-memory data objects from different file formats.
    - Data alignment and integrated handling of missing data.
    - Reshaping and pivoting of data sets.
    - Label-based slicing, indexing and subsetting of large data sets.
    - Columns from a data structure can be deleted or inserted.
    - Group by data for aggregation and transformations.
    - High performance merging and joining of data.
    - Time Series functionality.
+ Able to manipulate several <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html">types of files</a>, including CSVs, TSVs , JSONs, HTML, xlsx, HDF5, Python Pickle, among others.
* Is compatible with many of the other data analysis libraries, like Scikit-Learn, Matplotlib, NumPy, and more. 

Some of key features of `Pandas` are captured in the diagram below:

![fig_features](https://favtutor.com/resources/images/uploads/mceu_16841658121636696850726.png)
Image Source: [favtutor.com](https://favtutor.com/blogs/numpy-vs-pandas)

# <font color="red">Packages Used</font>

We will use the followin packages:

- `Matplolib`: for visualization
- `Seaborn`: for visualization settings
- `NumPy`: for array creation.
- `Pandas`: for creating and manipulating Series and DataFrames, and for visualization.

In addition, we will use the `datetime` module to manipulate dates and times.

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import datetime
import numpy as np
import pandas as pd

In [None]:
print(f'Using Numpy version:  {np.__version__}')
print(f'Using Pandas version: {pd.__version__}')

#### Notebook settings

In [None]:
%matplotlib inline

Only 5 rows of data will be displayed:

In [None]:
pd.set_option('display.max_rows', 5)

Print floating point numbers using fixed point notation:

In [None]:
np.set_printoptions(suppress=True)

#### Graphics

In [None]:
import matplotlib.pyplot as plt

In [None]:
import seaborn as sns

- There are five preset Seaborn themes: `darkgrid` (default), `whitegrid`, `dark`, `white`, and `ticks`.
- They are each suited to different applications and personal preferences. 

In [None]:
sns.set_style("whitegrid")

- The four preset contexts, in order of relative size, are `paper`, `notebook` (default), `talk`, and `poster`.

In [None]:
sns.set_context("paper")

Remove spine:

In [None]:
sns.despine()

![fig_pandas](https://www.dasca.org/content/images/main/essential-data-manipulation-techniques-in-pandas.jpg)
Image Source: dasca.org

# <font color="red">`pandas` Data Structures

There are three data structures provided by the Pandas, which are as follows:

- **Series**: 1D size-immutable array like structure having homogeneous data.
- **DataFrames**: 2D size-mutable tabular structure with heterogeneously typed columns.
- **Panel**: 3D, size-mutable array (not covered here).

## <font color="blue">1D Data Structures: Series</font>

- A <font color='red'>Series</font> is a one-dimensional <font color='green'>**labeled**</font> array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
- It can be seen as a column in a table or a one-dimensional NumPy array (homogeneous data), but with an added index that allows you to access elements by their label.
   - The row labels of Series are called the **index**.
- Series support a wide range of data manipulation and analysis operations.

![title](https://portal.nccs.nasa.gov/datashare/astg/training/python/pandas/pandas_series.png)

#### Creating a Series

A Series can be constructed with the `pd.Series` constructor (passing a list, array, dictionary or existing DataFrame).

```python
pd.Series(data=None, index=None, dtype=None, 
          name=None, copy=False)
```

- **data**: Array or dict or scalar value or iterables. It is used to populate the rows of the Series object.
- **index**: Array or index. It is used to label the rows of the Series. It’s length must be the same as the object passed in the data parameter and all the values must be unique. `np.arrange(n)` is the default index.
- **dtype**: Used to specify the data type of the Series which will be formed. If this parameter is not specified then the data type will be inferred from the values present in the series.
- **copy**: Boolean used to copy the input data. 

Creation from a list:

In [None]:
my_list = [5, 8, 13, 0.1, -5]

Use a list to create a Numpy array:

In [None]:
a = np.array(my_list)

In [None]:
type(a)

In [None]:
a

Use a list to create a Pandas Series:

In [None]:
sr = pd.Series(my_list)
print(type(sr))
print(sr)

...get default index values

#### NumPy arrays as backend of Pandas

Contains an array of data:

In [None]:
sr.values  

- If nothing else is specified, the values are labeled with their index number. 
- The Pandas Series will then have an associated array of data labels from `0`, to `N-1`:

In [None]:
sr.index

In [None]:
my_rows = list(range(5))
print(my_rows)

In [None]:
sr.index.values 

Obtain statistical information:

In [None]:
sr.describe()

#### More on the index

Rename the index values:

In [None]:
sr.index = ['A','B','C','D','E']
print(sr)

Or pass the index values during Pandas series creation:

In [None]:
sr1 = pd.Series(my_list, index=['A','B','C','D','E'])
print(sr1)

#### NumPy Array has an implicitly defined integer index used to access the values while the Pandas Series has an explicitly defined index associated with the values.

Get value at position `n` in series

In [None]:
print(sr[3])  

Use `iloc` (integer location) to get value at position `n`

In [None]:
print(sr.iloc[3]) 

Value at given index using dictionary-like syntax

In [None]:
print(sr.loc['D'])

#### Filtering

In [None]:
sr[sr > 1]

#### Other ways to create Series

We can also create a Pandas Series from a dictionary:

In [None]:
sr2 = pd.Series(dict(A=5, B=8, C=13, D=0.1, E=-5))
sr2

You can also, create a Pandas Series from a scalar data. But, if you pass a single value with multiple indexes, the value will be same for all the indexes.

In [None]:
sr3 = pd.Series(10.5, index=['A','B','C','D','E'])
print(sr3)

### <font color='green'>Breakout 1</font>

1. Create a Series using:

```python
   data = {'Course': "Pandas", 'Setting': "Virtual", 'Duration': "3 hours"}
```

2. Create a new Series with the above `data` and with the index as:

```python
   my_index = ['Course_Name', 'Course_Setting', 'Course_Duration']
```

<details><summary><b><font color="green">Click here to access the solution</font></b></summary>
<p>

```python
   data = {'Course': "Pandas", 'Setting': "Virtual", 'Duration': "3 hours"}
   sr1 = pd.Series(data)
   my_index = ['Course_Name', 'Course_Setting', 'Course_Duration']
   sr2 = pd.Series(data, index=my_index)
``` 
</p>
</details>

## <font color="blue">2D data structures</font>

Pandas: <font color='red'>DataFrame</font> is a 2-dimensional labeled data structure with columns of potentially different types. It is generally the most commonly used pandas object.

A <font color='red'>DataFrame</font> is like a sequence of aligned <font color='red'>Series</font> objects, i.e. they share the same index.

![title](https://portal.nccs.nasa.gov/datashare/astg/training/python/pandas/pandas_df.png)


### Features of DataFrames

A DataFrame:
- Is a way to represent and work with tabular data.
- Can be thought of as a __generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data__.
- Rows and columns of a DataFrame are labelled and can be named.
- Can be seen as a dictionary of one-dimensional NumPy arrays, lists, dictionaries or Series.
- Supports hetrogenous collections of data where data in each column is homogeneous.
- Can perform arithmetic operations on rows and columns.
- Its size is mutable.
   - We can add/remove rows or columns as needed.
- Supports reading flat files like `CSV`, `Excel`, `JSON` and also reading `SQL` tables
- Handles missing data.

**A pandas dataframe can be seen as a collection of pandas series**
![fig_objects](https://doit-test.readthedocs.io/en/latest/_images/base_01_pandas_5_0.png)
Image Source: doit-test.readthedocs.io

### <font color="green">Create DataFrame</font>

- We can create a DataFrame using a list, a NumPy Array or a dictionary

__Creation with a list__

In [None]:
data =[
    [5, True, 'x', 2.7],
    [8, True, 'y', 3.1],
    [13,False,'z', np.NaN],
    [1, False, 'a', 0.1],
    [-5, True, 'b', -2]
]
data

In [None]:
df = pd.DataFrame(
    data=data,
    index=['A','B','C','D','E'],
    columns=['num', 'bool', 'str', 'real']
)

In [None]:
print(type(df))

In [None]:
df

__Creation with NumPy__

In [None]:
num_col = np.array([5, 8, 13, 1, -5])
bool_col = np.array([True, True, False, False, True])
str_col = np.array(['x', 'y', 'z', 'a', 'b'])
real_col = np.array([2.7, 3.1, np.NaN, 0.1, -2])

In [None]:
df = pd.DataFrame(
    dict(num=num_col, bool=bool_col, str=str_col, real=real_col),
    index=['A','B','C','D','E'],
)
df

In [None]:
df.info()

__Creation with a dictionary__

- The keys are the column names
- The corresponding values are lists of numbers, bools, strings and floats respectively.

In [None]:
data_dict = {
    "num": [5, 8, 13, 1, -5],
    "bool": [True, True, False, False, True],
    "str": ['x', 'y', 'z', 'a', 'b'],
    "real": [2.7, 3.1, np.NaN, 0.1, -2]
}

In [None]:
df = pd.DataFrame(data_dict, index=['A','B','C','D','E'])

In [None]:
df

### <font color="green">Inspecting data in DataFrame</font>

__Display the first few rows__:

In [None]:
df.head()

In [None]:
df.head(2)

__Display the last few rows__:

In [None]:
df.tail()

In [None]:
df.tail(3)

__Get the number of rows and columns as a tuple__:

In [None]:
df.shape

In [None]:
len(df)

__Get the type of each column__:

In [None]:
df.dtypes

__Get list of column names__:

In [None]:
df.columns

__Get the index values__:

In [None]:
df.index

### <font color="green">Obtain basic data information</font>

We can get the column count, number of values in each column, data type of each column, etc.:

In [None]:
df.info()

### <font color="green">Obtain descriptive statistics</font> 

- Can only be done on each numeric column.

In [None]:
df.describe()

We can pass the argument `include='object'` to return the descriptive statistics of categorical (object) columns:

In [None]:
df.describe(include='object')

### <font color="green">Sorting records</font>

- We can sort records by any column using `df.sort_values()` function.
- For example, we can sort the "str" column in ascending order.

In [None]:
df.sort_values('str', ascending=True)

### <font color="green">Slicing data</font>

__Get specific column(s)__:

In [None]:
df['num']

In [None]:
df.num

In [None]:
df[['num','real']]

### <font color="green">Label-based selection</font>

__Get specific row(s) by name(s)__: use `loc[]`

In [None]:
df.loc['C']

In [None]:
df.loc[['B', 'D']]

In [None]:
df.loc['A':'E':2]

__Get specific row(s) and column(s) by name(s)__:

In [None]:
df.loc['A':'D':2, ['num', 'real']]

In [None]:
df.loc['A':'C', 'num':'real']

### <font color="green">Index-based selection</font>

__Get specific row(s) by position(s)__: use `iloc[]`

In [None]:
df.iloc[2]

In [None]:
df.iloc[1:4]

__Get specific row(s) and column(s) by position(s)__: use `iloc[]`

In [None]:
df.iloc[[2,4], [1,3]]

__Display one random row__:

In [None]:
df.sample()

__Select columns based on data type__:

In [None]:
df.select_dtypes(include='object')

### <font color="green">Filtering data</font>

__Apply masking__:

In [None]:
df[df.real > 1.0]

In [None]:
df[df.real == 3.1]

__Problem with `NaN`__:
- In Python (and NumPy), the `nan`'s don’t compare to equal. 
- Pandas/NumPy uses the fact that `np.nan != np.nan`, and treats `None` like `np.nan`.
- A scalar equality comparison versus a `None/np.nan` doesn’t provide useful information.

In [None]:
df.real

In [None]:
df.real == np.NaN

We use the `isnull()` method to find out which DataFrame entries are `NaN`.

In [None]:
df.isnull()

We can determine if there is at least one `NaN` in your data:

In [None]:
df.isnull().values.any()

To determine which column has at least a `NaN`:

In [None]:
df.isnull().sum()

###  <font color="green">Dealing with missing values</font>

- One of the most common issues encountered in datasets is the presence of missing values.
- In Pandas, missing data can be represented in various forms, such as `NaN` (Not a Number) or `None`.
- Understanding how to manage missing data ensures the integrity of the datasets and reinforces the accuracy of any conclusions drawn from them.
- How do we deal with missing values in our dataset? There are at least three (3) options we can consider:
    - Remove the rows and or columns with missing values: use `dropna()` method.
    - Replace the missing values with specified values, such as the mean or median of the column: use the `fillna()` method.
    - Fill missing values with interpolated values: use the `interpolate()` method.

__Drop rows with missing values__

In [None]:
df.dropna(axis=0)

__Drop columns with missing values__

In [None]:
df.dropna(axis=1)

__Drops rows/columns with at least one missing value__

In [None]:
df.dropna(how='any')

__Drops rows/columns with all values missing__

In [None]:
df.dropna(how='all')

__Replace missing values using `fillna()`__

With any number:

In [None]:
my_number = df['real'].mean()
df.fillna(my_number)

Forward fills missing values with the last non-missing value.

In [None]:
df.fillna(method='ffill')

Back fills missing values with the next non-missing value.

In [None]:
df.fillna(method='bfill')

__Fill missing values using interpolation__

In [None]:
df.interpolate()

In [None]:
df.interpolate(method="linear")

In [None]:
df.interpolate(axis=0)

### <font color='green'>Breakout 2</font>

Use the above DataFrame `df` to create a new one with the values in the `bool` column to be True and the values in the `num` column to be less than 10.

<details><summary><b><font color="orange">Click here to access the solution</font></b></summary>
<p>

```python
df[(df.num < 10) & (df["bool"])]
``` 
</p>
</details>

# <font color="red">Important Operations on DataFrames</font>

- Operations on rows and/or columns
   - `apply()`
   - `map()`
   - `replace()`
- Merging
   - `concat()` 
   - `join()` 
   - `merge()`
- Comparing
   - `compare()`
- Grouping
   - `groupby()`

## <font color="blue"> Applying</font>

- `replace()`: Use for targeted value substitutions.
- `map()`: Use for simple value mappings on a single Series.
- `apply()`: Use for complex transformations involving multiple columns or rows.

### <font color="green"> `replace()` Function</font>

The `replace()` method is a convenient way to replace specific strings in a entire DataFrame or a specific column (Series) with numbers.

In [None]:
df = pd.DataFrame(
    {
        'student1': ['D', 'A', 'B', 'C'], 
        'student2': ['B', 'C', 'B', 'A']
    }
)
df

In [None]:
df.replace({'A': 1, 'B': 2, 'C': 3, 'D': 4})

### <font color="green"> `map()` Function</font>

- The `map()` method takes a dictionary as an argument, where the keys are the strings to be replaced and the values are the replacement values.
- It is used on one column only (Series).

In [None]:
mymapping = {'A': 1, 'B': 2, 'C': 3, 'D': 4}

In [None]:
df['student1'].map(mymapping)

### <font color="green"> `apply()` Function</font>
- Allows us to use an external function to manipulate every row or column.
- The function can be any Python function that takes a single argument and returns a single value.
- Can be used to perform a wide range of operations on your data, including filtering, sorting, and grouping.
- You used the `axis` parameter to determine if you want to apply to a row (`axis=1`) or a column (`axis=0`).

```python
DataFrame.apply(func, axis=0, broadcast=None, raw=False, 
                reduce=None, result_type=None, args=(), **kwds)
```

In [None]:
data = [(-13, 5, 7), (2, 4, -6), (7.5, -5, 8),(-2, 3, 9)]
df = pd.DataFrame(data, 
                  columns=['col1', 'col2', 'col3'],
                  index=['row1', 'row2', 'row3', 'row4']
                 )
print(df)

__Apply to all entries__:

In [None]:
def square_func(x):
    return x**2

In [None]:
df.apply(square_func)

In [None]:
df.apply(lambda x: x**2)

__Apply to a specific column__:

In [None]:
df['col3'].apply(lambda x: x**2)

__Calculation along axis__:

In [None]:
df.apply(sum, axis=0)

In [None]:
df4 = df.apply(sum, axis=1)
df4

__Using multiple arguments__:

In [None]:
def my_func(x, y, z):
    return (x+y**2)/z

In [None]:
df.apply(lambda x: my_func(x['col1'], x['col2'], x['col3']), axis=1)

### <font color='green'>Breakout 3</font>
The code below creates a Pandas DataFrame of students' grades.

```python
columns = ["Students", "Engl", "Phys", "Math", "Comp"]
students = ["Julia", "Jules", "Julio"]
engl_grades = ["A", "D", "B"]
phys_grades = ["A", "A", "C"]
math_grades = ["C", "A", "A"]
comp_grades = ["B", "B", "C"]

zipped = list(zip(students, engl_grades, phys_grades, 
                  math_grades, comp_grades))
student_df = pd.DataFrame(zipped, columns = columns)
```

Do the following:
1. Set the `Students` as index.
2. Replace the letters (`A`, `B`, `C`, `D`) with numbers (`4`, `3`, `2`, `1`). You may want to do a Google search on how to replace a string with an integer in a Pandas DataFrame.
3. Compute the GPA of each student.

<details><summary><b><font color="orange">Click here to access the solution</font></b></summary>
<p>

```python
# Question 1
student_df = student_df.set_index(columns[0])
    
# Question 2
mymap = {'A': 4, 'B': 3, 'C': 2, 'D': 1}

new_student_df = student_df.replace(mymap)

new_student_df = student_df.applymap(lambda s: mymap.get(s) if s in mymap else s)
    
# Question 3
new_student_df.mean(axis=1)
``` 
</p>
</details>

## <font color="blue"> Merging</font>

Consider the DataFrame which rows are students' grades and columns are the classes:

In [None]:
columns = ["Engl", "Phys", "Math", "Comp"]
students = ["Julia", "Jules", "Julio"]
engl_grades = ["A", "D", "B"]
phys_grades = ["A", "A", "C"]
math_grades = ["C", "A", "A"]
comp_grades = ["B", "B", "C"]

zipped = list(zip(engl_grades, phys_grades, 
                  math_grades, comp_grades))
student_grades = pd.DataFrame(zipped, columns = columns, index=students)
student_grades

Create a new DataFrame with one row:

In [None]:
new_student_name = ['Jean']
new_student_grade = pd.DataFrame([['C', 'A', 'B', 'A']], 
                                 columns = columns, 
                                 index=new_student_name)
new_student_grade

Create a new DataFrame with two new courses:

In [None]:
new_courses = ['Psy', 'Bio']
new_grades = [
    ['C', 'A'], 
    ['A', 'B'], 
    ['A','A'], 
    ['B', 'C']]
new_course_grades = pd.DataFrame(new_grades, 
                                 columns = new_courses, 
                                 index=students+new_student_name)
new_course_grades

### <font color="green">`concat()` function</font>

- Append either columns or rows from one DataFrame to another.

In [None]:
student_grades2 = pd.concat([student_grades, new_student_grade])
student_grades2

### <font color="green">`join()` function</font>

- Used to combine two DataFrame on row indices.

In [None]:
student_grades3 = student_grades2.join(new_course_grades)
student_grades3

### <font color="green">`merge()` function</font>

Combines or joins two DataFrames with the same columns or indices.

```python
merge(left, right, how='inner', on=None, left_on=None, right_on=None,
      left_index=False, right_index=False, sort=True,
      suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)
```

- The column to be keyed: `on`, `left_on`, `right_on`
- The merging method: `how`
   - INNER JOIN: `how='inner'`
   - LEFT JOIN: `how='left'`
   - RIGHT JOIN: `how='right'`
   - OUTER JOIN: `how='outer'`
   - CROSS JOIN: `how='cross'`


![fig_merge](https://datacomy.com/data_analysis/pandas/merge/types-of-joins.png)

In [None]:
grade_df1 = pd.DataFrame([['Jules',100,85,90], ['Julio',89,97,85], ['Julia',91,75,95]], 
                         columns=['Student', 'Math', 'Eng', 'Phys'])
grade_df1

In [None]:
grade_df2 = pd.DataFrame([['Jules',100,92,93], ['Julio',93,94,87], ['Julia',93,82,95]], 
                         columns=['Student', 'Math', 'Eng', 'Phys'])
grade_df2

In [None]:
grade_df1.merge(grade_df2, on='Student')

In [None]:
grade_df1.merge(grade_df2, on='Math')

In [None]:
grade_df1.merge(grade_df2, how='right')

In [None]:
grade_df1.merge(grade_df2, how='inner')

This creates all possible combinations of `left` and `right`. 

In [None]:
grade_df1.merge(grade_df2, how='cross')

### <font color="green"> `compare()` function</font>

- Compares two DataFrames row-by-row and column-by-column.
- Displays the differences next to each other.

```python
df1.compare(df2, align_axis=1, keep_shape=False, keep_equal=False)
```

In [None]:
print(grade_df1.compare(grade_df2))

Add the argument `keep_equal=True` if you want to keep the corresponding values that are equal.

In [None]:
print(grade_df1.compare(grade_df2, keep_equal=True))

In [None]:
print(grade_df1.compare(grade_df2, align_axis=1))

When `align_axis=0` the `DataFrame.compare()` method returns DataFrame that are stacked vertically with rows drawn alternately from self and others.

In [None]:
print(grade_df1.compare(grade_df2, align_axis=0))

If `keep_shape=True`, all rows and columns in the resulted DataFrame will be shown. Otherwise, only the ones with different values will be shown in the resulted DataFrame.

In [None]:
print(grade_df1.compare(grade_df2, keep_shape=True))

#### `groupby()` Function

Will be covered in a furure section.

# <font color='red'>Pandas DateTime</font>
- Being able to handle and work with temporal information is extremely important when doing data analysis. 
- Time information in the data allows us to see patterns through time (trends) as well as to make predictions into the future (at varying level of confidence). 
- Many data points we collect are obtained at different time intervals and ordered chronologically. They are referred as time series data.
- The [datetime](https://docs.python.org/3/library/datetime.html) provides functionalities for manipulating dates and times.
- Pandas provides a number to tools to handle times series data by including methods for manipulation `datetime` objects.

Generate sequences of fixed-frequency dates and time spans:

In [None]:
dti = pd.date_range('2022-01-01', periods=15, freq='H')
print(type(dti))
dti

Manipulating and converting date times with timezone information:

In [None]:
dti = dti.tz_localize("UTC")
dti

Use the sequence to create a Pandas series:

In [None]:
ts = pd.Series(range(len(dti)), index=dti)
print(ts)

Resample or convert the time series to a particular frequency:

- Sample every two hours and compute the mean

In [None]:
ts.resample('2H').mean()

Create a Pandas series where the index is the time component:

In [None]:
num_periods = 67
ts = pd.Series(np.random.random(num_periods),
               index=pd.date_range('2021-01', 
                                   periods=num_periods, 
                                   freq='W'))
ts

Create a Pandas DataFrame where the index is the time component:

In [None]:
num_periods = 2500
df = pd.DataFrame(dict(X = np.random.random(num_periods), 
                       Y = -5+np.random.random(num_periods)),
                  index=pd.date_range('2000', 
                                      periods=num_periods, 
                                      freq='D'))
df

**Resampling**
- The `resample()` function is used to resample time-series data.
- It groups data by a certain time span. 
- You specify a method of how you would like to resample.
- Pandas comes with many in-built options for resampling, and you can even define your own methods.

Here are some time period options:

| Alias | Description |
| --- | --- |
| 'D' |	Calendar day |
| 'W' |	Weekly |
| 'M' |	Month end |
| 'Q' |	Quarter end |
| 'A' |	Year end |

Here are some method options for resampling:

| Method | Description |
| --- | --- |
| max |	Maximum value |
| mean |	Mean of values in time range |
| median |	Median of values in time range |
| min |	Minimum data value |
| sum |	Sum of values |

In [None]:
df.X.resample('Y').mean()

In [None]:
df.Y.resample('W').sum()

In [None]:
df.X.resample('Q').median()

# <font color="red">Applications</font>

## <font color="blue"> Report on UFO Sightings</font>

In [None]:
url = 'http://bit.ly/uforeports'
df_ufo = pd.read_csv(url)            
df_ufo 

Convert the Time column to datetime format:

In [None]:
df_ufo['Time'] = pd.to_datetime(df_ufo.Time)
df_ufo

Rename the column to Date:

In [None]:
df_ufo.rename(columns={'Time':'Date'}, inplace=True)
df_ufo

Move the Date column as the DataFrame index:

In [None]:
df_ufo.set_index(['Date'], inplace=True)

In [None]:
df_ufo

**Question 1**: How to determine the number of sightings between two dates?

In [None]:
df1 = df_ufo.loc['1978-01-01 09:00:00':'1980-01-01 11:00:00']
df1

**Question 2**: How to extract the sightings at a specific month?

In [None]:
df2 = df_ufo[df_ufo.index.month == 2]
df2

**Question 3**: How to extract the sightings at a specific year?

In [None]:
df3 = df_ufo[df_ufo.index.year == 1999]
df3

**Question 4**: How to extract the sightings in a given State?

In [None]:
df4 = df_ufo[df_ufo['State']== 'CA']
df4

**Question 5**: How to get the sightings with shape `TRIANGLE`?

In [None]:
df5 = df_ufo[df_ufo['Shape Reported']== 'TRIANGLE']
df5

**Question 6**: How to count the number of sightings in each state?

In [None]:
df6 = df_ufo.groupby(['State']).count()
df6

In [None]:
df6['City'].plot(kind='barh', 
                 figsize=(18,17));

## <font color="blue">Population Data</font>

### Using the `groupby` Function and Related Functions to Aggregate

Read data from url as pandas dataframe:

In [None]:
pop_url = 'http://bit.ly/2cLzoxH'

pop_data = pd.read_csv(pop_url)
pop_data

Convert the `year` values as datetime objects and make the `year` as index:

In [None]:
pop_data['year'] = pd.to_datetime(pop_data.year, format="%Y")

In [None]:
pop_data.rename(columns={'year': 'Year'}, inplace=True)

In [None]:
pop_data.set_index(['Year'], inplace=True)

In [None]:
pop_data

We want to create a new dataframe by selecting the `continent` and `pop` columns only:

In [None]:
continent_pop = pop_data[['continent', 'pop']]
continent_pop

### Pandas `groupby()` Function

- It is used to group rows that have the same values.
- It is used with **aggregate functions** (`count`, `sum`, `min`, `max`, `mean`) to get the statistics based on one or more column values.
- It is also called **Split-Apply-Combine** process:
    - The `groupby()` function splits the data into groups based on some criteria.
    - The aggregate function is applied to each of the groups.
    - The groups are combined together to create a new DataFrame.

In [None]:
grouped_pop = continent_pop.groupby("continent")
grouped_pop

How could then print the new DataFrame?

In [None]:
grouped_pop.head()

Obtain statistical description:

In [None]:
grouped_pop.describe().transpose()

**Iterating through Groups**

In [None]:
for key, item in grouped_pop:
    print(f"Key is: {str(key)}")
    print(f"{str(item)} \n\n")

#### Selecting a Group

A single group can be selected using `get_group()`:

In [None]:
grouped_pop.get_group('Oceania')

#### Functions To Aggregate

**`mean()`** computes mean values for each group:

In [None]:
grouped_pop.aggregate(np.mean)

In [None]:
grouped_pop.mean()

**`sum()`** adds of values within each group.

In [None]:
grouped_pop.aggregate(np.sum)

In [None]:
grouped_pop.sum()

**`size()`** computes the size per each group.

In [None]:
grouped_pop.aggregate(np.size)

In [None]:
grouped_pop.size()

For each group, you can similarly use:
    
- `count()`: computes the number of values.
- `max()`: gets maximum value.
- `min()`: gets minimum value.
- `std()`: computes standard deviation of the values.
- `var()`: computes variance, an estimate of variability.
- `sem()`: computes standard error of the mean values.

**Applying several functions at once**

In [None]:
grouped_pop.agg([np.sum, np.mean, np.std])

**`describe()`** computes a quick summary of values per group

In [None]:
grouped_pop.describe()

**`first()`** gets the first row value within each group.

In [None]:
grouped_pop.first()

**`last()`** gets the last row value within each group.

In [None]:
grouped_pop.last()

**`nth()`** gives nth value, in each group.

In [None]:
grouped_pop.nth(8)

## <font color="blue">Read HTML Table</font>

We want to be able to read the **United States presidential election results for Minnesota** table from:

[https://en.wikipedia.org/wiki/Minnesota](https://en.wikipedia.org/wiki/Minnesota)


In [None]:
df_table = pd.read_html('https://en.wikipedia.org/wiki/Minnesota')
df_table

We read all the tables from the webpage. We can select the specific table we want to read by using the `match` parameter:

In [None]:
df_table = pd.read_html('https://en.wikipedia.org/wiki/Minnesota', 
                        match='United States presidential election results for Minnesota')

df_table

In [None]:
type(df_table)

In [None]:
len(df_table)

You can see that the result is a list containing one DataFrame. We can then extract the DataFrame:

In [None]:
df = df_table[0]
df

Let us gather basic information on rows and columns:

In [None]:
df.info()

#### Change the column names
- For readability and easy manipulation, we want to column names to be one words only.

In [None]:
new_columns = ['Year', 'GOP_Num', 'GOP_Perc', 
               'DNC_Num', 'DNC_Perc', 'Others_Num', 'Others_Perc']
df.columns = new_columns
df

In [None]:
df.info()

- Notice that the columns `GOP_Perc`, `DNC_Perc` and `Others_Perc` more likely have the string type. 
- We want them to have numerical values.
- We can use the `regex=True` parameter to replace the string `%` with an empty space.

In [None]:
df = df.replace({'%': ''}, regex=True)
df

In [None]:
df.info()

- The Columns `GOP_Perc`, `DNC_Perc` and `Others_Perc` are still strings.
- We need to convert them into floating point numbers.

In [None]:
df[['GOP_Perc', 'DNC_Perc', 'Others_Perc']] = df[['GOP_Perc', 'DNC_Perc', 'Others_Perc']].apply(pd.to_numeric)
df.info()

__Make the `Year` values as a datetime objects and set `Year` as index__:

In [None]:
df['Year'] = pd.to_datetime(df['Year'], format="%Y")

In [None]:
df.set_index('Year', inplace=True)

#### Compute the means

In [None]:
df['GOP_Num'].mean()

In [None]:
df['DNC_Num'].mean()

#### Do timeseries plot

In [None]:
df[['GOP_Num', 'DNC_Num']].plot(color=['red', 'blue']);

In [None]:
df[['GOP_Perc', 'DNC_Perc']].plot(color=['red', 'blue']);

## <font color="blue">Weather Data</font>

<center>https://www.wunderground.com/cgi-bin/findweather/getForecast?query=KDAA</center>

#### Pandas <font color='red'>read_csv</font>

In [None]:
url = "https://portal.nccs.nasa.gov/datashare/astg/training/python/pandas/weather/hampton_10-10-15_10-10-16.csv"
weather_df = pd.read_csv(url)

In [None]:
weather_df

Print the column labels:

In [None]:
weather_df.columns

Get basic information on the data:

In [None]:
weather_df.info()

The column `Events` may have missing values.

__Descriptive statistics__:

In [None]:
weather_df.describe()

__Access values of a column like in a dictionary__:

In [None]:
weather_df["Max TemperatureF"]

In [None]:
weather_df["EDT"]

Access column data like a "method" is nicer because you can autocomplete:

In [None]:
weather_df.EDT  

Select multiple columns:

In [None]:
weather_df[["EDT", "Mean TemperatureF"]]

You can also pass an argument:

In [None]:
weather_df.EDT.head() 

In [None]:
weather_df["Mean TemperatureF"].head()

#### Rename columns

- Some column names have multiple words. It is easier to manipulate the DataFrame when each column name is one word only.
- Assign a new list of column names to the columns property of the DataFrame.

In [None]:
weather_df.columns = [
    "date", "max_temp", "mean_temp", "min_temp", "max_dew",
    "mean_dew", "min_dew", "max_humidity", "mean_humidity",
    "min_humidity", "max_pressure", "mean_pressure",
    "min_pressure", "max_visibilty", "mean_visibility",
    "min_visibility", "max_wind", "mean_wind", "min_wind",
    "precipitation", "cloud_cover", "events", "wind_dir"
]

In [None]:
weather_df

Now, we can use `.` dot: 

In [None]:
weather_df.mean_temp.head()

In [None]:
weather_df.mean_temp.std()

In [None]:
weather_df.mean_temp.mean()

### Visualization

In [None]:
weather_df.mean_temp.plot();

In [None]:
weather_df[['max_temp','min_temp']].plot(subplots=False);

In [None]:
new_weather_df = weather_df[['max_temp','min_temp']]
new_weather_df.plot(subplots=True);

We can specify column labels in the loc method to retrieve columns by label instead of by position:

In [None]:
new_weather_df = weather_df.loc[50:125,['max_temp','min_temp']]
new_weather_df.plot(subplots=True);

The <font color='red'>plot()</font> function returns a matplotlib <font color='red'>AxesSubPlot</font> object. You can pass this object into subsequent calls to plot() in order to compose plots.

In [None]:
ax = weather_df.max_temp.plot(title="Min and Max Temperatures", 
                                figsize=(12,6));
weather_df.min_temp.plot(style="red", ax=ax);
ax.set_ylabel("Temperature (F)");

Perform scatter plot:

In [None]:
new_weather_df.plot(kind='scatter', x='max_temp', y='min_temp');

### <font color="green"> Breakout 4</font>
Take the `weather_df` DataFrame to:

1. Convert the `date` column values into datetime objects.
2. Make the `date` column as the index.
3. Plot the times series max and min temperatures on the same axes with the dates (ranging from November 2015 to March 2016).

<details><summary><b><font color="green">Click here to access the solution</font></b></summary>
<p>

```python

# Question 1
weather_df['date'] = pd.to_datetime(weather_df['date'])

# Question 2: Make the date (datetime object) as index
weather_df.set_index("date", inplace=True) 

# Question 3:
#Select the date range
slice_weather_df = slice_weather_df = weather_df['2015-11-01':'2016-04-01']

# Plot
ax = slice_weather_df.max_temp.plot(title="Min and Max Temperatures", 
                                figsize=(12,6));
slice_weather_df.min_temp.plot(style="red", ax=ax);
ax.set_ylabel("Temperature (F)");
``` 
</p>
</details>

## <font color="blue">Climate data</font>

### <center>Global Surface Temperature Change based on Land and Ocean Data</center>

Web scraping:

[https://www.columbia.edu/~mhs119/Temperature/](https://www.columbia.edu/~mhs119/Temperature/)


#### Reference

- [http://pubs.giss.nasa.gov/docs/2010/2010_Hansen_ha00510u.pdf](http://pubs.giss.nasa.gov/docs/2010/2010_Hansen_ha00510u.pdf)
- [https://data.giss.nasa.gov/gistemp/graphs_v4/](https://data.giss.nasa.gov/gistemp/graphs_v4/)