# NOTES

##### [Tutorial Source](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/01_table_oriented.html)

##### Imports
- Auto reload modules when modules change
    - Install into .ipynb file: `%load_ext autoreload`
  - `%autoreload` options:
    - 0: turns off autoreload 
    - 1: reloads only explicitly reimported modules 
    - 2: auto-reloads all modules whenever they changeoreload 

##### Part 1
- `object` in dtypes output refers to a `string`

# Imports

In [4]:
# Import Python Modules
# %autoreload 2 
import importlib
import pandas as pd # type: ignore
import os
import sys
# Import Custom Modules
from decorators import CsvToDF, Sep, VerifyData

# Check Module Versions
# check_imported_versions(globals())
# sep()

# Import Data
cwd       = os.getcwd()
csv_file  = 'manifest.csv'
csv_filepath = os.path.join(cwd, csv_file)
data = CsvToDF(csv_filepath) 

# Print to Verify Data
VerifyData(dataframe=data, df_name="DATA", rows_to_display=5)

Data imported successfully from '/Users/soigne/PROJECTS/PANTRY/Python/MODULES_PY/pandas_module/PD_Tutorial/manifest.csv'
Execution time: 0.0026679039001464844 seconds...
----------------------------------------------------------------------------------------------------------------
<<<<<<<<<<<<<<<<	DATA: Head		>>>>>>>>>>>>>>>>
    PassengerId  Survived  Pclass                                                 Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3                              Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3         1       3                                Heikkinen, Miss Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1         Futre

# TUTORIAL

### PART 1: Reading/Writing Tabular Data and Exploring DataFrame Info

##### Export Data to Excel File

In [None]:
# Export data to Excel file
data.to_excel("manifest.xlsx", sheet_name="passengers", index=False)
titanic = pd.read_excel("manifest.xlsx", sheet_name="passengers")
# print(titanic.head(10).to_string())
# print(titanic.tail(10).to_string())

##### Reading/Writing Tabular Data and Exploring DataFrame Info

In [41]:
print("Shape: ", data.shape)
Sep()
print(data.info())
'''
Output Explanation:
- It is indeed a DataFrame.
- There are 891 entries, i.e. 891 rows.
- Each row has a row label (aka the index) with values ranging from 0 to 890.
- The table has 12 columns. Most columns have a value for each of the rows (all 891 values are non-null). Some columns do have missing values and less than 891 non-null values.
- The columns Name, Sex, Cabin and Embarked consists of textual data (strings, aka object). The other columns are numerical data with some of them whole numbers (aka integer) and others are real numbers (aka float).
- The kind of data (characters, integers,…) in the different columns are summarized by listing the dtypes.
- The approximate amount of RAM used to hold the DataFrame is provided as well.
'''

##### DataFrame Attributes

In [51]:
# index: displays the labels of the rows
print("INDEX:\n", data.index)
# columns: displays the labels of the columns
print("COLUMNS:\n", data.columns)
# axes: returns a list of the row and column labels (i.e., [index, columns])
print("AXES:\n", data.axes)
# dtypes: displays the data type of each column
print("DTYPES:\n", data.dtypes)
# size: displays the total number of elements (rows x columns)
print("SIZE:\n", data.size)
# ndim: displays the number of dimensions
print("NDIM:\n", data.ndim)
# empty: displays 'True' if the dataframe is empty, otherwise 'False'
print("EMPTY:\n", data.empty)
# T: transposes the dataframe, swapping rows with columns
print("TRANSPOSE:\n", data.T)
# values: returns the dataframe data as a NumPy array
print("VALUES:\n", data.values)

### PART 2: Selecting Subsets of a DF

In [None]:
ages = data["Age"]
print(ages.head())
# Check type: should be a one-dimensional 'series'
print(type(ages))
# Check shape of the output: (rows, columns)
print(ages.shape)

# Selecting multiple columns creates a DataFrame
new_df = data[["Age", "Sex"]]
print(new_df.head())
print(type(new_df))
print(new_df.shape)

### PART 3: Filter Specific Rows from a DF

##### Get passengers older than 35

In [None]:
above_35 = data[data["Age"] > 35]
below_35 = data[data["Age"] < 35]
is_35    = data[data["Age"] == 35]
age_na   = data[pd.isna(data["Age"])]
print(data.shape)
Sep()

print("Aged >35: ", above_35.shape)
print("Aged <35: ", below_35.shape)
print("Aged =35: ", is_35.shape)
print("Aged N/A: ", age_na.shape)
print("Sum: ", sum([above_35.shape[0], below_35.shape[0], is_35.shape[0], age_na.shape[0]]))
Sep()

print(data["Age"] > 35)

##### Get passengers from cabin class 2 and 3

In [None]:
# Original:
# class_2_3 = data[(data["Pclass"] == 2) | (data["Pclass"] == 3)]
# Simplified:
class_2_3 = data[data["Pclass"].isin([2, 3])]

# Verify Data:
VerifyData(class_2_3, "class_2_3", 10)

##### Work with passenger data for which the age is known

In [None]:
age_no_na = data[data["Age"].notna()]
print(age_no_na.head().to_string())

### Review Questions

In [5]:
# Test Dataset for Review Questions
data = {
    "ID": [1, 2, 3, 4, 5],
    "Name": ["Alice", "Bob", "Charlie", "David", "Eva"],
    "Age": [25, 30, 35, 40, 45],
    "City": ["New York", "San Francisco", "Chicago", "Los Angeles", "New York"],
    "Salary": [70000, 85000, 65000, 90000, 120000],
    "Department": ["IT", "HR", "IT", "Marketing", "IT"]
}

df = pd.DataFrame(data)

##### 1: Filter and summarize data

In [8]:
# Filter the data to show only employees from the IT department. 
# Then, calculate the average salary for the IT department.
it_emps = df.filter()
print(it_emps)

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
