<a href="https://colab.research.google.com/github/faithkane3/workshops/blob/main/pandas_crash_course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to my Pandas Crash Course!

This tutorial is a basic intro to:
- the Python data science library, pandas.
- the pandas DataFrame object.
- DataFrame attributes and methods we need to manipulate our data throughout the DS pipeline.

## Getting Started
1. Create your own notebook to work in.
2. Click the 'File' tab in the upper-left part of the menu bar and click 'Save a copy in Drive' to create your own copy of my Google Colab notebook that you can edit and save.
3. As you complete exercises, be sure to Save your work by either clicking on the 'File' tab in the menu bar and 'Save' or using `cmd+S`.

### Orientation:
- This notebook is composed of cells. Each cell will contain either text or Python code.
- To run the Python code cells, click the "play" button next to the cell or click your cursor inside the cell and do "Shift + Enter" on your keyboard. 
- Run the code cells in order from top to bottom, because order matters in programming and code.

### Troubleshooting
- If the notebook appears to not be working correctly, then restart this environment by going up to **Runtime** then select **Restart Runtime**. 
- If the notebook is running correctly, but you need a fresh copy of this original notebook, go [here](https://colab.research.google.com/drive/1Io39BlBOYHn1y22_zRfniXhdXE9s-huU?usp=sharing) and repeat the steps above in 'Getting Started'.
- Save frequently (`cmd+S`) and save often, so you have access to all of your exercise solutions!

___

## <font color=red>What Is a Pandas DataFrame?</font>

The pandas DataFrame object is a two-dimensional labled data structure with columns of the same or different data types; a column is a pandas Series object, which is a one-dimensional, labeled array made up of an autogenerated index and data. A DataFrame is like a sequence of Series objects aligned by the same index. **There are three main components that make up a pandas DataFrame: the index, the columns, and the data.** 

![dataframe diagram](https://www.w3resource.com/w3r_images/pandas-data-structure.svg)

<font color=purple>Let's create a DataFrame, so we can examine these components below.</font>

___

# Imports

In [21]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 8))
plt.rc('font', size=12)

# turn off warnings
import warnings
warnings.filterwarnings("ignore")

### Create your DataFrame using a csv file

```python
df = pd.read_csv(url)
```

In [6]:
url = 'https://raw.githubusercontent.com/faithkane3/faithkane3.github.io/master/titanic_df.csv'
titanic_df = pd.read_csv(url, index_col=0)

In [8]:
# I can use the python `type()` function to validate that my `titanic_df` is a DataFrame.

type(titanic_df)

pandas.core.frame.DataFrame

## The Components of a Pandas DataFrame - Index, Columns, Data

Now that I have a pandas DataFrame to work with, I can look at the components of the DataFrame object using the `.index` , the `.columns`, and the `.values` attributes. 


In [11]:
# Here is my default `RangeIndex` object.

titanic_df.index

Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            881, 882, 883, 884, 885, 886, 887, 888, 889, 890],
           dtype='int64', length=891)

In [12]:
# My column labels are also an index object.

titanic_df.columns

Index(['passenger_id', 'survived', 'pclass', 'sex', 'age', 'sibsp', 'parch',
       'fare', 'embarked', 'class', 'deck', 'embark_town', 'alone'],
      dtype='object')

In [13]:
# My values are a two-dimensional NumPy array. Pandas is built on numpy.

titanic_df.values

array([[0, 0, 3, ..., nan, 'Southampton', 0],
       [1, 1, 1, ..., 'C', 'Cherbourg', 0],
       [2, 1, 3, ..., nan, 'Southampton', 1],
       ...,
       [888, 0, 3, ..., nan, 'Southampton', 0],
       [889, 1, 1, ..., 'C', 'Cherbourg', 1],
       [890, 0, 3, ..., nan, 'Queenstown', 1]], dtype=object)

#### `.head()`, `.tail()`, and `.sample()`

The `.head(n)` method returns the first n rows in the Series; n = 5 by default. This method returns a new Series with the same indexing as the original Series. 

The `.tail(n)` method returns the last n rows in the Series; n = 5 by default. Increase or decrease your value for n to return more or less than 5 rows.

The `.sample(n)` method returns a random sample of rows in the Series; n = 1 by default. Again, the index is retained.

In [14]:
titanic_df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [15]:
titanic_df.tail()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
886,886,0,2,male,27.0,0,0,13.0,S,Second,,Southampton,1
887,887,1,1,female,19.0,0,0,30.0,S,First,B,Southampton,1
888,888,0,3,female,,1,2,23.45,S,Third,,Southampton,0
889,889,1,1,male,26.0,0,0,30.0,C,First,C,Cherbourg,1
890,890,0,3,male,32.0,0,0,7.75,Q,Third,,Queenstown,1


In [16]:
titanic_df.sample()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
617,617,0,3,female,26.0,1,0,16.1,S,Third,,Southampton,0




---



#### `.info()`

The `.info()` method allows me to check out the data types and counts of non-null values of each column in my DataFrame all at once.



In [18]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   passenger_id  891 non-null    int64  
 1   survived      891 non-null    int64  
 2   pclass        891 non-null    int64  
 3   sex           891 non-null    object 
 4   age           714 non-null    float64
 5   sibsp         891 non-null    int64  
 6   parch         891 non-null    int64  
 7   fare          891 non-null    float64
 8   embarked      889 non-null    object 
 9   class         891 non-null    object 
 10  deck          203 non-null    object 
 11  embark_town   889 non-null    object 
 12  alone         891 non-null    int64  
dtypes: float64(2), int64(6), object(5)
memory usage: 97.5+ KB




---



`.value_counts()`

The `.value_counts()` method returns the frequency of occurance of the unique labels in the a column or Series. This is an extremely useful method you will find yourself using often with Series containing object and category data types.

In [20]:
# Survived is a binary column with 1 representing survivors; there are no null values.

titanic_df.survived.value_counts(dropna=False)

0    549
1    342
Name: survived, dtype: int64

___

#### `.astype()`

The `.astype()` method allows me to convert a Series from one data type to another. Like most methods, it returns a new transformed Series by default instead of mutating my original data.

In [None]:
# 