# Pandas

In this lesson we will learn the basics of data manipulation using the Pandas library. 

# Set up

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Set seed for reproducibility
np.random.seed(seed=1234)

# Load data

We're going to work with the [Titanic dataset](https://www.kaggle.com/c/titanic/data) which has data on the people who embarked the RMS Titanic in 1912 and whether they survived the expedition or not. It's a very common and rich dataset which makes it very apt for exploratory data analysis with Pandas.

Let's load the data from the CSV file into a Pandas dataframe. The `header=0` signifies that the first row (0th index) is a header row which contains the names of each column in our dataset.

In [None]:
# Read from CSV to Pandas DataFrame
url = "https://raw.githubusercontent.com/GokuMohandas/MadeWithML/main/datasets/titanic.csv"
df = pd.read_csv(url, header=0)

In [None]:
# First five items
df.head()

These are the different features: 
* `class`: class of travel
* `name`: full name of the passenger
* `sex`: gender
* `age`: numerical age
* `sibsp`: # of siblings/spouse aboard
* `parch`: number of parents/child aboard
* `ticket`: ticket number
* `fare`: cost of the ticket
* `cabin`: location of room
* `emarked`: port that the passenger embarked at (C - Cherbourg, S - Southampton, Q - Queenstown)
* `survived`: survial metric (0 - died, 1 - survived)

# Exploratory data analysis (EDA)

Now that we loaded our data, we're ready to start exploring it to find interesting information.

> Be sure to check out our entire lesson devoted to [EDA](https://madewithml.com/courses/mlops/exploratory-data-analysis/) in our [mlops](https://madewithml.com/#mlops) course.

In [None]:
import matplotlib.pyplot as plt

We can use `.describe()` to extract some standard details about our numerical features. 

In [None]:
# Describe features
df.describe()

In [None]:
# Correlation matrix
plt.matshow(df.corr())
continuous_features = df.describe().columns
plt.xticks(range(len(continuous_features)), continuous_features, rotation='45')
plt.yticks(range(len(continuous_features)), continuous_features, rotation='45')
plt.colorbar()
plt.show()

We can also use `.hist()` to view the histrogram of values for each feature.

In [None]:
# Histograms
df['age'].hist()

In [None]:
# Unique values
df['embarked'].unique()

# Filtering

In [None]:
# Selecting data by feature
df['name'].head()

In [None]:
# Filtering
df[df['sex']=='female'].head() # only the female data appear

# Sorting

In [None]:
# Sorting
df.sort_values('age', ascending=False).head()

# Grouping

In [None]:
# Grouping
survived_group = df.groupby('survived')
survived_group.mean()

# Indexing

We can use `iloc` to get rows or columns at particular positions in the dataframe.

In [None]:
# Selecting row 0
df.iloc[0, :] 

In [None]:
# Selecting a specific value
df.iloc[0, 1]

# Preprocessing

After exploring, we can clean and preprocess our dataset.

> Be sure to check out our entire lesson focused on [preprocessing](https://madewithml.com/courses/mlops/preprocessing/) in our [mlops](https://madewithml.com/#mlops) course.

In [None]:
# Rows with at least one NaN value
df[pd.isnull(df).any(axis=1)].head()

In [None]:
# Drop rows with Nan values
df = df.dropna() # removes rows with any NaN values
df = df.reset_index() # reset's row indexes in case any rows were dropped
df.head()

In [None]:
# Dropping multiple columns
df = df.drop(['name', 'cabin', 'ticket'], axis=1) # we won't use text features for our initial basic models
df.head()

In [None]:
# Map feature values
df['sex'] = df['sex'].map( {'female': 0, 'male': 1} ).astype(int)
df['embarked'] = df['embarked'].dropna().map( {'S':0, 'C':1, 'Q':2} ).astype(int)
df.head()

# Feature engineering

We're now going to use feature engineering to create a column called `family_size`. We'll first define a function called `get_family_size` that will determine the family size using the number of parents and siblings. 

In [None]:
# Lambda expressions to create new features
def get_family_size(sibsp, parch):
    family_size = sibsp + parch
    return family_size

Once we define the function, we can use `lambda` to `apply` that function on each row (using the numbers of siblings and parents in each row to determine the family size for each row).

In [None]:
df["family_size"] = df[["sibsp", "parch"]].apply(lambda x: get_family_size(x["sibsp"], x["parch"]), axis=1)
df.head()

In [None]:
# Reorganize headers
df = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'family_size', 'fare', 'embarked', 'survived']]
df.head()

# Save data

Finally, let's save our preprocessed data into a new CSV file to use later.

In [None]:
# Saving dataframe to CSV
df.to_csv('processed_titanic.csv', index=False)

In [None]:
# See the saved file
!ls -l