## AcademyXi Data Analysis - Data Manipulation
### Workshop B - Data manipulation in practice
In this workshop module, we will go through a number of ways in which you can use Python for data manipulation. 

Think of this as a beginning of a rich, rewarding journey. We appreciate that some of the below may seem difficult (or easy) depending on your experience level with Python and Pandas. 

If you're not sure about how some part of the code below is working, try reviewing the documentation for the method or function (e.g. [here](https://pandas.pydata.org/docs/index.html)). Alternatively, think about what the code has done to the underlying data, and go back to the code to see if you can understand the steps it's taken to do so.

Good luck!

### Preparation

This will prepare our notebook including installing required packages and loading the data.

In [None]:
# Install additional libraries required (fsspec and s3fs) to load files through AWS S3
%%capture tmp
!pip install fsspec s3fs

# Import libraries to be used
import plotly.express as px
import pandas as pd
import numpy as np

In [None]:
# Load data from S3
df = pd.read_csv("s3://databyjp/academyxi/Datafiniti_Womens_Shoes_sm.csv")

In [None]:
# Check that the file has been properly loaded
df.head()

In [None]:
# Show summary information about the DataFrame, as well as individual columns
df.info()

## Sort / filter data

Sorting and filtering data in a Pandas DataFrame is easy and powerful. Take a look at some common ways to do it below.

### Sort data with Python and Pandas

In [None]:
# We will be using just a few columns, so let's make a copy of the DataFrame with only those
sdf = df[["id", "prices.merchant", "prices.amountMax"]]

In [None]:
# .sort_values method is one you will be using the most often. It can take one argument like so:
sdf.sort_values("prices.amountMax")

In [None]:
# Or provide multiple arguments as a list, which will then sort the data in the order of columns specified
sdf.sort_values(["prices.merchant", "prices.amountMax"])

In [None]:
# By default, .sort_values method sorts the data in ascending order. 
# To sort in descending order, add the ascending=False argument.
sdf.sort_values(["prices.merchant", "prices.amountMax"], ascending=False)

In [None]:
# To sort one column in ascending order and another in descending order, chain the methods
# Note that when chaining the methods, the order of columns should be reversed
sdf.sort_values("prices.amountMax", ascending=False).sort_values("prices.merchant")

### Filter data with Python and Pandas
Three useful ways to access particular rows of Pandas DataFrames are by the:
- row number; 
- `index` value; or
- column values.
Let's take a look at each below.

#### Filter by rows
The .iloc method can be used to slide the data in the way in which it is currently arranged.

In [None]:
# Get the first 10 rows
sdf.iloc[:10]

In [None]:
# Get the 15th row
sdf.iloc[15]  # Note that the row object is returned rather than a DataFrame

#### Filter by index
Each row of a DataFrame includes an `index` value, which acts as a name for each row. 

This might simply be a meaningless number, but it can be more - it might for example be a date, userID, whatever, allowing for convenient selection of subsets. 

In [None]:
# Get rows where the index is 10 or smaller
sdf.loc[:10,:]

So while that above example might look the same as before, we can do things like:

In [None]:
# Select rows where the index is smaller than 10, and the vendor is Walmart
sdf[sdf["prices.merchant"]=="Walmart.com"].loc[:10]

#### Filter by column data

You may have noticed the code `df[df["prices.merchant"]=="Walmart.com"]` above when showing how to filter data by the index. 

This code uses an Boolean array produced by `df["prices.merchant"]=="Walmart.com"`, in which each row is marked as TRUE or FALSE, based on whether the row's "prices.merchant" column value is "Walmart.com".

This is an extremely powerful method of data filtering, as any number of logical (and/or) operations can be combined using these Boolean arrays as you will see below.

Pay attention to how the query is constructed using brackets, combining logical operations. If you are not sure, I find it helpful to articulate what each clause within one set of brackets is doing, and to consider each conditional (& = AND, | = OR) clause.

In [None]:
# Get the portion of the dataframe where "prices.merchant" has "Walmart.com" values
sdf[sdf["prices.merchant"]=="Walmart.com"]

In [None]:
# Get the portion of the dataframe where "prices.merchant" has "Walmart.com" values, 
# and the prices.amountMax is above 50
sdf[(sdf["prices.merchant"]=="Walmart.com") & (sdf["prices.amountMax"] > 50)]

In [None]:
# Get the portion of the dataframe where "prices.merchant" has "Walmart.com" values
# and the prices.amountMax is above 60 or less than 10
sdf[(sdf["prices.merchant"]=="Walmart.com") & ((sdf["prices.amountMax"] > 60) | (sdf["prices.amountMax"] < 10))]

In [None]:
# Get the portion of the dataframe where "prices.merchant" is missing values
# and by value of prices.amountMax
sdf[(sdf["prices.merchant"].isna()) & (sdf["prices.amountMax"] > 120)]

In [None]:
# Get the portion of the dataframe where "prices.merchant" contains string "com"
# and the prices.amountMax is above 60
sdf[(sdf["prices.merchant"].str.contains("com")) & (sdf["prices.amountMax"] > 60)]

As you can see, Pandas provides flexible and powerful data filtering tools. This just scratches the surface of the large array of ways in which you can filter data in Pandas. 

To learn more, check out [this tutorial](https://pandas.pydata.org/pandas-docs/dev/getting_started/intro_tutorials/03_subset_data.html) from Pandas, and other methods such as `.query` ([reference](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html)), `.filter` ([reference](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.filter.html)) and how to test for patterns in strings ([reference](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#testing-for-strings-that-match-or-contain-a-pattern)).

# Data type conversions

Now, let's take a look at how to convert data types within Pandas.

### Data types - Simple conversion 

To convert one data type to another in a DataFrame, the `.astype` method can be used ([read more](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html)). Take a look below:

In [None]:
# Floating point to String
df["prices.amountMax"].astype(str)

In [None]:
# Floating point to Integer (rounds down)
df["prices.amountMax"].astype(int)

In [None]:
# Boolean to Binary / Integer
df["prices.isSale"].astype(int)

### Data types - converting dates

Without manual intervention, dates and/or times are usually loaded as strings. However, they are best handled in a native datetime format as it allows date/time specific operations.

Take a look at a few examples below:

What happens when we manipulate the data as-is, without converting it to a datetime object?

In [None]:
# Grab the year data - first four characters of the "dateAdded" column
df["dateAddedYr"] = df["dateAdded"].str[:4]
print(df["dateAddedYr"])

In [None]:
# What happens if we operate on the year column?
df["dateAddedYr"] * 2

In [None]:
# So let's convert the data to integers
df["dateAddedYr"] = df["dateAddedYr"].astype(int)

But, many of our operations are easier if the column is converted to datetime.

In [None]:
# Actually, "2015-05-04T12:13:08Z" is a standard datetime format. This can be simply converted to datetime objects.
df["dateAdded"] = pd.to_datetime(df["dateAdded"])
df["dateAdded"]

In [None]:
# Once the column has been converted to a datetime objects, their properties can be accessed with various methods under the `.dt` set
print(df["dateAdded"].dt.year)  # Year
print(df["dateAdded"].dt.timetz)  # Timezone
print(df["dateAdded"].dt.dayofweek)  # Day (monday=0, sunday=6)

In [None]:
# This can be now used to easily filter our data
# let's say we want to see all data added in 2017 or later, and on Saturday/Sunday.
df[(df["dateAdded"].dt.year >= 2017) & (df["dateAdded"].dt.dayofweek >= 6)]

If you will be working with date/time data, we recommend reading [this tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html), and [this reference guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html).