<a id="toc"></a>
# Table of Contents
## [Python for Data Analysis (pandas)](#pandas)
## [Querying Dataframes](#querying)
### [Query Challenge](#query_challenge)
## [Reshaping Data](#reshaping)
## [Aggregating Data](#aggregates)
## [Merging Datasets](#merging)
## [Ordering Data](#ordering)
## [Exporting Data](#export)

<a id="pandas"></a>
## Pandas
[Back to Table of Contents](#toc)

Pandas is a _library_ which allows us to do some powerful operations with table-like data. We can query datasets with a high degree of granularity, merge them together, sort, and aggregate them.

I highly suggest you take a look and read through the documentation on it when you can! **[It is available here.](https://pandas.pydata.org/pandas-docs/stable/index.html)**

Combined with this lesson are 4 Excel spreadsheets that we will be combining together to form various queries.

- `COMPANIES.xlsx`
- `EMPLOYEES.xlsx`
- `FAKE_DATA_BUILD.xlsx` <- This is our main table
- `ISO_COUNTRY_LOOKUPS.xlsx`

To start with, let's read a spreadsheet and see the output.

Remember, if we want to use a _library_ we have to **import** it.

In [None]:
import pandas as pd # we can create a temporary name using the "as" keyword here which can make it shorter

pd.set_option('display.max_rows', 1000) # We can use this variable to decide how many rows we want to see
pd.set_option('display.max_columns', 500) # We can use this variable to decide how many columns we want to see

df_employees = pd.read_excel('EMPLOYEES.xlsx')
df_employees # This will display the table

Above, I've named the table "df_employees". Since this table is going to hold the employee information, I've called it "employees" but I've added the prefix "df_", why?

In pandas, table objects are stored as what are called "dataframes" - I'm using "df" as shorthand for that to let me know what kind of _object_ I'm working with.

Let's read our other sheets into dataframe objects so we can begin using them.

In [None]:
df_main = pd.read_excel("FAKE_DATA_BUILD.xlsx", sheet_name="Main")
df_main_definitions = pd.read_excel("FAKE_DATA_BUILD.xlsx", sheet_name="VariableDefinitions")
df_companies = pd.read_excel("COMPANIES.xlsx")
df_countries = pd.read_excel("ISO_COUNTRY_LOOKUPS.xlsx")

In [None]:
# write each dataframe name here individually and see what it outputs when you run this



Next, we can learn a little bit about our data using a few small tools:

- `len()` <- tells us how many rows are in the dataframe.
- `dataframe.info()` <- gives us basic information about the dataframe.
- `dataframe.columns.values` <- Gives us the headers of the dataframe (useful later).

In [None]:
print('Number of rows in df_main: ' + str(len(df_main)))
print('Number of rows in df_employees: ' + str(len(df_employees)))
print('Number of rows in df_companies: ' + str(len(df_companies)))
print('Number of rows in df_countries: ' + str(len(df_countries)))

Let's take a look at df_main and see what we can learn about our data.

In [None]:
df_main.info()

In [None]:
df_main.columns.values

Although looking at the data is useful, we need to learn a little more about what each column contains. In this dataset, we have a "VariableDefinitions" sheet what we pulled into df_main_definitions. Let's take a look.

In [None]:
df_main_definitions

<a id="querying"></a>
## Querying Dataframes
[Back to Table of Contents](#toc)

Reading data is useful but without the ability to ask a dataset questions, it doesn't really give us much more over Excel. This is where the strength of pandas starts to show.

The _syntax_ for querying is fairly simple but can get complex based upon the query itself.

For our purposes, I'll describe it this way:

The statement: `df_main[df_main["ORGN_CTRY_CODE"] == "BR"]` can be read as "within the dataframe df_main tell me where 'ORGN_CTRY_CODE' equals 'BR'." Or, more simply: all data for exports from Brazile.

We can also utilize multiple conditions like so:

`df_main[(df_main["ORGN_CTRY_CODE"] == "BR") & (df_main["DEST_CTRY_CODE"] == "US")]` - this can be read as "within the dataframe df_main tell me where 'ORGN_CTRY_CODE' equals 'BR' **_and_** 'DEST_CTRY_CODE' equals 'US'. Or, more simply, all data for packages exported from Brazil to the US.


In [None]:
df_main[df_main["ORGN_CTRY_CODE"] == "BR"]

In [None]:
df_main[(df_main["ORGN_CTRY_CODE"] == "BR") & (df_main["DEST_CTRY_CODE"] == "US")]

<a id="query_challenge"></a>
### Query Challenge
[Back to Table of Contents](#toc)

Write the following queries:

- Get information about packages _from_ Mexico that were of transportation type "Truck"
- Get information about packages from Argentina whose invoices have _NOT_ been paid.
- Get information about packages from the US for 2017
    - Query dates like this: `(df_main["DATETIME"] >= '01-01-2017') & (df_main["DATETIME"] < '01-01-2018')`
- Get information about packages for all countries whose "SOLUTION_TYPE" was "FF" that have "INVOICE_REV" of greater than 2500

<a id="reshaping"></a>
## Reshaping Data
[Back to Table of Contents](#toc)

Remember viewing the output of `dataframe.columns.values`? Let's use that to slim down the actual data we want for display.

We can get a subset of columns of a dataframe like this:

- `df_new` has headers `["col1", "col2", "col3", "col4"]`.
- We get a subset of these columns by creating a list of the columns we want and feeding it to the dataframe.
    - `df_new_subset = df_new[["col2", "col4"]]`

Let's get the headers of `df_main` again and take a look to see which ones we want - we'll convert them to a list as well.

In [None]:
hdrs_main = list(df_main.columns.values)
hdrs_main

Let's say our end goal is to get the total revenue for each employee. What columns would we need to do that? We'd probably need the employee ID and the invoice revenue - two columns "EMPLOYEE_ID" and "INVOICE_REV".

In [None]:
df_employee_revenue = df_main[["EMPLOYEE_ID", "INVOICE_REV"]]
df_employee_revenue

<a id="aggregates"></a>
## Aggregating Data
[Back to Table of Contents](#toc)

It's great that we can get the columns that we need but there are two problems.
1. This is each individual transaction, it doesn't help us see the _total_ revenue for each employee.
2. We don't know who any of these employee IDs refer to (more on this later).

First, we need to _aggregate_ the data by summing the revenue for each employee.

The syntax is as follows:

`df_employee_revenue.groupby(["EMPLOYEE_ID"])["INVOICE_REV"].sum()`

We can also get the _average_ by using a different function:

`df_employee_revenue.groupby(["EMPLOYEE_ID"])["INVOICE_REV"].mean()`

Let's try it out.

In [None]:
df_tot_emp_rev = df_employee_revenue.groupby(["EMPLOYEE_ID"], as_index=False)["INVOICE_REV"].sum()
df_tot_emp_rev

<a id="merging"></a>
## Merging Datasets
[Back to Table of Contents](#toc)

Aggregating data is useful but it doesn't solve the 2nd problem we mentioned earlier: we don't know who these people are. Luckily, we have another spreadsheet which has all of the employee information we need to solve this problem.

The way that we handle situations like this is to use pandas' _merge_ functionality.

Here's the syntax:

`df1.merge(df2, left_on='lkey', right_on='rkey')`

- **df1** is the dataset you want to merge outside data onto.
- **df2** is the dataset you want to merge onto df1.
- **left_on** is the name of the column in df1 you want df2 to merge onto.
- **right_on** is the name of the column in df2 you want to connect to the column in df1.
- If you don't put anything for the keys, it will attempt to find a matching pair of keys.

[Additional information can be found here.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)

Let's merge df_employees onto df_employee_revenue.

In [None]:
# Let's take a look at df_employees again to verify the columns and data
df_employees

In [None]:
df_tot_emp_rev

In [None]:
# Let's merge
df_employee_rev_info = df_tot_emp_rev.merge(df_employees, left_on="EMPLOYEE_ID", right_on="EMPLOYEE_ID")

df_employee_rev_info

<a id="ordering"></a>
## Ordering Data
[Back to Table of Contents](#toc)

Now that we've aggregated and merged out data, let's order our data and find the top 10 employees in our company! [More information is available here.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)

The syntax for sorting values is as follows:

`df.sort_values(by=['col1'])`

The column we want to sort by is "INVOICE_REV".

In [None]:
df_employee_rev_info = df_employee_rev_info.sort_values(by=["INVOICE_REV"], ascending=False)

Let's get the top 10.

In [None]:
df_top_10 = df_employee_rev_info.head(10)
df_top_10

<a id="exporting"></a>
## Exporting Data
[Back to Table of Contents](#toc)

Now that we've created our report, let's export it into Excel!

In [None]:
df_top_10.to_excel("our_cool_employee_report.xlsx", index=False)