# Tech #3 Summarizing and Sorting Data

- This file builds on the code from Tech #2 and illustrates how to summarize and sort data.
- Please follow the instruction on top of each empty cell. Although we already typed in the codes from Tech #2, you still need to execute them one by one. You can execute the cell by clicking "Run". 

---
## Code from Tech #2 Loading and Navigating Data
---
Additional note: In the code cell, if you start the line with a # key, you will be creating comments. You can use these comments to help your audience to understand your code. Python will ignore the comments when executing the code. 

**Step 1: Import pandas package**

In [None]:
import pandas as pd

**Step 2: Load the dataset**

In [None]:
df = pd.read_csv('Compustat_fy2019.csv', parse_dates = ['datadate'])
#A new parameter in the parentheses to have python automatically parse a date variable. 

*Note 1: Difference from Excel*  
In python, we don't work directly on the dataset itself. Python creates a temporary file in your computer's memory when loading a dataset. All the manipulation procedures will be performed on the temporary file and won't change your original dataset.  

*Note 2: File Directory*  
We only include the dataset filename (i.e., Compustat_fy2019.csv) here because this dataset is kept in the same folder as the opened Jupyter Notebook. Python takes the folder from which this notebook is opened as the default folder. But if you save your dataset in a different folder, you need to specify the full file path for Python to access it.   

**Step 3: Navigate the dataset**

In [None]:
df.head()
#Leaving the parameter empty gives you top five observations by default.

In [None]:
df.tail()
#Returning the last five observations.

---
# New code to learn in this class
---

## Summarizing data
---
What's important for us to know about the data before we perform any analysis?
- How many observations are there?
- How many variables are there? What are those variables?
- What is the average of a certain variable? 
- ...

### Returning the index of the dataset
```Python
df.index
```
- An index is an identifier for a row. By default, the index starts from 0 to (#rows - 1).
- You can use the index to refer to the rows you need. Just like you can use a name to call someone.

### Returning the columns of the dataset
```Python
df.columns
```

### Returning the shape of the dataset
```Python
df.shape
```
- No parentheses after shape because this is a value and not a function. You can think of it as already a parameter describing the shape of the data.
- The returning result is (#rows, #columns), or in other words, (#observations, #variables).

### Returninig summary statistics of the numerical variables
```Python
df.describe()
```
- Summary statistics could include the number of observations (count), the average (mean), the standard deviation (std), the minimum (min), the maximum (max), and the values at different percentiles.

In [None]:
df.describe()

## Sorting data
---

### Sorting on one variable
```Python
df.sort_values(by = 'at', ascending = False)
```
- Sort the observations based on certain variables, in this example, total assets.
- Speficy whether you want to start from the smallest to the largest (ascending), or the other way around (descending).
- In this example, we sort the observations based on the observation's total assets in a descending manner.  

### Sorting on multiple variables
```Python
df.sort_values(by = ['datadate','at'], ascending = [True, False])
```
- In this example, we first sort the data based on the fiscal year end (starting from the earliest). And then, within firms that have the same fiscal year end, we sort them based on the total assets (starting from the highest).

## Online support
- [Pandas Documentation：An official guide](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)
- [Stack Overflow: A question and answer site for programmers](https://stackoverflow.com/questions/37787698/how-to-sort-pandas-dataframe-from-one-column)
- Just Google! You will be surprised how many people have had the same problems as yours!

---
# Quiz
---
Please load the dataset "Compustat_fy2019.csv" and use the functions we learned to answer the following questions. You can run the code in below empty cells and fill the answers in Canvas quiz. 

**1. What is the net income for 29th observation of "Compustat_fy2019.csv" (AMERICAN EXPRESS CO)?** 

**2. Among companies listed in NASDAQ, which company has the largest total equities according to the dataset?**  
Note: companies listed in NASDAQ has an exchange code of 14.  
(This data set includes companies listed in three exchanges. They are New York Stock Exchange (11), American Stock Exchange (12), NASDAQ (14).  

**3. What's the index for the company that you found in question 2?** 
