# pandas

- Pandas is one of the most commonly used Python packages/libraries/modules for data science.<br><br>
- Pandas is Python's answer for making two dimensional tables (ala Excel and SQL).<br><br>
- Pandas calls a table a "DataFrame".<br><br>
- Pandas DataFrames are used by Python's other packages for statistical analysis, data manipulation, and data visualization.<br><br>
- Pandas DataFrames can be exported as .csv and other files.<br><br>

The pandas syntax isn't very instinctual. Some of the syntax will differ from basic Python. I still have to look a lot of things up in pandas, if it's something I don't do very often. However, it is the tool for working with spreadsheets in Python, so you'll need to learn it at some point.<br><br>
Pandas is written as more of a *functional* language than basic Python. This means that instead of manipulating objects, we'll be applying more functions across our data.

#### <br>Why do we work with Jupyter Notebooks for data science?

Jupyter Notebooks allow us to view nicely formatted output (such as pandas DataFrames and data visualizations) directly below the code used to create the object. They also allow you to scroll through large DataFrames or images.

#### <br>NumPy arrays
This week is going to focus on the Python package Pandas. However, Pandas (and many other Python packages) are built on NumPy arrays. NumPy is another Python module, and NumPy arrays are multi-dimensional datasets made up entirely of numerical data. They allow for much faster calculations than other basic Python objects. If you work with large numerical datasets, you will also want to look into the NumPy package. NumPy arrays do not have the features that many of us want to work with, such as column headers and the ability to work with non-numerical data; that's why pandas is so popular.

### <br><br><br>Importing pandas

Because pandas is one of the most commonly used Python packages, it often gets imported as a shortened version of it's actual name. This makes it quicker to type.

In [None]:
import pandas as pd

Pandas comes with the Anaconda distribution of Python and is available on Google Colab.

### <br><br><br>Opening files from your computer

#### If you are using Google Colab, you must run the next line of code. *If you are NOT using Google Colab, do NOT run the next line.*
Google Colab requires you to load data files into your workspace by hand (or by using this trick to pull them in from github).

In [None]:
!wget https://raw.githubusercontent.com/aGitHasNoName/pandasBasics/main/forestfires.csv
!wget https://raw.githubusercontent.com/aGitHasNoName/pandasBasics/main/pigeonRacing.txt
!wget https://raw.githubusercontent.com/aGitHasNoName/pandasBasics/main/zoo.xlsx

### <br><br><br>Loading a csv file

We will use the function `pd.read_csv()`. As a reminder, when we use a function from an imported module, we first give the module's name, followed by a dot, followed by the function name.
<br><br>This will automatically create a **DataFrame** object, which we are saving as `df`. `df` is a common variable name for a DataFrame. You can open the file, define it as a Pandas DataFrame, assign it to a variable, and close the file in one line. (Already we're seeing the differences from basic Python).

In [None]:
df = pd.read_csv("forestfires.csv")

This is a dataset from forest fires in NE Portugal. I have included the dataset as a csv file in today's materials, but the data is available publically at this site: https://archive.ics.uci.edu/ml/datasets/Forest+Fires

### <br><br><br>Viewing the DataFrame

In [None]:
df

<br>Take a minute to look at the data. The DataFrame will have a slightly different look on Colab and Jupyter, and on different versions of Jupyter.
<br><br>The number at the beginning of each row is called an **index**. The index was automatically assigned by pandas when the dataset was loaded. It was not in the original csv file. It is merely a series of consecutive numbers going down the rows. The rows were loaded in whatever order they were in the csv file.

If you are working in Google Colab, there is a new feature that lets you magically convert your DataFrame into an interactive table. We're NOT going to use that feature, though you can feel free to explore it on your own time. 

<br><br>There are ways to view pieces of the DataFrame. Try these to see what they do:

In [None]:
df.head()

In [None]:
df.head(10)

In [None]:
df.tail()

In [None]:
df.tail(2)

In [None]:
df.sample()

In [None]:
df.sample(6)

### <br><br><br>Loading other types of files

We can open a tab-separated file using the same function we used to open a csv. We just have to pass a second argument, a **keyword argument**, to tell it that the delimiter is a tab instead of the default (comma). This dataset contains rankings of profressional racing pigeons.

In [None]:
pigeon_df = pd.read_csv("pigeonRacing.txt", delimiter="\t")

In [None]:
pigeon_df.head()

<br><br>We will use a different function to open an Excel file. This file has information about animals and has two sheets within the excel file. We will first load sheet 1 and then sheet 2. We have to pass the `read_excel()` function one extra argument to specify the sheet:

In [None]:
zoo_df = pd.read_excel("zoo.xlsx", sheet_name=0)

In [None]:
zoo_df.head()

In [None]:
zoo_class_df = pd.read_excel("zoo.xlsx", sheet_name=1)

In [None]:
zoo_class_df.head()

### <br><br>Exercise 1

Try to load two or three files from your own computer into pandas. Try with at least two different file types (csv, tab-delimited, excel).

<br>**If you are using Google Colab**, you will need to upload the files to Colab yourself. You can do this by clicking on the folder on the left menu. You should see a file tree come up that includes sample_data. Right click anywhere in this space and choose upload to upload your own files.

### <br><br><br>Getting basic info about the DataFrame

You can use the `len()` function to find out how many rows are in a DataFrame object:

In [None]:
len(df)

<br>The `describe()` method will give you some very basic stats about each column in your DataFrame:

In [None]:
df.describe()

<br>The `shape` attribute will return the number of rows and columns as a tuple. An attribute gives us some stored data about an object - it is not a method function, so it does not get parentheses.

In [None]:
df.shape

You can even save the shape tuple as an object, in case you need to include it in any code:

In [None]:
df_shape = df.shape

In [None]:
print("Our DataFrame has " + str(df_shape[0]) + " rows and " + str(df_shape[1]) + " columns.")

<br>The `size` attribute will tell you the total number of elements in the DataFrame (size = rows x columns):

In [None]:
df.size

<br>To return a list of the column names, you can start with the `columns` attribute:

In [None]:
df.columns

Hmm. That looks strange because it is a pandas object. You can make it into a list so that it is easier to work with:

In [None]:
column_names = list(df.columns)
print(column_names)

<br>To find out the data types of the data found in each column, use the `dtypes` attribute:

In [None]:
df.dtypes

<br>To **transpose** a DataFrame (swap the rows and columns), you also use an attribute:

In [None]:
df.T

<br>Let's see if that changed our DataFrame object:

In [None]:
df

<br><br>It didn't change! DataFrames are **immutable objects** like strings and numpy arrays. To save the transposed DataFrame, we would have to reassign it to a variable:

In [None]:
df_t = df.T
df_t

### <br><br>Exercise 2

First run the following code cell to look at the zoo animals DataFrame:

In [None]:
zoo_df

Write code to create a list of column names from `zoo_df`:

Write code to return the data type for each column in `zoo_df`:

<br><br><br>At this point, you may want to learn how to select data from your DataFrame. For example, how do you choose a single column to work with? How do you choose all animals that are aquatic? **Selecting data** is a big part of working with pandas DataFrames, but there are actually multiple ways to do it. Tomorrow we are going to focus only on selecting data. For the rest of today, we're going to practice several common tasks you'll want to do that don't involve selecting data. I don't expect you to memorize exactly how to do all these tasks today, but the info will be here for when you need it, plus, we will get practice with some common pandas syntax.

### <br><br><br>Renaming columns

Here's what our column names look like in the forest fire dataset:

In [None]:
df.head()

Four of the columns end in "\_code". Let's remove that part from the column names. We can use the `rename()` method. We need to pass the function a **dictionary** of the old name to be replaced as the key and the new name as the value.

In [None]:
df.rename(columns = {"moisture_code": "moisture", "fuel_code": "fuel"})

In [None]:
df.head()

Uh-oh, the change didn't stick. We've encountered this before with strings, so we know the answer - reassign it to a variable.

In [None]:
df = df.rename(columns = {"moisture_code": "moisture", "fuel_code": "fuel"})

In [None]:
df.head()

### <br><br>Exercise 3

Write code to remove "\_code" from the ends of the drought and initial_spread column names:

In [None]:
df.head()

### <br><br><br>Dropping rows and columns

Let's drop a single row from the DataFrame. How about row 2? You still have to assign `df` to a variable to make the change permanent:

In [None]:
df = df.drop(2)

In [None]:
df.head()

<br>The index numbers did not reset when we dropped a row. 2 is missing!

We can reset the index and pretend like 2 was never there. The `reset_index()` function takes one keyword argument. If we don't pass the argument, `drop=True`, an extra column will get added to our DataFrame containing the old index numbers. Let's first reset the index without passing the argument, but we won't save that DataFrame:

In [None]:
df.reset_index()

You can see that new column `index` contains the original index positions. Now let's save a new version of our DataFrame, with the indexes reset, but without that new column:

In [None]:
df = df.reset_index(drop=True)

In [None]:
df.head()

<br><br><br>The `drop()` function defaults to dropping rows. If we want to drop a column, we need to add one more argument. `axis=1` is used in pandas to refer to columns as opposed to rows (`axis=0`). The `axis` argument is used elsewhere in pandas, too. Let's drop the "X" column:

In [None]:
df = df.drop("X", axis=1)

In [None]:
df.head()

### <br><br>Exercise 4

Write code to view the last 5 rows of the DataFrame:

Now write code to drop the very last row:

In [None]:
df.tail()

Write code to remove the "Y" column:

In [None]:
df.head()

### <br><br><br>Sorting a DataFrame

There are two functions for sorting your DataFrame.

If you want to sort by the index numbers, or if you want to sort by the column names (alphabetically), you use `sort_index`. It can take two arguments: the axis to sort by (row or column) and the order (ascending or not):

The default arguments are to sort by row index with 0 at the top, which is how we've already been viewing the data:

In [None]:
df.sort_index()

Let's try more arguments:

In [None]:
df.sort_index(ascending=False)

In [None]:
df.sort_index(axis=1)

In [None]:
df.sort_index(axis=1, ascending=False)

<br><br><br>The second sort function, `sort_values()`, will sort the frame by the data in a column:

In [None]:
df.sort_values("area_burned")

In [None]:
df.sort_values("day")

### <br><br>Exercise 5

Write code to sort the DataFrame by the rain column, with the largest values at the top:

<br><br><br>You can also sort on multiple values by passing the `sort_values` function a list of column names instead of a single name. If we want to first sort by day, then by area burned:

In [None]:
df.sort_values(["day", "area_burned"])

### <br><br><br>Saving your changed DataFrame

We've made a lot of changes to the forest fire dataset. Let's save it as a new csv file. First, we can decide what we're going to call the new file:

In [None]:
new_filename = "fire_changed.csv"

Next, we can use the `to_csv()` method function to save the new file:

In [None]:
df.to_csv(new_filename)

### <br><br>Exercise 6

The `zoo_df` DataFrame was originally an Excel file. Write code to save it as a csv file:

In [None]:
new_zoo = 