# Introduction to Pandas

* * * 

### Icons Used In This Notebook
üîî **Question**: A quick question to help you understand what's going on.<br>
ü•ä **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
‚ö†Ô∏è **Warning**: Heads-up about tricky stuff or common mistakes.<br>
üí° **Tip**: How to do something a bit more efficiently or effectively.<br>
üé¨ **Demo**: Showing off something more advanced ‚Äì so you know what Python can be used for!<br>

### Learning Objectives
1. [Libraries](#lib)
2. [Importing Data from Files](#data)
3. [Data Frames: Spreadsheets in Python](#df)
4. [üöÄ Project](#project)


<a id='lib'></a>

## Libraries

A **library** refers to a reusable chunk of code. Usually, a Python library contains a collection of related functionalities.

We have already been using Python's [standard library](https://docs.python.org/3/library/) - it comes ready and loaded with Python. We've also used `pandas` to work with data frames. Today, we will expand on our Pandas knowledge and do our first data science project.

### Installing libraries

The most common option is to install a library directly using the command line. One way to do this is to go to Jupyter Launcher by clicking on the `+` symbol in the top left of Jupyter Labs, then select Terminal.

You can then use `pip`, a Python package installer, to install new packages. Simply run `pip install [PACKAGE_NAME]`, and the package will be installed.

üí° **Tip**: You can also install packages within a Jupyter Notebook. Create a new cell, and run the command `!pip install [PACKAGE_NAME]`.


### Importing libraries 
Before we can use a library like Pandas, we have to **import** it into the current session.
Importing is done with the `import` keyword. We simply run `import [PACKAGE_NAME]`, and everything inside the package becomes available to use.

Let's import the `numpy` module, which has a lot of useful functions for working with numerical data. Let's access a function from this module using dot notation.

In [None]:
import numpy

print('The mean of [1, 4, 5] is:', numpy.mean([1, 4, 5]))

For many packages, like `numpy`, there is an **alias**, or nickname that they are often imported as. For common packages (especially those with long names), it saves a lot of typing when you use a nickname. For example, `numpy` is usually imported as below:

In [None]:
import numpy as np

print('mean of [1, 4, 5] is:', np.mean([1, 4, 5]))

There are very common abbreviations used for some of the more popular libraries, including:

* `pandas` -> `pd`
* `numpy` -> `np`
* `matplotlib` -> `plt`
* `statsmodels.api` -> `sm`

‚ö†Ô∏è **Warning**: Sometimes aliases can make programs harder to understand, since readers must learn your program's aliases. Be very intentional about using aliases!

### Help!

How do we know what we can do with `numpy`? Usually, packages provide **documentation** which explain these components. Let's have a look at the documentation [online](https://docs.python.org/3/library/math.html). 

Being comfortable sifting through documentation is a **very** important skill!


## ü•ä Challenge: Locating the Right Library

You want to select a random value from a list of data.

1. What [standard library](https://docs.python.org/3/library/) would you most expect to help? Look at the documentation and find it.
2. Which **function** would you select from that library? üí° **Tip**: Look at "Functions for sequences" in the documentation.
3. Import the library, and apply the function to the following list.

In [None]:
ids = [1, 2, 3, 4, 5, 6]

In [None]:
# YOUR CODE HERE


<a id='data'></a>

# Importing Data from Files

No set of basic skills is complete without learning how to import data from files. 

## Getting Your Bearings

Before we can get our data, we first have to figure out where the file is on our hard disk! 

We can use the magic command `%pwd` to check the location of your "working directory" (the folder on your computer that Python is currently connected to). 

In [None]:
# print working directory
%pwd

‚ö†Ô∏è **Warning**: Navigating file paths can be *pretty confusing* üòµ‚Äçüí´, but it's an important skill! 


## Importing .csv Files

As data scientists, we'll often be working with **Comma Seperated Values (.csv)** files. 

Comma separated values files are common because they are relatively small and look good in spreadsheet software. A comma separated values file is just a text file that contains data but that has commas (or other separators) to indicate column breaks.

`pandas` comes with a function [`.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
that makes it really easy to import .csv files.

üí° **Tip**: Let's have a look at a .csv file in our File Browser!

### Wait, How Do I Get My Files?

The file we want to import are inside a folder called "data", which is inside of the main "Python-Fundamentals" folder. As you can see in the file path, this directory is two folders "up" from where we currently are. 

üí° **Tip**: Let's use the File Browser to the left of our screen, as well as our Finder (Mac) / File Explorer (Windows), to orient ourselves. 

Have a look at the "gapminder-FiveYearData.csv" file we are importing below.

* The `read_csv()` function takes a string as its main argument. This string consists of the file path pointing to the file.
* `../` means 'go up one level in the folder'.
* `../../` means 'go up two levels in the folder'.
* `data/` means 'go into a folder called "data".
* `gapminder-FiveYearData.csv` is the file name we are accessing within that "data" folder.

In [None]:
import pandas as pd

df = pd.read_csv('../../data/gapminder-FiveYearData.csv')
df.head()

üîî **Question**: What does the [gapminder-FiveYearData](https://en.wikipedia.org/wiki/Gapminder_Foundation) dataset seem to be about?

<a id='df'></a>

# Data Frames: Spreadsheets in Python

A common data structure you've likely already encountered is **tabular data**. Think of an Excel sheet: each column corresponds to a different feature of each datapoint, while rows correspond to different samples.

In scientific programming, tabular data is often called a **data frame**. In Python, there a specialized library called Pandas, which contains an object `DataFrame` that implements this data structure.

We're going to explore `pandas` more closely in the next part, but let's try creating a `DataFrame` object right now. 

First, **paste the dictionary you created in the Challenge above.**

In [None]:
# YOUR CODE HERE


Next, we `import` the `pandas` **library** (we will cover libraries in more detail in the next notebook) and pass in the dictionary to the `pd.DataFrame()` function, storing the result in a variable called `df`.

In [None]:
import pandas as pd

df = pd.DataFrame(fruits)
df

Note that the keys of our dictionary became column names, and the values became cells, in the `DataFrame`. In addition, there is an **index** on the left that keeps track of the row.

üí° **Tip**: Objects can also have **attributes**, or variables associated with the data type. We can get the number of columns and rows with `df.shape`, an attribute of the dataframe. 

üîî **Question**: How many rows and columns does this dataframe have? 

In [None]:
df.shape

## ü•ä Challenge: Building a DataFrame

The following code gives an error. Why does it have an error? 

üí° **Tip:** Google the line at the bottom of the error message if you need help!

In [None]:
fruit = ['apple', 'orange']
length = [3.2, 2.1, 3.1]
color = ['red', 'orange', 'yellow']

fruit_dict = {
    'fruit': fruit,
    'length': length,
    'color': color}

df_fruit = pd.DataFrame(fruit_dict)

## Slicing Columns
We can choose a single column by selecting the name of that column. The act of obtaining a particular subset of a data frame is often referred to as **slicing**. This uses bracket notation to select part of the data.

Check it out:

In [None]:
df['country']

`pandas` calls this a `Series` object. It's like a list, except it's labeled. 

You can slice a Series object just like you can with a list!

In [None]:
gap_country = df['country']
gap_country[0]

`DataFrame` objects also have methods, including those for [merging](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge), [aggregation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html), and others. Many of these functions operate on a single column of the DataFrame. For example, we can identify the number of unique values in each column by using `.nunique()`, and what those unique values are by using `.unique()`:

In [None]:
#number of unique countries in the df
print(df['country'].nunique())

#unique countries in the df
print(df['country'].unique())

## More Methods: `.head()`, `.describe()`, and `.value_counts()`

The `.head()` method will show the first five rows of a Data Frame by default. Put an integer in the parentheses to specify a different number of rows. 

`.describe()` provides basic summary statistics. 

`.value_counts()` counts frequencies.

In [None]:
# View the first 3 rows
df.head(3)

In [None]:
# Produce some quick summary statistics
df.describe()

Now, we can investigate how many of each category?

In [None]:
# How many letters by each writer?
df['year'].value_counts()

## Column names

You can call [attributes](https://medium.com/@shawnnkoski/pandas-attributes-867a169e6d9b) of a Pandas variable by using "dot notation" - it's like a method, but without the parentheses. 

üí° **Tip**: Attributes are **features** of data. Methods **allow you to do something** with data. 

üí° **Tip**: A method is written with parenteses: e.g. `gap.value_counts()`. An attribute is written without parentheses: e.g. `gap.columns`.


In [None]:
# List the column names using the .columns *attribute*
df.columns

üîî **Question**: Here's another popular attribute: `shape`. What do you think it does?

In [None]:
df.shape

## Slicing Rows

You can slice rows of a DataFrame like you would a string or a list. If we just want three rows: 

In [None]:
df[6:9]

## Conditional Subsetting

What if we want a subset based on a condition? For example, what if we just wanted a subset for data only when country is equal to Egypt? 

In [None]:
df['country'] == 'Egypt'

üí° **Tip**: Fancy terminology alert: the above Series is called a **Boolean mask**. It's like a list of True/False labels that we can use to filter our Data Frame for a certain condition! We'll cover this further in Python Fundamentals II.

Here, we subset our Data Frame with the fancy Boolean mask we just created. 

In [None]:
# Dataframe just of data points in Egypt
egypt_df = df[df['country'] == 'Egypt']
egypt_df

In [None]:
# Data frame just of 2002
year_2002_df = df[df['year'] == 2002]
year_2002_df.head()

üí° **Tip**: You can learn more about Pandas DataFrames in D-Lab's [**Python Data Wrangling**](https://github.com/dlab-berkeley/Python-Data-Wrangling) workshop. [Register now](https://dlab.berkeley.edu/training/upcoming-workshops).


## Creating a new Column

To create a new column, use the `[]` brackets with the new column name at the left side of the assignment. Note that we can just throw in another column which we do some calculations on:

In [None]:
df['lifeExp_rounded'] = df['lifeExp'].round()
df.head()

In [None]:
# YOUR CODE HERE


<a id='group'></a>

# üé¨ Demo: Grouping and Plotting Data Frames

This demo shows how to create a plot in the California Health Interview Survey (CHIS), the nation's largest state health survey. We look at the people in the data who often feel left out, and group these people by their self-perceived general health.

More on Pandas in the next session!

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

chis_df = pd.read_csv('../data/chis_extract.csv')
chis_df.head()

In [None]:
# Grouping by race and ethnicity, get the "feel safe" column, count the values
chis_grouped = chis_df.groupby('general_health')['feel_left_out'].value_counts(normalize=True)

# Pivot the table and put it in a new DataFrame
ag = chis_grouped.unstack()

# Plot the normalized amount of 
ag['OFTEN'].sort_values().plot(kind='barh')
plt.title('Feeling left out, sorted by general health')
plt.xlabel('Percentage of group');
plt.ylabel('General health');

# üéâ Well done!

Today's project took us through basic data manipulation and analysis using Pandas. 

### üí° Tip: More workshops!

D-Lab teaches workshops that allow you to practice more with DataFrames and visualization.

- To learn more about data wrangling, check out D-Lab's [Python Data Wrangling workshop](https://github.com/dlab-berkeley/Python-Data-Wrangling).
- To learn more about data visualization, check out D-Lab's [Python Data Visualization workshop](https://github.com/dlab-berkeley/Python-Data-Visualization).