# Day 4: Session A - Data Frames

[Link to session webpage](https://eds-217-essential-python.github.io/course-materials/interactive-sessions/4a_dataframes.html)

Date: 9/6/24

In [None]:
import pandas as pd
import numpy as np

## Step 1: Import Data

In [None]:
url = "https://raw.githubusercontent.com/datasets/world-cities/master/data/world-cities.csv"
cities_df = pd.read_csv(url)

## Step 2: Explore Data

- Figuring out the DataFrame structure
- Looking at data types
- Getting a sense of data values
- Checking out column names

In [None]:
cities_df.head()

In [None]:
cities_df.tail()

### Exploring DataFrames using their properties

Use `shape` property to get the size (rows x columns) of a dataframe

In [None]:
cities_df.shape
# shape is a property, rather than a function, so we don't need parentheses
# cities_df.shape is really just a variable that contains information about this dataframe

Use `columns` to get a list of column names

In [None]:
cities_df.columns
# returns an index quantity because columns are an index (rows are also index)

Determining the types of data using the `dtypes` property

In [None]:
cities_df.dtypes

### Exploring DataFrames using their methods

Use the `describe()` method to get a summary of the dataframe

In [None]:
cities_df.describe()
# summarizes the numerical data in your df by column
# so here it only returns geonameid because it's the only numerical data

Use `info()` to get detailed information about column types and content

In [None]:
cities_df.info()
# helpful for showing which columns are missing data (non-null count)

Use `isnull()` and `sum()` to count missing values

In [None]:
cities_df.isnull().sum()

## Step 3: Clean Data

For removing missing data, `dropna()` is best, use `subset` argument to select specific columns

In [None]:
cities_df = cities_df.dropna(subset=['subcountry'])

In [None]:
cities_df.isnull().sum()

## Step 4: Basic Selecting and Filtering

### Basic Selecting

Selecting a column is easy! Just add it to the dataframe with brackets

In [None]:
cities_df['name'].head()
# using .head() method to keep my notebook clean (don't do this in reality)

In [None]:
cities_df[['name','country']].head()
# if you want to select multiple columns, put the columns we want into a list ([[]])

# list_we_want = ['name','country','subcountry']
# cities_df[list_we_want]

To make a series from a column, request it like `df['column']`

To make a dataframe from a column, request it as a single-item list: `df[['column']]`

### Basic Filtering

Use conditional expressions to filter rows

In [None]:
us_cities = cities_df[ cities_df['country'] == 'United States' ]
# the `cities_df['country'] == 'United States'` part will give us T/F values

# could be written as:
# rows_we_want = cities_df['country'] == 'United States'
# us_cities = cities_df[rows_we_want]

us_cities.head()

### Filtering on multiple columns

We can combine logical operators to filter on multiple columns

In [None]:
in_us = cities_df['country'] == "United States"
in_ca = cities_df['subcountry'] == "California"

california_cities = cities_df[in_us & in_ca]
california_cities.head()

In [None]:
# Efficient! But we need parentheses around each condition!!!!!
california_cities = cities_df[
    (cities_df['country'] == "United States") & 
    (cities_df['subcountry'] == "California")
]

## Step 5: Sorting and Ranking

coming soon ...

## Step 6: Transformations

coming soon ...

## Steps 7 & 8: Grouping / Aggregation

The workhorse method for grouping in pandas is `groupby()`

In [None]:
cities_per_country = cities_df.groupby('country')

In [None]:
cities_per_country['name'].count().sort_values(ascending=False).head(10)