# Day 4: Session A - Data Frames in Pandas

https://eds-217-essential-python.github.io/course-materials/interactive-sessions/4a_dataframes.html

Date: 9/6/24

In [None]:
import pandas as pd
import numpy as np

## Step 1: import data

In [None]:
url = "https://raw.githubusercontent.com/datasets/world-cities/master/data/world-cities.csv"
cities_df = pd.read_csv(url)

## Step 2: Explore Data

Figuring out the Data Frame structure. 
Looking at data types.
Getting a sense of data values. 
Checking out the column names.


In [None]:
cities_df.head()

In [None]:
cities_df.tail()

### Exploring DataFramew using their properties

Use the`shape` property to get the size (rows x columns) of a dataframe. Don't put parentheses `()` on the end of `shape` because it's not a function. Calling shape returns a tuple which is not a function that can be called. 

In [None]:
cities_df.shape

Use the `columns` property to get a "list" of column names

In [None]:
cities_df.columns

Determining the types of data using the `dtypes` property

In [None]:
cities_df.dtypes

### Exploring Data Frames using their methods

Use the `describe()` method to get a summary of the dataframe

In [None]:
cities_df.describe()

Use `info()` to get detailed information about column types and content. 

In [None]:
cities_df.info()

Use `isnull()` and `sum()` to count missing values

In [None]:
cities_df.isnull().sum()

## Step 2: Cleaning Data

For removing missing data, `dropna()` is best, use `subset` argument to select specific columns

In [None]:
cities_df = cities_df.dropna(subset=['subcountry'])

In [None]:
cities_df.isnull().sum()

## Step 3: Basic Selecting and Filtering

Selecting a column is easy! Just add it to the data frame with brackets

In [None]:
cities_df['name'].head() #Using the head() method to keep my notebook clean for demo, not for use

If we want to select more than one column, put the columns we want into a list.

In [None]:
cities_df[['name', 'country', 'subcountry']] # we need the double bracket here, otherwise it won't work. Cannot have more than one column in a series. 

To make a series from a column, request it like `df['column']`


To make a DataFrame from a column, request it as a single item list: `df[['column']]`

### Basic Filtering

Use conditional expressions to filter rows.


In [None]:
us_cities = cities_df[ cities_df['country'] == 'United States' ]
print(us_cities)

We can combine logical operators to filter on multiple columns!

In [None]:
# What about in just California? Be Decarative and Expressive!

in_us = cities_df['country'] == 'United States'
in_ca= cities_df['subcountry'] == 'California'

ca_cities= cities_df[ in_us & in_ca ]
print(ca_cities.head())

In [None]:
# Efficient! But we need to wrap each condition in `" "` to avoid confusion

ca_cities = cities_df[
    (cities_df['country'] == 'United States') & 
    (cities_df['subcountry'] == 'California') 
]

print(ca_cities.head())

## Step 5: Sorting and Ranking

coming soon...

## Step 6: Basic Transformations

coming soon ...

## Steps 7/8: Grouping / Aggregation

The workhorse method for grouping in pandas is `groupby()`

In [None]:
cities_per_country = cities_df.groupby('country') #groupby is looking for the name of column

Use aggregation on groupby to get patterns out of the datasets. 

Aggregations: `mean()`, `max()`, `median()`, `count()`

See the cheat sheet on aggregation on the course webpage

In [None]:
cities_per_country['name'].count().sort_values(ascending = False).head(10)

## Step: Visualization

...coming soon