# Day 4: Session A - [DataFrames]

[https://eds-217-essential-python.github.io/course-materials/interactive-sessions/4a_dataframes.html]

Date: [09/06/24]

In [None]:
import pandas as pd
import numpy as np

In [None]:
url = "https://raw.githubusercontent.com/datasets/world-cities/master/data/world-cities.csv"
cities_df = pd.read_csv(url)

## Step 2: Explore Data

Figuring out the DataFrame structure.

Looking at data types.

Getting a sense of data values.

Checking out the column names.

In [None]:
cities_df.head()

In [None]:
cities_df.tail()

### Exploring DataFrames using their properties:

Use `shape` to get the size (rows x columns) of a dataframe

-Attribute: quantity that is contained by dataframes, NOT a data frame
i.e. class, shape, etc

-stand alone variable that are properties of the object you are referring to

In [None]:
cities_df.shape

Use `columns` to get a list of column names

In [None]:
cities_df.columns #returns columns indices

list(cities_df.columns) #coercing the name of indices into a list

Determining the types of data using `dtypes` property

In [None]:
cities_df.dtypes

### Exploring DataFrames using their methods

Use the `describe()` method to get a summary of the dataframe

In [None]:
cities_df.describe() #describes numeric values

Use `info()` to get the deetailed information about column types and content.

In [None]:
cities_df.info()

Use `isnull()` and `sum()` to count missing values

In [None]:
cities_df.isnull().sum()

## Step 2: Clean Data

For removing missing data, `dropna()` is best, use `subset` argument to select specific columns.

In [None]:
cities_df = cities_df.dropna(subset=['subcountry'])

In [None]:
cities_df.isnull().sum()

## Step 3: Basic Selecting and Filtering

Selecting a column is easy! Just add it to the dataframe with brackets

In [None]:
cities_df['name'].head() #using the head() method to keep my notebook clean [for demo]

If we want to select more than one column, put the columns we want into a list.

In [None]:
cities_df[['name', 'country']] #double brackets to return list of columns in a dataframe output
#requesting a single column returns a series []
#requesting a single column as a single element list returns data frame[[]]

To make a Series from a oclumn, request it like `df['column']`

To make a DataFrame from a column, request it as a single-item list: `df[ [ 'column' ] ]`

### Basic Filtering

Use conditional expressions to filter rows. 

In [None]:
us_cities = cities_df[ cities_df['country'] == 'United States' ]
print(us_cities)

We can combine logical operators to filter on multiple columns

In [None]:
#Efficient! Parentheses matter, wrap each condition to avoid confusion
california_cities = cities_df[
    (cities_df['country'] == 'United States') & 
    (cities_df['subcountry'] == 'California')]   
print(california_cities[['name', 'country', 'subcountry']].head())

In [None]:
# Declarative and expressive!
in_us = cities_df['country'] == 'United States'
in_ca = cities_df['subcountry'] == 'California'

california_cities = cities_df[ in_us & in_ca ]
california_cities.head()

## Step 5: Sorting and Ranking

## Step 6: Transformations

## Step 7/8: Grouping / Aggregation
The workhorse method for grouping in pandas is `groupby()`


Use aggregation on groupby to get patterns out of datasets.

Aggregations: `mean()`, `max()`, `median()`, `count()`

In [None]:
cities_per_country = cities_df.groupby('country')
cities_per_country['name'].count().sort_values(ascending = False).head(10)

## Visualization